1
00:00:00,000 --> 00:00:03,160
All right, ready to dive into some cutting edge AI research.

2
00:00:03,160 --> 00:00:04,000
Always.

3
00:00:04,000 --> 00:00:05,000
Awesome.

4
00:00:05,000 --> 00:00:07,380
Today, we're looking at a brand new paper,

5
00:00:07,380 --> 00:00:11,960
Palo Gema 2, a family of versatile VLMs for transfer.

6
00:00:11,960 --> 00:00:14,400
Oh yeah, this one's generating a lot of buzz.

7
00:00:14,400 --> 00:00:15,800
It's all about vision language models,

8
00:00:15,800 --> 00:00:17,040
or VLMs for short, right?

9
00:00:17,040 --> 00:00:17,880
You got it.

10
00:00:17,880 --> 00:00:20,760
So for our listeners who might not be familiar with VLMs,

11
00:00:20,760 --> 00:00:22,600
what exactly are we talking about here?

12
00:00:22,600 --> 00:00:26,160
Imagine AI that can not only see images,

13
00:00:26,160 --> 00:00:29,160
but actually understand them and connect them with text.

14
00:00:29,160 --> 00:00:31,800
Like they can look at a picture of a cat and know it's a cat,

15
00:00:31,800 --> 00:00:33,840
and then also understand a description of a cat.

16
00:00:33,840 --> 00:00:35,520
So it's like giving AI the ability

17
00:00:35,520 --> 00:00:37,800
to process information the way humans do,

18
00:00:37,800 --> 00:00:39,360
using both sight and language.

19
00:00:39,360 --> 00:00:40,520
Exactly.

20
00:00:40,520 --> 00:00:43,640
And Palo Gema 2 is pushing the boundaries of what's

21
00:00:43,640 --> 00:00:46,240
possible with these types of models.

22
00:00:46,240 --> 00:00:49,080
It's building on the success of the original Palo Gema,

23
00:00:49,080 --> 00:00:50,960
but with some serious upgrades.

24
00:00:50,960 --> 00:00:53,880
OK, so what makes Palo Gema 2 so special?

25
00:00:53,880 --> 00:00:56,920
What sets it apart from the original and other VLMs out

26
00:00:56,920 --> 00:00:57,280
there?

27
00:00:57,280 --> 00:00:59,640
Well, for starters, it's not just one model.

28
00:00:59,640 --> 00:01:01,160
It's a whole family of models.

29
00:01:01,160 --> 00:01:02,200
A family of models.

30
00:01:02,200 --> 00:01:04,400
Like different versions of the same AI.

31
00:01:04,400 --> 00:01:05,520
Precisely.

32
00:01:05,520 --> 00:01:07,800
They come in different sizes with varying numbers

33
00:01:07,800 --> 00:01:08,760
of parameters.

34
00:01:08,760 --> 00:01:11,440
You can think of parameters as a measure of the model's

35
00:01:11,440 --> 00:01:12,480
complexity.

36
00:01:12,480 --> 00:01:15,160
So some are smaller and faster, while others

37
00:01:15,160 --> 00:01:17,400
are larger and more powerful.

38
00:01:17,400 --> 00:01:17,880
Interesting.

39
00:01:17,880 --> 00:01:20,600
So it's like having different tools for different jobs.

40
00:01:20,600 --> 00:01:22,680
You might use a smaller, more efficient model

41
00:01:22,680 --> 00:01:25,160
for a simple task, and then bring out the big guns

42
00:01:25,160 --> 00:01:26,440
for something more complex.

43
00:01:26,440 --> 00:01:27,240
Exactly.

44
00:01:27,240 --> 00:01:28,400
And here's another cool thing.

45
00:01:28,400 --> 00:01:30,600
They also come in different resolutions.

46
00:01:30,600 --> 00:01:34,320
You know, like how you can have an image in standard definition

47
00:01:34,320 --> 00:01:35,400
or high definition.

48
00:01:35,400 --> 00:01:36,120
Yeah, it makes sense.

49
00:01:36,120 --> 00:01:38,120
Well, Palo Gema 2 can process images

50
00:01:38,120 --> 00:01:39,360
at different resolutions, too.

51
00:01:39,360 --> 00:01:42,800
It's like upgrading from standard def to HD,

52
00:01:42,800 --> 00:01:46,040
the higher the resolution, the more detail the model can see.

53
00:01:46,040 --> 00:01:48,520
Ah, so it's not just about having a big brain,

54
00:01:48,520 --> 00:01:49,760
but also sharp eyes.

55
00:01:49,760 --> 00:01:50,440
Exactly.

56
00:01:50,440 --> 00:01:52,880
And the researchers found that this increase in resolution

57
00:01:52,880 --> 00:01:55,160
was just as important as the size of the model

58
00:01:55,160 --> 00:01:57,200
when it came to performance improvement.

59
00:01:57,200 --> 00:01:58,080
That's fascinating.

60
00:01:58,080 --> 00:02:01,000
It really highlights how important visual detail

61
00:02:01,000 --> 00:02:02,720
is for these AI systems.

62
00:02:02,720 --> 00:02:05,080
OK, so we've got these different sizes and resolutions.

63
00:02:05,080 --> 00:02:08,800
How do they actually train these AI models to be so versatile?

64
00:02:08,800 --> 00:02:10,520
What's the secret sauce?

65
00:02:10,520 --> 00:02:12,200
It's a pretty sophisticated process,

66
00:02:12,200 --> 00:02:13,520
but I'll try to break it down.

67
00:02:13,520 --> 00:02:15,920
They use a three-stage training process.

68
00:02:15,920 --> 00:02:16,840
Three stages.

69
00:02:16,840 --> 00:02:17,760
All right, take me through it.

70
00:02:17,760 --> 00:02:18,880
What happens in stage one?

71
00:02:18,880 --> 00:02:21,520
Stage one is all about laying the foundation.

72
00:02:21,520 --> 00:02:23,320
You could say it's like AI boot camp.

73
00:02:23,320 --> 00:02:23,880
Boot camp.

74
00:02:23,880 --> 00:02:24,920
I like that analogy.

75
00:02:24,920 --> 00:02:27,400
So they start with two pre-trained models,

76
00:02:27,400 --> 00:02:30,000
Siglipso, 400 meters for the vision part,

77
00:02:30,000 --> 00:02:32,440
and Gemma 2 for the language part.

78
00:02:32,440 --> 00:02:34,680
And then they combine these two models

79
00:02:34,680 --> 00:02:38,200
and train them on a massive data set of a billion examples.

80
00:02:38,200 --> 00:02:39,520
Wow, a billion examples.

81
00:02:39,520 --> 00:02:40,800
That's a lot of data.

82
00:02:40,800 --> 00:02:41,400
It is.

83
00:02:41,400 --> 00:02:43,240
But that's how these models learn

84
00:02:43,240 --> 00:02:46,280
to make connections between images and text.

85
00:02:46,280 --> 00:02:48,560
They need to see a huge variety of examples

86
00:02:48,560 --> 00:02:50,920
to understand the nuances of the world.

87
00:02:50,920 --> 00:02:53,120
So it's like showing a child thousands of pictures

88
00:02:53,120 --> 00:02:55,760
and explaining what each one is until they start to understand

89
00:02:55,760 --> 00:02:57,360
how to recognize things on their own.

90
00:02:57,360 --> 00:02:58,880
It's similar to that, yeah.

91
00:02:58,880 --> 00:03:00,760
And then in stage two, they move on

92
00:03:00,760 --> 00:03:02,480
to more specialized training.

93
00:03:02,480 --> 00:03:06,000
This stage is all about focusing on tasks that really benefit

94
00:03:06,000 --> 00:03:07,240
from a keen eye for detail.

95
00:03:07,240 --> 00:03:08,640
OK, give me some examples.

96
00:03:08,640 --> 00:03:10,240
What kind of tasks are we talking about here?

97
00:03:10,240 --> 00:03:12,840
Think about things like optical character recognition,

98
00:03:12,840 --> 00:03:14,640
or OCR for short.

99
00:03:14,640 --> 00:03:17,480
That's the ability to read text from an image.

100
00:03:17,480 --> 00:03:19,400
Right, like when your phone can scan a document

101
00:03:19,400 --> 00:03:21,080
and turn it into editable text.

102
00:03:21,080 --> 00:03:23,760
Exactly, or recognizing handwriting,

103
00:03:23,760 --> 00:03:25,800
even if it's messy or stylized.

104
00:03:25,800 --> 00:03:27,720
You know, it's funny how some AI struggled

105
00:03:27,720 --> 00:03:29,720
to decipher handwritten notes.

106
00:03:29,720 --> 00:03:32,000
I wonder how Palo Gemma 2 does with that.

107
00:03:32,000 --> 00:03:34,520
Well, that's what the stage two training is all about.

108
00:03:34,520 --> 00:03:38,240
They crank up the resolution, first to 448 PX SEGAR,

109
00:03:38,240 --> 00:03:41,840
and then to a whopping 896 PX SEG,

110
00:03:41,840 --> 00:03:44,000
really honing in on those fine points.

111
00:03:44,000 --> 00:03:46,920
So they're giving the AI a super-powered magnifying glass

112
00:03:46,920 --> 00:03:48,880
to examine every little detail.

113
00:03:48,880 --> 00:03:49,520
That's pretty cool.

114
00:03:49,520 --> 00:03:50,760
And what about the third stage?

115
00:03:50,760 --> 00:03:51,560
What happens there?

116
00:03:51,560 --> 00:03:53,920
Stage three is all about fine-tuning.

117
00:03:53,920 --> 00:03:56,360
It's like taking those general skills from bootcamp

118
00:03:56,360 --> 00:03:58,920
and applying them to a specialized field.

119
00:03:58,920 --> 00:04:02,600
They take the model and train it further on specific tasks

120
00:04:02,600 --> 00:04:05,600
to make it an expert in that particular area.

121
00:04:05,600 --> 00:04:07,840
So it's like sending the AI to grad school

122
00:04:07,840 --> 00:04:09,360
to become a specialist.

123
00:04:09,360 --> 00:04:10,560
Makes sense.

124
00:04:10,560 --> 00:04:12,520
Now, the big question,

125
00:04:12,520 --> 00:04:15,080
did all that training actually pay off?

126
00:04:15,080 --> 00:04:17,560
Did Palo Gemma 2 live up to the hype?

127
00:04:17,560 --> 00:04:18,560
It absolutely did.

128
00:04:18,560 --> 00:04:20,720
And the results are pretty impressive.

129
00:04:20,720 --> 00:04:22,080
OK, I'm all ears.

130
00:04:22,080 --> 00:04:23,280
Spill the beans.

131
00:04:23,280 --> 00:04:25,680
What kind of amazing things can Palo Gemma 2 do?

132
00:04:25,680 --> 00:04:28,640
Well, first of all, it outperformed the original Palo Gemma

133
00:04:28,640 --> 00:04:30,280
by a significant margin.

134
00:04:30,280 --> 00:04:33,720
Not only that, it even beat some of the current state-of-the-art

135
00:04:33,720 --> 00:04:34,520
models out there.

136
00:04:34,520 --> 00:04:35,000
Wow.

137
00:04:35,000 --> 00:04:37,640
So it's like the valedictorian of the VLM class.

138
00:04:37,640 --> 00:04:38,200
That's a kettle.

139
00:04:38,200 --> 00:04:39,040
It's pretty impressive.

140
00:04:39,040 --> 00:04:40,840
And what's even more exciting is that they

141
00:04:40,840 --> 00:04:43,960
tested it on some unique tasks, going beyond the typical things

142
00:04:43,960 --> 00:04:45,240
you see with VLMs.

143
00:04:45,240 --> 00:04:46,120
OK, I'm intrigued.

144
00:04:46,120 --> 00:04:48,120
What kind of unusual tasks are we talking about here?

145
00:04:48,120 --> 00:04:49,040
Give me some examples.

146
00:04:49,040 --> 00:04:51,120
Well, remember that OCR ability we talked about?

147
00:04:51,120 --> 00:04:54,120
Palo Gemma 2 actually outperformed a specialized OCR

148
00:04:54,120 --> 00:04:55,720
model called HTS.

149
00:04:55,720 --> 00:04:56,120
Hold on.

150
00:04:56,120 --> 00:04:59,040
So it's beating the specialists at their own game.

151
00:04:59,040 --> 00:04:59,960
It seems so.

152
00:04:59,960 --> 00:05:01,360
And that's just the beginning.

153
00:05:01,360 --> 00:05:04,280
It can also understand tables, not just the words in them,

154
00:05:04,280 --> 00:05:06,360
but the actual layout and structure.

155
00:05:06,360 --> 00:05:08,320
You mean it can figure out which rows are headings,

156
00:05:08,320 --> 00:05:10,040
which columns are related, all that?

157
00:05:10,040 --> 00:05:11,280
That's amazing.

158
00:05:11,280 --> 00:05:11,800
Exactly.

159
00:05:11,800 --> 00:05:13,560
It's showing a deep understanding

160
00:05:13,560 --> 00:05:15,400
of the visual information.

161
00:05:15,400 --> 00:05:16,480
And there's more.

162
00:05:16,480 --> 00:05:20,040
Imagine showing it a drawing of a molecule.

163
00:05:20,040 --> 00:05:22,800
And it can actually figure out the chemical structure

164
00:05:22,800 --> 00:05:26,160
represented by that drawing, spitting out something called

165
00:05:26,160 --> 00:05:29,600
a smile string, which is like a code describing

166
00:05:29,600 --> 00:05:30,920
the molecule's makeup.

167
00:05:30,920 --> 00:05:31,840
Wow.

168
00:05:31,840 --> 00:05:33,760
That's getting in some serious science stuff.

169
00:05:33,760 --> 00:05:36,080
Palo Gemma 2 sounds like it could be incredibly valuable

170
00:05:36,080 --> 00:05:37,920
for researchers in chemistry and beyond.

171
00:05:37,920 --> 00:05:38,520
Absolutely.

172
00:05:38,520 --> 00:05:38,920
And get this.

173
00:05:38,920 --> 00:05:41,760
It can actually read sheet music and translate it

174
00:05:41,760 --> 00:05:43,200
into a digital format.

175
00:05:43,200 --> 00:05:44,880
Wait, are you serious?

176
00:05:44,880 --> 00:05:48,200
This AI can hear music just by looking at the sheet music?

177
00:05:48,200 --> 00:05:49,600
That's what the research suggests.

178
00:05:49,600 --> 00:05:52,560
Imagine taking a picture of a handwritten musical score

179
00:05:52,560 --> 00:05:55,480
and having it instantly playable on your computer.

180
00:05:55,480 --> 00:05:56,120
Wow.

181
00:05:56,120 --> 00:05:57,200
That's mind-blowing.

182
00:05:57,200 --> 00:05:59,600
What else can this AI prodigy do?

183
00:05:59,600 --> 00:06:01,440
Well, one thing that really stood out to me

184
00:06:01,440 --> 00:06:05,520
was its ability to generate incredibly detailed captions.

185
00:06:05,520 --> 00:06:06,360
Detailed captions.

186
00:06:06,360 --> 00:06:07,800
You mean like more than just saying,

187
00:06:07,800 --> 00:06:08,920
this is a picture of a cat.

188
00:06:08,920 --> 00:06:09,960
Way more than that.

189
00:06:09,960 --> 00:06:12,960
These captions are long and provide factual information

190
00:06:12,960 --> 00:06:15,480
about the image, almost like a short story.

191
00:06:15,480 --> 00:06:17,680
So it's like having an AI storyteller that

192
00:06:17,680 --> 00:06:21,440
can describe a picture with incredible accuracy and detail.

193
00:06:21,440 --> 00:06:22,000
Exactly.

194
00:06:22,000 --> 00:06:23,600
They tested it on a data set called

195
00:06:23,600 --> 00:06:27,160
DOCCI, which stands for Descriptions of Connected

196
00:06:27,160 --> 00:06:28,800
and Contrasting Images.

197
00:06:28,800 --> 00:06:32,280
And Palo Gemma 2 generated these super detailed captions,

198
00:06:32,280 --> 00:06:34,240
even pointing out the spatial relationships

199
00:06:34,240 --> 00:06:36,960
between objects, the number of objects,

200
00:06:36,960 --> 00:06:39,080
and adding general knowledge about the world.

201
00:06:39,080 --> 00:06:40,000
Wow.

202
00:06:40,000 --> 00:06:41,800
Almost like it's thinking for itself.

203
00:06:41,800 --> 00:06:44,480
Well, it is processing information

204
00:06:44,480 --> 00:06:46,280
and making connections in a way that's

205
00:06:46,280 --> 00:06:47,960
quite similar to human reasoning.

206
00:06:47,960 --> 00:06:49,240
It's pretty remarkable.

207
00:06:49,240 --> 00:06:52,920
So we've got an AI that can read, recognize complex structures,

208
00:06:52,920 --> 00:06:55,640
hear music, and tell detailed stories.

209
00:06:55,640 --> 00:06:57,320
It sounds almost too good to be true.

210
00:06:57,320 --> 00:06:59,560
Is there anything this model can't do?

211
00:06:59,560 --> 00:07:00,560
That's a great question.

212
00:07:00,560 --> 00:07:02,720
And it leads us to some really interesting findings

213
00:07:02,720 --> 00:07:04,880
about Palo Gemma 2's limitations.

214
00:07:04,880 --> 00:07:06,280
Limitations.

215
00:07:06,280 --> 00:07:07,480
OK, now I'm really curious.

216
00:07:07,480 --> 00:07:08,560
What are we talking about here?

217
00:07:08,560 --> 00:07:10,320
Let's delve into those next.

218
00:07:10,320 --> 00:07:13,880
So as amazing as Palo Gemma 2 is,

219
00:07:13,880 --> 00:07:16,120
it does have its limitations.

220
00:07:16,120 --> 00:07:17,760
No AI is perfect, right?

221
00:07:17,760 --> 00:07:18,920
Right, of course.

222
00:07:18,920 --> 00:07:20,680
What kind of limitations did they find?

223
00:07:20,680 --> 00:07:23,480
Well, one thing they noticed was that the largest model, the one

224
00:07:23,480 --> 00:07:26,240
with a whopping 20 billion parameters.

225
00:07:26,240 --> 00:07:27,800
Yeah, the one we were talking about earlier.

226
00:07:27,800 --> 00:07:29,240
Yeah, exactly.

227
00:07:29,240 --> 00:07:33,480
That one didn't always perform as well as they expected.

228
00:07:33,480 --> 00:07:34,120
That's interesting.

229
00:07:34,120 --> 00:07:36,480
I would have thought bigger would always mean better.

230
00:07:36,480 --> 00:07:37,880
Why do you think that was?

231
00:07:37,880 --> 00:07:39,600
Well, the researchers think it might

232
00:07:39,600 --> 00:07:42,160
be because of how it was trained.

233
00:07:42,160 --> 00:07:46,400
This particular model was trained completely from scratch.

234
00:07:46,400 --> 00:07:49,280
But the smaller models benefited from a technique

235
00:07:49,280 --> 00:07:51,200
called distillation.

236
00:07:51,200 --> 00:07:52,280
Distillation.

237
00:07:52,280 --> 00:07:54,080
Now, that's a word I haven't heard in a while.

238
00:07:54,080 --> 00:07:55,640
What does that mean in this context?

239
00:07:55,640 --> 00:07:58,120
OK, imagine a master chef teaching a student

240
00:07:58,120 --> 00:07:59,600
a complex recipe.

241
00:07:59,600 --> 00:08:02,520
Instead of going through every single step from scratch,

242
00:08:02,520 --> 00:08:05,280
they might give the student a simplified version first.

243
00:08:05,280 --> 00:08:06,120
I see.

244
00:08:06,120 --> 00:08:08,160
So they get the gist of it without getting bogged down

245
00:08:08,160 --> 00:08:09,240
in all the details.

246
00:08:09,240 --> 00:08:10,120
Exactly.

247
00:08:10,120 --> 00:08:12,760
And distillation for AI is kind of like that.

248
00:08:12,760 --> 00:08:15,160
You take a larger, more complex model

249
00:08:15,160 --> 00:08:17,840
and use it to teach a smaller model, transferring knowledge

250
00:08:17,840 --> 00:08:18,720
more efficiently.

251
00:08:18,720 --> 00:08:21,000
So it's like giving the smaller models a head start.

252
00:08:21,000 --> 00:08:22,840
Yeah, you could say that.

253
00:08:22,840 --> 00:08:24,480
And it seems like that gave them an edge

254
00:08:24,480 --> 00:08:26,040
when it came to performance.

255
00:08:26,040 --> 00:08:28,520
It highlights that it's not just about the size of the model,

256
00:08:28,520 --> 00:08:30,000
but also how you train it.

257
00:08:30,000 --> 00:08:31,000
So it's not just brawn.

258
00:08:31,000 --> 00:08:32,640
It's about brains, too.

259
00:08:32,640 --> 00:08:33,800
Exactly.

260
00:08:33,800 --> 00:08:36,520
And speaking of brains, another interesting observation

261
00:08:36,520 --> 00:08:38,920
they made was about how Palo Gema II handled

262
00:08:38,920 --> 00:08:41,080
traditional object detection tasks.

263
00:08:41,080 --> 00:08:42,040
Oh, and interesting.

264
00:08:42,040 --> 00:08:43,360
How did it do with that?

265
00:08:43,360 --> 00:08:45,480
I'm assuming not great if you're bringing it up, no?

266
00:08:45,480 --> 00:08:47,360
Well, it wasn't bad.

267
00:08:47,360 --> 00:08:49,680
But it didn't perform as well as they expected,

268
00:08:49,680 --> 00:08:52,560
considering how well it did with those other complex tasks,

269
00:08:52,560 --> 00:08:55,080
like understanding tables and molecules.

270
00:08:55,080 --> 00:08:56,640
Yeah, that is kind of surprising.

271
00:08:56,640 --> 00:08:58,640
You'd think if it can handle those things,

272
00:08:58,640 --> 00:09:01,080
basic object detection would be a piece of cake.

273
00:09:01,080 --> 00:09:01,760
Right.

274
00:09:01,760 --> 00:09:02,640
You'd think so.

275
00:09:02,640 --> 00:09:05,560
But it seems like even though it can see and understand

276
00:09:05,560 --> 00:09:08,440
images, it's not always speaking the same language

277
00:09:08,440 --> 00:09:09,960
as those traditional tests.

278
00:09:09,960 --> 00:09:11,120
That's a great way to put it.

279
00:09:11,120 --> 00:09:14,320
It's like it's learned a whole new way of seeing the world,

280
00:09:14,320 --> 00:09:16,480
but it still needs to figure out how to communicate that

281
00:09:16,480 --> 00:09:17,440
with the old systems.

282
00:09:17,440 --> 00:09:18,280
Exactly.

283
00:09:18,280 --> 00:09:19,800
The researchers believe this might

284
00:09:19,800 --> 00:09:22,720
be because of the language-based training approach they used

285
00:09:22,720 --> 00:09:24,160
for Palo Gema II.

286
00:09:24,160 --> 00:09:26,920
It doesn't quite align with how object detection is usually

287
00:09:26,920 --> 00:09:27,680
evaluated.

288
00:09:27,680 --> 00:09:28,480
Makes sense.

289
00:09:28,480 --> 00:09:31,440
So are there any areas where this unique approach actually

290
00:09:31,440 --> 00:09:33,760
gives Palo Gema II an advantage?

291
00:09:33,760 --> 00:09:35,320
That's a great question.

292
00:09:35,320 --> 00:09:38,240
And it leads us to one of the most exciting applications

293
00:09:38,240 --> 00:09:40,320
of this technology, medicine.

294
00:09:40,320 --> 00:09:40,720
Oh, yeah.

295
00:09:40,720 --> 00:09:42,440
You mentioned earlier that Palo Gema II

296
00:09:42,440 --> 00:09:44,840
had some really impressive results when it came

297
00:09:44,840 --> 00:09:48,000
to analyzing medical images, particularly chest X-rays.

298
00:09:48,000 --> 00:09:48,600
Exactly.

299
00:09:48,600 --> 00:09:53,560
They trained on a massive data set called MIMICXR,

300
00:09:53,560 --> 00:09:57,160
which has over 377,000 chest X-rays

301
00:09:57,160 --> 00:09:59,520
and their corresponding radiologist reports.

302
00:09:59,520 --> 00:10:00,000
Wow.

303
00:10:00,000 --> 00:10:01,280
That's a lot of X-rays.

304
00:10:01,280 --> 00:10:03,760
So they basically show the AI a ton of examples

305
00:10:03,760 --> 00:10:04,960
and taught it what to look for.

306
00:10:04,960 --> 00:10:05,440
Right.

307
00:10:05,440 --> 00:10:07,120
And the results were really remarkable.

308
00:10:07,120 --> 00:10:09,880
Palo Gema II was able to analyze chest X-rays

309
00:10:09,880 --> 00:10:11,800
and generate reports that were similar to what

310
00:10:11,800 --> 00:10:13,560
a human radiologist might write.

311
00:10:13,560 --> 00:10:13,920
Wow.

312
00:10:13,920 --> 00:10:16,000
So it's like having an AI radiologist.

313
00:10:16,000 --> 00:10:19,440
How did they even measure how accurate it was?

314
00:10:19,440 --> 00:10:22,400
They used a special metric called RadGraph, which

315
00:10:22,400 --> 00:10:25,720
compares the AI's reports to the actual radiologist reports

316
00:10:25,720 --> 00:10:26,840
and get this.

317
00:10:26,840 --> 00:10:30,200
Palo Gema II achieved state-of-the-art results.

318
00:10:30,200 --> 00:10:30,800
State-of-the-art.

319
00:10:30,800 --> 00:10:32,280
So it's not just doing a decent job.

320
00:10:32,280 --> 00:10:33,680
It's actually one of the best out there.

321
00:10:33,680 --> 00:10:34,600
That's right.

322
00:10:34,600 --> 00:10:36,640
And just like with those other tasks,

323
00:10:36,640 --> 00:10:38,400
they found that increasing the resolution

324
00:10:38,400 --> 00:10:41,560
and the size of the model led to even better performance.

325
00:10:41,560 --> 00:10:42,280
That's incredible.

326
00:10:42,280 --> 00:10:44,320
It's like giving the AI a better microscope

327
00:10:44,320 --> 00:10:46,320
to examine those X-rays.

328
00:10:46,320 --> 00:10:48,440
This has huge potential for health care.

329
00:10:48,440 --> 00:10:51,760
Imagine being able to use AI to assist radiologists,

330
00:10:51,760 --> 00:10:54,160
especially in areas where there's a shortage of specialists.

331
00:10:54,160 --> 00:10:54,880
Exactly.

332
00:10:54,880 --> 00:10:56,800
It could help streamline diagnoses,

333
00:10:56,800 --> 00:10:59,840
make health care more accessible, and ultimately save lives.

334
00:10:59,840 --> 00:11:00,480
That's amazing.

335
00:11:00,480 --> 00:11:02,320
This is truly groundbreaking stuff.

336
00:11:02,320 --> 00:11:04,600
It sounds like the research on Palo Gema II

337
00:11:04,600 --> 00:11:07,000
isn't just about incremental improvements.

338
00:11:07,000 --> 00:11:08,840
It's about fundamentally changing

339
00:11:08,840 --> 00:11:10,400
what's possible with AI.

340
00:11:10,400 --> 00:11:11,120
You got it.

341
00:11:11,120 --> 00:11:14,360
It's a giant leap towards creating truly versatile AI

342
00:11:14,360 --> 00:11:17,400
that can understand and interact with the world in ways

343
00:11:17,400 --> 00:11:19,200
we never thought possible before.

344
00:11:19,200 --> 00:11:20,400
I'm really blown away by all this.

345
00:11:20,400 --> 00:11:23,000
It makes me wonder, what's next?

346
00:11:23,000 --> 00:11:24,760
Where do we go from here?

347
00:11:24,760 --> 00:11:26,080
So where do we go from here?

348
00:11:26,080 --> 00:11:29,240
What's next for Palo Gema II and VLMs in general?

349
00:11:29,240 --> 00:11:30,560
What's the future look like?

350
00:11:30,560 --> 00:11:33,240
It's hard to say for sure, but I think the possibilities

351
00:11:33,240 --> 00:11:34,520
are incredibly exciting.

352
00:11:34,520 --> 00:11:36,640
The potential applications of this technology

353
00:11:36,640 --> 00:11:38,560
are vast, really.

354
00:11:38,560 --> 00:11:41,360
We've already talked about medicine and scientific research,

355
00:11:41,360 --> 00:11:44,080
but imagine AI that can not only analyze data,

356
00:11:44,080 --> 00:11:45,920
but also generate creative content.

357
00:11:45,920 --> 00:11:47,640
And I mean, based on visual input.

358
00:11:47,640 --> 00:11:50,160
You're talking about AI composing music, maybe?

359
00:11:50,160 --> 00:11:52,600
Or designing molecules, even writing screenplays

360
00:11:52,600 --> 00:11:53,800
based on what it sees.

361
00:11:53,800 --> 00:11:54,440
Exactly.

362
00:11:54,440 --> 00:11:56,160
It really raises some interesting questions

363
00:11:56,160 --> 00:11:57,720
about the future of creativity.

364
00:11:57,720 --> 00:12:00,360
What's the role of AI going to be in our lives?

365
00:12:00,360 --> 00:12:01,120
It's fascinating.

366
00:12:01,120 --> 00:12:01,840
For sure.

367
00:12:01,840 --> 00:12:04,000
But beyond those creative applications,

368
00:12:04,000 --> 00:12:06,480
there's also the potential for AI like this

369
00:12:06,480 --> 00:12:08,920
to help us tackle some of the world's biggest problems.

370
00:12:08,920 --> 00:12:09,560
Absolutely.

371
00:12:09,560 --> 00:12:12,040
I mean, think about climate change, disease diagnosis,

372
00:12:12,040 --> 00:12:13,480
even poverty.

373
00:12:13,480 --> 00:12:16,680
AI that can understand both images and text

374
00:12:16,680 --> 00:12:18,560
could be a game changer.

375
00:12:18,560 --> 00:12:20,440
It's like having this super-powered research

376
00:12:20,440 --> 00:12:23,800
assistant sift through mountains of data,

377
00:12:23,800 --> 00:12:26,520
identify patterns, help us find solutions

378
00:12:26,520 --> 00:12:27,960
that we might miss otherwise.

379
00:12:27,960 --> 00:12:28,600
Precisely.

380
00:12:28,600 --> 00:12:30,760
And maybe even solutions we couldn't have imagined

381
00:12:30,760 --> 00:12:31,720
without AI.

382
00:12:31,720 --> 00:12:32,400
Wow.

383
00:12:32,400 --> 00:12:34,320
So it sounds like the research on Palo Gemma 2

384
00:12:34,320 --> 00:12:36,840
isn't just about making things a little bit better.

385
00:12:36,840 --> 00:12:39,960
It's about fundamentally changing what AI can do.

386
00:12:39,960 --> 00:12:41,440
I think that's a great way to put it.

387
00:12:41,440 --> 00:12:45,720
It's a big step forward in creating truly versatile AI.

388
00:12:45,720 --> 00:12:49,080
AI that can understand the world in a way that, well,

389
00:12:49,080 --> 00:12:51,040
was unimaginable before.

390
00:12:51,040 --> 00:12:53,840
It really feels like we're on the edge of something big here.

391
00:12:53,840 --> 00:12:56,680
It's just been an incredible deep dive into Palo Gemma 2.

392
00:12:56,680 --> 00:12:59,320
I'm definitely walking away with a sense of awe

393
00:12:59,320 --> 00:13:01,800
at what this technology can achieve, that's for sure.

394
00:13:01,800 --> 00:13:03,400
It's been a pleasure exploring this with you.

395
00:13:03,400 --> 00:13:05,200
The future of AI is definitely bright.

396
00:13:05,200 --> 00:13:07,360
Yeah, I'm excited to see what breakthroughs come next.

397
00:13:07,360 --> 00:13:10,920
And to everyone listening out there, stay curious.

398
00:13:10,920 --> 00:13:31,080
Until next time, this is the Deep Dive, signing off.

