1
00:00:00,000 --> 00:00:02,920
All right, let's dive into some AI and video today.

2
00:00:02,920 --> 00:00:03,800
Sounds good to me.

3
00:00:03,800 --> 00:00:06,560
AI that can like watch and understand videos.

4
00:00:06,560 --> 00:00:08,160
Yeah, you know, just like we do.

5
00:00:08,160 --> 00:00:08,480
Yeah.

6
00:00:08,480 --> 00:00:11,880
So we're looking at this paper called Apollo,

7
00:00:11,880 --> 00:00:14,560
an exploration of video understanding

8
00:00:14,560 --> 00:00:17,160
and large multimodal models.

9
00:00:17,160 --> 00:00:19,600
It's all about teaching AI to make sense

10
00:00:19,600 --> 00:00:23,120
of what it's seeing in a video.

11
00:00:23,120 --> 00:00:24,920
So where do we even start with this?

12
00:00:24,920 --> 00:00:26,480
Well, you know, it's pretty complex.

13
00:00:26,480 --> 00:00:28,480
The researchers are working with these things called

14
00:00:28,480 --> 00:00:32,160
large multimodal models or LMMs.

15
00:00:32,160 --> 00:00:32,840
LMMs, OK.

16
00:00:32,840 --> 00:00:35,560
Yeah, think of them like really powerful AI brains

17
00:00:35,560 --> 00:00:38,560
that can process different types of information,

18
00:00:38,560 --> 00:00:41,520
like text and images and video altogether.

19
00:00:41,520 --> 00:00:44,040
So it's like teaching these AI brains

20
00:00:44,040 --> 00:00:46,080
to watch and understand a movie.

21
00:00:46,080 --> 00:00:46,600
Right.

22
00:00:46,600 --> 00:00:49,080
But isn't that like a ton of information to process?

23
00:00:49,080 --> 00:00:49,560
It is.

24
00:00:49,560 --> 00:00:49,920
It is.

25
00:00:49,920 --> 00:00:51,400
And that's one of the biggest challenges.

26
00:00:51,400 --> 00:00:52,840
And you know, the researchers had

27
00:00:52,840 --> 00:00:56,040
to tackle that sheer amount of data involved

28
00:00:56,040 --> 00:00:57,560
in video processing.

29
00:00:57,560 --> 00:01:00,200
Just imagine all the pixels and the movement,

30
00:01:00,200 --> 00:01:01,760
the different objects, the actions.

31
00:01:01,760 --> 00:01:02,600
It's a lot.

32
00:01:02,600 --> 00:01:03,760
It's a computational beast.

33
00:01:03,760 --> 00:01:04,640
It really is.

34
00:01:04,640 --> 00:01:06,880
So how did they even begin to approach this?

35
00:01:06,880 --> 00:01:08,240
How do you even start?

36
00:01:08,240 --> 00:01:10,360
Well, one of the really cool things they discovered

37
00:01:10,360 --> 00:01:13,440
is something called scaling consistency.

38
00:01:13,440 --> 00:01:13,880
OK.

39
00:01:13,880 --> 00:01:17,000
I like to think of it like a shortcut for AI research.

40
00:01:17,000 --> 00:01:17,680
OK.

41
00:01:17,680 --> 00:01:18,680
I'm intrigued.

42
00:01:18,680 --> 00:01:19,960
Tell me more about this shortcut.

43
00:01:19,960 --> 00:01:20,960
What is it?

44
00:01:20,960 --> 00:01:22,960
So basically, they found that they

45
00:01:22,960 --> 00:01:26,880
could test their ideas on smaller LMMs

46
00:01:26,880 --> 00:01:28,720
and smaller data sets.

47
00:01:28,720 --> 00:01:31,280
And the results they get will likely hold true,

48
00:01:31,280 --> 00:01:34,440
even when they scale up to those massive, computationally

49
00:01:34,440 --> 00:01:36,160
expensive models.

50
00:01:36,160 --> 00:01:39,720
So they don't have to train these huge models every time

51
00:01:39,720 --> 00:01:40,840
they want to try something new.

52
00:01:40,840 --> 00:01:41,280
Exactly.

53
00:01:41,280 --> 00:01:43,080
So it saves a lot of time and resources.

54
00:01:43,080 --> 00:01:43,800
Yeah.

55
00:01:43,800 --> 00:01:46,520
It's a really clever way to make research more efficient.

56
00:01:46,520 --> 00:01:46,880
Yeah.

57
00:01:46,880 --> 00:01:49,000
And it opens up a lot possibilities

58
00:01:49,000 --> 00:01:50,800
for faster progress.

59
00:01:50,800 --> 00:01:52,880
That's a pretty big win right out of the gate.

60
00:01:52,880 --> 00:01:55,760
So what did they do with this newfound efficiency?

61
00:01:55,760 --> 00:02:00,600
Did they start building like a super video understanding AI?

62
00:02:00,600 --> 00:02:01,520
Well, not quite.

63
00:02:01,520 --> 00:02:05,800
First, they realized that the standard tests used

64
00:02:05,800 --> 00:02:09,000
to measure how good an AI is at understanding video

65
00:02:09,000 --> 00:02:10,200
were kind of flawed.

66
00:02:10,200 --> 00:02:11,680
The tests themselves were the problem.

67
00:02:11,680 --> 00:02:12,200
Yeah.

68
00:02:12,200 --> 00:02:14,040
They found that a lot of these tests

69
00:02:14,040 --> 00:02:16,560
could be cheated by the AI.

70
00:02:16,560 --> 00:02:19,400
Instead of actually understanding the video,

71
00:02:19,400 --> 00:02:24,120
the AI could find workarounds by focusing only on the text that

72
00:02:24,120 --> 00:02:25,000
was with the video.

73
00:02:25,000 --> 00:02:25,560
Oh, interesting.

74
00:02:25,560 --> 00:02:27,640
Or even just a single frame from the video.

75
00:02:27,640 --> 00:02:31,200
So it was like the AI was acing the test

76
00:02:31,200 --> 00:02:33,320
without really learning the material.

77
00:02:33,320 --> 00:02:34,760
That's a great way to put it.

78
00:02:34,760 --> 00:02:36,400
And that's why the researchers decided

79
00:02:36,400 --> 00:02:41,080
to create their own benchmark, a new test called Apollo Bench,

80
00:02:41,080 --> 00:02:44,000
which is much tougher, specifically designed

81
00:02:44,000 --> 00:02:47,480
to measure true video understanding.

82
00:02:47,480 --> 00:02:50,280
OK, so Apollo Bench is like the ultimate test

83
00:02:50,280 --> 00:02:52,560
for a video savvy AI.

84
00:02:52,560 --> 00:02:55,800
What kind of challenges does it throw at them?

85
00:02:55,800 --> 00:02:59,240
Oh, it includes things like reading text that

86
00:02:59,240 --> 00:03:00,680
appears in the video over time.

87
00:03:00,680 --> 00:03:01,320
Oh, wow.

88
00:03:01,320 --> 00:03:05,120
Understanding where objects are and how they move,

89
00:03:05,120 --> 00:03:08,200
figuring out cause and effect relationships.

90
00:03:08,200 --> 00:03:10,040
And even making sense of videos that

91
00:03:10,040 --> 00:03:12,600
are taken from a first person perspective.

92
00:03:12,600 --> 00:03:13,080
Really?

93
00:03:13,080 --> 00:03:13,580
Yeah.

94
00:03:13,580 --> 00:03:15,240
So they really raised the bar for what

95
00:03:15,240 --> 00:03:17,480
it means to understand a video.

96
00:03:17,480 --> 00:03:18,800
They did, absolutely.

97
00:03:18,800 --> 00:03:21,880
And not only is it more challenging,

98
00:03:21,880 --> 00:03:25,160
but it's also a lot faster to evaluate

99
00:03:25,160 --> 00:03:28,040
than those older benchmarks, which means researchers

100
00:03:28,040 --> 00:03:32,600
get quicker feedback and can make progress more rapidly.

101
00:03:32,600 --> 00:03:34,640
So they streamlined the testing process, too.

102
00:03:34,640 --> 00:03:36,800
It's like they're clearing all the hurdles

103
00:03:36,800 --> 00:03:40,520
to make video understanding research more accessible.

104
00:03:40,520 --> 00:03:42,120
I'm really curious, how did they actually

105
00:03:42,120 --> 00:03:46,400
go about building this amazing video understanding AI?

106
00:03:46,400 --> 00:03:51,040
Well, they dove into the nitty gritty of LMM design.

107
00:03:51,040 --> 00:03:54,080
Systematically testing different approaches

108
00:03:54,080 --> 00:03:55,600
to see what worked best.

109
00:03:55,600 --> 00:03:57,040
It was a lot of trial and error.

110
00:03:57,040 --> 00:04:00,160
But they uncovered some really insightful things along the way.

111
00:04:00,160 --> 00:04:00,600
I bet.

112
00:04:00,600 --> 00:04:03,000
So let's unpack some of those design decisions.

113
00:04:03,000 --> 00:04:06,360
So first up, how does an AI even watch a video?

114
00:04:06,360 --> 00:04:07,560
It doesn't have eyes, right?

115
00:04:07,560 --> 00:04:08,120
Exactly.

116
00:04:08,120 --> 00:04:09,960
That's a great question.

117
00:04:09,960 --> 00:04:14,000
They had to figure out how to feed the video information

118
00:04:14,000 --> 00:04:15,160
to the AI.

119
00:04:15,160 --> 00:04:17,600
Should it look at every single frame?

120
00:04:17,600 --> 00:04:18,640
Just a few.

121
00:04:18,640 --> 00:04:22,200
Should it focus on the action-packed moments?

122
00:04:22,200 --> 00:04:24,760
Or take a more even approach?

123
00:04:24,760 --> 00:04:26,400
There's a lot of ways to go about it.

124
00:04:26,400 --> 00:04:28,720
So what did they find was the best strategy?

125
00:04:28,720 --> 00:04:32,760
So they found that sampling based on frames per second,

126
00:04:32,760 --> 00:04:37,440
or FPS, was crucial for the AI to understand

127
00:04:37,440 --> 00:04:39,400
the flow of time in a video.

128
00:04:39,400 --> 00:04:41,360
FPS, like what we see in video games and movies.

129
00:04:41,360 --> 00:04:42,240
Exactly.

130
00:04:42,240 --> 00:04:43,040
Think about it.

131
00:04:43,040 --> 00:04:46,200
If you want to understand how fast a car is going,

132
00:04:46,200 --> 00:04:48,480
you need to see how its position changes.

133
00:04:48,480 --> 00:04:49,960
From one frame to the next.

134
00:04:49,960 --> 00:04:55,680
So FPS sampling gives the AI that kind of temporal information.

135
00:04:55,680 --> 00:04:58,720
So it's not just about seeing what's in each frame.

136
00:04:58,720 --> 00:05:02,120
It's about understanding how those frames connect over time

137
00:05:02,120 --> 00:05:03,320
to tell a story.

138
00:05:03,320 --> 00:05:03,960
Exactly.

139
00:05:03,960 --> 00:05:04,760
You got it.

140
00:05:04,760 --> 00:05:09,240
It's about giving the AI a sense of timing and sequence.

141
00:05:09,240 --> 00:05:10,440
That makes a lot of sense.

142
00:05:10,440 --> 00:05:13,000
Now, the paper also mentioned something about vision encoders.

143
00:05:13,000 --> 00:05:16,040
Can you explain what those are and why they're important?

144
00:05:16,040 --> 00:05:16,440
Sure.

145
00:05:16,440 --> 00:05:20,680
So vision encoders are like translators for the AI.

146
00:05:20,680 --> 00:05:23,800
They take the raw visual data from the video frames,

147
00:05:23,800 --> 00:05:26,280
all those pixels and colors, and convert them

148
00:05:26,280 --> 00:05:28,360
into a language that AI can understand.

149
00:05:28,360 --> 00:05:30,800
So they're like turning light and shadows into something

150
00:05:30,800 --> 00:05:32,440
the AI can actually process.

151
00:05:32,440 --> 00:05:33,520
Exactly.

152
00:05:33,520 --> 00:05:35,760
And the research is experimented with different types

153
00:05:35,760 --> 00:05:36,920
of vision encoders.

154
00:05:36,920 --> 00:05:39,760
Some that focus on spatial information,

155
00:05:39,760 --> 00:05:42,400
we get the layout of a scene, and others that

156
00:05:42,400 --> 00:05:44,840
are better at capturing temporal information,

157
00:05:44,840 --> 00:05:46,880
like the movement of objects.

158
00:05:46,880 --> 00:05:48,960
It seems like having both of those perspectives

159
00:05:48,960 --> 00:05:51,880
would be important for understanding a video as a whole.

160
00:05:51,880 --> 00:05:52,680
You're right.

161
00:05:52,680 --> 00:05:55,960
And they found that combining a powerful image encoder called

162
00:05:55,960 --> 00:06:00,560
Siglip with a video encoder called intern video 2

163
00:06:00,560 --> 00:06:02,320
gave them the best results.

164
00:06:02,320 --> 00:06:06,120
It was like having two different experts analyze the video

165
00:06:06,120 --> 00:06:07,720
and then put their insights together.

166
00:06:07,720 --> 00:06:09,440
Teamwork, even within the AI itself.

167
00:06:09,440 --> 00:06:10,400
Exactly.

168
00:06:10,400 --> 00:06:11,680
OK.

169
00:06:11,680 --> 00:06:16,880
I'm really starting to see how much thought and experimentation

170
00:06:16,880 --> 00:06:20,320
went into building this video understanding system.

171
00:06:20,320 --> 00:06:24,120
But I have to wonder, videos generate a ton of data.

172
00:06:24,120 --> 00:06:27,320
How do they manage to keep things manageable for the AI?

173
00:06:27,320 --> 00:06:28,280
That's a good question.

174
00:06:28,280 --> 00:06:28,800
Yeah.

175
00:06:28,800 --> 00:06:32,760
That's where a clever technique called video token resampling

176
00:06:32,760 --> 00:06:33,360
comes in.

177
00:06:33,360 --> 00:06:33,840
OK.

178
00:06:33,840 --> 00:06:36,200
It's all about compressing the visual information

179
00:06:36,200 --> 00:06:37,880
without losing the important details.

180
00:06:37,880 --> 00:06:40,720
So creating a super efficient summary of each frame.

181
00:06:40,720 --> 00:06:41,520
Exactly.

182
00:06:41,520 --> 00:06:41,920
OK.

183
00:06:41,920 --> 00:06:45,480
And they found that using a perceiver resampler

184
00:06:45,480 --> 00:06:47,680
was the most effective way to do this.

185
00:06:47,680 --> 00:06:51,200
It's like a smart filter that helps the AI focus on what

186
00:06:51,200 --> 00:06:53,120
matters most in each frame.

187
00:06:53,120 --> 00:06:54,600
That sounds pretty high tech.

188
00:06:54,600 --> 00:06:55,080
It is.

189
00:06:55,080 --> 00:06:57,600
So I'm guessing they had to test a bunch of different resampling

190
00:06:57,600 --> 00:06:59,000
methods to find the best one.

191
00:06:59,000 --> 00:06:59,800
They did.

192
00:06:59,800 --> 00:07:00,640
They did.

193
00:07:00,640 --> 00:07:02,560
And this one really stood out for its ability

194
00:07:02,560 --> 00:07:05,480
to condense the data while retaining

195
00:07:05,480 --> 00:07:07,360
crucial visual information.

196
00:07:07,360 --> 00:07:07,600
OK.

197
00:07:07,600 --> 00:07:10,240
So they figure out how to streamline the data.

198
00:07:10,240 --> 00:07:13,160
What about actually connecting the visual information

199
00:07:13,160 --> 00:07:14,440
with text?

200
00:07:14,440 --> 00:07:14,920
Yeah.

201
00:07:14,920 --> 00:07:19,680
How does the AI link what it's seeing to the questions

202
00:07:19,680 --> 00:07:20,440
that's being asked?

203
00:07:20,440 --> 00:07:21,000
Yeah.

204
00:07:21,000 --> 00:07:23,360
That's where video token integration comes into play.

205
00:07:23,360 --> 00:07:23,680
OK.

206
00:07:23,680 --> 00:07:27,280
It's all about how to weave the visual information

207
00:07:27,280 --> 00:07:30,280
into the stream of text that the AI is processing.

208
00:07:30,280 --> 00:07:30,640
Right.

209
00:07:30,640 --> 00:07:32,240
Because ultimately, we want the AI

210
00:07:32,240 --> 00:07:35,200
to be able to answer our questions about the video.

211
00:07:35,200 --> 00:07:35,960
Exactly.

212
00:07:35,960 --> 00:07:38,640
And something as simple as adding time stamps

213
00:07:38,640 --> 00:07:39,800
between video clips.

214
00:07:39,800 --> 00:07:40,040
OK.

215
00:07:40,040 --> 00:07:42,000
It turned out to be a major improvement.

216
00:07:42,000 --> 00:07:42,560
Really?

217
00:07:42,560 --> 00:07:42,920
Yeah.

218
00:07:42,920 --> 00:07:45,680
It's like giving the AI a roadmap of what's

219
00:07:45,680 --> 00:07:47,680
happening in the video and when.

220
00:07:47,680 --> 00:07:50,600
So even small design choices can have a big impact

221
00:07:50,600 --> 00:07:52,160
on the AI's performance.

222
00:07:52,160 --> 00:07:54,720
It's fascinating to see how all these pieces fit together.

223
00:07:54,720 --> 00:07:55,600
It really is.

224
00:07:55,600 --> 00:07:57,240
And they didn't stop there.

225
00:07:57,240 --> 00:07:59,800
They also dug deep into the best ways

226
00:07:59,800 --> 00:08:01,960
to train these video LMMs.

227
00:08:01,960 --> 00:08:02,600
Oh, that's right.

228
00:08:02,600 --> 00:08:03,520
Yeah.

229
00:08:03,520 --> 00:08:06,040
The paper mentioned something about catastrophic forgetting.

230
00:08:06,040 --> 00:08:07,280
Can you remind us what that is?

231
00:08:07,280 --> 00:08:07,840
Yeah.

232
00:08:07,840 --> 00:08:10,000
Catastrophic forgetting is when an AI,

233
00:08:10,000 --> 00:08:13,080
while learning something new, starts

234
00:08:13,080 --> 00:08:14,960
to forget things that it learned previously.

235
00:08:14,960 --> 00:08:15,440
Oh, right.

236
00:08:15,440 --> 00:08:20,000
So in this case, if you only train an AI on video data,

237
00:08:20,000 --> 00:08:22,160
it might start forgetting its language skills.

238
00:08:22,160 --> 00:08:24,680
So it's like it's becoming a video expert

239
00:08:24,680 --> 00:08:26,400
but forgetting how to speak.

240
00:08:26,400 --> 00:08:26,960
Exactly.

241
00:08:26,960 --> 00:08:30,480
So they found that including a moderate amount of text data

242
00:08:30,480 --> 00:08:34,760
in the training mix helps prevent this kind of forgetting

243
00:08:34,760 --> 00:08:38,920
while still allowing the AI to benefit from that richness

244
00:08:38,920 --> 00:08:39,840
of the video data.

245
00:08:39,840 --> 00:08:42,440
So it's like a balancing act, making sure the AI develops

246
00:08:42,440 --> 00:08:44,320
both its video and its language skills.

247
00:08:44,320 --> 00:08:45,120
Exactly.

248
00:08:45,120 --> 00:08:45,440
OK.

249
00:08:45,440 --> 00:08:47,880
And they even found that leaning a little more heavily

250
00:08:47,880 --> 00:08:53,360
on video data in the mix gives the best overall performance.

251
00:08:53,360 --> 00:08:56,400
So they optimized the watching strategy, the representation

252
00:08:56,400 --> 00:08:59,440
of visual information, the data flow,

253
00:08:59,440 --> 00:09:01,400
and even the training process.

254
00:09:01,400 --> 00:09:03,200
So what did all this effort lead to?

255
00:09:03,200 --> 00:09:07,280
Did they actually build this amazing video understanding AI?

256
00:09:07,280 --> 00:09:07,920
They did.

257
00:09:07,920 --> 00:09:08,560
They did.

258
00:09:08,560 --> 00:09:11,680
And they named their creation Apollo.

259
00:09:11,680 --> 00:09:14,440
In fact, they created a whole family of Apollo models.

260
00:09:14,440 --> 00:09:15,320
A whole family.

261
00:09:15,320 --> 00:09:15,960
Yeah.

262
00:09:15,960 --> 00:09:16,520
That's cool.

263
00:09:16,520 --> 00:09:18,320
Ranging in size and complexity.

264
00:09:18,320 --> 00:09:21,280
And in the paper, they talk about the number of parameters

265
00:09:21,280 --> 00:09:22,160
in each model.

266
00:09:22,160 --> 00:09:25,160
Can you remind us what those are and why they matter?

267
00:09:25,160 --> 00:09:28,240
Parameters are like the knobs and dials

268
00:09:28,240 --> 00:09:32,640
that the AI can adjust to learn from data.

269
00:09:32,640 --> 00:09:35,880
The more parameters a model has, the more potential

270
00:09:35,880 --> 00:09:39,000
it has to learn complex patterns and understand

271
00:09:39,000 --> 00:09:40,080
nuanced information.

272
00:09:40,080 --> 00:09:43,560
So more parameters means a more sophisticated, potentially

273
00:09:43,560 --> 00:09:45,320
smarter AI.

274
00:09:45,320 --> 00:09:45,760
Right.

275
00:09:45,760 --> 00:09:48,720
But it also probably means more computing power is needed.

276
00:09:48,720 --> 00:09:49,240
Exactly.

277
00:09:49,240 --> 00:09:50,160
There's a trade-off there.

278
00:09:50,160 --> 00:09:50,600
Yeah.

279
00:09:50,600 --> 00:09:52,440
You want a powerful model, but you also

280
00:09:52,440 --> 00:09:53,720
want it to be efficient.

281
00:09:53,720 --> 00:09:55,800
And how did the Apollo models perform?

282
00:09:55,800 --> 00:09:57,600
Like, did all that optimization pay off?

283
00:09:57,600 --> 00:09:58,120
It did.

284
00:09:58,120 --> 00:09:59,200
Big time.

285
00:09:59,200 --> 00:10:04,320
Even their smaller Apollo 3B model with 3 billion parameters

286
00:10:04,320 --> 00:10:07,920
often outperformed existing models that were much larger,

287
00:10:07,920 --> 00:10:09,760
sometimes even twice the size.

288
00:10:09,760 --> 00:10:12,720
So it's not just about brute force computing power.

289
00:10:12,720 --> 00:10:16,000
It's about smart design choices and clever optimization.

290
00:10:16,000 --> 00:10:17,160
Exactly.

291
00:10:17,160 --> 00:10:19,840
And their largest model, Apollo 7B,

292
00:10:19,840 --> 00:10:23,400
set a new state of the art for models of its size,

293
00:10:23,400 --> 00:10:28,480
even surpassing some models with 30 billion parameters.

294
00:10:28,480 --> 00:10:31,720
That shows just how powerful, well-designed, and optimized

295
00:10:31,720 --> 00:10:32,640
models can be.

296
00:10:32,640 --> 00:10:33,360
OK.

297
00:10:33,360 --> 00:10:34,680
I'm officially impressed.

298
00:10:34,680 --> 00:10:36,640
They tackled some really tough challenges,

299
00:10:36,640 --> 00:10:39,600
came up with some ingenious solutions,

300
00:10:39,600 --> 00:10:42,320
and created a family of AI models that

301
00:10:42,320 --> 00:10:46,160
are pushing the limits of what's possible in video

302
00:10:46,160 --> 00:10:46,800
understanding.

303
00:10:46,800 --> 00:10:47,640
It's true.

304
00:10:47,640 --> 00:10:48,960
And this is just the beginning.

305
00:10:48,960 --> 00:10:49,320
I bet.

306
00:10:49,320 --> 00:10:51,440
In the next part, we'll delve deeper

307
00:10:51,440 --> 00:10:53,240
into the implications of this research.

308
00:10:53,240 --> 00:10:54,760
What does it mean for the future of AI?

309
00:10:54,760 --> 00:10:57,680
How could it change the way we interact with video?

310
00:10:57,680 --> 00:10:58,400
Stay tuned.

311
00:10:58,400 --> 00:10:59,400
All right, it sounds good.

312
00:10:59,400 --> 00:11:00,400
I'm excited to find out.

313
00:11:00,400 --> 00:11:01,440
So welcome back.

314
00:11:01,440 --> 00:11:05,680
Back to marveling at the ingenuity of this Apollo research.

315
00:11:05,680 --> 00:11:07,680
Yeah, it's truly groundbreaking work.

316
00:11:07,680 --> 00:11:11,680
I mean, this paper isn't just about some cool new AI model.

317
00:11:11,680 --> 00:11:14,920
It's like a window into the future of AI research itself.

318
00:11:14,920 --> 00:11:15,480
Totally.

319
00:11:15,480 --> 00:11:16,960
I agree.

320
00:11:16,960 --> 00:11:21,200
It's fascinating to see how these researchers are pushing

321
00:11:21,200 --> 00:11:23,560
the boundaries, not just of what AI can do,

322
00:11:23,560 --> 00:11:26,080
but also how we go about developing and improving

323
00:11:26,080 --> 00:11:27,000
AI systems.

324
00:11:27,000 --> 00:11:31,280
OK, so before we get lost in all the possibilities,

325
00:11:31,280 --> 00:11:35,600
can we take a step back and highlight the key takeaways

326
00:11:35,600 --> 00:11:36,760
from this paper?

327
00:11:36,760 --> 00:11:37,280
Sure.

328
00:11:37,280 --> 00:11:40,120
What stood out to you is the most important aspects

329
00:11:40,120 --> 00:11:41,080
of this research.

330
00:11:41,080 --> 00:11:42,920
For me, one of the biggest takeaways

331
00:11:42,920 --> 00:11:45,720
is this whole concept of scaling consistency.

332
00:11:45,720 --> 00:11:47,520
We touched on it earlier, but it's

333
00:11:47,520 --> 00:11:51,400
worth emphasizing just how revolutionary this idea is.

334
00:11:51,400 --> 00:11:54,080
Yeah, so to recap, they discovered

335
00:11:54,080 --> 00:11:57,000
that they could test their ideas on smaller, more manageable

336
00:11:57,000 --> 00:11:58,760
models and data sets.

337
00:11:58,760 --> 00:12:00,680
And those findings would reliably translate

338
00:12:00,680 --> 00:12:02,840
to larger, more complex models.

339
00:12:02,840 --> 00:12:04,240
Exactly.

340
00:12:04,240 --> 00:12:06,800
That's a potential game changer for AI research,

341
00:12:06,800 --> 00:12:08,600
because it means we don't always need to rely

342
00:12:08,600 --> 00:12:12,040
on those massive, computationally expensive models

343
00:12:12,040 --> 00:12:13,080
for every experiment.

344
00:12:13,080 --> 00:12:14,880
That makes research more accessible.

345
00:12:14,880 --> 00:12:18,200
Not everyone has access to supercomputers and stuff.

346
00:12:18,200 --> 00:12:18,760
Absolutely.

347
00:12:18,760 --> 00:12:22,120
Scaling consistency opens the door for more researchers

348
00:12:22,120 --> 00:12:26,600
to contribute to the field, which could accelerate progress

349
00:12:26,600 --> 00:12:28,000
significantly.

350
00:12:28,000 --> 00:12:31,640
It's like democratizing AI research, which is amazing.

351
00:12:31,640 --> 00:12:32,880
It really is.

352
00:12:32,880 --> 00:12:35,680
And speaking of democratization, they also

353
00:12:35,680 --> 00:12:40,160
tackled those flawed benchmarks, right?

354
00:12:40,160 --> 00:12:42,280
Those tests that weren't really measuring

355
00:12:42,280 --> 00:12:44,280
true video understanding.

356
00:12:44,280 --> 00:12:47,480
They created Apollo Bench, which is, I mean, that's huge.

357
00:12:47,480 --> 00:12:47,880
It is.

358
00:12:47,880 --> 00:12:49,520
It's a huge contribution in itself.

359
00:12:49,520 --> 00:12:52,000
It sets a much more rigorous and efficient way

360
00:12:52,000 --> 00:12:55,160
to evaluate video understanding in AI, which

361
00:12:55,160 --> 00:12:57,120
will lead to more meaningful progress.

362
00:12:57,120 --> 00:12:58,360
It's like they said, a new standard.

363
00:12:58,360 --> 00:12:58,960
They did.

364
00:12:58,960 --> 00:13:02,640
A higher bar for what it means for AI to get video.

365
00:13:02,640 --> 00:13:03,600
Exactly.

366
00:13:03,600 --> 00:13:05,800
And another key takeaway for me was,

367
00:13:05,800 --> 00:13:09,280
their focus on the temporal aspect of video,

368
00:13:09,280 --> 00:13:13,080
like how they really emphasize the importance of understanding

369
00:13:13,080 --> 00:13:15,640
the flow of time, the sequence of events.

370
00:13:15,640 --> 00:13:16,160
It's key.

371
00:13:16,160 --> 00:13:19,120
Yeah, I remember that aha moment when we discussed

372
00:13:19,120 --> 00:13:20,800
the frames per second sampling.

373
00:13:20,800 --> 00:13:22,640
It seems so intuitive now.

374
00:13:22,640 --> 00:13:23,440
It does.

375
00:13:23,440 --> 00:13:26,360
But it's such a crucial insight.

376
00:13:26,360 --> 00:13:30,200
It reminds us that AI doesn't perceive the world the same way

377
00:13:30,200 --> 00:13:30,880
we do.

378
00:13:30,880 --> 00:13:32,880
We need to explicitly teach it how

379
00:13:32,880 --> 00:13:35,160
to pay attention to the temporal dimension.

380
00:13:35,160 --> 00:13:35,760
Exactly.

381
00:13:35,760 --> 00:13:37,720
It's like giving the AI a sense of rhythm, right?

382
00:13:37,720 --> 00:13:38,600
Yeah.

383
00:13:38,600 --> 00:13:42,160
Like helping it see the beat of the video, so to speak.

384
00:13:42,160 --> 00:13:43,400
That's a great analogy.

385
00:13:43,400 --> 00:13:46,440
And that's essential for understanding actions,

386
00:13:46,440 --> 00:13:50,120
relationships between events, cause and effect.

387
00:13:50,120 --> 00:13:53,120
All those things that make a video more than just

388
00:13:53,120 --> 00:13:54,760
a collection of static images.

389
00:13:54,760 --> 00:13:56,720
It's about seeing the story unfold,

390
00:13:56,720 --> 00:13:58,040
not just the individual frames.

391
00:13:58,040 --> 00:13:58,680
Exactly.

392
00:13:58,680 --> 00:14:01,200
And their work on combining different types

393
00:14:01,200 --> 00:14:04,800
of vision encoders was also quite insightful.

394
00:14:04,800 --> 00:14:05,800
Yes.

395
00:14:05,800 --> 00:14:07,680
I was really intrigued by that whole idea

396
00:14:07,680 --> 00:14:11,600
of having an AI with both a photographic memory

397
00:14:11,600 --> 00:14:12,880
and a sense of timing.

398
00:14:12,880 --> 00:14:14,520
It's a powerful combination.

399
00:14:14,520 --> 00:14:18,040
It's like giving the AI a more balanced perspective,

400
00:14:18,040 --> 00:14:20,880
allowing it to process both the spatial and the temporal

401
00:14:20,880 --> 00:14:22,880
information in a video.

402
00:14:22,880 --> 00:14:25,360
I mean, they really thought through every aspect of this,

403
00:14:25,360 --> 00:14:25,960
didn't they?

404
00:14:25,960 --> 00:14:27,120
They did.

405
00:14:27,120 --> 00:14:29,800
From how the AI watches the video

406
00:14:29,800 --> 00:14:32,920
to how it represents the information, how it compresses

407
00:14:32,920 --> 00:14:36,200
the data, how it integrates text and visuals,

408
00:14:36,200 --> 00:14:37,720
and even how it's trained.

409
00:14:37,720 --> 00:14:42,240
They took a very systematic and thorough approach.

410
00:14:42,240 --> 00:14:44,720
And that's reflected in those impressive results

411
00:14:44,720 --> 00:14:47,560
they achieved with the Apollo models.

412
00:14:47,560 --> 00:14:51,320
This whole discussion is giving me a new appreciation

413
00:14:51,320 --> 00:14:53,760
for the complexity of video understanding.

414
00:14:53,760 --> 00:14:54,440
It's complex.

415
00:14:54,440 --> 00:14:57,320
It's something we humans do so effortlessly.

416
00:14:57,320 --> 00:15:00,160
But teaching an AI to do the same

417
00:15:00,160 --> 00:15:01,560
is a massive undertaking.

418
00:15:01,560 --> 00:15:01,920
It is.

419
00:15:01,920 --> 00:15:04,680
It makes you realize just how much is going on

420
00:15:04,680 --> 00:15:07,600
beneath the surface when we watch a video.

421
00:15:07,600 --> 00:15:10,160
Like we're processing so much information

422
00:15:10,160 --> 00:15:11,920
without even realizing it.

423
00:15:11,920 --> 00:15:15,720
Emotions, facial expressions, body language, context, story

424
00:15:15,720 --> 00:15:18,040
lines, humor, symbolism.

425
00:15:18,040 --> 00:15:19,680
It's mind-boggling.

426
00:15:19,680 --> 00:15:20,160
It is.

427
00:15:20,160 --> 00:15:22,000
And the researchers behind Apollo

428
00:15:22,000 --> 00:15:25,280
are making remarkable strides towards building

429
00:15:25,280 --> 00:15:29,320
AI that can grasp at least some of that complexity.

430
00:15:29,320 --> 00:15:32,000
So now I can't help but wonder about the future.

431
00:15:32,000 --> 00:15:35,800
What does all of this mean for how we interact with video?

432
00:15:35,800 --> 00:15:37,840
How could this research change things for us

433
00:15:37,840 --> 00:15:39,840
in the real world?

434
00:15:39,840 --> 00:15:44,120
I feel like we're on the verge of something really big here.

435
00:15:44,120 --> 00:15:45,120
Yeah, we are.

436
00:15:45,120 --> 00:15:47,240
The potential applications of this research

437
00:15:47,240 --> 00:15:52,120
are vast and incredibly exciting.

438
00:15:52,120 --> 00:15:53,960
OK, so paint me a picture.

439
00:15:53,960 --> 00:15:58,480
What kind of world could Apollo and its descendants create?

440
00:15:58,480 --> 00:16:01,400
Well, imagine a world where you can search for videos,

441
00:16:01,400 --> 00:16:04,440
not just by keywords, but by the actual content.

442
00:16:04,440 --> 00:16:05,800
What's happening in the video?

443
00:16:05,800 --> 00:16:07,200
Who's in it, what they're doing?

444
00:16:07,200 --> 00:16:10,200
So instead of typing recipe lasagna,

445
00:16:10,200 --> 00:16:11,840
I could search for videos where someone

446
00:16:11,840 --> 00:16:13,800
is making lasagna from scratch.

447
00:16:13,800 --> 00:16:14,400
Exactly.

448
00:16:14,400 --> 00:16:16,000
Or videos where someone is making lasagna

449
00:16:16,000 --> 00:16:18,000
and they're explaining the steps as they go.

450
00:16:18,000 --> 00:16:20,080
Or even videos where someone's making lasagna

451
00:16:20,080 --> 00:16:22,200
and it looks really delicious.

452
00:16:22,200 --> 00:16:23,400
That'd be amazing.

453
00:16:23,400 --> 00:16:27,320
No more waiting through endless search results

454
00:16:27,320 --> 00:16:29,480
to find the exact video I'm looking for.

455
00:16:29,480 --> 00:16:31,200
Right, AI that can understand video

456
00:16:31,200 --> 00:16:35,000
could completely transform how we search for and discover

457
00:16:35,000 --> 00:16:36,120
content.

458
00:16:36,120 --> 00:16:38,760
It's like having a personal video librarian who

459
00:16:38,760 --> 00:16:40,280
knows exactly what you want.

460
00:16:40,280 --> 00:16:43,840
And it's not just about finding the right videos.

461
00:16:43,840 --> 00:16:46,600
It's about understanding them on a deeper level too, right?

462
00:16:46,600 --> 00:16:48,440
Yes.

463
00:16:48,440 --> 00:16:51,720
What if the AI could help us learn from videos more

464
00:16:51,720 --> 00:16:52,200
effectively?

465
00:16:52,200 --> 00:16:53,000
Absolutely.

466
00:16:53,000 --> 00:16:55,000
Think about educational videos.

467
00:16:55,000 --> 00:16:59,240
AI could analyze the content and create interactive quizzes

468
00:16:59,240 --> 00:17:02,360
or summaries or even personalized study guides.

469
00:17:02,360 --> 00:17:03,800
It's like having a tutor built right

470
00:17:03,800 --> 00:17:06,280
into every educational video, guiding you

471
00:17:06,280 --> 00:17:08,840
through the material and making sure you're actually

472
00:17:08,840 --> 00:17:11,120
learning and for entertainment.

473
00:17:11,120 --> 00:17:15,400
I mean, imagine AI that can recommend movies or shows

474
00:17:15,400 --> 00:17:17,560
based on your specific preferences.

475
00:17:17,560 --> 00:17:20,440
Not just general categories like comedy or drama,

476
00:17:20,440 --> 00:17:24,320
but on the actual themes and characters and story lines

477
00:17:24,320 --> 00:17:25,160
that you enjoy.

478
00:17:25,160 --> 00:17:29,560
So instead of suggesting another generic rom-com,

479
00:17:29,560 --> 00:17:34,360
it might recommend a hidden gem with a similar emotional arc

480
00:17:34,360 --> 00:17:37,200
to my favorite movie, even if it's

481
00:17:37,200 --> 00:17:39,080
like a completely different genre.

482
00:17:39,080 --> 00:17:42,360
The AI could become a true taste maker,

483
00:17:42,360 --> 00:17:44,280
helping you discover content you might never

484
00:17:44,280 --> 00:17:45,600
have found on your own.

485
00:17:45,600 --> 00:17:48,320
It's like having a friend who knows your taste perfectly

486
00:17:48,320 --> 00:17:51,480
and is always surprising you with new and interesting things.

487
00:17:51,480 --> 00:17:52,520
Exactly.

488
00:17:52,520 --> 00:17:54,960
And beyond entertainment and education,

489
00:17:54,960 --> 00:17:58,160
think about the impact on accessibility.

490
00:17:58,160 --> 00:17:59,240
Oh, right, right.

491
00:17:59,240 --> 00:18:03,760
AI could generate real-time captions and descriptions

492
00:18:03,760 --> 00:18:06,600
for people who are deaf or hard of hearing,

493
00:18:06,600 --> 00:18:09,280
making video content much more inclusive.

494
00:18:09,280 --> 00:18:11,680
And it could translate videos into different languages, right?

495
00:18:11,680 --> 00:18:12,240
Yes.

496
00:18:12,240 --> 00:18:13,720
Opening up a world of content.

497
00:18:13,720 --> 00:18:14,200
It could.

498
00:18:14,200 --> 00:18:15,080
All around the globe.

499
00:18:15,080 --> 00:18:16,080
Absolutely.

500
00:18:16,080 --> 00:18:20,520
It's exciting to think about how this technology could

501
00:18:20,520 --> 00:18:24,480
break down barriers and make information more accessible.

502
00:18:24,480 --> 00:18:25,280
It is.

503
00:18:25,280 --> 00:18:25,800
Everyone.

504
00:18:25,800 --> 00:18:27,960
And let's not forget about the potential

505
00:18:27,960 --> 00:18:29,480
for our creative industry.

506
00:18:29,480 --> 00:18:30,360
Oh, how so?

507
00:18:30,360 --> 00:18:33,480
AI could assist filmmakers and editors

508
00:18:33,480 --> 00:18:38,400
by automating tasks like scene detection or object tracking,

509
00:18:38,400 --> 00:18:42,240
freeing them up to focus more on the creative side of things.

510
00:18:42,240 --> 00:18:44,680
It's like giving them a super powered assistant

511
00:18:44,680 --> 00:18:46,720
to handle the more tedious parts of their work.

512
00:18:46,720 --> 00:18:47,320
Exactly.

513
00:18:47,320 --> 00:18:50,120
And AI could even help generate new content.

514
00:18:50,120 --> 00:18:50,760
Really?

515
00:18:50,760 --> 00:18:53,000
Imagine AI-powered tools that can

516
00:18:53,000 --> 00:18:58,240
create storyboards or animatics or even entire animated sequences.

517
00:18:58,240 --> 00:19:00,440
Wow, that would be a game changer

518
00:19:00,440 --> 00:19:03,400
for independent filmmakers or anyone with a story to tell.

519
00:19:03,400 --> 00:19:03,900
Right.

520
00:19:03,900 --> 00:19:05,240
But limited resources.

521
00:19:05,240 --> 00:19:07,720
The creative possibilities are endless.

522
00:19:07,720 --> 00:19:11,960
OK, I'm officially convinced the future of video is intelligent.

523
00:19:11,960 --> 00:19:12,440
It is.

524
00:19:12,440 --> 00:19:14,040
And it's going to be amazing.

525
00:19:14,040 --> 00:19:16,360
But as with any powerful technology,

526
00:19:16,360 --> 00:19:18,280
it's important to proceed thoughtfully

527
00:19:18,280 --> 00:19:21,080
and consider the potential implications.

528
00:19:21,080 --> 00:19:21,400
Right.

529
00:19:21,400 --> 00:19:24,440
We need to make sure that these systems are developed and used

530
00:19:24,440 --> 00:19:24,960
responsibly.

531
00:19:24,960 --> 00:19:25,960
Absolutely.

532
00:19:25,960 --> 00:19:30,760
We need to be mindful of things like bias in training data,

533
00:19:30,760 --> 00:19:32,800
the potential for misuse.

534
00:19:32,800 --> 00:19:36,960
And the impact on human creativity and employment.

535
00:19:36,960 --> 00:19:41,160
It's like we're explorers setting foot on a new continent.

536
00:19:41,160 --> 00:19:41,960
It is.

537
00:19:41,960 --> 00:19:45,600
You know, we need to be aware of the potential dangers

538
00:19:45,600 --> 00:19:47,400
as well as the incredible opportunities.

539
00:19:47,400 --> 00:19:50,040
And one of the things that impressed me most about this Apollo

540
00:19:50,040 --> 00:19:53,320
research was their honesty about the limitations

541
00:19:53,320 --> 00:19:57,520
of their work, especially when it comes to conversational AI.

542
00:19:57,520 --> 00:19:57,840
Right.

543
00:19:57,840 --> 00:20:00,920
They acknowledge that existing benchmarks don't really

544
00:20:00,920 --> 00:20:04,680
capture the nuances of like back and forth conversation.

545
00:20:04,680 --> 00:20:07,200
It's one thing for an AI to answer a simple question

546
00:20:07,200 --> 00:20:08,320
about a video.

547
00:20:08,320 --> 00:20:10,520
But it's a whole other challenge to engage

548
00:20:10,520 --> 00:20:14,800
in a natural fluid conversation about the content,

549
00:20:14,800 --> 00:20:17,560
to understand intent and emotion,

550
00:20:17,560 --> 00:20:21,640
and all the subtle cues we humans use when we communicate.

551
00:20:21,640 --> 00:20:25,160
It's like the difference between a basic chatbot

552
00:20:25,160 --> 00:20:27,720
and a truly intelligent conversational partner.

553
00:20:27,720 --> 00:20:28,720
Exactly.

554
00:20:28,720 --> 00:20:31,840
And that's where the real potential of video understanding

555
00:20:31,840 --> 00:20:33,160
lies.

556
00:20:33,160 --> 00:20:36,440
To truly unlock its power, we need AI

557
00:20:36,440 --> 00:20:38,840
that can not only process information,

558
00:20:38,840 --> 00:20:42,320
but also interact with us in a way that feels human,

559
00:20:42,320 --> 00:20:44,720
intuitive, and engaging.

560
00:20:44,720 --> 00:20:47,240
So instead of just asking what happened,

561
00:20:47,240 --> 00:20:49,600
we could have deeper, more nuanced conversations

562
00:20:49,600 --> 00:20:51,560
with the AI about the video.

563
00:20:51,560 --> 00:20:52,120
Exactly.

564
00:20:52,120 --> 00:20:54,400
We could ask, why did the character make that choice?

565
00:20:54,400 --> 00:20:54,760
Yeah.

566
00:20:54,760 --> 00:20:57,080
Or what do you think will happen next?

567
00:20:57,080 --> 00:20:59,160
And the AI could respond in a way that

568
00:20:59,160 --> 00:21:02,240
shows genuine understanding and even

569
00:21:02,240 --> 00:21:04,640
offer its own interpretations or predictions.

570
00:21:04,640 --> 00:21:08,080
It'd be like discussing a movie with a friend,

571
00:21:08,080 --> 00:21:10,760
bouncing ideas off each other, and creating new insights.

572
00:21:10,760 --> 00:21:13,240
That kind of interactive video understanding

573
00:21:13,240 --> 00:21:15,400
is still in its early stages.

574
00:21:15,400 --> 00:21:19,520
But the Apollo research lays a really strong foundation

575
00:21:19,520 --> 00:21:21,440
for future work in this area.

576
00:21:21,440 --> 00:21:23,480
It's exciting to think about, it's

577
00:21:23,480 --> 00:21:26,920
like we're on the cusp of a new era where AI

578
00:21:26,920 --> 00:21:31,600
becomes less of a tool and more of a companion,

579
00:21:31,600 --> 00:21:33,840
helping us explore and understand the world

580
00:21:33,840 --> 00:21:35,040
through the lens of video.

581
00:21:35,040 --> 00:21:36,880
And as we continue down this path,

582
00:21:36,880 --> 00:21:39,120
it's crucial that we involve diverse voices

583
00:21:39,120 --> 00:21:42,240
and perspectives in the development and deployment

584
00:21:42,240 --> 00:21:45,640
of these technologies to ensure that they benefit

585
00:21:45,640 --> 00:21:47,240
all of humanity.

586
00:21:47,240 --> 00:21:48,240
Well said.

587
00:21:48,240 --> 00:21:50,880
This has been an incredible deep dive

588
00:21:50,880 --> 00:21:53,000
into the world of video understanding

589
00:21:53,000 --> 00:21:55,760
and the amazing work being done by the researchers

590
00:21:55,760 --> 00:21:56,720
behind Apollo.

591
00:21:56,720 --> 00:21:58,520
Truly groundbreaking stuff.

592
00:21:58,520 --> 00:22:01,120
It is a field that's ripe with potential.

593
00:22:01,120 --> 00:22:03,360
And I can't wait to see what the future holds.

594
00:22:03,360 --> 00:22:04,040
Me neither.

595
00:22:04,040 --> 00:22:05,040
Thanks for joining us.

596
00:22:05,040 --> 00:22:08,400
And until next time, keep exploring, keep questioning.

597
00:22:08,400 --> 00:22:08,840
Yes.

598
00:22:08,840 --> 00:22:29,000
And keep diving deep.

