1
00:00:00,000 --> 00:00:02,920
Welcome back to the AI Papers podcast daily.

2
00:00:02,920 --> 00:00:04,200
You're not going to believe this,

3
00:00:04,200 --> 00:00:08,040
but we just got our hands on the Quinn 2.5 technical report.

4
00:00:08,040 --> 00:00:09,000
Oh, wow.

5
00:00:09,000 --> 00:00:14,520
Hot off the presses, release just today, December 20th, 2024.

6
00:00:14,520 --> 00:00:15,560
No way.

7
00:00:15,560 --> 00:00:17,840
And let me tell you, the Quinn team

8
00:00:17,840 --> 00:00:19,920
is really pushing the boundaries of what's

9
00:00:19,920 --> 00:00:22,320
possible with large language models.

10
00:00:22,320 --> 00:00:23,840
Yeah, they really are.

11
00:00:23,840 --> 00:00:25,960
I mean, this report is packed with details

12
00:00:25,960 --> 00:00:28,400
about their latest series of models.

13
00:00:28,400 --> 00:00:31,560
And some of the advancements are truly remarkable,

14
00:00:31,560 --> 00:00:33,360
especially when it comes to context length

15
00:00:33,360 --> 00:00:35,160
and how they've tackled the data challenges.

16
00:00:35,160 --> 00:00:35,800
Absolutely.

17
00:00:35,800 --> 00:00:37,080
OK, let's unpack this.

18
00:00:37,080 --> 00:00:37,760
Sounds good.

19
00:00:37,760 --> 00:00:40,680
First things first, Quinn 2.5 isn't just one model.

20
00:00:40,680 --> 00:00:42,480
It's a whole family of them, right?

21
00:00:42,480 --> 00:00:43,160
Exactly, yeah.

22
00:00:43,160 --> 00:00:46,120
They've developed models of various sizes, both open weight,

23
00:00:46,120 --> 00:00:48,880
meaning they're freely available for anyone to use,

24
00:00:48,880 --> 00:00:52,720
and proprietary, which are accessible through Alibaba Cloud.

25
00:00:52,720 --> 00:00:55,160
So for all the AI enthusiasts out there,

26
00:00:55,160 --> 00:00:58,600
whether you're a researcher, developer, or just someone who

27
00:00:58,600 --> 00:01:03,200
loves to tinker with the latest AI tools, you've got options.

28
00:01:03,200 --> 00:01:08,080
We're looking at sizes from a nimble 0.5 billion parameters,

29
00:01:08,080 --> 00:01:12,200
all the way up to a massive 72 billion parameter model

30
00:01:12,200 --> 00:01:13,600
in the open weight category.

31
00:01:13,600 --> 00:01:14,800
That's incredible, yeah.

32
00:01:14,800 --> 00:01:17,400
And for those who need even more horsepower,

33
00:01:17,400 --> 00:01:21,360
there's Quinn 2.5 Turbo and Quinn 2.5 Plus,

34
00:01:21,360 --> 00:01:23,240
which are the proprietary models.

35
00:01:23,240 --> 00:01:26,080
These are built using a mixture of experts approach,

36
00:01:26,080 --> 00:01:28,000
interesting, which we can dive into a little later.

37
00:01:28,000 --> 00:01:31,480
Yeah, I'm really excited to see the return of the 3B, 14B,

38
00:01:31,480 --> 00:01:32,960
and 32B models.

39
00:01:32,960 --> 00:01:33,440
Yeah.

40
00:01:33,440 --> 00:01:35,440
It's great that they're offering a range of sizes

41
00:01:35,440 --> 00:01:37,360
to fit different needs and resources.

42
00:01:37,360 --> 00:01:38,440
It's a smart move.

43
00:01:38,440 --> 00:01:39,880
You know, not everyone has access

44
00:01:39,880 --> 00:01:41,680
to massive computing clusters.

45
00:01:41,680 --> 00:01:42,000
Right.

46
00:01:42,000 --> 00:01:43,520
So having these mid-range options

47
00:01:43,520 --> 00:01:46,600
makes this cutting-edge technology much more accessible.

48
00:01:46,600 --> 00:01:48,120
It's a wider range of users.

49
00:01:48,120 --> 00:01:50,160
Now, let's talk about what they fed this beast.

50
00:01:50,160 --> 00:01:50,660
OK.

51
00:01:50,660 --> 00:01:54,080
I was blown away by the sheer scale of the training data,

52
00:01:54,080 --> 00:01:56,080
18 trillion tokens.

53
00:01:56,080 --> 00:01:56,960
Oh.

54
00:01:56,960 --> 00:01:59,120
That's more than double the previous version.

55
00:01:59,120 --> 00:02:00,960
That's a lot of data.

56
00:02:00,960 --> 00:02:04,280
But what's really interesting is how they went beyond just size.

57
00:02:04,280 --> 00:02:04,880
You know?

58
00:02:04,880 --> 00:02:09,040
They used their previous models to filter the data for quality.

59
00:02:09,040 --> 00:02:10,360
So it's not just about quantity.

60
00:02:10,360 --> 00:02:14,200
It's about making sure every token counts.

61
00:02:14,200 --> 00:02:17,480
So they're using their own AI to train even better AI.

62
00:02:17,480 --> 00:02:18,560
Exactly.

63
00:02:18,560 --> 00:02:18,960
Exactly.

64
00:02:18,960 --> 00:02:19,720
That makes sense, though.

65
00:02:19,720 --> 00:02:20,200
Yeah.

66
00:02:20,200 --> 00:02:23,080
If you want your model to be a math whiz or a coding guru,

67
00:02:23,080 --> 00:02:25,080
you got to make sure it's got the right stuff

68
00:02:25,080 --> 00:02:26,200
and it's learning diet.

69
00:02:26,200 --> 00:02:26,560
Right.

70
00:02:26,560 --> 00:02:27,960
It's like cross-training for AI.

71
00:02:27,960 --> 00:02:28,960
Exactly.

72
00:02:28,960 --> 00:02:31,960
They also went a step further and generated synthetic data

73
00:02:31,960 --> 00:02:36,080
specifically for math, code, and general knowledge domains,

74
00:02:36,080 --> 00:02:39,560
all using their previous models to ensure quality.

75
00:02:39,560 --> 00:02:41,680
So they're really trying to cover all their bases.

76
00:02:41,680 --> 00:02:42,120
They are.

77
00:02:42,120 --> 00:02:43,200
They are.

78
00:02:43,200 --> 00:02:45,280
And it sounds like they were very aware

79
00:02:45,280 --> 00:02:49,280
of the potential for bias in their data,

80
00:02:49,280 --> 00:02:51,400
especially from sources like social media.

81
00:02:51,400 --> 00:02:51,960
Absolutely.

82
00:02:51,960 --> 00:02:52,460
Yeah.

83
00:02:52,460 --> 00:02:54,520
They strategically balance the content

84
00:02:54,520 --> 00:02:57,760
across different domains, downsampling

85
00:02:57,760 --> 00:03:00,400
overrepresented areas like social media,

86
00:03:00,400 --> 00:03:04,200
and up-sampling valuable areas like tech and science.

87
00:03:04,200 --> 00:03:07,800
This helps create a more balanced training set

88
00:03:07,800 --> 00:03:10,160
and a more knowledgeable model.

89
00:03:10,160 --> 00:03:12,640
This meticulous approach to data quality

90
00:03:12,640 --> 00:03:14,280
really sets them apart.

91
00:03:14,280 --> 00:03:16,800
It's not just about chasing bigger numbers.

92
00:03:16,800 --> 00:03:21,360
It's about building models that are both capable and reliable.

93
00:03:21,360 --> 00:03:22,360
Right.

94
00:03:22,360 --> 00:03:22,680
OK.

95
00:03:22,680 --> 00:03:24,240
So we've covered the data feast.

96
00:03:24,240 --> 00:03:25,040
The feast, yeah.

97
00:03:25,040 --> 00:03:26,520
Now let's talk about the architecture

98
00:03:26,520 --> 00:03:27,880
they use to build these models.

99
00:03:27,880 --> 00:03:28,520
OK.

100
00:03:28,520 --> 00:03:30,600
Sounds complex.

101
00:03:30,600 --> 00:03:34,600
So for the dense models, they stuck with a transformer-based

102
00:03:34,600 --> 00:03:38,760
decoder architecture, similar to the previous Quintu.

103
00:03:38,760 --> 00:03:39,240
Gotcha.

104
00:03:39,240 --> 00:03:41,080
But with some key enhancements.

105
00:03:41,080 --> 00:03:41,680
Like what?

106
00:03:41,680 --> 00:03:44,680
Well, they've used some interesting tricks in the architecture,

107
00:03:44,680 --> 00:03:48,120
like this thing called grouped query attention.

108
00:03:48,120 --> 00:03:48,560
What's that?

109
00:03:48,560 --> 00:03:52,200
It basically helps the model use memory more efficiently.

110
00:03:52,200 --> 00:03:53,800
Which is a big deal when you're dealing

111
00:03:53,800 --> 00:03:56,080
with massive amounts of data.

112
00:03:56,080 --> 00:03:56,920
That's a mouthful.

113
00:03:56,920 --> 00:03:57,480
It is.

114
00:03:57,480 --> 00:03:58,160
But I get it.

115
00:03:58,160 --> 00:03:58,480
Yeah.

116
00:03:58,480 --> 00:03:59,880
Efficiency is key when you're working

117
00:03:59,880 --> 00:04:00,920
with these massive models.

118
00:04:00,920 --> 00:04:01,560
Definitely.

119
00:04:01,560 --> 00:04:04,080
It's like optimizing your code so it runs faster.

120
00:04:04,080 --> 00:04:04,680
Right.

121
00:04:04,680 --> 00:04:05,560
Exactly.

122
00:04:05,560 --> 00:04:07,440
And they've also incorporated other features,

123
00:04:07,440 --> 00:04:10,920
like a switchy-LU activation function and rotary

124
00:04:10,920 --> 00:04:13,600
positional embeddings or row B.

125
00:04:13,600 --> 00:04:14,000
Wow.

126
00:04:14,000 --> 00:04:14,560
OK.

127
00:04:14,560 --> 00:04:17,320
All of which are designed to improve the model's performance.

128
00:04:17,320 --> 00:04:18,680
I'm sensing a theme here.

129
00:04:18,680 --> 00:04:19,000
Yeah.

130
00:04:19,000 --> 00:04:21,480
Squeeze every bit of performance out of these models.

131
00:04:21,480 --> 00:04:22,600
It seems so.

132
00:04:22,600 --> 00:04:23,640
So we've got the data.

133
00:04:23,640 --> 00:04:24,760
Uh-huh.

134
00:04:24,760 --> 00:04:26,440
We've got the blue trends for the model.

135
00:04:26,440 --> 00:04:26,960
OK.

136
00:04:26,960 --> 00:04:28,720
Now let's talk about the training process.

137
00:04:28,720 --> 00:04:29,520
Sounds good.

138
00:04:29,520 --> 00:04:30,520
It's a two-phase approach.

139
00:04:30,520 --> 00:04:31,200
Right.

140
00:04:31,200 --> 00:04:31,560
OK.

141
00:04:31,560 --> 00:04:32,400
Walk me through it.

142
00:04:32,400 --> 00:04:36,280
So phase one focuses on pre-training

143
00:04:36,280 --> 00:04:40,480
with a context length of 4,096 tokens.

144
00:04:40,480 --> 00:04:41,480
OK.

145
00:04:41,480 --> 00:04:43,200
Think of it as building the foundation,

146
00:04:43,200 --> 00:04:45,960
getting the model familiar with the basics of language

147
00:04:45,960 --> 00:04:47,920
and how words relate to each other.

148
00:04:47,920 --> 00:04:49,760
So that's like teaching the model the alphabet

149
00:04:49,760 --> 00:04:50,560
and basic grammar.

150
00:04:50,560 --> 00:04:51,240
Exactly, yeah.

151
00:04:51,240 --> 00:04:52,320
And what happens in phase two?

152
00:04:52,320 --> 00:04:55,320
In phase two, they extend that context length

153
00:04:55,320 --> 00:04:56,840
for longer sequence training.

154
00:04:56,840 --> 00:04:57,240
OK.

155
00:04:57,240 --> 00:05:01,560
Up to 32,768 tokens for most models.

156
00:05:01,560 --> 00:05:02,160
Hold on.

157
00:05:02,160 --> 00:05:04,720
32,768 tokens.

158
00:05:04,720 --> 00:05:06,040
How much information is that?

159
00:05:06,040 --> 00:05:06,880
It's a lot.

160
00:05:06,880 --> 00:05:10,200
Think of it as feeding the model an entire research paper

161
00:05:10,200 --> 00:05:11,960
or a long-form article.

162
00:05:11,960 --> 00:05:12,560
Oh, wow.

163
00:05:12,560 --> 00:05:16,400
But there's one model, Quen 2.5 Turbo,

164
00:05:16,400 --> 00:05:19,280
that gets its own special training regimen.

165
00:05:19,280 --> 00:05:20,120
Special treatment?

166
00:05:20,120 --> 00:05:22,720
Yeah, they gradually expand the context length,

167
00:05:22,720 --> 00:05:28,920
ultimately reaching a mind-blowing 262,144 tokens.

168
00:05:28,920 --> 00:05:31,600
262,144 tokens.

169
00:05:31,600 --> 00:05:34,040
That's, I don't even know what to compare that to.

170
00:05:34,040 --> 00:05:34,960
It's a lot.

171
00:05:34,960 --> 00:05:35,440
That is a lot.

172
00:05:35,440 --> 00:05:38,120
How does a model even handle that much information?

173
00:05:38,120 --> 00:05:39,160
That's a great question.

174
00:05:39,160 --> 00:05:40,000
Yeah.

175
00:05:40,000 --> 00:05:41,680
They implemented some clever techniques,

176
00:05:41,680 --> 00:05:44,640
like yarn and dual-chunk attention or DCA.

177
00:05:44,640 --> 00:05:45,200
OK.

178
00:05:45,200 --> 00:05:47,080
These innovations not only help the model

179
00:05:47,080 --> 00:05:50,160
understand long contexts, but also maintain its performance

180
00:05:50,160 --> 00:05:51,080
on shorter ones.

181
00:05:51,080 --> 00:05:53,240
So it's not just about cramming in as much information

182
00:05:53,240 --> 00:05:53,880
as possible.

183
00:05:53,880 --> 00:05:55,880
It's about making sure the model can actually

184
00:05:55,880 --> 00:05:56,760
use it effectively.

185
00:05:56,760 --> 00:05:57,280
Precisely.

186
00:05:57,280 --> 00:05:57,520
Yeah.

187
00:05:57,520 --> 00:05:59,400
And you know what's even more amazing?

188
00:05:59,400 --> 00:06:02,240
They figured out a way to make Quen 2.5 Turbo

189
00:06:02,240 --> 00:06:05,840
handle up to 1 million tokens during inference,

190
00:06:05,840 --> 00:06:08,000
which is when the model is actually using what

191
00:06:08,000 --> 00:06:09,680
it's learned to perform a task.

192
00:06:09,680 --> 00:06:10,440
1 million.

193
00:06:10,440 --> 00:06:11,560
That's astronomical.

194
00:06:11,560 --> 00:06:12,080
It is.

195
00:06:12,080 --> 00:06:12,360
It is.

196
00:06:12,360 --> 00:06:13,800
I can't even wrap my head around that.

197
00:06:13,800 --> 00:06:14,400
Yeah.

198
00:06:14,400 --> 00:06:16,560
What are the practical implications

199
00:06:16,560 --> 00:06:19,720
of this kind of long context processing power?

200
00:06:19,720 --> 00:06:22,560
Well, imagine you're a researcher digging

201
00:06:22,560 --> 00:06:25,240
through mountains of historical documents.

202
00:06:25,240 --> 00:06:29,320
Quen 2.5 can analyze all that, find connections

203
00:06:29,320 --> 00:06:30,600
you wouldn't believe.

204
00:06:30,600 --> 00:06:31,240
Wow.

205
00:06:31,240 --> 00:06:33,880
And even summarize it for you in plain English.

206
00:06:33,880 --> 00:06:36,720
That's the power of a million token context length.

207
00:06:36,720 --> 00:06:37,720
Exactly.

208
00:06:37,720 --> 00:06:38,760
That's incredible.

209
00:06:38,760 --> 00:06:40,880
I'm starting to see how this technology could

210
00:06:40,880 --> 00:06:43,600
revolutionize all sorts of fields

211
00:06:43,600 --> 00:06:45,840
from scientific research to creative writing.

212
00:06:45,840 --> 00:06:46,360
Yeah.

213
00:06:46,360 --> 00:06:49,640
And this is where those mixture of experts models,

214
00:06:49,640 --> 00:06:52,920
or MOE for short, really come into play.

215
00:06:52,920 --> 00:06:56,280
They're designed for handling this kind of massive scale.

216
00:06:56,280 --> 00:06:56,600
OK.

217
00:06:56,600 --> 00:06:59,000
Remind me again, how do these MOE models work?

218
00:06:59,000 --> 00:07:00,560
I know we touched on it earlier.

219
00:07:00,560 --> 00:07:02,000
But I want to make sure I understand it fully.

220
00:07:02,000 --> 00:07:02,500
OK.

221
00:07:02,500 --> 00:07:04,920
So imagine instead of one massive network trying

222
00:07:04,920 --> 00:07:08,520
to do everything, you have multiple smaller networks,

223
00:07:08,520 --> 00:07:11,240
each specializing in a specific task.

224
00:07:11,240 --> 00:07:12,080
Select specialists.

225
00:07:12,080 --> 00:07:12,600
Exactly.

226
00:07:12,600 --> 00:07:14,160
These specialists are called experts.

227
00:07:14,160 --> 00:07:14,560
That's it.

228
00:07:14,560 --> 00:07:18,560
So when a new task comes along, only the experts

229
00:07:18,560 --> 00:07:20,880
whose skills are needed get activated.

230
00:07:20,880 --> 00:07:22,960
Oh, so it's like having a team of specialists.

231
00:07:22,960 --> 00:07:23,560
Exactly.

232
00:07:23,560 --> 00:07:26,080
Instead of one generalist trying to do everything,

233
00:07:26,080 --> 00:07:27,640
that sounds much more efficient.

234
00:07:27,640 --> 00:07:29,320
Exactly.

235
00:07:29,320 --> 00:07:33,360
This specialization makes MOE models incredibly efficient.

236
00:07:33,360 --> 00:07:34,120
Makes sense.

237
00:07:34,120 --> 00:07:36,240
Especially when dealing with huge amounts of data

238
00:07:36,240 --> 00:07:37,760
and complex tasks.

239
00:07:37,760 --> 00:07:41,040
So we've covered the data, the architecture,

240
00:07:41,040 --> 00:07:42,440
and the pre-training.

241
00:07:42,440 --> 00:07:42,720
OK.

242
00:07:42,720 --> 00:07:44,120
But the journey doesn't end there.

243
00:07:44,120 --> 00:07:45,200
No, it doesn't.

244
00:07:45,200 --> 00:07:46,880
Let's talk about post-training.

245
00:07:46,880 --> 00:07:47,360
All right.

246
00:07:47,360 --> 00:07:50,960
Where the Quen team really fine-tuned these models

247
00:07:50,960 --> 00:07:53,280
for specific skills and behaviors.

248
00:07:53,280 --> 00:07:53,760
That's right.

249
00:07:53,760 --> 00:07:55,560
That's when things get really interesting.

250
00:07:55,560 --> 00:07:56,920
This is where things get really interesting.

251
00:07:56,920 --> 00:07:57,480
Absolutely.

252
00:07:57,480 --> 00:07:57,680
Yeah.

253
00:07:57,680 --> 00:08:00,760
Post-training is like sending your LLM to grad school.

254
00:08:00,760 --> 00:08:01,080
OK.

255
00:08:01,080 --> 00:08:03,160
It's where they take that raw potential

256
00:08:03,160 --> 00:08:05,000
and shape it into a model.

257
00:08:05,000 --> 00:08:05,400
OK.

258
00:08:05,400 --> 00:08:08,840
That can follow instructions, understand complex concepts,

259
00:08:08,840 --> 00:08:10,960
and generate human quality text.

260
00:08:10,960 --> 00:08:11,400
Gotcha.

261
00:08:11,400 --> 00:08:14,600
And Quen 2.5's post-training regimen

262
00:08:14,600 --> 00:08:17,200
involved two major advancements.

263
00:08:17,200 --> 00:08:17,480
OK.

264
00:08:17,480 --> 00:08:21,000
Expanded supervised fine-tuning data coverage.

265
00:08:21,000 --> 00:08:24,040
And a two-stage reinforcement learning process.

266
00:08:24,040 --> 00:08:24,280
OK.

267
00:08:24,280 --> 00:08:25,440
Let's break those down.

268
00:08:25,440 --> 00:08:25,960
That was good.

269
00:08:25,960 --> 00:08:29,080
First up, expanded supervised fine-tuning.

270
00:08:29,080 --> 00:08:30,760
What's new and exciting here?

271
00:08:30,760 --> 00:08:32,960
Well, they used a massive data set

272
00:08:32,960 --> 00:08:36,240
with millions of high-quality examples

273
00:08:36,240 --> 00:08:39,520
to address limitations they observed in previous models.

274
00:08:39,520 --> 00:08:41,000
So what kind of limitations?

275
00:08:41,000 --> 00:08:43,600
Well, they wanted to refine the models in areas

276
00:08:43,600 --> 00:08:48,720
like long sequence generation, math, coding, and following

277
00:08:48,720 --> 00:08:50,320
instructions flawlessly.

278
00:08:50,320 --> 00:08:50,640
Really?

279
00:08:50,640 --> 00:08:51,520
Nail those down.

280
00:08:51,520 --> 00:08:51,880
Yeah.

281
00:08:51,880 --> 00:08:55,040
It's all about creating a model that's not just smart,

282
00:08:55,040 --> 00:08:58,480
but can actually put that smarts to work in a useful way.

283
00:08:58,480 --> 00:09:01,120
It's like taking a brilliant student

284
00:09:01,120 --> 00:09:03,560
and teaching them how to apply their knowledge

285
00:09:03,560 --> 00:09:05,160
to solve real-world problems.

286
00:09:05,160 --> 00:09:05,680
I like it.

287
00:09:05,680 --> 00:09:06,920
And they didn't stop there.

288
00:09:06,920 --> 00:09:07,400
OK.

289
00:09:07,400 --> 00:09:10,200
They also implemented a two-stage reinforcement learning

290
00:09:10,200 --> 00:09:13,360
process, which is all about training models

291
00:09:13,360 --> 00:09:15,800
through a system of rewards and penalties.

292
00:09:15,800 --> 00:09:17,960
Kind of like teaching a dog new tricks with treats

293
00:09:17,960 --> 00:09:18,920
and corrections.

294
00:09:18,920 --> 00:09:19,880
Exactly.

295
00:09:19,880 --> 00:09:22,920
The first stage, offline RL, focused

296
00:09:22,920 --> 00:09:26,280
on developing capabilities like reasoning and factuality.

297
00:09:26,280 --> 00:09:26,680
OK.

298
00:09:26,680 --> 00:09:27,160
Makes sense.

299
00:09:27,160 --> 00:09:27,760
Yeah.

300
00:09:27,760 --> 00:09:30,800
It's like teaching the model to solve a complex puzzle.

301
00:09:30,800 --> 00:09:31,360
Right.

302
00:09:31,360 --> 00:09:33,600
Where the solution isn't immediately obvious.

303
00:09:33,600 --> 00:09:34,400
Exactly.

304
00:09:34,400 --> 00:09:36,320
So what happens in the second stage?

305
00:09:36,320 --> 00:09:39,720
Online RO, they refined the model's ability

306
00:09:39,720 --> 00:09:43,960
to produce responses that are truthful, helpful, concise,

307
00:09:43,960 --> 00:09:45,680
relevant, and harmless.

308
00:09:45,680 --> 00:09:48,560
So it's like taking those raw reasoning skills

309
00:09:48,560 --> 00:09:51,720
and polishing them into a communication style

310
00:09:51,720 --> 00:09:53,960
that's both effective and aligned with human value.

311
00:09:53,960 --> 00:09:54,840
Right.

312
00:09:54,840 --> 00:09:57,680
They're shaping not just the model's intelligence,

313
00:09:57,680 --> 00:09:59,120
but its personality as well.

314
00:09:59,120 --> 00:10:00,280
That's a great way to put it.

315
00:10:00,280 --> 00:10:04,000
Now the big question is, did all this careful training

316
00:10:04,000 --> 00:10:04,920
pay off?

317
00:10:04,920 --> 00:10:06,560
That is the question.

318
00:10:06,560 --> 00:10:07,680
Let's get to the results.

319
00:10:07,680 --> 00:10:08,320
OK.

320
00:10:08,320 --> 00:10:11,640
How do these instruction tuned models perform?

321
00:10:11,640 --> 00:10:13,480
I'm ready to hear about some AI magic.

322
00:10:13,480 --> 00:10:14,000
All right.

323
00:10:14,000 --> 00:10:16,200
So the results were impressive across the board.

324
00:10:16,200 --> 00:10:19,920
They used a mix of open benchmarks, their own in-house

325
00:10:19,920 --> 00:10:23,400
data sets, and even assessments of their long context

326
00:10:23,400 --> 00:10:24,280
capabilities.

327
00:10:24,280 --> 00:10:24,760
OK.

328
00:10:24,760 --> 00:10:26,600
Especially for those models designed

329
00:10:26,600 --> 00:10:29,360
to handle massive amounts of information.

330
00:10:29,360 --> 00:10:29,840
OK.

331
00:10:29,840 --> 00:10:31,760
Let's start with those open benchmarks.

332
00:10:31,760 --> 00:10:33,240
These are the standardized tests that

333
00:10:33,240 --> 00:10:36,360
allow us to compare performance across different LLMs,

334
00:10:36,360 --> 00:10:36,600
right?

335
00:10:36,600 --> 00:10:37,000
Yeah.

336
00:10:37,000 --> 00:10:38,280
Like the SATs for AI.

337
00:10:38,280 --> 00:10:38,880
Exactly.

338
00:10:38,880 --> 00:10:39,280
OK.

339
00:10:39,280 --> 00:10:42,320
How did Quen 2.5 do on its exams?

340
00:10:42,320 --> 00:10:44,280
They passed with flying colors.

341
00:10:44,280 --> 00:10:44,520
Right?

342
00:10:44,520 --> 00:10:47,520
Starting with the 72B parameter model,

343
00:10:47,520 --> 00:10:51,600
Quen 2.572BB consistently outperformed its peers

344
00:10:51,600 --> 00:10:54,720
in the same size category across a wide range of tasks.

345
00:10:54,720 --> 00:10:55,320
Wow.

346
00:10:55,320 --> 00:10:58,720
It even achieved results comparable to the much larger

347
00:10:58,720 --> 00:11:00,600
LLAMA3405B.

348
00:11:00,600 --> 00:11:01,280
That's a big one.

349
00:11:01,280 --> 00:11:03,280
Which has five times the parameters.

350
00:11:03,280 --> 00:11:03,640
Wow.

351
00:11:03,640 --> 00:11:04,840
That's a huge accomplishment.

352
00:11:04,840 --> 00:11:07,080
So they're achieving state of the art performance

353
00:11:07,080 --> 00:11:09,160
with a model that's significantly smaller.

354
00:11:09,160 --> 00:11:09,440
Right.

355
00:11:09,440 --> 00:11:11,160
I bet that has implications for making

356
00:11:11,160 --> 00:11:12,160
these models more accessible.

357
00:11:12,160 --> 00:11:12,720
Absolutely.

358
00:11:12,720 --> 00:11:12,960
Yeah.

359
00:11:12,960 --> 00:11:15,640
And they weren't content with just matching performance.

360
00:11:15,640 --> 00:11:18,360
They also wanted to beat their own previous best.

361
00:11:18,360 --> 00:11:20,360
You're talking about the Quen 272B model, right?

362
00:11:20,360 --> 00:11:22,200
The one they released before this new series?

363
00:11:22,200 --> 00:11:22,760
That's the one.

364
00:11:22,760 --> 00:11:23,160
OK.

365
00:11:23,160 --> 00:11:27,280
And Quen 2.572B showed marked improvements

366
00:11:27,280 --> 00:11:30,640
over its predecessor in almost all the benchmark evaluations.

367
00:11:30,640 --> 00:11:31,200
Really?

368
00:11:31,200 --> 00:11:32,280
In what areas?

369
00:11:32,280 --> 00:11:35,360
Especially in general, tasks, math, and coding.

370
00:11:35,360 --> 00:11:36,640
I'm sensing a theme here.

371
00:11:36,640 --> 00:11:36,960
Yeah.

372
00:11:36,960 --> 00:11:40,600
Math and coding seem to be areas where Quen 2.5 really shines.

373
00:11:40,600 --> 00:11:41,320
Definitely.

374
00:11:41,320 --> 00:11:42,160
It's a strength.

375
00:11:42,160 --> 00:11:42,560
Yeah.

376
00:11:42,560 --> 00:11:44,600
They clearly focused on building models

377
00:11:44,600 --> 00:11:47,600
that can go beyond simply understanding text

378
00:11:47,600 --> 00:11:49,960
to actually applying that understanding

379
00:11:49,960 --> 00:11:51,880
in practical and challenging ways.

380
00:11:51,880 --> 00:11:53,000
That's really impressive.

381
00:11:53,000 --> 00:11:53,480
It is.

382
00:11:53,480 --> 00:11:53,720
OK.

383
00:11:53,720 --> 00:11:55,680
What about the proprietary Moe models?

384
00:11:55,680 --> 00:11:56,360
Right.

385
00:11:56,360 --> 00:11:57,400
How did they fare?

386
00:11:57,400 --> 00:11:58,680
Exceptionally well.

387
00:11:58,680 --> 00:11:59,280
Really?

388
00:11:59,280 --> 00:12:02,160
Especially considering their lower training and inference

389
00:12:02,160 --> 00:12:04,800
costs compared to the dense models.

390
00:12:04,800 --> 00:12:05,360
Wow.

391
00:12:05,360 --> 00:12:08,680
Quen 2.5 Plus delivered very competitive results

392
00:12:08,680 --> 00:12:13,320
compared to both Quen 2.572B and Lama 3405B.

393
00:12:13,320 --> 00:12:14,520
So those are the biggest models?

394
00:12:14,520 --> 00:12:14,840
Yeah.

395
00:12:14,840 --> 00:12:15,160
OK.

396
00:12:15,160 --> 00:12:16,640
It even outperformed other models

397
00:12:16,640 --> 00:12:18,560
on several challenging benchmarks.

398
00:12:18,560 --> 00:12:20,120
So they're not just powerful.

399
00:12:20,120 --> 00:12:21,560
They're also efficient.

400
00:12:21,560 --> 00:12:22,800
That's a winning combination.

401
00:12:22,800 --> 00:12:23,360
It is.

402
00:12:23,360 --> 00:12:25,480
Let's talk about some of the smaller models now.

403
00:12:25,480 --> 00:12:27,880
Like the 14B and 32B models.

404
00:12:27,880 --> 00:12:28,400
Right.

405
00:12:28,400 --> 00:12:30,960
Where Quen 2.5 Turbo also fits in.

406
00:12:30,960 --> 00:12:31,520
OK.

407
00:12:31,520 --> 00:12:32,600
How do they stack up?

408
00:12:32,600 --> 00:12:36,560
Even at these smaller sizes, Quen 2.5 continued to impress.

409
00:12:36,560 --> 00:12:36,840
OK.

410
00:12:36,840 --> 00:12:40,760
The 14B model delivered solid performance across the board,

411
00:12:40,760 --> 00:12:43,800
even outperforming some larger models in certain areas.

412
00:12:43,800 --> 00:12:45,400
And the 32B model.

413
00:12:45,400 --> 00:12:46,680
What stood out there?

414
00:12:46,680 --> 00:12:49,480
The 32B model was a real standout,

415
00:12:49,480 --> 00:12:51,720
often exceeding the performance of larger models

416
00:12:51,720 --> 00:12:53,280
with similar architectures.

417
00:12:53,280 --> 00:12:53,720
Wow.

418
00:12:53,720 --> 00:12:56,280
Particularly in math and coding tasks.

419
00:12:56,280 --> 00:12:58,280
There's that math and coding prowess again.

420
00:12:58,280 --> 00:12:58,680
Right.

421
00:12:58,680 --> 00:13:03,280
Now, let's bring in Quen 2.5 Turbo in this size range.

422
00:13:03,280 --> 00:13:03,760
OK.

423
00:13:03,760 --> 00:13:05,120
How did it stack up?

424
00:13:05,120 --> 00:13:08,520
It managed to achieve comparable results to the 14B model.

425
00:13:08,520 --> 00:13:09,120
Wow.

426
00:13:09,120 --> 00:13:11,080
Even though its training and inference costs

427
00:13:11,080 --> 00:13:12,400
were significantly lower.

428
00:13:12,400 --> 00:13:13,280
That's impressive.

429
00:13:13,280 --> 00:13:15,040
This efficiency is really remarkable,

430
00:13:15,040 --> 00:13:16,640
especially when you consider that they managed

431
00:13:16,640 --> 00:13:18,440
to maintain performance.

432
00:13:18,440 --> 00:13:20,160
It makes me wonder what they could achieve

433
00:13:20,160 --> 00:13:23,480
if they applied this moe approach to even larger models.

434
00:13:23,480 --> 00:13:25,320
The possibilities are exciting, aren't they?

435
00:13:25,320 --> 00:13:27,000
It's like they've found a secret formula.

436
00:13:27,000 --> 00:13:28,120
Right.

437
00:13:28,120 --> 00:13:30,600
All right, now let's shift gears and talk about the 7B models.

438
00:13:30,600 --> 00:13:31,160
OK.

439
00:13:31,160 --> 00:13:32,160
How did they measure up?

440
00:13:32,160 --> 00:13:36,400
They compared Quen 2.5 7B with Mistral 7B,

441
00:13:36,400 --> 00:13:39,160
Llama 38B, Gemma 2.9 B.

442
00:13:39,160 --> 00:13:40,360
So some other 7Bs.

443
00:13:40,360 --> 00:13:40,880
Yep.

444
00:13:40,880 --> 00:13:43,720
And their own previous model, Quen 2.7 B.

445
00:13:43,720 --> 00:13:46,640
So a mix of the top contenders and their own previous work.

446
00:13:46,640 --> 00:13:47,240
Exactly.

447
00:13:47,240 --> 00:13:49,000
Which is a great way to track progress.

448
00:13:49,000 --> 00:13:49,560
It is.

449
00:13:49,560 --> 00:13:50,000
It is.

450
00:13:50,000 --> 00:13:51,400
And how did it perform?

451
00:13:51,400 --> 00:13:55,440
The results were very positive, despite having slightly

452
00:13:55,440 --> 00:13:58,000
fewer parameters than some of the other models.

453
00:13:58,000 --> 00:14:02,120
Quen 2.5 7B surpassed both its predecessors and competitors

454
00:14:02,120 --> 00:14:04,080
on numerous benchmarks.

455
00:14:04,080 --> 00:14:07,960
What were some of the standout wins for Quen 2.5 7B?

456
00:14:07,960 --> 00:14:09,920
Anything particularly impressive?

457
00:14:09,920 --> 00:14:12,320
Yeah, it demonstrated significant improvements

458
00:14:12,320 --> 00:14:14,200
across a variety of tasks.

459
00:14:14,200 --> 00:14:14,600
OK.

460
00:14:14,600 --> 00:14:19,720
From general benchmarks like MMLU to math challenges

461
00:14:19,720 --> 00:14:23,320
like math and coding tasks like Human Evil.

462
00:14:23,320 --> 00:14:25,240
So it did really well on the coding and math.

463
00:14:25,240 --> 00:14:25,840
It did.

464
00:14:25,840 --> 00:14:26,280
It did.

465
00:14:26,280 --> 00:14:28,480
So even at this smaller scale, they're still holding their own.

466
00:14:28,480 --> 00:14:29,480
Absolutely, yeah.

467
00:14:29,480 --> 00:14:29,800
OK.

468
00:14:29,800 --> 00:14:32,680
And then we have the smallest models designed for things

469
00:14:32,680 --> 00:14:34,080
like your smartphone and laptop.

470
00:14:34,080 --> 00:14:34,520
Right.

471
00:14:34,520 --> 00:14:38,320
Quen 2.5 0.5 B, 1.5 B, and 3B.

472
00:14:38,320 --> 00:14:38,760
OK.

473
00:14:38,760 --> 00:14:40,560
So these are the smallest models in the set.

474
00:14:40,560 --> 00:14:40,960
Yeah.

475
00:14:40,960 --> 00:14:43,760
And despite their compact size, these models

476
00:14:43,760 --> 00:14:46,160
held their own against established baselines.

477
00:14:46,160 --> 00:14:46,560
Really?

478
00:14:46,560 --> 00:14:53,000
In fact, Quen 2.5 0.5 B even outperformed Gemma 2 2.6 B

479
00:14:53,000 --> 00:14:55,440
on several math and coding tasks.

480
00:14:55,440 --> 00:14:56,040
Oh, wow.

481
00:14:56,040 --> 00:14:58,600
So even the smallest ones are still very capable.

482
00:14:58,600 --> 00:14:59,280
They are.

483
00:14:59,280 --> 00:14:59,760
They are.

484
00:14:59,760 --> 00:15:00,680
That's pretty incredible.

485
00:15:00,680 --> 00:15:01,240
He is.

486
00:15:01,240 --> 00:15:04,280
It's like they miniaturized the power of an LLM

487
00:15:04,280 --> 00:15:06,320
without sacrificing his core capability.

488
00:15:06,320 --> 00:15:06,760
Yeah.

489
00:15:06,760 --> 00:15:08,560
What are the implications of being

490
00:15:08,560 --> 00:15:12,800
able to run these powerful models on everyday devices?

491
00:15:12,800 --> 00:15:14,280
That's a great question.

492
00:15:14,280 --> 00:15:17,640
It really opens up the possibilities for using LLMs

493
00:15:17,640 --> 00:15:20,640
in new and exciting ways, bringing the power of AI

494
00:15:20,640 --> 00:15:21,400
to everyone.

495
00:15:21,400 --> 00:15:21,920
Yeah.

496
00:15:21,920 --> 00:15:23,760
Think about personalized AI tutors.

497
00:15:23,760 --> 00:15:24,400
Oh, wow.

498
00:15:24,400 --> 00:15:25,680
That can fit in your pocket.

499
00:15:25,680 --> 00:15:26,840
That's amazing.

500
00:15:26,840 --> 00:15:28,880
The future of AI is looking pretty bright.

501
00:15:28,880 --> 00:15:29,440
It is.

502
00:15:29,440 --> 00:15:31,840
Before we move on to the next part of this deep dive,

503
00:15:31,840 --> 00:15:34,200
I'd like to understand the MOE models a little better.

504
00:15:34,200 --> 00:15:34,680
All right.

505
00:15:34,680 --> 00:15:36,160
I get the basic idea.

506
00:15:36,160 --> 00:15:36,680
Yeah.

507
00:15:36,680 --> 00:15:40,200
Having specialized experts instead of one generalist.

508
00:15:40,200 --> 00:15:42,560
But how does this actually work in practice?

509
00:15:42,560 --> 00:15:46,440
It's like having a team of chefs in a restaurant, each

510
00:15:46,440 --> 00:15:48,680
with their own specialty, instead of one chef

511
00:15:48,680 --> 00:15:49,920
trying to cook everything.

512
00:15:49,920 --> 00:15:50,400
Oh, what?

513
00:15:50,400 --> 00:15:54,360
You have specialists for pasta, desserts, grilling.

514
00:15:54,360 --> 00:15:55,440
I like where you're going with this.

515
00:15:55,440 --> 00:15:56,240
And so on.

516
00:15:56,240 --> 00:15:56,600
OK.

517
00:15:56,600 --> 00:15:59,440
This makes the kitchen run much more efficiently.

518
00:15:59,440 --> 00:15:59,840
OK.

519
00:15:59,840 --> 00:16:01,440
That's a great analogy.

520
00:16:01,440 --> 00:16:05,800
So in a MOE model, instead of one giant network,

521
00:16:05,800 --> 00:16:08,080
you have these specialized experts.

522
00:16:08,080 --> 00:16:08,440
Right.

523
00:16:08,440 --> 00:16:10,960
But how does the model know which expert

524
00:16:10,960 --> 00:16:13,000
to call on for a given task?

525
00:16:13,000 --> 00:16:15,240
That's where the routing mechanism comes in.

526
00:16:15,240 --> 00:16:15,880
What cat?

527
00:16:15,880 --> 00:16:18,840
It's like the head chef directing traffic in the kitchen,

528
00:16:18,840 --> 00:16:21,440
making sure the right cooks are working on the right orders.

529
00:16:21,440 --> 00:16:25,520
So it analyzes the input data and decides

530
00:16:25,520 --> 00:16:28,600
which experts are most likely to provide the best output.

531
00:16:28,600 --> 00:16:29,240
Right.

532
00:16:29,240 --> 00:16:30,680
It's like having a smart manager.

533
00:16:30,680 --> 00:16:31,120
Yeah.

534
00:16:31,120 --> 00:16:33,040
Who knows exactly which team member

535
00:16:33,040 --> 00:16:34,680
has the right skills for the job?

536
00:16:34,680 --> 00:16:35,440
Exactly.

537
00:16:35,440 --> 00:16:36,000
OK.

538
00:16:36,000 --> 00:16:39,400
This routing mechanism is what makes MOE models so efficient.

539
00:16:39,400 --> 00:16:39,800
Gotcha.

540
00:16:39,800 --> 00:16:43,040
They're only using the brain power they need for a specific task.

541
00:16:43,040 --> 00:16:47,280
I see why the Quinn team chose to use MOE models

542
00:16:47,280 --> 00:16:49,240
for their largest and most powerful offerings.

543
00:16:49,240 --> 00:16:49,880
Yeah.

544
00:16:49,880 --> 00:16:52,200
It's a really clever way to handle the challenges

545
00:16:52,200 --> 00:16:53,520
of scale and complexity.

546
00:16:53,520 --> 00:16:54,280
It is.

547
00:16:54,280 --> 00:16:56,840
But did these MOE models live up to the hype?

548
00:16:56,840 --> 00:16:57,600
They did.

549
00:16:57,600 --> 00:17:01,360
Both Quinn 2.5 Turbo and Quinn 2.5 Plus

550
00:17:01,360 --> 00:17:04,360
delivered exceptional results, often exceeding

551
00:17:04,360 --> 00:17:06,520
the performance of larger dense model.

552
00:17:06,520 --> 00:17:09,000
So better than the non-MOE model.

553
00:17:09,000 --> 00:17:09,400
Yep.

554
00:17:09,400 --> 00:17:12,200
While using significantly less computational power.

555
00:17:12,200 --> 00:17:13,920
So they're not just powerful.

556
00:17:13,920 --> 00:17:15,600
They're also incredibly efficient.

557
00:17:15,600 --> 00:17:16,000
Right.

558
00:17:16,000 --> 00:17:18,200
It's like getting the same amazing meal.

559
00:17:18,200 --> 00:17:20,520
But the kitchen is running like a well-oiled machine.

560
00:17:20,520 --> 00:17:23,840
And that's what makes the Quinn 2.5 research so exciting.

561
00:17:23,840 --> 00:17:24,320
I agree.

562
00:17:24,320 --> 00:17:26,200
They're achieving high performance in a way that's

563
00:17:26,200 --> 00:17:28,200
actually practical and accessible.

564
00:17:28,200 --> 00:17:30,160
I can't wait to hear more about these MOE models

565
00:17:30,160 --> 00:17:31,000
as they evolve.

566
00:17:31,000 --> 00:17:31,600
Neither.

567
00:17:31,600 --> 00:17:34,680
This feels like a major turning point in AI development.

568
00:17:34,680 --> 00:17:35,240
Yeah.

569
00:17:35,240 --> 00:17:37,960
Now, I think we've laid a solid foundation

570
00:17:37,960 --> 00:17:41,080
with our discussion of the base models and the MOE approach.

571
00:17:41,080 --> 00:17:41,680
OK.

572
00:17:41,680 --> 00:17:44,000
Are you ready to move on to the next part of the deep dive?

573
00:17:44,000 --> 00:17:44,560
I'm ready.

574
00:17:44,560 --> 00:17:46,640
Those instruction tuned models.

575
00:17:46,640 --> 00:17:47,440
Absolutely.

576
00:17:47,440 --> 00:17:49,880
This is where we see these base models transformed

577
00:17:49,880 --> 00:17:54,000
into truly capable and responsive AI assistants.

578
00:17:54,000 --> 00:17:55,200
I'm already to be impressed.

579
00:17:55,200 --> 00:17:55,480
All right.

580
00:17:55,480 --> 00:17:56,000
Let's do it.

581
00:17:56,000 --> 00:17:56,520
Let's do it.

582
00:17:56,520 --> 00:17:56,800
All right.

583
00:17:56,800 --> 00:18:00,080
Let's dive into the world of instruction tuned models.

584
00:18:00,080 --> 00:18:00,360
OK.

585
00:18:00,360 --> 00:18:03,240
What makes these models different from the base models

586
00:18:03,240 --> 00:18:04,560
we discussed earlier?

587
00:18:04,560 --> 00:18:07,440
Think of the base models as having raw talent.

588
00:18:07,440 --> 00:18:07,680
OK.

589
00:18:07,680 --> 00:18:09,720
Like a promising athlete who hasn't yet

590
00:18:09,720 --> 00:18:11,160
received formal coaching.

591
00:18:11,160 --> 00:18:12,160
I see where you're going with this.

592
00:18:12,160 --> 00:18:12,800
Yeah.

593
00:18:12,800 --> 00:18:16,160
They have the potential to do incredible things.

594
00:18:16,160 --> 00:18:18,040
But they haven't been specifically

595
00:18:18,040 --> 00:18:21,240
trained to follow instructions, understand

596
00:18:21,240 --> 00:18:24,320
complex nuances of language, or consistently

597
00:18:24,320 --> 00:18:26,360
generate human quality text.

598
00:18:26,360 --> 00:18:27,000
Gotcha.

599
00:18:27,000 --> 00:18:29,800
Instruction tuning is like giving that athlete expert

600
00:18:29,800 --> 00:18:32,200
coaching, honing their natural abilities

601
00:18:32,200 --> 00:18:34,080
into finely tuned skills.

602
00:18:34,080 --> 00:18:36,320
So it's like taking a brilliant student

603
00:18:36,320 --> 00:18:38,080
and teaching them how to apply their knowledge

604
00:18:38,080 --> 00:18:40,880
in the real world, solving practical problems.

605
00:18:40,880 --> 00:18:41,760
Exactly.

606
00:18:41,760 --> 00:18:46,360
And in the case of Quen 2.5, they took their instruction

607
00:18:46,360 --> 00:18:48,600
tuning to a whole new level.

608
00:18:48,600 --> 00:18:51,400
This is where those two key advancements come in, right?

609
00:18:51,400 --> 00:18:54,560
The expanded, supervised fine tuning data

610
00:18:54,560 --> 00:18:57,000
and that two-stage reinforcement learning process.

611
00:18:57,000 --> 00:18:58,000
You got it.

612
00:18:58,000 --> 00:18:59,560
Yeah, those advancements were crucial

613
00:18:59,560 --> 00:19:03,120
for creating the impressive instruction tuned models

614
00:19:03,120 --> 00:19:04,960
in the Quen 2.5 series.

615
00:19:04,960 --> 00:19:05,400
Cool.

616
00:19:05,400 --> 00:19:07,480
They supercharged their training data

617
00:19:07,480 --> 00:19:10,320
with millions of high quality examples.

618
00:19:10,320 --> 00:19:10,840
Wow.

619
00:19:10,840 --> 00:19:13,240
Targeting areas where previous models struggled.

620
00:19:13,240 --> 00:19:16,520
So areas like generating long, coherent responses,

621
00:19:16,520 --> 00:19:19,480
solving complex math problems, writing code,

622
00:19:19,480 --> 00:19:21,240
and following instructions flawlessly.

623
00:19:21,240 --> 00:19:22,280
Precisely, yeah.

624
00:19:22,280 --> 00:19:24,280
They wanted to make sure the model had seen and learned

625
00:19:24,280 --> 00:19:26,320
from as many different types of instructions

626
00:19:26,320 --> 00:19:28,000
and desired outputs as possible.

627
00:19:28,000 --> 00:19:29,760
So they really tried to cover all the bases.

628
00:19:29,760 --> 00:19:30,260
They did.

629
00:19:30,260 --> 00:19:30,920
They did.

630
00:19:30,920 --> 00:19:32,760
Almost like creating a training simulation

631
00:19:32,760 --> 00:19:34,560
for a real world AI assistant.

632
00:19:34,560 --> 00:19:36,680
It sounds like they were trying to anticipate

633
00:19:36,680 --> 00:19:39,680
the kinds of tasks these models might encounter

634
00:19:39,680 --> 00:19:42,960
in real world applications and train them accordingly.

635
00:19:42,960 --> 00:19:44,200
Exactly.

636
00:19:44,200 --> 00:19:47,480
And the two-stage reinforcement learning process

637
00:19:47,480 --> 00:19:51,480
was key for fine tuning the model's ability

638
00:19:51,480 --> 00:19:54,920
to not only understand instructions,

639
00:19:54,920 --> 00:19:59,000
but also to generate responses that aligned with human

640
00:19:59,000 --> 00:20:00,400
preferences and values.

641
00:20:00,400 --> 00:20:03,120
I remember you mentioned earlier that offline RL focused

642
00:20:03,120 --> 00:20:06,600
on developing capabilities like reasoning and factuality,

643
00:20:06,600 --> 00:20:10,160
which can be tricky for a reward model to evaluate directly.

644
00:20:10,160 --> 00:20:11,880
Yeah, it's like teaching the model

645
00:20:11,880 --> 00:20:16,280
to solve a complex puzzle, where the solution isn't obvious.

646
00:20:16,280 --> 00:20:18,680
They used clever techniques like answer matching

647
00:20:18,680 --> 00:20:22,000
and execution feedback along with human review

648
00:20:22,000 --> 00:20:24,000
to make sure the model was learning the right lessons.

649
00:20:24,000 --> 00:20:25,360
So it's more nuanced than that.

650
00:20:25,360 --> 00:20:27,000
Even when the reward signals were subtle.

651
00:20:27,000 --> 00:20:27,640
Exactly.

652
00:20:27,640 --> 00:20:30,360
So it's not just about giving the model a pat on the back

653
00:20:30,360 --> 00:20:31,960
when it gets the answer right.

654
00:20:31,960 --> 00:20:34,280
It's also about guiding it towards developing

655
00:20:34,280 --> 00:20:36,600
a deeper understanding of the task at hand.

656
00:20:36,600 --> 00:20:40,800
Then with online RL, they refined the model's output

657
00:20:40,800 --> 00:20:43,120
for more nuanced qualities like truthfulness,

658
00:20:43,120 --> 00:20:47,000
helpfulness, conciseness, relevance, and harmlessness.

659
00:20:47,000 --> 00:20:47,680
Exactly.

660
00:20:47,680 --> 00:20:50,040
So it's like taking those raw reasoning skills

661
00:20:50,040 --> 00:20:52,640
and polishing them into a communication style that's

662
00:20:52,640 --> 00:20:55,240
both effective and aligned with human values.

663
00:20:55,240 --> 00:20:55,880
Yeah.

664
00:20:55,880 --> 00:20:58,600
They're shaping not just the model's intelligence,

665
00:20:58,600 --> 00:21:00,080
but its personality as well.

666
00:21:00,080 --> 00:21:01,400
That's a great way to put it.

667
00:21:01,400 --> 00:21:06,120
Now the big question is, did all this careful training pay off?

668
00:21:06,120 --> 00:21:07,160
Let's get to the results.

669
00:21:07,160 --> 00:21:07,660
Yeah.

670
00:21:07,660 --> 00:21:10,720
How did these instruction-tuned models perform?

671
00:21:10,720 --> 00:21:12,800
I'm ready to hear about some AI magic.

672
00:21:12,800 --> 00:21:15,040
The results were impressive across the board.

673
00:21:15,040 --> 00:21:16,040
OK, in what ways?

674
00:21:16,040 --> 00:21:18,680
Well, they used a mix of open benchmarks,

675
00:21:18,680 --> 00:21:21,760
their own in-house data sets, and even assessments

676
00:21:21,760 --> 00:21:24,040
of their long context capabilities,

677
00:21:24,040 --> 00:21:25,680
especially for those models designed

678
00:21:25,680 --> 00:21:27,680
to handle massive amounts of information.

679
00:21:27,680 --> 00:21:29,960
OK, let's start with those open benchmarks.

680
00:21:29,960 --> 00:21:31,400
These are the standardized tests that

681
00:21:31,400 --> 00:21:34,600
allow us to compare performance across different LLMs,

682
00:21:34,600 --> 00:21:36,720
right, like the SATs for AI.

683
00:21:36,720 --> 00:21:37,880
Exactly.

684
00:21:37,880 --> 00:21:40,960
And in these standardized tests, Quen 2.5

685
00:21:40,960 --> 00:21:43,600
held its own against some of the top models in the field.

686
00:21:43,600 --> 00:21:44,560
OK, that's good.

687
00:21:44,560 --> 00:21:49,240
In fact, their flagship model, Quen 2.5 72B Instruct,

688
00:21:49,240 --> 00:21:51,480
delivered exceptional performance,

689
00:21:51,480 --> 00:21:56,120
even surpassing the much larger Lama 3.1 405B Instruct

690
00:21:56,120 --> 00:21:58,000
on several key benchmarks.

691
00:21:58,000 --> 00:21:59,280
So they beat the big guys.

692
00:21:59,280 --> 00:22:00,440
They did.

693
00:22:00,440 --> 00:22:02,720
Wow, that's a huge accomplishment.

694
00:22:02,720 --> 00:22:04,720
So they're achieving state-of-the-art performance

695
00:22:04,720 --> 00:22:06,720
with a smaller, more efficient model.

696
00:22:06,720 --> 00:22:07,280
Right.

697
00:22:07,280 --> 00:22:08,120
That's a big deal.

698
00:22:08,120 --> 00:22:08,760
It is.

699
00:22:08,760 --> 00:22:10,960
And it wasn't just the largest model that impressed.

700
00:22:10,960 --> 00:22:11,280
OK.

701
00:22:11,280 --> 00:22:12,600
Quen 2.5 Plus.

702
00:22:12,600 --> 00:22:13,920
So one of the Moe models.

703
00:22:13,920 --> 00:22:17,600
Yeah, their Moe model also outperformed Quen 2.5 72B

704
00:22:17,600 --> 00:22:19,480
Instruct on several benchmarks.

705
00:22:19,480 --> 00:22:20,840
So even better in some cases.

706
00:22:20,840 --> 00:22:23,280
Yeah, further demonstrating the power and efficiency

707
00:22:23,280 --> 00:22:24,160
of the meta approach.

708
00:22:24,160 --> 00:22:26,320
I'm starting to see why you're so excited about Moe models.

709
00:22:26,320 --> 00:22:28,800
They seem to be consistently punching above their weight.

710
00:22:28,800 --> 00:22:29,960
They really are.

711
00:22:29,960 --> 00:22:34,040
Now, moving down to the 14B and 32B instruction tuned models.

712
00:22:34,040 --> 00:22:34,600
OK.

713
00:22:34,600 --> 00:22:37,280
Where Quen 2.5 Turbo also comes into play.

714
00:22:37,280 --> 00:22:37,680
Right.

715
00:22:37,680 --> 00:22:39,600
They compared their models with some heavy hitters,

716
00:22:39,600 --> 00:22:43,600
like GPT-4 Mini and Gemma 2 27B ITT.

717
00:22:43,600 --> 00:22:43,920
Yep.

718
00:22:43,920 --> 00:22:46,800
So a mix of proprietary and open weight models.

719
00:22:46,800 --> 00:22:47,240
OK.

720
00:22:47,240 --> 00:22:49,000
All known for their strong performance.

721
00:22:49,000 --> 00:22:49,880
And how did they do?

722
00:22:49,880 --> 00:22:52,920
The Quen 2.5 models proved to be strong contenders

723
00:22:52,920 --> 00:22:53,680
in this arena.

724
00:22:53,680 --> 00:22:54,080
OK.

725
00:22:54,080 --> 00:22:56,800
The 32B model in particular stood out

726
00:22:56,800 --> 00:22:59,120
with impressive capabilities in math and coding,

727
00:22:59,120 --> 00:23:01,520
consistently ranking among the top performers.

728
00:23:01,520 --> 00:23:02,840
There's that theme again.

729
00:23:02,840 --> 00:23:06,160
Quen 2.5 consistently excels at tasks

730
00:23:06,160 --> 00:23:08,720
that require more than just basic language understanding.

731
00:23:08,720 --> 00:23:09,160
All right.

732
00:23:09,160 --> 00:23:12,240
They're really good at logic, reasoning, and problem solving.

733
00:23:12,240 --> 00:23:14,520
Yeah, it's like they have a knack for figuring things out.

734
00:23:14,520 --> 00:23:15,480
Exactly.

735
00:23:15,480 --> 00:23:18,680
And how about Quen 2.5 14B Instruct?

736
00:23:18,680 --> 00:23:20,280
Did it keep up with the big guys?

737
00:23:20,280 --> 00:23:21,640
It definitely held its own.

738
00:23:21,640 --> 00:23:22,120
OK.

739
00:23:22,120 --> 00:23:24,480
Delivering competitive results across the benchmarks,

740
00:23:24,480 --> 00:23:28,600
improving to be a strong contender against even the proprietary GPT-40

741
00:23:28,600 --> 00:23:29,240
mini model.

742
00:23:29,240 --> 00:23:30,120
So it did pretty well.

743
00:23:30,120 --> 00:23:30,840
It did.

744
00:23:30,840 --> 00:23:33,480
And of course, we can't forget about Quen 2.5 Turbo.

745
00:23:33,480 --> 00:23:33,760
Right.

746
00:23:33,760 --> 00:23:37,680
It continued to impress, even outperforming Quen 2.5 14B

747
00:23:37,680 --> 00:23:39,600
Instruct on several benchmarks.

748
00:23:39,600 --> 00:23:39,920
Really?

749
00:23:39,920 --> 00:23:42,240
Despite its lower training and inference costs.

750
00:23:42,240 --> 00:23:45,240
I'm telling you, these MOI models are going to be game changers.

751
00:23:45,240 --> 00:23:46,560
I think so too.

752
00:23:46,560 --> 00:23:49,120
Now for the 7B instruction tuned models.

753
00:23:49,120 --> 00:23:49,520
OK.

754
00:23:49,520 --> 00:23:52,920
They focused on comparing Quen 2.5 7B

755
00:23:52,920 --> 00:23:59,280
with two leading open weight models, Gemma29BIT and Lama 3.1 8B

756
00:23:59,280 --> 00:24:00,360
Instruct.

757
00:24:00,360 --> 00:24:05,200
So how did Quen 2.5 7B stack up against those established competitors?

758
00:24:05,200 --> 00:24:08,720
It significantly outperformed both of them on almost every task.

759
00:24:08,720 --> 00:24:09,320
Really?

760
00:24:09,320 --> 00:24:11,800
This really highlights the effectiveness of their approach

761
00:24:11,800 --> 00:24:14,800
to pre-training, fine tuning, and data selection.

762
00:24:14,800 --> 00:24:17,520
It's like they've figured out the secret sauce for training highly

763
00:24:17,520 --> 00:24:18,960
capable LLMs.

764
00:24:18,960 --> 00:24:19,920
It seems so.

765
00:24:19,920 --> 00:24:21,520
It's not just about building bigger models.

766
00:24:21,520 --> 00:24:23,040
It's about training them smarter.

767
00:24:23,040 --> 00:24:26,000
And finally, they evaluated their smallest instruction tuned

768
00:24:26,000 --> 00:24:28,600
models, the ones designed for edge side applications.

769
00:24:28,600 --> 00:24:29,240
OK.

770
00:24:29,240 --> 00:24:34,320
Like Quen 2.5 0.5B, 1.5B, and 3B.

771
00:24:34,320 --> 00:24:35,640
OK, so like for smart funds.

772
00:24:35,640 --> 00:24:36,000
Yeah.

773
00:24:36,000 --> 00:24:38,200
These compact models showed substantial improvements

774
00:24:38,200 --> 00:24:41,080
over their previous versions, making them ideal for applications

775
00:24:41,080 --> 00:24:42,680
with limited computational resources.

776
00:24:42,680 --> 00:24:43,280
That makes sense.

777
00:24:43,280 --> 00:24:43,560
Yeah.

778
00:24:43,560 --> 00:24:45,760
It's amazing to see how they're packing so much power

779
00:24:45,760 --> 00:24:47,400
into smaller and smaller models.

780
00:24:47,400 --> 00:24:47,720
Right.

781
00:24:47,720 --> 00:24:49,960
It really democratizes access to AI,

782
00:24:49,960 --> 00:24:52,560
bringing its potential to a wider audience.

783
00:24:52,560 --> 00:24:53,440
Absolutely.

784
00:24:53,440 --> 00:24:56,360
They've achieved a great balance between performance and accessibility.

785
00:24:56,360 --> 00:24:56,800
They have.

786
00:24:56,800 --> 00:24:57,440
They have.

787
00:24:57,440 --> 00:25:03,680
So these open benchmarks paint a very positive picture of Quen 2.5's

788
00:25:03,680 --> 00:25:04,400
capabilities.

789
00:25:04,400 --> 00:25:05,280
Yeah.

790
00:25:05,280 --> 00:25:08,480
But you mentioned earlier that they went beyond these standard tests.

791
00:25:08,480 --> 00:25:09,040
Right.

792
00:25:09,040 --> 00:25:10,360
Why was that?

793
00:25:10,360 --> 00:25:14,960
They recognized that open benchmarks, while valuable,

794
00:25:14,960 --> 00:25:18,400
don't always capture the full range of skills and behaviors

795
00:25:18,400 --> 00:25:23,000
needed for a truly helpful and reliable AI assistant.

796
00:25:23,000 --> 00:25:23,520
OK.

797
00:25:23,520 --> 00:25:24,640
In the real world.

798
00:25:24,640 --> 00:25:25,480
Yeah, that makes sense.

799
00:25:25,480 --> 00:25:28,280
They wanted to put their models through more realistic tests.

800
00:25:28,280 --> 00:25:30,120
So they created their own custom exams.

801
00:25:30,120 --> 00:25:30,640
Exactly.

802
00:25:30,640 --> 00:25:35,280
They created in-house data sets to assess their model's performance

803
00:25:35,280 --> 00:25:38,000
in areas like knowledge understanding, text generation,

804
00:25:38,000 --> 00:25:39,200
and coding.

805
00:25:39,200 --> 00:25:42,440
All crucial skills for real world tasks.

806
00:25:42,440 --> 00:25:43,280
And how did those go?

807
00:25:43,280 --> 00:25:45,000
Well, they didn't just stick to English.

808
00:25:45,000 --> 00:25:45,360
OK.

809
00:25:45,360 --> 00:25:48,960
They conducted these evaluations in both English and Chinese.

810
00:25:48,960 --> 00:25:53,040
So they're really serious about making their models work well

811
00:25:53,040 --> 00:25:54,960
across different languages and cultures.

812
00:25:54,960 --> 00:25:55,400
They are.

813
00:25:55,400 --> 00:25:58,000
It's like they're training them to be global citizens.

814
00:25:58,000 --> 00:25:59,040
I like that analogy.

815
00:25:59,040 --> 00:26:01,160
And in their in-house evaluations,

816
00:26:01,160 --> 00:26:04,520
they compared their models with leading language models.

817
00:26:04,520 --> 00:26:05,520
So the competition.

818
00:26:05,520 --> 00:26:10,160
Yeah, including giants like GPT-4, Cloud 3.5 Sonnet, Quen 2,

819
00:26:10,160 --> 00:26:11,440
and Lama 3.1.

820
00:26:11,440 --> 00:26:12,000
Oh, wow.

821
00:26:12,000 --> 00:26:14,040
So the heavy hitters in the LLM world?

822
00:26:14,040 --> 00:26:16,480
Yep, across both English and Chinese.

823
00:26:16,480 --> 00:26:21,880
OK, so how did Quen 2.5 measure up against these AI titans?

824
00:26:21,880 --> 00:26:23,640
They more than held their own.

825
00:26:23,640 --> 00:26:26,280
They often matched or exceeded the performance

826
00:26:26,280 --> 00:26:28,560
of these larger and more established models,

827
00:26:28,560 --> 00:26:30,160
even in very challenging areas.

828
00:26:30,160 --> 00:26:30,800
That's impressive.

829
00:26:30,800 --> 00:26:32,640
Can you give me some specific examples?

830
00:26:32,640 --> 00:26:36,720
Yeah, in English, Quen 2.5.7 to B really stood out.

831
00:26:36,720 --> 00:26:40,720
Matching or exceeding the performance of LLMA 3.1.4 or 5B

832
00:26:40,720 --> 00:26:42,000
in almost every metric.

833
00:26:42,000 --> 00:26:43,040
Wow, that's impressive.

834
00:26:43,040 --> 00:26:45,960
And in Chinese, Quen 2.5 plus addressed

835
00:26:45,960 --> 00:26:47,760
some previous limitations and instruction

836
00:26:47,760 --> 00:26:50,960
following while maintaining its already strong performance

837
00:26:50,960 --> 00:26:52,040
in other areas.

838
00:26:52,040 --> 00:26:55,880
It's inspiring to see a newer model like Quen 2.5 competing

839
00:26:55,880 --> 00:26:58,320
at such a high level against the giants of the field.

840
00:26:58,320 --> 00:26:58,640
It is.

841
00:26:58,640 --> 00:27:01,600
It's a clear sign of how quickly this field is advancing.

842
00:27:01,600 --> 00:27:03,840
New approaches to training and architecture

843
00:27:03,840 --> 00:27:05,680
are leading to rapid improvements.

844
00:27:05,680 --> 00:27:08,000
It seems like every week there's something new coming out.

845
00:27:08,000 --> 00:27:08,720
There is.

846
00:27:08,720 --> 00:27:11,160
We're witnessing an AI revolution.

847
00:27:11,160 --> 00:27:13,560
OK, let's talk about their multilingual capabilities.

848
00:27:13,560 --> 00:27:15,440
I know we touched on it briefly.

849
00:27:15,440 --> 00:27:19,280
But how do they really assess how well these models perform

850
00:27:19,280 --> 00:27:20,600
across different languages?

851
00:27:20,600 --> 00:27:23,120
They took a multifaceted approach.

852
00:27:23,120 --> 00:27:25,680
They expanded existing benchmarks

853
00:27:25,680 --> 00:27:28,840
to include multilingual examples and used

854
00:27:28,840 --> 00:27:32,280
several MMLU-like benchmarks in various languages,

855
00:27:32,280 --> 00:27:37,240
including Arabic, Japanese, Korean, Indonesian, and Turkish.

856
00:27:37,240 --> 00:27:38,680
Wow, a lot of languages.

857
00:27:38,680 --> 00:27:42,040
Yeah, they even tested the models on translated versions

858
00:27:42,040 --> 00:27:45,880
of benchmarks, ensuring consistency across different language

859
00:27:45,880 --> 00:27:46,360
versions.

860
00:27:46,360 --> 00:27:47,960
Wow, they really went all out.

861
00:27:47,960 --> 00:27:49,520
It sounds like they're committed to building

862
00:27:49,520 --> 00:27:52,280
AI that can truly communicate across language barriers.

863
00:27:52,280 --> 00:27:52,760
They are.

864
00:27:52,760 --> 00:27:54,360
What were the results of these tests?

865
00:27:54,360 --> 00:27:56,400
The results were encouraging.

866
00:27:56,400 --> 00:27:58,920
Quen 2.5 consistently performed well

867
00:27:58,920 --> 00:28:01,760
in instruction following, multilingual knowledge

868
00:28:01,760 --> 00:28:04,120
understanding, and even math reasoning

869
00:28:04,120 --> 00:28:05,720
across various languages.

870
00:28:05,720 --> 00:28:08,360
So they're not just claiming to be multilingual.

871
00:28:08,360 --> 00:28:10,280
They're proving it with hard data.

872
00:28:10,280 --> 00:28:11,040
Exactly.

873
00:28:11,040 --> 00:28:15,080
They even want it step further, testing the model's ability

874
00:28:15,080 --> 00:28:18,080
to understand subtle cultural nuances.

875
00:28:18,080 --> 00:28:18,640
Oh, wow.

876
00:28:18,640 --> 00:28:20,760
It's one thing to translate words,

877
00:28:20,760 --> 00:28:24,520
but it's quite another to grasp the cultural context.

878
00:28:24,520 --> 00:28:25,800
That's really fascinating.

879
00:28:25,800 --> 00:28:27,880
They're not just building language models.

880
00:28:27,880 --> 00:28:29,680
They're building cultural interpreters.

881
00:28:29,680 --> 00:28:32,880
Now, before we move on to their impressive long context

882
00:28:32,880 --> 00:28:36,040
capabilities, we need to discuss one more piece

883
00:28:36,040 --> 00:28:37,880
of the evaluation puzzle.

884
00:28:37,880 --> 00:28:39,160
The reward model.

885
00:28:39,160 --> 00:28:39,520
Oh, right.

886
00:28:39,520 --> 00:28:40,440
The reward model.

887
00:28:40,440 --> 00:28:43,200
This plays a key role in the reinforcement learning

888
00:28:43,200 --> 00:28:43,960
process, right?

889
00:28:43,960 --> 00:28:44,600
Exactly.

890
00:28:44,600 --> 00:28:47,440
The reward model is like a coach.

891
00:28:47,440 --> 00:28:51,360
Giving feedback to the LLM and guiding it towards generating

892
00:28:51,360 --> 00:28:52,520
better responses.

893
00:28:52,520 --> 00:28:55,720
But evaluating a reward model is different than evaluating

894
00:28:55,720 --> 00:28:57,400
the LLM itself, isn't it?

895
00:28:57,400 --> 00:28:58,400
You're absolutely right.

896
00:28:58,400 --> 00:29:00,880
You can't just give it a test and grade it.

897
00:29:00,880 --> 00:29:03,280
Instead, you have to assess how well it

898
00:29:03,280 --> 00:29:07,120
predicts which responses humans would find most helpful,

899
00:29:07,120 --> 00:29:09,280
truthful, and so on.

900
00:29:09,280 --> 00:29:12,840
So how did they evaluate the reward model for Qwenn 2.5?

901
00:29:12,840 --> 00:29:15,640
They used specialized benchmarks designed

902
00:29:15,640 --> 00:29:19,080
for assessing reward models, including their own Chinese

903
00:29:19,080 --> 00:29:20,600
human preference benchmark.

904
00:29:20,600 --> 00:29:21,200
Gotcha.

905
00:29:21,200 --> 00:29:24,040
It's a complex process, but essential for ensuring

906
00:29:24,040 --> 00:29:26,560
that the reward model is actually doing its job.

907
00:29:26,560 --> 00:29:30,280
And how did the Qwenn 2.5 reward model perform?

908
00:29:30,280 --> 00:29:33,400
It performed well overall, even surpassing some established

909
00:29:33,400 --> 00:29:35,240
reward models on certain benchmarks.

910
00:29:35,240 --> 00:29:35,800
Really?

911
00:29:35,800 --> 00:29:39,200
For example, it was the top performer in both the PPE

912
00:29:39,200 --> 00:29:41,680
and human preference Chinese evaluations.

913
00:29:41,680 --> 00:29:43,680
So it sounds like they did a good job with that, too.

914
00:29:43,680 --> 00:29:44,400
They did.

915
00:29:44,400 --> 00:29:46,120
This is a testament to their efforts

916
00:29:46,120 --> 00:29:48,320
in developing a robust reward model.

917
00:29:48,320 --> 00:29:50,000
It's amazing to see how much effort

918
00:29:50,000 --> 00:29:54,240
they put into refining every aspect of the training process.

919
00:29:54,240 --> 00:29:55,920
It's clear they're not cutting any corners.

920
00:29:55,920 --> 00:29:58,000
They're really striving for excellence,

921
00:29:58,000 --> 00:30:00,040
and that's reflected in their results.

922
00:30:00,040 --> 00:30:03,240
However, they made a very insightful observation

923
00:30:03,240 --> 00:30:04,840
during their evaluation.

924
00:30:04,840 --> 00:30:05,360
What's that?

925
00:30:05,360 --> 00:30:07,600
They found that simply achieving a high score

926
00:30:07,600 --> 00:30:10,000
on a reward model benchmark doesn't always

927
00:30:10,000 --> 00:30:12,600
mean that the LLM will also perform well.

928
00:30:12,600 --> 00:30:13,400
That's interesting.

929
00:30:13,400 --> 00:30:13,900
Yeah.

930
00:30:13,900 --> 00:30:17,000
So there's a gap between how we currently evaluate reward

931
00:30:17,000 --> 00:30:20,160
models and how effective they are in practice.

932
00:30:20,160 --> 00:30:21,880
So there's still room for improvement there.

933
00:30:21,880 --> 00:30:22,800
Exactly.

934
00:30:22,800 --> 00:30:25,960
This highlights a need for better evaluation methods

935
00:30:25,960 --> 00:30:29,520
that can more accurately predict real world performance.

936
00:30:29,520 --> 00:30:30,520
Yeah, if that makes sense.

937
00:30:30,520 --> 00:30:32,760
It's an area where more research is needed.

938
00:30:32,760 --> 00:30:37,600
It's a reminder that even as we're making great strides in AI,

939
00:30:37,600 --> 00:30:40,280
there are still fundamental challenges to address.

940
00:30:40,280 --> 00:30:41,000
Yeah.

941
00:30:41,000 --> 00:30:43,000
It's a constantly evolving field.

942
00:30:43,000 --> 00:30:44,080
Absolutely.

943
00:30:44,080 --> 00:30:47,160
Now, are you ready to move on to those long context

944
00:30:47,160 --> 00:30:48,000
capabilities?

945
00:30:48,000 --> 00:30:48,840
Let's do it.

946
00:30:48,840 --> 00:30:51,200
This is where things get really mind blowing.

947
00:30:51,200 --> 00:30:54,720
All right, they use benchmarks like Ruler, LVEvol,

948
00:30:54,720 --> 00:30:57,560
and Long Bench Chat, which are specifically

949
00:30:57,560 --> 00:30:59,440
designed to assess a model's ability

950
00:30:59,440 --> 00:31:01,600
to handle long pieces of text.

951
00:31:01,600 --> 00:31:03,480
What are some of the key skills needed

952
00:31:03,480 --> 00:31:05,400
for handling long contexts?

953
00:31:05,400 --> 00:31:07,480
What were they looking for in these evaluations?

954
00:31:07,480 --> 00:31:09,360
Well, first and foremost, the model

955
00:31:09,360 --> 00:31:10,880
needs to have a good memory.

956
00:31:10,880 --> 00:31:11,320
OK.

957
00:31:11,320 --> 00:31:13,920
It needs to remember information from earlier parts

958
00:31:13,920 --> 00:31:15,040
of a long text.

959
00:31:15,040 --> 00:31:18,080
Kind of like reading a novel and recalling details

960
00:31:18,080 --> 00:31:20,640
from the first chapter when you're halfway through the book.

961
00:31:20,640 --> 00:31:21,680
That's great analogy.

962
00:31:21,680 --> 00:31:24,040
It's one thing to process a few sentences.

963
00:31:24,040 --> 00:31:24,440
Right.

964
00:31:24,440 --> 00:31:27,160
It's quite another to maintain understanding over thousands

965
00:31:27,160 --> 00:31:28,720
or even millions of tokens.

966
00:31:28,720 --> 00:31:29,880
Exactly.

967
00:31:29,880 --> 00:31:32,160
They were also looking for the model's ability

968
00:31:32,160 --> 00:31:36,400
to see the big picture, to understand the relationships

969
00:31:36,400 --> 00:31:39,200
and connections between different parts of a long text.

970
00:31:39,200 --> 00:31:42,200
So not just remembering isolated facts,

971
00:31:42,200 --> 00:31:45,960
but actually grasping the narrative or argument,

972
00:31:45,960 --> 00:31:47,280
the flow of ideas.

973
00:31:47,280 --> 00:31:48,400
Precisely.

974
00:31:48,400 --> 00:31:51,120
And Quen 2.5 delivered impressive results.

975
00:31:51,120 --> 00:31:52,000
OK, that's good.

976
00:31:52,000 --> 00:31:55,840
Especially the larger models like Quen 2.5.72B instruct.

977
00:31:55,840 --> 00:31:57,240
So the big ones did well.

978
00:31:57,240 --> 00:31:57,960
They did.

979
00:31:57,960 --> 00:32:01,400
They even surpassed existing open weight and proprietary models

980
00:32:01,400 --> 00:32:03,240
in their long context performance.

981
00:32:03,240 --> 00:32:05,600
So they're not just handling long contexts.

982
00:32:05,600 --> 00:32:08,680
They're excelling a task that require a deep understanding

983
00:32:08,680 --> 00:32:09,480
of those contexts.

984
00:32:09,480 --> 00:32:10,040
Exactly.

985
00:32:10,040 --> 00:32:12,000
And they didn't just test their ability

986
00:32:12,000 --> 00:32:13,680
to understand long texts.

987
00:32:13,680 --> 00:32:14,000
OK.

988
00:32:14,000 --> 00:32:16,920
They also tested their ability to generate them.

989
00:32:16,920 --> 00:32:21,320
They gave Quen 2.5 Turbo a task called PASCII Retrieval,

990
00:32:21,320 --> 00:32:24,240
where it had to find specific information hidden

991
00:32:24,240 --> 00:32:26,240
within a million token document.

992
00:32:26,240 --> 00:32:28,200
Wait, a million tokens?

993
00:32:28,200 --> 00:32:30,160
That sounds impossible, like finding

994
00:32:30,160 --> 00:32:32,760
a needle in a haystack of words.

995
00:32:32,760 --> 00:32:33,760
It's a real challenge.

996
00:32:33,760 --> 00:32:34,320
But it is.

997
00:32:34,320 --> 00:32:37,000
But Quen 2.5 Turbo aced it.

998
00:32:37,000 --> 00:32:37,480
Really?

999
00:32:37,480 --> 00:32:40,280
Achieving 100% accuracy in retrieving

1000
00:32:40,280 --> 00:32:41,240
that hidden information.

1001
00:32:41,240 --> 00:32:42,520
Wow, that's impressive.

1002
00:32:42,520 --> 00:32:44,360
It's a testament to the model's ability

1003
00:32:44,360 --> 00:32:48,400
to not only process, but also effectively use

1004
00:32:48,400 --> 00:32:50,000
vast amounts of data.

1005
00:32:50,000 --> 00:32:52,080
That opens up incredible possibilities.

1006
00:32:52,080 --> 00:32:54,600
Imagine researchers, analysts, or writers

1007
00:32:54,600 --> 00:32:57,000
being able to instantly analyze and extract

1008
00:32:57,000 --> 00:32:59,920
key information from massive amounts of text.

1009
00:32:59,920 --> 00:33:00,600
Yeah.

1010
00:33:00,600 --> 00:33:02,520
It's like having an army of research assistants

1011
00:33:02,520 --> 00:33:03,360
at your fingertips.

1012
00:33:03,360 --> 00:33:03,920
Exactly.

1013
00:33:03,920 --> 00:33:06,000
And they also addressed the computational cost

1014
00:33:06,000 --> 00:33:08,320
of handling these long contexts.

1015
00:33:08,320 --> 00:33:10,960
They implemented a sparse attention mechanism,

1016
00:33:10,960 --> 00:33:13,480
which significantly speeds up the processing.

1017
00:33:13,480 --> 00:33:15,440
So they found a way to make it faster and more efficient.

1018
00:33:15,440 --> 00:33:16,280
They did.

1019
00:33:16,280 --> 00:33:18,720
That's great news for anyone who wants to use these models.

1020
00:33:18,720 --> 00:33:20,800
Yeah, it means more people can benefit

1021
00:33:20,800 --> 00:33:23,640
from these powerful capabilities without needing

1022
00:33:23,640 --> 00:33:27,320
access to specialized hardware or massive computing budgets.

1023
00:33:27,320 --> 00:33:30,120
I love how the Q and Team focuses on both performance

1024
00:33:30,120 --> 00:33:31,240
and accessibility.

1025
00:33:31,240 --> 00:33:31,960
I agree.

1026
00:33:31,960 --> 00:33:33,720
They're pushing the boundaries of AI

1027
00:33:33,720 --> 00:33:35,360
while making sure these advancements are

1028
00:33:35,360 --> 00:33:37,160
available to a wider audience.

1029
00:33:37,160 --> 00:33:37,640
I agree.

1030
00:33:37,640 --> 00:33:40,480
Their work is a great example of how AI research should

1031
00:33:40,480 --> 00:33:44,520
be done with a focus on both innovation and social impact.

1032
00:33:44,520 --> 00:33:46,280
Before we wrap up this deep dive,

1033
00:33:46,280 --> 00:33:49,800
I want to make sure we're presenting a balanced perspective.

1034
00:33:49,800 --> 00:33:51,840
Every groundbreaking development comes

1035
00:33:51,840 --> 00:33:53,360
with potential limitations.

1036
00:33:53,360 --> 00:33:54,160
It does.

1037
00:33:54,160 --> 00:33:55,920
And I want to make sure we cover those as well.

1038
00:33:55,920 --> 00:33:56,880
That's a great point.

1039
00:33:56,880 --> 00:33:58,640
So what are some of the limitations you noticed

1040
00:33:58,640 --> 00:34:01,120
in the QN 2.5 research?

1041
00:34:01,120 --> 00:34:04,000
Well, while their efforts to ensure data quality

1042
00:34:04,000 --> 00:34:07,320
are commendable, they acknowledge that biases can still

1043
00:34:07,320 --> 00:34:11,120
find their way into data sets, even with careful curation.

1044
00:34:11,120 --> 00:34:14,240
So bias is still a problem, even in these huge models.

1045
00:34:14,240 --> 00:34:17,080
Yeah, it's an ongoing challenge in the field of AI.

1046
00:34:17,080 --> 00:34:17,680
That's true.

1047
00:34:17,680 --> 00:34:20,600
Eliminating bias completely is a tough nut to crack.

1048
00:34:20,600 --> 00:34:21,680
What about the Moe models?

1049
00:34:21,680 --> 00:34:23,840
Do they mention any limitations there?

1050
00:34:23,840 --> 00:34:26,480
Well, while Moe models are very promising,

1051
00:34:26,480 --> 00:34:28,560
there's still a lot of ongoing research

1052
00:34:28,560 --> 00:34:31,240
into how to best design and train them.

1053
00:34:31,240 --> 00:34:33,400
There's no one-size-fits-all solution.

1054
00:34:33,400 --> 00:34:36,640
And finding the optimal configuration for a given task

1055
00:34:36,640 --> 00:34:38,240
can be complex.

1056
00:34:38,240 --> 00:34:41,800
Yeah, it's like a new tool with incredible potential.

1057
00:34:41,800 --> 00:34:44,000
But we're still learning how to use it most effectively.

1058
00:34:44,000 --> 00:34:45,080
Exactly.

1059
00:34:45,080 --> 00:34:48,240
And they also highlighted the need for better evaluation

1060
00:34:48,240 --> 00:34:50,560
methods for reward models.

1061
00:34:50,560 --> 00:34:53,720
Remember that observation about benchmark scores not always

1062
00:34:53,720 --> 00:34:56,120
being predictive of real-world performance?

1063
00:34:56,120 --> 00:34:56,880
I do.

1064
00:34:56,880 --> 00:35:00,160
Yeah, this suggests that we need to refine our evaluation

1065
00:35:00,160 --> 00:35:03,040
techniques to better assess how these models will actually

1066
00:35:03,040 --> 00:35:04,640
perform in practice.

1067
00:35:04,640 --> 00:35:06,360
It's like we need to find new ways

1068
00:35:06,360 --> 00:35:10,160
to measure not just how intelligent an AI is,

1069
00:35:10,160 --> 00:35:13,640
but how well it can actually apply that intelligence

1070
00:35:13,640 --> 00:35:14,880
to real-world situations.

1071
00:35:14,880 --> 00:35:15,840
Precisely.

1072
00:35:15,840 --> 00:35:18,920
And finally, even with their impressive long context

1073
00:35:18,920 --> 00:35:20,960
capabilities, they acknowledge that there's

1074
00:35:20,960 --> 00:35:23,600
a gap between what these models can do

1075
00:35:23,600 --> 00:35:27,120
and what humans can do when it comes to processing information

1076
00:35:27,120 --> 00:35:29,480
over extended periods.

1077
00:35:29,480 --> 00:35:31,000
It's good to be reminded that we're still

1078
00:35:31,000 --> 00:35:33,040
in the early days of AI development.

1079
00:35:33,040 --> 00:35:33,440
It is.

1080
00:35:33,440 --> 00:35:35,320
We've come a long way, but there's still a lot of room

1081
00:35:35,320 --> 00:35:36,000
to grow.

1082
00:35:36,000 --> 00:35:36,840
Absolutely.

1083
00:35:36,840 --> 00:35:38,920
But these limitations shouldn't overshadow

1084
00:35:38,920 --> 00:35:41,880
the incredible progress demonstrated in the Quen 2.5

1085
00:35:41,880 --> 00:35:42,360
research.

1086
00:35:42,360 --> 00:35:42,880
Right.

1087
00:35:42,880 --> 00:35:45,520
They've clearly addressed many of the challenges facing

1088
00:35:45,520 --> 00:35:46,120
LLMs.

1089
00:35:46,120 --> 00:35:46,760
Yeah, I care.

1090
00:35:46,760 --> 00:35:49,600
And their work is a major contribution to the field.

1091
00:35:49,600 --> 00:35:51,960
I'm excited to see how this research inspires further

1092
00:35:51,960 --> 00:35:54,920
innovation and leads to the development of even more

1093
00:35:54,920 --> 00:35:56,960
capable and helpful AI systems.

1094
00:35:56,960 --> 00:35:57,560
I agree.

1095
00:35:57,560 --> 00:36:00,440
What do you see as the key takeaways from this research?

1096
00:36:00,440 --> 00:36:03,320
What does it mean for the AI community and the world beyond?

1097
00:36:03,320 --> 00:36:04,960
Well, I think the most important takeaway

1098
00:36:04,960 --> 00:36:07,400
is that open-weight LLMs are catching up

1099
00:36:07,400 --> 00:36:09,440
to the performance of proprietary models.

1100
00:36:09,440 --> 00:36:09,960
OK.

1101
00:36:09,960 --> 00:36:13,080
The fact that Quen 2.5.7.2 be instruct

1102
00:36:13,080 --> 00:36:18,000
can compete head-to-head with models like LLAMA 3.1405B

1103
00:36:18,000 --> 00:36:22,000
instruct is a huge victory for open access and collaboration

1104
00:36:22,000 --> 00:36:23,000
in AI.

1105
00:36:23,000 --> 00:36:24,360
That's an important point.

1106
00:36:24,360 --> 00:36:24,840
It is.

1107
00:36:24,840 --> 00:36:28,280
Openness fosters innovation and accelerates progress.

1108
00:36:28,280 --> 00:36:29,000
Absolutely.

1109
00:36:29,000 --> 00:36:31,600
What other key takeaways did you notice?

1110
00:36:31,600 --> 00:36:35,760
I was also very impressed by the multilingual capabilities

1111
00:36:35,760 --> 00:36:37,400
of Quen 2.5.

1112
00:36:37,400 --> 00:36:38,360
Me too.

1113
00:36:38,360 --> 00:36:41,200
It seems they've really focused on training these models

1114
00:36:41,200 --> 00:36:43,320
to work well in many different languages.

1115
00:36:43,320 --> 00:36:44,680
It's like they're creating AI that

1116
00:36:44,680 --> 00:36:46,480
can speak to the whole world.

1117
00:36:46,480 --> 00:36:49,960
Yeah, and not just speak, but understand different cultures

1118
00:36:49,960 --> 00:36:50,840
and nuances.

1119
00:36:50,840 --> 00:36:51,480
Exactly.

1120
00:36:51,480 --> 00:36:54,080
It's a step towards breaking down language barriers

1121
00:36:54,080 --> 00:36:56,120
and fostering global communication.

1122
00:36:56,120 --> 00:36:57,240
It really is.

1123
00:36:57,240 --> 00:36:59,880
I'm curious to see how this kind of technology will impact

1124
00:36:59,880 --> 00:37:02,400
fields like translation, diplomacy,

1125
00:37:02,400 --> 00:37:04,560
and international collaboration.

1126
00:37:04,560 --> 00:37:06,040
It's exciting to think about.

1127
00:37:06,040 --> 00:37:09,040
I can imagine a future where people from different countries

1128
00:37:09,040 --> 00:37:11,240
can seamlessly communicate with each other,

1129
00:37:11,240 --> 00:37:13,000
regardless of their native language.

1130
00:37:13,000 --> 00:37:15,400
It's a beautiful vision, and I think this research is bringing

1131
00:37:15,400 --> 00:37:16,840
us closer to that reality.

1132
00:37:16,840 --> 00:37:17,680
I agree.

1133
00:37:17,680 --> 00:37:20,600
Their focus on efficiency is another crucial contribution.

1134
00:37:20,600 --> 00:37:21,280
Definitely.

1135
00:37:21,280 --> 00:37:24,280
They're showing that it's possible to achieve high performance

1136
00:37:24,280 --> 00:37:26,880
without relying on massive computing clusters.

1137
00:37:26,880 --> 00:37:27,380
Right.

1138
00:37:27,380 --> 00:37:32,520
This democratizes access to powerful AI tools,

1139
00:37:32,520 --> 00:37:35,240
making them available to a wider range of users

1140
00:37:35,240 --> 00:37:36,480
and applications.

1141
00:37:36,480 --> 00:37:38,960
And their work on multilingual and long context

1142
00:37:38,960 --> 00:37:42,280
capabilities is pushing the boundaries of what's

1143
00:37:42,280 --> 00:37:43,920
possible with language models.

1144
00:37:43,920 --> 00:37:44,560
Absolutely.

1145
00:37:44,560 --> 00:37:47,040
We're getting closer to AI that can truly understand

1146
00:37:47,040 --> 00:37:51,520
and communicate across linguistic and cultural barriers

1147
00:37:51,520 --> 00:37:53,600
and process information on a scale that

1148
00:37:53,600 --> 00:37:55,360
was previously unimaginable.

1149
00:37:55,360 --> 00:37:57,680
It's like we're breaking down the communication barriers

1150
00:37:57,680 --> 00:38:00,520
between humans and machines, opening up

1151
00:38:00,520 --> 00:38:02,320
a whole new world of possibilities.

1152
00:38:02,320 --> 00:38:02,840
I love that.

1153
00:38:02,840 --> 00:38:06,080
And let's not forget their emphasis on human alignment,

1154
00:38:06,080 --> 00:38:08,920
ensuring that these powerful models are used responsibly

1155
00:38:08,920 --> 00:38:09,760
and ethically.

1156
00:38:09,760 --> 00:38:11,360
The Quinten team has shown that it's

1157
00:38:11,360 --> 00:38:15,000
possible to create models that are both highly capable

1158
00:38:15,000 --> 00:38:16,760
and aligned with human values.

1159
00:38:16,760 --> 00:38:17,240
I agree.

1160
00:38:17,240 --> 00:38:20,000
That's essential for building a future where AI benefits

1161
00:38:20,000 --> 00:38:21,240
all of humanity.

1162
00:38:21,240 --> 00:38:21,740
I agree.

1163
00:38:21,740 --> 00:38:24,160
Their research reminds us that AI development is not just

1164
00:38:24,160 --> 00:38:25,840
about creating smarter machines.

1165
00:38:25,840 --> 00:38:26,160
Right.

1166
00:38:26,160 --> 00:38:28,240
It's about creating a better world

1167
00:38:28,240 --> 00:38:29,480
with the help of those machines.

1168
00:38:29,480 --> 00:38:29,980
Yeah.

1169
00:38:29,980 --> 00:38:32,400
It's about using this incredible technology

1170
00:38:32,400 --> 00:38:36,280
to solve real world problems and enhance human capabilities.

1171
00:38:36,280 --> 00:38:37,480
Well said.

1172
00:38:37,480 --> 00:38:40,120
The Quinten 2.5 research is a shining

1173
00:38:40,120 --> 00:38:42,480
example of how technical innovation

1174
00:38:42,480 --> 00:38:45,720
and a deep understanding of the societal impact of AI

1175
00:38:45,720 --> 00:38:46,880
can go hand in hand.

1176
00:38:46,880 --> 00:38:49,480
OK, before we move on to the final part of our deep dive,

1177
00:38:49,480 --> 00:38:50,800
a question for you.

1178
00:38:50,800 --> 00:38:54,320
What's the most exciting aspect of this research?

1179
00:38:54,320 --> 00:38:57,440
What has you buzzing about the future of LLMs?

1180
00:38:57,440 --> 00:38:58,720
That's a tough one.

1181
00:38:58,720 --> 00:39:00,640
But if I had to choose, I'd say it's the work they've

1182
00:39:00,640 --> 00:39:02,040
done with MOMI models.

1183
00:39:02,040 --> 00:39:04,320
Those MOMI models, always stealing the show.

1184
00:39:04,320 --> 00:39:06,480
What specifically excites you about them?

1185
00:39:06,480 --> 00:39:08,000
I believe they have the potential

1186
00:39:08,000 --> 00:39:11,640
to unlock the next level of LLM capabilities.

1187
00:39:11,640 --> 00:39:12,140
OK.

1188
00:39:12,140 --> 00:39:15,880
They offer a way to scale performance without the cost.

1189
00:39:15,880 --> 00:39:16,320
Yeah.

1190
00:39:16,320 --> 00:39:18,760
Making these models more accessible and applicable

1191
00:39:18,760 --> 00:39:20,720
to a wider range of problems.

1192
00:39:20,720 --> 00:39:22,720
It's like they've found a way to break free

1193
00:39:22,720 --> 00:39:25,680
from the limitations of traditional model architectures.

1194
00:39:25,680 --> 00:39:28,240
It's like a paradigm shift in AI.

1195
00:39:28,240 --> 00:39:30,600
And their innovative techniques like fine-grained

1196
00:39:30,600 --> 00:39:33,280
expert segmentation and shared experts routing

1197
00:39:33,280 --> 00:39:35,720
show that there's still so much to explore

1198
00:39:35,720 --> 00:39:37,400
in this exciting new frontier.

1199
00:39:37,400 --> 00:39:39,240
It's a field ripe for innovation.

1200
00:39:39,240 --> 00:39:41,720
And I think we'll be talking about MOMI models a lot more

1201
00:39:41,720 --> 00:39:42,600
in the future.

1202
00:39:42,600 --> 00:39:43,960
I'm eager to see what they do next.

1203
00:39:43,960 --> 00:39:45,200
Me too.

1204
00:39:45,200 --> 00:39:47,240
OK, are you ready to move on to the final part

1205
00:39:47,240 --> 00:39:48,640
of this epic deep dive?

1206
00:39:48,640 --> 00:39:49,280
Let's do it.

1207
00:39:49,280 --> 00:39:49,880
All right.

1208
00:39:49,880 --> 00:39:53,680
We're back, ready to wrap up our deep dive into the Quinn 2.5

1209
00:39:53,680 --> 00:39:54,800
technical report.

1210
00:39:54,800 --> 00:39:57,000
This has been an incredible journey.

1211
00:39:57,000 --> 00:39:57,880
Yeah, it has.

1212
00:39:57,880 --> 00:40:00,840
And I feel like we've only scratched the surface of what

1213
00:40:00,840 --> 00:40:02,120
these models can do.

1214
00:40:02,120 --> 00:40:02,640
I agree.

1215
00:40:02,640 --> 00:40:04,440
We have covered a lot of ground.

1216
00:40:04,440 --> 00:40:07,200
But it's clear the Quinn team is pushing the boundaries

1217
00:40:07,200 --> 00:40:09,320
of what's possible with LLMs.

1218
00:40:09,320 --> 00:40:10,960
They've really set a new standard

1219
00:40:10,960 --> 00:40:13,960
for performance, efficiency, and accessibility.

1220
00:40:13,960 --> 00:40:14,600
They have.

1221
00:40:14,600 --> 00:40:17,800
I'm so curious to see what the future holds for this technology.

1222
00:40:17,800 --> 00:40:18,680
What are your thoughts?

1223
00:40:18,680 --> 00:40:20,920
Well, their commitment to open waiting

1224
00:40:20,920 --> 00:40:24,520
so many of these models is a huge gift to the AI community.

1225
00:40:24,520 --> 00:40:24,920
It is.

1226
00:40:24,920 --> 00:40:26,360
This kind of open access is going

1227
00:40:26,360 --> 00:40:29,480
to accelerate progress in all sorts of unexpected ways.

1228
00:40:29,480 --> 00:40:32,120
It's like they've thrown open the doors to their AI lab.

1229
00:40:32,120 --> 00:40:32,640
Yeah.

1230
00:40:32,640 --> 00:40:35,120
And invited the whole world to come in and explore.

1231
00:40:35,120 --> 00:40:36,640
That's a beautiful thing.

1232
00:40:36,640 --> 00:40:39,400
This research really shows the power of collaboration.

1233
00:40:39,400 --> 00:40:39,800
I agree.

1234
00:40:39,800 --> 00:40:42,360
By making these tools available to everyone,

1235
00:40:42,360 --> 00:40:45,360
we're going to see a burst of creativity and innovation.

1236
00:40:45,360 --> 00:40:48,120
I can't wait to see what people create with these models.

1237
00:40:48,120 --> 00:40:53,640
Imagine students using Quinn 2.5 to help them learn new languages.

1238
00:40:53,640 --> 00:40:57,120
Or researchers analyzing massive data sets to uncover

1239
00:40:57,120 --> 00:40:58,160
hidden patterns.

1240
00:40:58,160 --> 00:40:59,520
The possibilities are endless.

1241
00:40:59,520 --> 00:41:01,480
And I think we're just at the beginning

1242
00:41:01,480 --> 00:41:03,200
of this AI revolution.

1243
00:41:03,200 --> 00:41:03,720
I agree.

1244
00:41:03,720 --> 00:41:06,680
It's an exciting time to be alive.

1245
00:41:06,680 --> 00:41:09,600
But as we move forward with AI development,

1246
00:41:09,600 --> 00:41:12,480
we need to keep ethical considerations in mind.

1247
00:41:12,480 --> 00:41:13,080
Absolutely.

1248
00:41:13,080 --> 00:41:16,640
We want to ensure that these powerful tools are used for good

1249
00:41:16,640 --> 00:41:18,160
to benefit all of humanity.

1250
00:41:18,160 --> 00:41:18,760
Right.

1251
00:41:18,760 --> 00:41:20,640
The Quinn team's work is a great example

1252
00:41:20,640 --> 00:41:22,960
of how to combine technical brilliance

1253
00:41:22,960 --> 00:41:25,600
with a focus on responsible AI development.

1254
00:41:25,600 --> 00:41:26,160
Absolutely.

1255
00:41:26,160 --> 00:41:27,840
They're not just building smarter machines.

1256
00:41:27,840 --> 00:41:30,040
They're building a better world with AI.

1257
00:41:30,040 --> 00:41:31,200
Well said.

1258
00:41:31,200 --> 00:41:32,240
OK.

1259
00:41:32,240 --> 00:41:33,840
I think we've covered just about everything.

1260
00:41:33,840 --> 00:41:35,720
Any final thoughts before we sign off?

1261
00:41:35,720 --> 00:41:37,320
It's just huge thank you to you.

1262
00:41:37,320 --> 00:41:37,840
Oh, you're welcome.

1263
00:41:37,840 --> 00:41:39,920
For taking us on this incredible journey,

1264
00:41:39,920 --> 00:41:42,120
your insights have been so valuable.

1265
00:41:42,120 --> 00:41:42,680
Well, thank you.

1266
00:41:42,680 --> 00:41:44,360
It was my pleasure talking with you about all this.

1267
00:41:44,360 --> 00:41:44,960
Likewise.

1268
00:41:44,960 --> 00:41:47,680
And to our listeners, what stood out most to you

1269
00:41:47,680 --> 00:41:49,440
in this deep dive?

1270
00:41:49,440 --> 00:41:51,920
The million token context length.

1271
00:41:51,920 --> 00:41:53,600
The amazing MOE models.

1272
00:41:53,600 --> 00:41:56,400
Or was it something else that really blew your mind?

1273
00:41:56,400 --> 00:41:57,040
Let us know.

1274
00:41:57,040 --> 00:41:58,520
Yeah, we'd love to hear from you.

1275
00:41:58,520 --> 00:42:00,520
Keep those AI papers coming, and we'll

1276
00:42:00,520 --> 00:42:02,920
be back to break down all the exciting new developments.

1277
00:42:02,920 --> 00:42:06,000
Until next time, stay curious and keep exploring

1278
00:42:06,000 --> 00:42:08,120
the amazing world of AI.

1279
00:42:08,120 --> 00:42:22,880
See you on the next episode of AI Papers podcast daily.