1
00:00:00,000 --> 00:00:03,400
Hey everyone, welcome back, ready for another deep dive.

2
00:00:03,400 --> 00:00:07,280
Today we're looking at a paper that's trying to,

3
00:00:07,280 --> 00:00:09,520
well, it's basically trying to figure out

4
00:00:09,520 --> 00:00:12,080
how to make AI better at thinking.

5
00:00:12,080 --> 00:00:12,920
Oh yeah.

6
00:00:12,920 --> 00:00:14,440
Like really thinking things through.

7
00:00:14,440 --> 00:00:15,400
Yeah, that's a big one.

8
00:00:15,400 --> 00:00:19,000
Especially for, you know, like those really complex problems.

9
00:00:19,000 --> 00:00:19,840
Absolutely.

10
00:00:19,840 --> 00:00:21,840
The kind of stuff that you need for advanced math

11
00:00:21,840 --> 00:00:24,200
or those tough STEM problems.

12
00:00:24,200 --> 00:00:26,040
So this paper is called,

13
00:00:26,040 --> 00:00:30,600
Enhancing LLM Reasoning with Reward Guided Tree Search.

14
00:00:30,600 --> 00:00:32,560
And it's kind of trying to crack the code

15
00:00:32,560 --> 00:00:36,960
of how to get large language models to be better reasoners.

16
00:00:36,960 --> 00:00:39,000
Right, because they're good at so many things,

17
00:00:39,000 --> 00:00:40,720
but this is still a big challenge.

18
00:00:40,720 --> 00:00:41,840
And especially, you know,

19
00:00:41,840 --> 00:00:43,280
open AI has been kind of secretive

20
00:00:43,280 --> 00:00:46,040
about how they made their model a one so smart.

21
00:00:46,040 --> 00:00:47,640
They haven't exactly given us the recipe.

22
00:00:47,640 --> 00:00:48,560
Exactly.

23
00:00:48,560 --> 00:00:49,400
So this paper's like,

24
00:00:49,400 --> 00:00:50,920
all right, we're gonna try to figure this out.

25
00:00:50,920 --> 00:00:52,600
So whether you're an AI expert

26
00:00:52,600 --> 00:00:54,640
or just curious about this stuff,

27
00:00:54,640 --> 00:00:56,360
we're gonna break this down in a way that makes sense.

28
00:00:56,360 --> 00:00:58,480
Yeah, we'll make it clear and fun, I promise.

29
00:00:58,480 --> 00:01:00,600
So you're right, LLMs have come a long way,

30
00:01:00,600 --> 00:01:04,360
but when you start talking about multi-step reasoning,

31
00:01:04,360 --> 00:01:08,320
you know, like if you think about solving a geometry proof

32
00:01:08,320 --> 00:01:10,880
or figuring out a complex physics problem,

33
00:01:10,880 --> 00:01:12,600
it's, that's a different ball game

34
00:01:12,600 --> 00:01:14,240
than, you know, writing a poem or something.

35
00:01:14,240 --> 00:01:15,960
Absolutely, it's a different kind of thinking, right?

36
00:01:15,960 --> 00:01:18,560
Like it's not just about being creative

37
00:01:18,560 --> 00:01:20,280
or fluent with language.

38
00:01:20,280 --> 00:01:22,680
It's about logic and, you know,

39
00:01:22,680 --> 00:01:24,000
being able to think step by step.

40
00:01:24,000 --> 00:01:25,080
Right, exactly.

41
00:01:25,080 --> 00:01:26,680
So this paper is basically saying,

42
00:01:26,680 --> 00:01:29,800
what if instead of just making these models bigger,

43
00:01:29,800 --> 00:01:32,080
we make some think more during a task?

44
00:01:32,080 --> 00:01:34,880
Yeah, like give them a little more time to process.

45
00:01:34,880 --> 00:01:36,960
Yeah, like how we humans might, you know,

46
00:01:36,960 --> 00:01:38,600
pause and jot down some notes

47
00:01:38,600 --> 00:01:41,040
when we're faced with a tough problem.

48
00:01:41,040 --> 00:01:43,800
That's where this whole idea of test time scaling comes in.

49
00:01:43,800 --> 00:01:46,000
Exactly, and the paper presents a framework

50
00:01:46,000 --> 00:01:47,080
for doing just that.

51
00:01:47,080 --> 00:01:49,520
And it's got three main parts that work together,

52
00:01:49,520 --> 00:01:51,600
almost like a team tackling a puzzle, right?

53
00:01:51,600 --> 00:01:52,720
Oh, I like that.

54
00:01:52,720 --> 00:01:54,760
So first, you've got the policy model.

55
00:01:54,760 --> 00:01:56,200
Think of that as our puzzle solver,

56
00:01:56,200 --> 00:01:57,680
trying out different pieces.

57
00:01:57,680 --> 00:01:58,520
Okay, got it.

58
00:01:58,520 --> 00:01:59,800
The one trying to find the solution.

59
00:01:59,800 --> 00:02:02,360
Right, then there's the reward model.

60
00:02:02,360 --> 00:02:03,880
This one acts like the coach,

61
00:02:03,880 --> 00:02:06,360
giving feedback on which moves look promising.

62
00:02:06,360 --> 00:02:08,880
So it's like, hey, good job, that piece fits there,

63
00:02:08,880 --> 00:02:11,160
or uh-oh, that's not quite right.

64
00:02:11,160 --> 00:02:15,320
Exactly, and finally, you've got the search algorithm.

65
00:02:15,320 --> 00:02:16,880
This is our strategy guide,

66
00:02:16,880 --> 00:02:20,400
making sure the policy model explores the puzzle efficiently.

67
00:02:20,400 --> 00:02:22,240
So it doesn't waste time on dead ends, right?

68
00:02:22,240 --> 00:02:23,080
Exactly.

69
00:02:23,080 --> 00:02:24,000
Makes sense.

70
00:02:24,000 --> 00:02:27,080
Okay, so let's zoom in on that policy model for a second.

71
00:02:27,080 --> 00:02:31,920
How do they actually train it to be a better reasoner?

72
00:02:31,920 --> 00:02:34,000
What I found really fascinating was this idea

73
00:02:34,000 --> 00:02:37,080
of using a bigger, more powerful LLM

74
00:02:37,080 --> 00:02:38,680
to create example solutions.

75
00:02:38,680 --> 00:02:39,800
Yeah, it's a clever trick.

76
00:02:39,800 --> 00:02:41,880
And it's all about showing the policy model

77
00:02:41,880 --> 00:02:44,160
the right format for reasoning.

78
00:02:44,160 --> 00:02:46,600
Ah, so it's not just learning from any old data,

79
00:02:46,600 --> 00:02:47,600
it's learning from the best.

80
00:02:47,600 --> 00:02:49,680
Precisely, it's like learning from a master chef.

81
00:02:49,680 --> 00:02:51,160
Instead of just watching them cook,

82
00:02:51,160 --> 00:02:52,920
you get to see their detailed recipe notes,

83
00:02:52,920 --> 00:02:55,440
every step laid out, all the little tips and tricks.

84
00:02:55,440 --> 00:02:56,280
Okay, I get it.

85
00:02:56,280 --> 00:02:58,880
So the policy model is basically learning by example

86
00:02:58,880 --> 00:03:00,280
by seeing how the pros do it.

87
00:03:00,280 --> 00:03:03,000
Exactly, and to make it even more interesting,

88
00:03:03,000 --> 00:03:05,760
they use something called preference optimization.

89
00:03:05,760 --> 00:03:07,640
Ooh, that sounds fancy.

90
00:03:07,640 --> 00:03:11,000
It is, so the policy model actually

91
00:03:11,000 --> 00:03:13,880
generates multiple potential solutions,

92
00:03:13,880 --> 00:03:17,120
and then the reward model steps in and judges them,

93
00:03:17,120 --> 00:03:20,440
like which one seemed the most likely to be correct?

94
00:03:20,440 --> 00:03:22,560
Ah, so the reward model's like,

95
00:03:22,560 --> 00:03:25,720
okay, solution A is better than solution B, because...

96
00:03:25,720 --> 00:03:27,760
Exactly, it's a constant back and forth.

97
00:03:27,760 --> 00:03:30,680
So it's like the policy model's always getting feedback

98
00:03:30,680 --> 00:03:32,320
and refining its approach.

99
00:03:32,320 --> 00:03:34,880
Exactly, it's always learning and improving

100
00:03:34,880 --> 00:03:36,960
based on what the reward model prefers.

101
00:03:36,960 --> 00:03:38,200
Okay, that's really cool.

102
00:03:38,200 --> 00:03:41,000
So we've got our problem solver getting trained by a coach,

103
00:03:41,000 --> 00:03:42,560
but what about that search algorithm?

104
00:03:42,560 --> 00:03:43,480
How does that sit in?

105
00:03:43,480 --> 00:03:45,800
Well, that's where things get even more strategic,

106
00:03:45,800 --> 00:03:49,480
because it's not just about trying every possible solution,

107
00:03:49,480 --> 00:03:53,440
it's about exploring the solution space in a smart way.

108
00:03:53,440 --> 00:03:55,760
Right, like a good chess player thinks several moves ahead.

109
00:03:55,760 --> 00:03:58,160
Exactly, and that's where these search algorithms

110
00:03:58,160 --> 00:03:59,280
come into play.

111
00:03:59,280 --> 00:04:01,200
Interesting, so they're not just randomly trying things,

112
00:04:01,200 --> 00:04:02,800
they're being strategic about it.

113
00:04:02,800 --> 00:04:05,680
Right, so they tested out a couple of different

114
00:04:05,680 --> 00:04:07,760
search algorithms to see which worked best

115
00:04:07,760 --> 00:04:09,720
for these reasoning tasks.

116
00:04:09,720 --> 00:04:12,200
The first one they tried is called MCTS,

117
00:04:12,200 --> 00:04:14,760
which stands for Monte Carlo Tree Search.

118
00:04:14,760 --> 00:04:16,560
It's a popular algorithm for problems

119
00:04:16,560 --> 00:04:20,600
with lots of possibilities, like in games like chess or Go.

120
00:04:20,600 --> 00:04:22,640
So it's like a strategic planner looking ahead

121
00:04:22,640 --> 00:04:24,480
to see the consequences of different moves.

122
00:04:24,480 --> 00:04:27,280
Exactly, and then they tested out a modified version

123
00:04:27,280 --> 00:04:29,760
of MCTS called MCTSG.

124
00:04:29,760 --> 00:04:31,240
This one is a bit more adventurous

125
00:04:31,240 --> 00:04:33,160
considering all the potential next steps

126
00:04:33,160 --> 00:04:35,000
before picking the most promising one.

127
00:04:35,000 --> 00:04:37,120
Okay, so it's like, let's look at all the options

128
00:04:37,120 --> 00:04:38,200
and then pick the best one.

129
00:04:38,200 --> 00:04:40,840
Precisely, and what's really interesting is that

130
00:04:40,840 --> 00:04:44,560
for these math problems, the simpler MCTSG

131
00:04:44,560 --> 00:04:48,000
actually outperformed the standard MCTS.

132
00:04:48,000 --> 00:04:50,960
Oh wow, so sometimes a more straightforward approach

133
00:04:50,960 --> 00:04:52,360
can be surprisingly effective.

134
00:04:52,360 --> 00:04:54,480
That's right, it suggests that the way humans

135
00:04:54,480 --> 00:04:56,920
intuitively break down complex tasks

136
00:04:56,920 --> 00:04:59,680
might be more aligned with this simpler search strategy.

137
00:04:59,680 --> 00:05:01,360
Hmm, that's fascinating.

138
00:05:01,360 --> 00:05:02,720
So I've got this whole framework,

139
00:05:02,720 --> 00:05:03,960
but does it actually work?

140
00:05:03,960 --> 00:05:05,680
Well, to find out, they tested it

141
00:05:05,680 --> 00:05:07,600
on four different math data sets,

142
00:05:07,600 --> 00:05:09,560
and we're talking challenging stuff here.

143
00:05:09,560 --> 00:05:12,880
Problems from math Olympiads, college level courses,

144
00:05:12,880 --> 00:05:15,240
the kind that would make most people sweat a little.

145
00:05:15,240 --> 00:05:18,160
Okay, so no easy questions here.

146
00:05:18,160 --> 00:05:20,720
Nope, and the results were pretty promising.

147
00:05:20,720 --> 00:05:23,760
Not only did their framework beat a basic LLM

148
00:05:23,760 --> 00:05:26,960
on all four data sets, but it also beat approaches

149
00:05:26,960 --> 00:05:30,080
that rely on generating tons of random guesses.

150
00:05:30,080 --> 00:05:31,760
So it's not just about brute force,

151
00:05:31,760 --> 00:05:34,080
it's about being smart and strategic.

152
00:05:34,080 --> 00:05:36,520
Exactly, and that's one of the key takeaways

153
00:05:36,520 --> 00:05:37,560
from this research.

154
00:05:37,560 --> 00:05:39,920
Okay, that's really cool, but I'm curious about

155
00:05:39,920 --> 00:05:42,160
how they actually trained the policy model

156
00:05:42,160 --> 00:05:43,720
to think step by step.

157
00:05:43,720 --> 00:05:46,240
Yeah, how did they teach it to approach these problems

158
00:05:46,240 --> 00:05:48,240
in a logical, structured way?

159
00:05:48,240 --> 00:05:50,000
That's where things get really interesting.

160
00:05:50,000 --> 00:05:52,760
Well, remember we talked about using a bigger LLM

161
00:05:52,760 --> 00:05:55,240
to generate those example solutions?

162
00:05:55,240 --> 00:05:57,360
Yeah, like the Masterchef's recipe notes?

163
00:05:57,360 --> 00:05:59,200
Right, those examples weren't just about

164
00:05:59,200 --> 00:06:01,120
giving the final answer, they were about

165
00:06:01,120 --> 00:06:03,280
showing the entire thought process, you know?

166
00:06:03,280 --> 00:06:05,760
Oh, so like not just, here's the cake,

167
00:06:05,760 --> 00:06:08,280
but here's how you bake the cake step by step.

168
00:06:08,280 --> 00:06:10,640
Exactly, it's like having a math tutor

169
00:06:10,640 --> 00:06:15,400
who shows you their work every step labeled and explained.

170
00:06:15,400 --> 00:06:18,200
I see, so the policy model is learning by example

171
00:06:18,200 --> 00:06:20,800
by seeing how the pros break down the problem.

172
00:06:20,800 --> 00:06:24,280
Precisely, and to make sure it really gets that structure,

173
00:06:24,280 --> 00:06:26,880
they trained it on this data set called NUMINA Math.

174
00:06:26,880 --> 00:06:28,680
It's full of math problems solved

175
00:06:28,680 --> 00:06:31,240
in that detailed step by step format.

176
00:06:31,240 --> 00:06:33,000
Okay, so lots of good examples to learn from.

177
00:06:33,000 --> 00:06:36,160
And they also did this cool thing called Instruction Tuning.

178
00:06:36,160 --> 00:06:37,840
Instruction Tuning, what's that?

179
00:06:37,840 --> 00:06:40,080
So basically they gave the policy model

180
00:06:40,080 --> 00:06:42,360
clear instructions on how to approach the problems,

181
00:06:42,360 --> 00:06:45,960
like analyze the question, rephrase it in your own words,

182
00:06:45,960 --> 00:06:49,200
and then break down the solution into labeled steps.

183
00:06:49,200 --> 00:06:50,840
Ah, so it's like giving it a checklist

184
00:06:50,840 --> 00:06:52,080
for good problem solving.

185
00:06:52,080 --> 00:06:54,600
Exactly, it's not just learning by example,

186
00:06:54,600 --> 00:06:56,560
but also getting explicit guidance

187
00:06:56,560 --> 00:06:58,040
on how to think things through.

188
00:06:58,040 --> 00:06:59,480
Got it, so they're really trying to make sure

189
00:06:59,480 --> 00:07:00,960
it learns the right approach,

190
00:07:00,960 --> 00:07:03,160
but this all still depends on that reward model

191
00:07:03,160 --> 00:07:04,920
being able to judge the solutions, right?

192
00:07:04,920 --> 00:07:06,960
Oh, absolutely, the reward model is key.

193
00:07:06,960 --> 00:07:08,680
So what were some of the things they tried

194
00:07:08,680 --> 00:07:10,840
with the reward model's training?

195
00:07:10,840 --> 00:07:13,560
Well, they experimented with a bunch of different approaches

196
00:07:13,560 --> 00:07:14,760
to see what worked best.

197
00:07:14,760 --> 00:07:16,800
Like they tried simple scoring,

198
00:07:16,800 --> 00:07:19,520
you know, so thumbs up or thumbs down.

199
00:07:19,520 --> 00:07:22,160
But they also tried more detailed evaluations

200
00:07:22,160 --> 00:07:25,520
where the reward model had to actually explain its reasoning.

201
00:07:25,520 --> 00:07:27,560
Oh, interesting, so kind of like a teacher

202
00:07:27,560 --> 00:07:29,960
who might just circle a wrong answer,

203
00:07:29,960 --> 00:07:33,280
versus one who writes comments explaining why it's wrong

204
00:07:33,280 --> 00:07:34,200
and how to improve.

205
00:07:34,200 --> 00:07:37,560
Exactly, and they found that this more elaborate approach,

206
00:07:37,560 --> 00:07:41,000
what they call generative, actually worked better.

207
00:07:41,000 --> 00:07:44,240
So like forcing the model to explain itself

208
00:07:44,240 --> 00:07:45,880
made it a better judge?

209
00:07:45,880 --> 00:07:48,920
It seems that way, like if you can't explain it clearly,

210
00:07:48,920 --> 00:07:50,960
you probably don't understand it well enough yourself.

211
00:07:50,960 --> 00:07:52,040
Right, makes sense.

212
00:07:52,040 --> 00:07:54,760
But all this training needs data, tons of it.

213
00:07:54,760 --> 00:07:56,360
How did they make sure the reward model

214
00:07:56,360 --> 00:07:58,640
was learning from the right kind of information?

215
00:07:58,640 --> 00:08:01,240
They were very deliberate about the data they used.

216
00:08:01,240 --> 00:08:03,080
They focused on creating a data set

217
00:08:03,080 --> 00:08:05,920
with what they call outcome level supervision signals.

218
00:08:05,920 --> 00:08:10,840
So basically a clear answer key for each problem.

219
00:08:10,840 --> 00:08:13,680
Exactly, so the reward model knows for sure

220
00:08:13,680 --> 00:08:15,760
when a solution is right or wrong.

221
00:08:15,760 --> 00:08:17,800
Got it, no ambiguity there.

222
00:08:17,800 --> 00:08:20,680
But did they just feed it a bunch of easy problems?

223
00:08:20,680 --> 00:08:22,160
Or did they try to challenge it?

224
00:08:22,160 --> 00:08:24,600
They actually used something called active learning.

225
00:08:24,600 --> 00:08:25,520
Active learning.

226
00:08:25,520 --> 00:08:28,880
Yeah, they strategically chose the most informative

227
00:08:28,880 --> 00:08:32,480
and challenging examples for the reward model to learn from.

228
00:08:32,480 --> 00:08:33,320
Oh, that's smart.

229
00:08:33,320 --> 00:08:36,080
Like a teacher who knows which problems will really

230
00:08:36,080 --> 00:08:37,560
make their students think.

231
00:08:37,560 --> 00:08:38,320
Precisely.

232
00:08:38,320 --> 00:08:42,160
And they also filtered out any redundant or biased examples

233
00:08:42,160 --> 00:08:45,640
to prevent the reward model from developing bad habits.

234
00:08:45,640 --> 00:08:49,120
So they're making sure it gets a balanced diet of math

235
00:08:49,120 --> 00:08:50,400
problems, so to speak.

236
00:08:50,400 --> 00:08:51,480
That's pretty cool.

237
00:08:51,480 --> 00:08:54,800
But it makes me wonder, how much does the reward model's

238
00:08:54,800 --> 00:08:56,200
performance really matter?

239
00:08:56,200 --> 00:08:58,960
I mean, it's not the one actually solving the problems, right?

240
00:08:58,960 --> 00:09:00,080
That's a great question.

241
00:09:00,080 --> 00:09:02,120
And it turns out it matters a lot.

242
00:09:02,120 --> 00:09:04,880
They did some tests to see how well the reward model could

243
00:09:04,880 --> 00:09:07,320
judge individual reasoning steps.

244
00:09:07,320 --> 00:09:10,640
Not just the final answer, but each step along the way.

245
00:09:10,640 --> 00:09:11,720
Oh, wow.

246
00:09:11,720 --> 00:09:13,360
So they were really putting it to the test.

247
00:09:13,360 --> 00:09:13,760
They were.

248
00:09:13,760 --> 00:09:16,040
And what they found was pretty impressive.

249
00:09:16,040 --> 00:09:17,760
Even though the reward model was mainly

250
00:09:17,760 --> 00:09:20,760
trained on whether the final answer was right or wrong,

251
00:09:20,760 --> 00:09:22,600
it still got really good at assessing

252
00:09:22,600 --> 00:09:24,680
the quality of those individual steps.

253
00:09:24,680 --> 00:09:25,200
That's amazing.

254
00:09:25,200 --> 00:09:27,920
It's like a music teacher who can not only tell

255
00:09:27,920 --> 00:09:30,080
if you played the wrong note, but can also

256
00:09:30,080 --> 00:09:31,680
critique your technique in phrasing.

257
00:09:31,680 --> 00:09:32,880
Exactly.

258
00:09:32,880 --> 00:09:34,480
It suggests that the reward model

259
00:09:34,480 --> 00:09:36,600
has developed a deeper understanding of what

260
00:09:36,600 --> 00:09:39,160
good reasoning looks like, even beyond just

261
00:09:39,160 --> 00:09:40,720
getting the right answer.

262
00:09:40,720 --> 00:09:41,480
That's really cool.

263
00:09:41,480 --> 00:09:42,880
So it's not just about the outcome.

264
00:09:42,880 --> 00:09:44,240
It's about the process, too.

265
00:09:44,240 --> 00:09:45,160
Precisely.

266
00:09:45,160 --> 00:09:47,560
And that opens up some really exciting possibilities

267
00:09:47,560 --> 00:09:49,080
for the future of AI.

268
00:09:49,080 --> 00:09:52,080
OK, so we've talked about this framework, this policy model,

269
00:09:52,080 --> 00:09:54,760
this reward model, these search algorithms.

270
00:09:54,760 --> 00:09:57,280
But where does all this leave us?

271
00:09:57,280 --> 00:09:59,960
What's the big takeaway from this research?

272
00:09:59,960 --> 00:10:02,720
I think the main message is that we're making real progress

273
00:10:02,720 --> 00:10:06,360
towards building LLMs that can reason more like humans.

274
00:10:06,360 --> 00:10:08,880
This paper gives us a solid framework for doing that,

275
00:10:08,880 --> 00:10:11,240
and it shows us some of the key ingredients we need,

276
00:10:11,240 --> 00:10:13,160
things like structured training data,

277
00:10:13,160 --> 00:10:15,640
thoughtful reward modeling, and those strategic search

278
00:10:15,640 --> 00:10:16,480
algorithms.

279
00:10:16,480 --> 00:10:19,080
And while there's still a lot of work to be done,

280
00:10:19,080 --> 00:10:20,800
the results are pretty encouraging.

281
00:10:20,800 --> 00:10:22,240
Absolutely.

282
00:10:22,240 --> 00:10:25,320
We might not have AI that can solve all our problems just

283
00:10:25,320 --> 00:10:28,880
yet, but this research definitely brings us one step

284
00:10:28,880 --> 00:10:30,080
closer to that goal.

285
00:10:30,080 --> 00:10:31,320
That's for sure.

286
00:10:31,320 --> 00:10:33,360
And it raises some really interesting questions

287
00:10:33,360 --> 00:10:34,720
about the future.

288
00:10:34,720 --> 00:10:38,720
Like, if we can create LLMs that are this good at reasoning,

289
00:10:38,720 --> 00:10:40,280
what are the possibilities?

290
00:10:40,280 --> 00:10:43,640
Yeah, what could this mean for fields like medicine,

291
00:10:43,640 --> 00:10:46,040
engineering, or even art?

292
00:10:46,040 --> 00:10:48,720
Those are definitely questions worth pondering.

293
00:10:48,720 --> 00:10:52,640
And on that note, we'll wrap up today's deep dive.

294
00:10:52,640 --> 00:10:54,120
Thanks for joining us on this journey

295
00:10:54,120 --> 00:10:56,800
into the fascinating world of AI reasoning.

296
00:10:56,800 --> 00:10:59,320
Until next time, keep those minds curious

297
00:10:59,320 --> 00:11:00,920
and those algorithms humming.

298
00:11:00,920 --> 00:11:03,200
It's all about how they train that policy model

299
00:11:03,200 --> 00:11:05,680
to think step by step, like a human would.

300
00:11:05,680 --> 00:11:06,200
Right.

301
00:11:06,200 --> 00:11:08,000
Like we were talking about with those example solutions

302
00:11:08,000 --> 00:11:08,960
from the bigger LLM.

303
00:11:08,960 --> 00:11:09,840
Yeah, exactly.

304
00:11:09,840 --> 00:11:11,920
Like those examples weren't just about the answer.

305
00:11:11,920 --> 00:11:14,280
They were about showing the whole process.

306
00:11:14,280 --> 00:11:14,760
Right.

307
00:11:14,760 --> 00:11:16,680
So it wasn't just, here's the answer.

308
00:11:16,680 --> 00:11:18,760
It was more like, here's how we got to the answer.

309
00:11:18,760 --> 00:11:19,680
Precisely.

310
00:11:19,680 --> 00:11:22,320
Think of it like having a math tutor who doesn't just

311
00:11:22,320 --> 00:11:25,440
give you the answer, but shows you their work every step,

312
00:11:25,440 --> 00:11:26,600
clearly laid out.

313
00:11:26,600 --> 00:11:27,480
Ah, OK.

314
00:11:27,480 --> 00:11:30,400
So it's like seeing the logic behind each step,

315
00:11:30,400 --> 00:11:31,280
the reasoning process.

316
00:11:31,280 --> 00:11:32,400
Exactly.

317
00:11:32,400 --> 00:11:35,520
And to really drill that in, they trained the policy model

318
00:11:35,520 --> 00:11:37,800
on this data set called NUMINAMATH.

319
00:11:37,800 --> 00:11:41,480
It's full of problems solved in this detailed step by step

320
00:11:41,480 --> 00:11:41,960
format.

321
00:11:41,960 --> 00:11:44,080
So it's got lots of good examples to learn from to see

322
00:11:44,080 --> 00:11:45,280
how those steps fit together.

323
00:11:45,280 --> 00:11:45,640
Right.

324
00:11:45,640 --> 00:11:48,080
And they also used something called instruction tuning.

325
00:11:48,080 --> 00:11:50,560
They actually gave the model clear instructions

326
00:11:50,560 --> 00:11:52,320
on how to approach the problems.

327
00:11:52,320 --> 00:11:53,840
Instruction tuning.

328
00:11:53,840 --> 00:11:55,440
So it's not just learning by example.

329
00:11:55,440 --> 00:11:57,680
It's getting explicit instructions too.

330
00:11:57,680 --> 00:11:58,200
Exactly.

331
00:11:58,200 --> 00:12:01,360
It's like, OK, first analyze the question,

332
00:12:01,360 --> 00:12:03,240
then rephrase it in your own words,

333
00:12:03,240 --> 00:12:05,920
and then break down the solution into labeled steps.

334
00:12:05,920 --> 00:12:08,120
So it's like a checklist, a set of guidelines

335
00:12:08,120 --> 00:12:09,800
for effective problem solving.

336
00:12:09,800 --> 00:12:10,320
Right.

337
00:12:10,320 --> 00:12:12,840
And that helps the model learn the right approach,

338
00:12:12,840 --> 00:12:14,520
the step by step thinking.

339
00:12:14,520 --> 00:12:15,440
Makes sense.

340
00:12:15,440 --> 00:12:17,760
But we've talked a lot about the policy model.

341
00:12:17,760 --> 00:12:19,240
What about that reward model?

342
00:12:19,240 --> 00:12:21,880
How did they make sure it was judging the solutions accurately?

343
00:12:21,880 --> 00:12:23,720
Well, they experimented with different ways

344
00:12:23,720 --> 00:12:25,960
to train it to see what worked best.

345
00:12:25,960 --> 00:12:29,520
They tried simple scoring, like a thumbs up or thumbs down.

346
00:12:29,520 --> 00:12:32,720
But they also tried more complex evaluations.

347
00:12:32,720 --> 00:12:33,200
Oh.

348
00:12:33,200 --> 00:12:37,080
So instead of just saying good or bad,

349
00:12:37,080 --> 00:12:39,960
the reward model had to explain why it thought

350
00:12:39,960 --> 00:12:41,400
a solution was good or bad.

351
00:12:41,400 --> 00:12:42,520
Exactly.

352
00:12:42,520 --> 00:12:44,920
And they found that this more elaborate approach,

353
00:12:44,920 --> 00:12:47,720
where the model had to justify its judgments,

354
00:12:47,720 --> 00:12:50,200
actually led to better performance.

355
00:12:50,200 --> 00:12:50,840
Interesting.

356
00:12:50,840 --> 00:12:53,320
So forcing the model to explain itself

357
00:12:53,320 --> 00:12:54,760
actually made it a better judge.

358
00:12:54,760 --> 00:12:55,640
It seems that way.

359
00:12:55,640 --> 00:12:57,080
Like if you can explain it clearly,

360
00:12:57,080 --> 00:12:58,920
you probably don't understand it that well yourself.

361
00:12:58,920 --> 00:12:59,240
Right.

362
00:12:59,240 --> 00:13:00,280
Makes sense.

363
00:13:00,280 --> 00:13:02,560
But all this training needs data, right?

364
00:13:02,560 --> 00:13:02,920
Yeah.

365
00:13:02,920 --> 00:13:04,880
How did they make sure the reward models was

366
00:13:04,880 --> 00:13:06,640
getting the right kind of information?

367
00:13:06,640 --> 00:13:09,320
They were really careful about the data they used,

368
00:13:09,320 --> 00:13:12,520
focusing on creating a data set with clear outcome

369
00:13:12,520 --> 00:13:14,000
level supervision signals.

370
00:13:14,000 --> 00:13:14,720
Outcome levels.

371
00:13:14,720 --> 00:13:17,440
So basically clear answer key for each problem.

372
00:13:17,440 --> 00:13:20,600
So the model knows for sure if a solution is right or wrong.

373
00:13:20,600 --> 00:13:21,480
Exactly.

374
00:13:21,480 --> 00:13:22,520
No ambiguity there.

375
00:13:22,520 --> 00:13:24,240
The reward model knows exactly when

376
00:13:24,240 --> 00:13:26,040
it's looking at a correct solution.

377
00:13:26,040 --> 00:13:26,360
Got it.

378
00:13:26,360 --> 00:13:29,200
So it's got a solid foundation to learn from.

379
00:13:29,200 --> 00:13:31,200
But did they just give it easy problems?

380
00:13:31,200 --> 00:13:31,800
No.

381
00:13:31,800 --> 00:13:34,840
They actually used a technique called active learning

382
00:13:34,840 --> 00:13:38,240
to strategically pick the most informative and challenging

383
00:13:38,240 --> 00:13:40,520
examples for the reward model.

384
00:13:40,520 --> 00:13:42,800
So they were making sure it got a good mix,

385
00:13:42,800 --> 00:13:44,000
not just the easy stuff.

386
00:13:44,000 --> 00:13:44,720
Right.

387
00:13:44,720 --> 00:13:46,440
And they also made sure to filter out

388
00:13:46,440 --> 00:13:48,640
any redundant or biased examples.

389
00:13:48,640 --> 00:13:50,840
So they were really being careful about the data,

390
00:13:50,840 --> 00:13:52,640
making sure it was balanced and helpful.

391
00:13:52,640 --> 00:13:53,240
Exactly.

392
00:13:53,240 --> 00:13:55,120
They wanted to prevent the reward model

393
00:13:55,120 --> 00:13:57,960
from developing bad habits or taking shortcuts.

394
00:13:57,960 --> 00:13:58,960
Smart.

395
00:13:58,960 --> 00:14:02,120
So they've put a lot of work into this reward model.

396
00:14:02,120 --> 00:14:04,680
But how much does its performance really matter?

397
00:14:04,680 --> 00:14:07,200
I mean, it's not directly solving the problems, right?

398
00:14:07,200 --> 00:14:08,120
That's a great question.

399
00:14:08,120 --> 00:14:10,680
And it turns out it matters a lot.

400
00:14:10,680 --> 00:14:13,920
They did some tests to see how well the reward model could

401
00:14:13,920 --> 00:14:16,880
judge individual steps in the reasoning process.

402
00:14:16,880 --> 00:14:17,680
Oh, wow.

403
00:14:17,680 --> 00:14:20,600
So not just the final answer, but each step along the way.

404
00:14:20,600 --> 00:14:21,240
Exactly.

405
00:14:21,240 --> 00:14:23,480
And what they found was pretty amazing.

406
00:14:23,480 --> 00:14:26,560
Even though the reward model was trained on the final answer,

407
00:14:26,560 --> 00:14:29,240
it could still assess the quality of those individual steps

408
00:14:29,240 --> 00:14:29,960
really well.

409
00:14:29,960 --> 00:14:31,200
That's really impressive.

410
00:14:31,200 --> 00:14:33,400
So it's like it developed a deeper understanding

411
00:14:33,400 --> 00:14:36,160
of good reasoning, not just getting the right answer.

412
00:14:36,160 --> 00:14:37,120
Precisely.

413
00:14:37,120 --> 00:14:38,560
And that's really exciting because it

414
00:14:38,560 --> 00:14:42,160
means we could potentially build LLMs that can not only

415
00:14:42,160 --> 00:14:45,640
solve problems, but also explain their thinking in a way

416
00:14:45,640 --> 00:14:46,600
that we can understand.

417
00:14:46,600 --> 00:14:48,320
That would be incredible.

418
00:14:48,320 --> 00:14:51,400
So after this deep dive, what's the main takeaway for our listeners?

419
00:14:51,400 --> 00:14:52,720
Well, I think the biggest takeaway

420
00:14:52,720 --> 00:14:55,960
is that we're getting closer to building LLMs that reason more

421
00:14:55,960 --> 00:14:57,440
like humans.

422
00:14:57,440 --> 00:15:00,840
This research gives us a framework, some key ingredients,

423
00:15:00,840 --> 00:15:02,800
and some really promising results.

424
00:15:02,800 --> 00:15:05,640
It's definitely encouraging to see this progress.

425
00:15:05,640 --> 00:15:07,280
It seems like we're on the right track.

426
00:15:07,280 --> 00:15:08,560
Absolutely.

427
00:15:08,560 --> 00:15:11,240
We might not have AI that can solve all our problems just

428
00:15:11,240 --> 00:15:13,840
yet, but we're definitely moving in the right direction.

429
00:15:13,840 --> 00:15:14,560
That's for sure.

430
00:15:14,560 --> 00:15:18,120
And it makes you wonder, what could this mean for the future?

431
00:15:18,120 --> 00:15:21,000
If we can get LLMs to reason this well,

432
00:15:21,000 --> 00:15:22,200
what are the possibilities?

433
00:15:22,200 --> 00:15:25,240
Yeah, how could this impact feels like medicine, engineering,

434
00:15:25,240 --> 00:15:26,760
maybe even art?

435
00:15:26,760 --> 00:15:28,320
So many possibilities.

436
00:15:28,320 --> 00:15:30,280
It's definitely exciting to think about.

437
00:15:30,280 --> 00:15:33,760
And on that note, we'll wrap up this deep dive.

438
00:15:33,760 --> 00:15:35,160
Thanks for joining us on this journey

439
00:15:35,160 --> 00:15:37,280
into the world of AI reasoning.

440
00:15:37,280 --> 00:15:39,440
Until next time, keep those minds curious

441
00:15:39,440 --> 00:15:48,800
and those algorithms humming.