1
00:00:00,000 --> 00:00:01,920
Okay, so today we're diving into this paper.

2
00:00:01,920 --> 00:00:02,420
Okay.

3
00:00:02,420 --> 00:00:08,160
It's all about getting AI to kind of, you know, do what we want it to, but safely.

4
00:00:08,160 --> 00:00:08,660
Right.

5
00:00:08,660 --> 00:00:11,520
So we're talking about those big, large language models,

6
00:00:11,520 --> 00:00:15,080
LLNs, that are, you know, the writing text, they're coding, they're translating,

7
00:00:15,080 --> 00:00:19,560
answering all of our questions in a really human-like way.

8
00:00:19,560 --> 00:00:20,080
Yeah.

9
00:00:20,080 --> 00:00:26,440
Um, but you know, how do we train these things to actually be helpful and safe?

10
00:00:26,440 --> 00:00:26,840
Right.

11
00:00:27,880 --> 00:00:29,200
That seems to be the tricky part.

12
00:00:29,200 --> 00:00:29,640
Yeah.

13
00:00:29,640 --> 00:00:31,480
And that's where this RLHF comes in.

14
00:00:31,480 --> 00:00:31,960
Yeah.

15
00:00:31,960 --> 00:00:33,440
RLHF, that's right.

16
00:00:33,440 --> 00:00:35,640
Reinforcement, learning from human feedback.

17
00:00:35,640 --> 00:00:39,720
It's like, become the main way to train these big LLNs now.

18
00:00:39,720 --> 00:00:40,000
Okay.

19
00:00:40,000 --> 00:00:42,720
So imagine we've got this powerful AI.

20
00:00:42,720 --> 00:00:43,240
Yeah.

21
00:00:43,240 --> 00:00:47,720
We want to make sure that it's actually, you know, doing what we want it to.

22
00:00:47,720 --> 00:00:48,040
Right.

23
00:00:48,040 --> 00:00:49,080
Aligning with our values.

24
00:00:49,080 --> 00:00:52,840
Align with human values, preferences, all that stuff.

25
00:00:52,840 --> 00:00:56,560
So is RLHF kind of like giving AI like a crash course?

26
00:00:56,560 --> 00:00:57,000
Yeah.

27
00:00:57,000 --> 00:00:59,320
You know, how to be a good AI?

28
00:00:59,320 --> 00:00:59,560
Yeah.

29
00:00:59,560 --> 00:01:00,480
Think of it as a loop.

30
00:01:00,480 --> 00:01:01,760
Like you give it a task.

31
00:01:01,760 --> 00:01:03,520
Like write a poem about a cat.

32
00:01:03,520 --> 00:01:03,920
Okay.

33
00:01:03,920 --> 00:01:05,640
And it generates some responses.

34
00:01:05,640 --> 00:01:06,040
Uh-huh.

35
00:01:06,040 --> 00:01:08,280
And then humans come in and they read them.

36
00:01:08,280 --> 00:01:09,240
Like that was good.

37
00:01:09,240 --> 00:01:09,600
Okay.

38
00:01:09,600 --> 00:01:10,320
That was bad.

39
00:01:10,320 --> 00:01:11,360
This is why.

40
00:01:11,360 --> 00:01:11,880
Gotcha.

41
00:01:11,880 --> 00:01:14,800
And that becomes the AI's reward signal.

42
00:01:14,800 --> 00:01:15,960
Oh, so it's learning.

43
00:01:15,960 --> 00:01:17,960
It's learning from these human judgments.

44
00:01:17,960 --> 00:01:18,280
Okay.

45
00:01:18,280 --> 00:01:18,520
Yeah.

46
00:01:18,520 --> 00:01:21,040
It's like, okay, humans like this poem.

47
00:01:21,040 --> 00:01:21,440
Mm-hmm.

48
00:01:21,440 --> 00:01:22,800
I'm going to try and do more of that.

49
00:01:22,800 --> 00:01:23,480
Exactly.

50
00:01:23,480 --> 00:01:28,720
The AI is constantly adjusting its behavior based on that feedback.

51
00:01:28,720 --> 00:01:29,080
Gotcha.

52
00:01:29,080 --> 00:01:30,640
And the more it does what we like,

53
00:01:30,640 --> 00:01:30,960
Yeah.

54
00:01:30,960 --> 00:01:33,760
the more reward it gets, the better it becomes at this.

55
00:01:33,760 --> 00:01:34,920
Okay, but this paper.

56
00:01:34,920 --> 00:01:35,680
Yeah.

57
00:01:35,680 --> 00:01:37,480
Towards reliable alignment.

58
00:01:37,480 --> 00:01:39,400
Uncertainty where RLHF.

59
00:01:39,400 --> 00:01:40,160
Right.

60
00:01:40,160 --> 00:01:42,840
This is saying there's kind of a problem with this reward system,

61
00:01:42,840 --> 00:01:43,200
right?

62
00:01:43,200 --> 00:01:43,560
Yeah.

63
00:01:43,560 --> 00:01:45,640
The issue is these reward models.

64
00:01:45,640 --> 00:01:46,000
Okay.

65
00:01:46,000 --> 00:01:48,000
That are kind of at the heart of RLHF.

66
00:01:48,000 --> 00:01:48,520
Mm-hmm.

67
00:01:48,520 --> 00:01:51,840
These models are trying to predict what humans would like.

68
00:01:51,840 --> 00:01:52,280
Okay.

69
00:01:52,280 --> 00:01:53,880
But they can be a bit unreliable.

70
00:01:53,880 --> 00:01:54,240
Okay.

71
00:01:54,240 --> 00:01:56,120
So imagine training 10.

72
00:01:56,120 --> 00:01:56,480
Okay.

73
00:01:56,480 --> 00:01:59,120
Identical reward models on the same data.

74
00:01:59,120 --> 00:02:00,920
All right, so they're all learning the same stuff.

75
00:02:00,920 --> 00:02:02,520
The exact same information.

76
00:02:02,520 --> 00:02:05,160
You would think that they would all give the same score

77
00:02:05,160 --> 00:02:07,560
to the same AI output.

78
00:02:07,560 --> 00:02:08,680
Yeah, that seems pretty logical.

79
00:02:08,680 --> 00:02:09,160
Right.

80
00:02:09,160 --> 00:02:10,720
If they're learning from the same information,

81
00:02:10,720 --> 00:02:12,400
they should come up with the same conclusion.

82
00:02:12,400 --> 00:02:16,240
But what you might find is that one reward model.

83
00:02:16,240 --> 00:02:16,680
Okay.

84
00:02:16,680 --> 00:02:20,000
Gives an AI's output a stellar score.

85
00:02:20,000 --> 00:02:20,600
Mm-hmm.

86
00:02:20,600 --> 00:02:23,200
And another thinks it's absolute garbage.

87
00:02:23,200 --> 00:02:24,080
Wow.

88
00:02:24,080 --> 00:02:26,560
Figure one in the paper visualizes this.

89
00:02:26,560 --> 00:02:27,000
Okay.

90
00:02:27,000 --> 00:02:29,000
It's like imagine looking at a bar chart.

91
00:02:29,000 --> 00:02:29,440
Yeah.

92
00:02:29,440 --> 00:02:32,800
And instead of neat uniform bars,

93
00:02:32,800 --> 00:02:34,160
they're all over the place.

94
00:02:34,160 --> 00:02:34,480
Okay.

95
00:02:34,480 --> 00:02:35,760
Huge differences.

96
00:02:35,760 --> 00:02:38,920
So it's like asking 10 chefs to rate a dish,

97
00:02:38,920 --> 00:02:41,000
and they're all giving completely different scores.

98
00:02:41,000 --> 00:02:41,600
Exactly.

99
00:02:41,600 --> 00:02:43,400
How do you trust any of them at that point?

100
00:02:43,400 --> 00:02:43,760
Right.

101
00:02:43,760 --> 00:02:48,160
And that inconsistency is a major problem for AI alignment.

102
00:02:48,160 --> 00:02:51,000
Because if we can't trust the reward model,

103
00:02:51,000 --> 00:02:54,480
how can we trust that it's guiding the AI in the right direction?

104
00:02:54,480 --> 00:02:57,200
So this inconsistency, this unreliability,

105
00:02:57,200 --> 00:02:58,840
this is what the researchers in this paper

106
00:02:58,840 --> 00:03:00,160
are trying to address.

107
00:03:00,160 --> 00:03:00,440
Yeah.

108
00:03:00,440 --> 00:03:03,960
But why are these reward models so flaky to begin with?

109
00:03:03,960 --> 00:03:06,160
It boils down to two main factors.

110
00:03:06,160 --> 00:03:06,520
Okay.

111
00:03:06,520 --> 00:03:07,920
Limited data.

112
00:03:07,920 --> 00:03:08,360
Okay.

113
00:03:08,360 --> 00:03:10,560
And randomness in training.

114
00:03:10,560 --> 00:03:10,960
Okay.

115
00:03:10,960 --> 00:03:13,960
So think of it like teaching a kid to bake a cake.

116
00:03:13,960 --> 00:03:15,800
But you only give them a couple of recipes.

117
00:03:15,800 --> 00:03:16,520
Okay.

118
00:03:16,520 --> 00:03:20,440
And you let them kind of mess around with the oven temperature.

119
00:03:20,440 --> 00:03:24,800
So even if the recipe is good, the lack of experience

120
00:03:24,800 --> 00:03:27,400
in that random element could mess things up.

121
00:03:27,400 --> 00:03:28,280
Exactly.

122
00:03:28,280 --> 00:03:30,480
And the same thing happens with these reward models.

123
00:03:30,480 --> 00:03:30,840
Okay.

124
00:03:30,840 --> 00:03:33,480
They're trained on a relatively small amount of data.

125
00:03:33,480 --> 00:03:33,840
Okay.

126
00:03:33,840 --> 00:03:35,640
And there's a lot of randomness involved.

127
00:03:35,640 --> 00:03:36,080
Gotcha.

128
00:03:36,080 --> 00:03:37,920
In how they learn from that data.

129
00:03:37,920 --> 00:03:39,560
So they're not all going to be the same.

130
00:03:39,560 --> 00:03:42,840
So even if you train the same model multiple times

131
00:03:42,840 --> 00:03:46,000
on the same data, you're going to get slightly different models

132
00:03:46,000 --> 00:03:46,720
each time.

133
00:03:46,720 --> 00:03:47,080
Okay.

134
00:03:47,080 --> 00:03:48,520
I'm starting to see the problem here.

135
00:03:48,520 --> 00:03:49,040
Yeah.

136
00:03:49,040 --> 00:03:51,640
So how does this paper propose to fix it?

137
00:03:51,640 --> 00:03:53,640
Well, they've got this really interesting approach.

138
00:03:53,640 --> 00:03:56,080
And to explain it, they use this analogy.

139
00:03:56,080 --> 00:03:56,280
Okay.

140
00:03:56,280 --> 00:03:59,360
Called the three-armed bandit problem.

141
00:03:59,360 --> 00:04:00,600
I love a good analogy.

142
00:04:00,600 --> 00:04:01,040
Yeah.

143
00:04:01,040 --> 00:04:01,280
Okay.

144
00:04:01,280 --> 00:04:02,880
So imagine you're in a casino.

145
00:04:02,880 --> 00:04:03,280
Okay.

146
00:04:03,280 --> 00:04:05,200
You got three slot machines in front of you.

147
00:04:05,200 --> 00:04:05,600
Okay.

148
00:04:05,600 --> 00:04:06,880
Three armed bandit.

149
00:04:06,880 --> 00:04:07,200
Yeah.

150
00:04:07,200 --> 00:04:10,800
Each machine has a different average payout.

151
00:04:10,800 --> 00:04:11,440
Yeah.

152
00:04:11,440 --> 00:04:13,600
But you don't know what those averages are.

153
00:04:13,600 --> 00:04:13,960
Okay.

154
00:04:13,960 --> 00:04:15,200
You only have estimates.

155
00:04:15,200 --> 00:04:15,480
Gotcha.

156
00:04:15,480 --> 00:04:18,880
And those estimates come with varying degrees of certainty.

157
00:04:18,880 --> 00:04:19,160
Okay.

158
00:04:19,160 --> 00:04:20,400
So we've got these slot machines.

159
00:04:20,400 --> 00:04:20,800
Yeah.

160
00:04:20,800 --> 00:04:21,760
Mysterious payouts.

161
00:04:21,760 --> 00:04:22,640
Yeah.

162
00:04:22,640 --> 00:04:24,720
Varying degrees of certainty.

163
00:04:24,720 --> 00:04:26,080
What's the best strategy?

164
00:04:26,080 --> 00:04:29,200
So a naive approach would be to just go for the machine.

165
00:04:29,200 --> 00:04:29,480
Yeah.

166
00:04:29,480 --> 00:04:31,360
With the highest estimated payout.

167
00:04:31,360 --> 00:04:31,680
Right.

168
00:04:31,680 --> 00:04:33,080
Just go for the big bucks.

169
00:04:33,080 --> 00:04:34,240
Sounds reasonable.

170
00:04:34,240 --> 00:04:34,560
Yeah.

171
00:04:34,560 --> 00:04:37,320
But remember, those estimates have uncertainties.

172
00:04:37,320 --> 00:04:37,920
Right.

173
00:04:37,920 --> 00:04:41,640
You might end up picking a machine that's actually terrible.

174
00:04:41,640 --> 00:04:43,920
Just because the estimate was way off.

175
00:04:43,920 --> 00:04:45,880
So it's kind of like gambling on a stock.

176
00:04:45,880 --> 00:04:46,200
Yeah.

177
00:04:46,200 --> 00:04:47,720
That everyone's saying is going to take off,

178
00:04:47,720 --> 00:04:49,000
but it has no real track record.

179
00:04:49,000 --> 00:04:50,000
Yeah, exactly.

180
00:04:50,000 --> 00:04:50,360
Okay.

181
00:04:50,360 --> 00:04:53,480
A smarter approach is to factor in those uncertainties.

182
00:04:53,480 --> 00:04:53,960
Okay.

183
00:04:53,960 --> 00:04:59,160
Maybe a machine has a slightly lower estimated payout.

184
00:04:59,160 --> 00:05:02,840
But if you're much more certain of that estimate,

185
00:05:02,840 --> 00:05:04,680
it might be the safer, smarter bet.

186
00:05:04,680 --> 00:05:04,960
Okay.

187
00:05:04,960 --> 00:05:06,840
So it's like diversifying your investment.

188
00:05:06,840 --> 00:05:07,200
Exactly.

189
00:05:07,200 --> 00:05:09,440
You're not putting all your eggs in one basket.

190
00:05:09,440 --> 00:05:09,960
Right.

191
00:05:09,960 --> 00:05:11,560
Even if that basket looks really tempting.

192
00:05:11,560 --> 00:05:12,160
Exactly.

193
00:05:12,160 --> 00:05:12,360
Okay.

194
00:05:12,360 --> 00:05:13,440
I see where you're going with this.

195
00:05:13,440 --> 00:05:14,040
Yeah.

196
00:05:14,040 --> 00:05:17,520
So how does the slot machine analogy connect back

197
00:05:17,520 --> 00:05:21,440
to the reward models in AI alignment?

198
00:05:21,440 --> 00:05:23,520
Think of each arm of the bandit.

199
00:05:23,520 --> 00:05:24,040
Okay.

200
00:05:24,040 --> 00:05:25,160
It is a different way.

201
00:05:25,160 --> 00:05:25,800
Yeah.

202
00:05:25,800 --> 00:05:28,560
The AI could respond to a prompt.

203
00:05:28,560 --> 00:05:31,280
So a naive reward model might say,

204
00:05:31,280 --> 00:05:33,400
this response gets the highest reward.

205
00:05:33,400 --> 00:05:33,800
Yeah.

206
00:05:33,800 --> 00:05:38,000
But if that estimate is highly uncertain,

207
00:05:38,000 --> 00:05:41,400
following it could lead the AI down a path.

208
00:05:41,400 --> 00:05:42,040
Okay.

209
00:05:42,040 --> 00:05:44,720
That's actually less aligned with human preferences.

210
00:05:44,720 --> 00:05:47,800
So it's like the AI is blindly following that high reward

211
00:05:47,800 --> 00:05:50,120
slot machine without considering the uncertainty.

212
00:05:50,120 --> 00:05:51,960
And that could lead to some not so great outcomes.

213
00:05:51,960 --> 00:05:52,640
Precisely.

214
00:05:52,640 --> 00:05:52,960
Okay.

215
00:05:52,960 --> 00:05:55,000
And that's where this paper's solution comes in.

216
00:05:55,000 --> 00:05:55,320
Okay.

217
00:05:55,320 --> 00:05:57,800
They're basically saying, let's not be so naive.

218
00:05:57,800 --> 00:05:58,320
Okay.

219
00:05:58,320 --> 00:06:03,040
Let's be more cautious about rewards with high uncertainty.

220
00:06:03,040 --> 00:06:03,680
Gotcha.

221
00:06:03,680 --> 00:06:05,600
They call this a conservative approach.

222
00:06:05,600 --> 00:06:05,960
Okay.

223
00:06:05,960 --> 00:06:09,760
And they actually bake that uncertainty into the AI training

224
00:06:09,760 --> 00:06:10,760
process itself.

225
00:06:10,760 --> 00:06:12,360
Oh, this is where it gets really interesting.

226
00:06:12,360 --> 00:06:12,600
Yeah.

227
00:06:12,600 --> 00:06:18,000
So how do they actually make the AI more conservative or risk

228
00:06:18,000 --> 00:06:19,960
averse in its learning?

229
00:06:19,960 --> 00:06:24,920
Instead of blindly chasing the highest estimated reward,

230
00:06:24,920 --> 00:06:29,160
they encourage the AI to consider the confidence

231
00:06:29,160 --> 00:06:30,840
level of those estimates.

232
00:06:30,840 --> 00:06:33,120
It's like the AI is learning to say,

233
00:06:33,120 --> 00:06:35,400
hold on, this reward looks really tempting.

234
00:06:35,400 --> 00:06:36,000
Yeah.

235
00:06:36,000 --> 00:06:37,800
But I'm not so sure about it.

236
00:06:37,800 --> 00:06:39,840
Maybe I should explore other options

237
00:06:39,840 --> 00:06:41,720
that have more certain rewards, even

238
00:06:41,720 --> 00:06:42,760
if they're a little bit lower.

239
00:06:42,760 --> 00:06:44,800
So it's learning to be a little bit more skeptical.

240
00:06:44,800 --> 00:06:45,360
Exactly.

241
00:06:45,360 --> 00:06:47,600
Not so prone to those risky gambles.

242
00:06:47,600 --> 00:06:48,160
Right.

243
00:06:48,160 --> 00:06:48,920
I like that.

244
00:06:48,920 --> 00:06:49,440
Yeah.

245
00:06:49,440 --> 00:06:52,240
And this cautious approach has some cool benefits.

246
00:06:52,240 --> 00:06:53,000
Okay.

247
00:06:53,000 --> 00:06:56,320
The researchers did some theoretical analysis

248
00:06:56,320 --> 00:06:57,840
and found that.

249
00:06:57,840 --> 00:07:00,480
By incorporating this uncertainty,

250
00:07:00,480 --> 00:07:03,040
they could make it much less likely that the AI would

251
00:07:03,040 --> 00:07:04,800
get worse during training.

252
00:07:04,800 --> 00:07:08,880
That's huge, because it's like we're putting some guardrails up.

253
00:07:08,880 --> 00:07:09,280
Right.

254
00:07:09,280 --> 00:07:12,760
It's not going to completely veer off the rails

255
00:07:12,760 --> 00:07:16,800
and become less aligned with what we actually want.

256
00:07:16,800 --> 00:07:17,320
Right.

257
00:07:17,320 --> 00:07:20,480
But did they actually test this out in practice?

258
00:07:20,480 --> 00:07:21,320
They did.

259
00:07:21,320 --> 00:07:25,200
They went beyond just the theory and actually tested this out.

260
00:07:25,200 --> 00:07:28,840
First, they built an ensemble of 10 reward models

261
00:07:28,840 --> 00:07:33,680
using a model called Gemma 2Bit and a massive open source

262
00:07:33,680 --> 00:07:34,480
data set.

263
00:07:34,480 --> 00:07:36,480
So kind of like a panel of judges.

264
00:07:36,480 --> 00:07:37,000
Yeah.

265
00:07:37,000 --> 00:07:38,440
Think of it like a panel of judges

266
00:07:38,440 --> 00:07:40,360
with slightly different perspectives.

267
00:07:40,360 --> 00:07:42,360
And did they test this ensemble out?

268
00:07:42,360 --> 00:07:42,960
Oh, yeah.

269
00:07:42,960 --> 00:07:43,320
OK.

270
00:07:43,320 --> 00:07:46,160
They used a platform called Reward Benchmark

271
00:07:46,160 --> 00:07:49,480
to see how well their conservative method performed

272
00:07:49,480 --> 00:07:51,400
compared to more traditional approaches.

273
00:07:51,400 --> 00:07:51,800
OK.

274
00:07:51,800 --> 00:07:52,600
And guess what?

275
00:07:52,600 --> 00:07:53,000
Right.

276
00:07:53,000 --> 00:07:55,400
It held up incredibly well.

277
00:07:55,400 --> 00:07:55,880
Nice.

278
00:07:55,880 --> 00:07:58,560
So this whole uncertainty aware approach

279
00:07:58,560 --> 00:08:00,440
isn't just a good idea in theory.

280
00:08:00,440 --> 00:08:00,600
Yeah.

281
00:08:00,600 --> 00:08:02,200
It actually works in practice.

282
00:08:02,200 --> 00:08:02,400
OK.

283
00:08:02,400 --> 00:08:05,920
So they've got these more reliable reward models working

284
00:08:05,920 --> 00:08:06,760
together as a team.

285
00:08:06,760 --> 00:08:07,120
Right.

286
00:08:07,120 --> 00:08:07,920
What's the next step?

287
00:08:07,920 --> 00:08:09,880
That's where PTO comes in.

288
00:08:09,880 --> 00:08:10,240
OK.

289
00:08:10,240 --> 00:08:12,040
Proximal policy optimization.

290
00:08:12,040 --> 00:08:12,600
OK.

291
00:08:12,600 --> 00:08:16,240
It's a popular algorithm for fine tuning language models.

292
00:08:16,240 --> 00:08:16,880
OK.

293
00:08:16,880 --> 00:08:20,360
And it's the method they chose to really test

294
00:08:20,360 --> 00:08:22,760
their variance aware approach.

295
00:08:22,760 --> 00:08:24,800
So they're taking an existing training method.

296
00:08:24,800 --> 00:08:25,480
Yeah.

297
00:08:25,480 --> 00:08:28,480
And they're adding their uncertainty awareness

298
00:08:28,480 --> 00:08:29,120
to the mix.

299
00:08:29,120 --> 00:08:29,720
Exactly.

300
00:08:29,720 --> 00:08:31,280
They ran two experiments.

301
00:08:31,280 --> 00:08:31,520
Go.

302
00:08:31,520 --> 00:08:36,160
One used their cautious uncertainty aware PTO.

303
00:08:36,160 --> 00:08:37,960
And the other used regular PTO.

304
00:08:37,960 --> 00:08:38,440
Gotcha.

305
00:08:38,440 --> 00:08:39,200
As a baseline.

306
00:08:39,200 --> 00:08:40,800
So it was like a head to head competition.

307
00:08:40,800 --> 00:08:41,040
Yeah.

308
00:08:41,040 --> 00:08:41,920
It was like a head to head.

309
00:08:41,920 --> 00:08:43,840
To see which one produced better results.

310
00:08:43,840 --> 00:08:44,680
Exactly.

311
00:08:44,680 --> 00:08:44,840
OK.

312
00:08:44,840 --> 00:08:47,400
So I'm picturing like two AIs going at it in a training

313
00:08:47,400 --> 00:08:47,840
montage.

314
00:08:47,840 --> 00:08:48,400
Yeah.

315
00:08:48,400 --> 00:08:49,000
This is great.

316
00:08:49,000 --> 00:08:49,400
Yeah.

317
00:08:49,400 --> 00:08:51,160
So how did they judge the results?

318
00:08:51,160 --> 00:08:51,360
Right.

319
00:08:51,360 --> 00:08:53,360
Was it just a matter of opinion?

320
00:08:53,360 --> 00:08:56,800
Or was there more objective way to measure who was better?

321
00:08:56,800 --> 00:09:00,000
To make sure everything was fair and square.

322
00:09:00,000 --> 00:09:00,600
OK.

323
00:09:00,600 --> 00:09:04,800
They brought in a completely independent reward model.

324
00:09:04,800 --> 00:09:05,280
OK.

325
00:09:05,280 --> 00:09:06,600
To act as the judge.

326
00:09:06,600 --> 00:09:07,680
So a third party.

327
00:09:07,680 --> 00:09:08,080
Yeah.

328
00:09:08,080 --> 00:09:08,560
OK.

329
00:09:08,560 --> 00:09:10,880
This one was based on Lama 3.

330
00:09:10,880 --> 00:09:11,320
OK.

331
00:09:11,320 --> 00:09:14,520
So it had nothing to do with their training process whatsoever.

332
00:09:14,520 --> 00:09:15,400
It was neutral.

333
00:09:15,400 --> 00:09:15,640
Yeah.

334
00:09:15,640 --> 00:09:16,480
Completely neutral.

335
00:09:16,480 --> 00:09:16,720
OK.

336
00:09:16,720 --> 00:09:17,280
I love that.

337
00:09:17,280 --> 00:09:17,640
Yeah.

338
00:09:17,640 --> 00:09:17,840
OK.

339
00:09:17,840 --> 00:09:18,600
Drum roll please.

340
00:09:18,600 --> 00:09:19,240
Yeah.

341
00:09:19,240 --> 00:09:23,680
What were the results of this AI training showdown?

342
00:09:23,680 --> 00:09:25,440
Well, just like their theory of predictive.

343
00:09:25,440 --> 00:09:25,800
OK.

344
00:09:25,800 --> 00:09:28,880
The variance aware PTO delivered more consistent

345
00:09:28,880 --> 00:09:29,720
in improvements.

346
00:09:29,720 --> 00:09:30,280
OK.

347
00:09:30,280 --> 00:09:32,960
It might not have hit the same peak performance levels.

348
00:09:32,960 --> 00:09:33,200
OK.

349
00:09:33,200 --> 00:09:34,280
Is the regular PTO.

350
00:09:34,280 --> 00:09:34,840
Ah.

351
00:09:34,840 --> 00:09:37,040
But it was far more reliable.

352
00:09:37,040 --> 00:09:37,360
OK.

353
00:09:37,360 --> 00:09:41,040
And less prone to those dramatic drops in performance.

354
00:09:41,040 --> 00:09:43,680
So it's like the classic tortoise in the hair story.

355
00:09:43,680 --> 00:09:44,000
Right.

356
00:09:44,000 --> 00:09:44,520
Exactly.

357
00:09:44,520 --> 00:09:46,640
Regular PTO is the hair sprinting.

358
00:09:46,640 --> 00:09:47,040
Yeah.

359
00:09:47,040 --> 00:09:47,680
Ahead.

360
00:09:47,680 --> 00:09:47,920
Yeah.

361
00:09:47,920 --> 00:09:50,560
But also, you know, risking a major stumble.

362
00:09:50,560 --> 00:09:51,160
Yeah.

363
00:09:51,160 --> 00:09:53,120
While the variance aware was the tortoise.

364
00:09:53,120 --> 00:09:53,760
Right.

365
00:09:53,760 --> 00:09:55,240
Slow and steady wins the race.

366
00:09:55,240 --> 00:09:56,720
Exactly.

367
00:09:56,720 --> 00:09:58,640
Remember that whole slot machine analogy.

368
00:09:58,640 --> 00:09:59,280
Yeah.

369
00:09:59,280 --> 00:10:02,920
The regular PTO was that high risk high reward machine.

370
00:10:02,920 --> 00:10:03,720
Yeah.

371
00:10:03,720 --> 00:10:08,040
But the variance aware was that more conservative choice.

372
00:10:08,040 --> 00:10:08,480
Right.

373
00:10:08,480 --> 00:10:10,840
Offering more predictable and consistent outcomes.

374
00:10:10,840 --> 00:10:11,160
Yeah.

375
00:10:11,160 --> 00:10:12,920
And I think in the world of AI alignment,

376
00:10:12,920 --> 00:10:14,160
that's really what we want.

377
00:10:14,160 --> 00:10:14,960
Absolutely.

378
00:10:14,960 --> 00:10:19,120
Like slow and steady, reliable, predictable.

379
00:10:19,120 --> 00:10:21,120
The special one you were talking about AI systems.

380
00:10:21,120 --> 00:10:21,960
Yeah.

381
00:10:21,960 --> 00:10:23,680
They could have a huge impact on our lives.

382
00:10:23,680 --> 00:10:24,280
Exactly.

383
00:10:24,280 --> 00:10:26,760
We want to make sure that they're not making decisions.

384
00:10:26,760 --> 00:10:27,320
Right.

385
00:10:27,320 --> 00:10:31,600
Based on unreliable or unpredictable reward signals.

386
00:10:31,600 --> 00:10:34,760
OK, now I want to go back to those 10 reward models

387
00:10:34,760 --> 00:10:36,080
that were in the ensemble.

388
00:10:36,080 --> 00:10:36,400
Right.

389
00:10:36,400 --> 00:10:39,160
I'm curious, like, how much did they actually

390
00:10:39,160 --> 00:10:40,440
disagree with each other?

391
00:10:40,440 --> 00:10:40,760
Yeah.

392
00:10:40,760 --> 00:10:42,360
Because I imagine that there's got

393
00:10:42,360 --> 00:10:44,360
to be some interesting insights there.

394
00:10:44,360 --> 00:10:47,000
They did a deep dive into the variability

395
00:10:47,000 --> 00:10:48,520
of those reward models.

396
00:10:48,520 --> 00:10:48,960
OK.

397
00:10:48,960 --> 00:10:51,240
So remember, they were all trained on the same data.

398
00:10:51,240 --> 00:10:51,640
Right.

399
00:10:51,640 --> 00:10:54,600
But they wanted to see just how different their scores were

400
00:10:54,600 --> 00:10:57,000
for the same AI generated text.

401
00:10:57,000 --> 00:10:58,200
So what did they find?

402
00:10:58,200 --> 00:11:00,760
Was it like a polite disagreement?

403
00:11:00,760 --> 00:11:03,440
Or were these models just, like, throwing shade at each other?

404
00:11:03,440 --> 00:11:06,720
Well, the variance in those reward scores

405
00:11:06,720 --> 00:11:08,760
could be pretty wild.

406
00:11:08,760 --> 00:11:09,080
OK.

407
00:11:09,080 --> 00:11:13,920
Sometimes the scores range from 3 to 14 on their scale.

408
00:11:13,920 --> 00:11:15,160
So that's a pretty big difference.

409
00:11:15,160 --> 00:11:15,640
Yeah.

410
00:11:15,640 --> 00:11:15,960
OK.

411
00:11:15,960 --> 00:11:20,680
Meaning one model might give an AI output a glowing review

412
00:11:20,680 --> 00:11:23,000
while another thought it was a total flop.

413
00:11:23,000 --> 00:11:23,520
Wow.

414
00:11:23,520 --> 00:11:25,760
So even with that same training data,

415
00:11:25,760 --> 00:11:28,320
they still had, like, their own distinct opinions.

416
00:11:28,320 --> 00:11:30,560
It really highlights that these reward models

417
00:11:30,560 --> 00:11:32,840
aren't these perfect, all-knowing oracles.

418
00:11:32,840 --> 00:11:33,080
Right.

419
00:11:33,080 --> 00:11:34,760
They're complex systems.

420
00:11:34,760 --> 00:11:35,040
Yeah.

421
00:11:35,040 --> 00:11:37,240
And their outputs can be influenced

422
00:11:37,240 --> 00:11:39,840
by these really subtle factors that we might not even

423
00:11:39,840 --> 00:11:40,840
fully understand.

424
00:11:40,840 --> 00:11:42,960
It's kind of like you might love sausage.

425
00:11:42,960 --> 00:11:43,760
Yeah.

426
00:11:43,760 --> 00:11:45,880
But you don't want to see how it's made.

427
00:11:45,880 --> 00:11:47,080
I love that analogy.

428
00:11:47,080 --> 00:11:48,840
We don't always want to know the messy details

429
00:11:48,840 --> 00:11:49,960
behind how things work.

430
00:11:49,960 --> 00:11:51,440
But in this case, I think understanding

431
00:11:51,440 --> 00:11:55,920
how that sausage is made, meaning how those reward models work

432
00:11:55,920 --> 00:11:57,880
and where their uncertainties lie,

433
00:11:57,880 --> 00:11:58,920
is incredibly important.

434
00:11:58,920 --> 00:12:00,400
Because it's that understanding.

435
00:12:00,400 --> 00:12:00,760
Yeah.

436
00:12:00,760 --> 00:12:05,600
That allows us to develop these more robust and trustworthy AI

437
00:12:05,600 --> 00:12:06,840
alignment techniques.

438
00:12:06,840 --> 00:12:07,120
OK.

439
00:12:07,120 --> 00:12:12,360
So zooming out a bit, how do you think this research is going

440
00:12:12,360 --> 00:12:15,600
to shape the future of AI alignment?

441
00:12:15,600 --> 00:12:17,640
Is this like a really big deal?

442
00:12:17,640 --> 00:12:18,000
Yeah.

443
00:12:18,000 --> 00:12:20,240
Or is it just one small step?

444
00:12:20,240 --> 00:12:22,120
I think it's a pretty significant step.

445
00:12:22,120 --> 00:12:22,520
OK.

446
00:12:22,520 --> 00:12:27,360
By directly tackling this issue of reward model uncertainty,

447
00:12:27,360 --> 00:12:30,920
this paper has opened up some really promising avenues

448
00:12:30,920 --> 00:12:32,480
for research and development.

449
00:12:32,480 --> 00:12:35,360
It's like they've given us a new lens through which

450
00:12:35,360 --> 00:12:38,440
to view this whole process of AI alignment.

451
00:12:38,440 --> 00:12:41,680
We're not just blindly chasing the highest reward anymore.

452
00:12:41,680 --> 00:12:45,480
We're taking this more measured, thoughtful, cautious approach.

453
00:12:45,480 --> 00:12:47,840
And that caution that awareness of uncertainty

454
00:12:47,840 --> 00:12:50,400
seems like it's going to be crucial as we develop

455
00:12:50,400 --> 00:12:52,880
these even more powerful AI systems.

456
00:12:52,880 --> 00:12:55,840
Because we want to make sure that they're aligned with our values,

457
00:12:55,840 --> 00:12:59,000
not just blindly optimizing for some reward signal.

458
00:12:59,000 --> 00:12:59,520
Exactly.

459
00:12:59,520 --> 00:13:01,800
It's not just about making AI smarter.

460
00:13:01,800 --> 00:13:02,360
Right.

461
00:13:02,360 --> 00:13:05,840
It's about making it smarter in a way that benefits humanity.

462
00:13:05,840 --> 00:13:06,080
Yeah.

463
00:13:06,080 --> 00:13:09,640
And I think this research is a fantastic step in that direction.

464
00:13:09,640 --> 00:13:11,680
I think we've covered a lot of ground today.

465
00:13:11,680 --> 00:13:12,080
Yeah.

466
00:13:12,080 --> 00:13:17,760
From flaky reward models to slot machine analogies

467
00:13:17,760 --> 00:13:19,400
and cautious AI.

468
00:13:19,400 --> 00:13:22,960
So for our listeners who want to go even deeper,

469
00:13:22,960 --> 00:13:25,240
I definitely recommend checking out the original research

470
00:13:25,240 --> 00:13:25,720
paper.

471
00:13:25,720 --> 00:13:26,600
Definitely.

472
00:13:26,600 --> 00:13:28,880
It's full of really insightful analysis.

473
00:13:28,880 --> 00:13:29,640
Yeah.

474
00:13:29,640 --> 00:13:32,960
And technical details that we didn't have time to get into today.

475
00:13:32,960 --> 00:13:33,640
Absolutely.

476
00:13:33,640 --> 00:13:35,480
Stay curious, keep learning.

477
00:13:35,480 --> 00:13:36,240
Yes.

478
00:13:36,240 --> 00:13:39,120
And we'll catch you on our next deep dive.

479
00:13:39,120 --> 00:13:58,760
Sounds good.

