1
00:00:00,000 --> 00:00:05,560
All right, everyone get ready because today we are taking a deep dive into this new open AI paper.

2
00:00:05,560 --> 00:00:06,960
Oh yeah, this one's a good one.

3
00:00:06,960 --> 00:00:09,960
All about keeping those large language models.

4
00:00:09,960 --> 00:00:13,640
You know, LLMs on the straight and narrow.

5
00:00:13,640 --> 00:00:15,000
Yeah, keeping them safe.

6
00:00:15,000 --> 00:00:16,040
Safe, exactly.

7
00:00:16,040 --> 00:00:20,120
It's called Rule-Based Rewards for Language Model Safety.

8
00:00:20,120 --> 00:00:25,920
And the best part is we're going straight to the source with excerpts from the paper itself.

9
00:00:25,920 --> 00:00:27,040
I love it when we do that.

10
00:00:27,040 --> 00:00:28,200
Me too, I think it's fun.

11
00:00:28,200 --> 00:00:34,680
So this paper really digs into a problem that's been a thorn in the side of AI for a while now.

12
00:00:34,680 --> 00:00:35,240
Okay.

13
00:00:35,240 --> 00:00:38,120
You know how LLMs are getting so incredibly good.

14
00:00:38,120 --> 00:00:39,040
At so many things.

15
00:00:39,040 --> 00:00:43,480
At things like writing and translating, but sometimes they can say things that are...

16
00:00:43,480 --> 00:00:44,040
A little.

17
00:00:44,040 --> 00:00:45,760
A little off-color, shall we say.

18
00:00:45,760 --> 00:00:48,080
Yeah, it's like that friend who always puts their foot in their mouth.

19
00:00:48,080 --> 00:00:53,800
Exactly, and the old ways of fixing this, like having humans constantly checking, correcting them,

20
00:00:53,800 --> 00:00:55,240
it's just not feasible.

21
00:00:55,240 --> 00:00:56,480
Yeah, super time consuming.

22
00:00:56,480 --> 00:00:57,480
Way too expensive.

23
00:00:57,480 --> 00:01:00,800
So this paper's got a new method that's way more efficient.

24
00:01:00,800 --> 00:01:04,480
And gives us a lot more control over how these models behave.

25
00:01:04,480 --> 00:01:06,240
Okay, so spill the beans.

26
00:01:06,240 --> 00:01:07,120
What's the secret?

27
00:01:07,120 --> 00:01:08,280
Yeah, what's the secret sauce?

28
00:01:08,280 --> 00:01:11,600
It's called Rule-Based Rewards, or we can just call them RBRs for short.

29
00:01:11,600 --> 00:01:12,360
RBRs, okay.

30
00:01:12,360 --> 00:01:17,680
And it's all about setting very specific rules for what the AI should and shouldn't say.

31
00:01:17,680 --> 00:01:18,640
Okay, like boundaries.

32
00:01:18,640 --> 00:01:20,320
Boundaries, exactly.

33
00:01:20,320 --> 00:01:26,960
So imagine you're teaching a chatbot to politely refuse requests for things that are dangerous

34
00:01:26,960 --> 00:01:27,960
or illegal.

35
00:01:27,960 --> 00:01:29,080
Right, I'm with you.

36
00:01:29,080 --> 00:01:33,120
Okay, so with RBRs, you break down that polite refusal into clear rules.

37
00:01:33,120 --> 00:01:34,120
Like what?

38
00:01:34,120 --> 00:01:38,640
Like, apologize, don't be judgmental, and clearly state you can't do that.

39
00:01:38,640 --> 00:01:41,560
So it's basically like giving the AI a code of conduct.

40
00:01:41,560 --> 00:01:43,320
It's like a digital etiquette guide, yeah.

41
00:01:43,320 --> 00:01:45,240
I like that, a digital etiquette guide.

42
00:01:45,240 --> 00:01:46,960
And here's where things get really interesting.

43
00:01:46,960 --> 00:01:47,560
Okay.

44
00:01:47,560 --> 00:01:49,560
They actually use another AI.

45
00:01:49,560 --> 00:01:50,200
Whoa.

46
00:01:50,200 --> 00:01:55,080
Almost like a robot judge to score how well the chatbot is following those rules during

47
00:01:55,080 --> 00:01:55,920
its training.

48
00:01:55,920 --> 00:01:57,640
So AI judging AI.

49
00:01:57,640 --> 00:01:59,360
Exactly, it's like AI inception.

50
00:01:59,360 --> 00:02:03,360
That is wild, and that means we don't need humans constantly supervising.

51
00:02:03,360 --> 00:02:04,440
That's the idea.

52
00:02:04,440 --> 00:02:09,760
This AI judge gives feedback, so we don't need humans hovering over it all the time.

53
00:02:09,760 --> 00:02:11,520
I mean, that makes it much more efficient.

54
00:02:11,520 --> 00:02:12,040
Exactly.

55
00:02:12,040 --> 00:02:13,320
And I guess more consistent too.

56
00:02:13,320 --> 00:02:15,080
You got it way more consistent.

57
00:02:15,080 --> 00:02:17,160
And here's another cool thing about RBRs.

58
00:02:17,160 --> 00:02:17,560
Yeah.

59
00:02:17,560 --> 00:02:22,600
They're really easy to update as the world changes, you know, as our ideas about safety evolve.

60
00:02:22,600 --> 00:02:26,040
Oh, that's important because things change so quickly these days.

61
00:02:26,040 --> 00:02:27,920
Things change so fast, yeah.

62
00:02:27,920 --> 00:02:31,480
So let's say some new kind of harmful content pops up online.

63
00:02:31,480 --> 00:02:32,520
Like DeepFix or something.

64
00:02:32,520 --> 00:02:34,200
Yeah, like DeepFix, exactly.

65
00:02:34,200 --> 00:02:39,600
Well, with RBRs, you can just add new rules or tweak the old ones to keep the AI in check.

66
00:02:39,600 --> 00:02:43,000
So it's not just a one-time fix, it's like a constantly evolving system.

67
00:02:43,000 --> 00:02:45,480
It's like a living, breathing set of guidelines.

68
00:02:45,480 --> 00:02:46,120
I like that.

69
00:02:46,120 --> 00:02:51,280
So it sounds good in theory, but how well does this actually work in practice?

70
00:02:51,280 --> 00:02:53,400
You know, they got some pretty impressive results.

71
00:02:53,400 --> 00:02:54,800
Yeah, they put it through the ringer.

72
00:02:54,800 --> 00:02:58,360
They tested these RBRs in all kinds of scenarios.

73
00:02:58,360 --> 00:03:05,480
And one example they gave was training a chat bot to respond appropriately to requests for advice on self-harm.

74
00:03:05,480 --> 00:03:06,280
Oh, wow.

75
00:03:06,280 --> 00:03:07,200
That's a tough one.

76
00:03:07,200 --> 00:03:08,240
That's a sensitive area.

77
00:03:08,240 --> 00:03:11,520
You really need the AI to handle that carefully.

78
00:03:11,520 --> 00:03:12,160
Absolutely.

79
00:03:12,160 --> 00:03:14,400
You can't just have it spitting out generic advice.

80
00:03:14,400 --> 00:03:16,680
You don't want canned responses.

81
00:03:16,680 --> 00:03:21,600
An early version of their system tended to just recommend calling a US suicide hotline.

82
00:03:21,600 --> 00:03:24,080
Right, which isn't helpful for someone in another country.

83
00:03:24,080 --> 00:03:26,400
Totally useless for someone outside the US.

84
00:03:26,400 --> 00:03:29,800
So how did the RBRs actually help in this case?

85
00:03:29,800 --> 00:03:35,640
Well, they were able to refine the system so it acknowledges the user's distress,

86
00:03:35,640 --> 00:03:37,800
but doesn't actually give harmful advice.

87
00:03:37,800 --> 00:03:39,480
So it's learning to be more empathetic?

88
00:03:39,480 --> 00:03:40,520
More empathetic.

89
00:03:40,520 --> 00:03:43,080
It offers more general support and guidance.

90
00:03:43,080 --> 00:03:43,640
Wow.

91
00:03:43,640 --> 00:03:49,240
So it avoids those potentially harmful responses, but it's still engaging with the user.

92
00:03:49,240 --> 00:03:50,400
It's still there for them.

93
00:03:50,400 --> 00:03:53,600
That's a pretty big step forward in making these models safer.

94
00:03:53,600 --> 00:03:54,680
It's a huge step.

95
00:03:54,680 --> 00:03:57,800
And they said that it helped reduce something called over-refusals.

96
00:03:57,800 --> 00:03:58,600
Oh, yeah.

97
00:03:58,600 --> 00:03:59,520
Over-refusals.

98
00:03:59,520 --> 00:04:00,600
What is an over-refusal?

99
00:04:00,600 --> 00:04:04,720
So that's when the chat bot becomes overly cautious and refuses to do anything.

100
00:04:04,720 --> 00:04:06,000
Even if it's harmless?

101
00:04:06,000 --> 00:04:07,680
Even if it's totally harmless.

102
00:04:07,680 --> 00:04:10,560
It's like that friend who always says no to every invitation.

103
00:04:10,560 --> 00:04:12,840
Right, even if you're just asking them to hang out and watch a movie.

104
00:04:12,840 --> 00:04:13,400
Exactly.

105
00:04:13,400 --> 00:04:14,960
You just want to watch a movie.

106
00:04:14,960 --> 00:04:20,960
And you want the AI to be safe, but not so locked down that it can't hold a conversation.

107
00:04:20,960 --> 00:04:21,280
Right.

108
00:04:21,280 --> 00:04:22,280
It still needs to be useful.

109
00:04:22,280 --> 00:04:23,960
Still needs to be useful.

110
00:04:23,960 --> 00:04:27,160
And the RBRs seem to help find that sweet spot.

111
00:04:27,160 --> 00:04:29,960
So it's that balance between safety and functionality?

112
00:04:29,960 --> 00:04:31,280
Finding that balance.

113
00:04:31,280 --> 00:04:36,960
And what I really like about this approach is that it seems to give us more control than we had before.

114
00:04:36,960 --> 00:04:37,800
Way more control.

115
00:04:37,800 --> 00:04:40,280
We're not just giving a thumbs up or thumbs down on a response.

116
00:04:40,280 --> 00:04:42,120
Yeah, we're giving very specific instructions.

117
00:04:42,120 --> 00:04:43,200
We're giving instructions.

118
00:04:43,200 --> 00:04:45,320
We're saying this is how you should behave.

119
00:04:45,320 --> 00:04:48,240
So it's almost like we're teaching the AI good manners.

120
00:04:48,240 --> 00:04:52,080
Yeah, like teaching it how to be a responsible conversational partner.

121
00:04:52,080 --> 00:04:52,880
Exactly.

122
00:04:52,880 --> 00:04:59,440
So it sounds like a major win for AI safety, but I'm sure there are some limitations to this approach.

123
00:04:59,440 --> 00:05:01,520
Well, of course, no system is perfect.

124
00:05:01,520 --> 00:05:07,840
One thing the researchers point out is that RBRs work best when you can clearly define the rules.

125
00:05:07,840 --> 00:05:09,320
When the rules are very objective.

126
00:05:09,320 --> 00:05:10,160
Very objective.

127
00:05:10,160 --> 00:05:16,320
Yeah, like if the desired behavior is really subjective, it might not be as effective.

128
00:05:16,320 --> 00:05:17,000
I see.

129
00:05:17,000 --> 00:05:19,680
So like if you're trying to teach an AI humor.

130
00:05:19,680 --> 00:05:20,880
Like how to be funny.

131
00:05:20,880 --> 00:05:23,440
Yeah, that's really hard to define with specific rules.

132
00:05:23,440 --> 00:05:23,960
Exactly.

133
00:05:23,960 --> 00:05:25,280
Humor is so subjective.

134
00:05:25,280 --> 00:05:29,120
It's subjective and it changes depending on the person and the context.

135
00:05:29,120 --> 00:05:29,400
Right.

136
00:05:29,400 --> 00:05:33,960
So it's not like this is a silver bullet solution that will solve every AI challenge.

137
00:05:33,960 --> 00:05:39,920
It's not a magic wand, but it's a very powerful tool when used in the right situations.

138
00:05:39,920 --> 00:05:41,560
So this is all super fascinating.

139
00:05:41,560 --> 00:05:42,240
It is.

140
00:05:42,240 --> 00:05:45,480
I am so eager to dig into the details of their experiments.

141
00:05:45,480 --> 00:05:47,120
Yeah, let's look at some of the specifics.

142
00:05:47,120 --> 00:05:49,400
Well, let's unpack those ablations they were talking about.

143
00:05:49,400 --> 00:05:51,120
All right, let's get down to the nitty gritty.

144
00:05:51,120 --> 00:05:51,720
You bet.

145
00:05:51,720 --> 00:05:53,160
Okay, so ablations.

146
00:05:53,160 --> 00:05:53,840
Ablations.

147
00:05:53,840 --> 00:05:54,720
I like saying that word.

148
00:05:54,720 --> 00:05:55,840
It's a fun one, isn't it?

149
00:05:55,840 --> 00:05:56,520
It is.

150
00:05:56,520 --> 00:06:01,720
So they're like these little tests they run to see what happens when they tweak different parts of the system.

151
00:06:01,720 --> 00:06:06,680
Yeah, it's like tinkering under the hood to really understand how this RBR engine works.

152
00:06:06,680 --> 00:06:08,080
I like that analogy a lot.

153
00:06:08,080 --> 00:06:08,840
Glad you like it.

154
00:06:08,840 --> 00:06:10,200
It's a good one.

155
00:06:10,200 --> 00:06:16,680
One thing that I was really curious about was how the amount of training data affects all of this.

156
00:06:16,680 --> 00:06:17,800
Yeah, that's a good one.

157
00:06:17,800 --> 00:06:22,480
Because we know that AI, generally speaking, gets smarter with more data.

158
00:06:22,480 --> 00:06:24,480
Right, more data, more smarts usually.

159
00:06:24,480 --> 00:06:28,320
But I was wondering how that plays out with these safety-focused rules, these RBRs.

160
00:06:28,320 --> 00:06:35,200
So basically they wanted to find out if there's a point where adding more and more safety rules doesn't actually make much difference.

161
00:06:35,200 --> 00:06:38,000
So it's like, is there a saturation point?

162
00:06:38,000 --> 00:06:42,280
Like can the AI only handle so many rules before it just gets overloaded?

163
00:06:42,280 --> 00:06:45,720
Yeah, like does it just kind of freak out and say, I can't handle any more rules?

164
00:06:45,720 --> 00:06:46,920
Exactly, exactly.

165
00:06:46,920 --> 00:06:48,360
So what do they find?

166
00:06:48,360 --> 00:06:54,720
Well, at least in their experiments, what they found is that more rules usually meant safer behavior.

167
00:06:54,720 --> 00:06:55,440
Oh, interesting.

168
00:06:55,440 --> 00:07:00,480
So it's not like there's this magic number of rules where it's like, OK, now the AI is a safety expert.

169
00:07:00,480 --> 00:07:01,920
OK, so more is more in this case.

170
00:07:01,920 --> 00:07:03,160
More is more, it seems like.

171
00:07:03,160 --> 00:07:05,600
But they did find some really interesting patterns.

172
00:07:05,600 --> 00:07:12,280
In how often the AI refused to do things and also like how it phrased those refusals.

173
00:07:12,280 --> 00:07:13,480
So basically how it said no.

174
00:07:13,480 --> 00:07:15,200
Exactly how it said no.

175
00:07:15,200 --> 00:07:20,800
So as they added more safety rules, they noticed that the AI did get a little bit more cautious.

176
00:07:20,800 --> 00:07:22,400
Like more hesitant to do things.

177
00:07:22,400 --> 00:07:23,280
Yeah, more hesitant.

178
00:07:23,280 --> 00:07:27,160
So it was refusing more requests, even if they were harmless.

179
00:07:27,160 --> 00:07:30,880
So a little more uptight, but not to the point where it was completely useless.

180
00:07:30,880 --> 00:07:32,640
Exactly, not totally useless.

181
00:07:32,640 --> 00:07:37,040
It was a gradual increase and it never got as bad as some of the other methods they tried,

182
00:07:37,040 --> 00:07:40,640
like the one where they just relied on human feedback on safety.

183
00:07:40,640 --> 00:07:41,760
Right, which we talked about earlier.

184
00:07:41,760 --> 00:07:44,640
And that one seemed to be really quick to just hit the no button.

185
00:07:44,640 --> 00:07:47,960
Oh, yeah, that one was very, very cautious.

186
00:07:47,960 --> 00:07:49,640
Yeah, like overly cautious.

187
00:07:49,640 --> 00:07:50,360
Yeah.

188
00:07:50,360 --> 00:07:55,200
So I guess what you're saying is that like finding that perfect balance between being safe

189
00:07:55,200 --> 00:07:59,160
and still being able to actually do stuff is still a work in progress.

190
00:07:59,160 --> 00:08:00,840
It's a constant balancing act.

191
00:08:00,840 --> 00:08:02,600
A constant balancing act, yeah.

192
00:08:02,600 --> 00:08:05,920
It makes sense because it probably is different for every AI model, depending on what you

193
00:08:05,920 --> 00:08:06,920
want it to do.

194
00:08:06,920 --> 00:08:08,960
Absolutely, it all depends on the context.

195
00:08:08,960 --> 00:08:09,960
Right.

196
00:08:09,960 --> 00:08:11,560
So what about the style of the refusals?

197
00:08:11,560 --> 00:08:13,480
Do they see anything interesting there?

198
00:08:13,480 --> 00:08:18,840
Yeah, so they found that as they added more safety rules, the AI actually got better at

199
00:08:18,840 --> 00:08:21,640
politely refusing requests.

200
00:08:21,640 --> 00:08:24,480
So it learned how to say no in a nicer way.

201
00:08:24,480 --> 00:08:26,520
It learned to say no with more grace.

202
00:08:26,520 --> 00:08:27,520
Yeah.

203
00:08:27,520 --> 00:08:28,520
Okay.

204
00:08:28,520 --> 00:08:30,280
So it's all about learning from good examples, right?

205
00:08:30,280 --> 00:08:33,960
Like if you read a lot of well-written stuff, you'll probably start writing better yourself.

206
00:08:33,960 --> 00:08:34,960
It makes sense, yeah.

207
00:08:34,960 --> 00:08:35,960
Right.

208
00:08:35,960 --> 00:08:39,640
So it's like feeding the AI good examples of how to politely say no.

209
00:08:39,640 --> 00:08:45,320
It makes you wonder though, did they experiment with using different kinds of safety rules,

210
00:08:45,320 --> 00:08:46,560
not just more rules?

211
00:08:46,560 --> 00:08:49,640
Oh, you mean like strong refusals versus softer refusals?

212
00:08:49,640 --> 00:08:52,160
Yeah, like no way versus I'd rather not.

213
00:08:52,160 --> 00:08:53,160
They did.

214
00:08:53,160 --> 00:08:55,800
They actually experimented with that and that's where things got really interesting in terms

215
00:08:55,800 --> 00:08:58,160
of like fine-tuning the AI's behavior.

216
00:08:58,160 --> 00:08:59,160
Okay.

217
00:08:59,160 --> 00:09:00,360
So it's a week, tell me more.

218
00:09:00,360 --> 00:09:05,360
So they found that by adjusting the mix of those strong like hard no refusals and more

219
00:09:05,360 --> 00:09:11,560
like compliant softer responses, they could actually make the AI more or less cautious.

220
00:09:11,560 --> 00:09:14,560
So more strong refusals made it a bit more uptight.

221
00:09:14,560 --> 00:09:15,560
Okay.

222
00:09:15,560 --> 00:09:18,800
While more compliant examples made it more willing to go along with the request.

223
00:09:18,800 --> 00:09:22,400
So it's almost like you're setting the sensitivity on like a smoke detector.

224
00:09:22,400 --> 00:09:24,920
Oh, that's a great analogy.

225
00:09:24,920 --> 00:09:28,480
Like you want it to go off when there's a real fire, but you don't want it going off

226
00:09:28,480 --> 00:09:30,360
every time someone burns toast.

227
00:09:30,360 --> 00:09:31,360
Exactly.

228
00:09:31,360 --> 00:09:33,120
It's all about striking that right balance.

229
00:09:33,120 --> 00:09:35,160
So it's that balance again.

230
00:09:35,160 --> 00:09:37,280
But then they dug even deeper, right?

231
00:09:37,280 --> 00:09:41,880
They looked at how those softer, more empathetic refusals actually affected things.

232
00:09:41,880 --> 00:09:42,880
They did, yeah.

233
00:09:42,880 --> 00:09:47,400
And it turns out that the more they use those softer refusals in the training, the better

234
00:09:47,400 --> 00:09:51,760
the AI got at handling those really tricky situations like the request for advice on

235
00:09:51,760 --> 00:09:52,760
self-harm.

236
00:09:52,760 --> 00:09:53,760
Oh, that's so interesting.

237
00:09:53,760 --> 00:09:57,760
So instead of just shutting down the conversation, it learned to be more understanding, more

238
00:09:57,760 --> 00:09:58,760
supportive.

239
00:09:58,760 --> 00:10:00,200
It learned to be more human basically.

240
00:10:00,200 --> 00:10:01,200
Yeah, more human.

241
00:10:01,200 --> 00:10:05,680
And it did all of this without becoming any less safe overall.

242
00:10:05,680 --> 00:10:10,280
So it was still really good at avoiding the harmful content, but it was also able to offer

243
00:10:10,280 --> 00:10:13,680
this like compassionate response.

244
00:10:13,680 --> 00:10:14,680
Yeah.

245
00:10:14,680 --> 00:10:16,440
It was able to kind of walk that line.

246
00:10:16,440 --> 00:10:17,440
That's amazing.

247
00:10:17,440 --> 00:10:18,440
It really is.

248
00:10:18,440 --> 00:10:23,280
It's like they found the perfect recipe, you know, like safety plus compassion equals

249
00:10:23,280 --> 00:10:24,960
a really good AI assistant.

250
00:10:24,960 --> 00:10:25,960
Yeah.

251
00:10:25,960 --> 00:10:26,960
It's like the secret sauce.

252
00:10:26,960 --> 00:10:31,400
It's pretty amazing how much we can influence these models just by carefully choosing what

253
00:10:31,400 --> 00:10:32,400
we teach them.

254
00:10:32,400 --> 00:10:37,200
It really highlights the importance of the training data and the rules that we set.

255
00:10:37,200 --> 00:10:39,160
It's like raising a kid, right?

256
00:10:39,160 --> 00:10:40,760
You don't just want to say no all the time.

257
00:10:40,760 --> 00:10:41,760
Exactly.

258
00:10:41,760 --> 00:10:43,600
You want to guide them, teach them how to make good choices.

259
00:10:43,600 --> 00:10:47,320
You want them to be good people, but you also want them to be able to experience life and

260
00:10:47,320 --> 00:10:48,320
learn and grow.

261
00:10:48,320 --> 00:10:49,320
Exactly.

262
00:10:49,320 --> 00:10:50,320
You don't want to stifle them.

263
00:10:50,320 --> 00:10:51,320
Right.

264
00:10:51,320 --> 00:10:52,320
Yeah.

265
00:10:52,320 --> 00:10:55,440
So it sounds like they made some really great progress here, but even with all these fancy

266
00:10:55,440 --> 00:11:00,320
rules and all this training, there's still a human element to all of this, right?

267
00:11:00,320 --> 00:11:01,320
Absolutely.

268
00:11:01,320 --> 00:11:06,080
Like someone still has to decide what the rules are in the first place and how to actually

269
00:11:06,080 --> 00:11:08,120
teach those rules to the AI.

270
00:11:08,120 --> 00:11:09,120
You're absolutely right.

271
00:11:09,120 --> 00:11:11,120
And the researchers acknowledge that in the paper.

272
00:11:11,120 --> 00:11:15,160
They talk about how important it is to have a diverse group of people involved in this

273
00:11:15,160 --> 00:11:16,160
process.

274
00:11:16,160 --> 00:11:17,160
So not just AI researchers.

275
00:11:17,160 --> 00:11:18,640
Not just AI researchers, no.

276
00:11:18,640 --> 00:11:22,160
But also ethicists, people from different backgrounds and perspectives.

277
00:11:22,160 --> 00:11:25,760
Yeah, people who can bring their own lived experiences to the table.

278
00:11:25,760 --> 00:11:29,880
So that we can make sure that these AI models are actually reflecting the values that we

279
00:11:29,880 --> 00:11:31,760
want to see in the world.

280
00:11:31,760 --> 00:11:38,280
It's about making sure that AI is a force for good and not just this powerful tool that

281
00:11:38,280 --> 00:11:40,640
could be used for who knows what.

282
00:11:40,640 --> 00:11:41,640
Right.

283
00:11:41,640 --> 00:11:42,640
Exactly.

284
00:11:42,640 --> 00:11:44,280
It's not just about building a super smart AI.

285
00:11:44,280 --> 00:11:48,960
It's about making sure that AI is actually beneficial for society as a whole.

286
00:11:48,960 --> 00:11:52,680
It's about making sure that AI is aligned with our values as humans.

287
00:11:52,680 --> 00:11:53,680
Exactly.

288
00:11:53,680 --> 00:11:54,680
Well said.

289
00:11:54,680 --> 00:11:59,200
So speaking of pushing the boundaries, I know that they also tried some really out there

290
00:11:59,200 --> 00:12:02,800
approaches to training these AI models.

291
00:12:02,800 --> 00:12:03,800
Yeah, they did.

292
00:12:03,800 --> 00:12:08,320
Like what if you didn't teach it to be helpful at all, but just focus purely on safety?

293
00:12:08,320 --> 00:12:12,840
Oh, you're talking about the experiment where they threw out all the helpful training data

294
00:12:12,840 --> 00:12:15,840
and they just used those safety rules.

295
00:12:15,840 --> 00:12:16,840
Yeah.

296
00:12:16,840 --> 00:12:18,000
Did that even work?

297
00:12:18,000 --> 00:12:19,000
You know, it's interesting.

298
00:12:19,000 --> 00:12:21,520
It wasn't a total failure.

299
00:12:21,520 --> 00:12:25,280
The AI was still pretty good at saying no to harmful stuff.

300
00:12:25,280 --> 00:12:26,280
Okay.

301
00:12:26,280 --> 00:12:27,280
So it passed the safety test.

302
00:12:27,280 --> 00:12:29,360
It aced the safety test.

303
00:12:29,360 --> 00:12:31,240
But there was a catch.

304
00:12:31,240 --> 00:12:36,280
It also became way more likely to refuse requests that were totally harmless.

305
00:12:36,280 --> 00:12:37,280
Oh no.

306
00:12:37,280 --> 00:12:40,760
So it's like that super strict teacher who sucks all the fun out of learning.

307
00:12:40,760 --> 00:12:41,760
Exactly.

308
00:12:41,760 --> 00:12:44,120
You just want to ask a simple question and they shut you down.

309
00:12:44,120 --> 00:12:46,040
It's like, can we just have a little fun here?

310
00:12:46,040 --> 00:12:47,040
Exactly.

311
00:12:47,040 --> 00:12:53,000
It really shows how important it is to build a foundation of helpfulness into these models

312
00:12:53,000 --> 00:12:54,000
from the start.

313
00:12:54,000 --> 00:12:56,560
It's not just about teaching them what not to do.

314
00:12:56,560 --> 00:12:59,520
It's about teaching them how to be useful and engaging and helpful.

315
00:12:59,520 --> 00:13:01,400
It's about that balance again, right?

316
00:13:01,400 --> 00:13:02,400
Safety and usefulness.

317
00:13:02,400 --> 00:13:03,400
Yeah.

318
00:13:03,400 --> 00:13:04,400
Yeah.

319
00:13:04,400 --> 00:13:06,800
Like think about it in terms of like raising a child.

320
00:13:06,800 --> 00:13:08,600
You don't want to just tell them no all the time.

321
00:13:08,600 --> 00:13:10,680
You want them to learn and explore.

322
00:13:10,680 --> 00:13:13,240
You want them to experience the world.

323
00:13:13,240 --> 00:13:15,400
And you want them to enjoy the experience.

324
00:13:15,400 --> 00:13:16,400
Oh yeah.

325
00:13:16,400 --> 00:13:17,400
Exactly.

326
00:13:17,400 --> 00:13:18,400
And it's the same with AI.

327
00:13:18,400 --> 00:13:23,120
We want AI that's responsible, but we also want it to be enjoyable to interact with.

328
00:13:23,120 --> 00:13:26,640
We want AI that can make our lives better and easier and more fun.

329
00:13:26,640 --> 00:13:29,760
We want AI that can be a true companion and helper.

330
00:13:29,760 --> 00:13:30,760
Yeah.

331
00:13:30,760 --> 00:13:34,680
So I think we've covered a ton of ground here, really digging into the nitty gritty of these

332
00:13:34,680 --> 00:13:35,680
RBRs.

333
00:13:35,680 --> 00:13:37,760
It's been a quite the journey.

334
00:13:37,760 --> 00:13:38,760
It really has.

335
00:13:38,760 --> 00:13:43,440
And I think these experiments show just how much work there still is to be done in AI

336
00:13:43,440 --> 00:13:44,440
safety research.

337
00:13:44,440 --> 00:13:45,440
Oh absolutely.

338
00:13:45,440 --> 00:13:47,840
There's still so much to explore and learn.

339
00:13:47,840 --> 00:13:51,560
It's a complex field, but it's clear that we're making some really exciting progress.

340
00:13:51,560 --> 00:13:52,560
Yeah.

341
00:13:52,560 --> 00:13:53,560
We're definitely moving in the right direction.

342
00:13:53,560 --> 00:13:57,640
So for the last part of our deep dive, I want to kind of zoom out and think about the bigger

343
00:13:57,640 --> 00:13:58,800
picture.

344
00:13:58,800 --> 00:14:01,880
You know, what does all this mean for the future of AI?

345
00:14:01,880 --> 00:14:04,040
What's next for these rule based systems?

346
00:14:04,040 --> 00:14:21,480
Stay tuned.

347
00:14:21,480 --> 00:14:26,040
This open AI paper, it's been a wild ride, really exploring all the different facets

348
00:14:26,040 --> 00:14:28,680
and turns of this approach to making AI safer.

349
00:14:28,680 --> 00:14:30,400
I feel like we really got in the weeds.

350
00:14:30,400 --> 00:14:31,400
We did.

351
00:14:31,400 --> 00:14:32,400
We got into the weeds.

352
00:14:32,400 --> 00:14:33,400
We got turned into the nitty gritty.

353
00:14:33,400 --> 00:14:36,960
We talked about the basic idea of these rules, the RBRs and all those experiments.

354
00:14:36,960 --> 00:14:38,560
Yeah, and all those tests.

355
00:14:38,560 --> 00:14:40,320
But now I want to step back.

356
00:14:40,320 --> 00:14:41,320
Zoom out a little.

357
00:14:41,320 --> 00:14:43,720
And look at the big picture, you know.

358
00:14:43,720 --> 00:14:45,920
What does this all mean for the future?

359
00:14:45,920 --> 00:14:47,320
The future of AI.

360
00:14:47,320 --> 00:14:50,080
Yeah, where do we go from here?

361
00:14:50,080 --> 00:14:52,360
That's the million dollar question, isn't it?

362
00:14:52,360 --> 00:14:53,360
It is.

363
00:14:53,360 --> 00:14:56,200
And this paper doesn't give us all the answers.

364
00:14:56,200 --> 00:14:59,560
But it opens up some really interesting possibilities.

365
00:14:59,560 --> 00:15:00,560
It does.

366
00:15:00,560 --> 00:15:02,280
One thing that really struck me was.

367
00:15:02,280 --> 00:15:03,280
What's that?

368
00:15:03,280 --> 00:15:07,480
The RBRs could potentially change the way we train AI.

369
00:15:07,480 --> 00:15:09,400
Yeah, the training process.

370
00:15:09,400 --> 00:15:14,400
Because right now, it takes so much time and effort to get humans to label all that data.

371
00:15:14,400 --> 00:15:16,080
Oh, it's a massive undertaking.

372
00:15:16,080 --> 00:15:17,720
It's a huge job.

373
00:15:17,720 --> 00:15:20,840
And it's not just the time, it's also the potential for bias, right?

374
00:15:20,840 --> 00:15:21,840
Oh, yeah.

375
00:15:21,840 --> 00:15:23,000
Human bias can creep in.

376
00:15:23,000 --> 00:15:24,000
And errors.

377
00:15:24,000 --> 00:15:25,000
So many errors.

378
00:15:25,000 --> 00:15:28,240
But with these RBRs, it's like you're automating that process.

379
00:15:28,240 --> 00:15:29,880
Using AI to train AI.

380
00:15:29,880 --> 00:15:30,880
Exactly.

381
00:15:30,880 --> 00:15:33,000
So you have this other AI judging the responses.

382
00:15:33,000 --> 00:15:34,160
Based on the rules.

383
00:15:34,160 --> 00:15:35,720
And those rules has already been defined.

384
00:15:35,720 --> 00:15:38,600
So you don't need humans to do that tedious labeling work.

385
00:15:38,600 --> 00:15:40,520
Yeah, that frees up the humans.

386
00:15:40,520 --> 00:15:43,040
Frees them up to do more interesting things.

387
00:15:43,040 --> 00:15:44,040
Like what?

388
00:15:44,040 --> 00:15:45,320
Like coming up with better rules.

389
00:15:45,320 --> 00:15:47,040
Yeah, more nuanced rules.

390
00:15:47,040 --> 00:15:48,680
Or tackling those ethical questions.

391
00:15:48,680 --> 00:15:49,680
It's a big question.

392
00:15:49,680 --> 00:15:50,680
Yeah, exactly.

393
00:15:50,680 --> 00:15:53,560
So it's not just about labeling data points.

394
00:15:53,560 --> 00:15:56,680
It's about shaping the overall values of the system.

395
00:15:56,680 --> 00:15:59,320
Guiding the AI, giving it a moral compass.

396
00:15:59,320 --> 00:16:00,320
Exactly.

397
00:16:00,320 --> 00:16:02,400
Which leads to another exciting possibility.

398
00:16:02,400 --> 00:16:03,640
What's that?

399
00:16:03,640 --> 00:16:10,640
What if we could use these RBRs to create AI models that aren't just safe, but also reflect

400
00:16:10,640 --> 00:16:13,040
like a broader set of values.

401
00:16:13,040 --> 00:16:15,280
Oh, I like where you're going with this.

402
00:16:15,280 --> 00:16:17,000
So it's not just about avoiding harm.

403
00:16:17,000 --> 00:16:18,680
It's about promoting good.

404
00:16:18,680 --> 00:16:20,480
Like being fair and unbiased.

405
00:16:20,480 --> 00:16:21,480
Exactly.

406
00:16:21,480 --> 00:16:22,920
Like imagine an AI assistant.

407
00:16:22,920 --> 00:16:24,640
Helpful and harmless.

408
00:16:24,640 --> 00:16:29,320
But also inclusive, respectful of different cultures, sensitive to people's emotions.

409
00:16:29,320 --> 00:16:30,720
AI with empathy.

410
00:16:30,720 --> 00:16:31,720
Yeah.

411
00:16:31,720 --> 00:16:33,880
It's not just about preventing bad things.

412
00:16:33,880 --> 00:16:36,120
It's about actively making things better.

413
00:16:36,120 --> 00:16:37,600
That's a powerful idea.

414
00:16:37,600 --> 00:16:40,440
And I think that's where the real potential of this research lies.

415
00:16:40,440 --> 00:16:42,240
It's not just about minimizing risks.

416
00:16:42,240 --> 00:16:43,800
It's about maximizing benefit.

417
00:16:43,800 --> 00:16:44,800
Yeah.

418
00:16:44,800 --> 00:16:47,200
Unlocking the positive potential of AI.

419
00:16:47,200 --> 00:16:49,040
Of course there are still challenges, right?

420
00:16:49,040 --> 00:16:50,040
Like Alissa challenges.

421
00:16:50,040 --> 00:16:53,760
Like we've been talking about, these RBRs, they're better suited for tasks where the

422
00:16:53,760 --> 00:16:56,000
rules are really clear.

423
00:16:56,000 --> 00:16:57,000
Objective rules.

424
00:16:57,000 --> 00:16:58,640
But for things like creativity.

425
00:16:58,640 --> 00:16:59,640
Art.

426
00:16:59,640 --> 00:17:00,640
Storytelling.

427
00:17:00,640 --> 00:17:01,640
It's harder to define those rules.

428
00:17:01,640 --> 00:17:03,040
It's much more subjective.

429
00:17:03,040 --> 00:17:04,360
So there's still a lot of work to be done.

430
00:17:04,360 --> 00:17:06,560
Oh yeah, we're just scratching the surface.

431
00:17:06,560 --> 00:17:12,280
Figuring out how to adapt these rule-based approaches to these more nuanced areas.

432
00:17:12,280 --> 00:17:14,360
AI that can write a symphony.

433
00:17:14,360 --> 00:17:15,360
Compose a poem.

434
00:17:15,360 --> 00:17:16,560
Or paint a masterpiece.

435
00:17:16,560 --> 00:17:17,560
Exactly.

436
00:17:17,560 --> 00:17:21,240
It makes you wonder if we need entirely new types of AI, right?

437
00:17:21,240 --> 00:17:23,160
Ones that are less reliant on rules.

438
00:17:23,160 --> 00:17:26,120
And more capable of learning and adapting like humans do.

439
00:17:26,120 --> 00:17:28,200
AI that can think outside the box.

440
00:17:28,200 --> 00:17:29,200
Thank for itself.

441
00:17:29,200 --> 00:17:30,200
Be creative.

442
00:17:30,200 --> 00:17:32,680
It's an exciting time to be working on this stuff.

443
00:17:32,680 --> 00:17:33,680
It really is.

444
00:17:33,680 --> 00:17:36,360
It feels like we're at the beginning of a whole new era.

445
00:17:36,360 --> 00:17:38,080
A new chapter in AI.

446
00:17:38,080 --> 00:17:39,320
Where the focus is shifting.

447
00:17:39,320 --> 00:17:41,400
From just building powerful machines.

448
00:17:41,400 --> 00:17:44,160
To building machines that are powerful and beneficial.

449
00:17:44,160 --> 00:17:46,800
AI that's aligned with our values.

450
00:17:46,800 --> 00:17:47,800
Exactly.

451
00:17:47,800 --> 00:17:50,360
But before we get too carried away with all these possibilities.

452
00:17:50,360 --> 00:17:52,080
We need a dose of reality.

453
00:17:52,080 --> 00:17:55,080
It's important to remember that no system is perfect.

454
00:17:55,080 --> 00:17:56,800
No AI is foolproof.

455
00:17:56,800 --> 00:17:58,640
And the researchers are very clear about that.

456
00:17:58,640 --> 00:18:00,160
They acknowledge the limitations.

457
00:18:00,160 --> 00:18:02,720
They say that these RBIs aren't a magic solution.

458
00:18:02,720 --> 00:18:03,720
They're not a silver bullet.

459
00:18:03,720 --> 00:18:06,440
And there's always a chance for unexpected consequences, right?

460
00:18:06,440 --> 00:18:08,800
AI could find loopholes.

461
00:18:08,800 --> 00:18:10,040
Find workarounds.

462
00:18:10,040 --> 00:18:11,040
Outsmart the rules.

463
00:18:11,040 --> 00:18:12,280
It's a constant arms race.

464
00:18:12,280 --> 00:18:13,800
So we have to be vigilant.

465
00:18:13,800 --> 00:18:16,240
Keep researching and testing and refining.

466
00:18:16,240 --> 00:18:17,960
Making sure AI stays on the right track.

467
00:18:17,960 --> 00:18:19,440
It's like you're game of chess.

468
00:18:19,440 --> 00:18:20,440
Right.

469
00:18:20,440 --> 00:18:21,440
It gets a supercomputer.

470
00:18:21,440 --> 00:18:23,520
You have to think several moves ahead.

471
00:18:23,520 --> 00:18:25,400
Anticipate the AI's strategy.

472
00:18:25,400 --> 00:18:26,400
Come up with countermeasures.

473
00:18:26,400 --> 00:18:29,120
It's a challenge, but it's an exciting challenge.

474
00:18:29,120 --> 00:18:31,040
And that's why this research is so important.

475
00:18:31,040 --> 00:18:32,040
It's pushing the boundaries.

476
00:18:32,040 --> 00:18:33,760
It's showing us what's possible.

477
00:18:33,760 --> 00:18:35,680
And reminding us to be careful.

478
00:18:35,680 --> 00:18:36,680
Exactly.

479
00:18:36,680 --> 00:18:39,040
So I think that's a good place to wrap up our deep dive.

480
00:18:39,040 --> 00:18:40,600
It's been a fascinating conversation.

481
00:18:40,600 --> 00:18:42,600
We've explored these rule-based rewards.

482
00:18:42,600 --> 00:18:43,680
How they work.

483
00:18:43,680 --> 00:18:46,240
What they could mean for the future of AI.

484
00:18:46,240 --> 00:18:49,080
A future where AI is safe and ethical.

485
00:18:49,080 --> 00:18:51,760
A future where AI empowers us all.

486
00:18:51,760 --> 00:18:53,880
And helps us build a better world.

487
00:18:53,880 --> 00:18:54,880
That's the goal.

488
00:18:54,880 --> 00:18:55,880
That's the dream.

489
00:18:55,880 --> 00:18:59,080
And thanks to research like this, we're one step closer.

490
00:18:59,080 --> 00:19:01,360
To making that dream a reality.

491
00:19:01,360 --> 00:19:06,760
So to everyone listening, keep learning, keep asking questions, and stay engaged in the

492
00:19:06,760 --> 00:19:08,840
conversation about AI.

493
00:19:08,840 --> 00:19:11,000
Because the future of AI is up to all of us.

494
00:19:11,000 --> 00:19:12,000
That's right.

495
00:19:12,000 --> 00:19:14,160
And on that note, we'll sign off for today.

496
00:19:14,160 --> 00:19:15,160
Until next time.

497
00:19:15,160 --> 00:19:32,200
Take care everyone.

