1
00:00:00,000 --> 00:00:04,600
Okay, so ready to dive into some AI evaluation today.

2
00:00:04,600 --> 00:00:08,320
It's one thing to build cool algorithm,

3
00:00:08,320 --> 00:00:10,560
but it's something else entirely to know if it works.

4
00:00:10,560 --> 00:00:11,560
Totally, and you know what?

5
00:00:11,560 --> 00:00:14,040
It's like anyone who's worked on a real AI project knows

6
00:00:14,920 --> 00:00:17,040
evaluation can be kind of a pain, right?

7
00:00:17,040 --> 00:00:18,960
It's like something people just tack on at the end,

8
00:00:18,960 --> 00:00:21,760
but these posts we're looking at by Hamilton who's saying,

9
00:00:21,760 --> 00:00:22,600
they really say like,

10
00:00:22,600 --> 00:00:25,000
nope, gotta bake that in from the start.

11
00:00:25,000 --> 00:00:26,040
It's gotta be part of the process.

12
00:00:26,040 --> 00:00:27,080
Yeah, from day one.

13
00:00:27,080 --> 00:00:28,560
Exactly, and he doesn't just like

14
00:00:28,560 --> 00:00:30,280
to talk about it generally.

15
00:00:30,280 --> 00:00:33,880
He uses this case study, this AI assistant, Lucy.

16
00:00:33,880 --> 00:00:35,320
Yeah, Lucy, she was built

17
00:00:35,320 --> 00:00:37,960
for this real estate company, ReChat,

18
00:00:37,960 --> 00:00:42,960
and they used her to help agents automate some tasks.

19
00:00:43,560 --> 00:00:44,440
Things were going great,

20
00:00:44,440 --> 00:00:47,600
but then Lucy kind of plateaued.

21
00:00:47,600 --> 00:00:49,320
Her performance wasn't getting any better.

22
00:00:49,320 --> 00:00:50,160
But you hit a wall.

23
00:00:50,160 --> 00:00:52,360
Yeah, and that's when they realized,

24
00:00:52,360 --> 00:00:53,920
okay, we need a system,

25
00:00:53,920 --> 00:00:55,400
like a real way to figure out

26
00:00:55,400 --> 00:00:58,320
what's holding Lucy back and how to improve.

27
00:00:58,320 --> 00:00:59,440
How to make her better.

28
00:00:59,440 --> 00:01:01,480
Exactly, and what I thought was really interesting

29
00:01:01,480 --> 00:01:04,960
is how he breaks this whole evaluation process

30
00:01:04,960 --> 00:01:07,040
down into like three levels.

31
00:01:07,040 --> 00:01:08,640
It's like you're leveling up in a game.

32
00:01:08,640 --> 00:01:09,480
I like that.

33
00:01:09,480 --> 00:01:10,320
Yeah.

34
00:01:10,320 --> 00:01:11,160
So like what's the first level?

35
00:01:11,160 --> 00:01:12,720
What's the first step on this ladder?

36
00:01:12,720 --> 00:01:15,080
So the first level is all about these things

37
00:01:15,080 --> 00:01:16,240
called unit tests.

38
00:01:16,240 --> 00:01:17,640
So this is where you're really

39
00:01:17,640 --> 00:01:19,240
in the nitty gritty of the code, right?

40
00:01:19,240 --> 00:01:21,240
Making sure all the little pieces work on their own

41
00:01:21,240 --> 00:01:22,120
like they should.

42
00:01:22,120 --> 00:01:24,640
Okay, so this is like, if I'm writing an essay,

43
00:01:24,640 --> 00:01:26,240
I'm checking for typos,

44
00:01:26,240 --> 00:01:28,480
I'm checking for all that before I submit.

45
00:01:28,480 --> 00:01:29,600
Yeah, it's good analogy.

46
00:01:29,600 --> 00:01:31,080
He's saying he gives this example

47
00:01:31,080 --> 00:01:33,480
of how ReChat uses unit tests

48
00:01:33,480 --> 00:01:36,160
to make sure Lucy doesn't accidentally,

49
00:01:36,160 --> 00:01:37,680
give away sensitive info,

50
00:01:37,680 --> 00:01:40,680
like those unique ID codes, UOBDs.

51
00:01:40,680 --> 00:01:42,640
So they wanna make sure Lucy's not blabbing

52
00:01:42,640 --> 00:01:43,800
about confidential stuff.

53
00:01:43,800 --> 00:01:44,920
Right, keep it all under wraps.

54
00:01:44,920 --> 00:01:46,680
Exactly, gonna protect those secrets.

55
00:01:46,680 --> 00:01:47,600
Makes sense.

56
00:01:47,600 --> 00:01:49,760
So level two, what's that about?

57
00:01:49,760 --> 00:01:51,320
Okay, so level two,

58
00:01:51,320 --> 00:01:53,000
this is where it gets really interesting.

59
00:01:53,000 --> 00:01:55,800
This is where you start bringing in humans,

60
00:01:55,800 --> 00:01:56,880
human judgment.

61
00:01:56,880 --> 00:02:00,040
You're actually looking at how the AI is doing

62
00:02:00,040 --> 00:02:02,360
and having humans say like, yep, good job,

63
00:02:02,360 --> 00:02:04,240
or nope, not so much.

64
00:02:04,240 --> 00:02:06,880
So it's like having a teacher grade my essay now, right?

65
00:02:06,880 --> 00:02:07,720
Yeah.

66
00:02:07,720 --> 00:02:09,520
They're giving me feedback on my writing and stuff.

67
00:02:09,520 --> 00:02:10,360
You got it.

68
00:02:10,360 --> 00:02:13,040
And Hussein, he really emphasizes how important it is

69
00:02:13,040 --> 00:02:15,520
to make this process, you know,

70
00:02:15,520 --> 00:02:17,960
smooth, easy for these humans.

71
00:02:17,960 --> 00:02:18,800
Yeah.

72
00:02:18,800 --> 00:02:20,000
He even suggests building custom tools

73
00:02:20,000 --> 00:02:22,240
just to help people quickly review the AI's work

74
00:02:22,240 --> 00:02:23,120
and give feedback.

75
00:02:23,120 --> 00:02:24,400
You don't want them to get bogged down

76
00:02:24,400 --> 00:02:26,760
by a bunch of technical stuff.

77
00:02:26,760 --> 00:02:28,880
Right, you want them to focus on what matters

78
00:02:28,880 --> 00:02:30,880
as the AI doing a good job or not.

79
00:02:30,880 --> 00:02:31,720
Exactly.

80
00:02:31,720 --> 00:02:33,880
And what's cool, he talks about using,

81
00:02:33,880 --> 00:02:37,840
get this, LLMs to automate part of this human evaluation.

82
00:02:37,840 --> 00:02:40,560
Like an AI is helping grade the essays.

83
00:02:40,560 --> 00:02:41,960
Wait, that's kind of meta, isn't it?

84
00:02:41,960 --> 00:02:44,160
An AI evaluating another AI?

85
00:02:44,160 --> 00:02:46,640
It is, it is, but if you're already working with AI,

86
00:02:46,640 --> 00:02:48,160
might as well make it work for you, right?

87
00:02:48,160 --> 00:02:48,980
Yeah, I guess so.

88
00:02:48,980 --> 00:02:50,120
Streaming the whole thing.

89
00:02:50,120 --> 00:02:51,640
Now the third level,

90
00:02:51,640 --> 00:02:54,240
this is what Hussein calls like the gold standard.

91
00:02:54,240 --> 00:02:55,080
Okay.

92
00:02:55,080 --> 00:02:55,900
AB testing.

93
00:02:55,900 --> 00:02:56,960
Oh yeah, AB testing effort of that.

94
00:02:56,960 --> 00:02:58,400
So that's where you're putting it out there

95
00:02:58,400 --> 00:02:59,240
in the real world.

96
00:02:59,240 --> 00:03:00,680
Yes.

97
00:03:00,680 --> 00:03:03,720
You're releasing different versions to actual users

98
00:03:03,720 --> 00:03:05,200
and seeing how they react.

99
00:03:05,200 --> 00:03:06,720
So you're seeing what works, what doesn't work.

100
00:03:06,720 --> 00:03:08,760
Exactly, you're getting real world data,

101
00:03:08,760 --> 00:03:11,320
which at the end of the day is what really matters.

102
00:03:11,320 --> 00:03:13,080
Your AI could pass all the tests,

103
00:03:13,080 --> 00:03:15,400
get great feedback from humans,

104
00:03:15,400 --> 00:03:17,360
but if the users don't like it.

105
00:03:17,360 --> 00:03:18,680
Right, if it doesn't actually help them.

106
00:03:18,680 --> 00:03:19,800
Right, it's a flop.

107
00:03:19,800 --> 00:03:22,160
AB testing helps you catch those little things

108
00:03:22,160 --> 00:03:25,120
that might not show up in a lab setting

109
00:03:25,120 --> 00:03:27,080
because the real world is messy.

110
00:03:27,080 --> 00:03:28,240
Yeah, totally.

111
00:03:28,240 --> 00:03:30,200
So we've got our three levels.

112
00:03:30,200 --> 00:03:35,320
Unit tests, human and model evil, and AB testing.

113
00:03:35,320 --> 00:03:37,920
Okay, but before we even get to those stages,

114
00:03:37,920 --> 00:03:40,560
Hussein mentions this thing called critique shadowing.

115
00:03:40,560 --> 00:03:41,440
What is that?

116
00:03:41,440 --> 00:03:42,840
Okay, so critique shadowing,

117
00:03:42,840 --> 00:03:44,440
this is a really cool way to build

118
00:03:44,440 --> 00:03:46,440
what he calls an LLM judge.

119
00:03:46,440 --> 00:03:47,920
So you have your expert, right?

120
00:03:47,920 --> 00:03:49,560
They're making those pass field judgments

121
00:03:49,560 --> 00:03:50,880
like we talked about in level two.

122
00:03:50,880 --> 00:03:51,880
Okay, sounds familiar.

123
00:03:51,880 --> 00:03:53,240
But here's the twist.

124
00:03:53,240 --> 00:03:55,120
They don't just give a thumbs up or down,

125
00:03:55,120 --> 00:03:57,200
they also gotta explain why.

126
00:03:57,200 --> 00:04:00,040
Ah, so it's like not just yes or no,

127
00:04:00,040 --> 00:04:02,240
but yes because or no because.

128
00:04:02,240 --> 00:04:03,440
You got it, what did they like,

129
00:04:03,440 --> 00:04:05,080
what didn't they like, all the details.

130
00:04:05,080 --> 00:04:06,640
So you're capturing their thinking,

131
00:04:06,640 --> 00:04:08,160
not just the final decision.

132
00:04:08,160 --> 00:04:10,200
And then you take all these critiques

133
00:04:10,200 --> 00:04:12,960
and use them to train your LLM judge.

134
00:04:12,960 --> 00:04:15,400
So the LLM is learning from the expert,

135
00:04:15,400 --> 00:04:17,360
trying to see the world through their eyes.

136
00:04:17,360 --> 00:04:18,200
That's pretty cool.

137
00:04:18,200 --> 00:04:20,280
It is, and it helps you uncover all these things

138
00:04:20,280 --> 00:04:23,200
that we just assume when we interact with AI,

139
00:04:23,200 --> 00:04:25,360
things we don't even realize we're expecting.

140
00:04:25,360 --> 00:04:26,240
Like hidden rules.

141
00:04:26,240 --> 00:04:27,480
Yeah, exactly.

142
00:04:27,480 --> 00:04:30,200
Like an apprentice learning from a master craftsman.

143
00:04:30,200 --> 00:04:33,320
But in this case, the craft is judging AI.

144
00:04:33,320 --> 00:04:34,680
I like that analogy.

145
00:04:34,680 --> 00:04:36,720
So the benefits of this LLM judge,

146
00:04:36,720 --> 00:04:39,400
they go beyond just grading the AI.

147
00:04:39,400 --> 00:04:41,200
You can use it to make the AI better, right?

148
00:04:41,200 --> 00:04:43,240
Absolutely, it's not just about giving a score,

149
00:04:43,240 --> 00:04:45,920
it's about giving feedback that helps the AI

150
00:04:45,920 --> 00:04:48,120
learn and improve, like a personal tutor.

151
00:04:48,120 --> 00:04:49,080
Nice.

152
00:04:49,080 --> 00:04:50,720
And the best part, he says,

153
00:04:50,720 --> 00:04:55,040
a lot of the work you do for this whole evaluation process,

154
00:04:55,040 --> 00:04:57,840
you can use it again later for fixing bugs

155
00:04:57,840 --> 00:04:58,800
and improving the model.

156
00:04:58,800 --> 00:05:00,080
Yeah, super efficient.

157
00:05:00,080 --> 00:05:02,200
It's like you're getting double the bang for your book.

158
00:05:02,200 --> 00:05:03,280
That's awesome.

159
00:05:03,280 --> 00:05:04,680
Now you mentioned that this approach

160
00:05:04,680 --> 00:05:09,520
forces you to define what good actually means for your AI.

161
00:05:09,520 --> 00:05:10,360
What does that mean?

162
00:05:10,360 --> 00:05:11,200
Yeah, you can't just be like,

163
00:05:11,200 --> 00:05:12,520
oh, I'll know it when I see it.

164
00:05:12,520 --> 00:05:13,640
You gotta actually say, okay,

165
00:05:13,640 --> 00:05:15,560
this is what success looks like.

166
00:05:15,560 --> 00:05:19,360
What are the specific things that make this AI good?

167
00:05:19,360 --> 00:05:21,680
So you're not just building an AI that works.

168
00:05:21,680 --> 00:05:23,880
You're building one that works well

169
00:05:23,880 --> 00:05:26,080
and you're defining what well means.

170
00:05:26,080 --> 00:05:27,040
Right, right.

171
00:05:27,040 --> 00:05:30,480
And that process of figuring out what well means,

172
00:05:30,480 --> 00:05:33,120
it can actually help you build a better product.

173
00:05:33,120 --> 00:05:35,120
Because you're really thinking about what's important,

174
00:05:35,120 --> 00:05:36,640
what the users care about.

175
00:05:36,640 --> 00:05:38,400
So evaluation isn't just a chore,

176
00:05:38,400 --> 00:05:40,000
it can actually make your AI better.

177
00:05:40,000 --> 00:05:40,840
Totally.

178
00:05:40,840 --> 00:05:42,800
It's all about finding those hidden gems

179
00:05:42,800 --> 00:05:45,480
that make your AI really shine.

180
00:05:45,480 --> 00:05:46,840
That makes a lot of sense.

181
00:05:46,840 --> 00:05:48,320
Okay, so we've talked about

182
00:05:48,320 --> 00:05:50,320
like the foundations of AI evaluation.

183
00:05:50,320 --> 00:05:51,160
Yeah, the basics.

184
00:05:51,160 --> 00:05:52,720
But where does this fit in the bigger picture?

185
00:05:52,720 --> 00:05:54,000
What's the end goal here?

186
00:05:54,000 --> 00:05:57,360
Now this is where your LLM judge really starts to pay off.

187
00:05:57,360 --> 00:05:59,400
You mean like putting it to work.

188
00:05:59,400 --> 00:06:01,360
Exactly, you can use it to evaluate

189
00:06:01,360 --> 00:06:02,840
all those user interactions,

190
00:06:02,840 --> 00:06:05,080
the real ones and those synthetic ones we talked about.

191
00:06:05,080 --> 00:06:06,680
So now you can see the big picture.

192
00:06:06,680 --> 00:06:09,000
How's the AI actually doing out in the wild?

193
00:06:09,000 --> 00:06:09,840
Totally.

194
00:06:09,840 --> 00:06:12,400
It's not just about individual judgments anymore.

195
00:06:12,400 --> 00:06:15,640
It's like looking for patterns, trends

196
00:06:15,640 --> 00:06:18,240
across your entire dataset.

197
00:06:18,240 --> 00:06:20,960
Hussein, he recommends breaking down your analysis,

198
00:06:20,960 --> 00:06:23,880
looking at things like different user types,

199
00:06:23,880 --> 00:06:25,680
different situations, different features.

200
00:06:25,680 --> 00:06:27,760
So like you might discover, for example,

201
00:06:27,760 --> 00:06:31,160
your AI struggles with new users who are,

202
00:06:31,160 --> 00:06:33,400
I don't know, getting multiple matches

203
00:06:33,400 --> 00:06:35,680
compared to expert users who are getting no matches.

204
00:06:35,680 --> 00:06:36,520
Exactly.

205
00:06:36,520 --> 00:06:38,560
That kind of detail helps you pinpoint

206
00:06:38,560 --> 00:06:40,360
where your AI needs work.

207
00:06:40,360 --> 00:06:42,520
It's like a heat map showing you the hotspots

208
00:06:42,520 --> 00:06:43,880
where things are going wrong.

209
00:06:43,880 --> 00:06:45,400
Okay, I'm seeing how this gets really useful.

210
00:06:45,400 --> 00:06:48,160
So once you know where those errors are, what's next?

211
00:06:48,160 --> 00:06:50,200
Time to put on your detective hat.

212
00:06:50,200 --> 00:06:52,040
Do some error analysis.

213
00:06:52,040 --> 00:06:54,680
Hussein says look at examples of each type of error,

214
00:06:54,680 --> 00:06:55,960
classify them by hand.

215
00:06:55,960 --> 00:06:56,800
Okay.

216
00:06:56,800 --> 00:06:58,840
He even uses a spreadsheet to keep track of everything.

217
00:06:58,840 --> 00:07:01,280
So it's like categorizing clues to figure out,

218
00:07:01,280 --> 00:07:03,000
you know, why the crime happened.

219
00:07:03,000 --> 00:07:03,840
Exactly.

220
00:07:03,840 --> 00:07:05,400
You might find a lot of errors are happening

221
00:07:05,400 --> 00:07:08,080
because I don't know, the AI is missing some context

222
00:07:08,080 --> 00:07:09,440
about the user.

223
00:07:09,440 --> 00:07:11,880
Or maybe it's not giving clear error messages,

224
00:07:11,880 --> 00:07:12,720
things like that.

225
00:07:12,720 --> 00:07:14,120
It's all about those little details, right?

226
00:07:14,120 --> 00:07:16,240
That can make or break the user experience.

227
00:07:16,240 --> 00:07:18,440
Right, and as you're classifying those errors,

228
00:07:18,440 --> 00:07:19,840
you'll start to see patterns.

229
00:07:19,840 --> 00:07:23,160
Like maybe 40% are because users need more guidance,

230
00:07:23,160 --> 00:07:25,680
30% are tied to login issues.

231
00:07:25,680 --> 00:07:26,840
That's great.

232
00:07:26,840 --> 00:07:28,760
So now you know where to focus your energy

233
00:07:28,760 --> 00:07:30,280
to make the biggest improvements.

234
00:07:30,280 --> 00:07:32,960
And the best part, Hussein says you can get a lot done

235
00:07:32,960 --> 00:07:36,600
in just 15 minutes of, you know, focused error analysis.

236
00:07:36,600 --> 00:07:37,440
Okay.

237
00:07:37,440 --> 00:07:38,280
Of course it might take longer

238
00:07:38,280 --> 00:07:40,960
to have a ton of data, but even a quick analysis

239
00:07:40,960 --> 00:07:42,040
can be really helpful.

240
00:07:42,040 --> 00:07:42,880
It's good to know.

241
00:07:42,880 --> 00:07:44,920
I was picturing like hours and hours

242
00:07:44,920 --> 00:07:46,120
staring at a spreadsheet.

243
00:07:46,120 --> 00:07:48,080
So once you've analyzed your errors,

244
00:07:48,080 --> 00:07:50,720
figured out the root causes, now you can fix them.

245
00:07:50,720 --> 00:07:52,680
So it's like another round of edits

246
00:07:52,680 --> 00:07:56,400
on that important paper, polishing it up, making it shine.

247
00:07:56,400 --> 00:07:57,440
Exactly.

248
00:07:57,440 --> 00:07:59,240
And remember those unit tests.

249
00:07:59,240 --> 00:08:00,560
This is where you add new ones

250
00:08:00,560 --> 00:08:02,320
to make sure those errors don't come back.

251
00:08:02,320 --> 00:08:03,960
Like patching up the leaks in your boat

252
00:08:03,960 --> 00:08:06,040
so you can sail smoothly.

253
00:08:06,040 --> 00:08:08,200
Now what if you find certain types of errors

254
00:08:08,200 --> 00:08:10,240
that are just like really tricky

255
00:08:10,240 --> 00:08:13,120
for your general LLM judge to catch?

256
00:08:13,120 --> 00:08:14,080
Hmm.

257
00:08:14,080 --> 00:08:15,840
I guess you need a specialist.

258
00:08:15,840 --> 00:08:17,080
Precisely.

259
00:08:17,080 --> 00:08:19,760
You can create specialized judges

260
00:08:19,760 --> 00:08:22,480
that are, you know, experts in specific areas.

261
00:08:22,480 --> 00:08:24,240
It's like having a team of specialists,

262
00:08:24,240 --> 00:08:26,600
a cardiologist, neurologist, and so on.

263
00:08:26,600 --> 00:08:27,440
But I'm guessing you don't want

264
00:08:27,440 --> 00:08:28,560
to jump straight to specialists.

265
00:08:28,560 --> 00:08:29,400
Right.

266
00:08:29,400 --> 00:08:32,240
Hussein's big on starting with that general purpose judge,

267
00:08:32,240 --> 00:08:34,800
doing that thorough error analysis.

268
00:08:34,800 --> 00:08:38,880
That way you know where those specialists are needed most.

269
00:08:38,880 --> 00:08:39,720
Makes sense.

270
00:08:39,720 --> 00:08:40,560
Don't want to bring in the whole team

271
00:08:40,560 --> 00:08:41,720
if only one doctor is needed.

272
00:08:41,720 --> 00:08:42,560
Exactly.

273
00:08:42,560 --> 00:08:44,840
And one thing I really like about his approach

274
00:08:44,840 --> 00:08:46,680
is how iterative it is, you know?

275
00:08:46,680 --> 00:08:49,600
It's this cycle of evaluation, analysis, refinement.

276
00:08:49,600 --> 00:08:52,480
It's not about getting it perfect on the first try.

277
00:08:52,480 --> 00:08:54,880
It's about learning and improving as you go.

278
00:08:54,880 --> 00:08:56,640
It's like training for a marathon.

279
00:08:56,640 --> 00:08:59,080
You don't run 26 miles on day one.

280
00:08:59,080 --> 00:09:00,640
Right. You build up slowly.

281
00:09:00,640 --> 00:09:01,960
And what really stood out to me

282
00:09:01,960 --> 00:09:05,920
was how much he emphasized actually looking at your data.

283
00:09:05,920 --> 00:09:07,520
Not just building fancy models,

284
00:09:07,520 --> 00:09:10,120
but really understanding what the data is telling you.

285
00:09:10,120 --> 00:09:12,040
He even said the real value

286
00:09:12,040 --> 00:09:15,160
comes from that close analysis of your data.

287
00:09:15,160 --> 00:09:17,960
The LLM judge is just a tool to get you there.

288
00:09:17,960 --> 00:09:18,800
Right.

289
00:09:18,800 --> 00:09:19,920
He's basically saying,

290
00:09:19,920 --> 00:09:22,000
don't get so caught up in the tools

291
00:09:22,000 --> 00:09:24,200
that you forget the fundamentals.

292
00:09:24,200 --> 00:09:26,640
So it's about that data-driven mindset.

293
00:09:26,640 --> 00:09:28,400
Let the data be your guide.

294
00:09:28,400 --> 00:09:29,240
Exactly.

295
00:09:29,240 --> 00:09:32,560
And for anyone feeling a bit overwhelmed by all this,

296
00:09:32,560 --> 00:09:34,800
Hussein has some great advice.

297
00:09:34,800 --> 00:09:36,080
Start simple.

298
00:09:36,080 --> 00:09:37,560
Use the tools you already have.

299
00:09:37,560 --> 00:09:39,960
Don't get bogged down in fancy frameworks.

300
00:09:39,960 --> 00:09:42,520
And remember, there's no shame in asking for help.

301
00:09:42,520 --> 00:09:44,280
There's tons of resources out there,

302
00:09:44,280 --> 00:09:46,800
and the AI community is super helpful.

303
00:09:46,800 --> 00:09:48,960
Like he's saying, just dive in, start experimenting.

304
00:09:48,960 --> 00:09:50,960
You don't need to be an expert to get started.

305
00:09:50,960 --> 00:09:51,880
Exactly.

306
00:09:51,880 --> 00:09:52,800
Now we've covered a lot,

307
00:09:52,800 --> 00:09:54,160
but there's one more piece of the puzzle

308
00:09:54,160 --> 00:09:55,000
we need to talk about.

309
00:09:55,000 --> 00:09:57,320
Okay, so we've talked about all the technical stuff

310
00:09:57,320 --> 00:09:58,640
of AI evaluation.

311
00:09:58,640 --> 00:09:59,920
Yeah, the nuts and bolts.

312
00:09:59,920 --> 00:10:01,480
But where does it all fit?

313
00:10:01,480 --> 00:10:02,640
What's the big picture here?

314
00:10:02,640 --> 00:10:04,400
What are we really aiming for?

315
00:10:04,400 --> 00:10:06,200
That's a great question.

316
00:10:06,200 --> 00:10:09,120
And Hussein, he doesn't just give us a toolbox

317
00:10:09,120 --> 00:10:10,400
and say, good luck.

318
00:10:10,400 --> 00:10:13,600
He wants us to think about how all of this connects

319
00:10:13,600 --> 00:10:15,960
to the bigger goals we have for AI.

320
00:10:15,960 --> 00:10:17,840
So it's not just about making AI a little bit better,

321
00:10:17,840 --> 00:10:18,880
it's about something bigger.

322
00:10:18,880 --> 00:10:20,640
Yeah, exactly.

323
00:10:20,640 --> 00:10:25,440
Imagine if AI products had to meet a much higher standard.

324
00:10:25,440 --> 00:10:27,120
Not just how well they work,

325
00:10:27,120 --> 00:10:29,800
but also how they impact people,

326
00:10:29,800 --> 00:10:31,440
how they impact society.

327
00:10:31,440 --> 00:10:34,000
So we're talking about going beyond just like,

328
00:10:34,000 --> 00:10:35,480
accuracy and efficiency.

329
00:10:35,480 --> 00:10:38,360
Right, it's about things like transparency and fairness

330
00:10:38,360 --> 00:10:41,320
and whether people can trust these systems.

331
00:10:41,320 --> 00:10:43,120
And that's where those detailed critiques

332
00:10:43,120 --> 00:10:44,920
from experts become so important.

333
00:10:44,920 --> 00:10:46,760
Because they give you that human feedback

334
00:10:46,760 --> 00:10:48,320
that shapes how the AI behaves.

335
00:10:48,320 --> 00:10:49,800
So it's like you're building a conscience

336
00:10:49,800 --> 00:10:50,640
into the whole process.

337
00:10:50,640 --> 00:10:53,080
Yeah, you're not just creating intelligent machines,

338
00:10:53,080 --> 00:10:54,840
you're creating responsible machines.

339
00:10:54,840 --> 00:10:55,680
I like that.

340
00:10:55,680 --> 00:10:58,320
To combine that human element with the power of LLMs

341
00:10:58,320 --> 00:10:59,800
to learn and adapt,

342
00:10:59,800 --> 00:11:02,000
you start to see the potential for AI

343
00:11:02,000 --> 00:11:04,960
that is not just smarter, but also more ethical.

344
00:11:04,960 --> 00:11:07,080
Okay, that's a really exciting idea,

345
00:11:07,080 --> 00:11:09,680
but it also feels like a huge responsibility.

346
00:11:09,680 --> 00:11:10,920
It is a big responsibility.

347
00:11:10,920 --> 00:11:12,920
So how do we actually make this happen?

348
00:11:12,920 --> 00:11:16,280
Well, Hussein, he leaves us with this really powerful question.

349
00:11:16,280 --> 00:11:19,560
He asks, how can we use all these evaluation techniques,

350
00:11:19,560 --> 00:11:21,680
not just to improve performance,

351
00:11:21,680 --> 00:11:25,080
but to make sure that AI is developed and used responsibly?

352
00:11:25,080 --> 00:11:28,280
So it's a call to action for everyone who's working in AI?

353
00:11:28,280 --> 00:11:31,680
Exactly, a reminder that we're not just building cool tech,

354
00:11:31,680 --> 00:11:33,240
we're shaping the future.

355
00:11:33,240 --> 00:11:34,960
And we have a responsibility to get it right

356
00:11:34,960 --> 00:11:36,880
to make sure the future is good for everyone.

357
00:11:36,880 --> 00:11:38,520
Can't have said it better myself.

358
00:11:38,520 --> 00:11:39,960
So for all of you listening out there

359
00:11:39,960 --> 00:11:42,000
who are both excited about AI

360
00:11:42,000 --> 00:11:44,640
and a little bit worried about the risks

361
00:11:44,640 --> 00:11:46,520
this deep dive has given you a roadmap.

362
00:11:46,520 --> 00:11:48,480
Build those evaluation systems,

363
00:11:48,480 --> 00:11:49,960
get your hands dirty with the data,

364
00:11:49,960 --> 00:11:52,080
and never stop asking the tough questions.

365
00:11:52,080 --> 00:11:54,480
Let's build the future of AI together responsibly.

366
00:11:54,480 --> 00:11:55,320
Absolutely.

367
00:11:55,320 --> 00:11:57,080
And that's a wrap for today's deep dive.

368
00:11:57,080 --> 00:12:26,080
Thanks for joining us.