1
00:00:00,000 --> 00:00:02,680
ever get like bad info from an AI.

2
00:00:02,680 --> 00:00:05,560
You're like, wait a minute, how could it be so wrong?

3
00:00:05,560 --> 00:00:08,480
Like these AIs can just hallucinate facts.

4
00:00:08,480 --> 00:00:09,560
And it can be kind of scary,

5
00:00:09,560 --> 00:00:12,480
especially as we use AI more and more for important stuff.

6
00:00:12,480 --> 00:00:13,960
Yeah, absolutely.

7
00:00:13,960 --> 00:00:15,800
And that's what's so interesting about this research

8
00:00:15,800 --> 00:00:17,480
we're gonna deep dive into today.

9
00:00:17,480 --> 00:00:19,880
It's all about measuring how well AI

10
00:00:19,880 --> 00:00:21,400
can tell fact from fiction.

11
00:00:21,400 --> 00:00:24,840
Okay, so like separating the real from the made up

12
00:00:24,840 --> 00:00:26,280
in the AI world.

13
00:00:26,280 --> 00:00:27,120
What's the approach here?

14
00:00:27,120 --> 00:00:28,840
How did they try to figure that out?

15
00:00:28,840 --> 00:00:30,960
Well, researchers at OpenAI actually created

16
00:00:30,960 --> 00:00:33,720
this new benchmark, they call it SimpleQA.

17
00:00:33,720 --> 00:00:36,360
And it's basically a giant test for AI models.

18
00:00:36,360 --> 00:00:38,240
Like think of it as a trivia game,

19
00:00:38,240 --> 00:00:41,200
but specifically designed to test how good AI is

20
00:00:41,200 --> 00:00:43,880
at handling straightforward factual questions.

21
00:00:43,880 --> 00:00:46,800
Ah, so it's not about how fancy the AI can write

22
00:00:46,800 --> 00:00:48,400
or code or anything like that.

23
00:00:48,400 --> 00:00:49,880
It's about whether it actually knows stuff

24
00:00:49,880 --> 00:00:50,720
about the real world.

25
00:00:50,720 --> 00:00:52,440
Exactly, it's all about factual accuracy.

26
00:00:52,440 --> 00:00:53,280
Interesting, so tell me,

27
00:00:53,280 --> 00:00:55,600
how does this SimpleQA thing actually work?

28
00:00:55,600 --> 00:00:58,440
Okay, so imagine like a massive trivia game,

29
00:00:58,440 --> 00:01:00,760
but for AI, right?

30
00:01:00,760 --> 00:01:04,640
SimpleQA has over 4,000 of these short questions.

31
00:01:04,640 --> 00:01:08,280
And each question has a single correct answer.

32
00:01:08,280 --> 00:01:10,200
No room for interpretation.

33
00:01:10,200 --> 00:01:12,720
Oh, so like true or false?

34
00:01:12,720 --> 00:01:15,960
But for AI, no wishy-washy answer is allowed.

35
00:01:15,960 --> 00:01:18,560
Yep, it's about as straightforward as it gets.

36
00:01:18,560 --> 00:01:19,880
And the questions they came up with

37
00:01:19,880 --> 00:01:22,000
are surprisingly tricky.

38
00:01:22,000 --> 00:01:25,640
Things like who won the Nobel Prize in Physics in 2022?

39
00:01:25,640 --> 00:01:28,000
Or what's the capital of Australia?

40
00:01:28,000 --> 00:01:30,160
Okay, yeah, those sound pretty simple.

41
00:01:30,160 --> 00:01:31,920
I feel like I should know those.

42
00:01:31,920 --> 00:01:33,280
But I'm guessing they made these questions

43
00:01:33,280 --> 00:01:34,600
a lot tougher than they sound.

44
00:01:34,600 --> 00:01:37,160
Oh yeah, they definitely didn't go easy on the AI's.

45
00:01:37,160 --> 00:01:39,160
They actually designed the questions to be hard,

46
00:01:39,160 --> 00:01:41,480
even for really advanced models like GPT-4.

47
00:01:41,480 --> 00:01:43,920
Wow, so they were really trying to push these AI's

48
00:01:43,920 --> 00:01:44,760
to their limits.

49
00:01:44,760 --> 00:01:46,480
I mean, how did they even come up with questions

50
00:01:46,480 --> 00:01:48,800
that could stump something as smart as GPT-4?

51
00:01:48,800 --> 00:01:51,520
Well, they used this process called adversarial collection,

52
00:01:51,520 --> 00:01:53,080
which basically means they kept tweaking

53
00:01:53,080 --> 00:01:55,600
and refining the questions until even the most advanced

54
00:01:55,600 --> 00:01:57,240
models were struggling to get them right.

55
00:01:57,240 --> 00:01:59,600
So they were determined to find those AI weak spots.

56
00:01:59,600 --> 00:02:01,720
Clever, but okay.

57
00:02:01,720 --> 00:02:04,040
So let's say they've got all these tricky trivia questions

58
00:02:04,040 --> 00:02:04,920
for the AI.

59
00:02:04,920 --> 00:02:07,080
How do they actually know if the AI

60
00:02:07,080 --> 00:02:08,720
is getting the answers right?

61
00:02:08,720 --> 00:02:11,040
Like how do they make sure they're not grading the AI

62
00:02:11,040 --> 00:02:12,440
based on bad information?

63
00:02:12,440 --> 00:02:13,600
That's a really good point.

64
00:02:13,600 --> 00:02:15,680
Data quality is super important.

65
00:02:15,680 --> 00:02:17,200
And the researchers knew that.

66
00:02:17,200 --> 00:02:18,760
So they were incredibly meticulous

67
00:02:18,760 --> 00:02:21,840
about how they made sure the answers themselves were correct.

68
00:02:21,840 --> 00:02:24,080
They used two separate AI trainers

69
00:02:24,080 --> 00:02:26,560
to answer each question independently.

70
00:02:26,560 --> 00:02:28,280
And they only kept the questions

71
00:02:28,280 --> 00:02:30,840
where both AIs gave the same answer.

72
00:02:30,840 --> 00:02:33,080
Oh, so it's like double checking their work.

73
00:02:33,080 --> 00:02:35,840
It's like, okay, you two AIs agree on this

74
00:02:35,840 --> 00:02:37,760
so we can be pretty confident this answer's right?

75
00:02:37,760 --> 00:02:40,920
Yeah, and not only that, but they also insisted

76
00:02:40,920 --> 00:02:43,200
that every single answer had to be backed up

77
00:02:43,200 --> 00:02:45,600
with solid evidence from a reputable source.

78
00:02:45,600 --> 00:02:47,520
So it's like showing your work in math class.

79
00:02:47,520 --> 00:02:49,800
You gotta be able to prove where you got that answer from.

80
00:02:49,800 --> 00:02:50,640
Exactly.

81
00:02:50,640 --> 00:02:52,400
It can't just be like some random guess.

82
00:02:52,400 --> 00:02:54,000
It has to be verifiable.

83
00:02:54,000 --> 00:02:54,840
I like that.

84
00:02:54,840 --> 00:02:55,680
Makes a lot of sense.

85
00:02:55,680 --> 00:02:57,920
So besides this whole double checking thing,

86
00:02:57,920 --> 00:03:00,160
what were some of the other criteria they used

87
00:03:00,160 --> 00:03:01,880
to pick which questions to include?

88
00:03:01,880 --> 00:03:04,720
Like, were there any rules about the kind of questions

89
00:03:04,720 --> 00:03:06,200
that made it into simple QA?

90
00:03:06,200 --> 00:03:07,520
Oh yeah, for sure.

91
00:03:07,520 --> 00:03:09,600
They had a few really important rules

92
00:03:09,600 --> 00:03:11,000
for selecting the questions.

93
00:03:11,000 --> 00:03:14,840
First, the question had to have one and only one right answer.

94
00:03:14,840 --> 00:03:17,080
Like a hard fact that doesn't change over time.

95
00:03:17,080 --> 00:03:20,160
Okay, so no trip questions or opinions,

96
00:03:20,160 --> 00:03:22,320
just straight up undeniable truths.

97
00:03:22,320 --> 00:03:23,160
Yep.

98
00:03:23,160 --> 00:03:25,880
And then they also made sure that all the questions

99
00:03:25,880 --> 00:03:28,080
can be answered using information that was available

100
00:03:28,080 --> 00:03:29,960
up to the end of 2023.

101
00:03:29,960 --> 00:03:31,760
So nothing super recent,

102
00:03:31,760 --> 00:03:34,200
that the AI might not have had a chance to learn yet.

103
00:03:34,200 --> 00:03:35,440
Oh, that makes sense.

104
00:03:35,440 --> 00:03:36,880
You don't wanna test the AI on something

105
00:03:36,880 --> 00:03:38,240
it couldn't have possibly known.

106
00:03:38,240 --> 00:03:40,240
Right, and of course the answer had to be provable

107
00:03:40,240 --> 00:03:41,720
with evidence, we already talked about that.

108
00:03:41,720 --> 00:03:42,560
Right, right.

109
00:03:42,560 --> 00:03:46,960
So verifiable facts, no recent events, no trickery.

110
00:03:46,960 --> 00:03:48,280
Okay, got it.

111
00:03:48,280 --> 00:03:49,760
Can you give me some actual examples

112
00:03:49,760 --> 00:03:51,720
of the kinds of questions they ask in simple QA?

113
00:03:51,720 --> 00:03:53,000
I'm really curious now.

114
00:03:53,000 --> 00:03:54,600
Absolutely.

115
00:03:54,600 --> 00:03:56,120
Let's see, so table one in their paper

116
00:03:56,120 --> 00:03:57,240
has some good examples.

117
00:03:57,240 --> 00:03:58,720
Like one question is,

118
00:03:58,720 --> 00:04:03,720
who received the IE Frank Rosenblatt Award in 2010?

119
00:04:04,480 --> 00:04:06,320
Okay, I'm guessing I'm supposed to know that.

120
00:04:06,320 --> 00:04:08,280
Well, the answer is Michio Sugino.

121
00:04:08,280 --> 00:04:11,480
Another one is, on which US TV station

122
00:04:11,480 --> 00:04:14,920
did the Canadian reality series to serve and protect debut?

123
00:04:14,920 --> 00:04:18,040
Oh wow, see that's way too specific for me, I have no idea.

124
00:04:18,040 --> 00:04:20,520
And the answer to that one is KVOS TV.

125
00:04:20,520 --> 00:04:22,320
Okay, so they're not messing around with these questions.

126
00:04:22,320 --> 00:04:24,480
They really span a pretty wide range of topics too.

127
00:04:24,480 --> 00:04:25,560
It sounds like they were trying to make sure

128
00:04:25,560 --> 00:04:27,600
the AIs had to know a little bit of everything,

129
00:04:27,600 --> 00:04:30,160
like science, history, pop culture.

130
00:04:30,160 --> 00:04:31,320
Did they mention if they focused

131
00:04:31,320 --> 00:04:33,160
on any specific subject areas?

132
00:04:33,160 --> 00:04:34,520
Not really, they were really going

133
00:04:34,520 --> 00:04:36,000
for breadth of knowledge with this.

134
00:04:36,000 --> 00:04:37,840
Like they wanted to make sure these AIs

135
00:04:37,840 --> 00:04:39,640
had a good understanding of facts

136
00:04:39,640 --> 00:04:41,160
across a whole bunch of different areas.

137
00:04:41,160 --> 00:04:43,600
It's kind of like Jeopardy, but for AI.

138
00:04:43,600 --> 00:04:46,400
I like it, an AI game show.

139
00:04:46,400 --> 00:04:47,240
That'd be fun to watch.

140
00:04:47,240 --> 00:04:48,680
Where do they even get all this information

141
00:04:48,680 --> 00:04:50,360
from, all these facts and figures?

142
00:04:50,360 --> 00:04:52,480
They used a variety of sources,

143
00:04:52,480 --> 00:04:54,800
but they tried to stick to really reliable ones.

144
00:04:54,800 --> 00:04:57,320
So Whippypedia was their go-to.

145
00:04:57,320 --> 00:05:00,320
But they also pulled from things like fandom websites

146
00:05:00,320 --> 00:05:02,880
and academic articles, even IMDB

147
00:05:02,880 --> 00:05:05,120
for some of the movie and TV stuff.

148
00:05:05,120 --> 00:05:08,320
Right, so they weren't just like Googling random trivia.

149
00:05:08,320 --> 00:05:09,640
They were actually trying to make sure

150
00:05:09,640 --> 00:05:12,080
the AI was learning from trustworthy sources.

151
00:05:12,080 --> 00:05:13,080
Yeah, exactly.

152
00:05:13,080 --> 00:05:15,200
So we've got these super tricky trivia questions,

153
00:05:15,200 --> 00:05:16,880
all fact checked and everything.

154
00:05:16,880 --> 00:05:19,960
But how did they actually grade the AI's answers?

155
00:05:19,960 --> 00:05:23,200
Was it just like right or wrong, or was there more to it?

156
00:05:23,200 --> 00:05:24,880
They actually used three different grades.

157
00:05:24,880 --> 00:05:28,520
So you had correct, incorrect, and then not attempted.

158
00:05:28,520 --> 00:05:29,640
Not attempted.

159
00:05:29,640 --> 00:05:30,480
That's interesting.

160
00:05:30,480 --> 00:05:32,720
Why would an AI not even try to answer a question?

161
00:05:32,720 --> 00:05:34,720
Well, sometimes an AI might recognize

162
00:05:34,720 --> 00:05:36,640
that it just doesn't know the answer.

163
00:05:36,640 --> 00:05:38,480
And it doesn't want to just guess randomly.

164
00:05:38,480 --> 00:05:39,640
So it might say something like,

165
00:05:39,640 --> 00:05:41,400
I don't know or I need more information.

166
00:05:41,400 --> 00:05:44,560
And in those cases, they would mark it as not attempted.

167
00:05:44,560 --> 00:05:46,320
So a little bit of AI humility there.

168
00:05:46,320 --> 00:05:47,240
Probably a good thing.

169
00:05:47,240 --> 00:05:48,080
Yeah.

170
00:05:48,080 --> 00:05:48,920
Right.

171
00:05:48,920 --> 00:05:50,360
So what are the examples of how the grading worked

172
00:05:50,360 --> 00:05:52,200
for the correct and incorrect answers?

173
00:05:52,200 --> 00:05:53,040
Sure.

174
00:05:53,040 --> 00:05:55,720
So Table Two in the paper has some good examples.

175
00:05:55,720 --> 00:05:57,800
Let's say there's a question about the Dutch player

176
00:05:57,800 --> 00:06:01,600
who scored in the 2022 World Cup game against Argentina.

177
00:06:01,600 --> 00:06:05,240
The correct answer is Wout Weghorst.

178
00:06:05,240 --> 00:06:07,480
So any response that includes his full name,

179
00:06:07,480 --> 00:06:09,960
even if it adds some extra details about the game,

180
00:06:09,960 --> 00:06:12,160
would be marked as correct.

181
00:06:12,160 --> 00:06:15,480
Okay, so as long as the AI gets that key fact, right?

182
00:06:15,480 --> 00:06:16,400
It's good.

183
00:06:16,400 --> 00:06:18,120
What about an incorrect example?

184
00:06:18,120 --> 00:06:19,920
Okay, so for that same question,

185
00:06:19,920 --> 00:06:21,920
if the AI answered Virgil van Dijk,

186
00:06:21,920 --> 00:06:24,840
or even if it said like Virgil van Dijk and Wout Weghorst,

187
00:06:24,840 --> 00:06:26,280
it would be considered incorrect

188
00:06:26,280 --> 00:06:29,120
because it's contradicting that single correct answer.

189
00:06:29,120 --> 00:06:30,960
They're really strict about there only being

190
00:06:30,960 --> 00:06:32,440
one verifiable answer.

191
00:06:32,440 --> 00:06:33,440
Got it.

192
00:06:33,440 --> 00:06:34,480
No partial credit here.

193
00:06:34,480 --> 00:06:36,560
This all sounds super labor intensive.

194
00:06:36,560 --> 00:06:38,920
Did they have actual humans grading

195
00:06:38,920 --> 00:06:40,320
all of these AI responses?

196
00:06:40,320 --> 00:06:43,320
Actually, they use AI to grade the other AI.

197
00:06:43,320 --> 00:06:45,080
Whoa, AI inception, I like it.

198
00:06:45,080 --> 00:06:47,960
Yeah, they created the special chat GPD classifier.

199
00:06:47,960 --> 00:06:49,400
And gave it detailed instructions

200
00:06:49,400 --> 00:06:50,680
on how to grade the answers.

201
00:06:50,680 --> 00:06:52,840
They even gave it examples for each category.

202
00:06:52,840 --> 00:06:54,480
So it knew exactly what to look for.

203
00:06:54,480 --> 00:06:56,920
So meta and AI grading other AI,

204
00:06:56,920 --> 00:06:59,040
did they have any humans involved in this at all though?

205
00:06:59,040 --> 00:06:59,880
Oh yeah, for sure.

206
00:06:59,880 --> 00:07:02,120
Humans were still definitely part of the process.

207
00:07:02,120 --> 00:07:05,360
So they had a third AI trainer

208
00:07:05,360 --> 00:07:07,880
answer a random sample of 1,000 questions

209
00:07:07,880 --> 00:07:10,000
just to kind of double check the accuracy

210
00:07:10,000 --> 00:07:11,440
of the whole benchmark itself.

211
00:07:11,440 --> 00:07:14,560
And then they also had humans review any questions

212
00:07:14,560 --> 00:07:16,440
where the AI graders disagreed,

213
00:07:16,440 --> 00:07:19,000
or if the AI flagged any potential problems.

214
00:07:19,000 --> 00:07:20,840
So there was still that human oversight

215
00:07:20,840 --> 00:07:22,520
to make sure everything was running smoothly.

216
00:07:22,520 --> 00:07:25,000
Exactly, it wasn't just AI's gone wild.

217
00:07:25,000 --> 00:07:25,840
Good to know.

218
00:07:25,840 --> 00:07:27,520
Did they say anything about how accurate

219
00:07:27,520 --> 00:07:28,960
they think the benchmark itself is?

220
00:07:28,960 --> 00:07:30,760
Like what's the margin of error here?

221
00:07:30,760 --> 00:07:32,400
They estimated that the benchmark has

222
00:07:32,400 --> 00:07:36,600
about a 3% error rate, which is actually really impressive.

223
00:07:36,600 --> 00:07:39,600
When you consider how big and complex this whole project was,

224
00:07:39,600 --> 00:07:41,360
they did find that some of the questions

225
00:07:41,360 --> 00:07:43,240
were maybe a little bit ambiguous,

226
00:07:43,240 --> 00:07:45,160
or that even some reputable sources

227
00:07:45,160 --> 00:07:47,000
would sometimes have conflicting information.

228
00:07:47,000 --> 00:07:48,120
Right, it happens.

229
00:07:48,120 --> 00:07:50,600
Even human experts disagree sometimes.

230
00:07:50,600 --> 00:07:52,200
So they've got this massive data set

231
00:07:52,200 --> 00:07:55,120
of challenging trivia questions for AI,

232
00:07:55,120 --> 00:07:57,120
all graded and double checked.

233
00:07:57,120 --> 00:07:58,480
What did they actually do with all this?

234
00:07:58,480 --> 00:08:00,880
How did they use this simple QA thing

235
00:08:00,880 --> 00:08:02,840
to test different AI models?

236
00:08:02,840 --> 00:08:04,920
So they basically used simple QA

237
00:08:04,920 --> 00:08:07,960
to test a whole bunch of different AI models,

238
00:08:07,960 --> 00:08:10,320
kind of like an AI Olympics, you know?

239
00:08:10,320 --> 00:08:12,240
See how they all stacked up against each other.

240
00:08:12,240 --> 00:08:13,520
Oh, I like that.

241
00:08:13,520 --> 00:08:15,240
So it's a benchmark.

242
00:08:15,240 --> 00:08:17,160
To see which AIs are the trivia champs,

243
00:08:17,160 --> 00:08:19,200
did any of them get a perfect score?

244
00:08:19,200 --> 00:08:20,440
Not even close.

245
00:08:20,440 --> 00:08:23,320
Even the most advanced models, like GPT-4,

246
00:08:23,320 --> 00:08:25,840
they still struggled with some of the questions.

247
00:08:25,840 --> 00:08:29,240
So no AI geniuses just yet.

248
00:08:29,240 --> 00:08:31,880
Makes you wonder how hard it really is to build an AI

249
00:08:31,880 --> 00:08:34,560
that can actually understand and remember facts

250
00:08:34,560 --> 00:08:35,640
the way humans do.

251
00:08:35,640 --> 00:08:37,160
Yeah, it's a really tough challenge.

252
00:08:37,160 --> 00:08:39,120
And this research definitely highlights that.

253
00:08:39,120 --> 00:08:41,480
So how did the different models actually perform?

254
00:08:41,480 --> 00:08:44,520
Did they notice any interesting patterns or differences

255
00:08:44,520 --> 00:08:46,800
between how the different AIs handled the questions?

256
00:08:46,800 --> 00:08:47,760
Yeah, definitely.

257
00:08:47,760 --> 00:08:50,160
So they tested a bunch of different versions of GPT-4,

258
00:08:50,160 --> 00:08:52,480
and then they also tested this model called Clawd,

259
00:08:52,480 --> 00:08:54,360
which is made by a company called Anthropic.

260
00:08:54,360 --> 00:08:55,880
And one thing they found was that generally

261
00:08:55,880 --> 00:08:58,960
the bigger, more complex models did better on simple QA.

262
00:08:58,960 --> 00:09:00,320
Okay, that makes sense.

263
00:09:00,320 --> 00:09:01,720
The more parameters, the more data.

264
00:09:01,720 --> 00:09:04,040
The smarter the AI is gonna be.

265
00:09:04,040 --> 00:09:05,720
But were there any surprises,

266
00:09:05,720 --> 00:09:09,200
any models that did better or worse than you might expect?

267
00:09:09,200 --> 00:09:10,960
There were, actually.

268
00:09:10,960 --> 00:09:13,720
So one interesting thing they found was that the Clawd models

269
00:09:13,720 --> 00:09:16,520
were a lot more cautious than the GPT-4 models.

270
00:09:16,520 --> 00:09:18,760
Like they were way more likely to say,

271
00:09:18,760 --> 00:09:22,520
I don't know, rather than risk getting the answer wrong.

272
00:09:22,520 --> 00:09:25,720
Oh, so it's like Clawd was playing it safe

273
00:09:25,720 --> 00:09:28,440
while GPT-4 was more willing to take a gamble,

274
00:09:28,440 --> 00:09:29,840
even if it meant making a mistake.

275
00:09:29,840 --> 00:09:30,960
That's kind of funny when you think about it,

276
00:09:30,960 --> 00:09:33,000
like different AI personalities.

277
00:09:33,000 --> 00:09:34,800
Did they have a term for this cautiousness?

278
00:09:34,800 --> 00:09:36,520
Yeah, they called it calibration.

279
00:09:36,520 --> 00:09:39,080
So in this context, calibration basically means

280
00:09:39,080 --> 00:09:41,560
how well an AI can judge its own likelihood

281
00:09:41,560 --> 00:09:43,000
of being right or wrong.

282
00:09:43,000 --> 00:09:44,680
Kind of like knowing when to bet big

283
00:09:44,680 --> 00:09:45,920
and when to fold in poker.

284
00:09:45,920 --> 00:09:49,440
Ah, so a well-calibrated AI knows its limits.

285
00:09:49,440 --> 00:09:52,360
While an overconfident AI might just bluff its way through

286
00:09:52,360 --> 00:09:53,720
even when it has no clue.

287
00:09:54,680 --> 00:09:55,520
Interesting.

288
00:09:55,520 --> 00:09:57,280
So how did they measure this calibration thing?

289
00:09:57,280 --> 00:09:59,480
Well, they actually used two different methods.

290
00:09:59,480 --> 00:10:02,200
So first, they straight up asked the AI

291
00:10:02,200 --> 00:10:04,360
how confident it was in its answer.

292
00:10:04,360 --> 00:10:05,920
Like they would use prompts like,

293
00:10:05,920 --> 00:10:07,960
on a scale of zero to 100%,

294
00:10:07,960 --> 00:10:09,720
how sure are you about this answer?

295
00:10:09,720 --> 00:10:11,520
So they were making the AI rate

296
00:10:11,520 --> 00:10:13,680
its own confidence level.

297
00:10:13,680 --> 00:10:15,360
That's pretty clever.

298
00:10:15,360 --> 00:10:16,520
What did they find?

299
00:10:16,520 --> 00:10:18,320
Were any of the AIs actually good

300
00:10:18,320 --> 00:10:19,960
at judging their own knowledge?

301
00:10:19,960 --> 00:10:21,360
Well, it turns out most models,

302
00:10:21,360 --> 00:10:22,800
even the really advanced ones,

303
00:10:22,800 --> 00:10:24,840
tend to be a bit overconfident.

304
00:10:24,840 --> 00:10:28,000
Like they often think they know more than they actually do.

305
00:10:28,000 --> 00:10:30,560
Classic case of AI hubris.

306
00:10:30,560 --> 00:10:32,400
But hey, at least they're optimistic.

307
00:10:32,400 --> 00:10:34,480
What was the second way they measured calibration?

308
00:10:34,480 --> 00:10:37,320
Okay, so for this one, they got a little creative.

309
00:10:37,320 --> 00:10:40,720
They asked the same question to each AI model 100 times,

310
00:10:40,720 --> 00:10:41,880
but with little variations

311
00:10:41,880 --> 00:10:43,720
in how the question was phrased each time.

312
00:10:43,720 --> 00:10:45,520
Oh, so they were trying to see

313
00:10:45,520 --> 00:10:48,160
if the AI would give the same answer consistently.

314
00:10:48,160 --> 00:10:50,640
Even when the wording of the question was slightly different.

315
00:10:50,640 --> 00:10:52,240
Exactly, because the idea is that

316
00:10:52,240 --> 00:10:54,000
if an AI keeps giving the same answer,

317
00:10:54,000 --> 00:10:56,160
even when the question changes a little bit,

318
00:10:56,160 --> 00:10:58,120
it's probably more confident in that answer.

319
00:10:58,120 --> 00:10:59,480
Right, so they looked at how often

320
00:10:59,480 --> 00:11:01,640
each AI gave consistent answers

321
00:11:01,640 --> 00:11:04,440
and compared that to how accurate it actually was.

322
00:11:04,440 --> 00:11:05,960
So if an AI gave the same answer,

323
00:11:05,960 --> 00:11:07,880
like 80 times out of 100,

324
00:11:07,880 --> 00:11:10,800
you'd want to see it getting at least 80% of those answers.

325
00:11:10,800 --> 00:11:13,720
Right, to consider it well calibrated, makes sense.

326
00:11:13,720 --> 00:11:15,720
What did they find when they did this experiment?

327
00:11:15,720 --> 00:11:18,440
So it turns out that accuracy did generally go up

328
00:11:18,440 --> 00:11:20,640
when the AI was giving more consistent answers,

329
00:11:20,640 --> 00:11:21,920
which is a good sign.

330
00:11:21,920 --> 00:11:23,160
And there was this one model,

331
00:11:23,160 --> 00:11:25,160
01 Preview from OpenAI,

332
00:11:25,160 --> 00:11:27,960
that really stood out as being particularly well calibrated.

333
00:11:27,960 --> 00:11:29,360
Like its answer consistency

334
00:11:29,360 --> 00:11:31,440
pretty much matched up with this actual accuracy.

335
00:11:31,440 --> 00:11:33,920
Oh, so 01 Preview had a good sense

336
00:11:33,920 --> 00:11:35,880
of when it knew something for sure.

337
00:11:35,880 --> 00:11:37,600
And when it was just guessing,

338
00:11:37,600 --> 00:11:40,480
did that mean it aced the whole simple QA test though?

339
00:11:40,480 --> 00:11:42,320
Not necessarily, remember calibration

340
00:11:42,320 --> 00:11:43,680
is just one piece of the puzzle.

341
00:11:43,680 --> 00:11:45,200
It doesn't guarantee that an AI

342
00:11:45,200 --> 00:11:46,760
will get every answer right.

343
00:11:46,760 --> 00:11:48,800
Right, it's like saying someone is really good

344
00:11:48,800 --> 00:11:50,560
at knowing what they don't know.

345
00:11:50,560 --> 00:11:52,440
But that doesn't mean they're a genius.

346
00:11:52,440 --> 00:11:55,680
They still need to have the actual knowledge to back it up.

347
00:11:55,680 --> 00:11:57,440
Did they look at any other aspects

348
00:11:57,440 --> 00:11:59,680
of AI factuality in this research?

349
00:11:59,680 --> 00:12:01,200
They did, actually.

350
00:12:01,200 --> 00:12:02,880
One of the most interesting parts of this research

351
00:12:02,880 --> 00:12:05,560
was not just measuring AI factuality,

352
00:12:05,560 --> 00:12:08,400
but also starting to really understand its limitations.

353
00:12:08,400 --> 00:12:09,720
And that's something we can dig into

354
00:12:09,720 --> 00:12:11,640
a little deeper in the last part of our deep dive.

355
00:12:11,640 --> 00:12:12,720
All right, so we've talked about

356
00:12:12,720 --> 00:12:14,880
how this simple QA thing works

357
00:12:14,880 --> 00:12:16,800
and how they grade the AI's answers.

358
00:12:16,800 --> 00:12:20,000
And even how they measure whether an AI knows when

359
00:12:20,000 --> 00:12:21,280
it's bluffing or not.

360
00:12:21,280 --> 00:12:23,440
But you mentioned there's some limits to all this.

361
00:12:23,440 --> 00:12:25,480
What are some of the things that simple QA

362
00:12:25,480 --> 00:12:27,960
doesn't really tell us about AI in facts?

363
00:12:27,960 --> 00:12:28,880
Yeah, that's a great point.

364
00:12:28,880 --> 00:12:31,120
I mean, simple QA is a great starting point,

365
00:12:31,120 --> 00:12:32,920
but it definitely has its boundaries.

366
00:12:32,920 --> 00:12:34,880
It's important to remember that it was really designed

367
00:12:34,880 --> 00:12:37,640
to zero in on those short factual questions

368
00:12:37,640 --> 00:12:39,840
with one clear right answer,

369
00:12:39,840 --> 00:12:41,480
kind of like a trivia game for AI.

370
00:12:41,480 --> 00:12:44,640
Right, very controlled, very specific.

371
00:12:44,640 --> 00:12:46,040
But what's the problem with that?

372
00:12:46,040 --> 00:12:47,760
Isn't that a good way to start understanding

373
00:12:47,760 --> 00:12:49,480
how AI handles facts?

374
00:12:49,480 --> 00:12:50,600
It is, for sure.

375
00:12:50,600 --> 00:12:51,880
But we also have to keep in mind

376
00:12:51,880 --> 00:12:54,120
that the real world is a lot messier

377
00:12:54,120 --> 00:12:56,680
and a lot more complex than a trivia game.

378
00:12:56,680 --> 00:12:59,120
The researchers themselves even point out

379
00:12:59,120 --> 00:13:02,040
that just because an AI can answer these trivia questions

380
00:13:02,040 --> 00:13:03,680
doesn't necessarily mean it can handle

381
00:13:03,680 --> 00:13:06,880
like longer, more nuanced tasks,

382
00:13:06,880 --> 00:13:09,160
like writing a factual news article

383
00:13:09,160 --> 00:13:12,000
or summarizing a complicated research paper.

384
00:13:12,000 --> 00:13:14,400
So simple QA is kind of like a basic skills test.

385
00:13:14,400 --> 00:13:16,000
You have to pass it before you can move on

386
00:13:16,000 --> 00:13:18,400
to the harder stuff, but acing the basics

387
00:13:18,400 --> 00:13:19,840
doesn't mean you're gonna be a star performer

388
00:13:19,840 --> 00:13:20,680
in the real world.

389
00:13:20,680 --> 00:13:24,960
Exactly, and even if an AI does really well on simple QA,

390
00:13:24,960 --> 00:13:27,120
it doesn't mean that it's like totally immune

391
00:13:27,120 --> 00:13:28,160
to making things up.

392
00:13:28,160 --> 00:13:30,240
It just means it's less likely to get things wrong

393
00:13:30,240 --> 00:13:31,800
in this very specific context.

394
00:13:31,800 --> 00:13:32,640
Right.

395
00:13:32,640 --> 00:13:33,480
Got it.

396
00:13:33,480 --> 00:13:36,120
So we can't just assume an AI is like a perfectly factual

397
00:13:36,120 --> 00:13:38,520
oracle based on how it does on one test.

398
00:13:38,520 --> 00:13:40,840
We need to keep coming up with new ways to evaluate

399
00:13:40,840 --> 00:13:42,800
and improve AI's grasp of facts

400
00:13:42,800 --> 00:13:45,560
across all sorts of different tasks and complexities.

401
00:13:45,560 --> 00:13:47,880
Where do you think research in this area should go next?

402
00:13:47,880 --> 00:13:49,960
What are some of the big questions we still need to answer?

403
00:13:49,960 --> 00:13:52,240
That's the million dollar question.

404
00:13:52,240 --> 00:13:53,720
I think one of the biggest challenges

405
00:13:53,720 --> 00:13:56,840
is figuring out how to evaluate AI factuality

406
00:13:56,840 --> 00:13:59,480
in more open-ended situations.

407
00:13:59,480 --> 00:14:01,120
Like how do you measure accuracy

408
00:14:01,120 --> 00:14:03,920
when there isn't just one single right answer?

409
00:14:03,920 --> 00:14:07,240
Think about summarizing a long document or writing an essay.

410
00:14:07,240 --> 00:14:09,800
Can we develop benchmarks for those kinds of tasks?

411
00:14:09,800 --> 00:14:12,120
Yeah, that sounds really tough.

412
00:14:12,120 --> 00:14:15,440
It's like going from multiple choice to essay questions,

413
00:14:15,440 --> 00:14:16,680
way harder to grade.

414
00:14:16,680 --> 00:14:19,840
Any other areas you think are important for future research?

415
00:14:19,840 --> 00:14:20,680
Definitely.

416
00:14:20,680 --> 00:14:22,240
I think we also need to understand

417
00:14:22,240 --> 00:14:25,080
why these AI systems sometimes make stuff up.

418
00:14:25,080 --> 00:14:26,680
Like what's going on under the hood?

419
00:14:26,680 --> 00:14:28,400
Is it because of how they're trained?

420
00:14:28,400 --> 00:14:29,920
Or is there something more fundamental

421
00:14:29,920 --> 00:14:31,600
about how they process information

422
00:14:31,600 --> 00:14:33,720
that makes them prone to these errors?

423
00:14:33,720 --> 00:14:35,040
If we can answer those questions,

424
00:14:35,040 --> 00:14:37,920
I think we'll be a lot closer to developing AI systems

425
00:14:37,920 --> 00:14:40,000
that are truly reliable and trustworthy.

426
00:14:40,000 --> 00:14:41,800
Yeah, makes sense.

427
00:14:41,800 --> 00:14:44,120
If you know why something breaks,

428
00:14:44,120 --> 00:14:46,120
you're more likely to be able to fix it.

429
00:14:46,120 --> 00:14:47,800
Well, this has been a fascinating deep dive

430
00:14:47,800 --> 00:14:49,840
into the world of AI and facts.

431
00:14:49,840 --> 00:14:51,560
Really appreciate your insights on this.

432
00:14:51,560 --> 00:14:52,400
My pleasure.

433
00:14:52,400 --> 00:14:54,680
Any final thoughts for our listeners before we wrap up?

434
00:14:54,680 --> 00:14:57,480
Hmm, I guess the main takeaway is that this whole AI

435
00:14:57,480 --> 00:14:59,920
factuality thing is a really complex issue.

436
00:14:59,920 --> 00:15:01,720
And it's definitely still a work in progress.

437
00:15:01,720 --> 00:15:04,760
We're making some good headway with tools like SimpleQA.

438
00:15:04,760 --> 00:15:06,080
But there's still a lot more to do.

439
00:15:06,080 --> 00:15:07,480
And I think it's important for everyone,

440
00:15:07,480 --> 00:15:10,160
not just AI experts, to understand these limitations

441
00:15:10,160 --> 00:15:11,400
and to be part of the conversation

442
00:15:11,400 --> 00:15:13,880
about how we can make sure AI is used responsibly

443
00:15:13,880 --> 00:15:15,080
and ethically.

444
00:15:15,080 --> 00:15:16,640
You know, it's a big deal.

445
00:15:16,640 --> 00:15:17,800
Absolutely.

446
00:15:17,800 --> 00:15:18,800
Well said.

447
00:15:18,800 --> 00:15:20,200
All right, on that note, we're gonna wrap up

448
00:15:20,200 --> 00:15:21,360
this episode of the deep dive.

449
00:15:21,360 --> 00:15:24,160
Thanks for joining us on this exploration of AI and facts.

450
00:15:24,160 --> 00:15:27,600
And until next time, stay curious.