1
00:00:00,000 --> 00:00:05,000
Okay, so today we're diving into like the foundation of AI,

2
00:00:05,720 --> 00:00:07,800
training data sets, the huge amount of data

3
00:00:07,800 --> 00:00:10,240
that all these amazing AI models we hear about.

4
00:00:10,240 --> 00:00:11,520
Yeah. Learn from.

5
00:00:11,520 --> 00:00:12,880
You said over this paper,

6
00:00:13,960 --> 00:00:16,920
bridging the data provenance gap across text,

7
00:00:16,920 --> 00:00:18,480
speech and video.

8
00:00:18,480 --> 00:00:19,320
Yeah.

9
00:00:19,320 --> 00:00:20,760
It's a massive study.

10
00:00:20,760 --> 00:00:22,680
Like the biggest audit of its kind.

11
00:00:22,680 --> 00:00:23,520
Right.

12
00:00:23,520 --> 00:00:25,280
Covering like almost 4,000 data sets.

13
00:00:25,280 --> 00:00:27,320
It is a really extensive study

14
00:00:27,320 --> 00:00:29,440
and it's really the first of its kind

15
00:00:29,440 --> 00:00:31,640
to take the sort of comprehensive look.

16
00:00:31,640 --> 00:00:34,360
Across text, speech, A and D video.

17
00:00:34,360 --> 00:00:35,200
Yeah.

18
00:00:35,200 --> 00:00:36,040
So that's pretty crazy.

19
00:00:36,040 --> 00:00:36,880
So that's basically saying,

20
00:00:36,880 --> 00:00:38,960
where is AI getting its education from?

21
00:00:38,960 --> 00:00:39,960
Yeah, exactly.

22
00:00:39,960 --> 00:00:41,960
I mean, you think about like how we learn, you know,

23
00:00:41,960 --> 00:00:44,040
as humans through books and experiences.

24
00:00:44,040 --> 00:00:44,880
Yeah.

25
00:00:44,880 --> 00:00:46,160
This is essentially AI's version of that.

26
00:00:46,160 --> 00:00:48,080
So this is how it's learning about the world.

27
00:00:48,080 --> 00:00:49,360
And it's really interesting to look

28
00:00:49,360 --> 00:00:51,000
at where that data is coming from.

29
00:00:51,000 --> 00:00:51,960
Yeah, totally.

30
00:00:51,960 --> 00:00:53,560
So what were some of the things that kind of jumped out

31
00:00:53,560 --> 00:00:55,120
at you when you first read this study?

32
00:00:55,120 --> 00:00:57,120
Well, one of the first things that struck me

33
00:00:57,120 --> 00:01:00,880
was the sheer volume of data that's coming from the internet.

34
00:01:00,880 --> 00:01:03,800
We're talking about web crawled data,

35
00:01:03,800 --> 00:01:06,480
you know, websites that are scraped for information,

36
00:01:06,480 --> 00:01:08,680
social media posts,

37
00:01:08,680 --> 00:01:11,560
even content that's been synthetically generated.

38
00:01:11,560 --> 00:01:12,720
Wait, synthetically generated.

39
00:01:12,720 --> 00:01:13,560
Yeah.

40
00:01:13,560 --> 00:01:14,400
What do you mean?

41
00:01:14,400 --> 00:01:16,200
So this is content that's actually created by AI.

42
00:01:16,200 --> 00:01:17,040
Oh, wow.

43
00:01:17,040 --> 00:01:18,240
So AI is learning from AI.

44
00:01:18,240 --> 00:01:19,080
Yes.

45
00:01:19,080 --> 00:01:19,920
That's interesting.

46
00:01:19,920 --> 00:01:21,120
And that's a growing trend actually.

47
00:01:21,120 --> 00:01:21,960
Really?

48
00:01:21,960 --> 00:01:22,800
Yeah.

49
00:01:22,800 --> 00:01:23,640
And it's fascinating, right?

50
00:01:23,640 --> 00:01:25,240
It creates this sort of feedback loop

51
00:01:25,240 --> 00:01:27,720
where AI is learning from itself.

52
00:01:27,720 --> 00:01:29,120
But it also raises questions.

53
00:01:29,120 --> 00:01:33,320
You know, if AI is learning from content created by other AI,

54
00:01:33,320 --> 00:01:35,720
are we just perpetuating existing biases

55
00:01:35,720 --> 00:01:37,480
or could we even be amplifying them?

56
00:01:37,480 --> 00:01:38,320
Right, yeah.

57
00:01:38,320 --> 00:01:39,160
That makes total sense.

58
00:01:39,160 --> 00:01:43,080
So when we talk about speech and video data,

59
00:01:43,080 --> 00:01:44,920
like what are the main sources for that?

60
00:01:44,920 --> 00:01:47,360
Is it still, like is YouTube still the king?

61
00:01:47,360 --> 00:01:49,040
YouTube is definitely a major player,

62
00:01:49,040 --> 00:01:50,600
especially for video data.

63
00:01:50,600 --> 00:01:53,720
The study found that almost a million hours of video data

64
00:01:53,720 --> 00:01:56,120
used in AI training comes from YouTube.

65
00:01:56,120 --> 00:01:59,080
That's way more than like movies or audio books,

66
00:01:59,080 --> 00:01:59,920
for example.

67
00:01:59,920 --> 00:02:00,760
Interesting.

68
00:02:00,760 --> 00:02:02,200
So when it comes to speech, then, what are we looking at?

69
00:02:02,200 --> 00:02:04,160
So speech data is a little bit more diverse.

70
00:02:04,160 --> 00:02:05,280
You've got YouTube, of course.

71
00:02:05,280 --> 00:02:06,200
You've got audio books.

72
00:02:06,200 --> 00:02:09,040
You've got things like podcasts, lectures,

73
00:02:09,040 --> 00:02:10,600
even things like voice recordings

74
00:02:10,600 --> 00:02:12,520
from customer service interactions.

75
00:02:12,520 --> 00:02:13,280
Wow.

76
00:02:13,280 --> 00:02:15,800
OK, so we've got AI learning from all

77
00:02:15,800 --> 00:02:19,840
of these different places, from web pages, social media,

78
00:02:19,840 --> 00:02:21,760
even other AI.

79
00:02:21,760 --> 00:02:24,960
YouTube is just like massive digital melting pot.

80
00:02:24,960 --> 00:02:26,680
Yeah, it really is.

81
00:02:26,680 --> 00:02:29,320
But can anyone just grab this data and use it?

82
00:02:29,320 --> 00:02:31,480
Isn't it all publicly available?

83
00:02:31,480 --> 00:02:33,480
That's where things get really interesting

84
00:02:33,480 --> 00:02:34,680
and a little bit messy.

85
00:02:34,680 --> 00:02:35,040
OK.

86
00:02:35,040 --> 00:02:37,320
The research uncovered a huge issue

87
00:02:37,320 --> 00:02:40,120
with what they call undocumented restrictions.

88
00:02:40,120 --> 00:02:41,640
Undocumented restrictions.

89
00:02:41,640 --> 00:02:42,880
What does that even mean?

90
00:02:42,880 --> 00:02:45,360
So basically, over 80% of the content

91
00:02:45,360 --> 00:02:46,560
used in these data sets.

92
00:02:46,560 --> 00:02:47,680
Across all formats.

93
00:02:47,680 --> 00:02:50,800
Across all formats has some kind of hidden limitation.

94
00:02:50,800 --> 00:02:53,800
Duried in the fine print of the original source material.

95
00:02:53,800 --> 00:02:56,640
OK, so even if a data set has an open license,

96
00:02:56,640 --> 00:02:59,600
the actual data itself might not be free to use.

97
00:02:59,600 --> 00:03:00,560
Exactly.

98
00:03:00,560 --> 00:03:02,200
And this is a really big problem,

99
00:03:02,200 --> 00:03:05,280
especially for AI developers who might,

100
00:03:05,280 --> 00:03:08,120
assuming that they have the right to use certain data,

101
00:03:08,120 --> 00:03:10,280
when in fact they don't, a lot of this content.

102
00:03:10,280 --> 00:03:11,520
Especially if they're scraping it

103
00:03:11,520 --> 00:03:12,680
from these different websites.

104
00:03:12,680 --> 00:03:13,320
Right.

105
00:03:13,320 --> 00:03:16,240
Especially stuff scraped from places like YouTube.

106
00:03:16,240 --> 00:03:18,480
You know, might have restrictions against commercial use

107
00:03:18,480 --> 00:03:20,480
even if the data set itself seems open.

108
00:03:20,480 --> 00:03:21,320
Wow.

109
00:03:21,320 --> 00:03:23,960
So that's a whole other layer of complexity.

110
00:03:23,960 --> 00:03:27,680
So AI developers could be unknowingly training

111
00:03:27,680 --> 00:03:30,400
their models on data that they're not even allowed to use.

112
00:03:30,400 --> 00:03:31,080
That's right.

113
00:03:31,080 --> 00:03:33,400
And we're already seeing lawsuits pop up

114
00:03:33,400 --> 00:03:35,960
around AI training and copyright infringement.

115
00:03:35,960 --> 00:03:38,560
So this is a really serious issue.

116
00:03:38,560 --> 00:03:41,160
So it's not just about finding the data.

117
00:03:41,160 --> 00:03:43,440
It's about understanding the legal complexities that

118
00:03:43,440 --> 00:03:44,240
come with it.

119
00:03:44,240 --> 00:03:47,840
So who is actually doing all of this data sourcing for AI?

120
00:03:47,840 --> 00:03:50,080
Is it all like big tech companies just

121
00:03:50,080 --> 00:03:51,880
gathering up all this information?

122
00:03:51,880 --> 00:03:54,080
It's actually a lot more diverse than you might think,

123
00:03:54,080 --> 00:03:57,360
especially when it comes to publicly available data sets.

124
00:03:57,360 --> 00:03:57,880
Oh, really?

125
00:03:57,880 --> 00:03:58,800
Yeah.

126
00:03:58,800 --> 00:04:02,040
For speech and video data sets, academic institutions

127
00:04:02,040 --> 00:04:03,680
are actually the biggest players.

128
00:04:03,680 --> 00:04:04,320
Interesting.

129
00:04:04,320 --> 00:04:05,960
I would have thought it was big tech.

130
00:04:05,960 --> 00:04:08,240
Why are universities so involved in this?

131
00:04:08,240 --> 00:04:10,680
Well, I think it's a combination of factors.

132
00:04:10,680 --> 00:04:14,960
Academia has a long tradition of open research and data

133
00:04:14,960 --> 00:04:15,520
sharing.

134
00:04:15,520 --> 00:04:16,040
Yeah.

135
00:04:16,040 --> 00:04:18,640
And they often have the expertise and the resources

136
00:04:18,640 --> 00:04:20,680
to create high quality data sets,

137
00:04:20,680 --> 00:04:23,000
even if those data sets are smaller in scale

138
00:04:23,000 --> 00:04:25,960
than the ones created by large corporations.

139
00:04:25,960 --> 00:04:30,160
They're responsible for over 70% of video data set creation.

140
00:04:30,160 --> 00:04:30,800
Wow.

141
00:04:30,800 --> 00:04:31,520
That's incredible.

142
00:04:31,520 --> 00:04:32,000
Yeah.

143
00:04:32,000 --> 00:04:35,080
So big tech might have a scale, but academia brings

144
00:04:35,080 --> 00:04:37,760
the expertise and the focus on research.

145
00:04:37,760 --> 00:04:39,320
What about text data sets?

146
00:04:39,320 --> 00:04:40,920
Who's leading the charge there?

147
00:04:40,920 --> 00:04:43,160
Text data sets are a lot more diverse.

148
00:04:43,160 --> 00:04:45,640
You've got university researchers, of course.

149
00:04:45,640 --> 00:04:48,600
But you also have industry labs, research groups,

150
00:04:48,600 --> 00:04:52,800
even startups contributing to the creation of text data sets.

151
00:04:52,800 --> 00:04:55,120
So it's not just like one monolithic entity?

152
00:04:55,120 --> 00:04:56,280
No, not at all.

153
00:04:56,280 --> 00:04:58,120
It's much more of a collaborative effort.

154
00:04:58,120 --> 00:05:00,040
And probably contributing to those licensing issues.

155
00:05:00,040 --> 00:05:01,040
You're absolutely right.

156
00:05:01,040 --> 00:05:02,200
That's a key point.

157
00:05:02,200 --> 00:05:02,680
Yeah.

158
00:05:02,680 --> 00:05:04,680
The researchers highlighted, the more players

159
00:05:04,680 --> 00:05:08,000
involved, the more complex that licensing landscape becomes.

160
00:05:08,000 --> 00:05:09,600
Yeah, for sure.

161
00:05:09,600 --> 00:05:11,560
Let's explore that in more detail in the next segment.

162
00:05:11,560 --> 00:05:11,800
OK.

163
00:05:11,800 --> 00:05:13,960
Yeah, it really is a double-edged sword, right?

164
00:05:13,960 --> 00:05:15,120
Yeah.

165
00:05:15,120 --> 00:05:18,080
It's great to have more people involved in creating

166
00:05:18,080 --> 00:05:18,800
these data sets.

167
00:05:18,800 --> 00:05:21,200
I mean, that's fantastic for innovation.

168
00:05:21,200 --> 00:05:23,480
But it also makes it really hard to keep track

169
00:05:23,480 --> 00:05:25,920
of all those different licensing agreements.

170
00:05:25,920 --> 00:05:27,000
Yeah.

171
00:05:27,000 --> 00:05:28,120
I can imagine.

172
00:05:28,120 --> 00:05:29,200
Like herding cats.

173
00:05:29,200 --> 00:05:30,440
Exactly.

174
00:05:30,440 --> 00:05:31,920
And it gets even trickier when you

175
00:05:31,920 --> 00:05:34,480
realize that a lot of this data is coming from sources

176
00:05:34,480 --> 00:05:37,600
with strict restrictions on commercial use.

177
00:05:37,600 --> 00:05:38,240
Right.

178
00:05:38,240 --> 00:05:41,280
Think about social media platforms, websites

179
00:05:41,280 --> 00:05:44,520
with very specific terms of service, even content that's

180
00:05:44,520 --> 00:05:45,840
protected by copyright.

181
00:05:45,840 --> 00:05:46,320
Right.

182
00:05:46,320 --> 00:05:46,960
Yeah.

183
00:05:46,960 --> 00:05:49,400
We talked about the undocumented restrictions

184
00:05:49,400 --> 00:05:53,040
earlier that are lurking in that original source material.

185
00:05:53,040 --> 00:05:54,840
This is where that really becomes a problem.

186
00:05:54,840 --> 00:05:56,360
This is where it becomes a huge problem.

187
00:05:56,360 --> 00:05:58,200
And think about all the data that's

188
00:05:58,200 --> 00:05:59,560
being scraped from YouTube.

189
00:05:59,560 --> 00:06:00,600
Yeah.

190
00:06:00,600 --> 00:06:02,840
While the data sets themselves might seem

191
00:06:02,840 --> 00:06:05,600
to have open licenses, the original content

192
00:06:05,600 --> 00:06:08,720
is often covered by YouTube's terms of service, which often

193
00:06:08,720 --> 00:06:12,440
prohibits things like data scraping and commercial use.

194
00:06:12,440 --> 00:06:12,960
Wow.

195
00:06:12,960 --> 00:06:16,360
So AI developers could be walking into a legal trap,

196
00:06:16,360 --> 00:06:18,720
assuming they have the right to use data when they actually

197
00:06:18,720 --> 00:06:19,240
don't.

198
00:06:19,240 --> 00:06:19,840
Yeah.

199
00:06:19,840 --> 00:06:21,720
And that's incredibly risky, especially now

200
00:06:21,720 --> 00:06:23,640
that we're seeing all these lawsuits popping up.

201
00:06:23,640 --> 00:06:24,280
Totally.

202
00:06:24,280 --> 00:06:24,960
Yeah.

203
00:06:24,960 --> 00:06:27,520
So how do you even begin to navigate that?

204
00:06:27,520 --> 00:06:29,240
That's the million dollar question.

205
00:06:29,240 --> 00:06:33,200
The researchers suggest that we need better tools

206
00:06:33,200 --> 00:06:36,760
and practices for actually tracking data provenance

207
00:06:36,760 --> 00:06:39,360
throughout the entire lifecycle of a data set.

208
00:06:39,360 --> 00:06:41,960
You know, it's not enough to just kind of slap an open license

209
00:06:41,960 --> 00:06:43,400
on a collection of data.

210
00:06:43,400 --> 00:06:46,680
We need to know where each piece of data came from,

211
00:06:46,680 --> 00:06:49,240
what the original licensing terms were,

212
00:06:49,240 --> 00:06:51,800
and how those terms might have changed over time.

213
00:06:51,800 --> 00:06:52,200
OK.

214
00:06:52,200 --> 00:06:55,400
So let's say we've got to handle on all these licensing issues.

215
00:06:55,400 --> 00:06:57,120
Are there any other challenges we

216
00:06:57,120 --> 00:07:00,320
need to be thinking about when it comes to AI training data?

217
00:07:00,320 --> 00:07:01,520
Oh, there definitely are.

218
00:07:01,520 --> 00:07:02,020
OK.

219
00:07:02,020 --> 00:07:05,640
And one of the biggest ones is the issue of representation.

220
00:07:05,640 --> 00:07:08,120
You mean like whether the data actually reflects

221
00:07:08,120 --> 00:07:09,960
the diversity of the real world?

222
00:07:09,960 --> 00:07:10,720
Exactly.

223
00:07:10,720 --> 00:07:14,640
If we want AI to be truly intelligent and useful,

224
00:07:14,640 --> 00:07:17,440
it has to learn from a wide range of perspectives

225
00:07:17,440 --> 00:07:18,600
and experiences.

226
00:07:18,600 --> 00:07:19,100
Right.

227
00:07:19,100 --> 00:07:21,160
It needs to get a well-rounded education.

228
00:07:21,160 --> 00:07:22,480
That's a great analogy.

229
00:07:22,480 --> 00:07:25,360
Otherwise it might have blind spots or even make decisions

230
00:07:25,360 --> 00:07:27,800
that are biased or unfair.

231
00:07:27,800 --> 00:07:28,560
Exactly.

232
00:07:28,560 --> 00:07:30,320
And unfortunately, the study found

233
00:07:30,320 --> 00:07:32,320
that things haven't really improved much when

234
00:07:32,320 --> 00:07:35,960
it comes to geographic and linguistic representation.

235
00:07:35,960 --> 00:07:36,680
Really?

236
00:07:36,680 --> 00:07:37,200
Yeah.

237
00:07:37,200 --> 00:07:38,560
That's kind of surprising.

238
00:07:38,560 --> 00:07:40,000
I would have thought with all this talk

239
00:07:40,000 --> 00:07:42,880
about building more inclusive AI,

240
00:07:42,880 --> 00:07:45,520
we would be seeing some progress in this area.

241
00:07:45,520 --> 00:07:46,020
Yeah.

242
00:07:46,020 --> 00:07:47,360
It's a really complex issue.

243
00:07:47,360 --> 00:07:49,760
And while we are seeing more data

244
00:07:49,760 --> 00:07:53,280
from underrepresented languages and regions,

245
00:07:53,280 --> 00:07:56,760
the overall picture is still very Western-centric.

246
00:07:56,760 --> 00:07:58,960
Most of the data sets and the organizations

247
00:07:58,960 --> 00:08:02,560
creating them are concentrated in North America and Europe.

248
00:08:02,560 --> 00:08:05,320
So even though there's more diversity than before,

249
00:08:05,320 --> 00:08:06,560
it's still not balanced.

250
00:08:06,560 --> 00:08:07,680
Not even close.

251
00:08:07,680 --> 00:08:09,600
For example, the study found that data sets

252
00:08:09,600 --> 00:08:12,720
from organizations in Africa or South America

253
00:08:12,720 --> 00:08:15,200
make up less than 0.2% of all the data.

254
00:08:15,200 --> 00:08:16,360
Less than 0.2%.

255
00:08:16,360 --> 00:08:17,720
Less than 0.2%.

256
00:08:17,720 --> 00:08:19,400
That's incredibly low.

257
00:08:19,400 --> 00:08:21,800
It's like those regions are barely even represented

258
00:08:21,800 --> 00:08:24,000
in the data that's shaping the future of AI.

259
00:08:24,000 --> 00:08:24,800
Yeah.

260
00:08:24,800 --> 00:08:25,960
That's the reality right now.

261
00:08:25,960 --> 00:08:27,240
And it has huge implications.

262
00:08:27,240 --> 00:08:28,400
I mean, think about it.

263
00:08:28,400 --> 00:08:32,320
AI systems trained on such skewed data

264
00:08:32,320 --> 00:08:35,360
are likely to perpetuate existing biases.

265
00:08:35,360 --> 00:08:38,000
And they're going to have trouble understanding

266
00:08:38,000 --> 00:08:40,640
the nuances of different cultures and languages.

267
00:08:40,640 --> 00:08:41,080
Yeah.

268
00:08:41,080 --> 00:08:44,800
So AI could end up with this very narrow view of the world.

269
00:08:44,800 --> 00:08:45,400
Exactly.

270
00:08:45,400 --> 00:08:47,640
Missing out on all the richness and complexity

271
00:08:47,640 --> 00:08:49,680
that comes from different cultures and perspectives.

272
00:08:49,680 --> 00:08:50,360
Exactly.

273
00:08:50,360 --> 00:08:51,440
And that's a problem.

274
00:08:51,440 --> 00:08:55,240
If we want AI to be a force for good in the world,

275
00:08:55,240 --> 00:08:57,680
if it's only exposed to a limited set of experiences,

276
00:08:57,680 --> 00:09:01,360
it's only going to make decisions that are biased or unfair.

277
00:09:01,360 --> 00:09:03,960
It's like teaching AI to see the world

278
00:09:03,960 --> 00:09:08,480
through a very specific lens, potentially missing

279
00:09:08,480 --> 00:09:10,120
the bigger picture.

280
00:09:10,120 --> 00:09:11,720
So what can be done about this?

281
00:09:11,720 --> 00:09:12,760
How do we address this?

282
00:09:12,760 --> 00:09:13,960
Well, there are a few things we can do.

283
00:09:13,960 --> 00:09:16,000
First of all, we need to be a lot more intentional

284
00:09:16,000 --> 00:09:19,080
about collecting data from a wider range of sources.

285
00:09:19,080 --> 00:09:22,720
That means working with organizations and communities

286
00:09:22,720 --> 00:09:25,760
in underrepresented regions to build data sets that

287
00:09:25,760 --> 00:09:28,440
reflect their lived experiences.

288
00:09:28,440 --> 00:09:31,040
It also means investing in research and development

289
00:09:31,040 --> 00:09:33,520
to create better tools and techniques.

290
00:09:33,520 --> 00:09:38,600
That can help us identify and mitigate biases in AI systems.

291
00:09:38,600 --> 00:09:40,520
So it's not just about collecting more data.

292
00:09:40,520 --> 00:09:41,920
It's about collecting the right data.

293
00:09:41,920 --> 00:09:42,480
Exactly.

294
00:09:42,480 --> 00:09:43,520
In the right way.

295
00:09:43,520 --> 00:09:45,600
And making sure that it's used responsibly.

296
00:09:45,600 --> 00:09:46,280
Absolutely.

297
00:09:46,280 --> 00:09:48,120
And it's not just the responsibility

298
00:09:48,120 --> 00:09:49,640
of researchers and developers.

299
00:09:49,640 --> 00:09:53,080
We need policymakers and industry leaders

300
00:09:53,080 --> 00:09:54,440
to get involved, too.

301
00:09:54,440 --> 00:09:55,560
We need guidelines.

302
00:09:55,560 --> 00:09:58,440
We need incentives that promote the development

303
00:09:58,440 --> 00:10:01,480
of more inclusive AI systems.

304
00:10:01,480 --> 00:10:03,600
This makes me think about the long-term implications

305
00:10:03,600 --> 00:10:04,760
of all this.

306
00:10:04,760 --> 00:10:07,000
If we're not careful, we could end up

307
00:10:07,000 --> 00:10:10,840
with AI that reflects and amplifies the inequalities that

308
00:10:10,840 --> 00:10:12,080
already exist in the world.

309
00:10:12,080 --> 00:10:14,640
It's a very real concern.

310
00:10:14,640 --> 00:10:17,640
But I don't think it's too late to course correct.

311
00:10:17,640 --> 00:10:20,880
We have this amazing opportunity to build AI that's

312
00:10:20,880 --> 00:10:24,440
fair, equitable, and beneficial to everyone.

313
00:10:24,440 --> 00:10:27,600
But it's going to take a collective effort from everyone,

314
00:10:27,600 --> 00:10:29,720
from all the stakeholders to make that happen.

315
00:10:29,720 --> 00:10:31,000
Well said.

316
00:10:31,000 --> 00:10:32,280
Yeah, this conversation has really

317
00:10:32,280 --> 00:10:37,040
highlighted the importance of thinking critically

318
00:10:37,040 --> 00:10:39,920
about the data that's shaping the future of AI.

319
00:10:39,920 --> 00:10:41,280
It's not just a technical issue.

320
00:10:41,280 --> 00:10:42,520
It's a societal one.

321
00:10:42,520 --> 00:10:43,360
Absolutely.

322
00:10:43,360 --> 00:10:46,000
And it's a conversation that we need to keep having.

323
00:10:46,000 --> 00:10:48,320
The choices we make today about AI training data

324
00:10:48,320 --> 00:10:50,800
are going to have a huge impact on the world of tomorrow.

325
00:10:50,800 --> 00:10:54,360
Yeah, it feels like we're really at this turning point

326
00:10:54,360 --> 00:10:56,080
in AI development.

327
00:10:56,080 --> 00:10:59,640
We have this amazing technology with so much potential.

328
00:10:59,640 --> 00:11:00,960
But we're also starting to see what

329
00:11:00,960 --> 00:11:05,440
happens when that technology is built on data that's

330
00:11:05,440 --> 00:11:09,120
incomplete, biased, or even, like we said, illegally obtained.

331
00:11:09,120 --> 00:11:10,240
That's a really good point.

332
00:11:10,240 --> 00:11:12,680
And it brings us back to this fundamental issue

333
00:11:12,680 --> 00:11:14,680
that this research paper is really highlighting,

334
00:11:14,680 --> 00:11:16,560
the importance of data provenance.

335
00:11:16,560 --> 00:11:20,440
Really understanding where our data comes from,

336
00:11:20,440 --> 00:11:24,240
how it's been handled, what limitations it might have.

337
00:11:24,240 --> 00:11:26,960
That's absolutely critical if we want to build

338
00:11:26,960 --> 00:11:30,280
AI that's trustworthy and responsible.

339
00:11:30,280 --> 00:11:32,000
It's like building a house.

340
00:11:32,000 --> 00:11:34,480
You wouldn't just use any materials you found lying around.

341
00:11:34,480 --> 00:11:35,240
Exactly.

342
00:11:35,240 --> 00:11:38,880
You want to make sure they're strong, safe,

343
00:11:38,880 --> 00:11:41,440
for the job, and you'd want to know where they came from.

344
00:11:41,440 --> 00:11:43,120
You want to know where they came from, exactly.

345
00:11:43,120 --> 00:11:45,280
And just like you would have a building inspector come in

346
00:11:45,280 --> 00:11:49,120
and check the foundation of a house,

347
00:11:49,120 --> 00:11:52,720
we need to find ways to audit the data that's

348
00:11:52,720 --> 00:11:54,880
going into these AI systems.

349
00:11:54,880 --> 00:11:57,640
We need to be able to trace the origins of the data,

350
00:11:57,640 --> 00:12:03,000
verify the licensing, assess its potential for bias.

351
00:12:03,000 --> 00:12:05,480
So it's not just about collecting more data.

352
00:12:05,480 --> 00:12:07,680
It's about being much smarter about how we collect it,

353
00:12:07,680 --> 00:12:09,760
how we document it, how we use it.

354
00:12:09,760 --> 00:12:11,880
It's about quality over quantity in many ways.

355
00:12:11,880 --> 00:12:13,800
Yeah, totally.

356
00:12:13,800 --> 00:12:16,120
This makes me think about the role of transparency

357
00:12:16,120 --> 00:12:17,400
in all this.

358
00:12:17,400 --> 00:12:20,320
If developers are unknowingly using data

359
00:12:20,320 --> 00:12:23,280
that they're not supposed to, or if AI systems are being

360
00:12:23,280 --> 00:12:27,320
trained on biased data, how can we even

361
00:12:27,320 --> 00:12:29,080
begin to address these issues?

362
00:12:29,080 --> 00:12:31,240
Well, transparency is absolutely key.

363
00:12:31,240 --> 00:12:34,400
We need much clearer documentation.

364
00:12:34,400 --> 00:12:37,440
We need standardized licensing practices.

365
00:12:37,440 --> 00:12:39,840
We need open communication between the people who

366
00:12:39,840 --> 00:12:42,600
are creating these data sets and the developers who

367
00:12:42,600 --> 00:12:43,720
are actually using them.

368
00:12:43,720 --> 00:12:45,760
Right, and it also feels like there's a huge need

369
00:12:45,760 --> 00:12:50,360
for more education and awareness within the AI community

370
00:12:50,360 --> 00:12:51,640
itself.

371
00:12:51,640 --> 00:12:54,880
So not every developer is going to be a legal expert

372
00:12:54,880 --> 00:12:58,120
on data licensing or an ethicist who specializes

373
00:12:58,120 --> 00:12:59,080
in bias detection.

374
00:12:59,080 --> 00:12:59,680
Absolutely.

375
00:12:59,680 --> 00:13:02,520
We need to give developers the knowledge and the tools.

376
00:13:02,520 --> 00:13:05,000
They need to make really informed decisions about the data

377
00:13:05,000 --> 00:13:05,800
that they're using.

378
00:13:05,800 --> 00:13:09,200
And that means better education on things like data provenance,

379
00:13:09,200 --> 00:13:11,640
licensing, ethical considerations.

380
00:13:11,640 --> 00:13:14,920
And it also means developing new tools and resources

381
00:13:14,920 --> 00:13:17,360
to help developers identify potential issues

382
00:13:17,360 --> 00:13:18,600
with their data sets.

383
00:13:18,600 --> 00:13:20,920
Yeah, it's almost like we need a whole new field

384
00:13:20,920 --> 00:13:25,120
of expertise within AI.

385
00:13:25,120 --> 00:13:29,040
Data provenance specialists who can help people navigate

386
00:13:29,040 --> 00:13:30,080
these complexities.

387
00:13:30,080 --> 00:13:31,040
That's a great idea.

388
00:13:31,040 --> 00:13:33,480
I mean, it's something that we're already starting to see.

389
00:13:33,480 --> 00:13:35,240
There's a growing demand for people

390
00:13:35,240 --> 00:13:41,920
who understand the ethical and the technical aspects of data

391
00:13:41,920 --> 00:13:43,440
sourcing for AI.

392
00:13:43,440 --> 00:13:45,800
So it's not just about the algorithms anymore.

393
00:13:45,800 --> 00:13:48,400
It's about the entire data ecosystem.

394
00:13:48,400 --> 00:13:49,400
A whole ecosystem.

395
00:13:49,400 --> 00:13:50,960
That supports those algorithms.

396
00:13:50,960 --> 00:13:52,760
And that's why this research is so important.

397
00:13:52,760 --> 00:13:56,480
It's really shining a light on this often overlooked,

398
00:13:56,480 --> 00:14:01,040
but absolutely crucial aspect of AI development.

399
00:14:01,040 --> 00:14:03,720
The data that we use to train these systems

400
00:14:03,720 --> 00:14:06,040
is ultimately going to determine their capabilities

401
00:14:06,040 --> 00:14:06,840
or limitations.

402
00:14:06,840 --> 00:14:08,640
And you know how they impact the world.

403
00:14:08,640 --> 00:14:09,160
Right.

404
00:14:09,160 --> 00:14:12,640
Yeah, it's a powerful reminder that building responsible AI

405
00:14:12,640 --> 00:14:15,120
is about so much more than just writing code.

406
00:14:15,120 --> 00:14:16,440
It really is.

407
00:14:16,440 --> 00:14:18,080
It's about making thoughtful choices

408
00:14:18,080 --> 00:14:21,560
at every stage of the process, from data collection

409
00:14:21,560 --> 00:14:23,040
to deployment.

410
00:14:23,040 --> 00:14:23,800
Absolutely.

411
00:14:23,800 --> 00:14:25,680
Well, this has been a really fascinating and eye-opening

412
00:14:25,680 --> 00:14:26,240
conversation.

413
00:14:26,240 --> 00:14:27,000
Yeah, it has.

414
00:14:27,000 --> 00:14:29,360
Thanks for walking us through all this research

415
00:14:29,360 --> 00:14:33,600
and helping us understand the complexities of data provenance

416
00:14:33,600 --> 00:14:34,360
in AI.

417
00:14:34,360 --> 00:14:35,120
My pleasure.

418
00:14:35,120 --> 00:14:37,480
I'm really glad we had a chance to really dive deep

419
00:14:37,480 --> 00:14:40,000
into this because it's something that needs a lot more attention.

420
00:14:40,000 --> 00:14:41,080
Totally.

421
00:14:41,080 --> 00:14:43,480
So listeners, I hope this deep dive has given you

422
00:14:43,480 --> 00:14:49,360
a new appreciation for this unseen world of AI training data

423
00:14:49,360 --> 00:14:52,200
and the incredibly important role it

424
00:14:52,200 --> 00:14:54,880
plays in shaping the future of AI.

425
00:14:54,880 --> 00:14:58,560
So until next time, keep those minds curious.

426
00:14:58,560 --> 00:15:14,600
And those questions coming.

