1
00:00:00,000 --> 00:00:03,100
All right, let's jump into some more AI research today.

2
00:00:03,100 --> 00:00:06,540
We're gonna be looking at navigating the risks,

3
00:00:06,540 --> 00:00:10,440
a survey of security, privacy, and ethics threats

4
00:00:10,440 --> 00:00:12,400
in LLM-based agents.

5
00:00:12,400 --> 00:00:15,420
Quite a title, but yeah, it's a really interesting look

6
00:00:15,420 --> 00:00:19,520
at how AI, specifically those powerful AI systems,

7
00:00:19,520 --> 00:00:21,700
could potentially cause some problems

8
00:00:21,700 --> 00:00:23,440
as they become more common.

9
00:00:23,440 --> 00:00:25,480
Right, and I think what's interesting about this paper

10
00:00:25,480 --> 00:00:28,480
is that it kind of goes beyond just like saying,

11
00:00:28,480 --> 00:00:31,240
oh, AI can mess up sometimes.

12
00:00:31,240 --> 00:00:32,740
It's really trying to break down

13
00:00:32,740 --> 00:00:35,840
how we can categorize those potential problems

14
00:00:35,840 --> 00:00:37,080
based on where they come from.

15
00:00:37,080 --> 00:00:38,880
Yeah, exactly, it's not just, you know,

16
00:00:38,880 --> 00:00:40,200
AI made a mistake, it's like,

17
00:00:40,200 --> 00:00:42,360
is the issue the info it's getting?

18
00:00:42,360 --> 00:00:44,240
Is it the AI model itself,

19
00:00:44,240 --> 00:00:46,160
or is it some crazy combination of both?

20
00:00:46,160 --> 00:00:47,320
Yeah, that makes a lot of sense.

21
00:00:47,320 --> 00:00:48,720
It's like a doctor trying to figure out

22
00:00:48,720 --> 00:00:49,560
what's wrong with you.

23
00:00:49,560 --> 00:00:50,400
Right.

24
00:00:50,400 --> 00:00:51,220
They're just treat the symptoms.

25
00:00:51,220 --> 00:00:52,520
They wanna figure out the root cause

26
00:00:52,520 --> 00:00:53,920
so they can give you the right treatment.

27
00:00:53,920 --> 00:00:57,040
Absolutely, finding the why behind a problem

28
00:00:57,040 --> 00:00:59,680
is just as important as knowing what the problem is.

29
00:00:59,680 --> 00:01:01,580
And, you know, in this case,

30
00:01:01,580 --> 00:01:04,680
the paper's really focused on these LLM-based agents,

31
00:01:04,680 --> 00:01:07,200
which are kind of like chatbots on steroids,

32
00:01:07,200 --> 00:01:08,120
I guess you could say.

33
00:01:08,120 --> 00:01:10,200
So more than just your typical chatbot,

34
00:01:10,200 --> 00:01:12,520
they could just, you know, have a basic conversation.

35
00:01:12,520 --> 00:01:13,440
Way more.

36
00:01:13,440 --> 00:01:14,920
We're talking about AI systems

37
00:01:14,920 --> 00:01:17,320
that can actually do things like complex tasks,

38
00:01:17,320 --> 00:01:20,040
make decisions, even interact with the real world.

39
00:01:20,040 --> 00:01:22,360
Whoa, so like imagine a chatbot

40
00:01:22,360 --> 00:01:24,480
that can not only book your flight,

41
00:01:24,480 --> 00:01:26,920
but also figure out how you're getting to the airport

42
00:01:26,920 --> 00:01:28,960
and check you in for your flight, that kind of thing.

43
00:01:28,960 --> 00:01:30,000
Exactly.

44
00:01:30,000 --> 00:01:32,600
We're talking about some serious potential here,

45
00:01:32,600 --> 00:01:35,760
but of course with great power comes, well, you know,

46
00:01:35,760 --> 00:01:36,600
the rest.

47
00:01:36,600 --> 00:01:38,240
The potential for things to go very wrong.

48
00:01:38,240 --> 00:01:41,040
Yeah, that's a big part of what this paper is exploring.

49
00:01:41,040 --> 00:01:42,900
It's highlighting how all these features

50
00:01:42,900 --> 00:01:45,960
that make LLM-based agents so powerful,

51
00:01:45,960 --> 00:01:47,700
well, they also open up these new ways

52
00:01:47,700 --> 00:01:49,580
for them to be attacked or manipulated.

53
00:01:49,580 --> 00:01:51,280
Like think about their ability

54
00:01:51,280 --> 00:01:54,040
to process all kinds of input, not just text.

55
00:01:54,040 --> 00:01:55,440
Right, like multimodal input,

56
00:01:55,440 --> 00:01:57,760
that they can handle text, images,

57
00:01:57,760 --> 00:01:59,760
maybe even audio all at the same time, right?

58
00:01:59,760 --> 00:02:02,740
Exactly, that makes them super versatile,

59
00:02:02,740 --> 00:02:06,080
but it also means there are more ways for them to be tricked.

60
00:02:06,080 --> 00:02:09,440
Think of an AI that's supposed to help you navigate a city.

61
00:02:09,440 --> 00:02:13,080
It's using street signs, maps, even traffic updates,

62
00:02:13,080 --> 00:02:15,360
but then an attacker can mess with any of that

63
00:02:15,360 --> 00:02:18,600
to send the AI and you in the wrong direction.

64
00:02:18,600 --> 00:02:20,280
Like those optical illusion tricks,

65
00:02:20,280 --> 00:02:21,820
but instead of tricking our eyes,

66
00:02:21,820 --> 00:02:23,480
they're tricking the AI's brain.

67
00:02:23,480 --> 00:02:24,320
Exactly.

68
00:02:24,320 --> 00:02:27,000
And it's not just about the type of input either.

69
00:02:27,000 --> 00:02:28,720
Think about how these agents can have these

70
00:02:28,720 --> 00:02:30,040
back and forth conversations,

71
00:02:30,040 --> 00:02:32,000
like with multiple rounds of interaction.

72
00:02:32,000 --> 00:02:33,040
Okay, that's cool.

73
00:02:33,040 --> 00:02:34,840
So they're not just responding to one prompt,

74
00:02:34,840 --> 00:02:36,840
they can ask questions, get clarification,

75
00:02:36,840 --> 00:02:37,920
kind of learn as they go.

76
00:02:37,920 --> 00:02:41,400
Right, but think about that from like a security standpoint.

77
00:02:41,400 --> 00:02:44,580
A skilled attacker could actually use that multi-round thing

78
00:02:44,580 --> 00:02:47,540
to manipulate the AI, kind of like a puppet master

79
00:02:47,540 --> 00:02:50,320
slowly pulling strings to get it to do what they want.

80
00:02:50,320 --> 00:02:52,200
Hold on, so you're saying someone could hijack

81
00:02:52,200 --> 00:02:55,680
the AI's goals without even like hacking

82
00:02:55,680 --> 00:02:58,520
into the system directly, that's kind of scary.

83
00:02:58,520 --> 00:03:00,920
It is, and the paper digs deep

84
00:03:00,920 --> 00:03:02,560
into how that could actually happen.

85
00:03:02,560 --> 00:03:04,600
And this is where it gets really concerning

86
00:03:04,600 --> 00:03:09,480
because imagine if an AI had access to external tools,

87
00:03:09,480 --> 00:03:11,720
things that can do stuff in the real world.

88
00:03:11,720 --> 00:03:14,280
If the AI's goal gets hijacked,

89
00:03:14,280 --> 00:03:16,160
those actions could be harmful.

90
00:03:16,160 --> 00:03:17,800
So we're not just talking about the AI

91
00:03:17,800 --> 00:03:19,680
giving you wrong information,

92
00:03:19,680 --> 00:03:22,160
we're talking about it potentially doing things.

93
00:03:22,160 --> 00:03:24,720
Like making purchases you didn't authorize,

94
00:03:24,720 --> 00:03:26,440
controlling your smart home devices,

95
00:03:26,440 --> 00:03:29,280
maybe even spreading false information online.

96
00:03:29,280 --> 00:03:31,040
This is making AI assistance seem

97
00:03:31,040 --> 00:03:33,100
a little less appealing all of a sudden.

98
00:03:33,100 --> 00:03:34,560
I get it, it's definitely something to consider.

99
00:03:34,560 --> 00:03:36,920
That's why understanding these risks early on

100
00:03:36,920 --> 00:03:39,520
is so important before these super powerful

101
00:03:39,520 --> 00:03:41,400
AI systems are everywhere.

102
00:03:41,400 --> 00:03:43,400
So we've talked about multimodal input

103
00:03:43,400 --> 00:03:44,840
and multi-round interaction,

104
00:03:44,840 --> 00:03:46,880
creating potential vulnerabilities.

105
00:03:46,880 --> 00:03:49,480
Are there other features of LLM-based agents

106
00:03:49,480 --> 00:03:52,640
that make them both powerful but also kind of risky?

107
00:03:52,640 --> 00:03:53,920
Oh yeah, definitely.

108
00:03:53,920 --> 00:03:56,000
Another one is their ability to use,

109
00:03:56,000 --> 00:03:58,200
well, what are basically memory mechanisms?

110
00:03:58,200 --> 00:03:59,520
So like a human memory,

111
00:03:59,520 --> 00:04:01,760
they can store information and retrieve it later.

112
00:04:01,760 --> 00:04:04,280
Exactly, that's how they keep track of context

113
00:04:04,280 --> 00:04:06,200
and learn from what they've done in the past,

114
00:04:06,200 --> 00:04:09,760
makes their interactions way more complex and natural.

115
00:04:09,760 --> 00:04:13,840
But, you know, that memory thing also opens up some,

116
00:04:13,840 --> 00:04:15,880
well, privacy risks.

117
00:04:15,880 --> 00:04:16,920
I could see that.

118
00:04:16,920 --> 00:04:19,600
Like what if an attacker could get into the AI's memory

119
00:04:19,600 --> 00:04:22,200
and pull out sensitive info it had picked up along the way?

120
00:04:22,200 --> 00:04:23,720
Yeah, like someone having access

121
00:04:23,720 --> 00:04:25,760
to every conversation you've ever had

122
00:04:25,760 --> 00:04:28,800
and being able to use that against you, not good.

123
00:04:28,800 --> 00:04:32,040
And then on top of that, these agents can invoke tools.

124
00:04:32,040 --> 00:04:34,440
Basically that means they can use external software

125
00:04:34,440 --> 00:04:36,940
or APIs to do things in the real world.

126
00:04:36,940 --> 00:04:40,080
Booking flights, controlling robots, even writing code.

127
00:04:40,080 --> 00:04:42,560
Whoa, that's given the AI some serious power.

128
00:04:42,560 --> 00:04:43,880
It is incredibly powerful.

129
00:04:43,880 --> 00:04:46,280
But also you have to remember any weakness in those tools

130
00:04:46,280 --> 00:04:48,120
becomes a weakness for the AI too.

131
00:04:48,120 --> 00:04:51,000
So you could have an AI that can control your smart home,

132
00:04:51,000 --> 00:04:53,560
but if there's a vulnerability in how it unlocks your doors,

133
00:04:53,560 --> 00:04:55,360
that could be a huge security risk.

134
00:04:55,360 --> 00:04:56,840
Definitely not a good situation.

135
00:04:56,840 --> 00:04:57,680
Yeah.

136
00:04:57,680 --> 00:04:59,600
So it seems like every strength of these agents

137
00:04:59,600 --> 00:05:02,200
has this flip side where it could also be a weakness, huh?

138
00:05:02,200 --> 00:05:03,600
Yeah, that's a great way to put it.

139
00:05:03,600 --> 00:05:05,720
And that's really why this paper is so important.

140
00:05:05,720 --> 00:05:08,840
It's basically a call to action for everyone working on AI

141
00:05:08,840 --> 00:05:10,680
to think critically about these risks

142
00:05:10,680 --> 00:05:12,720
and figure out how to stop them

143
00:05:12,720 --> 00:05:15,200
before these agents become a bigger part of our lives.

144
00:05:15,200 --> 00:05:16,720
Sounds like a race against time almost.

145
00:05:16,720 --> 00:05:18,000
In a way, yeah.

146
00:05:18,000 --> 00:05:19,960
But instead of focusing on the doom and gloom,

147
00:05:19,960 --> 00:05:21,920
maybe we should dive into the specifics of what

148
00:05:21,920 --> 00:05:23,760
could actually go wrong.

149
00:05:23,760 --> 00:05:25,680
The paper breaks down these risks

150
00:05:25,680 --> 00:05:27,280
into three main categories.

151
00:05:27,280 --> 00:05:29,760
And I think that'll give us a much clearer idea

152
00:05:29,760 --> 00:05:30,920
of what we're dealing with.

153
00:05:30,920 --> 00:05:32,440
OK, sounds like a plan.

154
00:05:32,440 --> 00:05:34,320
Let's start with that first category, then risks

155
00:05:34,320 --> 00:05:36,640
that come from problematic inputs.

156
00:05:36,640 --> 00:05:39,040
What kind of threats fall under that umbrella?

157
00:05:39,040 --> 00:05:41,320
This is where things get really interesting.

158
00:05:41,320 --> 00:05:42,840
So think about it.

159
00:05:42,840 --> 00:05:46,240
Even the smartest AI is only as good as the information

160
00:05:46,240 --> 00:05:47,840
it's working with, right?

161
00:05:47,840 --> 00:05:49,920
But in the real world, that info

162
00:05:49,920 --> 00:05:54,520
can be manipulated or messed up or just plain wrong.

163
00:05:54,520 --> 00:05:57,000
Right, so even if the AI itself is perfectly fine,

164
00:05:57,000 --> 00:05:59,840
it can still get tripped up by bad information.

165
00:05:59,840 --> 00:06:00,640
Exactly.

166
00:06:00,640 --> 00:06:02,480
And there are a few ways that can happen.

167
00:06:02,480 --> 00:06:04,480
But one of the most, well, I guess,

168
00:06:04,480 --> 00:06:07,040
both fascinating and kind of scary

169
00:06:07,040 --> 00:06:09,280
is this idea of adversarial examples.

170
00:06:09,280 --> 00:06:10,640
Adversarial examples.

171
00:06:10,640 --> 00:06:13,080
That sounds ominous, tell me more.

172
00:06:13,080 --> 00:06:15,160
Remember those optical illusions we talked about?

173
00:06:15,160 --> 00:06:16,160
It's kind of like that.

174
00:06:16,160 --> 00:06:18,800
But for AI, you can actually make these tiny little changes

175
00:06:18,800 --> 00:06:19,360
to the input.

176
00:06:19,360 --> 00:06:22,000
Could be text, images, audio, whatever.

177
00:06:22,000 --> 00:06:23,960
And a human might not even notice the difference,

178
00:06:23,960 --> 00:06:26,800
but it can totally screw up how the AI understands things.

179
00:06:26,800 --> 00:06:30,440
So you're saying someone could manipulate the data

180
00:06:30,440 --> 00:06:31,920
to trick the AI.

181
00:06:31,920 --> 00:06:34,120
But if a human looked at it, they'd be like, what's wrong?

182
00:06:34,120 --> 00:06:34,760
This looks fine.

183
00:06:34,760 --> 00:06:35,600
Yep, exactly.

184
00:06:35,600 --> 00:06:37,720
Think about a self-driving car that's

185
00:06:37,720 --> 00:06:40,120
using AI to read road signs.

186
00:06:40,120 --> 00:06:42,160
Someone could change a stop sign in a way

187
00:06:42,160 --> 00:06:44,800
that's invisible to us, but makes the AI think

188
00:06:44,800 --> 00:06:45,960
it's a speed limit sign.

189
00:06:45,960 --> 00:06:46,520
Oh, wow.

190
00:06:46,520 --> 00:06:48,200
That's actually kind of terrifying.

191
00:06:48,200 --> 00:06:50,400
It's like hacking, but on a whole other level.

192
00:06:50,400 --> 00:06:54,440
It's exploiting how the AI sees the world, not just the code.

193
00:06:54,440 --> 00:06:56,600
Yeah, and this isn't just theoretical stuff either.

194
00:06:56,600 --> 00:06:59,440
The paper talks about some real world examples

195
00:06:59,440 --> 00:07:01,040
where this has happened.

196
00:07:01,040 --> 00:07:05,320
They were able to trick a city navigation AI

197
00:07:05,320 --> 00:07:08,000
by messing with an image of a landmark.

198
00:07:08,000 --> 00:07:10,520
It ended up taking a wrong turn because of it.

199
00:07:10,520 --> 00:07:11,360
Yikes.

200
00:07:11,360 --> 00:07:13,440
So it's like creating a custom-made illusion

201
00:07:13,440 --> 00:07:15,880
to fool a specific AI.

202
00:07:15,880 --> 00:07:17,560
Are there other ways attackers could

203
00:07:17,560 --> 00:07:20,200
use these problematic inputs to mess things up?

204
00:07:20,200 --> 00:07:21,160
Definitely.

205
00:07:21,160 --> 00:07:22,840
We've already talked about goal hijacking,

206
00:07:22,840 --> 00:07:25,000
which is basically manipulating the AI

207
00:07:25,000 --> 00:07:27,000
into doing something it wasn't supposed to.

208
00:07:27,000 --> 00:07:29,720
Super dangerous if it can take actions in the real world.

209
00:07:29,720 --> 00:07:32,000
But there's also this thing called model extraction.

210
00:07:32,000 --> 00:07:32,840
Model extraction.

211
00:07:32,840 --> 00:07:35,960
So that's like stealing the AI's brain or something.

212
00:07:35,960 --> 00:07:36,680
Pretty much.

213
00:07:36,680 --> 00:07:39,120
It's when someone tries to copy what the AI can do

214
00:07:39,120 --> 00:07:41,920
without actually having the code or the training data,

215
00:07:41,920 --> 00:07:44,840
like trying to figure out how a super complex machine works

216
00:07:44,840 --> 00:07:46,600
just by looking at what it does.

217
00:07:46,600 --> 00:07:48,280
I'd imagine that's got to be really tough to do,

218
00:07:48,280 --> 00:07:52,000
especially with these LLM-based agents being so complex.

219
00:07:52,000 --> 00:07:53,280
It is, for sure.

220
00:07:53,280 --> 00:07:54,960
But it's not impossible.

221
00:07:54,960 --> 00:07:56,520
The paper talks about how attackers

222
00:07:56,520 --> 00:07:58,120
could use all sorts of techniques

223
00:07:58,120 --> 00:08:01,240
to basically probe the AI, give it different inputs,

224
00:08:01,240 --> 00:08:03,120
see how it reacts, and try to piece together

225
00:08:03,120 --> 00:08:04,480
how it works on the inside.

226
00:08:04,480 --> 00:08:07,800
So it's like a high stakes game of 20 questions,

227
00:08:07,800 --> 00:08:10,440
trying to figure out the secret recipe of the AI.

228
00:08:10,440 --> 00:08:12,360
Huh, that's a good way to put it.

229
00:08:12,360 --> 00:08:14,600
And yeah, the stakes are high, because if someone

230
00:08:14,600 --> 00:08:16,960
can pull off model extraction, they

231
00:08:16,960 --> 00:08:18,640
could make their own version of the AI,

232
00:08:18,640 --> 00:08:20,280
and who knows what they'd do with it,

233
00:08:20,280 --> 00:08:22,400
maybe even get rid of all the safety features.

234
00:08:22,400 --> 00:08:26,240
Kind of like if someone stole the secret formula for Coca-Cola

235
00:08:26,240 --> 00:08:28,360
and started selling their own version with, like,

236
00:08:28,360 --> 00:08:30,240
who knows what kind of weird ingredients in it.

237
00:08:30,240 --> 00:08:30,960
Exactly.

238
00:08:30,960 --> 00:08:34,000
That's why having strong security measures is so important.

239
00:08:34,000 --> 00:08:37,600
Gotta make sure those valuable AIs don't get stolen or poppied.

240
00:08:37,600 --> 00:08:39,320
Oh, and speaking of things getting stolen,

241
00:08:39,320 --> 00:08:42,560
there's also this thing called prompt leakage.

242
00:08:42,560 --> 00:08:43,640
Prompt leakage.

243
00:08:43,640 --> 00:08:45,560
I don't think we've talked about that one yet.

244
00:08:45,560 --> 00:08:46,160
Right.

245
00:08:46,160 --> 00:08:48,920
So think about the prompts we give to AI.

246
00:08:48,920 --> 00:08:51,480
It's like giving them instructions, right?

247
00:08:51,480 --> 00:08:54,920
Well, prompt leakage is when those instructions accidentally

248
00:08:54,920 --> 00:08:58,120
get revealed, either through what the AI says back to you

249
00:08:58,120 --> 00:09:00,120
or through some other security flaw.

250
00:09:00,120 --> 00:09:02,720
So it's like someone figuring out the secret code words

251
00:09:02,720 --> 00:09:04,160
to control the AI.

252
00:09:04,160 --> 00:09:04,720
Yep.

253
00:09:04,720 --> 00:09:06,480
That's one way to think about it.

254
00:09:06,480 --> 00:09:09,800
And it's a problem because if you know the right prompts,

255
00:09:09,800 --> 00:09:13,280
you can basically make the AI do whatever you want,

256
00:09:13,280 --> 00:09:15,880
especially if it can access sensitive information

257
00:09:15,880 --> 00:09:17,880
or take actions in the real world.

258
00:09:17,880 --> 00:09:20,080
Like we gotta be super careful about how we're talking

259
00:09:20,080 --> 00:09:22,800
to these AIs, making sure we don't accidentally give away

260
00:09:22,800 --> 00:09:25,440
the keys to the kingdom, so to speak.

261
00:09:25,440 --> 00:09:26,480
Absolutely.

262
00:09:26,480 --> 00:09:29,000
And finally, we've got this thing called jail breaking,

263
00:09:29,000 --> 00:09:32,400
which is all about breaking through the AI safety rules

264
00:09:32,400 --> 00:09:34,000
and making it do bad stuff.

265
00:09:34,000 --> 00:09:34,800
Jail breaking.

266
00:09:34,800 --> 00:09:35,800
I guess that makes sense.

267
00:09:35,800 --> 00:09:37,760
It's like you're freeing the AI from its cage, huh?

268
00:09:37,760 --> 00:09:38,880
Yeah, kind of.

269
00:09:38,880 --> 00:09:40,640
Like imagine a chatbot that's designed

270
00:09:40,640 --> 00:09:42,160
to be super friendly and helpful,

271
00:09:42,160 --> 00:09:44,520
but then someone figures out how to jail break it

272
00:09:44,520 --> 00:09:48,040
and it starts spouting hate speech or spreading fake news.

273
00:09:48,040 --> 00:09:48,960
Not good.

274
00:09:48,960 --> 00:09:51,440
OK, that's definitely not what we want.

275
00:09:51,440 --> 00:09:54,760
It seems like every time we make these AIs more powerful,

276
00:09:54,760 --> 00:09:57,560
we also create more ways for them to go wrong.

277
00:09:57,560 --> 00:09:59,520
It's like this endless game of cat and mouse

278
00:09:59,520 --> 00:10:02,160
between the people making AIs and the people trying

279
00:10:02,160 --> 00:10:02,880
to exploit them.

280
00:10:02,880 --> 00:10:04,160
That's a perfect way to put it.

281
00:10:04,160 --> 00:10:05,640
It's a constant back and forth.

282
00:10:05,640 --> 00:10:08,680
And that actually brings us to the second big category

283
00:10:08,680 --> 00:10:11,480
of risks, which is when the problem isn't the input

284
00:10:11,480 --> 00:10:15,120
the AI is getting, but the AI model itself.

285
00:10:15,120 --> 00:10:18,440
So now we're talking about issues with the AI's brain.

286
00:10:18,440 --> 00:10:20,720
Even if it's getting perfect information,

287
00:10:20,720 --> 00:10:22,600
what kinds of things can go wrong there?

288
00:10:22,600 --> 00:10:25,120
Well, this is where we get into some of those weird quirks

289
00:10:25,120 --> 00:10:26,480
that AIs can have.

290
00:10:26,480 --> 00:10:29,120
How sometimes they just do unexpected things.

291
00:10:29,120 --> 00:10:30,960
One of the most well-known examples of this

292
00:10:30,960 --> 00:10:32,760
is something called hallucination.

293
00:10:32,760 --> 00:10:34,280
Hallucination?

294
00:10:34,280 --> 00:10:36,400
Like seeing things that aren't there.

295
00:10:36,400 --> 00:10:38,320
How can an AI hallucinate?

296
00:10:38,320 --> 00:10:39,920
Well, not literally seeing things.

297
00:10:39,920 --> 00:10:42,560
It's more like when the AI makes stuff up.

298
00:10:42,560 --> 00:10:44,400
It says things that are just plain wrong,

299
00:10:44,400 --> 00:10:47,920
but it says them with total confidence as if they were facts.

300
00:10:47,920 --> 00:10:51,800
So it's like the AI is just confidently spouting nonsense?

301
00:10:51,800 --> 00:10:54,160
That sounds almost funny, but I guess

302
00:10:54,160 --> 00:10:56,520
it could also be a big problem depending on the situation.

303
00:10:56,520 --> 00:10:57,920
Oh, yeah, for sure.

304
00:10:57,920 --> 00:10:59,480
Think about an AI that's supposed

305
00:10:59,480 --> 00:11:02,040
to be helping doctors diagnose patients,

306
00:11:02,040 --> 00:11:04,320
but it starts hallucinating symptoms,

307
00:11:04,320 --> 00:11:07,040
could lead to the wrong treatment, maybe even hurt someone,

308
00:11:07,040 --> 00:11:08,960
or what about a financial AI that's

309
00:11:08,960 --> 00:11:12,400
giving you investment advice based on made-up market data?

310
00:11:12,400 --> 00:11:14,800
OK, yeah, those are not good scenarios.

311
00:11:14,800 --> 00:11:18,880
So what causes these AI hallucinations in the first place?

312
00:11:18,880 --> 00:11:21,960
It's kind of complicated, but it could be all sorts of things.

313
00:11:21,960 --> 00:11:24,880
Maybe there were biases in the data it was trained on.

314
00:11:24,880 --> 00:11:26,480
Maybe there are gaps in its knowledge,

315
00:11:26,480 --> 00:11:28,560
or maybe it's just a fundamental limitation

316
00:11:28,560 --> 00:11:31,120
of how these models actually learn.

317
00:11:31,120 --> 00:11:32,640
It seems like we're still figuring out

318
00:11:32,640 --> 00:11:35,200
how these complex AIs really work, huh?

319
00:11:35,200 --> 00:11:39,200
Are there any other model flaws we should be watching out for?

320
00:11:39,200 --> 00:11:42,640
Yeah, another important one is this idea of memorization

321
00:11:42,640 --> 00:11:44,560
versus generalization.

322
00:11:44,560 --> 00:11:48,000
Ideally, we want AIs to actually learn from their experiences

323
00:11:48,000 --> 00:11:50,840
and be able to apply that knowledge to new situations,

324
00:11:50,840 --> 00:11:53,480
but sometimes they just end up memorizing patterns

325
00:11:53,480 --> 00:11:55,720
without really understanding what's going on.

326
00:11:55,720 --> 00:11:57,960
So it's like the difference between a student who can actually

327
00:11:57,960 --> 00:12:01,760
use what they learn to solve new problems versus a student who

328
00:12:01,760 --> 00:12:05,080
just memorizes facts for a test but can't apply any of it

329
00:12:05,080 --> 00:12:05,920
in the real world.

330
00:12:05,920 --> 00:12:06,800
Exactly.

331
00:12:06,800 --> 00:12:09,240
And that can be a problem for LLM-based agents

332
00:12:09,240 --> 00:12:12,280
because they might get stuck doing the same things over and over

333
00:12:12,280 --> 00:12:14,400
again, even when it doesn't make sense anymore.

334
00:12:14,400 --> 00:12:16,800
Sounds frustrating, but is that actually dangerous?

335
00:12:16,800 --> 00:12:17,960
Could be, yeah.

336
00:12:17,960 --> 00:12:20,520
Think about a self-driving car that's

337
00:12:20,520 --> 00:12:23,880
memorized a specific route, but then there's road construction.

338
00:12:23,880 --> 00:12:25,560
If it can't adapt to the new situation,

339
00:12:25,560 --> 00:12:27,360
it might make a really bad decision.

340
00:12:27,360 --> 00:12:28,120
OK, yeah.

341
00:12:28,120 --> 00:12:29,520
I definitely see the problem there.

342
00:12:29,520 --> 00:12:32,840
We need AI that can think on its feet, not just follow

343
00:12:32,840 --> 00:12:34,560
memorized instructions blindly.

344
00:12:34,560 --> 00:12:35,400
Exactly.

345
00:12:35,400 --> 00:12:37,120
And that's one of the big challenges right now.

346
00:12:37,120 --> 00:12:40,520
Building AI that can truly learn and adapt, not just pair it

347
00:12:40,520 --> 00:12:42,720
back what it's seen before.

348
00:12:42,720 --> 00:12:45,480
And speaking of things that AI can sometimes pair it back,

349
00:12:45,480 --> 00:12:48,560
there's one more big model flaw we should talk about, bias.

350
00:12:48,560 --> 00:12:50,800
Ah, bias.

351
00:12:50,800 --> 00:12:53,040
That's a big topic these days, not just with AI,

352
00:12:53,040 --> 00:12:55,120
but everywhere, really.

353
00:12:55,120 --> 00:12:58,800
How does bias show up in these LLM-based agents?

354
00:12:58,800 --> 00:13:00,960
It's kind of like with humans, you know?

355
00:13:00,960 --> 00:13:04,640
AI can pick up biases from the data it's trained on.

356
00:13:04,640 --> 00:13:08,040
So if that data reflects existing societal biases,

357
00:13:08,040 --> 00:13:11,080
the AI can end up perpetuating those biases

358
00:13:11,080 --> 00:13:12,520
in what it does and says.

359
00:13:12,520 --> 00:13:15,560
So it's like the AI is inheriting all the prejudices

360
00:13:15,560 --> 00:13:17,440
and stereotypes that are already out there.

361
00:13:17,440 --> 00:13:17,840
Right.

362
00:13:17,840 --> 00:13:19,840
And this can be a huge problem, especially

363
00:13:19,840 --> 00:13:22,840
as these LLM-based agents start playing bigger roles

364
00:13:22,840 --> 00:13:24,920
in our lives, like making decisions that actually

365
00:13:24,920 --> 00:13:25,760
affect people.

366
00:13:25,760 --> 00:13:28,240
Like imagine an AI that's being used to help companies

367
00:13:28,240 --> 00:13:30,920
hire people, but it unknowingly discriminates

368
00:13:30,920 --> 00:13:34,240
against certain candidates because of biases in the data

369
00:13:34,240 --> 00:13:35,040
it learned from.

370
00:13:35,040 --> 00:13:35,920
Exactly.

371
00:13:35,920 --> 00:13:38,520
Or what about an AI that's supposed to be helping

372
00:13:38,520 --> 00:13:40,680
with criminal justice, but it ends up

373
00:13:40,680 --> 00:13:42,800
recommending harsher sentences for people

374
00:13:42,800 --> 00:13:46,000
from certain backgrounds because of those same biases?

375
00:13:46,000 --> 00:13:48,120
It's a really complex and sensitive issue

376
00:13:48,120 --> 00:13:50,200
that we definitely need to figure out how to solve.

377
00:13:50,200 --> 00:13:52,680
Yeah, this is getting into the whole ethics of AI, isn't it?

378
00:13:52,680 --> 00:13:54,400
It's not just about building smart systems.

379
00:13:54,400 --> 00:13:57,080
It's about making sure they're actually used fairly

380
00:13:57,080 --> 00:13:58,080
and responsibly.

381
00:13:58,080 --> 00:13:58,720
Absolutely.

382
00:13:58,720 --> 00:14:02,080
And that kind of leads us into the last big category of risks

383
00:14:02,080 --> 00:14:04,880
that the paper talks about, those that happen because

384
00:14:04,880 --> 00:14:08,680
of the way the AI model interacts with the input it gets.

385
00:14:08,680 --> 00:14:10,720
So it's not just about the input being bad

386
00:14:10,720 --> 00:14:12,200
or the model being flawed.

387
00:14:12,200 --> 00:14:14,480
It's about what happens when those two things come together.

388
00:14:14,480 --> 00:14:17,080
OK, so this is where things get even more tangled, right?

389
00:14:17,080 --> 00:14:19,120
We're talking about the intersection of those first two

390
00:14:19,120 --> 00:14:20,200
categories.

391
00:14:20,200 --> 00:14:24,720
What kind of threats emerge in this weird overlapping zone?

392
00:14:24,720 --> 00:14:26,560
Well, one that's especially concerning

393
00:14:26,560 --> 00:14:28,560
is something called a backdoor attack.

394
00:14:28,560 --> 00:14:30,040
Backdoor attack.

395
00:14:30,040 --> 00:14:31,160
That sounds kind of sneaky.

396
00:14:31,160 --> 00:14:32,080
What's that all about?

397
00:14:32,080 --> 00:14:34,720
It's like, imagine you've got this AI agent, right?

398
00:14:34,720 --> 00:14:36,000
And it seems totally normal.

399
00:14:36,000 --> 00:14:37,280
Does this job just fine?

400
00:14:37,280 --> 00:14:40,440
But someone snuck in this little hidden vulnerability,

401
00:14:40,440 --> 00:14:42,960
like a secret backdoor in a building that

402
00:14:42,960 --> 00:14:45,200
lets someone bypass all the security.

403
00:14:45,200 --> 00:14:47,440
So it's kind of like a time bomb waiting to go off.

404
00:14:47,440 --> 00:14:49,280
But how would someone even do that with AI?

405
00:14:49,280 --> 00:14:50,800
Like plan a backdoor?

406
00:14:50,800 --> 00:14:51,600
I mean.

407
00:14:51,600 --> 00:14:53,200
It's a bit technical, but basically they

408
00:14:53,200 --> 00:14:55,920
can mess with either the training data or the model

409
00:14:55,920 --> 00:14:59,200
itself and create this hidden trigger.

410
00:14:59,200 --> 00:15:02,080
So the AI works fine normally, but if it encounters

411
00:15:02,080 --> 00:15:05,200
a specific thing, could be a word, an image, whatever, boom,

412
00:15:05,200 --> 00:15:07,240
it executes this malicious code.

413
00:15:07,240 --> 00:15:10,120
Whoa, so like an example to help me wrap my head around this.

414
00:15:10,120 --> 00:15:11,680
OK, so picture this.

415
00:15:11,680 --> 00:15:13,200
You have a smart home chatbot.

416
00:15:13,200 --> 00:15:16,200
It controls all your lights, locks, everything.

417
00:15:16,200 --> 00:15:17,800
But it's been backdoored.

418
00:15:17,800 --> 00:15:19,480
And let's say the trigger phrase is,

419
00:15:19,480 --> 00:15:20,840
it's time to water the plants.

420
00:15:20,840 --> 00:15:22,960
Totally normal phrase, right?

421
00:15:22,960 --> 00:15:24,960
But instead of turning on the sprinklers,

422
00:15:24,960 --> 00:15:27,000
the AI unlocks all your doors.

423
00:15:27,000 --> 00:15:28,880
OK, yeah, that's straight out of a nightmare.

424
00:15:28,880 --> 00:15:30,640
And you wouldn't even know until it was too late.

425
00:15:30,640 --> 00:15:31,360
Exactly.

426
00:15:31,360 --> 00:15:34,320
It's why we need really, really good security testing

427
00:15:34,320 --> 00:15:36,160
for these AI systems, especially ones that

428
00:15:36,160 --> 00:15:38,200
can do stuff in the real world.

429
00:15:38,200 --> 00:15:41,800
Like imagine if someone backdoored a self-driving car AI.

430
00:15:41,800 --> 00:15:43,480
We can't have those kinds of risks?

431
00:15:43,480 --> 00:15:44,400
Absolutely not.

432
00:15:44,400 --> 00:15:48,200
It's like the stakes keep getting higher as these AI agents

433
00:15:48,200 --> 00:15:50,400
become more powerful, especially as they get integrated

434
00:15:50,400 --> 00:15:52,520
into our daily lives.

435
00:15:52,520 --> 00:15:55,200
So what else should we be worried about

436
00:15:55,200 --> 00:15:57,840
when it comes to this interplay between inputs

437
00:15:57,840 --> 00:15:59,280
and the AI models themselves?

438
00:15:59,280 --> 00:16:01,560
Well, there's the whole issue of privacy leakage, which

439
00:16:01,560 --> 00:16:02,680
we touched on before.

440
00:16:02,680 --> 00:16:05,440
You know, how even an AI designed to be private

441
00:16:05,440 --> 00:16:07,960
can accidentally leak sensitive information

442
00:16:07,960 --> 00:16:09,120
through its outputs.

443
00:16:09,120 --> 00:16:11,640
Right, it's not just about protecting the data going

444
00:16:11,640 --> 00:16:14,480
into the AI, but also controlling what comes out.

445
00:16:14,480 --> 00:16:16,600
Yeah, because even if the AI is not

446
00:16:16,600 --> 00:16:18,320
trying to expose private stuff, it

447
00:16:18,320 --> 00:16:21,520
can happen just because of the way it processes information.

448
00:16:21,520 --> 00:16:22,880
Like imagine an AI that's learned

449
00:16:22,880 --> 00:16:24,600
from tons of medical records.

450
00:16:24,600 --> 00:16:27,520
It might accidentally reveal a patient's info

451
00:16:27,520 --> 00:16:30,360
in a response to a totally unrelated question.

452
00:16:30,360 --> 00:16:32,600
It's like the AI is putting two and two together in a way

453
00:16:32,600 --> 00:16:35,720
that reveals something private, even if it's not intentional.

454
00:16:35,720 --> 00:16:36,640
Exactly.

455
00:16:36,640 --> 00:16:37,880
It's tricky stuff.

456
00:16:37,880 --> 00:16:39,600
We need to be developing better ways

457
00:16:39,600 --> 00:16:42,560
to make sure AI's keep that kind of information locked down

458
00:16:42,560 --> 00:16:43,800
tight no matter what.

459
00:16:43,800 --> 00:16:47,080
OK, so we've gone through those three main risk categories.

460
00:16:47,080 --> 00:16:50,560
Problems with the input, problems with the AI model itself,

461
00:16:50,560 --> 00:16:52,840
and then the weird stuff that happens when those two things

462
00:16:52,840 --> 00:16:53,600
combine.

463
00:16:53,600 --> 00:16:54,960
It's a lot to think about.

464
00:16:54,960 --> 00:16:57,280
But I guess that's why we're doing this deep dive, right?

465
00:16:57,280 --> 00:16:59,240
We need to understand this stuff if we're

466
00:16:59,240 --> 00:17:00,720
going to use AI safely.

467
00:17:00,720 --> 00:17:02,440
Couldn't have felt it better myself.

468
00:17:02,440 --> 00:17:04,600
And you know what really brings it all home?

469
00:17:04,600 --> 00:17:07,480
Those real world case studies the paper looks at,

470
00:17:07,480 --> 00:17:10,160
they show just how real these risks are.

471
00:17:10,160 --> 00:17:10,640
You're right.

472
00:17:10,640 --> 00:17:11,720
Let's get into those.

473
00:17:11,720 --> 00:17:15,000
First up, we've got WebGPT, which is an AI agent that

474
00:17:15,000 --> 00:17:17,720
uses a web browser to answer your questions.

475
00:17:17,720 --> 00:17:19,520
WebGPT, yeah.

476
00:17:19,520 --> 00:17:22,640
It's a really cool example of how AI can use tools,

477
00:17:22,640 --> 00:17:24,680
like an internet browser, to do more.

478
00:17:24,680 --> 00:17:27,200
It's like having your own personal AI researcher.

479
00:17:27,200 --> 00:17:30,000
But of course, using the internet also opens up some,

480
00:17:30,000 --> 00:17:32,400
well, let's say, opportunities for things to go wrong.

481
00:17:32,400 --> 00:17:34,360
Right, because the internet is full of information,

482
00:17:34,360 --> 00:17:36,840
but it's also full of bias, inaccuracies,

483
00:17:36,840 --> 00:17:38,680
and just plain old bad stuff.

484
00:17:38,680 --> 00:17:40,360
Right, exactly.

485
00:17:40,360 --> 00:17:43,920
And the paper points out that WebGPT is especially vulnerable

486
00:17:43,920 --> 00:17:46,280
to goal hijacking and those backdoor attacks

487
00:17:46,280 --> 00:17:48,680
we were talking about, because it relies so much

488
00:17:48,680 --> 00:17:50,120
on external data.

489
00:17:50,120 --> 00:17:53,880
So if that data is compromised, the AI is compromised too.

490
00:17:53,880 --> 00:17:56,080
It's like giving an AI access to the internet.

491
00:17:56,080 --> 00:17:59,280
It's kind of like giving a kid the keys to a candy store.

492
00:17:59,280 --> 00:18:00,720
Sure, there's awesome stuff in there,

493
00:18:00,720 --> 00:18:03,560
but also a lot of potential for things to go wrong.

494
00:18:03,560 --> 00:18:04,480
Perfect analogy.

495
00:18:04,480 --> 00:18:06,080
We need to be really thoughtful about how

496
00:18:06,080 --> 00:18:10,440
we design these AI so they can use powerful tools safely.

497
00:18:10,440 --> 00:18:14,040
OK, next case study, we've got Voyager, the AI that plays

498
00:18:14,040 --> 00:18:14,800
Minecraft.

499
00:18:14,800 --> 00:18:16,040
Yeah, Voyager.

500
00:18:16,040 --> 00:18:17,680
This one's super interesting because it's

501
00:18:17,680 --> 00:18:21,200
an example of an embodied AI, meaning it can actually

502
00:18:21,200 --> 00:18:23,960
learn and interact with a simulated environment.

503
00:18:23,960 --> 00:18:25,920
It's like teaching an AI to play a video game.

504
00:18:25,920 --> 00:18:27,600
And Minecraft is a perfect environment for that,

505
00:18:27,600 --> 00:18:29,520
because it's so complex and open-ended.

506
00:18:29,520 --> 00:18:32,840
But Voyager also shows us how even when an AI is super smart,

507
00:18:32,840 --> 00:18:35,480
if it doesn't have enough knowledge about a specific domain,

508
00:18:35,480 --> 00:18:36,520
it can make mistakes.

509
00:18:36,520 --> 00:18:39,000
And those mistakes can be amplified when the AI is actually

510
00:18:39,000 --> 00:18:41,280
doing things, not just answering questions.

511
00:18:41,280 --> 00:18:43,360
Because in Minecraft, Voyager is not just

512
00:18:43,360 --> 00:18:44,600
thinking about what to do.

513
00:18:44,600 --> 00:18:47,600
It's actually writing code to control its actions

514
00:18:47,600 --> 00:18:48,560
in the game, right?

515
00:18:48,560 --> 00:18:49,320
Yep.

516
00:18:49,320 --> 00:18:52,760
So even small errors can have big consequences.

517
00:18:52,760 --> 00:18:55,320
Plus, the paper points out that Voyager's design makes it

518
00:18:55,320 --> 00:18:58,600
a bit easier for attackers to do that model extraction thing

519
00:18:58,600 --> 00:18:59,400
we talked about.

520
00:18:59,400 --> 00:19:01,920
So there are definitely security risks there too.

521
00:19:01,920 --> 00:19:04,160
OK, on to case study number three.

522
00:19:04,160 --> 00:19:07,280
PPP, which is an AI for navigating cities.

523
00:19:07,280 --> 00:19:09,880
PPP is all about multimodal AI.

524
00:19:09,880 --> 00:19:12,680
It uses text and images to find its way around,

525
00:19:12,680 --> 00:19:15,200
kind of like a super-powered GPS.

526
00:19:15,200 --> 00:19:17,360
It's awesome because it can understand street signs,

527
00:19:17,360 --> 00:19:19,200
recognize landmarks, all that stuff.

528
00:19:19,200 --> 00:19:21,880
But again, when you're dealing with the real world,

529
00:19:21,880 --> 00:19:23,360
there are always more risks.

530
00:19:23,360 --> 00:19:23,640
Right.

531
00:19:23,640 --> 00:19:25,320
Like what if someone mess with the street sign

532
00:19:25,320 --> 00:19:27,560
to trick the AI into sending you the wrong way?

533
00:19:27,560 --> 00:19:28,360
Exactly.

534
00:19:28,360 --> 00:19:30,920
The paper talks about how PPP is really vulnerable

535
00:19:30,920 --> 00:19:33,600
to those adversarial examples, because attackers

536
00:19:33,600 --> 00:19:36,800
can manipulate both the text and images it uses.

537
00:19:36,800 --> 00:19:38,120
And when you're navigating a city,

538
00:19:38,120 --> 00:19:40,320
taking a wrong turn could be a lot more serious

539
00:19:40,320 --> 00:19:41,360
than in a video game.

540
00:19:41,360 --> 00:19:42,840
Yeah, that's for sure.

541
00:19:42,840 --> 00:19:44,080
OK, last case study.

542
00:19:44,080 --> 00:19:46,080
We've got Chat Dev, which is a system

543
00:19:46,080 --> 00:19:49,760
where multiple AIs work together to write software code.

544
00:19:49,760 --> 00:19:50,600
This one's wild.

545
00:19:50,600 --> 00:19:53,760
It's like having a whole team of AI programmers.

546
00:19:53,760 --> 00:19:54,840
Super efficient.

547
00:19:54,840 --> 00:19:56,080
But also, think about it.

548
00:19:56,080 --> 00:19:58,560
If one of those AIs messes up, it

549
00:19:58,560 --> 00:20:00,800
could have a ripple effect on the whole project.

550
00:20:00,800 --> 00:20:02,320
It's like that game of telephone

551
00:20:02,320 --> 00:20:05,320
where the message gets all jumbled up as it's passed around.

552
00:20:05,320 --> 00:20:07,880
Except in this case, instead of a silly message,

553
00:20:07,880 --> 00:20:10,880
you could end up with buggy or even dangerous code.

554
00:20:10,880 --> 00:20:12,600
Yeah, good analogy.

555
00:20:12,600 --> 00:20:15,800
It highlights why we need ways to catch and fix errors

556
00:20:15,800 --> 00:20:18,600
in these multi-agent systems, especially when they're

557
00:20:18,600 --> 00:20:20,720
doing something as important as writing software.

558
00:20:20,720 --> 00:20:23,200
So I think we've covered a lot of ground today.

559
00:20:23,200 --> 00:20:25,160
It's both exciting and a little bit scary

560
00:20:25,160 --> 00:20:26,960
to see how fast AI is progressing.

561
00:20:26,960 --> 00:20:27,800
It is.

562
00:20:27,800 --> 00:20:29,840
But that's why these deep dyes are so important, right?

563
00:20:29,840 --> 00:20:32,400
Yeah, you can't just blindly embrace new technology.

564
00:20:32,400 --> 00:20:34,840
We have to be aware of the potential downsides, too.

565
00:20:34,840 --> 00:20:35,560
Couldn't agree more.

566
00:20:35,560 --> 00:20:37,400
This paper is a real wake-up call.

567
00:20:37,400 --> 00:20:41,360
It's like, hey, AI is amazing, but we need to be careful.

568
00:20:41,360 --> 00:20:43,680
We can't just assume it's always going to be used for good.

569
00:20:43,680 --> 00:20:46,680
We need to think about the risks, build in strong security,

570
00:20:46,680 --> 00:20:49,080
and make sure we're using AI ethically.

571
00:20:49,080 --> 00:20:50,280
Well said.

572
00:20:50,280 --> 00:20:52,680
So what's the one big takeaway you hope our listeners

573
00:20:52,680 --> 00:20:54,120
get from today's episode?

574
00:20:54,120 --> 00:20:55,040
I think it's this.

575
00:20:55,040 --> 00:20:58,480
AI is changing the world in awesome ways, but it's not

576
00:20:58,480 --> 00:20:59,880
perfect.

577
00:20:59,880 --> 00:21:03,520
LLM-based agents especially, they have incredible potential,

578
00:21:03,520 --> 00:21:06,040
but also some serious risks.

579
00:21:06,040 --> 00:21:08,280
We all need to be aware of those risks

580
00:21:08,280 --> 00:21:10,240
and be part of the conversation about how

581
00:21:10,240 --> 00:21:12,120
to make sure AI is used for good.

582
00:21:12,120 --> 00:21:13,280
Totally agree.

583
00:21:13,280 --> 00:21:15,360
And as AI keeps evolving, those risks

584
00:21:15,360 --> 00:21:16,320
are going to change, too.

585
00:21:16,320 --> 00:21:18,160
So we've got to stay informed.

586
00:21:18,160 --> 00:21:20,120
Well, this has been a fascinating deep dive.

587
00:21:20,120 --> 00:21:21,200
Thanks for joining me today.

588
00:21:21,200 --> 00:21:21,720
My pleasure.

589
00:21:21,720 --> 00:21:23,320
And thanks to all our listeners out there.

590
00:21:23,320 --> 00:21:24,680
Keep those brains curious.

591
00:21:24,680 --> 00:21:28,760
Until next time.