WEBVTT

00:00:00.000 --> 00:00:03.000
You know, when we build a high -tech security

00:00:03.000 --> 00:00:05.259
system, we kind of just expect it to see the

00:00:05.259 --> 00:00:07.540
world essentially the same way we do. I mean,

00:00:07.599 --> 00:00:09.400
imagine you have a highly trained guard dog,

00:00:09.560 --> 00:00:11.460
right? Right, yeah. You completely trust that

00:00:11.460 --> 00:00:14.779
dog to instinctively know the difference between

00:00:14.779 --> 00:00:17.079
a burglar sneaking through the yard and, say,

00:00:17.219 --> 00:00:20.339
a mail carrier dropping off a package. It's just

00:00:20.339 --> 00:00:22.859
reliable. Yeah, we rely on our human senses.

00:00:23.219 --> 00:00:26.079
And so we naturally project that same basic grip

00:00:26.079 --> 00:00:30.339
on reality onto the systems we build. we assume

00:00:30.339 --> 00:00:32.140
they process the physical world with the exact

00:00:32.140 --> 00:00:34.439
same context that we do. You step into the world

00:00:34.439 --> 00:00:36.539
of artificial intelligence and the assumption

00:00:36.539 --> 00:00:38.979
completely falls apart. Suddenly your high -tech

00:00:38.979 --> 00:00:41.560
guard dog is staring straight at a burglar, but

00:00:41.560 --> 00:00:43.899
because the burglar is wearing the specific,

00:00:44.280 --> 00:00:47.020
slightly weird -looking sweater, the computer's

00:00:47.020 --> 00:00:49.920
brain just short -circuits. It literally registers

00:00:49.920 --> 00:00:53.420
the burglar as a potted plant. Yeah, it is the

00:00:53.420 --> 00:00:55.700
ultimate cognitive blind spot. And I think the

00:00:55.700 --> 00:00:58.100
terrifying part is how overwhelmingly easy it

00:00:58.100 --> 00:01:00.820
is to exploit. once you actually understand the

00:01:00.820 --> 00:01:04.819
math behind it. Totally. Well, welcome to today's

00:01:04.819 --> 00:01:07.980
deep dive. We are looking at a massive stack

00:01:07.980 --> 00:01:10.719
of research today centered around a really fascinating

00:01:10.719 --> 00:01:13.420
field called adversarial machine learning, which

00:01:13.420 --> 00:01:16.579
is basically the study of how machine learning

00:01:16.579 --> 00:01:18.719
algorithms get attacked and how developers are

00:01:18.719 --> 00:01:21.060
frantically trying to defend them. Frantically

00:01:21.060 --> 00:01:23.349
is definitely the right word for it. Right. So

00:01:23.349 --> 00:01:25.849
our mission today is to equip you, the listener,

00:01:25.989 --> 00:01:28.829
with a clear understanding of the invisible vulnerabilities

00:01:28.829 --> 00:01:31.329
hidden inside the AI systems you interact with

00:01:31.329 --> 00:01:33.349
every single day. We're going to cut through

00:01:33.349 --> 00:01:35.790
all the dense academic jargon to show you exactly

00:01:35.790 --> 00:01:38.290
how these systems can be tricked, the actual

00:01:38.290 --> 00:01:41.930
mechanics behind the illusions, and why knowing

00:01:41.930 --> 00:01:44.790
this is basically becoming a crucial modern survival

00:01:44.790 --> 00:01:47.730
skill. Yeah. And to really grasp how an AI gets

00:01:47.730 --> 00:01:50.150
tricked, we first have to understand this massive

00:01:50.150 --> 00:01:52.819
foundational flaw in how learns about the world.

00:01:53.299 --> 00:01:55.459
It all comes down to a fundamental statistical

00:01:55.459 --> 00:01:58.680
assumption that underpins almost all machine

00:01:58.680 --> 00:02:01.400
learning. In the field, it's called the IID assumption,

00:02:01.879 --> 00:02:03.519
which stands for independent and identically

00:02:03.519 --> 00:02:05.680
distributed. OK, let's unpack this for a second,

00:02:05.680 --> 00:02:08.560
because that sounds incredibly dense. What does

00:02:08.560 --> 00:02:11.340
independent and identically distributed actually

00:02:11.340 --> 00:02:14.740
mean for the AI sitting in a server somewhere?

00:02:14.879 --> 00:02:18.080
Right. In plain English, when developers build

00:02:18.080 --> 00:02:20.639
the machine learning model, they are desperately

00:02:20.639 --> 00:02:24.909
hoping that the test data, meaning the chaotic,

00:02:25.189 --> 00:02:28.389
unpredictable real world that the AI is eventually

00:02:28.389 --> 00:02:31.590
going to operate in, will look exactly like the

00:02:31.590 --> 00:02:33.849
training data it was educated on in the laboratory.

00:02:33.949 --> 00:02:36.810
Oh, I see. Yeah, they are assuming the statistical

00:02:36.810 --> 00:02:39.800
distribution of the real world... perfectly matches

00:02:39.800 --> 00:02:41.620
their controlled environment. So it's kind of

00:02:41.620 --> 00:02:44.280
like training that guard dog to fetch a stick

00:02:44.280 --> 00:02:47.740
in your perfectly manicured fenced -in backyard.

00:02:48.180 --> 00:02:50.039
You throw the stick, the dog brings it back a

00:02:50.039 --> 00:02:52.699
thousand times. Flawless execution. Exactly.

00:02:52.699 --> 00:02:54.860
And then you take that same dog, you drop it

00:02:54.860 --> 00:02:57.120
in the middle of the Amazon jungle, throw a live

00:02:57.120 --> 00:02:59.900
snake, and expect the exact same fetching behavior.

00:02:59.960 --> 00:03:01.800
Right. Because the shape of the snake is roughly

00:03:01.800 --> 00:03:04.020
similar to the stick, completely ignoring that

00:03:04.020 --> 00:03:06.500
the context, the environment, all the underlying

00:03:06.500 --> 00:03:09.379
reality has violently changed. Yeah, and the

00:03:09.379 --> 00:03:12.300
source material highlights just how spectacularly

00:03:12.300 --> 00:03:15.680
that assumption breaks down when users intentionally

00:03:15.680 --> 00:03:18.960
supply fabricated data. They give this one really

00:03:18.960 --> 00:03:21.780
crazy example. Researchers used a low -cost,

00:03:22.039 --> 00:03:24.639
commercially available 3D printer to create a

00:03:24.639 --> 00:03:26.939
little toy turtle. Just a standard plastic turtle.

00:03:27.060 --> 00:03:29.020
Just a standard little plastic turtle. To the

00:03:29.020 --> 00:03:31.860
human eye, it's obviously just a toy. But they

00:03:31.860 --> 00:03:34.219
mathematically engineered the texture on the

00:03:34.219 --> 00:03:38.280
shell so specifically that Google's object detection

00:03:38.280 --> 00:03:41.599
AI classified it as a rifle. Wait, a plastic

00:03:41.599 --> 00:03:44.639
toy turtle registers as a deadly weapon. A rifle,

00:03:44.840 --> 00:03:47.080
yeah. And not just from one tricky angle, but

00:03:47.080 --> 00:03:49.080
from almost every angle it was viewed from. How

00:03:49.080 --> 00:03:51.740
does a computer even make that leap? I mean,

00:03:51.780 --> 00:03:53.819
they don't look anything alike. Because the AI

00:03:53.819 --> 00:03:56.000
isn't looking at the turtle -ness of the object,

00:03:56.120 --> 00:03:57.939
right? Yeah. It has absolutely no concept of

00:03:57.939 --> 00:04:00.199
what an animal or weapon actually is in the physical

00:04:00.199 --> 00:04:02.560
world. It is literally just looking at pixel

00:04:02.560 --> 00:04:04.819
values and mathematical gradients. OK, gradient.

00:04:04.960 --> 00:04:07.729
Yeah, think of a gradient as... like a topographic

00:04:07.729 --> 00:04:10.789
map of the AI's confidence. When an attacker

00:04:10.789 --> 00:04:13.550
suddenly alters the pixels on that turtle shell,

00:04:13.969 --> 00:04:16.089
they are mapping out a path down the steepest

00:04:16.089 --> 00:04:18.790
mathematical hill, essentially pushing the image

00:04:18.790 --> 00:04:21.310
across a strict boundary in the AI's programming

00:04:21.310 --> 00:04:24.009
until it falls into the rifle category. Wow.

00:04:24.189 --> 00:04:26.250
And it goes the other way too, right? Because

00:04:26.250 --> 00:04:28.589
the sources mention a machine -tweaked image

00:04:28.589 --> 00:04:30.949
of a dog that looks completely like a normal

00:04:30.949 --> 00:04:33.769
dog to you and me, but the computer absolutely

00:04:33.769 --> 00:04:36.420
insists that it's a cat. Yeah, they are breaking

00:04:36.420 --> 00:04:39.199
that IID assumption on purpose. They're proving

00:04:39.199 --> 00:04:43.360
that the AI sees math, not meaning. OK, so if

00:04:43.360 --> 00:04:45.500
the AI is essentially just doing math on pixels,

00:04:46.040 --> 00:04:48.740
it makes you wonder how people are actively weaponizing

00:04:48.740 --> 00:04:51.300
that math out in the wild. And the research breaks

00:04:51.300 --> 00:04:53.240
this down into two main battlegrounds, which

00:04:53.240 --> 00:04:56.269
are evasion and data poisoning. Right, and these

00:04:56.269 --> 00:04:58.790
two strategies attack completely different phases

00:04:58.790 --> 00:05:01.610
of the machine learning lifecycle. Evasion attacks

00:05:01.610 --> 00:05:03.990
target a finished, deployed model that's already

00:05:03.990 --> 00:05:06.170
operating in the real world. Data poisoning,

00:05:06.170 --> 00:05:08.750
on the other hand, targets the model's education

00:05:08.750 --> 00:05:10.990
while it's still learning. So imagine you're

00:05:10.990 --> 00:05:13.470
trying to crash an exclusive VIP party, right?

00:05:13.990 --> 00:05:16.709
Evasion is wearing a brilliantly crafted disguise

00:05:16.709 --> 00:05:19.730
to fool the bouncer at the door. You're tricking

00:05:19.730 --> 00:05:22.290
the system in real time. I love that analogy.

00:05:22.689 --> 00:05:25.529
Yes. While data poisoning is like hacking the

00:05:25.529 --> 00:05:27.750
bouncer's guest list a week before the party

00:05:27.750 --> 00:05:30.170
even starts. You're corrupting the foundation.

00:05:30.509 --> 00:05:32.870
Exactly. You are corrupting the foundation the

00:05:32.870 --> 00:05:36.029
bouncer relies on. So let's look at evasion first.

00:05:36.970 --> 00:05:39.189
The goal here is to exploit the imperfection

00:05:39.189 --> 00:05:41.470
of the trained model without ever touching its

00:05:41.470 --> 00:05:44.040
training data. And there's this classic example

00:05:44.040 --> 00:05:46.660
from the sources involving a McAfee attack on

00:05:46.660 --> 00:05:49.439
a Tesla's Mobilize system. Oh, this one is wild.

00:05:49.560 --> 00:05:52.220
It really is. The attackers didn't hack the car's

00:05:52.220 --> 00:05:54.379
computer at all. They simply added a two -inch

00:05:54.379 --> 00:05:56.579
strip of black tape to a physical speed limit

00:05:56.579 --> 00:05:58.939
sign. Just two inches of tape, not even covering

00:05:58.939 --> 00:06:00.819
the number on the sign, right? Just placed on

00:06:00.819 --> 00:06:03.699
the sign itself. That is all it took. To a human,

00:06:04.019 --> 00:06:06.079
it clearly still looks like a normal speed limit

00:06:06.079 --> 00:06:08.490
sign with a weird piece of tape on it. But to

00:06:08.490 --> 00:06:12.089
the AI, that tiny perturbation shifted the mathematical

00:06:12.089 --> 00:06:14.949
average of the pixels just enough that it crossed

00:06:14.949 --> 00:06:16.910
one of those unseen boundaries we talked about.

00:06:17.230 --> 00:06:19.689
And it fooled the car into driving 50 miles per

00:06:19.689 --> 00:06:22.889
hour over the speed limit. Right. It read a 35

00:06:22.889 --> 00:06:25.990
mile per hour sign as an 85 mile per hour sign.

00:06:26.170 --> 00:06:29.149
That is terrifying. And it's not just cars, either.

00:06:29.350 --> 00:06:31.730
The sources detail this entire niche industry

00:06:31.730 --> 00:06:34.490
of stealth streetwear. It's basically clothing

00:06:34.490 --> 00:06:37.129
or glasses printed with adversarial patterns,

00:06:37.389 --> 00:06:39.870
these mathematical optical illusions designed

00:06:39.870 --> 00:06:42.889
specifically to deceive facial recognition systems.

00:06:43.149 --> 00:06:45.009
Yeah, you just wear a weirdly patterned t -shirt

00:06:45.009 --> 00:06:46.850
and the security camera thinks you're a flock

00:06:46.850 --> 00:06:48.970
of birds or it simply doesn't register a human

00:06:48.970 --> 00:06:51.269
face at all. So you're disguising the input at

00:06:51.269 --> 00:06:53.680
the door. Exactly. Data poisoning, however, is

00:06:53.680 --> 00:06:56.600
far more insidious. This is contaminating the

00:06:56.600 --> 00:06:58.639
training data set with data specifically designed

00:06:58.639 --> 00:07:01.319
to increase errors or to teach the algorithm

00:07:01.319 --> 00:07:04.319
malicious behaviors from day one. And the sources

00:07:04.319 --> 00:07:06.519
bring up a really fascinating modern example

00:07:06.519 --> 00:07:10.180
of this from 2023. Researchers at the University

00:07:10.180 --> 00:07:12.939
of Chicago created this tool called Nightshade.

00:07:13.199 --> 00:07:15.420
Nightshade is a brilliant application of data

00:07:15.420 --> 00:07:18.129
poisoning. So visual artists were finding that

00:07:18.129 --> 00:07:20.170
massive tech companies were just scraping their

00:07:20.170 --> 00:07:23.149
artwork off the internet without consent to train

00:07:23.149 --> 00:07:25.990
Text -to -image generative models, right? So

00:07:25.990 --> 00:07:28.889
artists started using nightshade to proactively

00:07:28.889 --> 00:07:31.689
poison their own artwork online Wait, so how

00:07:31.689 --> 00:07:34.129
do you booby trap a painting? Well, they alter

00:07:34.129 --> 00:07:36.069
the pixels of their art in a way that is totally

00:07:36.069 --> 00:07:39.629
invisible to human fans But when the AI scrapes

00:07:39.629 --> 00:07:42.170
it and processes it the data is fundamentally

00:07:42.170 --> 00:07:45.620
corrupted It teaches the model the wrong associations

00:07:45.620 --> 00:07:48.540
in its underlying vector space. Oh, I see. So

00:07:48.540 --> 00:07:50.980
later, when a user asks the AI to generate an

00:07:50.980 --> 00:07:53.259
image of a dog, it might spit out a picture of

00:07:53.259 --> 00:07:55.360
a purse because the poison data fundamentally

00:07:55.360 --> 00:07:57.319
broke the mathematical link between the word

00:07:57.319 --> 00:08:00.920
dog and the visual representation of a dog. Wow,

00:08:01.100 --> 00:08:03.459
okay. So evasion and poisoning are all about

00:08:03.459 --> 00:08:05.980
manipulating the data to break the AI or make

00:08:05.980 --> 00:08:09.129
it hallucinate. But what if the attacker doesn't

00:08:09.129 --> 00:08:10.930
want to break the AI? What if they just want

00:08:10.930 --> 00:08:12.889
to straight up steal the algorithm itself? I

00:08:12.889 --> 00:08:16.470
mean, is a locked proprietary system safe from

00:08:16.470 --> 00:08:18.910
that? That brings up a whole different category

00:08:18.910 --> 00:08:21.910
of attacks that target the model itself, not

00:08:21.910 --> 00:08:24.970
just its outputs. This includes model extraction

00:08:24.970 --> 00:08:27.569
and Byzantine attacks. Wait, hold on. If a model

00:08:27.569 --> 00:08:29.870
is locked in a black box, meaning, you know,

00:08:29.870 --> 00:08:31.449
you don't have the source code, you don't know

00:08:31.449 --> 00:08:33.850
the parameters, and you only interact with it

00:08:33.850 --> 00:08:36.029
through a basic interface where you ask a question

00:08:36.029 --> 00:08:39.149
and get an answer, How do you steal the underlying

00:08:39.149 --> 00:08:41.789
math? Isn't that like trying to steal a master

00:08:41.789 --> 00:08:44.029
chef's secret recipe just by tasting the soup?

00:08:44.190 --> 00:08:46.389
It is exactly like that. But imagine you have

00:08:46.389 --> 00:08:49.070
unlimited soup and you're a mathematical super

00:08:49.070 --> 00:08:51.570
taster. OK. Unlimited soup. I'm listening. Right.

00:08:51.570 --> 00:08:55.350
By systematically tasting, querying the model

00:08:55.350 --> 00:08:57.850
thousands or millions of times with highly specific

00:08:57.850 --> 00:09:01.190
inputs, and analyzing exactly how the model responds

00:09:01.190 --> 00:09:04.230
to tiny variations, you can mathematically reverse

00:09:04.230 --> 00:09:06.450
engineer the ingredients. You're just mapping

00:09:06.450 --> 00:09:09.470
the edges of what it knows. Attackers probed

00:09:09.470 --> 00:09:11.870
the black box to extract its underlying logic.

00:09:13.110 --> 00:09:15.649
The sources highlight this scenario where an

00:09:15.649 --> 00:09:18.950
adversary targets a highly proprietary multi

00:09:18.950 --> 00:09:21.470
-million dollar stock trading model. Oh wow.

00:09:21.809 --> 00:09:24.330
By pinging it continuously until they've mapped

00:09:24.330 --> 00:09:27.190
its decision making boundaries, they can perfectly

00:09:27.190 --> 00:09:30.490
recreate a clone of that model for their own

00:09:30.490 --> 00:09:32.919
financial gain. without ever breaching the actual

00:09:32.919 --> 00:09:35.000
servers. That's crazy. And it gets darker than

00:09:35.000 --> 00:09:36.899
just stealing financial algorithms, right? Because

00:09:36.899 --> 00:09:38.960
there's a subset of this called membership inference.

00:09:39.039 --> 00:09:42.220
Yes. This exploits a common flaw in AI called

00:09:42.220 --> 00:09:44.860
overfitting, where a model remembers its training

00:09:44.860 --> 00:09:47.940
material a little too well. OK. Attackers can

00:09:47.940 --> 00:09:49.960
query the system to figure out if a specific

00:09:49.960 --> 00:09:52.580
data point was used to train it. They could leverage

00:09:52.580 --> 00:09:56.149
this to discover if sensitive personally identifiable

00:09:56.149 --> 00:09:58.809
information like private medical records were

00:09:58.809 --> 00:10:01.009
secretly fed into the system by the developers.

00:10:01.210 --> 00:10:04.070
So just by analyzing the outputs, they can mathematically

00:10:04.070 --> 00:10:06.549
prove your private data is baked into the system.

00:10:06.629 --> 00:10:08.789
I mean, that is a massive privacy violation.

00:10:09.029 --> 00:10:11.669
It really is. Yeah. And we also see vulnerabilities

00:10:11.669 --> 00:10:14.710
in how AI models collaborate, which is known

00:10:14.710 --> 00:10:18.009
as Byzantine attacks. Right. As machine learning

00:10:18.009 --> 00:10:20.929
scales up, developers use something called federated

00:10:20.929 --> 00:10:23.919
learning. Instead of one massive central computer

00:10:23.919 --> 00:10:27.500
doing all the training, edge devices like thousands

00:10:27.500 --> 00:10:30.659
of mobile phones or local servers train locally

00:10:30.659 --> 00:10:33.440
and send updates back to a central server. So

00:10:33.440 --> 00:10:35.700
the central brain essentially relies on thousands

00:10:35.700 --> 00:10:38.799
of little scout brains. Exactly. A Byzantine

00:10:38.799 --> 00:10:41.639
attack occurs when a minority of those edge devices

00:10:41.639 --> 00:10:44.659
deviate from their expected behavior. They send

00:10:44.659 --> 00:10:47.679
back malicious updates, either to harm the central

00:10:47.679 --> 00:10:50.500
model or to bias it. Oh, I see. For example,

00:10:50.740 --> 00:10:52.940
they might amplify the recommendation of disinformation

00:10:52.940 --> 00:10:55.480
on a social media feed. The system is mathematically

00:10:55.480 --> 00:10:57.940
designed to aggregate and trust the scouts, so

00:10:57.940 --> 00:11:00.440
just a few bad actors can completely derail the

00:11:00.440 --> 00:11:02.980
central server. Man. You know, this whole idea

00:11:02.980 --> 00:11:05.379
of querying and manipulating a locked system

00:11:05.379 --> 00:11:07.639
brings up something that genuinely broke my brain

00:11:07.639 --> 00:11:10.960
in these sources. How exactly are attackers crafting

00:11:10.960 --> 00:11:14.720
these perfect optical illusions, like the stealth

00:11:14.720 --> 00:11:17.820
streetwear or the tape on the speed sign, if

00:11:17.820 --> 00:11:20.840
they don't even know the AI's underlying code?

00:11:20.960 --> 00:11:23.740
Like, how is that possible? Right, so to understand

00:11:23.740 --> 00:11:25.919
that, we have to distinguish between white box

00:11:25.919 --> 00:11:28.440
and black box attacks. In a white box attack,

00:11:28.649 --> 00:11:30.590
like the fast gradient sign method mentioned

00:11:30.590 --> 00:11:33.269
in the sources, the attacker knows all the model

00:11:33.269 --> 00:11:35.990
parameters. They have the topographic map. Exactly.

00:11:36.090 --> 00:11:38.149
They have the map. So they use the model's own

00:11:38.149 --> 00:11:40.450
mathematical gradients against it to calculate

00:11:40.450 --> 00:11:43.190
the exact minute amount of pixel noise needed

00:11:43.190 --> 00:11:45.649
to force the AI down the hill into the wrong

00:11:45.649 --> 00:11:47.929
answer. But in a black box attack, they don't

00:11:47.929 --> 00:11:49.929
have the map. They're essentially flying blind.

00:11:50.110 --> 00:11:52.309
Right. In a black box scenario, the adversary

00:11:52.309 --> 00:11:55.009
can only feed inputs in and see what comes out.

00:11:55.419 --> 00:11:58.580
yet they still manage to create devastating adversarial

00:11:58.580 --> 00:12:02.159
examples. The research details two fascinating

00:12:02.159 --> 00:12:05.039
methods for this, the square attack and the hop

00:12:05.039 --> 00:12:06.879
-skip -jump attack. Okay, let's start with the

00:12:06.879 --> 00:12:09.500
square attack. How do you optimize an image if

00:12:09.500 --> 00:12:11.519
you can't see the code? So the square attack

00:12:11.519 --> 00:12:14.100
is incredibly query -efficient. The attacker

00:12:14.100 --> 00:12:16.600
only needs the probability scores the model spits

00:12:16.600 --> 00:12:20.419
out. Say the AI responds, I am 90 % sure this

00:12:20.419 --> 00:12:23.100
image is a dog and 10 % sure it's a cat. Okay.

00:12:23.919 --> 00:12:26.440
The attacking algorithm takes the image and randomly

00:12:26.440 --> 00:12:29.259
changes very small square sections of pixels.

00:12:29.679 --> 00:12:32.539
It feeds the altered image back in. Did the cap

00:12:32.539 --> 00:12:35.580
probability go up to 11 %? And if it did? If

00:12:35.580 --> 00:12:37.940
yes, it keeps that change. Did it go down? It

00:12:37.940 --> 00:12:40.259
discards it. It just iteratively alters these

00:12:40.259 --> 00:12:43.519
tiny squares, optimizing entirely based on those

00:12:43.519 --> 00:12:46.559
shifting probability percentages until it forces

00:12:46.559 --> 00:12:48.820
the AI to be highly confident in the incorrect

00:12:48.820 --> 00:12:51.659
class. Wow, so it's just brute forcing the math,

00:12:51.860 --> 00:12:54.480
pixel square by pixel square. Exactly. But the

00:12:54.480 --> 00:12:56.139
hop skip jump attack takes this even further,

00:12:56.279 --> 00:12:58.320
doesn't it? Oh yeah. Hop skip jump doesn't even

00:12:58.320 --> 00:13:00.379
need the probability score. It relies only on

00:13:00.379 --> 00:13:02.559
the final class prediction. The AI just says

00:13:02.559 --> 00:13:05.409
dog or cat. no percentages at all. Wait, hold

00:13:05.409 --> 00:13:07.210
on. If it doesn't even get a percentage, if the

00:13:07.210 --> 00:13:09.830
AI literally just says dog, how does the algorithm

00:13:09.830 --> 00:13:12.070
know if a random pixel change is moving in the

00:13:12.070 --> 00:13:14.429
right direction? That sounds mathematically impossible

00:13:14.429 --> 00:13:16.990
to optimize. It sounds impossible, but it uses

00:13:16.990 --> 00:13:19.690
a modified binary search to find the literal

00:13:19.690 --> 00:13:22.490
edge of the AI's reality. It takes an image of

00:13:22.490 --> 00:13:25.590
a dog and a completely random image of static

00:13:25.590 --> 00:13:28.720
that the AI happens to classify as a cat. OK.

00:13:29.019 --> 00:13:31.399
It draws a mathematical line between them and

00:13:31.399 --> 00:13:33.840
finds the exact intersection where the AI switches

00:13:33.840 --> 00:13:36.940
its mind from cat to dog. Like finding the exact

00:13:36.940 --> 00:13:39.759
border wall between two countries on a map. Yes.

00:13:40.320 --> 00:13:42.299
And once it finds that boundary, it throws out

00:13:42.299 --> 00:13:44.539
random vectors to approximate the mathematical

00:13:44.539 --> 00:13:46.580
gradient. It's like playing an incredibly intense

00:13:46.580 --> 00:13:49.440
game of Marco Polo in the dark. That's a perfect

00:13:49.440 --> 00:13:52.419
way to visualize it. By shouting Margo and seeing

00:13:52.419 --> 00:13:54.940
where the boundary lies, it inches along that

00:13:54.940 --> 00:13:57.460
border wall, closer and closer to the original

00:13:57.460 --> 00:14:00.120
image of the dog. It keeps inching until it finds

00:14:00.120 --> 00:14:02.279
a point that looks exactly like the dog to the

00:14:02.279 --> 00:14:05.019
human eye, but falls just one millimeter over

00:14:05.019 --> 00:14:07.639
the mathematical line into the cat territory

00:14:07.639 --> 00:14:10.759
for the AI. That is wild. And it requires absolutely

00:14:10.759 --> 00:14:13.200
zero knowledge of the model's inner workings.

00:14:13.370 --> 00:14:16.009
So if we know the exact mathematical mechanisms

00:14:16.009 --> 00:14:18.190
attackers are using to do this, it makes you

00:14:18.190 --> 00:14:20.490
wonder why we can't just patch these vulnerabilities

00:14:20.490 --> 00:14:23.230
like a standard software bug. Why is defense

00:14:23.230 --> 00:14:25.809
so much harder than offense here? Because applying

00:14:25.809 --> 00:14:28.529
machine learning to real -world security domains

00:14:28.529 --> 00:14:31.809
like malware detection presents massive hurdles

00:14:31.809 --> 00:14:34.149
that just don't exist in a clean laboratory.

00:14:34.769 --> 00:14:37.169
The first major hurdle is called concept drift.

00:14:37.529 --> 00:14:39.850
Which is really just about the bad guys constantly

00:14:39.850 --> 00:14:42.570
changing their tactics, right? Exactly. Malware

00:14:42.570 --> 00:14:45.450
creators are evolving every single day. They

00:14:45.450 --> 00:14:47.529
change the statistical properties of their malicious

00:14:47.529 --> 00:14:51.649
code to evade detection. The data itself is fundamentally

00:14:51.649 --> 00:14:54.330
shifting. So a model trained on yesterday's malware

00:14:54.330 --> 00:14:56.809
is already out of date today. The concept it

00:14:56.809 --> 00:14:59.049
originally learned has literally drifted. Right.

00:14:59.350 --> 00:15:00.889
And then there's the math problem of the real

00:15:00.889 --> 00:15:03.809
world. The sources call it class imbalance. Yeah,

00:15:03.909 --> 00:15:06.509
so in a laboratory, you might train an AI with

00:15:06.509 --> 00:15:09.250
50 % safe files and 50 % malware so it learns

00:15:09.250 --> 00:15:11.909
the difference perfectly. But in realistic deployment

00:15:11.909 --> 00:15:13.929
environments, malicious samples might only make

00:15:13.929 --> 00:15:18.049
up 0 .01 % to 2 % of the total data passing through

00:15:18.049 --> 00:15:21.029
a network. It's a needle in a massive stack of

00:15:21.029 --> 00:15:23.210
needles. Right. And this creates a phenomenon

00:15:23.210 --> 00:15:26.259
called the base rate fallacy. The model quickly

00:15:26.259 --> 00:15:28.500
figures out that if it just blindly guesses that

00:15:28.500 --> 00:15:32.399
every single file is safe, it will be 99 .9 %

00:15:32.399 --> 00:15:35.759
accurate. Because 99 .9 % of the files actually

00:15:35.759 --> 00:15:38.879
are safe. Exactly. So it gets an A -plus on its

00:15:38.879 --> 00:15:41.519
accuracy report by just being completely lazy

00:15:41.519 --> 00:15:44.320
while failing its actual purpose of catching

00:15:44.320 --> 00:15:46.659
the rare malware. Yep. It develops a massive

00:15:46.659 --> 00:15:49.759
bias toward the majority class. However, the

00:15:49.759 --> 00:15:52.159
research does highlight a really creative fix

00:15:52.159 --> 00:15:54.360
for this using a natural language processing

00:15:54.360 --> 00:15:57.240
model called BERT. Wait, BERT? Like the text

00:15:57.240 --> 00:15:59.740
prediction AI, the thing that writes essays and

00:15:59.740 --> 00:16:01.700
answers questions? That's the word. A computer

00:16:01.700 --> 00:16:04.080
virus is a compiled program, it's not a paragraph.

00:16:04.539 --> 00:16:07.440
How do you apply a grammar tool to malware? Well,

00:16:07.440 --> 00:16:09.220
this is where the defense gets incredibly innovative.

00:16:09.700 --> 00:16:11.480
A computer program is essentially a sequence

00:16:11.480 --> 00:16:13.620
of commands, right? Open this file, read this

00:16:13.620 --> 00:16:16.429
memory, send this data. Okay. Researchers realized

00:16:16.429 --> 00:16:19.169
that this sequence of application programming

00:16:19.169 --> 00:16:23.049
interface, or API calls, mathematically resembles

00:16:23.049 --> 00:16:26.169
a sequence of words in a sentence. They're essentially

00:16:26.169 --> 00:16:28.570
the nouns and verbs of the software. Oh, wow.

00:16:28.629 --> 00:16:30.809
So they treated the malware like a foreign language?

00:16:31.049 --> 00:16:34.179
Exactly. They've fine -tuned BERT, which is incredible

00:16:34.179 --> 00:16:36.659
at understanding context and sentences, to read

00:16:36.659 --> 00:16:38.980
this malware language. And it actually worked

00:16:38.980 --> 00:16:41.259
beautifully. Really? Yeah. On an Android data

00:16:41.259 --> 00:16:44.460
set with only 0 .5 % malware, it achieved massive

00:16:44.460 --> 00:16:46.799
improvements over traditional sequence models.

00:16:47.259 --> 00:16:49.700
It learned to spot the malicious grammar even

00:16:49.700 --> 00:16:51.659
when it was incredibly rare. That makes a lot

00:16:51.659 --> 00:16:54.809
of sense. Let me throw a simpler defense at you

00:16:54.809 --> 00:16:57.669
based on the text. Sure. If attackers are tricking

00:16:57.669 --> 00:17:00.330
single models with these evasion attacks, why

00:17:00.330 --> 00:17:02.730
do we need complex language models? Why not just

00:17:02.730 --> 00:17:05.210
use an ensemble, like group a dozen different

00:17:05.210 --> 00:17:08.269
AI models together? That way, if one AI gets

00:17:08.269 --> 00:17:10.049
fooled by the piece of tape on the stop sign,

00:17:10.210 --> 00:17:12.990
the other 11 models catch the attacker. Safety

00:17:12.990 --> 00:17:15.690
in numbers. It sounds like common sense, but

00:17:15.690 --> 00:17:18.109
this reveals a pretty harsh truth about how these

00:17:18.109 --> 00:17:21.529
attacks function on a mathematical level. The

00:17:21.529 --> 00:17:23.950
sources explicitly state that while ensembles

00:17:23.950 --> 00:17:26.589
are indeed effective against data poisoning attacks,

00:17:27.230 --> 00:17:29.309
they are utterly ineffective against evasion

00:17:29.309 --> 00:17:32.849
attacks. Wait, really? Why wouldn't safety in

00:17:32.849 --> 00:17:36.170
numbers work? Because evasion attacks exploit

00:17:36.170 --> 00:17:38.829
the fundamental mathematical way neural networks

00:17:38.829 --> 00:17:41.930
process data. If you craft an adversarial example

00:17:41.930 --> 00:17:44.569
that tricks one deep learning model, it has a

00:17:44.569 --> 00:17:46.769
very high probability of transferring and tricking

00:17:46.769 --> 00:17:48.809
all similar models, even if they were trained

00:17:48.809 --> 00:17:51.710
slightly differently. Oh. Because they're all

00:17:51.710 --> 00:17:53.730
optimizing towards similar mathematical features,

00:17:54.450 --> 00:17:56.750
the blind spot is shared. So the whole squad

00:17:56.750 --> 00:17:58.849
falls for the exact same trick. Unfortunately,

00:17:59.069 --> 00:18:03.059
yes. Wow. You know, that realization that traditional

00:18:03.059 --> 00:18:06.200
security defenses completely fail against adversarial

00:18:06.200 --> 00:18:08.859
machine learning really forces us to reconsider

00:18:08.859 --> 00:18:11.980
the entire battlefield between human and artificial

00:18:11.980 --> 00:18:16.160
intelligence. Let's do a quick recap. AI is incredibly

00:18:16.160 --> 00:18:18.619
powerful. It can read malware as a language.

00:18:18.720 --> 00:18:21.440
It can drive cars. It can manage finances. But

00:18:21.440 --> 00:18:23.960
it is fundamentally fragile. It relies on these

00:18:23.960 --> 00:18:26.099
strict mathematical boundaries that can be poisoned

00:18:26.099 --> 00:18:28.759
from the inside, evaded with a literal piece

00:18:28.759 --> 00:18:31.480
of tape, or extracted by probing in the dark.

00:18:31.640 --> 00:18:34.380
Yeah. And as AI integrates deeper into our cars,

00:18:34.539 --> 00:18:36.640
our banks, our hospitals, knowing where these

00:18:36.640 --> 00:18:38.940
blind spots are isn't just an academic exercise

00:18:38.940 --> 00:18:41.299
anymore. You need to know that the machine's

00:18:41.299 --> 00:18:43.700
perception of reality is entirely different from

00:18:43.700 --> 00:18:46.240
yours. perception of reality. Right. And we want

00:18:46.240 --> 00:18:49.039
to leave you, the listener, with a final completely

00:18:49.039 --> 00:18:50.819
different angle from the source material to think

00:18:50.819 --> 00:18:52.819
about. We've spent this entire deep dive talking

00:18:52.819 --> 00:18:55.460
about complex mathematical attacks, random pixel

00:18:55.460 --> 00:18:58.420
perturbations, gradient calculations, altering

00:18:58.420 --> 00:19:01.000
latent vector spaces. But Google brain researcher

00:19:01.000 --> 00:19:03.619
Nick Frost offers a pretty sobering reality check.

00:19:03.700 --> 00:19:06.079
Oh yeah, this part was great. The adversarial

00:19:06.079 --> 00:19:08.799
machine learning community spends massive amounts

00:19:08.799 --> 00:19:12.680
of time in computing power inventing hyper complex

00:19:12.680 --> 00:19:16.500
math noise to trick an autonomous car into missing

00:19:16.500 --> 00:19:19.599
a stop sign. Right. Finding the exact microscopic

00:19:19.599 --> 00:19:22.059
placement of black tape to ruin the pixel math

00:19:22.059 --> 00:19:24.380
for the car's camera. But Frost points out that

00:19:24.380 --> 00:19:27.259
it is vastly easier for a bad actor in the real

00:19:27.259 --> 00:19:29.920
world to just physically remove the stop sign.

00:19:30.000 --> 00:19:32.880
Just take a wrench, unbolt the sign, and throw

00:19:32.880 --> 00:19:35.339
it in the bushes. Exactly. The AI will crash

00:19:35.339 --> 00:19:37.799
the car just the same. It really makes you wonder.

00:19:38.079 --> 00:19:41.119
In our absolute obsession with securing the complex

00:19:41.119 --> 00:19:44.220
virtual code and patching digital optical illusions,

00:19:44.839 --> 00:19:46.859
we might be completely ignoring the profound

00:19:46.859 --> 00:19:49.200
vulnerabilities of the physical world the AI

00:19:49.200 --> 00:19:51.579
actually operates in. It's definitely something

00:19:51.579 --> 00:19:53.220
to think about the next time you get into a smart

00:19:53.220 --> 00:19:56.279
car or rely on an automated system. Thanks for

00:19:56.279 --> 00:19:58.400
joining us on this deep dive. Stay curious, stay

00:19:58.400 --> 00:19:59.799
sharp, and we'll catch you next time.
