WEBVTT

00:00:00.000 --> 00:00:02.080
If you've ever wondered how your inbox magically

00:00:02.080 --> 00:00:04.480
knows what's junk and what's actually important,

00:00:04.759 --> 00:00:07.139
without you having to do anything, you are going

00:00:07.139 --> 00:00:09.060
to really love today's deep deck. Oh, absolutely.

00:00:09.160 --> 00:00:12.240
It's such a wild story. It really is. We are

00:00:12.240 --> 00:00:15.060
looking into a totally fascinating paradox in

00:00:15.060 --> 00:00:17.000
the world of statistics and machine learning

00:00:17.000 --> 00:00:20.600
today. We've got a great stack of sources, primarily

00:00:20.600 --> 00:00:24.359
a really detailed piece on the naive Bayes classifier.

00:00:24.600 --> 00:00:26.620
Yeah, which is just a classic algorithm in this

00:00:26.620 --> 00:00:28.449
space. Right. And our mission for this session

00:00:28.449 --> 00:00:32.390
is to figure out how something so fundamentally

00:00:32.390 --> 00:00:35.570
flawed on paper ended up powering some of the

00:00:35.570 --> 00:00:37.869
most important technology of the early internet.

00:00:38.130 --> 00:00:39.829
It really is a paradox. I mean, you are looking

00:00:39.829 --> 00:00:42.630
at a mathematical model that is literally referred

00:00:42.630 --> 00:00:45.710
to in statistical literature as idiot space.

00:00:46.030 --> 00:00:48.630
Right, idiot space, which is just, I mean, that

00:00:48.630 --> 00:00:50.909
is a brutal nickname for an algorithm. It really

00:00:50.909 --> 00:00:53.969
is. But. Here is the core irony we are exploring

00:00:53.969 --> 00:00:57.289
today. Despite having this foundational assumption

00:00:57.289 --> 00:01:00.969
about the world that is highly unrealistic, this

00:01:00.969 --> 00:01:04.349
family of probabilistic classifiers is massively

00:01:04.349 --> 00:01:07.469
scalable. Oh yeah, massively. It requires surprisingly

00:01:07.469 --> 00:01:11.189
little training data, and it is incredibly fast.

00:01:11.469 --> 00:01:14.390
Exactly. I mean, it routinely outperforms vastly

00:01:14.390 --> 00:01:18.150
more complex algorithms in specific high -stakes

00:01:18.150 --> 00:01:20.769
environments, even when its underlying logic

00:01:20.769 --> 00:01:23.989
seems completely disconnected from reality. OK,

00:01:23.989 --> 00:01:26.370
let's unpack this. To understand why it's called

00:01:26.370 --> 00:01:29.049
Idiot's Baze, we first need to look at exactly

00:01:29.049 --> 00:01:31.859
what this model assumes about the world. And

00:01:31.859 --> 00:01:34.200
that brings us to the math of independence. So

00:01:34.200 --> 00:01:36.299
to start, the classifier is built on Bayes' theorem,

00:01:36.560 --> 00:01:39.400
which is a fundamental principle of probability.

00:01:39.700 --> 00:01:42.959
Right. At its core, it helps us find the probability

00:01:42.959 --> 00:01:46.640
of a label given some observed features. It updates

00:01:46.640 --> 00:01:48.980
its beliefs based on the evidence it sees. OK,

00:01:49.079 --> 00:01:52.799
makes sense. But naive Bayes adds a massive deliberately

00:01:52.799 --> 00:01:55.750
naive caveat to that theorem. It assumes that

00:01:55.750 --> 00:01:57.950
all features or predictors are conditionally

00:01:57.950 --> 00:02:00.670
independent, given the target class. Meaning

00:02:00.670 --> 00:02:02.569
what? Exactly. Like, let's bring that down to

00:02:02.569 --> 00:02:04.189
earth for a second. Yeah, meaning it assumes

00:02:04.189 --> 00:02:06.969
the information provided by one variable is completely

00:02:06.969 --> 00:02:09.009
unrelated to the information provided by any

00:02:09.009 --> 00:02:11.449
other variable. So no connection at all. Right.

00:02:11.629 --> 00:02:14.550
It assumes there is absolutely zero shared context

00:02:14.550 --> 00:02:17.210
between them. Let's use the classic fruit example

00:02:17.210 --> 00:02:19.750
from the text. Oh, I like this one. Imagine a

00:02:19.750 --> 00:02:21.870
computer model is trying to decide if a piece

00:02:21.870 --> 00:02:25.259
of fruit sitting on a table is an apple. It looks

00:02:25.259 --> 00:02:28.199
at three specific features. The fruit is red,

00:02:28.520 --> 00:02:31.740
it is round, and it is about 10 centimeters in

00:02:31.740 --> 00:02:34.199
diameter. Makes total sense so far. I mean, that

00:02:34.199 --> 00:02:36.340
sounds like an apple to me. Right, but a naive

00:02:36.340 --> 00:02:39.259
Bayes classifier considers each of those features

00:02:39.259 --> 00:02:41.900
to contribute completely independently to the

00:02:41.900 --> 00:02:44.340
probability that the fruit is an apple. Independent

00:02:44.340 --> 00:02:47.180
of each other. Exactly. It completely ignores

00:02:47.180 --> 00:02:50.219
any obvious real -world correlations between

00:02:50.219 --> 00:02:53.639
color, roundness, and size. It treats red as

00:02:53.639 --> 00:02:56.080
existing in a vacuum, round as existing in a

00:02:56.080 --> 00:02:59.219
vacuum, and 10 centimeters as existing in a vacuum.

00:02:59.580 --> 00:03:02.340
Well, it's like judging a band by listening to

00:03:02.340 --> 00:03:05.120
the drummer, the guitarist, and the singer in

00:03:05.120 --> 00:03:07.759
completely separate soundproof rooms, and then

00:03:07.759 --> 00:03:09.919
guessing if the song is good without ever hearing

00:03:09.919 --> 00:03:12.099
them play together. That is a perfect way to

00:03:12.099 --> 00:03:14.699
visualize it. I mean, in reality, the drummer

00:03:14.699 --> 00:03:17.139
and the bassist are interacting. Right. The redness

00:03:17.139 --> 00:03:19.479
of a fruit and its roundness might be biologically

00:03:19.479 --> 00:03:22.439
linked. Certain types of trees produce large

00:03:22.439 --> 00:03:25.180
red fruits while others produce, I don't know,

00:03:25.599 --> 00:03:27.379
small yellow ones. Yeah, nature has patterns.

00:03:27.680 --> 00:03:30.280
But Naive Bayes covers its ears and says, nope,

00:03:30.319 --> 00:03:32.159
they have absolutely nothing to do with each

00:03:32.159 --> 00:03:34.599
other. But I have to push back here. How could

00:03:34.599 --> 00:03:37.439
a model that actively ignores how things relate

00:03:37.439 --> 00:03:40.939
to each other in the real world possibly be useful?

00:03:41.259 --> 00:03:43.500
I mean, everything in the real world is interconnected.

00:03:43.719 --> 00:03:45.719
Oh, totally. If I build a model that ignores

00:03:45.719 --> 00:03:49.000
reality it should just fail. Right. What's fascinating

00:03:49.000 --> 00:03:52.000
here is what you actually gain by making that

00:03:52.000 --> 00:03:54.560
ridiculous assumption. Which is? By pretending

00:03:54.560 --> 00:03:57.979
these features don't interact, you turn a massive

00:03:57.979 --> 00:04:00.620
mathematically impossible calculation into a

00:04:00.620 --> 00:04:03.819
very simple closed form expression. Closed form

00:04:03.819 --> 00:04:07.180
meaning it has a neat, tidy mathematical solution.

00:04:07.419 --> 00:04:10.560
Exactly. Without this naive assumption, a computer

00:04:10.560 --> 00:04:13.020
would have to calculate how every single feature

00:04:13.020 --> 00:04:15.000
interacts with every other feature. Oh, wow.

00:04:15.020 --> 00:04:17.160
That sounds like a nightmare. It is. If you have

00:04:17.160 --> 00:04:19.699
thousands of features, the math becomes a tangled

00:04:19.699 --> 00:04:23.019
web of dependencies that requires expensive iterative

00:04:23.019 --> 00:04:25.319
approximation algorithms and massive amounts

00:04:25.319 --> 00:04:27.379
of computing power. And lots of time, I bet.

00:04:27.620 --> 00:04:31.459
Huge amounts of time. But by assuming independence,

00:04:31.699 --> 00:04:33.579
you basically just count the observations in

00:04:33.579 --> 00:04:36.379
each group and multiply those independent probabilities

00:04:36.379 --> 00:04:39.339
together. It reduces an impossible web into a

00:04:39.339 --> 00:04:41.319
simple checklist. OK, so we've been talking about

00:04:41.319 --> 00:04:44.399
this in theory. combining isolated discrete traits

00:04:44.399 --> 00:04:47.180
like colors and shapes. Let's see exactly how

00:04:47.180 --> 00:04:49.139
this math plays out when it's forced to look

00:04:49.139 --> 00:04:52.519
at continuous messy human data. Let's do it.

00:04:52.639 --> 00:04:54.699
Let's look at the person classifier example from

00:04:54.699 --> 00:04:56.620
the source text. Right. So this is a scenario

00:04:56.620 --> 00:04:59.420
where the model is trying to predict if a sample

00:04:59.420 --> 00:05:02.860
person is male or female based entirely on three

00:05:02.860 --> 00:05:05.759
continuous measurements. Height. weight, and

00:05:05.759 --> 00:05:08.740
foot size. And right away, the flaw in the naive

00:05:08.740 --> 00:05:10.959
independence assumption is just glaring. The

00:05:10.959 --> 00:05:13.459
model assumes height and weight are completely

00:05:13.459 --> 00:05:16.920
unrelated. That is demonstrably false. Taller

00:05:16.920 --> 00:05:20.160
people usually weigh more. Your pushback is completely

00:05:20.160 --> 00:05:23.180
valid. They're absolutely correlated in reality.

00:05:23.339 --> 00:05:26.480
If you are seven feet tall, your weight distribution

00:05:26.480 --> 00:05:28.199
is going to be very different than someone who

00:05:28.199 --> 00:05:31.439
is five feet tall. Obviously. But the model marches

00:05:31.439 --> 00:05:33.300
forward anyway, pretending they aren't linked.

00:05:33.959 --> 00:05:36.620
And because we are dealing with continuous data,

00:05:37.079 --> 00:05:40.220
meaning you could be 5 .92 feet tall, not just

00:05:40.220 --> 00:05:43.060
a binary tall or short, you can't just count

00:05:43.060 --> 00:05:45.540
simple categories like you did with red or round.

00:05:45.639 --> 00:05:47.939
So what does it do? It has to use a Gaussian

00:05:47.939 --> 00:05:50.560
or normal distribution to handle the numbers.

00:05:50.889 --> 00:05:54.069
So explain how that bell curve actually works

00:05:54.069 --> 00:05:56.509
in practice for a physical trait. Like, how does

00:05:56.509 --> 00:05:59.430
height turn into a probability score? Think of

00:05:59.430 --> 00:06:02.269
it like this. Imagine a graph plotting the heights

00:06:02.269 --> 00:06:05.009
of all the known men in the training data. It

00:06:05.009 --> 00:06:07.949
forms a bell curve. The peak of that bell is

00:06:07.949 --> 00:06:10.410
the average height for men. Okay, I'm texturing

00:06:10.410 --> 00:06:12.810
it. The further away you get from that average,

00:06:13.170 --> 00:06:15.750
either extremely tall or extremely short, the

00:06:15.750 --> 00:06:18.550
lower the curve drops, representing a lower probability.

00:06:18.750 --> 00:06:21.189
The model does this for every trait. It looks

00:06:21.189 --> 00:06:23.589
at the training data, separates it by class,

00:06:23.790 --> 00:06:26.629
male and female, and calculates that bell curve

00:06:26.629 --> 00:06:29.569
for height, weight, and foot size for each group.

00:06:29.769 --> 00:06:31.829
Okay, so it has these established bell curves,

00:06:31.949 --> 00:06:34.509
then we give it a test sample. In the text, they

00:06:34.509 --> 00:06:37.230
feed the model a completely new, unclassified

00:06:37.230 --> 00:06:40.370
person. This person is six feet tall, weighs

00:06:40.370 --> 00:06:43.829
130 pounds, and has an eight -inch foot. And

00:06:43.829 --> 00:06:46.370
the model has to figure out, based on its very

00:06:46.370 --> 00:06:48.990
isolated view of the world, if this person is

00:06:48.990 --> 00:06:52.129
more likely male or female. How does that calculation

00:06:52.129 --> 00:06:54.930
actually look? Well, it looks at the male bell

00:06:54.930 --> 00:06:57.529
curves first. It finds exactly where a height

00:06:57.529 --> 00:07:00.170
of six feet falls on the male height curve and

00:07:00.170 --> 00:07:02.689
grabs that probability. Okay. Then it finds where

00:07:02.689 --> 00:07:05.970
130 pounds falls on the male weight curve. Now,

00:07:05.970 --> 00:07:08.629
because 130 pounds is quite far below the male

00:07:08.629 --> 00:07:12.100
average in the training data, That specific probability

00:07:12.100 --> 00:07:14.339
density is going to be tiny. Like out on the

00:07:14.339 --> 00:07:16.560
edges. Exactly. It's way down at the flat tail

00:07:16.560 --> 00:07:18.540
of the bell curve. And then because of its naive

00:07:18.540 --> 00:07:21.300
rule, it just multiplies those individual probabilities

00:07:21.300 --> 00:07:25.139
together, right? Yes. It multiplies the probability

00:07:25.139 --> 00:07:28.740
of that height, that weight, and that foot size

00:07:28.740 --> 00:07:31.279
existing within the male distribution. Then it

00:07:31.279 --> 00:07:34.060
does the exact same process for the female distribution.

00:07:34.250 --> 00:07:37.050
But wait, if you're taking small decimals from

00:07:37.050 --> 00:07:39.389
the tails of these bell curves and multiplying

00:07:39.389 --> 00:07:41.829
them by other small decimals, like a fraction

00:07:41.829 --> 00:07:44.089
of a fraction of a fraction, the final numbers

00:07:44.089 --> 00:07:46.370
must be microscopic. Oh, they are microscopic.

00:07:46.790 --> 00:07:49.529
For the male classification, the posterior numerator,

00:07:49.850 --> 00:07:51.670
which is the final number we use to compare the

00:07:51.670 --> 00:07:57.529
classes, comes out to 6 .198. 4 times 10 to the

00:07:57.529 --> 00:07:59.959
negative 9. Let me just translate that for everyone

00:07:59.959 --> 00:08:02.519
listening. 10 to the negative 9. That means the

00:08:02.519 --> 00:08:07.279
probability score is 0 .000 with eight zeros

00:08:07.279 --> 00:08:09.819
before you even hit a number. It's tiny. It is

00:08:09.819 --> 00:08:12.579
microscopically tiny. Exactly. But then it calculates

00:08:12.579 --> 00:08:14.699
the posterior numerator for female, and that

00:08:14.699 --> 00:08:18.100
comes out to 5 .378 times 10 to the negative

00:08:18.100 --> 00:08:22.220
4. So that's 0 .0005. Still a tiny fraction,

00:08:22.639 --> 00:08:24.399
but compared to the male score with its eight

00:08:24.399 --> 00:08:27.639
zeros, the female score is vastly, heavily larger.

00:08:27.980 --> 00:08:31.019
Spot on. Because the final number for the female

00:08:31.019 --> 00:08:34.159
class is so much larger, the model confidently

00:08:34.159 --> 00:08:37.250
predicts the sample is female. Okay, this brings

00:08:37.250 --> 00:08:39.429
us to what I think is the most mind -bending

00:08:39.429 --> 00:08:42.370
part of this whole deep dive, the paradox of

00:08:42.370 --> 00:08:45.590
its success. Because I pointed out that the model's

00:08:45.590 --> 00:08:47.870
core assumption is flawed. Height and weight

00:08:47.870 --> 00:08:51.029
are obviously related. A six -foot person weighing

00:08:51.029 --> 00:08:54.409
130 pounds is an unusual combination that the

00:08:54.409 --> 00:08:57.029
model evaluates poorly because it splits those

00:08:57.029 --> 00:08:59.509
traits up. Right. Yet the model made a successful

00:08:59.509 --> 00:09:02.389
prediction anyway. How does it survive its own

00:09:02.389 --> 00:09:04.710
ignorance? It survives because of what it is

00:09:04.710 --> 00:09:07.549
actually being asked to do. You have to understand

00:09:07.549 --> 00:09:10.090
that Naive Bayes is actually terrible at quantifying

00:09:10.090 --> 00:09:12.850
uncertainty. Really? Oh yeah. If you ask it for

00:09:12.850 --> 00:09:15.649
an exact percentage of certainty, it will often

00:09:15.649 --> 00:09:18.419
produce wildly overconfident probabilities. It

00:09:18.419 --> 00:09:22.820
might say it is 99 .999 % sure of something when

00:09:22.820 --> 00:09:25.259
a truly accurate mathematical model would only

00:09:25.259 --> 00:09:28.039
be like 60 % sure. But the actual percentage

00:09:28.039 --> 00:09:30.039
doesn't matter. It doesn't matter because of

00:09:30.039 --> 00:09:33.720
something called the MAP decision rule. MAP stands

00:09:33.720 --> 00:09:37.940
for maximum a posteriori. Maximum o posteriori.

00:09:38.110 --> 00:09:41.629
In many practical applications the classifier

00:09:41.629 --> 00:09:44.990
does not need to produce a highly accurate beautifully

00:09:44.990 --> 00:09:48.149
calibrated percentage It only needs to rank the

00:09:48.149 --> 00:09:50.850
correct class as more probable than the others.

00:09:51.230 --> 00:09:53.929
Oh So it's like a broken stopwatch and a race

00:09:53.929 --> 00:09:57.230
that says the winner ran a two -second mile The

00:09:57.230 --> 00:10:00.509
time is totally absurdly wrong, but it still

00:10:00.509 --> 00:10:02.690
correctly identifies who crossed the finish line

00:10:02.690 --> 00:10:05.590
first. That is exactly it. As long as the order

00:10:05.590 --> 00:10:07.950
of the probabilities is correct, as long as the

00:10:07.950 --> 00:10:09.649
winning runner is placed ahead of the second

00:10:09.649 --> 00:10:11.990
place runner, the actual numbers don't matter.

00:10:12.590 --> 00:10:14.909
The overall classifier is robust enough to ignore

00:10:14.909 --> 00:10:17.470
the serious deficiencies in its underlying probability

00:10:17.470 --> 00:10:20.669
model. That is wild. It's basically failing successfully,

00:10:20.750 --> 00:10:22.470
but I'm stuck on something we talked about a

00:10:22.470 --> 00:10:24.730
minute ago, the microscopic numbers. Yeah. If

00:10:24.730 --> 00:10:27.149
this model scales up to look at thousands of

00:10:27.149 --> 00:10:29.529
features, and it keeps multiplying these tiny

00:10:29.529 --> 00:10:32.110
microscopic decimals together, wouldn't a computer

00:10:32.110 --> 00:10:34.210
eventually just run out of decimal places and

00:10:34.210 --> 00:10:36.169
round the whole thing down to zero? You've just

00:10:36.169 --> 00:10:38.649
identified a massive technical hurdle. Yes, computers

00:10:38.649 --> 00:10:41.830
absolutely do that. It's called arithmetic underflow.

00:10:42.190 --> 00:10:44.950
Arithmetic underflow. When you multiply thousands

00:10:44.950 --> 00:10:48.000
of tiny probabilities together, A standard computer

00:10:48.000 --> 00:10:50.299
simply runs out of memory to store all those

00:10:50.299 --> 00:10:53.120
trailing zeros, and it rounds the number to absolute

00:10:53.120 --> 00:10:55.799
zero, which completely shatters the math. So

00:10:55.799 --> 00:10:58.320
how do engineers actually build this without

00:10:58.320 --> 00:11:00.980
breaking the computer? They do the math under

00:11:00.980 --> 00:11:04.159
the hood in log space. Log space? Yeah, by taking

00:11:04.159 --> 00:11:06.779
the logarithm of the probabilities, you fundamentally

00:11:06.779 --> 00:11:09.440
change the arithmetic. In log space, instead

00:11:09.440 --> 00:11:11.759
of multiplying tiny fractions together, you're

00:11:11.759 --> 00:11:13.820
adding negative numbers together. Oh, that is

00:11:13.820 --> 00:11:15.980
a clever engineering trick. So you bypass the

00:11:15.980 --> 00:11:19.279
rounding errors entirely. Yes. And mathematically,

00:11:19.679 --> 00:11:21.860
expressing it in log space reveals something

00:11:21.860 --> 00:11:25.080
really profound. It turns the multinomial naive

00:11:25.080 --> 00:11:28.279
Bayes model into an equivalent of a linear classifier.

00:11:28.399 --> 00:11:29.809
So what is this? What does this all mean for

00:11:29.809 --> 00:11:32.210
the big picture? Well, if we connect this to

00:11:32.210 --> 00:11:35.129
the bigger picture, this entire process, the

00:11:35.129 --> 00:11:37.789
naive decoupling of features, the shift to log

00:11:37.789 --> 00:11:40.809
space, is what solves the dreaded curse of dimensionality

00:11:40.809 --> 00:11:43.450
in machine learning. The curse of dimensionality

00:11:43.450 --> 00:11:46.750
sounds ominous. It is, for data scientists anyway.

00:11:47.289 --> 00:11:49.389
As you add more features for a model to look

00:11:49.389 --> 00:11:52.970
at, the volume of possible combinations increases

00:11:52.970 --> 00:11:56.059
exponentially. Okay. Usually you need exponentially

00:11:56.059 --> 00:11:58.320
more training data just to understand how all

00:11:58.320 --> 00:12:01.039
those new dimensions relate to each other. But

00:12:01.039 --> 00:12:03.539
because Naive Bayes treats every single feature

00:12:03.539 --> 00:12:06.600
independently, it ignores that curse. because

00:12:06.600 --> 00:12:08.559
it's not looking for the connections. Exactly.

00:12:08.600 --> 00:12:10.559
It only needs one parameter for each feature.

00:12:10.600 --> 00:12:13.919
It scales linearly. It is incredibly fast and

00:12:13.919 --> 00:12:17.220
incredibly robust. And that speed and that ability

00:12:17.220 --> 00:12:20.059
to handle tens of thousands of isolated features

00:12:20.059 --> 00:12:22.720
simultaneously without breaking a sweat made

00:12:22.720 --> 00:12:25.779
it the absolute perfect weapon for a very specific,

00:12:25.919 --> 00:12:28.799
very messy war that broke out in the late 1990s.

00:12:28.860 --> 00:12:31.580
The war on spam. Yes. The ultimate battlefield

00:12:31.580 --> 00:12:34.740
for naive bays. The text notes that Bayesian

00:12:34.740 --> 00:12:36.879
algorithms were used for filtering as early as

00:12:36.879 --> 00:12:39.340
96, but the real turning point was around 98.

00:12:39.899 --> 00:12:41.879
Right. That's when Marin Sahami and his colleagues

00:12:41.879 --> 00:12:44.399
published the first major scholarly paper on

00:12:44.399 --> 00:12:46.840
using a naive Bayes classifier to fight junk

00:12:46.840 --> 00:12:49.679
email. Which was desperately needed because early

00:12:49.679 --> 00:12:52.240
email inboxes were rapidly becoming completely

00:12:52.240 --> 00:12:55.600
unusable. Just flooded with junk. To fight spam,

00:12:55.840 --> 00:12:57.980
you were essentially doing document classification.

00:12:58.659 --> 00:13:01.480
And the text outlines two main event models used

00:13:01.480 --> 00:13:04.980
for this. multinomial, and Bernoulli, but both

00:13:04.980 --> 00:13:07.480
of them generally rely on what's called the bag

00:13:07.480 --> 00:13:10.120
of words assumption. Okay, let's look at the

00:13:10.120 --> 00:13:12.659
bag of words assumption, because to me it implies

00:13:12.659 --> 00:13:14.539
that the filter doesn't actually read the email.

00:13:14.860 --> 00:13:16.840
Does it just dump all the words out of the sentence

00:13:16.840 --> 00:13:19.500
structure like a bucket of magnetic poetry? It

00:13:19.500 --> 00:13:22.200
is entirely a bucket of magnetic poetry. It completely

00:13:22.200 --> 00:13:25.379
strips away all context. The phrase, I am not

00:13:25.379 --> 00:13:27.919
a spammer, becomes the isolated words, I am not

00:13:27.919 --> 00:13:30.399
a spammer, just mixed around in a bucket. Wow.

00:13:30.620 --> 00:13:33.460
It completely ignores grammar. sentence structure,

00:13:33.720 --> 00:13:35.360
the length of the document, or where the words

00:13:35.360 --> 00:13:37.460
are positioned. It just looks at the tokens,

00:13:37.799 --> 00:13:40.299
the individual words, and counts how many times

00:13:40.299 --> 00:13:42.240
they appear. And how does it know which magnetic

00:13:42.240 --> 00:13:44.679
poetry words are good and which are bad? By calculating

00:13:44.679 --> 00:13:47.840
a metric called spamicity. Spamicity, I love

00:13:47.840 --> 00:13:50.360
that word. During its training phase, it correlates

00:13:50.360 --> 00:13:53.539
the use of specific words with spam emails. In

00:13:53.539 --> 00:13:56.639
other words, with legitimate emails, which engineers

00:13:56.639 --> 00:13:59.870
affectionately call ham. Ham and spam. So if

00:13:59.870 --> 00:14:02.850
it sees the word discount or winner, it thinks,

00:14:03.129 --> 00:14:06.350
uh -oh. Exactly. But it's highly efficient about

00:14:06.350 --> 00:14:09.149
what it ignores. Neutral words, which we call

00:14:09.149 --> 00:14:12.350
stop words, like the A some or is, are generally

00:14:12.350 --> 00:14:14.389
ignored because their spamicity is right around

00:14:14.389 --> 00:14:17.149
0 .5. Meaning they're a coin toss. Right. They

00:14:17.149 --> 00:14:19.110
appear equally in good emails and bad emails.

00:14:19.190 --> 00:14:21.710
They don't help the model make a decision. The

00:14:21.710 --> 00:14:24.470
filter instead focuses exclusively on words with

00:14:24.470 --> 00:14:28.179
a spamicity near 1 .0, which are highly distinctive

00:14:28.179 --> 00:14:32.000
of spam or near 0 .0, which are highly distinctive

00:14:32.000 --> 00:14:34.759
of legitimate mail. It's just scanning that bucket

00:14:34.759 --> 00:14:37.500
of scattered words for the most extreme red flags

00:14:37.500 --> 00:14:39.919
and green flags. Exactly. But an email inbox

00:14:39.919 --> 00:14:42.379
isn't a static data set like a list of heights

00:14:42.379 --> 00:14:44.299
and weights. It's an adversarial environment.

00:14:44.700 --> 00:14:46.700
There are motivated humans on the other side.

00:14:47.139 --> 00:14:49.600
As these Bayesian filters got smarter, the spammers

00:14:49.600 --> 00:14:53.070
fought back. And they initially did so by exploiting

00:14:53.070 --> 00:14:55.549
mathematical vulnerabilities in the model itself.

00:14:56.210 --> 00:14:58.769
The most basic vulnerability is the zero probability

00:14:58.769 --> 00:15:01.509
problem. Right. What happens if a word shows

00:15:01.509 --> 00:15:03.889
up in an email that the filter has never seen

00:15:03.889 --> 00:15:06.309
during its training phase? Well, if a word has

00:15:06.309 --> 00:15:08.950
never been seen, its frequency -based probability

00:15:08.950 --> 00:15:12.480
is exactly zero. And remember, the core math

00:15:12.480 --> 00:15:15.799
of Naive Bayes involves multiplying all the probabilities

00:15:15.799 --> 00:15:19.240
together. If you multiply anything by zero, the

00:15:19.240 --> 00:15:21.580
entire equation collapses to zero. It doesn't

00:15:21.580 --> 00:15:24.059
matter if the rest of the email is pure, obvious,

00:15:24.399 --> 00:15:27.259
undeniable spam. If there is one unseen word,

00:15:27.600 --> 00:15:29.919
you multiply by zero, and the whole spam score

00:15:29.919 --> 00:15:32.559
is wiped out. So early on, spammers would just

00:15:32.559 --> 00:15:34.860
invent a random new word, throw it in the subject

00:15:34.860 --> 00:15:36.919
line, and instantly break the filter. Exactly.

00:15:37.320 --> 00:15:39.279
To fix this, programmers introduced something

00:15:39.279 --> 00:15:41.720
called Laplace smoothing. Laplace smoothing?

00:15:42.100 --> 00:15:44.620
You basically add a pseudo count, usually just

00:15:44.620 --> 00:15:47.720
a microscopic value of one to every single word's

00:15:47.720 --> 00:15:49.529
frequency estimate. Even the ones you haven't

00:15:49.529 --> 00:15:52.370
seen, it's like giving every potential word a

00:15:52.370 --> 00:15:54.970
tiny participation trophy so that no probability

00:15:54.970 --> 00:15:57.500
is ever exactly zero. OK, here's where it gets

00:15:57.500 --> 00:16:00.059
really interesting. So the spammers realized

00:16:00.059 --> 00:16:02.179
they couldn't just break the math with new words

00:16:02.179 --> 00:16:04.860
anymore because the place smoothing gave them

00:16:04.860 --> 00:16:07.559
a tiny penalty instead of an automatic bypass.

00:16:07.620 --> 00:16:10.720
Right. They had to change tactics entirely. Instead

00:16:10.720 --> 00:16:13.279
of focusing on the bad words, they started weaponizing

00:16:13.279 --> 00:16:17.320
the good words. This brings us to Bayesian poisoning.

00:16:17.580 --> 00:16:20.340
Yes. This is a classic example of an adversarial

00:16:20.340 --> 00:16:22.419
machine learning attack. The spammers realized

00:16:22.419 --> 00:16:25.259
the filter was scoring the whole bag of words.

00:16:25.610 --> 00:16:28.330
So they started stuffing their spam emails with

00:16:28.330 --> 00:16:31.149
massive amounts of completely legitimate text.

00:16:31.330 --> 00:16:34.289
Just padding it out. Yeah. They would copy paste

00:16:34.289 --> 00:16:36.990
entire news articles or paragraphs from classic

00:16:36.990 --> 00:16:38.769
literature right at the bottom of the email,

00:16:39.190 --> 00:16:41.549
usually formatting it in invisible white text

00:16:41.549 --> 00:16:43.730
so the user wouldn't see it, but the computer

00:16:43.730 --> 00:16:45.850
filter would read it. It's a calculated attempt

00:16:45.850 --> 00:16:48.370
to dilute the overall spam score. It's like a

00:16:48.370 --> 00:16:50.710
smuggler trying to hide a tiny bit of contraband

00:16:50.710 --> 00:16:53.710
inside a massive, boring shipment of office supplies,

00:16:54.029 --> 00:16:56.350
hoping the sheer The volume of normal stuff throws

00:16:56.350 --> 00:16:59.190
off the border dogs. That's precisely the goal.

00:16:59.809 --> 00:17:01.909
They want the overwhelming number of good words

00:17:01.909 --> 00:17:04.630
to mathematically outweigh the few bad words.

00:17:04.630 --> 00:17:07.609
Didn't work. For a bit. But the defenders adapt

00:17:07.609 --> 00:17:10.549
it again. The text highlights a defense strategy

00:17:10.549 --> 00:17:13.440
known as Paul Graham's Scheme. Okay. Instead

00:17:13.440 --> 00:17:15.680
of averaging out every single word in the entire

00:17:15.680 --> 00:17:18.559
email, the filter was adjusted to only look at

00:17:18.559 --> 00:17:21.339
the most significant probabilities. It finds

00:17:21.339 --> 00:17:24.279
the 10 or 15 most extreme words in the document,

00:17:24.680 --> 00:17:27.920
the ones closest to a sphamicity of 1 .0 or 0

00:17:27.920 --> 00:17:31.539
.0, and only uses those to make the final calculation.

00:17:32.299 --> 00:17:35.119
Ah, so the border dogs are trained to completely

00:17:35.119 --> 00:17:37.500
ignore the millions of normal paper clips, and

00:17:37.500 --> 00:17:39.759
they zero in exclusively on the strange package

00:17:39.759 --> 00:17:42.119
hidden in the middle. The padding strategy just

00:17:42.119 --> 00:17:44.640
stops working. Right, so the spammers tried word

00:17:44.640 --> 00:17:46.700
transformation next. They realized the filter

00:17:46.700 --> 00:17:49.160
was looking closely at specific words like Viagra,

00:17:49.460 --> 00:17:52.220
so they started spelling it Viagra with two A's,

00:17:52.359 --> 00:17:56.720
or V !agra. Oh, I remember that, Era. The whole

00:17:56.720 --> 00:17:59.430
inbox looked like a bad password generator. Yeah,

00:17:59.509 --> 00:18:01.609
it was a mess. But the text notes that this generally

00:18:01.609 --> 00:18:04.690
fails as a long -term strategy. Because a naive

00:18:04.690 --> 00:18:07.490
base filter is continuously learning from the

00:18:07.490 --> 00:18:10.089
user's inbox and observing what gets marked as

00:18:10.089 --> 00:18:12.230
junk, it eventually just learns the new spelling.

00:18:12.549 --> 00:18:16.849
Oh, nice. Yeah, V !A -G -R -A quickly gets assigned

00:18:16.849 --> 00:18:20.329
its own spammicity of 1 .0, just like the original

00:18:20.329 --> 00:18:23.170
word. So they escalated yet again. Image spam.

00:18:23.349 --> 00:18:26.809
If the filter is so incredibly good at reading

00:18:26.809 --> 00:18:29.970
text, the spammers just stop sending text entirely.

00:18:30.410 --> 00:18:32.849
They would put all the spam words inside a linked

00:18:32.849 --> 00:18:35.690
JPEG or GIF image. Which is very clever because

00:18:35.690 --> 00:18:37.690
traditional Bayesian filters can't read a picture

00:18:37.690 --> 00:18:39.910
of a word. Right. But the platforms fought back

00:18:39.910 --> 00:18:42.589
again. Google's Gmail system implemented OCR

00:18:42.589 --> 00:18:45.210
optical character recognition. Whenever an email

00:18:45.210 --> 00:18:47.930
arrived with a mid to large size image, Google

00:18:47.930 --> 00:18:50.549
servers would scan the image, extract the hidden

00:18:50.549 --> 00:18:53.109
text out of the pixels, and then feed that text

00:18:53.109 --> 00:18:55.589
back into the naive base filter. This raises

00:18:55.589 --> 00:18:57.750
an important question about the nature of machine

00:18:57.750 --> 00:19:01.769
learning itself. It is an endless evolving cat

00:19:01.769 --> 00:19:04.710
and mouse game. You build a naive model, it works

00:19:04.710 --> 00:19:07.609
surprisingly well, adversaries exploit its naivete,

00:19:08.130 --> 00:19:10.509
and you have to continually patch it with clever

00:19:10.509 --> 00:19:14.410
engineering like OCR, or Laplace smoothing. It

00:19:14.410 --> 00:19:16.769
really is an incredible journey. I mean, we started

00:19:16.769 --> 00:19:19.410
with an algorithm named Idiot's Bayes because

00:19:19.410 --> 00:19:21.529
it makes an assumption that is mathematically

00:19:21.529 --> 00:19:24.269
laughable, pretending the world has no interconnected

00:19:24.269 --> 00:19:26.869
variables. Yeah. Yet through the magic of the

00:19:26.869 --> 00:19:29.930
MAP decision rule and log space computing, it

00:19:29.930 --> 00:19:32.910
turns into a highly scalable lightning fast tool.

00:19:33.170 --> 00:19:35.630
And it is really worth noting its lasting legacy

00:19:35.630 --> 00:19:39.079
in the field. The text mentions that Naive Bayes

00:19:39.079 --> 00:19:41.599
classifiers actually form a generative -discriminative

00:19:41.599 --> 00:19:44.339
pair with logistic regression. Okay, let's break

00:19:44.339 --> 00:19:46.519
that jargon down. A generative -discriminative

00:19:46.519 --> 00:19:48.500
pair. What does that mean for the listener? Well,

00:19:48.740 --> 00:19:50.480
it's two different ways of solving the same problem.

00:19:50.859 --> 00:19:53.460
A generative model, like Naive Bayes, tries to

00:19:53.460 --> 00:19:56.059
build a mathematical picture of what a spam email

00:19:56.059 --> 00:19:57.960
actually looks like. It generates a profile.

00:19:58.160 --> 00:20:00.740
A discriminative model like logistic regression

00:20:00.740 --> 00:20:03.460
doesn't care what spam looks like. It just tries

00:20:03.460 --> 00:20:06.559
to draw a hard mathematical line between the

00:20:06.559 --> 00:20:08.940
spam pile and the good pile. So they approach

00:20:08.940 --> 00:20:10.680
the data differently, but how do they compare

00:20:10.680 --> 00:20:13.279
in performance? This is where it gets surprising.

00:20:13.839 --> 00:20:17.240
Research by Andrew Eng and Michael Jordan showed

00:20:17.240 --> 00:20:20.059
that while more complex models like logistic

00:20:20.059 --> 00:20:23.180
regression might have a lower asymptotic error,

00:20:23.599 --> 00:20:26.359
meaning their absolute peak potential for accuracy

00:20:26.359 --> 00:20:31.980
is technically but higher naive Bayes can actually

00:20:31.980 --> 00:20:34.880
reach its own asymptotic error much, much faster

00:20:34.880 --> 00:20:38.059
in many practical real -world cases. Meaning,

00:20:38.279 --> 00:20:40.740
Naive Bayes hits its absolute peak performance

00:20:40.740 --> 00:20:43.099
requiring a fraction of the training data, even

00:20:43.099 --> 00:20:45.319
if that peak isn't, like, technically perfect.

00:20:45.539 --> 00:20:48.400
Exactly. Sometimes being blazing fast and good

00:20:48.400 --> 00:20:50.559
enough is far better than being theoretically

00:20:50.559 --> 00:20:53.420
perfect, but computationally exhausting. So the

00:20:53.420 --> 00:20:55.640
next time your inbox flawlessly filters out a

00:20:55.640 --> 00:20:58.279
shady email from a wealthy prince or effortlessly

00:20:58.279 --> 00:21:00.180
spots a mutated spelling of a pharmaceutical

00:21:00.180 --> 00:21:02.200
drug, you really have the idiot's algorithm to

00:21:02.200 --> 00:21:04.279
thank for it. It might be mathematically naive,

00:21:04.319 --> 00:21:07.170
but it is deeply street smart. Before we wrap

00:21:07.170 --> 00:21:10.190
up, there is one final provocative detail buried

00:21:10.190 --> 00:21:12.630
in the text that I want to leave you with. It

00:21:12.630 --> 00:21:15.549
mentions a process called semi -supervised parameter

00:21:15.549 --> 00:21:19.369
estimation. Ah, yes. Using the expectation maximization

00:21:19.369 --> 00:21:22.509
algorithm, or EM algorithm. Right. The text explains

00:21:22.509 --> 00:21:24.849
that you can run a naive Bayes classifier in

00:21:24.849 --> 00:21:28.039
a loop. You start with a few labeled spam emails,

00:21:28.400 --> 00:21:30.940
but then you feed it a mountain of completely

00:21:30.940 --> 00:21:33.519
unlabeled data. Just raw data. Yeah. Through

00:21:33.519 --> 00:21:35.839
this EM algorithm, the model guesses the labels,

00:21:36.000 --> 00:21:38.319
checks the statistical results, updates its parameters,

00:21:38.740 --> 00:21:41.559
and loops over and over until it converges. Which

00:21:41.559 --> 00:21:43.700
makes me wonder, if the model can learn from

00:21:43.700 --> 00:21:46.140
unlabeled data just by observing mathematical

00:21:46.140 --> 00:21:48.940
clusters in the dark, could future spam filters

00:21:48.940 --> 00:21:51.940
eventually learn to identify entirely new unseen

00:21:51.940 --> 00:21:54.849
categories of junk mail on their own? That's

00:21:54.849 --> 00:21:56.890
a fascinating thought. Could it know what spam

00:21:56.890 --> 00:21:59.150
is without a human ever needing to click the

00:21:59.150 --> 00:22:01.490
report spam button? If the underlying mixture

00:22:01.490 --> 00:22:03.890
model holds true meaning, if the mathematical

00:22:03.890 --> 00:22:06.109
structures of those hidden categories exist in

00:22:06.109 --> 00:22:08.609
the data, it theoretically could find those latent

00:22:08.609 --> 00:22:11.349
classes all by itself purely through observation.

00:22:11.549 --> 00:22:14.470
A machine navigating the muddy waters of human

00:22:14.470 --> 00:22:17.849
communication entirely on its own just by finding

00:22:17.849 --> 00:22:19.930
shapes in the dark. Thank you for joining us

00:22:19.930 --> 00:22:22.470
on this deep dive. And remember, sometimes the

00:22:22.470 --> 00:22:24.650
best way to solve an impossibly complex problem

00:22:24.650 --> 00:22:27.029
is to simply pretend the complexity doesn't exist.