WEBVTT

00:00:00.000 --> 00:00:02.279
You know, if I were to suddenly just like...

00:00:02.089 --> 00:00:04.570
jump out from behind a door and yell boo at you

00:00:04.570 --> 00:00:06.549
right now, you'd be surprised. I would absolutely

00:00:06.549 --> 00:00:08.310
jump. Right, your heart rate would spike, you'd

00:00:08.310 --> 00:00:11.189
probably jump back a foot or two, maybe drop

00:00:11.189 --> 00:00:13.849
whatever you're holding. It's this very visceral

00:00:13.849 --> 00:00:17.489
emotion, it's messy, it's physiological, and

00:00:17.489 --> 00:00:20.469
it is just a deeply human experience. Yeah, and

00:00:20.469 --> 00:00:22.429
it's also a completely subjective experience,

00:00:22.469 --> 00:00:24.710
like you can't look at someone who just got jump

00:00:24.710 --> 00:00:28.050
scared and say, ah yes, my sensors indicate you

00:00:28.050 --> 00:00:31.510
were exactly 4 .7 units surprised by that. Exactly.

00:00:31.500 --> 00:00:34.979
But what if you actually could? What if surprise

00:00:34.979 --> 00:00:38.240
wasn't just this fleeting feeling, but a cold,

00:00:38.479 --> 00:00:42.000
hard, mathematically rigorous metric? Like a

00:00:42.000 --> 00:00:44.479
specific number you could calculate to the decimal

00:00:44.479 --> 00:00:47.700
point to measure exactly how confused or uncertain

00:00:47.700 --> 00:00:50.280
or just utterly blindsided a system is. It does

00:00:50.280 --> 00:00:52.640
sound a bit like science fiction, right? Intempting

00:00:52.640 --> 00:00:56.240
to quantify confusion. It really does. But today,

00:00:56.359 --> 00:00:58.979
we are looking at the secret invisible metric

00:00:58.979 --> 00:01:01.939
that dictates how every single artificial intelligence

00:01:01.939 --> 00:01:05.099
on your phone actually thinks we are diving into

00:01:05.099 --> 00:01:08.019
the math of surprise. Yeah, and yet it is the

00:01:08.019 --> 00:01:09.939
foundational mathematics powering the language

00:01:09.939 --> 00:01:12.780
models, the predictive text, and really the AI

00:01:12.780 --> 00:01:14.700
we interact with every single day. We're talking

00:01:14.700 --> 00:01:16.920
about a concept known as perplexity. Welcome

00:01:16.920 --> 00:01:19.500
to today's deep dive. Our mission for you today

00:01:19.500 --> 00:01:23.079
is to demystify exactly how data scientists,

00:01:23.500 --> 00:01:26.659
statisticians, and modern AI mathematically measure

00:01:26.659 --> 00:01:29.719
this concept of uncertainty. So we're going to

00:01:29.719 --> 00:01:31.540
strip away the magic of AI and just look at the

00:01:31.540 --> 00:01:33.739
hard numbers. OK, let's unpack this. So to really

00:01:33.739 --> 00:01:37.280
grasp this, we have to... step away from the

00:01:37.280 --> 00:01:40.299
massive, multi -billion parameter neural networks

00:01:40.299 --> 00:01:42.219
for a minute. Right. Take a step back from the

00:01:42.219 --> 00:01:44.439
chat GPTs of the world. Exactly. If we want to

00:01:44.439 --> 00:01:46.319
understand what perplexity actually means in

00:01:46.319 --> 00:01:48.859
information theory, we have to ground the concept

00:01:48.859 --> 00:01:52.519
in everyday physical reality. And surprisingly,

00:01:52.739 --> 00:01:55.260
this isn't some new AI fad born in the 2020s.

00:01:55.260 --> 00:01:57.159
It was actually originally introduced back in

00:01:57.159 --> 00:02:00.680
1977. Wait, 1977? That is essentially the Stone

00:02:00.680 --> 00:02:02.700
Age in tech time. I mean, we're talking about

00:02:02.700 --> 00:02:05.180
the era of the Apple II and, like, the scope.

00:02:05.290 --> 00:02:07.969
It really is, which just highlights how fundamental

00:02:07.969 --> 00:02:10.550
the math is. It was introduced in the context

00:02:10.550 --> 00:02:13.169
of early speech recognition by a team of researchers

00:02:13.169 --> 00:02:17.129
at IBM, Frederick Jelinek, Robert Leroy Mercer,

00:02:17.550 --> 00:02:19.949
Lola Ball, and James Baker. OK, so a whole team

00:02:19.949 --> 00:02:22.150
trying to figure this out. Right. They were trying

00:02:22.150 --> 00:02:24.250
to get computers to understand human speech,

00:02:24.250 --> 00:02:27.000
and they just hit a wall. They needed a reliable

00:02:27.000 --> 00:02:29.360
way to measure the raw difficulty of a speech

00:02:29.360 --> 00:02:32.479
recognition task. Specifically, they needed to

00:02:32.479 --> 00:02:35.439
measure uncertainty for a discrete probability

00:02:35.439 --> 00:02:38.379
distribution. A discrete probability distribution?

00:02:38.439 --> 00:02:40.139
Okay, I want to make sure we don't lose anyone

00:02:40.139 --> 00:02:42.259
in the jargon right out of the gate here. We're

00:02:42.259 --> 00:02:43.560
essentially talking about things like flipping

00:02:43.560 --> 00:02:45.780
coins and rolling dice, right? Like things with

00:02:45.780 --> 00:02:48.400
a set countable number of outcomes. That is the

00:02:48.400 --> 00:02:50.539
perfect starting point, yeah. The mathematics

00:02:50.539 --> 00:02:53.120
of perplexity dictate that a fair coin toss has

00:02:53.120 --> 00:02:56.289
a perplexity of exactly two. A fair six -sided

00:02:56.289 --> 00:02:59.310
die roll has a perplexity of exactly 6. Okay.

00:02:59.610 --> 00:03:02.050
So generally speaking, if you have a probability

00:03:02.050 --> 00:03:05.449
distribution with exactly n outcomes, and every

00:03:05.449 --> 00:03:07.710
single outcome has an equal probability, so a

00:03:07.710 --> 00:03:11.259
1 and n chance, the perplexity is simply n. Got

00:03:11.259 --> 00:03:13.719
it. So if I'm trying to visualize this, if I'm,

00:03:13.719 --> 00:03:16.680
say, k -ways perplexed, I picture it like standing

00:03:16.680 --> 00:03:19.000
at a crossroads. Oh, I like that. Yeah. Like,

00:03:19.000 --> 00:03:21.020
if I'm standing at a fork in the road with three

00:03:21.020 --> 00:03:23.460
perfectly identical paths to choose from and

00:03:23.460 --> 00:03:25.719
no signs telling me where to go, I am three ways

00:03:25.719 --> 00:03:28.280
perplexed. Exactly. If a random variable has

00:03:28.280 --> 00:03:31.120
a uniform distribution over k outcomes, it has

00:03:31.120 --> 00:03:33.960
the exact same level of uncertainty as me rolling

00:03:33.960 --> 00:03:37.259
a fair k -sided die. So if I have 20 equally

00:03:37.259 --> 00:03:40.780
viable choices, my perplexity is 20. It's just

00:03:40.780 --> 00:03:43.639
that feeling of having options, but absolutely

00:03:43.639 --> 00:03:45.759
no clue which one is going to happen. What's

00:03:45.759 --> 00:03:48.000
fascinating here is how this connects to a broader

00:03:48.000 --> 00:03:51.770
concept called information entropy. Perplexity

00:03:51.770 --> 00:03:54.030
isn't just a basic tally of how many paths are

00:03:54.030 --> 00:03:56.750
at your crossroads. It is mathematically defined

00:03:56.750 --> 00:03:59.330
as the exponentiation of information entropy.

00:03:59.530 --> 00:04:02.569
Whoa, okay, big words. Yeah, I know. But basically,

00:04:02.770 --> 00:04:04.789
entropy is a way to translate that feeling of

00:04:04.789 --> 00:04:08.069
uncertainty into a concrete number based on logarithms.

00:04:08.270 --> 00:04:10.389
Okay, entropy and logarithms. Walk us through

00:04:10.389 --> 00:04:12.150
how that actually works under the hood, because,

00:04:12.210 --> 00:04:14.810
I mean, why do we need logarithms to understand

00:04:14.810 --> 00:04:17.170
a simple dice roll? It really comes down to how

00:04:17.170 --> 00:04:20.329
we measure information itself. Claude Shannon.

00:04:20.459 --> 00:04:22.939
who was basically the father of information theory,

00:04:23.600 --> 00:04:25.480
figured out that we can measure the surprise

00:04:25.480 --> 00:04:29.519
of an event in specific units. OK. And depending

00:04:29.519 --> 00:04:31.600
on the base of the logarithm you use for the

00:04:31.600 --> 00:04:34.639
calculation, the units change. If you use base

00:04:34.639 --> 00:04:37.259
two, the entropy of the distribution is measured

00:04:37.259 --> 00:04:42.120
in Shannon's, or much more commonly, bits. Bits,

00:04:42.240 --> 00:04:44.500
like computer bits. Exactly. And if you use the

00:04:44.500 --> 00:04:47.209
natural logarithm, base e, It's measured in nats.

00:04:47.449 --> 00:04:49.829
OK, so entropy tells me how many bits of information

00:04:49.829 --> 00:04:52.790
I'm missing before the die lands. Precisely.

00:04:52.910 --> 00:04:56.470
But talking about having, say, 2 .58 bits of

00:04:56.470 --> 00:04:59.269
uncertainty is incredibly abstract for a human

00:04:59.269 --> 00:05:01.230
brain to process. Yeah, I have no idea what 2

00:05:01.230 --> 00:05:04.069
.58 bits feels like. Right, and that is exactly

00:05:04.069 --> 00:05:06.970
where perplexity comes in. Perplexity takes that

00:05:06.970 --> 00:05:09.370
abstract entropy measurement and exponitiates

00:05:09.370 --> 00:05:11.910
it. It basically reverses the logarithm. It turns

00:05:11.910 --> 00:05:14.350
those bits back into a number that feels intuitive

00:05:14.350 --> 00:05:17.509
to us. Oh, I see. It converts 2 .58 bits back

00:05:17.509 --> 00:05:20.490
into six, meaning you are as confused as someone

00:05:20.490 --> 00:05:24.160
rolling a six -sided die. The larger the perplexity,

00:05:24.500 --> 00:05:26.480
the harder it is for an observer to guess what's

00:05:26.480 --> 00:05:29.060
going to happen next. That feels incredibly clean

00:05:29.060 --> 00:05:31.300
and logical when everything is fair and uniform.

00:05:31.860 --> 00:05:35.500
A six -sided die is six, a coin is two. But the

00:05:35.500 --> 00:05:38.060
real world isn't a fair casino, like dice are

00:05:38.060 --> 00:05:41.259
loaded, coins are scuffed. Can this metric actually

00:05:41.259 --> 00:05:43.939
handle non -uniform situations where some outcomes

00:05:43.939 --> 00:05:46.879
are way more likely than others? It absolutely

00:05:46.879 --> 00:05:49.160
handles non -uniform distributions. In fact,

00:05:49.160 --> 00:05:52.040
that's its primary purpose. But this is exactly

00:05:52.040 --> 00:05:54.459
where our human intuition starts to completely

00:05:54.459 --> 00:05:56.920
clash with the mathematics. Here's where it gets

00:05:56.939 --> 00:05:59.040
really interesting because I would naturally

00:05:59.040 --> 00:06:01.180
assume that perplexity is just a mirror image

00:06:01.180 --> 00:06:03.439
of how likely I am to guess correctly. Like,

00:06:03.740 --> 00:06:05.660
if I am playing a game where I have a massive

00:06:05.660 --> 00:06:08.540
probability of winning, my uncertainty, my perplexity

00:06:08.540 --> 00:06:11.920
should be almost zero, right? And that is the

00:06:11.920 --> 00:06:15.800
exact trap a lot of people fall into. Perplexity

00:06:15.800 --> 00:06:18.459
is a measure of the difficulty of a prediction

00:06:18.459 --> 00:06:22.579
problem as a whole. It is absolutely not just

00:06:22.579 --> 00:06:25.319
a straightforward representation of the probability

00:06:25.319 --> 00:06:27.800
of success on a single guess. Yeah, I actually

00:06:27.800 --> 00:06:30.160
stumbled over this totally counterintuitive math

00:06:30.160 --> 00:06:32.180
paradox when I was looking into the source material.

00:06:32.620 --> 00:06:35.339
Let's imagine a scenario where you have two choices.

00:06:35.459 --> 00:06:39.519
One of those choices has a 0 .9 probability of

00:06:39.519 --> 00:06:41.480
happening. That's a 90 % chance. The other outcome

00:06:41.480 --> 00:06:44.519
is a 10 % chance. If I always bet on the 90 %

00:06:44.519 --> 00:06:47.180
outcome, my probability of making a correct guess

00:06:47.180 --> 00:06:51.060
is 0 .9. I would feel incredibly confident. I

00:06:51.060 --> 00:06:53.079
mean, I'd bet my my house on it. You'd assume

00:06:53.079 --> 00:06:55.879
your uncertainty is minimal. But the math tells

00:06:55.879 --> 00:06:58.240
a very different story here. If you actually

00:06:58.240 --> 00:07:00.660
calculate the perplexity for that 90 -10 split,

00:07:01.100 --> 00:07:03.199
you don't get a number that cleanly maps back

00:07:03.199 --> 00:07:05.079
to 90%. Wait, let's actually run the numbers

00:07:05.079 --> 00:07:07.279
on that. How does the perplexity formula process

00:07:07.279 --> 00:07:10.439
a 90 % probability? Well, it uses those logarithms

00:07:10.439 --> 00:07:12.040
and negative exponents we mentioned earlier.

00:07:12.240 --> 00:07:16.079
For the 90 % outcome, you take 0 .9 to the power

00:07:16.079 --> 00:07:19.019
of negative 0 .9. OK. Then you multiply that.

00:07:19.259 --> 00:07:22.980
by the 10 % outcome, which is 0 .1, to the power

00:07:22.980 --> 00:07:26.199
of negative 0 .1. And when you compute all that,

00:07:26.899 --> 00:07:30.560
the perplexity spits out a value of 1 .38. 1

00:07:30.560 --> 00:07:35.699
.38. OK, so I am 1 .38 ways perplexed. But if

00:07:35.699 --> 00:07:38.939
I try to reverse engineer that back into a probability

00:07:38.939 --> 00:07:41.519
percentage like taking the inverse, 1 divided

00:07:41.519 --> 00:07:46.540
by 1 .38, I get 0 .72. That is 72%. It's absolutely

00:07:46.540 --> 00:07:49.920
not 90%. My actual chance of success is 90%,

00:07:49.920 --> 00:07:52.379
but the perplexity metric is actually like my

00:07:52.379 --> 00:07:56.399
odds are only 72%. Why is there this huge like

00:07:56.399 --> 00:07:59.560
18 % disconnect between my odds of winning and

00:07:59.560 --> 00:08:01.560
the mathematical measure of my surprise? It goes

00:08:01.560 --> 00:08:03.500
right back to what entropy actually measures.

00:08:03.699 --> 00:08:05.779
We have to look at something called optimal variable

00:08:05.779 --> 00:08:08.139
length coding. That sounds incredibly dense.

00:08:08.279 --> 00:08:10.500
It does, but just think of it like Morse code.

00:08:10.959 --> 00:08:13.060
In Morse code, the most common letters in the

00:08:13.060 --> 00:08:15.519
English language, like E, get the shortest possible

00:08:15.519 --> 00:08:19.360
code. Just a single dot. Rare letters, like Z,

00:08:19.660 --> 00:08:22.519
get these long, complex codes to keep them distinct

00:08:22.519 --> 00:08:25.000
without wasting time on the common stuff. OK,

00:08:25.060 --> 00:08:27.620
so you design a system that uses the least amount

00:08:27.620 --> 00:08:30.439
of energy for the most frequent events. Yes.

00:08:31.000 --> 00:08:34.460
Entropy. and by extension perplexity, measures

00:08:34.460 --> 00:08:37.059
the expected or average number of bits required

00:08:37.059 --> 00:08:39.679
to encode the outcome using that kind of optimal

00:08:39.679 --> 00:08:42.840
code. It provides insight into the information

00:08:42.840 --> 00:08:45.000
gain expected when you finally learn the outcome.

00:08:45.240 --> 00:08:47.220
Information gain. Yeah, think about it. If you

00:08:47.220 --> 00:08:50.000
bet on the 90 % outcome and you win, you don't

00:08:50.000 --> 00:08:51.779
really learn very much. You expect it to win.

00:08:52.360 --> 00:08:55.100
10 % of the time, you are going to lose. And

00:08:55.100 --> 00:08:57.360
when you lose that bet, the surprise is massive.

00:08:57.600 --> 00:08:59.960
You gain a ton of information about the unpredictability

00:08:59.960 --> 00:09:02.820
of the system. Oh, wow. I see. So the perplexity

00:09:02.820 --> 00:09:05.500
is evaluating the entire shape of the uncertainty.

00:09:05.940 --> 00:09:08.220
It's factoring in the catastrophic surprise of

00:09:08.220 --> 00:09:10.740
that 10 % chance happening. It's not just handing

00:09:10.740 --> 00:09:13.639
me the raw odds of winning a single bet. It's

00:09:13.639 --> 00:09:16.440
this holistic measure of the inherent chaos in

00:09:16.440 --> 00:09:18.720
the environment I'm betting in. Exactly. And

00:09:18.720 --> 00:09:20.700
that distinction becomes critical when scientists

00:09:20.700 --> 00:09:23.620
use this math in the real world. Because in the

00:09:23.620 --> 00:09:25.259
real world, whether we're predicting the stock

00:09:25.259 --> 00:09:27.320
market or trying to get a machine to translate

00:09:27.320 --> 00:09:30.879
French to English, the true underlying probabilities

00:09:30.879 --> 00:09:34.039
of the universe are completely unknown. We don't

00:09:34.039 --> 00:09:36.200
have a cheat sheet. Right. We don't. So if we

00:09:36.200 --> 00:09:37.799
don't know the actual odds of the environment,

00:09:38.039 --> 00:09:40.360
how do data scientists actually use this metric?

00:09:40.659 --> 00:09:43.179
They use it to evaluate models. Let's say there's

00:09:43.179 --> 00:09:46.000
an unknown true probability distribution out

00:09:46.000 --> 00:09:48.360
there in the world. We'll call this true reality

00:09:48.360 --> 00:09:51.639
P. OK, true reality is P. We don't know P, but

00:09:51.639 --> 00:09:54.039
we really want to predict it. So we gather a

00:09:54.039 --> 00:09:56.299
bunch of data, a training sample drawn from P,

00:09:56.419 --> 00:09:58.700
and we build a mathematical model based on that

00:09:58.700 --> 00:10:02.019
data. We'll call our proposed model Q. So P is

00:10:02.019 --> 00:10:05.799
actual reality, and Q is our AI's best guess

00:10:05.799 --> 00:10:09.919
at how reality works. Exactly that. Now, to figure

00:10:09.919 --> 00:10:13.740
out how good our model Q really is, we obviously

00:10:13.740 --> 00:10:15.740
can't test it on the data it already learned

00:10:15.740 --> 00:10:18.070
from. That would be cheating. Right. It already

00:10:18.070 --> 00:10:20.490
knows the answers. Yeah. So we have to see how

00:10:20.490 --> 00:10:22.809
well it predicts a completely separate test sample,

00:10:23.070 --> 00:10:25.370
like a fresh batch of data that was also drawn

00:10:25.370 --> 00:10:27.870
from the real world. P. You know, this immediately

00:10:27.870 --> 00:10:29.990
reminds me of being a student. Like, think of

00:10:29.990 --> 00:10:32.590
a low -perplexity model like a student who genuinely

00:10:32.590 --> 00:10:34.809
studied the underlying concepts for an exam.

00:10:34.950 --> 00:10:37.330
Oh, that's a great analogy. Right. So the unknown

00:10:37.330 --> 00:10:40.330
distribution, P, is the teacher's actual exam

00:10:40.330 --> 00:10:44.159
paper. The model, Q, is the student's brain,

00:10:44.480 --> 00:10:46.480
filled with the logic they developed from doing

00:10:46.480 --> 00:10:49.480
practice problems. When that student sits down

00:10:49.480 --> 00:10:51.639
to take the final test, the new test sample,

00:10:52.000 --> 00:10:54.179
if their mental model of the subject is good,

00:10:54.480 --> 00:10:56.899
they assign a really high probability to the

00:10:56.899 --> 00:10:58.740
questions that appear. They aren't blindsided.

00:10:59.100 --> 00:11:01.460
Right. They are simply less surprised by the

00:11:01.460 --> 00:11:04.860
test sample. And mathematically, a better model

00:11:04.860 --> 00:11:07.639
assigns higher probabilities to the events that

00:11:07.639 --> 00:11:10.440
actually occur in the test data. Because it predicts,

00:11:10.580 --> 00:11:13.200
well, It is fundamentally less surprised by what

00:11:13.200 --> 00:11:15.779
reality throws at it, which results in a lower

00:11:15.779 --> 00:11:18.659
perplexity value. Low perplexity models do a

00:11:18.659 --> 00:11:21.399
better job of compressing the test sample. Compressing

00:11:21.399 --> 00:11:24.519
it like a zip file on a computer. Very similar,

00:11:24.580 --> 00:11:27.139
actually. Because the model already anticipated

00:11:27.139 --> 00:11:30.259
the patterns in the data, it requires fewer bits

00:11:30.259 --> 00:11:33.000
per test element on average to encode the information.

00:11:33.200 --> 00:11:35.059
That makes total sense. And if we connect this

00:11:35.059 --> 00:11:39.090
to the bigger picture, the exponent. in these

00:11:39.090 --> 00:11:42.070
complex mathematical formulas actually represents

00:11:42.070 --> 00:11:44.350
something called cross entropy. Cross entropy.

00:11:44.529 --> 00:11:47.149
Cross entropy looks at the empirical distribution

00:11:47.149 --> 00:11:49.029
of the test sample, what actually happened in

00:11:49.029 --> 00:11:51.570
reality, and compares it to what our model Q

00:11:51.570 --> 00:11:53.549
predicted would happen. So it's like a direct

00:11:53.549 --> 00:11:55.610
comparison between the student's expectations

00:11:55.610 --> 00:11:58.190
and the teacher's grading key. Yes, exactly.

00:11:58.549 --> 00:12:00.690
And what that actually means on paper is defined

00:12:00.690 --> 00:12:03.730
by a concept called Kullback -Libler divergence,

00:12:03.929 --> 00:12:07.149
or KL divergence. OK, KL divergence. KL divergence

00:12:07.149 --> 00:12:09.509
is essentially a penalty. It measures the distance

00:12:09.509 --> 00:12:12.669
between two probability distributions. It calculates

00:12:12.669 --> 00:12:15.509
the extra wasted bits of information you're forced

00:12:15.509 --> 00:12:18.669
to use simply because your model Q is slightly

00:12:18.669 --> 00:12:21.110
wrong about reality P. OK, so if the student's

00:12:21.110 --> 00:12:23.330
brain Q is a perfect match for the teacher's

00:12:23.330 --> 00:12:26.389
exam P, the distance is zero. There is no penalty.

00:12:26.590 --> 00:12:29.389
Yes. And that is the only scenario where perplexity

00:12:29.389 --> 00:12:31.389
is completely minimized. The divergence between

00:12:31.389 --> 00:12:34.190
x expectation and reality drops all the way to

00:12:34.190 --> 00:12:37.090
zero. OK. We have covered flipping coins, 90

00:12:37.090 --> 00:12:40.230
-10 math paradoxes, and evaluating how well a

00:12:40.230 --> 00:12:43.250
model predicts a sequence of test events. But

00:12:43.250 --> 00:12:46.009
rolling a die or taking a multiple choice test

00:12:46.009 --> 00:12:48.370
is one thing. How on earth does this math scale

00:12:48.370 --> 00:12:50.409
up to a whole paragraph? Like, how does this

00:12:50.409 --> 00:12:52.990
power natural language processing or the large

00:12:52.990 --> 00:12:55.090
language models like chat GPT that are generating

00:12:55.090 --> 00:12:57.529
thousands of words in a row? Well, in natural

00:12:57.529 --> 00:13:00.779
language processing, or NLP. We aren't just predicting

00:13:00.779 --> 00:13:03.879
a single isolated event anymore. We are evaluating

00:13:03.879 --> 00:13:07.919
entire sprawling text documents. We look at a

00:13:07.919 --> 00:13:10.940
corpus, which is a massive structured collection

00:13:10.940 --> 00:13:13.740
of texts. And a language model is essentially

00:13:13.740 --> 00:13:16.379
a probability distribution mapped over those

00:13:16.379 --> 00:13:19.840
entire texts. But documents are incredibly messy.

00:13:20.100 --> 00:13:22.080
Like you might have a test sample that's a three

00:13:22.080 --> 00:13:24.779
word text message and another that's a 300 page

00:13:24.779 --> 00:13:28.019
historical biography. How can you possibly compare?

00:13:27.919 --> 00:13:30.899
the perplexity of a model on those two wildly

00:13:30.899 --> 00:13:32.740
different things. You're right. You can't compare

00:13:32.740 --> 00:13:34.679
them raw. You have to normalize the math. And

00:13:34.679 --> 00:13:37.059
this brings us to token normalized perplexity.

00:13:37.320 --> 00:13:40.379
Token normalized perplexity? Right. In NLP, a

00:13:40.379 --> 00:13:42.500
token is usually a single word or sometimes just

00:13:42.500 --> 00:13:44.720
a piece of a word. To make meaningful comparisons

00:13:44.720 --> 00:13:46.879
across different lengths of text, you calculate

00:13:46.879 --> 00:13:49.360
the overall perplexity of the document, and then

00:13:49.360 --> 00:13:51.580
you normalize it by the total number of tokens.

00:13:52.100 --> 00:13:55.039
So we're basically boiling the AI's massive confusion

00:13:55.039 --> 00:13:57.899
down to a per word average. What does that actually

00:13:57.899 --> 00:14:00.200
look like in practice? There was a really famous

00:14:00.200 --> 00:14:02.519
historical benchmark for this using the Brown

00:14:02.519 --> 00:14:04.779
Corpus. The Brown Corpus? Yeah, the Brown Corpus

00:14:04.779 --> 00:14:06.679
is a collection of one million words of American

00:14:06.679 --> 00:14:08.980
English compiled to cover all sorts of varying

00:14:08.980 --> 00:14:12.779
topics and genres. Back in 1992, the absolute

00:14:12.779 --> 00:14:15.159
state -of -the -art Lois published perplexity

00:14:15.159 --> 00:14:18.720
on the brown corpus was about 247 per token.

00:14:19.320 --> 00:14:22.899
Okay, so what does this all mean? If I'm a computer

00:14:22.899 --> 00:14:26.159
scientist in 1992 and my AI gets a perplexity

00:14:26.159 --> 00:14:30.059
of 247, what is actually happening inside the

00:14:30.059 --> 00:14:32.820
machine? Well, think back to our dice analogy.

00:14:33.070 --> 00:14:36.789
An AI with a token normalized perplexity of 247

00:14:36.789 --> 00:14:39.149
is just as confused when looking at the test

00:14:39.149 --> 00:14:42.250
data as if it had to choose uniformly and independently

00:14:42.250 --> 00:14:45.850
among 247 different, equally likely possibilities

00:14:45.850 --> 00:14:48.470
for every single word it tries to predict. Whoa!

00:14:48.840 --> 00:14:52.419
It's rolling a 247 -sided die for every single

00:14:52.419 --> 00:14:55.100
word in a sentence. That sounds agonizingly difficult.

00:14:55.379 --> 00:14:58.059
I mean, if you just naively guessed out of 247

00:14:58.059 --> 00:15:00.519
options, your accuracy rate would be 1 divided

00:15:00.519 --> 00:15:05.039
by 247, which is about 0 .4%. You would be wrong

00:15:05.039 --> 00:15:08.259
99 .6 % of the time. It sounds like a terrible

00:15:08.259 --> 00:15:11.500
model, doesn't it? But there is a brilliant nuance

00:15:11.500 --> 00:15:14.580
here that underscores exactly why we can't just

00:15:14.580 --> 00:15:17.320
blindly trust a single mathematical metric. Oh,

00:15:17.320 --> 00:15:19.620
I love this part from the reading the the loophole

00:15:19.620 --> 00:15:23.059
yes the math shows that if your perplexity is

00:15:23.059 --> 00:15:28.200
247 naive guessing gets you a 0 .4 percent accuracy

00:15:28.200 --> 00:15:31.799
rate but if you completely throw that sophisticated

00:15:31.799 --> 00:15:34.899
model in the trash and you just program a dumb

00:15:34.899 --> 00:15:38.779
machine to blindly guess the word the T H E for

00:15:38.779 --> 00:15:41.539
every single word in the English language you

00:15:41.539 --> 00:15:44.399
will actually achieve an accuracy rate of 7%.

00:15:44.399 --> 00:15:48.120
It's a massive discrepancy. 0 .4 % versus 7%.

00:15:48.120 --> 00:15:50.620
And it highlights the deeply nuanced nature of

00:15:50.620 --> 00:15:53.080
predictiveness. I mean, how is a blind, repetitive

00:15:53.080 --> 00:15:55.779
guess of the mathematically outperforming a state

00:15:55.779 --> 00:15:59.080
-of -the -art 1992 AI model metric? Yeah. Why

00:15:59.080 --> 00:16:01.179
shouldn't we just build an AI that screams the

00:16:01.179 --> 00:16:03.379
all day? Because it comes down to the types of

00:16:03.379 --> 00:16:05.080
statistics the models are actually utilizing.

00:16:05.289 --> 00:16:07.809
Guessing the word the constantly is based on

00:16:07.809 --> 00:16:10.629
unigram statistics. Right. A unigram just looks

00:16:10.629 --> 00:16:13.230
at the frequency of single words in total isolation.

00:16:13.519 --> 00:16:16.460
V is the most common word in the English language.

00:16:16.700 --> 00:16:19.379
So purely by brute force, it hits 7 % of the

00:16:19.379 --> 00:16:23.340
time. But the model that achieved the 247 perplexity

00:16:23.340 --> 00:16:27.159
score in 1992 wasn't utilizing unigrams. It was

00:16:27.159 --> 00:16:30.440
utilizing a trigram statistic. OK. A trigram

00:16:30.440 --> 00:16:32.659
looks at sequences of three words. It's actually

00:16:32.659 --> 00:16:35.259
looking at the context. Yes. A trigram model

00:16:35.259 --> 00:16:37.720
tries to predict the next word based on the context

00:16:37.720 --> 00:16:40.480
of the two preceding words. It is trying to learn

00:16:40.480 --> 00:16:42.720
the actual grammatical structure, the rhythm,

00:16:42.919 --> 00:16:45.200
and the flow of the language. It isn't just firing

00:16:45.200 --> 00:16:47.419
out the most common word regardless of context.

00:16:47.720 --> 00:16:50.539
So guessing the is extremely safe and gets a

00:16:50.539 --> 00:16:53.220
higher raw accuracy percentage on paper, but

00:16:53.220 --> 00:16:55.759
it's incredibly boring and completely useless.

00:16:56.259 --> 00:16:58.700
The trigram model is actually far more sophisticated

00:16:58.700 --> 00:17:00.980
because it's mapping the true complexity of human

00:17:00.980 --> 00:17:05.230
language. It takes risks. Utilizing trigram statistics

00:17:05.230 --> 00:17:07.430
fundamentally refines the prediction in a way

00:17:07.430 --> 00:17:09.690
unigrams never could, even if the model is still

00:17:09.690 --> 00:17:12.730
faced with 247 equivalent choices at every step.

00:17:13.009 --> 00:17:15.589
That's the key. And obviously the field has moved

00:17:15.589 --> 00:17:19.420
far beyond 1992 trigram models. Since around

00:17:19.420 --> 00:17:22.220
2007, deep learning has taken over entirely.

00:17:22.960 --> 00:17:24.779
Today, we have these dominant transformer models,

00:17:24.920 --> 00:17:27.720
Google's BERT, OpenAI's GPT -4, all these massive

00:17:27.720 --> 00:17:30.619
large language models. And token normalized perplexity

00:17:30.619 --> 00:17:32.839
is still a central tool for evaluating them.

00:17:33.859 --> 00:17:35.619
Engineers use this metric to compare different

00:17:35.619 --> 00:17:38.460
models on the exact same data set and to mathematically

00:17:38.460 --> 00:17:41.480
guide the optimization of the AI's internal settings,

00:17:41.619 --> 00:17:44.130
its hyperparameters. But as these models have

00:17:44.130 --> 00:17:46.569
gotten larger and more complex, haven't researchers

00:17:46.569 --> 00:17:49.170
discovered limitations to using perplexity as

00:17:49.170 --> 00:17:51.130
the ultimate North Star? I mean, it can't be

00:17:51.130 --> 00:17:53.470
perfect. They absolutely have. It is a powerful

00:17:53.470 --> 00:17:56.450
tool, but it is deeply flawed if used in a vacuum.

00:17:57.190 --> 00:17:58.990
First off, it's highly sensitive to linguistic

00:17:58.990 --> 00:18:01.569
features and sentence length. You cannot just

00:18:01.569 --> 00:18:03.849
blindly compare a perplexity score from a dataset

00:18:03.849 --> 00:18:05.990
of medical journals to a perplexity score from

00:18:05.990 --> 00:18:08.289
a dataset of Twitter posts and expect a perfect

00:18:08.289 --> 00:18:10.289
apples -to -apples comparison. The intrinsic

00:18:10.289 --> 00:18:12.009
entropy of those two environments is entirely

00:18:12.009 --> 00:18:15.140
different. That makes total sense. The baseline

00:18:15.140 --> 00:18:18.220
surprise of a dense medical journal is obviously

00:18:18.220 --> 00:18:20.759
going to be different from a random tweet. Furthermore,

00:18:20.980 --> 00:18:23.799
it turns out perplexity is an inadequate predictor

00:18:23.799 --> 00:18:26.819
of actual performance in the real world, particularly

00:18:26.819 --> 00:18:29.299
in tasks like speech recognition. Wait, really?

00:18:29.460 --> 00:18:31.500
Yeah. You might engineer a model that achieves

00:18:31.500 --> 00:18:34.180
a mathematically stunning ultra -low perplexity

00:18:34.180 --> 00:18:37.440
score. The KL divergence is incredibly low on

00:18:37.440 --> 00:18:39.920
the test set. But when you hook it up to a microphone

00:18:39.920 --> 00:18:42.759
and have a real hewn speak to it, it might still

00:18:42.759 --> 00:18:46.420
hallucinate, misinterpret accents, or just fail

00:18:46.420 --> 00:18:48.920
to transcribe accurately. So how do they reality

00:18:48.920 --> 00:18:51.579
check the math, then, if perplexity isn't enough?

00:18:51.799 --> 00:18:53.740
Researchers often have to rely on a simpler,

00:18:53.779 --> 00:18:56.380
alternative evaluation metric called word error

00:18:56.380 --> 00:18:59.730
rate. or where. Word error rate. Okay, that sounds

00:18:59.730 --> 00:19:01.910
much more grounded. It's simply taking the percentage

00:19:01.910 --> 00:19:05.230
of erroneously recognized words, so deletions,

00:19:05.390 --> 00:19:07.609
insertions, substitutions, and dividing it by

00:19:07.609 --> 00:19:10.509
the total number of words spoken. Exactly. It's

00:19:10.509 --> 00:19:13.589
a harsh reality check against the purely theoretical

00:19:13.589 --> 00:19:17.109
clean math of perplexity because there is a massive

00:19:17.109 --> 00:19:20.299
danger in AI development. If you blindly optimize

00:19:20.299 --> 00:19:23.000
your system just to achieve the absolute lowest

00:19:23.000 --> 00:19:25.859
perplexity score possible, you run into severe

00:19:25.859 --> 00:19:28.119
issues with overfitting. Overfitting, meaning

00:19:28.119 --> 00:19:30.519
the AI essentially just memorizes its training

00:19:30.519 --> 00:19:32.339
data. Going back to our student analogy, it's

00:19:32.339 --> 00:19:35.039
like a student who memorized the answer key to

00:19:35.039 --> 00:19:37.039
the practice test without actually understanding

00:19:37.039 --> 00:19:39.660
the underlying subject. They get zero perplexity

00:19:39.660 --> 00:19:42.400
on the practice exam, but fail miserably when

00:19:42.400 --> 00:19:44.440
they face a question worded slightly differently

00:19:44.440 --> 00:19:46.480
in the real world. Right. They completely lose

00:19:46.480 --> 00:19:49.680
their ability to generalize to new unseen information.

00:19:50.240 --> 00:19:52.779
The AI becomes brilliant at predicting the past

00:19:52.779 --> 00:19:55.059
and totally useless at navigating the future.

00:19:55.500 --> 00:19:58.140
Wow. So to wrap all of this up, you have gone

00:19:58.140 --> 00:19:59.759
on quite a mathematical journey with us today.

00:20:00.119 --> 00:20:02.500
We started with the simple uncertainty of rolling

00:20:02.500 --> 00:20:05.619
a fair six -sided die. We untangled the paradox

00:20:05.619 --> 00:20:08.460
of why predicting a 90 % probability still carries

00:20:08.460 --> 00:20:11.119
a mathematical penalty of surprise due to variable

00:20:11.119 --> 00:20:14.480
length coding. We learned how KL divergence measures

00:20:14.480 --> 00:20:16.779
the distance between a model's guess and true

00:20:16.779 --> 00:20:19.519
reality. And finally, we scaled all that math

00:20:19.519 --> 00:20:22.019
up to see how the most advanced large language

00:20:22.019 --> 00:20:24.319
models in the world measure their own confusion

00:20:24.319 --> 00:20:27.339
on a per -word basis. Next time you see an article

00:20:27.339 --> 00:20:29.690
or a headline about a brand new AI model, and

00:20:29.690 --> 00:20:31.450
the companies boasting about their groundbreaking

00:20:31.450 --> 00:20:34.210
new benchmark stats, you know exactly what perplexity

00:20:34.210 --> 00:20:37.309
means. It is the AI's internal mathematical measure

00:20:37.309 --> 00:20:40.069
of surprise. And more importantly, you know exactly

00:20:40.069 --> 00:20:42.190
why a single metric, no matter how low it gets,

00:20:42.410 --> 00:20:44.410
doesn't tell the whole story of how well an AI

00:20:44.410 --> 00:20:46.930
actually understands human language. And this

00:20:46.930 --> 00:20:49.279
raises an important question. We've established

00:20:49.279 --> 00:20:51.319
that data scientists and engineers are constantly

00:20:51.319 --> 00:20:53.619
trying to minimize an AI model's perplexity.

00:20:54.220 --> 00:20:56.420
They are tweaking the hyperparameters so that

00:20:56.420 --> 00:20:58.660
the AI is trained to never be surprised by a

00:20:58.660 --> 00:21:00.799
sequence of human words. It should perfectly

00:21:00.799 --> 00:21:03.720
anticipate everything we say, reducing its cross

00:21:03.720 --> 00:21:07.019
-entropy to the absolute minimum. But if we mathematically

00:21:07.019 --> 00:21:09.759
train a system to ruthlessly eliminate all surprise,

00:21:10.440 --> 00:21:13.039
Are we inadvertently mathematically stripping

00:21:13.039 --> 00:21:15.740
away the model's ability to ever generate truly

00:21:15.740 --> 00:21:19.079
surprising original or creative thoughts? Wow,

00:21:19.400 --> 00:21:22.400
that is definitely something to mull over. Because

00:21:22.400 --> 00:21:24.400
while an AI might be rigorously engineered to

00:21:24.400 --> 00:21:27.400
be a perfect unsurprised predictor of text, when

00:21:27.400 --> 00:21:29.599
a person jumps out from behind a door, that subjective,

00:21:29.880 --> 00:21:32.079
unpredictable human surprise is sometimes exactly

00:21:32.079 --> 00:21:33.059
what makes life interesting.