WEBVTT

00:00:00.000 --> 00:00:03.160
You know, it's basically a universal human experience

00:00:03.160 --> 00:00:05.799
at this point. You're packing for a simple weekend

00:00:05.799 --> 00:00:08.400
getaway, like two night's tops. Right. But you

00:00:08.400 --> 00:00:11.660
end up staring at this massive groaning steamer

00:00:11.660 --> 00:00:15.019
trunk of a suitcase like you've packed. four

00:00:15.019 --> 00:00:18.280
pairs of shoes, a heavy winter coat, just in

00:00:18.280 --> 00:00:20.920
case there's a freak blizzard in July, you know.

00:00:20.940 --> 00:00:23.780
I mean, just in case. Exactly. And enough outfits

00:00:23.780 --> 00:00:26.460
to supply a small theater company. You don't

00:00:26.460 --> 00:00:29.000
actually need, I mean, probably 90 % of it, but

00:00:29.000 --> 00:00:31.480
it's all in there. It's that comfort of having

00:00:31.480 --> 00:00:33.920
every possible option available. Yeah, you feel

00:00:33.920 --> 00:00:35.710
prepared for anything. Right. But then you get

00:00:35.710 --> 00:00:38.429
to the airport and the airline tells you, uh,

00:00:38.429 --> 00:00:40.829
you can only bring a single tiny carry on. And

00:00:40.829 --> 00:00:43.229
suddenly you have to make some ruthless decisions.

00:00:43.250 --> 00:00:45.549
You really do. You have to figure out how to

00:00:45.549 --> 00:00:48.609
take the core utility of that giant steamer trunk

00:00:48.609 --> 00:00:51.469
and like magically compress it into a bag that

00:00:51.469 --> 00:00:54.250
fits under the seat in front of you. And well,

00:00:54.250 --> 00:00:57.070
in the world of artificial intelligence, engineers

00:00:57.070 --> 00:00:59.770
are facing this exact same nightmare. That is,

00:00:59.850 --> 00:01:01.630
I mean, that's such a perfect parallel to draw

00:01:01.630 --> 00:01:04.530
because modern AI models are fun. fundamentally

00:01:04.530 --> 00:01:06.569
over -packed. Right, which is what we're looking

00:01:06.569 --> 00:01:09.650
at in today's Deep Dive. It's called model compression,

00:01:10.090 --> 00:01:13.269
and it's basically the invisible magic making

00:01:13.269 --> 00:01:16.489
your smartphone capable of running advanced AI

00:01:16.489 --> 00:01:19.349
without literally melting in your hand. Yeah,

00:01:19.549 --> 00:01:21.890
and we constantly hear about these massive neural

00:01:21.890 --> 00:01:24.629
networks achieving human -level accuracy and

00:01:24.629 --> 00:01:27.109
language revision. But that performance comes

00:01:27.109 --> 00:01:29.609
at a massive, massive cost. Because they're huge,

00:01:29.870 --> 00:01:32.280
right? Exactly. They require staggering amounts

00:01:32.280 --> 00:01:35.239
of memory and computing power just to function.

00:01:36.040 --> 00:01:38.379
So if we want AI everywhere running on the resource

00:01:38.379 --> 00:01:41.040
constrained edge devices in your life, like the

00:01:41.040 --> 00:01:42.739
smart speaker in your kitchen or the embedded

00:01:42.739 --> 00:01:45.180
systems in your car, it just can't weigh a ton.

00:01:45.260 --> 00:01:47.680
It has to be lean. Right. And even for the huge

00:01:47.680 --> 00:01:50.040
tech corporations, shrinking these massive models

00:01:50.040 --> 00:01:53.519
means saving millions in compute costs when they

00:01:53.519 --> 00:01:56.280
serve AI to you over an API. Oh, absolutely.

00:01:56.400 --> 00:01:58.680
The cost savings are astronomical. Which ultimately

00:01:58.680 --> 00:02:00.780
translates to those lightning fast response times

00:02:00.780 --> 00:02:03.439
that you, the user, expect. But before we get

00:02:03.439 --> 00:02:05.219
into the brilliant mechanics of how engineers

00:02:05.219 --> 00:02:07.140
actually achieve this, we need to establish a

00:02:07.140 --> 00:02:08.979
vital ground rule based on the sources we're

00:02:08.979 --> 00:02:11.500
looking at. Yeah, this is important. Model compression

00:02:11.500 --> 00:02:14.659
is, at its core, a process of lossy compression.

00:02:14.909 --> 00:02:17.689
Like, we are intentionally throwing data away.

00:02:17.870 --> 00:02:22.270
We are. But the key is we're doing it so strategically

00:02:22.270 --> 00:02:25.810
that the model's overall accuracy doesn't take

00:02:25.810 --> 00:02:28.129
a significant hit. Right. And I think we should

00:02:28.129 --> 00:02:31.229
draw a clear line here between model compression

00:02:31.229 --> 00:02:33.750
and another popular technique called knowledge

00:02:33.750 --> 00:02:35.830
distillation. Oh, right, because people confuse

00:02:35.830 --> 00:02:38.750
those all the time. Exactly. So knowledge distillation

00:02:38.750 --> 00:02:41.810
involves training a brand new, separate, smaller

00:02:41.810 --> 00:02:45.909
student model to simply mimic the output of a

00:02:45.909 --> 00:02:48.169
massive teacher model. Like a copycat. Yeah.

00:02:48.270 --> 00:02:51.490
a really smart copycat. That is a fascinating

00:02:51.490 --> 00:02:53.629
process, but it's fundamentally different from

00:02:53.629 --> 00:02:55.590
what we're talking about today. We aren't building

00:02:55.590 --> 00:02:58.250
a smaller copycat. We are taking the original

00:02:58.250 --> 00:03:00.689
massive brain and performing surgery to shrink

00:03:00.689 --> 00:03:03.250
it down. Precisely. OK, so let's unpack this.

00:03:03.610 --> 00:03:06.409
How do we actually fit that giant AI steamer

00:03:06.409 --> 00:03:08.849
trunk into a carry -on? I mean, the most obvious

00:03:08.849 --> 00:03:10.389
first step is just throwing out the clothes we

00:03:10.389 --> 00:03:12.689
aren't going to wear, right? Just cut out the

00:03:12.689 --> 00:03:15.210
dead weight. Right. And in machine learning,

00:03:15.370 --> 00:03:17.590
that technique is formally known as proning.

00:03:17.840 --> 00:03:20.800
Pruning, like a tree. Exactly like a tree. Yeah.

00:03:21.020 --> 00:03:23.780
You are taking a dense neural network and sparsifying

00:03:23.780 --> 00:03:27.180
it. If you picture a neural network as billions

00:03:27.180 --> 00:03:29.439
of mathematical weights connecting different

00:03:29.439 --> 00:03:33.300
artificial neurons, pruning is going in and forcibly

00:03:33.300 --> 00:03:36.020
setting specific parameters to exactly zero.

00:03:36.280 --> 00:03:39.060
Wow, just straight to zero. Yep. You are literally

00:03:39.060 --> 00:03:41.219
snipping the connections. OK, I want to visualize

00:03:41.219 --> 00:03:44.439
this. It feels a bit like playing a massive game

00:03:44.439 --> 00:03:47.580
of Jenga. OK. You have this huge solid tower

00:03:47.580 --> 00:03:50.000
of blocks. Those are all your parameters. And

00:03:50.000 --> 00:03:52.340
you start carefully tapping out individual blocks,

00:03:52.340 --> 00:03:55.819
trying to keep the tower standing. But the Jenga

00:03:55.819 --> 00:03:57.979
analogy breaks down a bit, doesn't it? It does,

00:03:58.120 --> 00:03:59.939
yeah. Because in Jenga, every block you remove

00:03:59.939 --> 00:04:02.620
makes the tower inherently more fragile and unstable

00:04:02.620 --> 00:04:05.340
until it eventually just collapses. But AI models

00:04:05.340 --> 00:04:07.699
don't seem to just collapse from a stiff breeze.

00:04:07.800 --> 00:04:10.500
No, they don't. A better way to think about the

00:04:10.500 --> 00:04:13.039
Jenga tower is to imagine that every time you

00:04:13.039 --> 00:04:15.979
pull a useless block out, the remaining blocks

00:04:15.979 --> 00:04:18.800
magically fuse together to become even stronger

00:04:18.800 --> 00:04:21.399
and more resilient. Oh, wow. So they compensate

00:04:21.399 --> 00:04:24.860
for the gap. Exactly. The AI adapts to the missing

00:04:24.860 --> 00:04:27.579
pieces. But to your point about which blocks

00:04:27.579 --> 00:04:30.360
to pull, engineers have a few specific criteria

00:04:30.360 --> 00:04:33.160
for that. Okay, what's the first one? Well, the

00:04:33.160 --> 00:04:36.350
simplest metric is magnitude. If a mathematical

00:04:36.350 --> 00:04:38.990
weight is already very close to zero, it means

00:04:38.990 --> 00:04:41.350
it's barely influencing the network's calculations

00:04:41.350 --> 00:04:43.790
anyway. Right. It's just a weak connection. Yeah,

00:04:43.790 --> 00:04:45.750
so you just round it down to zero and prune the

00:04:45.750 --> 00:04:47.470
connection. Makes total sense. If it's weak,

00:04:47.529 --> 00:04:50.310
cut it. What else? You can also look at the statistical

00:04:50.310 --> 00:04:52.889
pattern of neural activations. Which means what,

00:04:52.970 --> 00:04:55.790
exactly? Well, if certain pathways in the network

00:04:55.790 --> 00:04:58.990
rarely ever light up when processing a wide variety

00:04:58.990 --> 00:05:02.269
of data, those pathways might just be redundant

00:05:02.269 --> 00:05:04.920
architecture. Oh, I see. Just dead ends that

00:05:04.920 --> 00:05:09.220
aren't being used. Exactly. And then you have

00:05:09.220 --> 00:05:12.879
highly complex mathematical criteria, like utilizing

00:05:12.879 --> 00:05:14.939
Hessian values. OK, wait. Let's pause right there.

00:05:15.279 --> 00:05:17.339
Because Hessian values sounds incredibly dense.

00:05:17.500 --> 00:05:19.459
It is a bit dense, yeah. How does that actually

00:05:19.459 --> 00:05:23.060
help us find dead weight? So the Hessian essentially

00:05:23.060 --> 00:05:25.639
measures curvature. Or in this context, we can

00:05:25.639 --> 00:05:28.160
think of it as sensitivity. It calculates how

00:05:28.160 --> 00:05:31.600
sensitive the overall error of the model is to

00:05:31.600 --> 00:05:34.019
changes in specific weights. OK, I'm with you

00:05:34.019 --> 00:05:37.019
so far. Imagine poking a sleeping bear. Always

00:05:37.019 --> 00:05:40.100
a good idea. Right. But if you poke one spot

00:05:40.100 --> 00:05:42.839
and the bear doesn't even flinch, that spot isn't

00:05:42.839 --> 00:05:45.879
highly sensitive. OK. So if you tweak a mathematical

00:05:45.879 --> 00:05:48.860
weight in the model and the model's overall accuracy

00:05:48.860 --> 00:05:51.800
doesn't flinch, that weight is a prime candidate

00:05:51.800 --> 00:05:56.060
for pruning. Wow. But, okay, let me ask you this.

00:05:56.399 --> 00:05:59.100
If engineers can just go in and set millions

00:05:59.100 --> 00:06:01.639
of parameters to zero based on things like their

00:06:01.639 --> 00:06:03.740
magnitude or how often they activate, and the

00:06:03.740 --> 00:06:06.199
model still works perfectly, doesn't that mean

00:06:06.199 --> 00:06:08.040
that AI was just sort of overthinking in the

00:06:08.040 --> 00:06:09.939
first place? Like, why did it learn to hoard

00:06:09.939 --> 00:06:12.339
all that useless data during training if it didn't

00:06:12.339 --> 00:06:14.939
actually need it? What's fascinating here is

00:06:14.939 --> 00:06:17.360
the difference between exploration and exploitation.

00:06:17.740 --> 00:06:21.180
A model's raw size during training does not equal

00:06:21.180 --> 00:06:23.970
its ultimate efficiency. During the learning

00:06:23.970 --> 00:06:27.550
phase, the AI absolutely needs that vast web

00:06:27.550 --> 00:06:30.149
of connections to explore different possibilities.

00:06:30.350 --> 00:06:32.930
To make mistakes and hit dead ends. Yes, to hit

00:06:32.930 --> 00:06:35.509
dead ends and to eventually find the optimal

00:06:35.509 --> 00:06:38.420
mathematical pathways. But once those successful

00:06:38.420 --> 00:06:40.740
pathways are cemented, much of the surrounding

00:06:40.740 --> 00:06:43.399
exploratory scaffolding just becomes obsolete.

00:06:43.540 --> 00:06:45.899
Oh, it's like building a bridge. You need massive

00:06:45.899 --> 00:06:48.120
cranes and temporary support structures to build

00:06:48.120 --> 00:06:50.480
the arch, but once the arch is locked in place,

00:06:50.579 --> 00:06:52.800
you take all that scaffolding away. That's a

00:06:52.800 --> 00:06:54.360
great way to put it. The bridge still stands.

00:06:55.120 --> 00:06:57.399
But earlier you mentioned that setting these

00:06:57.399 --> 00:07:00.579
parameters to exactly zero is the key. Why is

00:07:00.579 --> 00:07:03.509
the number zero so magical here? It really comes

00:07:03.509 --> 00:07:07.050
down to the physical hardware of computing. Specifically,

00:07:07.649 --> 00:07:10.449
sparse matrix operations versus dense matrix

00:07:10.449 --> 00:07:12.670
operations. Okay, lay that out for me. Well,

00:07:12.689 --> 00:07:15.490
a dense matrix operation forces the computer's

00:07:15.490 --> 00:07:18.649
processor to calculate every single number, moving

00:07:18.649 --> 00:07:21.170
data in and out of its memory registers, even

00:07:21.170 --> 00:07:24.589
if that data is tiny or totally irrelevant. It

00:07:24.589 --> 00:07:26.649
takes a literal clock cycle for the hardware

00:07:26.649 --> 00:07:29.269
to process it. And a sparse matrix operation

00:07:29.269 --> 00:07:32.709
just skips that. Exactly. Modern computing hardware

00:07:32.709 --> 00:07:35.610
and software frameworks are optimized to recognize

00:07:35.610 --> 00:07:38.769
when a matrix is sparse, which just means it's

00:07:38.769 --> 00:07:41.370
heavily populated with zeros. The system knows

00:07:41.370 --> 00:07:44.670
that anything multiplied by zero is zero. Therefore,

00:07:45.029 --> 00:07:47.509
it physically skips the calculation entirely.

00:07:47.769 --> 00:07:50.709
So you aren't just saving memory space by storing

00:07:50.709 --> 00:07:53.269
fewer numbers. You are radically speeding up

00:07:53.269 --> 00:07:55.730
the processing time by giving the computer millions

00:07:55.730 --> 00:07:58.079
of fewer math problems to solve per second. Okay,

00:07:58.339 --> 00:08:00.560
so pruning throws away the clothes we don't need,

00:08:00.740 --> 00:08:02.899
the Jenga blocks that aren't load -bearing. But

00:08:02.899 --> 00:08:05.060
what do we do when we are left with only the

00:08:05.060 --> 00:08:07.620
essential blocks and the model is still too heavy?

00:08:07.740 --> 00:08:09.720
Yeah, you hit a wall eventually. Right. If you

00:08:09.720 --> 00:08:11.779
can't throw a block away, can you make the block

00:08:11.779 --> 00:08:13.959
itself physically smaller in the computer's memory?

00:08:14.300 --> 00:08:17.259
You can. And that introduces the strategy of

00:08:17.259 --> 00:08:20.680
conization. This is entirely about reducing the

00:08:20.680 --> 00:08:23.279
numerical precision of the weights and activations

00:08:23.279 --> 00:08:24.980
inside the network. And here's where it gets

00:08:24.980 --> 00:08:26.879
really interesting, because this is a concept

00:08:26.879 --> 00:08:29.600
we experience constantly in digital media without

00:08:29.600 --> 00:08:31.860
even thinking about it. We do all the time. Like,

00:08:31.860 --> 00:08:35.419
think about downsizing a massive uncompressed

00:08:35.419 --> 00:08:38.860
4K photograph so you can text it to a friend

00:08:38.860 --> 00:08:41.840
over a weak cellular connection. Right, you reduce

00:08:41.840 --> 00:08:44.720
the resolution to 1080p. Exactly. It takes up

00:08:44.720 --> 00:08:46.399
a fraction of the space on your hard drive. It

00:08:46.399 --> 00:08:49.139
sends instantly. But when your friend looks at

00:08:49.139 --> 00:08:51.840
it on their smartphone screen, it still looks

00:08:51.840 --> 00:08:54.399
crisp and perfectly recognizable. You lowered

00:08:54.399 --> 00:08:57.299
the resolution, but you kept the core image intact.

00:08:57.759 --> 00:09:00.399
And applying that exact concept to pure mathematics

00:09:00.399 --> 00:09:03.919
is what quantization does. How so? Well, in a

00:09:03.919 --> 00:09:06.700
standard uncompressed AI model, the mathematical

00:09:06.700 --> 00:09:09.159
weights are typically stored as 32 -bit floating

00:09:09.159 --> 00:09:11.500
point numbers. Which are huge. They are. They

00:09:11.500 --> 00:09:14.779
offer incredible microscopic granularity for

00:09:14.779 --> 00:09:17.360
the math. But they also take up a relatively

00:09:17.360 --> 00:09:19.919
large amount of physical memory on a microchip,

00:09:20.120 --> 00:09:22.340
and it takes the processor's significant effort

00:09:22.340 --> 00:09:25.120
to multiply them together. Right. So quantization

00:09:25.120 --> 00:09:28.460
asks a simple question. What if we take those

00:09:28.460 --> 00:09:31.820
heavy 32 -bit floating point numbers and just

00:09:31.820 --> 00:09:34.159
compress them into lightweight 8 -bit integers?

00:09:34.580 --> 00:09:37.360
Wow, so you trade that microscopic decimal precision

00:09:37.360 --> 00:09:40.519
for simple smaller whole numbers exactly less

00:09:40.519 --> 00:09:43.679
precision equals less physical space and Radically

00:09:43.679 --> 00:09:46.259
left computational power required to crunch the

00:09:46.259 --> 00:09:49.350
arithmetic but and there's a big but here I figured.

00:09:49.509 --> 00:09:51.929
The catch is that you can't just bluntly chop

00:09:51.929 --> 00:09:54.730
every single parameter down to 8 -bit without

00:09:54.730 --> 00:09:56.730
sometimes fundamentally breaking the model's

00:09:56.730 --> 00:09:58.429
logic. I think it's too blurry. Right. And that

00:09:58.429 --> 00:10:00.450
introduces a much more nuanced approach called

00:10:00.450 --> 00:10:03.190
mixed precision. Ah, so playing favorites. You

00:10:03.190 --> 00:10:05.529
aggressively quantize the less important parameters

00:10:05.529 --> 00:10:09.350
down to 8 -bit. But for the crucial, highly sensitive

00:10:09.350 --> 00:10:11.549
load -bearing parameters of the neural network,

00:10:11.929 --> 00:10:14.149
you let them keep a higher 16 -bit precision.

00:10:14.370 --> 00:10:16.590
Exactly. You mix and match based on what the

00:10:16.590 --> 00:10:19.470
model needs. But wait. How does a computer actually

00:10:19.470 --> 00:10:22.029
manage that on the fly? Like the sources mentioned

00:10:22.029 --> 00:10:24.629
PyTorch as a major framework for this, juggling

00:10:24.629 --> 00:10:27.710
8 -bit and 16 -bit numbers simultaneously. Yeah,

00:10:27.710 --> 00:10:29.669
PyTorch is a big player here. Doesn't that make

00:10:29.669 --> 00:10:31.629
the math incredibly messy? It sounds like trying

00:10:31.629 --> 00:10:33.649
to do your taxes in three different currencies

00:10:33.649 --> 00:10:35.409
at once. It would be an absolute nightmare if

00:10:35.409 --> 00:10:38.029
it were done manually, yeah. But modern frameworks

00:10:38.029 --> 00:10:40.450
are co -designed with the hardware to handle

00:10:40.450 --> 00:10:43.070
those currency exchanges seamlessly. Oh, OK.

00:10:43.450 --> 00:10:46.110
PyTorch handles this through Automatic Mixed

00:10:46.110 --> 00:10:49.639
Precision, or AMP. it basically acts as an incredibly

00:10:49.639 --> 00:10:52.679
fast translator under the hood, utilizing a feature

00:10:52.679 --> 00:10:55.419
called auto -casting. Auto -casting? How does

00:10:55.419 --> 00:10:57.840
it decide which currency to use? It evaluates

00:10:57.840 --> 00:11:00.080
the specific mathematical operation queued up.

00:11:00.620 --> 00:11:02.899
So if the calculation is historically robust

00:11:02.899 --> 00:11:05.559
and stable, auto -casting routes it through the

00:11:05.559 --> 00:11:08.500
fast, cheap, 8 -bit math pathways on the hardware.

00:11:08.620 --> 00:11:10.840
Okay. But if it detects that a calculation is

00:11:10.840 --> 00:11:13.419
highly sensitive, say, an activation function

00:11:13.419 --> 00:11:15.500
where dropping precision would cause a massive

00:11:15.500 --> 00:11:17.820
spike in the error rate, it automatically casts

00:11:17.820 --> 00:11:20.799
the numbers up to 16 -bit to preserve the granularity.

00:11:21.019 --> 00:11:24.139
That is so smart. And the research also emphasizes

00:11:24.139 --> 00:11:26.899
the importance of gradient scaling and loft scaling

00:11:26.899 --> 00:11:29.419
when using these lower precisions. Yes, very

00:11:29.419 --> 00:11:31.320
important. I want to make sure we explain the

00:11:31.320 --> 00:11:33.620
mechanism there because it sounds vital to keeping

00:11:33.620 --> 00:11:36.279
the model alive. It really is the ultimate safety

00:11:36.279 --> 00:11:38.879
net. When you compress numbers down to lower

00:11:38.879 --> 00:11:41.740
precisions like 8 -bit or 16 -bit, you introduce

00:11:41.740 --> 00:11:45.139
a severe risk called underflow. Underflow, what

00:11:45.139 --> 00:11:47.429
is that? So during the training process, the

00:11:47.429 --> 00:11:49.529
model calculates gradients, which are essentially

00:11:49.529 --> 00:11:52.230
the tiny adjustments it needs to make to get

00:11:52.230 --> 00:11:56.259
smarter. Often these gradient numbers get infinitesimally

00:11:56.259 --> 00:12:00.820
small. Like 0 .000. Exactly. And if you're using

00:12:00.820 --> 00:12:03.320
16 -bit precision, a number might get so small

00:12:03.320 --> 00:12:05.440
that the system literally can't represent it

00:12:05.440 --> 00:12:08.139
anymore. It hits the floor of what the math can

00:12:08.139 --> 00:12:10.120
hold, and the computer just rounds it down to

00:12:10.120 --> 00:12:12.659
absolute zero by mistake. Oh, wow. And if your

00:12:12.659 --> 00:12:15.000
adjustments become zero, the model just stops

00:12:15.000 --> 00:12:17.600
learning entirely? It just freezes. Yeah. Wow.

00:12:17.799 --> 00:12:20.539
So gradient scaling swoops in to save the day.

00:12:21.360 --> 00:12:23.679
Before the computer tries to process those tiny

00:12:23.679 --> 00:12:26.740
numbers with limited precision math, the framework

00:12:26.740 --> 00:12:29.720
artificially multiplies them by a large scaling

00:12:29.720 --> 00:12:32.200
factor. Oh, it blows them up? Right, it temporarily

00:12:32.200 --> 00:12:34.179
blows them up into larger numbers so they stay

00:12:34.179 --> 00:12:37.019
safely visible on the radar of the lower precision

00:12:37.019 --> 00:12:39.720
math. The hardware crunches the numbers safely

00:12:39.720 --> 00:12:42.320
and then the system divides them back down before

00:12:42.320 --> 00:12:44.860
applying the update. That is a brilliant way

00:12:44.860 --> 00:12:47.659
to get the speed of cheap math while preventing

00:12:47.659 --> 00:12:50.799
those catastrophic rounding errors of underflow.

00:12:50.940 --> 00:12:53.539
It's incredibly clever engineering. OK, so we've

00:12:53.539 --> 00:12:55.600
thrown out the extra Jenga blocks with pruning,

00:12:55.720 --> 00:12:58.399
and we've shrunk the blocks we kept using quantization.

00:12:58.840 --> 00:13:00.960
We are making serious progress on our carry on

00:13:00.960 --> 00:13:03.159
bag here. We definitely are. But it makes me

00:13:03.159 --> 00:13:06.960
wonder, is there a way to cheat the actual math

00:13:06.960 --> 00:13:09.860
itself? Like, we've deleted numbers, we've changed

00:13:09.860 --> 00:13:12.679
the size of numbers, but the computer still fundamentally

00:13:12.679 --> 00:13:14.700
has to multiply them together in grids, right?

00:13:14.779 --> 00:13:17.059
It does, yes. Is there a mathematical shortcut

00:13:17.059 --> 00:13:19.600
for the grids themselves? There is, actually.

00:13:19.879 --> 00:13:21.899
And it brings us to the third major structural

00:13:21.899 --> 00:13:25.240
technique we need to talk about. Low -rank factorization.

00:13:25.899 --> 00:13:27.480
Okay, let's break down the mechanics of this

00:13:27.480 --> 00:13:30.570
because it sounds super intimidating. The core

00:13:30.570 --> 00:13:33.250
idea is approximating massive weight matrices

00:13:33.250 --> 00:13:36.990
with smaller, low -rank matrices. Right. Let's

00:13:36.990 --> 00:13:39.330
look at the underlying math conceptually. Go

00:13:39.330 --> 00:13:41.870
for it. Inside a neural network, a layer of neurons

00:13:41.870 --> 00:13:44.610
is represented mathematically as a massive grid

00:13:44.610 --> 00:13:47.570
of numbers, a matrix. Let's call it matrix W.

00:13:47.710 --> 00:13:50.129
Matrix W. Got it. Assume it has a shape of m

00:13:50.129 --> 00:13:53.769
rows by n columns. If both m and n are, say,

00:13:53.889 --> 00:13:56.909
1 ,000, that single matrix contains 1 million

00:13:56.909 --> 00:14:00.000
individual parameters. Every single time the

00:14:00.000 --> 00:14:03.019
AI processes a piece of data, it has to multiply

00:14:03.019 --> 00:14:04.779
incoming numbers against that million -number

00:14:04.779 --> 00:14:07.500
grid. Which is incredibly computationally heavy.

00:14:08.059 --> 00:14:11.220
Insanely heavy. So how does low -rank factorization

00:14:11.220 --> 00:14:13.879
cheat that grid? It capitalizes on a concept

00:14:13.879 --> 00:14:16.600
called low -intrinsic dimensionality. Okay, another

00:14:16.600 --> 00:14:20.360
big term. I know, I know. But basically... Even

00:14:20.360 --> 00:14:23.019
though that grid has a million numbers, a lot

00:14:23.019 --> 00:14:25.899
of the complex calculations the AI is doing are

00:14:25.899 --> 00:14:28.440
actually highly repetitive. They can be mapped

00:14:28.440 --> 00:14:31.100
to much simpler underlying patterns. Really?

00:14:31.240 --> 00:14:33.980
Yeah. So low -rank factorization proves that

00:14:33.980 --> 00:14:35.980
you don't actually need the million -number grid.

00:14:36.039 --> 00:14:40.220
You can approximate matrix W by multiplying two

00:14:40.220 --> 00:14:42.799
much smaller matrices together. Let's call them

00:14:42.799 --> 00:14:46.519
matrix U and matrix V. Okay. Here is my analogy

00:14:46.519 --> 00:14:48.600
for this. Let me know if I have this right. Let's

00:14:48.600 --> 00:14:51.340
hear it. Imagine I hand you an enormous spreadsheet

00:14:51.340 --> 00:14:53.799
with a million complex numbers on it and tell

00:14:53.799 --> 00:14:56.100
you to memorize it. Impossible, right? Totally

00:14:56.100 --> 00:14:58.059
impossible. But what if I told you that every

00:14:58.059 --> 00:15:00.899
single number on that giant spreadsheet was actually

00:15:00.899 --> 00:15:03.600
just the result of multiplying a specific number

00:15:03.600 --> 00:15:06.299
from a tiny list A and a specific number from

00:15:06.299 --> 00:15:08.860
a tiny list B? Okay. If you just memorize those

00:15:08.860 --> 00:15:11.159
two tiny lists, you essentially have the DNA

00:15:11.159 --> 00:15:14.039
of the giant spreadsheet. You can instantly recreate

00:15:14.039 --> 00:15:16.299
any number you need in your head. That captures

00:15:16.299 --> 00:15:19.740
the dynamic perfectly. Yes. In our matrix example,

00:15:19.860 --> 00:15:22.840
instead of a 1 ,000 by 1 ,000 grid, matrix U

00:15:22.840 --> 00:15:25.379
might be with 1 ,000 rows by 10 columns, and

00:15:25.379 --> 00:15:27.559
matrix V might be 10 rows by 1 ,000 columns.

00:15:27.779 --> 00:15:29.919
That internal number of the 10 is what we call

00:15:29.919 --> 00:15:31.679
a low rank. Wait, let me just do the math on

00:15:31.679 --> 00:15:35.179
that. 1 ,000 times 10 is 10 ,000. So two of those

00:15:35.179 --> 00:15:38.340
matrices is just 20 ,000 parameters total. Exactly.

00:15:38.360 --> 00:15:40.460
You went from requiring the computer to store

00:15:40.460 --> 00:15:43.299
and process 1 million parameters down to just

00:15:43.299 --> 00:15:45.919
20 ,000 parameters. It is a staggering reduction

00:15:45.919 --> 00:15:49.759
in compute. Finding the exact numbers for those

00:15:49.759 --> 00:15:52.340
tiny matrices is a really complex undertaking.

00:15:53.399 --> 00:15:55.600
The core method used is singular value decomposition.

00:15:55.879 --> 00:15:59.460
SVD. Okay, explain how SVD actually extracts

00:15:59.460 --> 00:16:01.679
that DNA. So SVD is essentially a mathematical

00:16:01.679 --> 00:16:04.340
tool that scanned a massive matrix and extracts

00:16:04.340 --> 00:16:06.500
its most important underlying patterns while

00:16:06.500 --> 00:16:08.620
discarding the redundant noise. Kind of like

00:16:08.620 --> 00:16:11.000
pruning, but for the math itself. Exactly. If

00:16:11.000 --> 00:16:13.820
you have a massive million pixel image of a clear

00:16:13.820 --> 00:16:17.559
blue sky, SVD realizes it doesn't actually need

00:16:17.559 --> 00:16:20.440
to memorize a million unique pixels. It just

00:16:20.440 --> 00:16:23.340
needs to isolate the core pattern blue and the

00:16:23.340 --> 00:16:26.330
instruction to repeat it. composes the original

00:16:26.330 --> 00:16:29.389
messy matrix into those clean low -ranked matrices

00:16:29.389 --> 00:16:31.460
we talked about. But man, this sounds like a

00:16:31.460 --> 00:16:33.779
massive optimization headache. I mean, figuring

00:16:33.779 --> 00:16:36.820
out those exact patterns, deciding what the optimal

00:16:36.820 --> 00:16:39.379
rank should be, the sources even say you have

00:16:39.379 --> 00:16:41.259
to account for how activation functions like

00:16:41.259 --> 00:16:44.360
ReLU affect the math. Oh, absolutely. The ReLU

00:16:44.360 --> 00:16:46.320
activation function, which essentially acts as

00:16:46.320 --> 00:16:48.360
a gatekeeper telling signals to pass through

00:16:48.360 --> 00:16:51.279
or stop it, actually forces some outputs to zero,

00:16:51.379 --> 00:16:53.759
which implicitly changes the rank of the matrix

00:16:53.759 --> 00:16:55.639
after training. That sounds like a nightmare.

00:16:55.899 --> 00:16:58.039
It is. Engineers have to formulate this as a

00:16:58.039 --> 00:17:01.240
mixed, discrete continuous optimization problem

00:17:01.240 --> 00:17:04.119
to factor all that in. It is brutal, difficult

00:17:04.119 --> 00:17:06.160
mathematics. So is going through that brutal

00:17:06.160 --> 00:17:08.480
mathematics actually worth the trouble? If we

00:17:08.480 --> 00:17:10.960
connect this back to the bigger picture of why

00:17:10.960 --> 00:17:13.559
we compress models in the first place, it absolutely

00:17:13.559 --> 00:17:16.599
is. Yes. Doing the singular value decomposition

00:17:16.599 --> 00:17:19.079
and solving the optimization problem is incredibly

00:17:19.079 --> 00:17:21.480
hard work. But all of that hard work happens

00:17:21.480 --> 00:17:24.660
exactly once in a laboratory by the developers.

00:17:25.180 --> 00:17:28.059
The pain is entirely front -loaded. Oh, I see.

00:17:28.119 --> 00:17:29.980
Once they solve it and lock the tiny matrices

00:17:29.980 --> 00:17:33.559
in place, the resulting compressed model is exponentially

00:17:33.559 --> 00:17:36.640
faster at inference. The end user never feels

00:17:36.640 --> 00:17:39.680
the pain of the complex algebra. They only experience

00:17:39.680 --> 00:17:42.079
the blazing speed on their smartphone. OK, that

00:17:42.079 --> 00:17:44.559
makes total sense. So we have our three main

00:17:44.559 --> 00:17:47.740
structural tools, pruning to cut the dead connections,

00:17:48.579 --> 00:17:50.880
quantization to lower the precision and shrink

00:17:50.880 --> 00:17:54.099
the memory footprint, and low -rank factorization

00:17:54.099 --> 00:17:56.460
to mathematically cheat the matrix multiplication.

00:17:56.700 --> 00:17:58.940
That's the trifecta. Which brings us to a very

00:17:58.940 --> 00:18:01.440
practical question about strategy. Building these

00:18:01.440 --> 00:18:04.220
AI models requires millions of dollars in compute

00:18:04.220 --> 00:18:06.660
time. If the goal is to compress them, how do

00:18:06.660 --> 00:18:09.559
engineers sequence this? Do they build a giant,

00:18:09.819 --> 00:18:12.059
perfect model first and and then ruthlessly shrink

00:18:12.059 --> 00:18:14.599
it down? Or do they try to shrink it while they

00:18:14.599 --> 00:18:16.779
are building it? Research indicates that the

00:18:16.779 --> 00:18:18.920
answer is really both, depending on the specific

00:18:18.920 --> 00:18:21.799
architectural goals. Compression can be entirely

00:18:21.799 --> 00:18:24.839
decoupled from training. Like, you bake the cake

00:18:24.839 --> 00:18:27.980
and then you shrink it. Gotcha. A popular strategy

00:18:27.980 --> 00:18:31.460
here is literally called, train big, then compress.

00:18:31.960 --> 00:18:34.319
You train a huge model for a short amount of

00:18:34.319 --> 00:18:36.140
time, and this is important, less time than it

00:18:36.140 --> 00:18:38.099
would take for the model to fully converge or

00:18:38.099 --> 00:18:40.119
finish learning perfectly, and then you heavily

00:18:40.119 --> 00:18:42.869
compress it. And studies actually show that on

00:18:42.869 --> 00:18:46.049
the exact same computational budget, this train

00:18:46.049 --> 00:18:50.009
huge but briefly then shrink method yields a

00:18:50.009 --> 00:18:52.750
much more accurate model than if you had just

00:18:52.750 --> 00:18:54.730
spent that exact same budget training a small

00:18:54.730 --> 00:18:57.269
model until it was finished. It does, yes. Which,

00:18:57.430 --> 00:19:00.039
wait. I have to ask, if the ultimate goal of

00:19:00.039 --> 00:19:02.700
the engineers is to have a small efficient model

00:19:02.700 --> 00:19:05.299
that runs on a phone, why on earth wouldn't they

00:19:05.299 --> 00:19:08.160
just build a small model from scratch? Why go

00:19:08.160 --> 00:19:10.319
through the expensive effort of building a giant

00:19:10.319 --> 00:19:13.000
brain just to aggressively shrink it? It is honestly

00:19:13.000 --> 00:19:15.299
one of the most counterintuitive truths in modern

00:19:15.299 --> 00:19:17.559
machine learning. It really is. It turns out

00:19:17.559 --> 00:19:20.200
that neural networks seem to fundamentally require

00:19:20.200 --> 00:19:23.000
that massive sprawling initial parameter space

00:19:23.000 --> 00:19:25.640
to successfully map out the problem in the first

00:19:25.640 --> 00:19:28.269
place. Interesting. They need the vastness to

00:19:28.269 --> 00:19:31.470
find the subtle mathematical pathways and learn

00:19:31.470 --> 00:19:34.750
the deep patterns in the data. A small model

00:19:34.750 --> 00:19:37.190
simply lacks the surface area that exploratory

00:19:37.190 --> 00:19:39.490
scaffolding we talked about earlier to catch

00:19:39.490 --> 00:19:41.809
those complex insights during training. So it

00:19:41.809 --> 00:19:44.349
needs the giant brain to learn the information,

00:19:44.349 --> 00:19:47.269
but not to remember it. Precisely. Once the giant

00:19:47.269 --> 00:19:49.529
model has successfully learned the pattern, that

00:19:49.529 --> 00:19:51.769
refined knowledge can be extracted and stored

00:19:51.769 --> 00:19:53.910
in a fraction of the space. That is just mind

00:19:53.910 --> 00:19:56.099
blowing. Isn't it? OK, so what about the other

00:19:56.099 --> 00:19:58.819
strategy? Combining compression directly into

00:19:58.819 --> 00:20:02.160
the training process. Ah, so a landmark approach

00:20:02.160 --> 00:20:04.980
to this is a technique developed by Han and colleagues.

00:20:05.259 --> 00:20:07.880
It's literally called deep compression. Deep

00:20:07.880 --> 00:20:10.500
compression. Yeah. And it is essentially the

00:20:10.500 --> 00:20:13.579
master class in using every tool we've discussed,

00:20:14.240 --> 00:20:16.500
but forcing the model through a grueling three

00:20:16.500 --> 00:20:18.000
-step loop. OK, let's walk through the loop.

00:20:18.170 --> 00:20:21.690
Step one is our old friend pruning. Exactly.

00:20:22.250 --> 00:20:24.809
They take the train network and they prune all

00:20:24.809 --> 00:20:27.029
the connections that fall below a certain threshold.

00:20:28.029 --> 00:20:30.490
But then, and this is the crucial step that keeps

00:20:30.490 --> 00:20:33.470
the Jenga tower from falling away, they fine

00:20:33.470 --> 00:20:35.369
tune the network. They trained it again. Yeah.

00:20:35.490 --> 00:20:38.170
They run training data through it again, allowing

00:20:38.170 --> 00:20:41.009
the AI to mathematically readjust its remaining

00:20:41.009 --> 00:20:43.450
weights to compensate for the missing pieces.

00:20:44.289 --> 00:20:47.329
Then they prune again. fine -tune again, looping

00:20:47.329 --> 00:20:49.670
until it's as sparse as physically possible without

00:20:49.670 --> 00:20:52.450
break. Loop, adapt, loop, adapt. Exactly. Then

00:20:52.450 --> 00:20:55.130
they move to step two. Step two is a brilliant

00:20:55.130 --> 00:20:58.130
form of quantization. They take the surviving

00:20:58.130 --> 00:21:00.789
weights and they cluster them together, enforcing

00:21:00.789 --> 00:21:02.779
weight sharing. Weight sharing, what does that

00:21:02.779 --> 00:21:05.019
look like? Imagine looking at a sprawling painting

00:21:05.019 --> 00:21:07.319
and realizing that instead of needing a hundred

00:21:07.319 --> 00:21:09.940
slightly different hyper -specific shades of

00:21:09.940 --> 00:21:12.480
blue, you can just use one central shade of blue

00:21:12.480 --> 00:21:15.640
for all of them. Oh wow. They group similar weights,

00:21:16.220 --> 00:21:18.720
force them to share a single quantized value,

00:21:18.980 --> 00:21:21.240
and then again they fine -tune the network to

00:21:21.240 --> 00:21:23.680
adapt to this shared reality. Which leaves us

00:21:23.680 --> 00:21:26.259
with a highly pruned network where the remaining

00:21:26.259 --> 00:21:28.640
weights are heavily clustered into shared values.

00:21:28.740 --> 00:21:31.460
Yes. which sends the stage perfectly for step

00:21:31.460 --> 00:21:35.539
three, the grand finale, Huffman coding. You

00:21:35.539 --> 00:21:38.720
got it. Huffman coding is a classic lossless

00:21:38.720 --> 00:21:41.440
compression algorithm. Because step two created

00:21:41.440 --> 00:21:44.359
all these shared clustered weights, the model's

00:21:44.359 --> 00:21:47.920
data now has highly predictable repetitive patterns.

00:21:48.339 --> 00:21:50.640
Right. And explain the mechanism of Huffman coding

00:21:50.640 --> 00:21:52.779
here. How does it compress those patterns? The

00:21:52.779 --> 00:21:55.579
best analogy is Morse code. In Morse code, the

00:21:55.579 --> 00:21:57.759
most commonly used letter in the English language,

00:21:57.859 --> 00:22:01.220
E, is assigned the shortest possible code, a

00:22:01.220 --> 00:22:04.839
single dot. A rare letter, like Q, gets a long

00:22:04.839 --> 00:22:07.700
complex code, dash dash dot dash. Huffman coding

00:22:07.700 --> 00:22:10.319
does the exact same thing for digital data. It

00:22:10.319 --> 00:22:12.660
assigns very short digital codes to the most

00:22:12.660 --> 00:22:15.279
frequently occurring clustered weights, and longer

00:22:15.279 --> 00:22:18.180
codes to the rare ones. It compresses the digital

00:22:18.180 --> 00:22:20.380
footprint of the repetitive data without losing

00:22:20.380 --> 00:22:23.099
a single drop of information. That is genius.

00:22:23.279 --> 00:22:25.400
And the results of chaining these three steps

00:22:25.400 --> 00:22:28.380
together are staggering. The data cites that

00:22:28.380 --> 00:22:31.140
the SqueezeNet paper applied this deep compression

00:22:31.140 --> 00:22:34.200
loop to the famous AlexNet model. Yes, and the

00:22:34.200 --> 00:22:36.160
numbers are incredible. They achieved a compression

00:22:36.160 --> 00:22:40.029
ratio of 35? Yeah. They shrank a massive foundational

00:22:40.029 --> 00:22:43.950
model to roughly 3 % of its original size. 3%.

00:22:43.950 --> 00:22:46.609
And a ratio of about 10 on the squeeze nets themselves.

00:22:47.269 --> 00:22:50.750
Taking these massive computationally heavy beasts

00:22:50.750 --> 00:22:53.589
and turning them into sleek, agile systems that

00:22:53.589 --> 00:22:56.250
can live right in your pocket. Exactly. And doing

00:22:56.250 --> 00:22:58.990
so while maintaining the benchmark accuracy that

00:22:58.990 --> 00:23:01.009
made those models famous in the first place.

00:23:01.230 --> 00:23:03.569
It truly is the invisible infrastructure of the

00:23:03.569 --> 00:23:07.160
modern AI boom. Without it, the edge device revolution

00:23:07.160 --> 00:23:09.980
simply wouldn't exist. It really is. So we've

00:23:09.980 --> 00:23:12.680
journeyed from staring at an impossibly large

00:23:12.680 --> 00:23:15.539
steamer trunk to sliding a perfectly packed carry

00:23:15.539 --> 00:23:18.039
-on under our seat. We did it. We've explored

00:23:18.039 --> 00:23:20.339
how engineers pull this off by pruning the dead

00:23:20.339 --> 00:23:22.740
weight, by quantizing the precision to shrink

00:23:22.740 --> 00:23:26.039
the math, and by using clever matrix factorization

00:23:26.039 --> 00:23:28.420
to give the computer a mathematical shortcut,

00:23:28.799 --> 00:23:31.720
often chaining these techniques together in relentless

00:23:31.720 --> 00:23:34.799
loops of adaptation. So the next time you are

00:23:34.799 --> 00:23:36.799
out, and your smartphone instantly recognizes

00:23:36.799 --> 00:23:39.859
a friend's face in a photo, or your keyboard

00:23:39.859 --> 00:23:42.900
quickly autocompletes a highly complex sentence.

00:23:42.980 --> 00:23:45.680
And it does all of this instantly. Right. Without

00:23:45.680 --> 00:23:48.180
needing a Wi -Fi connection to a supercomputer,

00:23:48.680 --> 00:23:50.559
you'll know exactly how it's happening. You are

00:23:50.559 --> 00:23:53.259
witnessing model compression in action. It's

00:23:53.259 --> 00:23:56.099
just an elegant, deeply mathematical solution

00:23:56.099 --> 00:23:59.039
to a massive physical limitation. It really is.

00:23:59.440 --> 00:24:01.299
But before we wrap up, I want to leave you with

00:24:01.299 --> 00:24:04.190
a final thought to mull over. We talk about how

00:24:04.190 --> 00:24:06.869
an AI needs a massive brain to map out a problem,

00:24:06.970 --> 00:24:10.049
but not to retain the answer. If a foundational

00:24:10.049 --> 00:24:12.849
AI model like AlexNet can be ruthlessly pruned,

00:24:13.069 --> 00:24:16.009
quantized, and shrunk down to 135th of its original

00:24:16.009 --> 00:24:18.829
size without losing its accuracy or intelligence,

00:24:19.630 --> 00:24:22.470
it really begs a fascinating philosophical question.

00:24:22.779 --> 00:24:25.559
Is the vast majority of an AI's massive brain

00:24:25.559 --> 00:24:27.880
just empty space and redundant noise to begin

00:24:27.880 --> 00:24:31.059
with? And if so, as we pack our own mental trunks

00:24:31.059 --> 00:24:33.059
with experiences and knowledge, how much of human

00:24:33.059 --> 00:24:34.740
learning operates the exact same way?
