WEBVTT

00:00:00.000 --> 00:00:02.600
To make the world's most powerful artificial

00:00:02.600 --> 00:00:05.219
intelligence models faster, computer scientists

00:00:05.219 --> 00:00:09.000
are, they're purposefully giving them brain damage.

00:00:09.080 --> 00:00:11.240
Yeah, which sounds completely wild. It sounds

00:00:11.240 --> 00:00:13.679
totally counterintuitive, but they are intentionally

00:00:13.679 --> 00:00:15.919
ripping out millions of connections inside these

00:00:15.919 --> 00:00:19.120
neural networks just to see if the AI can, you

00:00:19.120 --> 00:00:21.920
know, survive the trauma. Exactly. And thrive,

00:00:21.940 --> 00:00:24.850
actually. Right. Well, welcome to the Deep Dive.

00:00:24.989 --> 00:00:27.789
Today we are pulling from a Wikipedia article,

00:00:28.410 --> 00:00:30.989
and it's underlying source material basically

00:00:30.989 --> 00:00:33.229
breaking down a concept in artificial neural

00:00:33.229 --> 00:00:35.950
networks called pruning. And it's such an important

00:00:35.950 --> 00:00:38.530
topic right now. It really is. The mission for

00:00:38.530 --> 00:00:40.670
you listening today is to explore the reality

00:00:40.670 --> 00:00:43.189
of modern machine learning, because making an

00:00:43.189 --> 00:00:45.729
artificial intelligence physically smaller is

00:00:45.729 --> 00:00:48.270
often the absolute secret to making it smarter,

00:00:48.630 --> 00:00:51.850
faster, and actually usable in your daily life.

00:00:52.149 --> 00:00:54.049
Yeah, usable being the key word there. Totally.

00:00:54.350 --> 00:00:56.649
And to figure out how to do this, engineers aren't

00:00:56.649 --> 00:00:59.130
just looking at math. They are borrowing blueprints

00:00:59.130 --> 00:01:01.789
directly from human biology. Which is, I mean,

00:01:01.789 --> 00:01:04.280
it's a complete paradigm shift from from how

00:01:04.280 --> 00:01:07.400
we usually view computational power. Right. Because

00:01:07.400 --> 00:01:10.180
we always think bigger is better. Exactly. We

00:01:10.180 --> 00:01:12.459
have this pervasive assumption that scaling up

00:01:12.459 --> 00:01:15.840
is the only path to better performance. You know,

00:01:16.060 --> 00:01:19.159
more parameters, bigger server farms, massive

00:01:19.159 --> 00:01:21.579
energy grids. Just throw more hardware at it.

00:01:21.680 --> 00:01:24.659
Right. But pruning flits that entirely on its

00:01:24.659 --> 00:01:28.060
head. It formally defines the practice of removing

00:01:28.060 --> 00:01:30.680
parameters from an existing artificial neural

00:01:30.680 --> 00:01:34.230
network. specifically to reduce its overall size.

00:01:34.409 --> 00:01:37.090
Okay, let's unpack this. Because to really grasp

00:01:37.090 --> 00:01:39.650
what a monumental task that is, I want you to

00:01:39.650 --> 00:01:41.730
imagine you are writing a massive, sprawling,

00:01:41.750 --> 00:01:44.370
like, thousand -page fantasy epic. Ooh, I like

00:01:44.370 --> 00:01:46.250
this. Right. So it's got hundreds of characters,

00:01:46.590 --> 00:01:49.030
thousands of subplots, incredible detail, and

00:01:49.030 --> 00:01:50.950
you finally finish it, you hand it to an editor,

00:01:51.030 --> 00:01:52.549
and the editor looks at you and says, well, this

00:01:52.549 --> 00:01:56.269
is brilliant. Now publish it as a 200 -page novella.

00:01:56.390 --> 00:01:58.329
Good luck with that. Exactly. And they say it

00:01:58.329 --> 00:02:00.629
has to keep the exact same plot, the exact...

00:02:00.560 --> 00:02:02.980
same emotional impact and tell the exact same

00:02:02.980 --> 00:02:05.519
story. Which most writers would say is impossible.

00:02:05.920 --> 00:02:07.680
You'd assume the entire narrative structure would

00:02:07.680 --> 00:02:10.800
just collapse. But Pruning argues that cutting

00:02:10.800 --> 00:02:13.199
all of those extra pages actually makes the story

00:02:13.199 --> 00:02:16.379
sharper. Removing the fluff makes it faster to

00:02:16.379 --> 00:02:19.099
read, easier to process, and, you know, objectively

00:02:19.099 --> 00:02:21.099
a better book. What's fascinating here is where

00:02:21.099 --> 00:02:24.000
that editorial ruthlessness actually originates.

00:02:24.460 --> 00:02:27.099
Because the source material highlights a direct

00:02:27.280 --> 00:02:30.680
biological parallel that engineers use as the

00:02:30.680 --> 00:02:33.520
foundational model. A human biology part. Yes.

00:02:33.860 --> 00:02:36.539
This computational process of shrinking a massive

00:02:36.539 --> 00:02:40.199
AI is mapped directly to the biological process

00:02:40.199 --> 00:02:43.000
of synaptic pruning, which happens naturally

00:02:43.000 --> 00:02:45.340
in mammalian brains during development. Wow.

00:02:45.639 --> 00:02:48.699
Yeah. And the article cites this pivotal 1998

00:02:48.699 --> 00:02:52.139
research paper by Gal Chachik, Isaac Myleson,

00:02:52.219 --> 00:02:54.759
and Eden Ruppin. It's titled Synaptic Pruning

00:02:54.759 --> 00:02:57.340
in Development. a computational account. So it's

00:02:57.340 --> 00:03:00.060
not just a metaphor? No, not at all. They didn't

00:03:00.060 --> 00:03:02.439
just notice a loose metaphor. They mathematically

00:03:02.439 --> 00:03:04.800
modeled how a mammalian brain develops and applied

00:03:04.800 --> 00:03:07.219
that exact framework to computer science. I have

00:03:07.219 --> 00:03:09.020
to push back on that slightly though. Okay, sure.

00:03:09.199 --> 00:03:12.120
Because the biological comparison sounds great

00:03:12.120 --> 00:03:15.580
in a textbook, but does it actually hold up mathematically?

00:03:15.780 --> 00:03:18.599
I mean, we usually think of learning, especially

00:03:18.599 --> 00:03:21.639
if you look at a toddler developing or an AI

00:03:21.639 --> 00:03:24.259
scraping the internet, we think of it as acquiring

00:03:24.259 --> 00:03:26.319
more. Right, gathering more data. Yeah, you build

00:03:26.319 --> 00:03:28.979
more vocabulary, you establish more neural connections.

00:03:29.439 --> 00:03:31.520
The general assumption is that a smarter brain

00:03:31.520 --> 00:03:34.580
is just a denser brain. Well, that assumption

00:03:34.580 --> 00:03:36.759
ignores the physical constraints of biology.

00:03:37.389 --> 00:03:40.530
Which, interestingly enough, maps perfectly to

00:03:40.530 --> 00:03:42.689
the physical constraints of hardware. How do

00:03:42.689 --> 00:03:45.349
you mean? Well, a mammalian brain requires a

00:03:45.349 --> 00:03:48.169
massive amount of metabolic energy just to maintain

00:03:48.169 --> 00:03:51.229
its neural pathways. Oh, right, calories. Exactly,

00:03:51.530 --> 00:03:54.129
calories. If a toddler's brain kept every single

00:03:54.129 --> 00:03:56.569
synaptic connection it ever formed during its,

00:03:56.569 --> 00:03:59.349
you know, its exclusive growth phase, the organism

00:03:59.349 --> 00:04:01.610
would burn an unsustainable amount of energy

00:04:01.610 --> 00:04:03.909
just sitting completely still. It would just

00:04:03.909 --> 00:04:06.699
exhaust itself. Exactly. And it would also be

00:04:06.699 --> 00:04:09.280
overwhelmed with sensory noise. So as the brain

00:04:09.280 --> 00:04:12.060
develops, it evaluates which pathways are actually

00:04:12.060 --> 00:04:15.960
firing efficiently to process information. And

00:04:15.960 --> 00:04:18.519
it aggressively lets the unused or redundant

00:04:18.519 --> 00:04:21.300
ones just wither away. So true knowledge isn't

00:04:21.300 --> 00:04:24.699
about hoarding data. It's about optimizing the

00:04:24.699 --> 00:04:27.519
pathways to access that data. Yes. It's like

00:04:27.519 --> 00:04:30.019
it's actively choosing what to forget so you

00:04:30.019 --> 00:04:33.329
can recall what actually matters faster. Precisely.

00:04:33.410 --> 00:04:36.129
That is the underlying philosophy of that 1998

00:04:36.129 --> 00:04:39.370
paper. The mammalian brain has to learn how to

00:04:39.370 --> 00:04:41.689
be efficient with its caloric resources, just

00:04:41.689 --> 00:04:43.649
as a neural network has to learn how to be efficient

00:04:43.649 --> 00:04:45.649
with its computational resources. OK, here's

00:04:45.649 --> 00:04:47.610
where it gets really interesting, because the

00:04:47.610 --> 00:04:49.490
big question for anyone listening who works with

00:04:49.490 --> 00:04:52.750
code or hardware is how we actually translate

00:04:52.750 --> 00:04:55.550
that biological withering into a digital algorithm.

00:04:55.730 --> 00:04:57.610
Right. How do you actually write the code for

00:04:57.610 --> 00:05:00.930
that? Exactly. How do we perform this digital

00:05:00.930 --> 00:05:04.930
surgery? Our sources point to two main targets

00:05:04.930 --> 00:05:08.110
when we try to shrink these networks. Node pruning,

00:05:08.370 --> 00:05:10.470
which targets the artificial neurons themselves,

00:05:11.310 --> 00:05:13.930
and edge pruning, which targets the weights or

00:05:13.930 --> 00:05:15.889
the connections between the neurons. Yeah, node

00:05:15.889 --> 00:05:18.370
versus edge. Right. So let's ground this in a

00:05:18.370 --> 00:05:20.370
visual. Let's say we're dealing with a tree.

00:05:20.649 --> 00:05:23.430
Are we talking about taking a chainsaw and cutting

00:05:23.430 --> 00:05:25.730
off whole branches, which would be the neurons,

00:05:26.050 --> 00:05:28.490
or are we just plucking individual unnecessary

00:05:28.490 --> 00:05:30.959
leaves, which would be the weights? That's a

00:05:30.959 --> 00:05:32.680
great way to visualize it. Let's look at the

00:05:32.680 --> 00:05:36.240
chainsaw approach first. Node pruning, which

00:05:36.240 --> 00:05:38.279
is technically called structured pruning. OK,

00:05:38.540 --> 00:05:40.560
the chainsaw. Right. If you're going to remove

00:05:40.560 --> 00:05:42.759
a whole branch, a whole node, you don't just

00:05:42.759 --> 00:05:45.500
randomly hack away at the tree. You have to organically

00:05:45.500 --> 00:05:48.220
build an algorithm to find the weakest link.

00:05:48.879 --> 00:05:51.129
So how do they measure that? Well, you'd start

00:05:51.129 --> 00:05:53.670
by evaluating the importance of every single

00:05:53.670 --> 00:05:56.050
neuron across the network based on a specific

00:05:56.050 --> 00:05:58.850
metric. Then you rank them all from most critical

00:05:58.850 --> 00:06:01.670
to least critical. You isolate the absolute lowest

00:06:01.670 --> 00:06:03.689
performing neuron and you rip it out entirely.

00:06:03.709 --> 00:06:06.230
Just gone. Gone. Then you check your network's

00:06:06.230 --> 00:06:09.110
overall performance. Like, is it still running

00:06:09.110 --> 00:06:12.089
above your target accuracy? And if it is? If

00:06:12.089 --> 00:06:14.870
it is, you loop back and you chainsaw the next

00:06:14.870 --> 00:06:17.490
weakest branch. And you keep doing that until

00:06:17.490 --> 00:06:20.180
you hit a user determination condition. Oh wait,

00:06:20.220 --> 00:06:22.620
taking a chainsaw to a whole branch destroys

00:06:22.620 --> 00:06:24.899
the underlying architecture. If you remove an

00:06:24.899 --> 00:06:27.920
entire neuron, every single connection, you know,

00:06:27.980 --> 00:06:30.240
the leaves that were on that branch, they're

00:06:30.240 --> 00:06:32.379
destroyed by default too. Right, it's a massive

00:06:32.379 --> 00:06:34.819
structural change. That seems incredibly disruptive

00:06:34.819 --> 00:06:37.220
to a model that has already spent, what, thousands

00:06:37.220 --> 00:06:39.720
of hours training its pathways. Is that really

00:06:39.720 --> 00:06:42.560
how modern deep learning engineers maintain these

00:06:42.560 --> 00:06:45.079
systems? Honestly, they rarely do it that way

00:06:45.079 --> 00:06:47.620
anymore for that exact reason. Oh, really? Yeah.

00:06:47.759 --> 00:06:49.980
Structured pruning is just too blunt of an instrument.

00:06:50.459 --> 00:06:53.019
You risk collapsing critical pathways that happen

00:06:53.019 --> 00:06:56.259
to route through an otherwise quiet node. Ah,

00:06:56.420 --> 00:06:58.879
so you might accidentally cut a vital artery.

00:06:59.379 --> 00:07:02.360
Exactly. Most modern AI development actually

00:07:02.360 --> 00:07:05.560
focuses on unstructured pruning. which is your

00:07:05.560 --> 00:07:07.959
leaf plucking analogy. They leave all the branches,

00:07:08.120 --> 00:07:11.019
the nodes, completely intact, but they target

00:07:11.019 --> 00:07:13.420
the individual weights, the edges connecting

00:07:13.420 --> 00:07:16.160
those neurons. They are finding the most insignificant

00:07:16.160 --> 00:07:18.420
connections and shutting them down. The source

00:07:18.420 --> 00:07:20.459
material mentions that to do this, engineers

00:07:20.459 --> 00:07:23.339
simply set the value of those weights to zero.

00:07:23.560 --> 00:07:25.420
Just change the number to a zero, yep. But wait,

00:07:25.480 --> 00:07:28.019
let's look at the actual hardware reality here.

00:07:28.379 --> 00:07:31.259
Because if I have a parameter with a weight of,

00:07:31.259 --> 00:07:34.980
say, 0 .8, and I change that value to 0, I haven't

00:07:34.980 --> 00:07:36.980
actually removed anything from the computer's

00:07:36.980 --> 00:07:40.079
memory. Right. The 0 still takes up space. Exactly.

00:07:40.360 --> 00:07:42.339
The data is still there, it's just a different

00:07:42.339 --> 00:07:45.139
number. The GPU still has to look at the zero,

00:07:45.680 --> 00:07:47.620
multiply it by whatever input is coming through,

00:07:47.899 --> 00:07:50.339
get a result of zero, and then pass it along.

00:07:50.379 --> 00:07:52.939
It's still doing the math. Right. So how does

00:07:52.939 --> 00:07:55.959
changing a number to zero save any energy or

00:07:55.959 --> 00:07:57.860
compute time for the listener running this on

00:07:57.860 --> 00:08:00.300
their laptop? That is the crucial engineering

00:08:00.300 --> 00:08:02.720
bottleneck that separates theory from practice,

00:08:03.000 --> 00:08:05.759
right there. Because you are completely right.

00:08:06.120 --> 00:08:09.100
If you just swap numbers for zeros, in a standard

00:08:09.100 --> 00:08:11.860
dense matrix. You save absolutely zero compute

00:08:11.860 --> 00:08:13.819
time. Because it's still doing the operation.

00:08:14.180 --> 00:08:16.819
Right. Multiplying by zero takes the exact same

00:08:16.819 --> 00:08:19.300
amount of hardware cycles as multiplying by seven.

00:08:20.540 --> 00:08:22.540
The secret mechanism that makes unstructured

00:08:22.540 --> 00:08:25.519
pruning actually work relies on how modern software

00:08:25.519 --> 00:08:27.819
and specialized hardware handle what are called

00:08:27.819 --> 00:08:32.350
sparse matrices. sparse matrices as opposed to

00:08:32.350 --> 00:08:34.629
dense matrices where every coordinate has a value.

00:08:34.950 --> 00:08:37.129
Correct. When you aggressively prune a network,

00:08:37.470 --> 00:08:39.649
you create a matrix that is overwhelmingly filled

00:08:39.649 --> 00:08:42.710
with zeros. So instead of storing a massive grid

00:08:42.710 --> 00:08:44.649
and forcing the processor to calculate all those

00:08:44.649 --> 00:08:47.289
meaningless zero multiplications, engineers use

00:08:47.289 --> 00:08:50.210
compressed formats, like compressed sparse row.

00:08:50.350 --> 00:08:52.309
Okay, what does that do? The software essentially

00:08:52.309 --> 00:08:54.409
maps out the coordinates of only the non -zero

00:08:54.409 --> 00:08:56.350
weights. and create some pointer system. Oh,

00:08:56.409 --> 00:08:58.370
I see. Yeah, so it tells the processor, hey,

00:08:58.529 --> 00:09:00.769
skip all of this empty space entirely, go directly

00:09:00.769 --> 00:09:03.129
to coordinate x, y, and multiply that specific

00:09:03.129 --> 00:09:05.649
number. Ah, so you are literally telling the

00:09:05.649 --> 00:09:08.870
hardware to ignore vast swaths of the architecture.

00:09:09.389 --> 00:09:12.009
You aren't just changing the math. You are physically

00:09:12.009 --> 00:09:14.070
shortening the distance the processor has to

00:09:14.070 --> 00:09:17.110
travel to get an answer. Exactly. And that is

00:09:17.110 --> 00:09:19.169
where the massive speed ups and energy savings

00:09:19.169 --> 00:09:22.639
actually come from. Modern AI chips, like the

00:09:22.639 --> 00:09:26.440
latest GPUs and TPUs, are physically architected

00:09:26.440 --> 00:09:29.080
to accelerate these sparse matrix operations

00:09:29.080 --> 00:09:31.690
natively. So the hardware is literally built

00:09:31.690 --> 00:09:34.870
for the prune software? Yes. They detect the

00:09:34.870 --> 00:09:37.049
sparsity and route the compute power only to

00:09:37.049 --> 00:09:39.629
the active pathways. That brings up a fascinating

00:09:39.629 --> 00:09:42.049
architectural choice, though. When we decide

00:09:42.049 --> 00:09:45.129
to pluck these leaves, when we zero out these

00:09:45.129 --> 00:09:47.509
weights, what is the scope of that operation?

00:09:47.629 --> 00:09:50.070
The scope. Do we look at the entire tree at once

00:09:50.070 --> 00:09:52.070
to find the absolute weakest leaves anywhere

00:09:52.070 --> 00:09:54.350
on the plant, or do we go branch by branch, layer

00:09:54.350 --> 00:09:56.309
by layer, finding the weakest leaves in each

00:09:56.309 --> 00:09:59.070
specific section? Both strategies are used, actually.

00:09:59.120 --> 00:10:01.980
And they are formalized as global versus local

00:10:01.980 --> 00:10:05.179
pruning. So global pruning looks at the entire

00:10:05.179 --> 00:10:08.740
forest. You compare weights from every single

00:10:08.740 --> 00:10:11.240
layer across the entire network simultaneously.

00:10:11.820 --> 00:10:14.019
You find the globally weakest connections and

00:10:14.019 --> 00:10:16.460
you zero them out completely, regardless of where

00:10:16.460 --> 00:10:18.700
they live. But that implies a massive risk, doesn't

00:10:18.700 --> 00:10:21.200
it? It does. Because if you use global pruning,

00:10:21.360 --> 00:10:24.279
you could theoretically have one specific layer

00:10:24.279 --> 00:10:27.480
in your network that just naturally uses smaller

00:10:27.480 --> 00:10:29.940
weights across the board. If you do a global

00:10:29.940 --> 00:10:32.879
sweep, you might accidentally wipe out an entire

00:10:32.879 --> 00:10:35.000
layer completely. And just sever the network.

00:10:35.080 --> 00:10:37.019
Yeah, effectively cutting the network in half.

00:10:37.139 --> 00:10:40.100
Which is the exact vulnerability of global pruning.

00:10:40.179 --> 00:10:42.559
It can cause what's called layer collapse. That's

00:10:42.559 --> 00:10:45.580
why local pruning is often preferred for really

00:10:45.580 --> 00:10:48.139
deep, complex architectures. Going branch by

00:10:48.139 --> 00:10:50.820
branch. Right. Local printing forces the algorithm

00:10:50.820 --> 00:10:53.759
to compare weights only within their specific

00:10:53.759 --> 00:10:57.440
layer. Layer 1 will have its own bottom 10 %

00:10:57.440 --> 00:10:59.720
of weights pruned. Layer 2 will have its own

00:10:59.720 --> 00:11:02.539
bottom 10 % pruned, completely independent of

00:11:02.539 --> 00:11:05.080
layer 1. It guarantees that every branch gets

00:11:05.080 --> 00:11:08.360
simplified, while ensuring no single branch gets

00:11:08.360 --> 00:11:10.299
completely cut off from the rest of the tree.

00:11:10.480 --> 00:11:12.759
That makes perfect sense structurally. It's like

00:11:12.759 --> 00:11:15.100
corporate downsizing in a way. Oh, I'll sell.

00:11:15.259 --> 00:11:17.919
Well, it's like firing the least productive employees

00:11:17.919 --> 00:11:20.440
to save the company money. But the hard part

00:11:20.440 --> 00:11:23.600
is, how do you objectively measure an artificial

00:11:23.600 --> 00:11:26.100
neuron's productivity? Right. What's the metric?

00:11:26.399 --> 00:11:28.320
Exactly. Our sources point out that the most

00:11:28.320 --> 00:11:31.679
basic metric in pruning is just weight magnitude.

00:11:31.740 --> 00:11:34.100
Yes. You just look at the absolute value of the

00:11:34.100 --> 00:11:37.529
weight. And if the number is already incredibly

00:11:37.529 --> 00:11:40.269
close to zero, the logic assumes it's probably

00:11:40.269 --> 00:11:42.789
not having a massive impact on the final output.

00:11:43.049 --> 00:11:45.049
That is definitely the easiest and cheapest way

00:11:45.049 --> 00:11:47.429
to prune. You sort the weights by size and you

00:11:47.429 --> 00:11:50.289
chop the smallest ones. But relying solely on

00:11:50.289 --> 00:11:53.370
magnitude assumes that small automatically equals

00:11:53.370 --> 00:11:56.870
insignificant. Right. And in highly complex nonlinear

00:11:56.870 --> 00:12:00.429
networks, a tiny weight might be serving as a

00:12:00.429 --> 00:12:03.000
crucial bottleneck. or like a subtle modifier

00:12:03.000 --> 00:12:05.379
that the network desperately needs for edge cases.

00:12:05.679 --> 00:12:07.500
Right. If we go back to the corporate analogy,

00:12:08.080 --> 00:12:10.539
an employee might only process 10 emails a day,

00:12:10.919 --> 00:12:13.019
but if those 10 emails are paying the company's

00:12:13.019 --> 00:12:15.639
legal bills, firing them shuts down the whole

00:12:15.639 --> 00:12:18.460
business. Exactly. You need a metric that measures

00:12:18.460 --> 00:12:20.799
sensitivity, not just volume. So how do they

00:12:20.799 --> 00:12:22.840
do that? Which leads us to the more sophisticated

00:12:22.840 --> 00:12:26.139
approach, using gradient information. specifically

00:12:26.139 --> 00:12:28.700
looking at the second derivative, or what's called

00:12:28.700 --> 00:12:31.019
the Hessian matrix. The Hessian matrix. Yeah,

00:12:31.139 --> 00:12:33.440
and the history is how computer scientists formalize

00:12:33.440 --> 00:12:36.100
this is just brilliant. The source references

00:12:36.100 --> 00:12:40.990
a landmark 1989 paper by Jan Lacoon. John Denker

00:12:40.990 --> 00:12:43.370
and Sarah Solla. Oh, Jan Lacoon, yeah. Yeah.

00:12:43.669 --> 00:12:45.990
And the title of this paper is Legendary in the

00:12:45.990 --> 00:12:48.549
Field. It's called Optimal Brain Damage. Optimal

00:12:48.549 --> 00:12:51.149
Brain Damage. Giving the AI intentional brain

00:12:51.149 --> 00:12:53.970
damage, but doing it optimally. Exactly. It perfectly

00:12:53.970 --> 00:12:56.129
captures the trauma we are inflicting on the

00:12:56.129 --> 00:12:59.610
system. But how does that gradient math actually

00:12:59.610 --> 00:13:02.450
measure sensitivity to prevent us from, you know,

00:13:02.710 --> 00:13:05.629
firing the critical legal guy? The math essentially

00:13:05.629 --> 00:13:08.330
creates a topographical map of the AI's error

00:13:08.330 --> 00:13:11.649
rate. The first derivative, the gradient, tells

00:13:11.649 --> 00:13:14.230
you the slope of where you are. But the second

00:13:14.230 --> 00:13:16.610
derivative, the Hessian, tells you the curvature

00:13:16.610 --> 00:13:19.490
of the space. Okay, visualize that for me. Imagine

00:13:19.490 --> 00:13:22.509
you are standing in a valley. If you are in a

00:13:22.509 --> 00:13:25.629
really steep, narrow ravine, taking one step

00:13:25.629 --> 00:13:28.450
in any direction causes your elevation to shoot

00:13:28.450 --> 00:13:32.169
up drastically. Right. In AI terms, elevation

00:13:32.169 --> 00:13:35.480
is your error rate. So if a weight sits in a

00:13:35.480 --> 00:13:38.159
steep ravine on that mathematical map, tweaking

00:13:38.159 --> 00:13:41.120
it even slightly causes a massive spike in errors.

00:13:41.799 --> 00:13:43.980
That weight is highly sensitive. It's critical.

00:13:44.240 --> 00:13:46.059
Even if the weight's magnitude is tiny, like

00:13:46.059 --> 00:13:48.100
even if it's a really small number. Exactly.

00:13:48.220 --> 00:13:49.720
The size of the number doesn't matter. It's where

00:13:49.720 --> 00:13:51.940
it sits. Yeah. Conversely, if a weight sits on

00:13:51.940 --> 00:13:54.519
a massive flat plane, you could drastically change

00:13:54.519 --> 00:13:56.960
its value or just set it to zero, and the elevation,

00:13:57.039 --> 00:13:59.200
the error rate barely changes at all. Oh, wow.

00:13:59.279 --> 00:14:02.860
Yeah. Optimal brain damage utilized this Hessian

00:14:02.860 --> 00:14:05.580
matrix to find the weights sitting on those flat

00:14:05.580 --> 00:14:07.860
planes. You prune those and you leave the ones

00:14:07.860 --> 00:14:09.720
in the ravines alone, completely independent

00:14:09.720 --> 00:14:12.539
of their absolute size. That is a staggering

00:14:12.539 --> 00:14:15.580
level of mathematical precision. But the really

00:14:15.580 --> 00:14:18.860
wild part about LeCun's 1989 paper is that it

00:14:18.860 --> 00:14:21.039
didn't just suggest deleting the flat plane weights.

00:14:21.240 --> 00:14:23.809
Right. There was a second part. It proposed that

00:14:23.809 --> 00:14:26.210
when you inflict this brain damage, you should

00:14:26.210 --> 00:14:28.629
actively change the values of the non -pruned

00:14:28.629 --> 00:14:31.870
weights, the survivors, to mathematically compensate

00:14:31.870 --> 00:14:34.570
for the ones you just deleted. It's an enforced

00:14:34.570 --> 00:14:37.309
adaptation. You don't just, you know, fire the

00:14:37.309 --> 00:14:40.649
employees. You immediately rewrite the job descriptions.

00:14:41.080 --> 00:14:43.659
for the surviving employees to absorb the displaced

00:14:43.659 --> 00:14:46.500
workload, minimizing the disruption to the overall

00:14:46.500 --> 00:14:48.659
network. Which introduces a massive question

00:14:48.659 --> 00:14:51.019
about the timeline of this entire process. Yeah.

00:14:51.320 --> 00:14:53.860
Because does this intentional brain damage and

00:14:53.860 --> 00:14:57.299
the schedule rewriting happen while the AI is

00:14:57.299 --> 00:14:59.519
actively learning from scratch? Or does it happen

00:14:59.519 --> 00:15:02.419
after the AI is fully baked, trained, and ready

00:15:02.419 --> 00:15:04.279
to be deployed? If we connect this to the bigger

00:15:04.279 --> 00:15:06.700
picture, the timing is one of the most hotly

00:15:06.700 --> 00:15:08.759
debated areas in machine learning research today.

00:15:08.799 --> 00:15:11.740
Really? Oh, yeah. can theoretically be applied

00:15:11.740 --> 00:15:14.860
at three different stages, before training, during

00:15:14.860 --> 00:15:17.519
training, or after training. Pruning before training,

00:15:18.019 --> 00:15:20.539
basically trying to find the optimal sparse architecture

00:15:20.539 --> 00:15:22.379
right from the starting block before it learns

00:15:22.379 --> 00:15:25.779
anything, that is a massive area of ongoing research.

00:15:26.200 --> 00:15:29.240
But it is incredibly difficult to predict which

00:15:29.240 --> 00:15:32.200
pathways will be needed before the data even

00:15:32.200 --> 00:15:34.059
starts flowing. Right, you don't know what it

00:15:34.059 --> 00:15:36.480
needs to learn yet. So the vast majority of practical

00:15:36.480 --> 00:15:39.519
applications happen during or after. Yes. But

00:15:39.519 --> 00:15:42.019
if you take a fully trained model, say a massive

00:15:42.019 --> 00:15:44.620
language model, and you suddenly sweep through

00:15:44.620 --> 00:15:47.500
and zero out 40 % of its connections using this

00:15:47.500 --> 00:15:50.080
Hessian math, its accuracy is still going to

00:15:50.080 --> 00:15:52.299
take a hit, right? I mean, you just ripped out

00:15:52.299 --> 00:15:54.759
half its brain. It absolutely takes a hit. Yeah.

00:15:54.860 --> 00:15:56.700
The source explains that when pruning is performed

00:15:56.700 --> 00:15:59.340
during or after training, the network inevitably

00:15:59.340 --> 00:16:02.529
requires a recovery phase. It has to heal. Exactly.

00:16:02.870 --> 00:16:05.470
You have to subject the damaged model to additional

00:16:05.470 --> 00:16:08.009
fine -tuning epochs. You essentially put it back

00:16:08.009 --> 00:16:10.509
into training mode with a smaller learning rate,

00:16:11.250 --> 00:16:12.929
allowing the remaining surviving connections

00:16:12.929 --> 00:16:15.870
to heal, adjust their weights, and basically

00:16:15.870 --> 00:16:18.090
figure out how to achieve the original accuracy

00:16:18.090 --> 00:16:21.779
with a fraction of the resources. So it's a continuous,

00:16:22.259 --> 00:16:24.559
delicate trade -off. Always. You are constantly

00:16:24.559 --> 00:16:27.559
balancing the desire to slash the computational

00:16:27.559 --> 00:16:30.559
cost against the temporary degradation of the

00:16:30.559 --> 00:16:33.539
network's accuracy, factoring in the extra time

00:16:33.539 --> 00:16:36.019
and electricity it takes to run that fine -tuning

00:16:36.019 --> 00:16:38.679
recovery phase. Right. You have to ensure the

00:16:38.679 --> 00:16:41.159
cost of surgery doesn't outweigh the benefits

00:16:41.159 --> 00:16:44.039
of having a leaner model. And every single AI

00:16:44.039 --> 00:16:46.759
lab has a different formula for balancing that

00:16:46.759 --> 00:16:49.059
trade -off, really depending on what hardware

00:16:49.059 --> 00:16:51.149
the final model is destined for. So what does

00:16:51.149 --> 00:16:54.490
this all mean? Let's bring all this dense topological

00:16:54.490 --> 00:16:57.169
math back down to earth to your daily life as

00:16:57.169 --> 00:16:59.649
a listener. We live in a tech culture that is

00:16:59.649 --> 00:17:02.330
obsessed with scale. Oh, completely. We constantly

00:17:02.330 --> 00:17:05.450
assume that bigger is fundamentally better. Trillions

00:17:05.450 --> 00:17:08.210
of parameters, massive server warehouses in the

00:17:08.210 --> 00:17:11.150
desert, gigawatts of power. But what pruning

00:17:11.150 --> 00:17:13.930
proves is that raw scale is actually a liability

00:17:13.930 --> 00:17:16.190
if you can't process it efficiently. Yeah, it

00:17:16.190 --> 00:17:18.559
becomes dead weight. Exactly. If you're listening

00:17:18.559 --> 00:17:20.440
to this and wondering how your smartphone can

00:17:20.440 --> 00:17:22.839
suddenly run highly complex voice recognition,

00:17:23.460 --> 00:17:25.480
or how you can download a localized language

00:17:25.480 --> 00:17:27.559
model straight to your laptop without it just

00:17:27.559 --> 00:17:30.750
melting your hard drive, Unstructured edge pruning

00:17:30.750 --> 00:17:33.670
and sparse matrix hardware are the invisible

00:17:33.670 --> 00:17:36.109
tricks making that happen for you. It's all happening

00:17:36.109 --> 00:17:39.390
behind the scenes. For AI to truly integrate

00:17:39.390 --> 00:17:42.369
into our daily lives without consuming the world's

00:17:42.369 --> 00:17:44.490
entire energy grid, it doesn't just need to get

00:17:44.490 --> 00:17:47.049
smarter. It has to learn how to aggressively

00:17:47.049 --> 00:17:50.049
edit its own manuscript. This raises an important

00:17:50.049 --> 00:17:52.420
question though. And it's actually hiding in

00:17:52.420 --> 00:17:54.859
plain sight within the see also and reference

00:17:54.859 --> 00:17:57.019
sections of our source material. Oh, what's that?

00:17:57.339 --> 00:17:59.440
There are two biological concepts listed there

00:17:59.440 --> 00:18:01.380
that are fascinating to apply to this digital

00:18:01.380 --> 00:18:05.140
framework. Neural Darwinism and Neuroregeneration.

00:18:05.480 --> 00:18:07.720
Neural Darwinism. That sounds like we are moving

00:18:07.720 --> 00:18:11.240
from editing manuscripts to actual digital evolution.

00:18:11.680 --> 00:18:13.579
Where are you going with this? Think about the

00:18:13.579 --> 00:18:16.299
mechanics we've just laid out. Yeah. We are systematically

00:18:16.299 --> 00:18:18.420
applying the Darwinian principle of survival

00:18:18.420 --> 00:18:21.299
of the fittest to individual lines of code and

00:18:21.299 --> 00:18:24.180
mathematical weights. Right. We evaluate connections.

00:18:24.680 --> 00:18:26.420
We find the ones sitting on the flat planes.

00:18:26.779 --> 00:18:28.920
And we systematically kill them off so the high

00:18:28.920 --> 00:18:31.200
-performing connections in the ravines can hoard

00:18:31.200 --> 00:18:34.039
the computational resources and thrive. Survival

00:18:34.039 --> 00:18:37.690
of the fittest neurons. Exactly. But if we fold

00:18:37.690 --> 00:18:40.069
in neuro -regeneration, which is actually referenced

00:18:40.069 --> 00:18:43.329
in a 2022 paper by Laurent and colleagues exploring

00:18:43.329 --> 00:18:46.150
sparse training, we aren't just cutting. What

00:18:46.150 --> 00:18:48.910
do you mean? The research is exploring how networks

00:18:48.910 --> 00:18:52.369
might dynamically regrow necessary connections

00:18:52.369 --> 00:18:55.450
during that fine -tuning recovery phase if they

00:18:55.450 --> 00:18:57.970
realize they prune too aggressively. Wait, really?

00:18:58.029 --> 00:19:00.349
Like a biological organism healing a wound by

00:19:00.349 --> 00:19:03.829
generating new tissue? Yes. Or routing entirely

00:19:03.829 --> 00:19:06.630
new blood vessels around a trauma. The code is

00:19:06.630 --> 00:19:08.569
learning to heal itself based on environmental

00:19:08.569 --> 00:19:11.230
demands. That is wild. So the profound question

00:19:11.230 --> 00:19:14.089
is, by mimicking synaptic pruning, introducing

00:19:14.089 --> 00:19:16.529
Darwinian survival of the fittest at the parameter

00:19:16.529 --> 00:19:19.190
level and allowing for dynamic neuro -regeneration.

00:19:19.569 --> 00:19:21.970
Are we inadvertently designing AI systems that

00:19:21.970 --> 00:19:24.269
behave like biological organisms evolving in

00:19:24.269 --> 00:19:27.990
real time? Exactly. Rather than just static mathematical

00:19:27.990 --> 00:19:30.970
tools executing a script. What happens to the

00:19:30.970 --> 00:19:34.009
architecture of artificial brains when they undergo

00:19:34.009 --> 00:19:36.910
full evolutionary cycles on their own? We started

00:19:36.910 --> 00:19:39.589
by trying to save some battery life on a GPU,

00:19:39.589 --> 00:19:42.529
and we ended up simulating the biological evolution

00:19:42.529 --> 00:19:44.630
of computer code. It really makes you think.

00:19:44.630 --> 00:19:46.970
It completely reframes how you look at the tech

00:19:46.970 --> 00:19:49.829
in your pocket. To everyone listening, thank

00:19:49.829 --> 00:19:52.509
you for joining us on this deep dive. Keep questioning

00:19:52.509 --> 00:19:55.329
the world around you, look for the hidden architecture

00:19:55.329 --> 00:19:57.309
and the tools you use every day, and we'll catch

00:19:57.309 --> 00:19:58.130
you on the next one.
