WEBVTT

00:00:00.000 --> 00:00:01.720
You know, we hear about artificial intelligence

00:00:01.720 --> 00:00:04.459
like constantly these days. Oh, absolutely. It's

00:00:04.459 --> 00:00:06.839
everywhere. Right. Spreading our emails, diagnosing

00:00:06.839 --> 00:00:11.400
diseases, driving our cars. And the way it's

00:00:11.400 --> 00:00:14.119
usually presented to you and me is like this

00:00:14.119 --> 00:00:18.399
sleek, totally impenetrable black box. Yeah,

00:00:18.440 --> 00:00:21.579
just millions of data points go in and then magic.

00:00:21.679 --> 00:00:24.440
Exactly. Some mysterious computational magic

00:00:24.440 --> 00:00:26.500
happens and this perfectly formed answer just

00:00:26.500 --> 00:00:28.769
pops out. Yeah. But if you actually peek inside

00:00:28.769 --> 00:00:31.489
that box, what you find isn't, you know, magic

00:00:31.489 --> 00:00:34.000
at all. No, far from it. I mean, when you look

00:00:34.000 --> 00:00:36.359
under the hood of a large language model or an

00:00:36.359 --> 00:00:38.679
image generator, you don't find the synthetic

00:00:38.679 --> 00:00:42.460
brain spontaneously thinking thoughts. You find

00:00:42.460 --> 00:00:45.659
a very specific, incredibly rigorous mathematical

00:00:45.659 --> 00:00:48.039
engine, and it's basically running a very old

00:00:48.039 --> 00:00:50.780
optimization problem. And today we are going

00:00:50.780 --> 00:00:53.619
to crack that black box wide open. Welcome to

00:00:53.619 --> 00:00:56.340
today's Deep Dive. We've got a single, just remarkably

00:00:56.340 --> 00:00:58.299
comprehensive source for you today. It's the

00:00:58.299 --> 00:01:00.789
Wikipedia article on backpropagation. That's

00:01:00.789 --> 00:01:03.429
the one. And our mission today is to understand

00:01:03.429 --> 00:01:06.849
the exact mathematical engine that allows AI

00:01:06.849 --> 00:01:09.829
to actually learn from its mistakes. Because

00:01:09.829 --> 00:01:12.109
whenever you hear about an AI learning to do

00:01:12.109 --> 00:01:14.489
something, it almost always boils down to this

00:01:14.489 --> 00:01:16.810
one specific algorithm. It really does. It's

00:01:16.810 --> 00:01:19.129
the core of it all. OK, let's unpack this. Before

00:01:19.129 --> 00:01:21.510
we can really graph how backpropagation works,

00:01:21.909 --> 00:01:24.480
we kind of need to understand the problem it

00:01:24.480 --> 00:01:26.640
was actually invented to solve, right? Yeah,

00:01:26.659 --> 00:01:29.519
you need the why before the how. Because in machine

00:01:29.519 --> 00:01:31.640
learning, specifically in this framework we call

00:01:31.640 --> 00:01:34.340
supervised learning, an AI doesn't start out

00:01:34.340 --> 00:01:36.719
with any inherent knowledge. It's just a blank

00:01:36.719 --> 00:01:40.120
slate. Completely ignorant. Its internal connections,

00:01:40.420 --> 00:01:42.599
these numerical values that we call weights,

00:01:43.079 --> 00:01:45.480
they are essentially just set at random initially.

00:01:45.640 --> 00:01:49.739
So if you ask a brand new untrained network to

00:01:50.890 --> 00:01:53.390
identify a picture of a cat, it is literally

00:01:53.390 --> 00:01:55.689
just guessing blindly. I was actually thinking

00:01:55.689 --> 00:01:57.370
about an analogy for this while I was reading

00:01:57.370 --> 00:01:59.709
to the source material. Imagine you are standing

00:01:59.709 --> 00:02:02.989
in a large room, right? And you're trying to

00:02:02.989 --> 00:02:05.069
hit a dart board. OK, I like where this is going.

00:02:05.370 --> 00:02:09.150
But you are completely blindfolded. You throw

00:02:09.150 --> 00:02:12.969
your first dart and you have absolutely no idea

00:02:12.969 --> 00:02:14.650
where it landed. I mean, it probably hit the

00:02:14.650 --> 00:02:17.610
ceiling or like a. Potted plant almost certainly

00:02:17.610 --> 00:02:19.870
a potted plant right now If nobody tells you

00:02:19.870 --> 00:02:22.069
anything your next throw is just gonna be another

00:02:22.069 --> 00:02:24.150
totally random guess you're never gonna learn

00:02:24.150 --> 00:02:27.009
Exactly, but in supervised learning you have

00:02:27.009 --> 00:02:30.270
a guide you have your training data Okay, so

00:02:30.270 --> 00:02:33.189
that data acts as the person standing next to

00:02:33.189 --> 00:02:36.110
you watching where your dart actually lands So

00:02:36.110 --> 00:02:38.349
they say okay that throw was you know three feet

00:02:38.349 --> 00:02:40.830
too high and two feet too far to the left They

00:02:40.830 --> 00:02:43.909
are actively quantifying your error. Yes They

00:02:43.909 --> 00:02:46.110
give you the exact measurements of how bad you

00:02:46.110 --> 00:02:49.389
missed. So for your next throw, you adjust. You

00:02:49.389 --> 00:02:51.409
change your angle. Maybe you change your force

00:02:51.409 --> 00:02:53.610
just trying to correct for that specific mistake.

00:02:53.770 --> 00:02:55.530
Yeah. And you keep throwing and they keep giving

00:02:55.530 --> 00:02:58.050
you precise feedback. Until eventually you're

00:02:58.050 --> 00:03:01.129
hitting the bullseye. Right. Now, in the mathematical

00:03:01.129 --> 00:03:03.449
realm of a neural network, that distance between

00:03:03.449 --> 00:03:05.629
your dart and the bullseye is measured by something

00:03:05.629 --> 00:03:09.009
called the loss function. Or sometimes the cost

00:03:09.009 --> 00:03:11.409
function, right? I saw both terms in the text.

00:03:11.729 --> 00:03:13.349
Yeah, they're generally interchangeable. The

00:03:13.349 --> 00:03:15.889
source actually mentions the squared error, which

00:03:15.889 --> 00:03:18.069
is used for regression problems. And the neat

00:03:18.069 --> 00:03:19.949
thing about squaring the distance of your miss

00:03:19.949 --> 00:03:23.090
is that the math heavily penalizes a dart that

00:03:23.090 --> 00:03:25.669
hits the ceiling, but only slightly penalizes

00:03:25.669 --> 00:03:27.870
a dart that lands, you know, just outside the

00:03:27.870 --> 00:03:29.889
bull's eye. That makes sense. It really wants

00:03:29.889 --> 00:03:32.870
to fix the huge mistakes. So if we visualize

00:03:32.870 --> 00:03:35.310
that relationship between the network's prediction

00:03:35.310 --> 00:03:38.729
and the actual error, it forms a shape like a

00:03:38.729 --> 00:03:42.439
giant bowl. Or, mathematically speaking, a parabola.

00:03:42.599 --> 00:03:44.759
A big mathematical bowl, exactly. And the goal

00:03:44.759 --> 00:03:48.240
of the AI, like its only fundamental goal, is

00:03:48.240 --> 00:03:50.580
to find the absolute lowest point of that bowl.

00:03:51.139 --> 00:03:53.780
The minimum. Because at that lowest point, down

00:03:53.780 --> 00:03:56.659
at the very bottom, the error is at its smallest.

00:03:57.159 --> 00:03:59.560
Your dart is squarely in the bullseye. Okay,

00:03:59.560 --> 00:04:01.219
so how does it get there? Well, if we connect

00:04:01.219 --> 00:04:03.719
this to the bigger picture, finding that minimum

00:04:03.719 --> 00:04:05.800
point is done through a process called gradient

00:04:05.800 --> 00:04:08.770
descent. Gradient descent. Right. Since the AI

00:04:08.770 --> 00:04:11.750
is blindfolded, remember, it can't just look

00:04:11.750 --> 00:04:13.610
at the bowl and jump straight to the bottom.

00:04:13.949 --> 00:04:16.509
All it can do is feel the slope of the bowl right

00:04:16.509 --> 00:04:19.110
beneath its own feet. Oh, I see. So it calculates

00:04:19.110 --> 00:04:21.509
the steepest downward direction from its current

00:04:21.509 --> 00:04:24.730
position, takes a tiny mathematical step in that

00:04:24.730 --> 00:04:27.410
direction, and then feels the slope again. Yeah.

00:04:27.410 --> 00:04:29.410
Back propagation is just the specific method

00:04:29.410 --> 00:04:32.459
used to calculate that exact slope. the steepest

00:04:32.459 --> 00:04:35.019
descent direction of the loss function relative

00:04:35.019 --> 00:04:37.180
to the current weights. You got it. So it takes

00:04:37.180 --> 00:04:40.100
a step. recalculates the error, adjusts the weights,

00:04:40.300 --> 00:04:42.060
takes another step. I mean, that makes logical

00:04:42.060 --> 00:04:45.000
sense for our single dart thrower. But reading

00:04:45.000 --> 00:04:47.180
through this source, I really started thinking

00:04:47.180 --> 00:04:50.220
about a sheer scale of modern AI. It's massive.

00:04:50.300 --> 00:04:53.300
Yeah. Yeah. A neural network isn't just one dart

00:04:53.300 --> 00:04:55.819
thrower. It has multiple layers. You've got an

00:04:55.819 --> 00:04:58.259
input layer taking in the data, an output layer

00:04:58.259 --> 00:05:00.339
giving the answer, and potentially hundreds of

00:05:00.339 --> 00:05:02.800
these hidden layers in between. Sometimes thousands

00:05:02.800 --> 00:05:05.540
of layers. And they contain millions or even

00:05:05.540 --> 00:05:07.709
billions of individual weights. connecting all

00:05:07.709 --> 00:05:10.209
of them together. And this is where the logistics

00:05:10.209 --> 00:05:13.269
just seem impossible to me. If you have to calculate

00:05:13.269 --> 00:05:17.110
the exact adjustment for every single tiny weight

00:05:17.110 --> 00:05:21.120
across billions of connections every single time

00:05:21.120 --> 00:05:24.319
the network makes a guess. I mean, how does calculating

00:05:24.319 --> 00:05:28.399
the change for every possible path not just instantly

00:05:28.399 --> 00:05:30.899
overwhelm the computer? It sounds like an infinite

00:05:30.899 --> 00:05:33.420
math problem. Exactly. It just sounds completely

00:05:33.420 --> 00:05:36.660
impossible. And that is the exact logistical

00:05:36.660 --> 00:05:39.100
nightmare that stumped researchers for years.

00:05:39.439 --> 00:05:42.279
Because if you try to calculate the error by

00:05:42.279 --> 00:05:43.899
starting at the beginning of the network and

00:05:43.899 --> 00:05:47.529
moving forward, Calculating how every tiny tweak

00:05:47.529 --> 00:05:49.889
to a weight might eventually affect the final

00:05:49.889 --> 00:05:52.769
output, the math becomes astronomically complex.

00:05:52.829 --> 00:05:54.569
Like tracking a billion different butterfly effects.

00:05:55.009 --> 00:05:58.050
Exactly. You end up calculating the same intermediate

00:05:58.050 --> 00:06:00.930
steps over and over and over again. It requires

00:06:00.930 --> 00:06:03.189
computing power that didn't exist back then and,

00:06:03.189 --> 00:06:06.009
frankly, still barely exists today. Let's visualize

00:06:06.009 --> 00:06:07.750
that bottleneck because I think I have a way

00:06:07.750 --> 00:06:09.970
to picture this. It's like a massive factory

00:06:09.970 --> 00:06:12.829
assembly line. OK, I'm with you. So a car rolls

00:06:12.829 --> 00:06:15.110
off the end of the line and the steering wheel

00:06:15.110 --> 00:06:18.050
is installed upside down. The forward approach

00:06:18.050 --> 00:06:20.189
we just talked about would be like standing at

00:06:20.189 --> 00:06:22.269
the very beginning of the assembly line looking

00:06:22.269 --> 00:06:25.209
at the raw steel and the very first worker. Right.

00:06:25.430 --> 00:06:28.029
And trying to mathematically simulate every possible

00:06:28.029 --> 00:06:30.850
action every single worker could take just to

00:06:30.850 --> 00:06:32.730
figure out how it might eventually result in

00:06:32.730 --> 00:06:34.569
an upside down steering wheel at the end. It's

00:06:34.569 --> 00:06:37.209
totally absurd. It's a logistical nightmare.

00:06:37.310 --> 00:06:40.990
It is. And the core brilliance of the backpropagation

00:06:40.990 --> 00:06:42.629
algorithm is actually right there in the name.

00:06:42.649 --> 00:06:45.310
You don't go forward. You propagate the error

00:06:45.310 --> 00:06:47.829
backward. Backward. You compute the gradient

00:06:47.829 --> 00:06:51.209
one layer at a time, but you iterate in reverse

00:06:51.209 --> 00:06:53.810
from the final output layer back to the first

00:06:53.810 --> 00:06:56.399
input. Piss them back to the factory. You start

00:06:56.399 --> 00:06:58.600
with the flawed car. You look at the final station.

00:06:58.759 --> 00:07:00.680
You ask, did the last person put the steering

00:07:00.680 --> 00:07:03.220
wheel on upside down? And if they say no, it

00:07:03.220 --> 00:07:04.899
arrived to them that way. Right. You just take

00:07:04.899 --> 00:07:06.860
one step backward. Did the person before them

00:07:06.860 --> 00:07:10.600
do it? No. So you trace that specific actual

00:07:10.600 --> 00:07:13.720
error backwards, step by step, following the

00:07:13.720 --> 00:07:16.480
exact chain of causality. Until you find the

00:07:16.480 --> 00:07:18.899
specific worker who made the mistake, and you

00:07:18.899 --> 00:07:22.000
tell them to adjust their process. That is so

00:07:22.000 --> 00:07:24.399
incredibly elegant. It really highlights the

00:07:24.399 --> 00:07:27.379
raw efficiency of the algorithm. You aren't evaluating

00:07:27.379 --> 00:07:29.639
every possible thing that could go wrong in the

00:07:29.639 --> 00:07:33.279
factory. You are only evaluating the actual error

00:07:33.279 --> 00:07:36.279
that occurred. And mathematically, the source

00:07:36.279 --> 00:07:38.480
explains this as an efficient application of

00:07:38.480 --> 00:07:41.339
the chain rule from calculus. The chain rule?

00:07:41.480 --> 00:07:43.819
I definitely remember that term from high school

00:07:43.819 --> 00:07:47.300
calculus, but usually in a very, like, abstract

00:07:47.300 --> 00:07:49.879
way. Well, in calculus, the chain rule is just

00:07:49.879 --> 00:07:52.959
a formula for calculating the derivative of composite

00:07:52.959 --> 00:07:55.079
functions. Composite functions. Think of them

00:07:55.079 --> 00:07:58.160
like Russian nesting dolls of math. Functions

00:07:58.160 --> 00:08:01.319
hidden entirely inside other functions. Oh, okay.

00:08:01.439 --> 00:08:04.019
I like that. In our neural network, the output

00:08:04.019 --> 00:08:06.319
of one layer simply becomes the input for the

00:08:06.319 --> 00:08:09.000
next layer. They are nested. So if you want to

00:08:09.000 --> 00:08:11.300
know how a tiny change to the smallest doll in

00:08:11.300 --> 00:08:13.500
the center affects the largest doll on the outside,

00:08:13.879 --> 00:08:15.740
the chain rule lets you just multiply the rates

00:08:15.740 --> 00:08:18.100
of change at each layer to find the total effect.

00:08:18.500 --> 00:08:21.060
But the key to making this actually run on a

00:08:21.060 --> 00:08:22.939
physical computer without, you know, melting

00:08:22.939 --> 00:08:25.819
the processor comes down to linear algebra. Exactly.

00:08:25.899 --> 00:08:29.220
This is the realm of matrices and vectors. If

00:08:29.220 --> 00:08:31.199
you try to calculate the error forwards, like

00:08:31.199 --> 00:08:33.179
we were saying earlier, you have to multiply

00:08:33.179 --> 00:08:36.000
a matrix by a matrix. And a matrix is essentially

00:08:36.000 --> 00:08:38.080
just a massive spreadsheet of numbers, right?

00:08:38.279 --> 00:08:41.799
Yep. And multiplying two giant spreadsheets together

00:08:41.799 --> 00:08:44.799
is exactly like your factory simulation. You

00:08:44.799 --> 00:08:47.860
are trying to track every possible path of change

00:08:47.860 --> 00:08:51.299
through the entire network simultaneously. It's

00:08:51.299 --> 00:08:54.259
highly, highly inefficient. But with backpropagation,

00:08:54.539 --> 00:08:57.019
by starting at the end with a known error, we

00:08:57.019 --> 00:08:59.100
just bypass all those redundant calculations.

00:08:59.440 --> 00:09:02.279
Right. Instead of a matrix multiplied by a matrix,

00:09:02.419 --> 00:09:04.980
each step backward is simply multiplying a vector

00:09:04.980 --> 00:09:07.919
by a matrix. And a vector is simpler. Much simpler.

00:09:08.019 --> 00:09:10.200
A vector is just a single column of numbers.

00:09:10.519 --> 00:09:12.700
In this case, it represents the specific error,

00:09:12.960 --> 00:09:15.240
which the math denotes as a lowercase delta.

00:09:15.679 --> 00:09:17.759
Multiplying a single column by a spreadsheet

00:09:17.759 --> 00:09:20.240
is infinitely faster for a computer process.

00:09:20.110 --> 00:09:22.669
Because it just avoids the duplicate calculations.

00:09:23.769 --> 00:09:26.049
Exactly. It skips computing all those intermediate

00:09:26.049 --> 00:09:29.230
values you don't actually need. And that one

00:09:29.230 --> 00:09:32.250
single mathematical trick is what makes training

00:09:32.250 --> 00:09:35.129
deep neural networks physically possible. Which

00:09:35.129 --> 00:09:39.370
is wild. This elegant trick is the absolute bedrock

00:09:39.370 --> 00:09:42.610
of the entire modern AI boom. I mean, without

00:09:42.610 --> 00:09:45.450
it... And none of the AI we use today even exists.

00:09:45.710 --> 00:09:47.330
So you would naturally assume this was invented

00:09:47.330 --> 00:09:49.789
recently, right? For sure. Like maybe 10 or 15

00:09:49.789 --> 00:09:52.970
years ago in some sleek Silicon Valley tech incubator.

00:09:53.149 --> 00:09:54.870
That would make sense. But the history of this

00:09:54.870 --> 00:09:58.190
algorithm is actually this tangled, bizarre web

00:09:58.190 --> 00:10:00.769
that stretches back centuries. Here's where it

00:10:00.769 --> 00:10:02.750
gets really interesting. Because the foundational

00:10:02.750 --> 00:10:04.669
math behind this, that chain rule we just broke

00:10:04.669 --> 00:10:06.750
down, it was first written down by Gottfried

00:10:06.750 --> 00:10:09.659
Wilhelm Leibniz. Yes. the German mathematician

00:10:09.659 --> 00:10:12.139
and philosopher. In the year 1676. I mean, we

00:10:12.139 --> 00:10:15.159
are literally using 17th century math to train

00:10:15.159 --> 00:10:18.500
modern AI chatbots. It is a profound reminder

00:10:18.500 --> 00:10:20.940
that in mathematics, foundational truths never

00:10:20.940 --> 00:10:23.929
expire. You know, the application of those ideas

00:10:23.929 --> 00:10:26.149
to what we now call machine learning, that took

00:10:26.149 --> 00:10:29.450
a very winding path. To say the least. The precursors

00:10:29.450 --> 00:10:32.470
to modern backpropagation actually appeared in

00:10:32.470 --> 00:10:35.049
optimal control theory back in the 1950s and

00:10:35.049 --> 00:10:37.649
60s. Control theory, that's the math used for

00:10:37.649 --> 00:10:39.990
engineering dynamic systems, right? Like flight

00:10:39.990 --> 00:10:43.389
paths and industrial processes. Exactly. Researchers

00:10:43.389 --> 00:10:46.289
like Henry J. Kelly, Arthur E. Bryson, and Lev

00:10:46.289 --> 00:10:48.750
Pontryagin, they were trying to figure out how

00:10:48.750 --> 00:10:52.629
to optimize multi -stage processes. Imagine trying

00:10:52.629 --> 00:10:55.190
to optimize a multi -stage rocket flight. You

00:10:55.190 --> 00:10:58.649
have thrust, wind resistance, fuel burn, all

00:10:58.649 --> 00:11:02.009
interacting continuously. If you want to adjust

00:11:02.009 --> 00:11:04.230
the starting launch angle to perfectly hit a

00:11:04.230 --> 00:11:06.629
target in orbit, they used something called the

00:11:06.629 --> 00:11:09.429
adjoint state method. Adjoint state method? Yeah.

00:11:09.789 --> 00:11:12.509
It basically calculates how the final miss relates

00:11:12.509 --> 00:11:14.789
to the initial parameters by working backward

00:11:14.789 --> 00:11:16.970
from the target. Wait, that sounds exactly like

00:11:16.970 --> 00:11:19.210
backpropagation. It is conceptually identical

00:11:19.210 --> 00:11:21.990
to it. It's just applied to the continuous physics

00:11:21.990 --> 00:11:24.350
of rockets instead of artificial neural networks.

00:11:24.690 --> 00:11:26.710
So they had the right tools, but they were just

00:11:26.710 --> 00:11:28.470
building a completely different house. Exactly.

00:11:28.529 --> 00:11:31.470
But I have to say, The absolute most surprising

00:11:31.470 --> 00:11:34.350
detail in the entire source involves a researcher

00:11:34.350 --> 00:11:38.549
named Paul Werbos, because in 1971, Werbos developed

00:11:38.549 --> 00:11:40.649
the backpropagation algorithm as we generally

00:11:40.649 --> 00:11:43.370
understand it today for his PhD thesis. Yes,

00:11:43.370 --> 00:11:45.570
he did. But he didn't do it to teach computers

00:11:45.570 --> 00:11:48.929
how to recognize images or play chess. He developed

00:11:48.929 --> 00:11:51.929
it to mathematicize Sigmund Freud's concept of

00:11:51.929 --> 00:11:54.870
the flow of psychic energy. I know. It remains

00:11:54.870 --> 00:11:57.070
one of the most unexpected crossovers in the

00:11:57.070 --> 00:11:59.970
history of science. Werbos was essentially attempting

00:11:59.970 --> 00:12:02.669
to map psychological theories of human behavior.

00:12:02.789 --> 00:12:05.350
The ego, the id. Right. And the way traumatic

00:12:05.350 --> 00:12:08.049
errors supposedly propagate backward through

00:12:08.049 --> 00:12:10.009
the human subconscious, he wanted to map all

00:12:10.009 --> 00:12:12.610
of that into a rigorous mathematical model. He

00:12:12.610 --> 00:12:15.129
literally translated Freudian psychoanalysis

00:12:15.129 --> 00:12:18.210
into calculus, which is just mind blowing. And

00:12:18.210 --> 00:12:20.190
the frustrating part is, Werbos couldn't even

00:12:20.190 --> 00:12:22.950
get his work published until 1981 because the

00:12:22.950 --> 00:12:25.860
concept was viewed as to unorthodox. The utility

00:12:25.860 --> 00:12:28.120
for computer science just wasn't recognized yet.

00:12:28.600 --> 00:12:30.740
And that delay is a recurring theme here. The

00:12:30.740 --> 00:12:32.639
concept just kept being discovered and then ignored.

00:12:34.200 --> 00:12:37.139
Sepo, Linema published a concept called the reverse

00:12:37.139 --> 00:12:40.580
mode of automatic differentiation in 1970. Which

00:12:40.580 --> 00:12:42.919
is mathematically identical to modern back -proc.

00:12:43.000 --> 00:12:45.700
Exactly. But it wasn't until the mid -1980s that

00:12:45.700 --> 00:12:48.940
the dam finally broke. And that culminated in

00:12:48.940 --> 00:12:52.240
a very famous 1986 paper in the journal Nature.

00:12:52.590 --> 00:12:54.750
published by David Rumelhart, Jeffrey Hinton,

00:12:55.250 --> 00:12:57.710
who, by the way, recently won the 2024 Nobel

00:12:57.710 --> 00:13:00.029
Prize in Physics for his AI contribution. Oh,

00:13:00.169 --> 00:13:02.629
yeah. Huge figure in the field. And Ronald Williams.

00:13:02.690 --> 00:13:05.970
Yeah. They demonstrated experimentally that backpropagation

00:13:05.970 --> 00:13:08.230
could train neural networks to learn internal

00:13:08.230 --> 00:13:11.029
representations of data. They popularized the

00:13:11.029 --> 00:13:13.509
algorithm and forced the broader scientific community

00:13:13.509 --> 00:13:16.009
to finally pay attention. They really put it

00:13:16.009 --> 00:13:17.769
on the map. I do have to pause here, though.

00:13:18.149 --> 00:13:21.509
If Lin and Amah had the exact math in 1970. and

00:13:21.509 --> 00:13:25.029
were both had it in 1971. Why did it take until

00:13:25.029 --> 00:13:27.549
1986 for anyone in the actual neural network

00:13:27.549 --> 00:13:30.529
field to adopt this magic shortcut? What's fascinating

00:13:30.529 --> 00:13:33.169
here is that the delay stemmed from a fundamental

00:13:33.169 --> 00:13:36.090
biological misunderstanding. Yeah. Back in those

00:13:36.090 --> 00:13:38.870
days, artificial neural network design was heavily,

00:13:38.870 --> 00:13:41.370
heavily influenced by physiologists who were

00:13:41.370 --> 00:13:44.690
studying actual human brains. And the prevailing

00:13:44.690 --> 00:13:47.009
biological belief at the time was that neurons

00:13:47.009 --> 00:13:50.649
only fired in discrete binary signals, a zero

00:13:50.649 --> 00:13:53.230
or a one, off or on. Oh, like a light switch.

00:13:53.850 --> 00:13:56.590
Precisely. But returning to our concept of gradient

00:13:56.590 --> 00:13:59.269
descent from earlier, backpropagation requires

00:13:59.269 --> 00:14:01.289
calculating a slope. Right. Feeling your way

00:14:01.289 --> 00:14:03.870
down the bowl. Exactly. And to do that, it requires

00:14:03.870 --> 00:14:06.909
continuous, differentiable functions. You cannot

00:14:06.909 --> 00:14:09.169
calculate the mathematical slope of a discrete

00:14:09.169 --> 00:14:11.389
light switch. Because it's just on or off. Right.

00:14:11.629 --> 00:14:14.330
On a graph, a step function is either a perfectly

00:14:14.330 --> 00:14:17.269
flat horizontal line, or it's a sheer vertical

00:14:17.269 --> 00:14:20.389
cliff. And the slope of a vertical cliff is mathematically

00:14:20.389 --> 00:14:23.110
undefined, while the slope of a flat line is

00:14:23.110 --> 00:14:25.909
just zero. So there's no slope to feel. Right.

00:14:26.210 --> 00:14:28.230
Because early researchers believed artificial

00:14:28.230 --> 00:14:30.570
neurons had to perfectly mimic this discrete

00:14:30.570 --> 00:14:33.549
biological firing, they assumed backpropagation

00:14:33.549 --> 00:14:36.110
was useless for neural networks. There was no

00:14:36.110 --> 00:14:38.389
gradient slope for the blindfolded AI to feel.

00:14:38.889 --> 00:14:41.889
Oh wow. They were trying so hard to copy human

00:14:41.889 --> 00:14:44.190
biology that they just completely blinded themselves

00:14:44.190 --> 00:14:46.809
to the mathematical solution. Exactly. To make

00:14:46.809 --> 00:14:49.870
the math work, they had to abandon strict biological

00:14:49.870 --> 00:14:52.409
realism. They needed activation functions that

00:14:52.409 --> 00:14:55.389
were smooth, continuous curves where a derivative

00:14:55.389 --> 00:14:58.169
could actually be evaluated. And once the field

00:14:58.169 --> 00:15:00.870
finally accepted that and the 1986 paper proved

00:15:00.870 --> 00:15:03.500
it worked in practice, back propagation just

00:15:03.500 --> 00:15:05.960
completely took over. It really did. The source

00:15:05.960 --> 00:15:09.000
lists this rapid -fire string of absolute triumphs

00:15:09.000 --> 00:15:12.899
that followed. In 1987, a system called NetTalk

00:15:12.899 --> 00:15:15.759
learned to convert English text into pronunciation,

00:15:15.980 --> 00:15:17.879
literally reading out loud. Huge breakthrough.

00:15:18.299 --> 00:15:21.740
Then in 1989, a project named Alvion used back

00:15:21.740 --> 00:15:24.240
propagation to learn how to drive a vehicle autonomously.

00:15:24.480 --> 00:15:27.360
That same year, Jan Lacoon published Lynette,

00:15:27.600 --> 00:15:29.600
which could recognize handwritten zip codes on

00:15:29.600 --> 00:15:33.399
envelopes. And by 1992, a system called TD Gammon

00:15:33.399 --> 00:15:36.059
achieved human -level play in backgammon. It

00:15:36.059 --> 00:15:38.559
was a golden age of rapid advancement. But, you

00:15:38.559 --> 00:15:41.039
know, the source is very clear that backpropagation

00:15:41.039 --> 00:15:44.059
isn't a flawless silver bullet. There are significant

00:15:44.059 --> 00:15:47.100
limitations. And the most famous of which brings

00:15:47.100 --> 00:15:49.100
us back to that bowl shape we talked about, the

00:15:49.100 --> 00:15:51.000
error landscape. Right, the perfect parabola

00:15:51.000 --> 00:15:52.980
where the bottom is the bullseye. Well, the problem

00:15:52.980 --> 00:15:56.919
is, real world data is rarely a perfect, smooth

00:15:56.919 --> 00:16:00.019
bowl. The error landscape of a highly complex

00:16:00.019 --> 00:16:02.940
neural network looks more like a rugged, chaotic

00:16:02.940 --> 00:16:05.700
mountain range. It's filled with peaks, valleys,

00:16:05.940 --> 00:16:09.019
and flat plateaus. Oh, I see. So the danger of

00:16:09.019 --> 00:16:12.159
gradient descent is that the blindfolded AI might

00:16:12.159 --> 00:16:14.740
feel its way down into a small valley, hit the

00:16:14.740 --> 00:16:17.360
bottom, and stop. Because all the slopes around

00:16:17.360 --> 00:16:19.879
it go up, it assumes it has solved the problem.

00:16:20.100 --> 00:16:22.870
Even though it's not the absolute bottom. It

00:16:22.870 --> 00:16:25.529
gets stuck in what they call a local minimum,

00:16:25.710 --> 00:16:27.850
right, like a shallow valley halfway up the mountain,

00:16:28.009 --> 00:16:30.409
instead of finding the global minimum, which

00:16:30.409 --> 00:16:32.690
is the true bottom of the deepest valley. Exactly.

00:16:32.929 --> 00:16:35.049
And it can also get completely stalled on flat

00:16:35.049 --> 00:16:37.090
plateaus where the slope is practically zero,

00:16:37.549 --> 00:16:39.570
which provides no directional guidance at all.

00:16:39.730 --> 00:16:42.379
Just wandering around in the dark. Yeah. Though

00:16:42.379 --> 00:16:45.220
it is worth noting that Yann LeCun actually argues

00:16:45.220 --> 00:16:48.039
that in highly complex, high -dimensional practical

00:16:48.039 --> 00:16:50.860
problems, getting stuck in a local minimum isn't

00:16:50.860 --> 00:16:54.139
really a catastrophic drawback. The good enough

00:16:54.139 --> 00:16:56.720
valley is usually perfectly fine for the AI to

00:16:56.720 --> 00:16:59.340
function effectively in the real world. But wait,

00:16:59.919 --> 00:17:01.799
knowing the slope of that mountain implies the

00:17:01.799 --> 00:17:05.240
mountain is perfectly smooth, right? If we use

00:17:05.240 --> 00:17:07.599
activation functions that create sharp, jagged

00:17:07.599 --> 00:17:10.279
cliffs in the mathematics, doesn't the entire

00:17:10.279 --> 00:17:13.109
back propagation formula just fall apart? Theoretically,

00:17:13.390 --> 00:17:15.990
yes, absolutely. Backpropagation strictly requires

00:17:15.990 --> 00:17:18.089
that the derivatives of the activation functions

00:17:18.089 --> 00:17:20.690
inside the neurons be known and continuous. Which

00:17:20.690 --> 00:17:22.789
leads to a truly fascinating contradiction in

00:17:22.789 --> 00:17:25.309
the text. Because we just established that researchers

00:17:25.309 --> 00:17:27.470
in the 70s were stalled because backprop needs

00:17:27.470 --> 00:17:29.809
smooth, differentiable math. It needs curves.

00:17:29.849 --> 00:17:32.109
Right. But the text points out that in the modern

00:17:32.109 --> 00:17:34.349
era, one of the most popular activation functions

00:17:34.349 --> 00:17:36.750
used in massive networks, like the famous AlexNet,

00:17:37.390 --> 00:17:41.130
is something called ReLU, or E -L -U. And ReLU

00:17:41.130 --> 00:17:43.410
is completely non -differentiable at zero. Yeah,

00:17:43.490 --> 00:17:46.069
it literally has a sharp jagged point on the

00:17:46.069 --> 00:17:48.769
graph. And why does a sharp point break calculus?

00:17:49.450 --> 00:17:51.549
Because a derivative is essentially finding the

00:17:51.549 --> 00:17:54.769
tangent line to a curve. On a smooth curve, there

00:17:54.769 --> 00:17:57.690
is only one way to balance a tangent line. But

00:17:57.690 --> 00:18:00.529
on a sharp pointy corner, you can balance a line

00:18:00.529 --> 00:18:03.869
in infinite ways, meaning there is no single

00:18:03.869 --> 00:18:06.819
defined slope. The blindfolded AI suddenly has

00:18:06.819 --> 00:18:10.119
no idea which way is down. Exactly. ReLU completely

00:18:10.119 --> 00:18:12.799
breaks the foundational rule of backpropagation.

00:18:13.380 --> 00:18:16.180
And yet, everyone uses it, and it works incredibly

00:18:16.180 --> 00:18:18.640
well. It is one of the great ironies of modern

00:18:18.640 --> 00:18:21.279
deep learning. Theoretically, ReLU should cause

00:18:21.279 --> 00:18:24.000
the algorithm to crash the exact moment the network's

00:18:24.000 --> 00:18:26.869
value hits exactly zero. But in practice, with

00:18:26.869 --> 00:18:28.890
millions of data points and floating -point math

00:18:28.890 --> 00:18:31.049
running on physical computers, hitting absolute

00:18:31.049 --> 00:18:33.670
perfect mathematical zero is just so incredibly

00:18:33.670 --> 00:18:35.849
rare that the system just hums along. So they

00:18:35.849 --> 00:18:37.710
just ignore the math rule? Yeah. It's a case

00:18:37.710 --> 00:18:40.069
where practical engineering simply ignores the

00:18:40.069 --> 00:18:42.089
theoretical math constraint because the real

00:18:42.089 --> 00:18:43.970
-world results are just too good to pass up.

00:18:44.190 --> 00:18:47.250
Wow. So researchers had this incredibly powerful

00:18:47.250 --> 00:18:50.150
algorithm, and they finally figured out the practical

00:18:50.150 --> 00:18:52.630
engineering to make it work. But the dominance

00:18:52.630 --> 00:18:55.109
of backprop wasn't just a straight line from

00:18:55.109 --> 00:18:58.650
1986 to today, was it? No, not at all. because

00:18:58.650 --> 00:19:00.910
the source mentions the algorithm actually fell

00:19:00.910 --> 00:19:03.789
completely out of favor in the 2000s during a

00:19:03.789 --> 00:19:05.950
period known as the AI winter. Yeah, they hit

00:19:05.950 --> 00:19:08.630
a really hard wall with computational power.

00:19:09.150 --> 00:19:10.990
The networks that researchers wanted to build

00:19:10.990 --> 00:19:13.950
are just becoming too deep and too complex. Standard

00:19:13.950 --> 00:19:16.990
computer processors, CPUs, they just couldn't

00:19:16.990 --> 00:19:19.450
crunch those matrix multiplications fast enough.

00:19:19.569 --> 00:19:21.849
Even with the backpropagation shortcut. Even

00:19:21.849 --> 00:19:24.089
with the shortcut. So researchers have the perfect

00:19:24.089 --> 00:19:25.690
algorithm, but we're essentially trying to run

00:19:25.690 --> 00:19:28.369
it on Like a pocket calculator? Essentially,

00:19:28.589 --> 00:19:30.490
yes. How did they finally get the horsepower

00:19:30.490 --> 00:19:33.289
they needed to thaw out the AI winter? The breakthrough

00:19:33.289 --> 00:19:36.289
actually came from a completely unrelated industry.

00:19:36.789 --> 00:19:40.269
Video games. Really? Video games. Yep. In the

00:19:40.269 --> 00:19:44.130
2010s, researchers realized that GPUs, graphics

00:19:44.130 --> 00:19:47.250
processing units, were structurally very different

00:19:47.250 --> 00:19:50.269
from standard CPUs. A traditional CPU is like

00:19:50.269 --> 00:19:53.309
a brilliant math professor. It can solve incredibly

00:19:53.309 --> 00:19:55.869
complex problems, but it solves them one at a

00:19:55.869 --> 00:19:59.029
time, sequentially. Which is terrible for backpropagation,

00:19:59.230 --> 00:20:01.470
because you need to update millions of weights

00:20:01.470 --> 00:20:05.009
across a whole network. Exactly. Now, a GPU,

00:20:05.029 --> 00:20:07.529
on the other hand, is designed to render millions

00:20:07.529 --> 00:20:10.519
of individual pixels for a video game simultaneously.

00:20:10.640 --> 00:20:12.359
Oh, I see where this is going. Structurally,

00:20:12.500 --> 00:20:15.160
a GPU is like an army of elementary school students.

00:20:15.460 --> 00:20:17.599
They can't do compliance calculus, but they can

00:20:17.599 --> 00:20:20.019
perform thousands of simple arithmetic problems

00:20:20.019 --> 00:20:22.759
at the exact same time. This is called parallel

00:20:22.759 --> 00:20:25.099
processing. And since we established earlier

00:20:25.099 --> 00:20:27.400
that backpropagation is mostly just multiplying

00:20:27.400 --> 00:20:30.440
vectors by matrices, which is really just thousands

00:20:30.440 --> 00:20:33.390
of simple multiplications. The GPU is the perfect

00:20:33.390 --> 00:20:35.109
tool. It's the absolute perfect tool. If you

00:20:35.109 --> 00:20:37.589
were playing a high end video game in 2012, the

00:20:37.589 --> 00:20:39.309
graphics card humming inside your computer was

00:20:39.309 --> 00:20:42.430
exactly the piece of hardware that broke AI from

00:20:42.430 --> 00:20:45.670
its hibernation. That is wild. Cheap, powerful

00:20:45.670 --> 00:20:48.670
GPUs suddenly made it physically possible to

00:20:48.670 --> 00:20:51.029
train networks with dozens or hundreds of layers

00:20:51.029 --> 00:20:54.210
using vast amounts of data. And that hardware

00:20:54.210 --> 00:20:56.829
evolution ushered in the deep learning revolution

00:20:56.829 --> 00:20:59.490
we are living in right now, powering everything

00:20:59.490 --> 00:21:02.410
from large language models to machine vision.

00:21:02.769 --> 00:21:04.470
And that hardware evolution is still continuing,

00:21:04.730 --> 00:21:06.849
right? Oh, very much so. The source notes that

00:21:06.849 --> 00:21:10.890
in 2023, a team at Stanford University actually

00:21:10.890 --> 00:21:14.109
implemented a back propagation algorithm on a

00:21:14.109 --> 00:21:17.880
photonic processor. Yes. Photonic, meaning using

00:21:17.880 --> 00:21:20.480
light instead of electricity. Using lasers and

00:21:20.480 --> 00:21:22.619
optic to compute the gradients. That sounds like

00:21:22.619 --> 00:21:25.680
sci -fi. It does. Photons have zero electrical

00:21:25.680 --> 00:21:27.599
resistance, and they move at the speed of light,

00:21:27.759 --> 00:21:29.880
which promises to be exponentially faster and

00:21:29.880 --> 00:21:31.980
more energy efficient than traditional silicon

00:21:31.980 --> 00:21:35.039
chips. The core mathematical algorithm remains

00:21:35.039 --> 00:21:37.339
exactly the same. It's the physical medium we

00:21:37.339 --> 00:21:39.240
use to run it that's fundamentally changing.

00:21:39.579 --> 00:21:41.960
So what does this all mean? We started out looking

00:21:41.960 --> 00:21:44.720
at this mysterious black box of AI, and what

00:21:44.720 --> 00:21:47.859
we found inside was really just an elegant optimization

00:21:47.859 --> 00:21:51.359
problem. We found a blindfolded dart thrower,

00:21:51.579 --> 00:21:53.839
feeling their way down a rugged mountain of error

00:21:53.839 --> 00:21:56.640
using gradient descent. We found a mathematical

00:21:56.640 --> 00:22:00.200
shortcut. Exactly. The chain rule, first scribbled

00:22:00.200 --> 00:22:03.619
down by Leibniz in 1676, combined with an attempt

00:22:03.619 --> 00:22:06.420
to map Freudian psychic energy, and it's running

00:22:06.420 --> 00:22:08.859
on hardware originally designed to render video

00:22:08.859 --> 00:22:11.519
game graphics. It's just a stunning testament

00:22:11.519 --> 00:22:14.380
to how human knowledge builds on itself, even

00:22:14.380 --> 00:22:16.400
when it takes a few wrong turns along the way.

00:22:16.900 --> 00:22:18.960
Absolutely. But before we wrap up, the source

00:22:18.960 --> 00:22:21.579
material mentions a connection to cognitive science

00:22:21.579 --> 00:22:24.680
that kind of completely flips this entire narrative

00:22:24.680 --> 00:22:26.619
on its head. It really does. You know, we spent

00:22:26.619 --> 00:22:28.500
a lot of time discussing how early researchers

00:22:28.500 --> 00:22:31.019
delayed the adoption of backpropagation because

00:22:31.019 --> 00:22:33.599
they firmly believed artificial neurons had to

00:22:33.599 --> 00:22:36.720
perfectly mimic biological brain cells, which

00:22:36.720 --> 00:22:38.779
fire in those discrete steps. Right, the light

00:22:38.779 --> 00:22:41.400
switch problem. But recently, the dynamic has

00:22:41.400 --> 00:22:44.519
reversed. Cognitive scientists are now suggesting

00:22:44.519 --> 00:22:47.539
that error backpropagation might actually explain

00:22:47.539 --> 00:22:50.359
human brain event -related potentials. What does

00:22:50.359 --> 00:22:53.240
that mean? Specifically, it's referring to brain

00:22:53.240 --> 00:22:56.279
wave patterns known as the N400 and P600. Wait,

00:22:56.640 --> 00:22:58.619
wait. Scientists are looking at how our artificial

00:22:58.619 --> 00:23:01.500
networks learn and wondering if our human brains

00:23:01.500 --> 00:23:03.859
are actually using the exact same mathematical

00:23:03.859 --> 00:23:07.079
algorithm. Yes. And it raises a really profound

00:23:07.079 --> 00:23:10.069
question for anyone listening to this. If artificial

00:23:10.069 --> 00:23:13.069
networks learn so powerfully by propagating error

00:23:13.069 --> 00:23:15.509
signals backward to adjust their internal weights,

00:23:16.250 --> 00:23:19.049
and our physical brainwaves show similar measurement

00:23:19.049 --> 00:23:20.990
patterns during language learning and prediction

00:23:20.990 --> 00:23:24.549
tasks, it forces us to reconsider our own cognition.

00:23:24.710 --> 00:23:27.529
It is intense. Are we humans, at a fundamental

00:23:27.529 --> 00:23:30.809
level, just biological algorithms? Are you constantly

00:23:30.809 --> 00:23:33.410
calculating your own gradient of error every

00:23:33.410 --> 00:23:35.890
single time you make a mistake, propagating that

00:23:35.890 --> 00:23:38.190
failure backward through your biological neurons?

00:23:38.029 --> 00:23:40.529
to ensure you do better next time, the brain

00:23:40.529 --> 00:23:43.029
might not be a black box either. It might just

00:23:43.029 --> 00:23:45.509
be adjusting its weights, trying to get the dart

00:23:45.509 --> 00:23:47.809
a little closer to the bullseye, something to

00:23:47.809 --> 00:23:50.049
think about the next time you try, fail, and

00:23:50.049 --> 00:23:50.710
learn something new.
