WEBVTT

00:00:00.000 --> 00:00:03.459
Imagine you are standing in the middle of this

00:00:03.459 --> 00:00:06.599
expansive, totally rugged mountain range. There's

00:00:06.599 --> 00:00:09.900
zero moonlight. I mean, it is just pitch black.

00:00:10.099 --> 00:00:13.419
And effusing fog rolls in. Exactly. It's incredibly

00:00:13.419 --> 00:00:15.820
thick fog. You can't even see your own hand in

00:00:15.820 --> 00:00:18.579
front of your face, let alone the trail. But

00:00:18.579 --> 00:00:21.920
your survival depends entirely on finding the

00:00:21.920 --> 00:00:24.579
absolute lowest point in the valley, the global

00:00:24.579 --> 00:00:27.559
minimum, before you freeze. Which is a terrifying

00:00:27.559 --> 00:00:31.149
situation to be in. Right. You have no map, no

00:00:31.149 --> 00:00:33.390
compass, no light. Literally the only tool you

00:00:33.390 --> 00:00:35.149
have is your sense of touch, just feeling the

00:00:35.149 --> 00:00:36.390
ground right underneath your boots to figure

00:00:36.390 --> 00:00:38.590
out which way is downhill. Yeah. I mean, if you're

00:00:38.590 --> 00:00:42.409
a hiker, that is a pure nightmare scenario. But

00:00:42.409 --> 00:00:44.950
if you are a mathematician or, you know, a computer

00:00:44.950 --> 00:00:47.710
scientist, it is actually the perfect framing

00:00:47.710 --> 00:00:49.869
for one of the most elegant problems in existence.

00:00:49.890 --> 00:00:52.570
It really is. Because that exact process, just

00:00:52.570 --> 00:00:54.670
feeling the slope in the dark and taking a step

00:00:54.670 --> 00:00:57.270
downward, that is the fundamental essence of

00:00:57.270 --> 00:00:59.890
how we train machines to quote -unquote think,

00:01:00.530 --> 00:01:02.829
it's, well, it's the bedrock of modern technology.

00:01:03.429 --> 00:01:06.829
Welcome to today's Deep Dive. I am so thrilled

00:01:06.829 --> 00:01:09.170
to be jumping into this with you, our listener,

00:01:09.730 --> 00:01:12.790
because today we have a very specific mission.

00:01:13.150 --> 00:01:16.409
We really do. We are exploring the invisible

00:01:16.409 --> 00:01:20.030
engine that's powering, well, almost every piece

00:01:20.030 --> 00:01:21.909
of artificial intelligence in the world right

00:01:21.909 --> 00:01:24.530
now, which is gradient descent. Yeah, gradient

00:01:24.530 --> 00:01:27.290
descent. Now, we are basing today's journey on

00:01:27.290 --> 00:01:32.200
this really dense... mathematically intense but

00:01:32.200 --> 00:01:34.519
utterly vital Wikipedia article on the subject.

00:01:34.980 --> 00:01:36.879
But don't worry, I know it sounds like intimidating

00:01:36.879 --> 00:01:40.099
calculus, but our mission today is to turn that

00:01:40.099 --> 00:01:42.359
math into pure intuitive aha moments for you.

00:01:42.620 --> 00:01:44.980
We're gonna decode those matrices and Greek letters

00:01:44.980 --> 00:01:47.280
into concepts you can actually visualize. And

00:01:47.280 --> 00:01:49.480
we really need to do that because while the terminology

00:01:49.480 --> 00:01:53.079
can absolutely wall off the casual learner, the

00:01:53.079 --> 00:01:55.400
underlying logic is surprisingly intuitive. Totally.

00:01:55.609 --> 00:01:57.909
This algorithm is essentially a 19th century

00:01:57.909 --> 00:02:00.189
mathematical engine that we just strapped to

00:02:00.189 --> 00:02:02.549
21st century supercomputers. I mean, without

00:02:02.549 --> 00:02:04.430
it, the AI revolution simply does not happen.

00:02:04.750 --> 00:02:07.230
OK, let's unpack this. Before we get into how

00:02:07.230 --> 00:02:10.509
a modern AI chatbot uses this math to talk to

00:02:10.509 --> 00:02:13.550
you, we need to establish exactly what this algorithm

00:02:13.550 --> 00:02:17.250
is actually trying to do and where it came from.

00:02:17.270 --> 00:02:20.349
Right. Fundamentally, gradient descent is a first

00:02:20.349 --> 00:02:24.110
order iterative algorithm for unconstrained mathematical

00:02:24.110 --> 00:02:27.460
optimization. which sounds like a mouthful. Yeah.

00:02:27.699 --> 00:02:30.340
In plain English, it's just a way to find the

00:02:30.340 --> 00:02:32.180
absolute minimum of a mathematical function.

00:02:32.259 --> 00:02:34.159
You have this landscape of numbers and you just

00:02:34.159 --> 00:02:36.060
want to find the bottom of the deepest valley.

00:02:36.400 --> 00:02:38.000
Exactly. And just for the record, if you want

00:02:38.000 --> 00:02:39.780
to find the maximum, like if you're climbing

00:02:39.780 --> 00:02:42.539
to the highest peak of that foggy mountain, it's

00:02:42.539 --> 00:02:44.219
just called gradient descent. Precisely. You're

00:02:44.219 --> 00:02:46.159
just trying to optimize something, whether that's

00:02:46.159 --> 00:02:48.840
up or down. And historically, the credit for

00:02:48.840 --> 00:02:51.020
this foundational idea goes all the way back

00:02:51.020 --> 00:02:53.520
to a French mathematician named Augustin Louis

00:02:53.520 --> 00:02:57.240
Cauchy. And this was in 1847. 1847, that's wild.

00:02:57.400 --> 00:02:59.900
Right, imagine that, a guy with a quill pen in

00:02:59.900 --> 00:03:02.580
Paris decades before the invention of the light

00:03:02.580 --> 00:03:05.000
bulb, sketching out the exact mathematical logic

00:03:05.000 --> 00:03:07.560
that would eventually train generative AI. That

00:03:07.560 --> 00:03:10.039
is crazy to think about. Yeah. And later on,

00:03:10.240 --> 00:03:12.520
Jacques Hadamard independently proposed a really

00:03:12.520 --> 00:03:15.620
similar method around 1907. And then Haskell

00:03:15.620 --> 00:03:18.699
-Curry really dug into its properties for complex

00:03:18.699 --> 00:03:23.080
nonlinear problems in 1944. So we are talking

00:03:23.080 --> 00:03:25.539
about mathematics that significantly pre -days

00:03:25.539 --> 00:03:28.180
the very first digital computers. Oh, absolutely.

00:03:28.400 --> 00:03:30.969
By a long shot. OK. So. The mountain analogy

00:03:30.969 --> 00:03:33.569
is great for the overall concept. But to make

00:03:33.569 --> 00:03:35.930
this visceral for you, the listener, I want to

00:03:35.930 --> 00:03:38.090
bring it indoors and kind of scale it down a

00:03:38.090 --> 00:03:40.150
bit. OK, I like where this is going. If I am

00:03:40.150 --> 00:03:42.729
understanding this core concept, it's basically

00:03:42.729 --> 00:03:45.449
like trying to find the drain in a massive, totally

00:03:45.449 --> 00:03:47.889
dry bathtub while wearing a blindfold. Yeah,

00:03:47.909 --> 00:03:49.800
that's a great way to picture it. You don't know

00:03:49.800 --> 00:03:52.800
where the drain is. So you just put your fingertips

00:03:52.800 --> 00:03:54.460
on the porcelain, you feel which way the surface

00:03:54.460 --> 00:03:56.759
is sloping away from you, and you move your hand

00:03:56.759 --> 00:03:59.099
in the direction of the steepest descent. You

00:03:59.099 --> 00:04:01.460
just keep following that downward curve until

00:04:01.460 --> 00:04:03.639
your fingertips hit the drain. Is that a fair

00:04:03.639 --> 00:04:05.960
way to think about it? The bathtub analogy works

00:04:05.960 --> 00:04:08.939
perfectly for the geometry, yes. But what's fascinating

00:04:08.939 --> 00:04:11.340
here is that it completely misses the element

00:04:11.340 --> 00:04:15.400
of time. Oh, really? Time? Yeah. And time is

00:04:15.400 --> 00:04:17.959
a crucial constraint in the math. Think about

00:04:17.959 --> 00:04:20.980
it. In a real bathtub, you feel the slope of

00:04:20.980 --> 00:04:24.279
the porcelain instantly, right? Your brain processes

00:04:24.279 --> 00:04:27.600
the curve in a millisecond. Right. But in mathematics,

00:04:28.019 --> 00:04:30.920
measuring that steepness takes massive computational

00:04:30.920 --> 00:04:33.660
effort. Wait, really? Because finding a slope

00:04:33.660 --> 00:04:36.699
sounds like... basic algebra to me. Yeah. Like,

00:04:36.879 --> 00:04:38.660
rise over run, right? Well, if it's a perfectly

00:04:38.660 --> 00:04:41.300
straight line, sure. But we are dealing with

00:04:41.300 --> 00:04:43.980
really complex curves and multiple dimensions

00:04:43.980 --> 00:04:46.899
here. To find the slope at a specific point,

00:04:47.220 --> 00:04:49.399
the computer has to perform differentiation.

00:04:49.779 --> 00:04:51.879
OK, differentiation. Yeah, it has to calculate

00:04:51.879 --> 00:04:54.180
the gradient of the function. It's effectively

00:04:54.180 --> 00:04:57.019
using a highly sophisticated mathematical instrument

00:04:57.019 --> 00:04:59.579
every single time it touches the ground. Oh,

00:04:59.579 --> 00:05:02.399
wow. So if you are on that foggy mountain and

00:05:02.399 --> 00:05:04.339
you stop to take out your serving, instruments

00:05:04.339 --> 00:05:07.519
to calculate the exact slope of the ground every

00:05:07.519 --> 00:05:09.399
single inch you walk. You'll never get down.

00:05:09.680 --> 00:05:12.139
Exactly. You will never ever make it to the bottom

00:05:12.139 --> 00:05:14.279
of the mountain before sunset. You will freeze.

00:05:14.740 --> 00:05:18.160
Ah, I see. you literally just run out of time

00:05:18.160 --> 00:05:21.779
and compute power. So the difficulty isn't just

00:05:21.779 --> 00:05:24.240
knowing which way is downhill. The difficulty

00:05:24.240 --> 00:05:26.720
is choosing how often you should stop to measure

00:05:26.720 --> 00:05:29.720
the steepness so you don't go off track while

00:05:29.720 --> 00:05:32.120
still actually making meaningful progress down

00:05:32.120 --> 00:05:34.699
the mountain. Exactly. And that seamlessly introduces

00:05:34.699 --> 00:05:37.339
the next massive problem that mathematicians

00:05:37.339 --> 00:05:39.879
had to solve. Which is? Once you calculate that

00:05:39.879 --> 00:05:42.339
slope and you know which direction is downhill,

00:05:42.920 --> 00:05:45.160
How far do you walk before you stop to measure

00:05:45.160 --> 00:05:48.819
again? Right. This is the step size. In the math,

00:05:49.160 --> 00:05:50.959
the article says it's represented by the Greek

00:05:50.959 --> 00:05:54.160
letter eta, or what computer scientists now call

00:05:54.160 --> 00:05:56.639
the learning rate. Yes, the learning rate. And

00:05:56.639 --> 00:05:58.519
this feels like a classic Goldilocks problem,

00:05:58.759 --> 00:06:01.480
right? It is entirely a delicate, often frustrating

00:06:01.480 --> 00:06:04.100
balancing act. If your step size, your eta, is

00:06:04.100 --> 00:06:06.220
too small, your convergence to the minimum is

00:06:06.220 --> 00:06:08.579
just agonizingly slow. Because you're taking

00:06:08.579 --> 00:06:10.779
millimeter long steps down a massive mountain.

00:06:10.920 --> 00:06:13.819
Exactly, you waste all your computational power

00:06:13.819 --> 00:06:16.829
just recalculating the slope constantly. But

00:06:16.829 --> 00:06:19.389
on the flip side, if your step size is too large,

00:06:19.769 --> 00:06:22.110
you run into an even bigger problem, which is

00:06:22.110 --> 00:06:24.790
overshoot. Overshoot. Meaning, like, you take

00:06:24.790 --> 00:06:27.449
such a massive confident leap that you completely

00:06:27.449 --> 00:06:30.069
jump over the valley and land higher up on the

00:06:30.069 --> 00:06:32.490
opposite mountain peak. Yes. And that leads to

00:06:32.490 --> 00:06:34.629
divergence. Divergence, right. And instead of

00:06:34.629 --> 00:06:37.009
getting closer to the minimum, your algorithm

00:06:37.009 --> 00:06:40.189
bounces wildly out of control, getting further

00:06:40.189 --> 00:06:42.329
and further away from the solution with every

00:06:42.329 --> 00:06:45.910
giant leap. So finding a good safe setting for

00:06:45.910 --> 00:06:49.050
that step size is one of the major practical

00:06:49.050 --> 00:06:51.569
problems in optimization. So what does this all

00:06:51.569 --> 00:06:53.889
mean? I want to challenge a piece of the history

00:06:53.889 --> 00:06:55.629
here, because I'm trying to put myself in the

00:06:55.629 --> 00:06:57.610
listener's shoes. Sure, go for it. Back in the

00:06:57.610 --> 00:07:00.129
mid 20th century, a mathematician named Philip

00:07:00.129 --> 00:07:03.069
Wolf advocated for using quote unquote clever

00:07:03.069 --> 00:07:06.620
choices of direction. to fix this step size problem.

00:07:06.959 --> 00:07:09.399
Yeah, the wolf conditions. Right, but if taking

00:07:09.399 --> 00:07:12.000
a step straight down like the absolute negative

00:07:12.000 --> 00:07:14.759
gradient is the mathematical definition of the

00:07:14.759 --> 00:07:18.259
steepest descent, why did wolf say we need a

00:07:18.259 --> 00:07:21.600
clever direction that deviates from that? Isn't

00:07:21.600 --> 00:07:24.060
straight down literally always the fastest way

00:07:24.060 --> 00:07:26.290
to the bottom? I mean, it seems like it should

00:07:26.290 --> 00:07:28.410
be, intuitively, if I want to get down the mountain,

00:07:28.529 --> 00:07:30.730
I just point my boots down the steepest possible

00:07:30.730 --> 00:07:33.189
angle. Yeah, exactly. But mathematically, taking

00:07:33.189 --> 00:07:35.750
the absolute steepest path right this second

00:07:35.750 --> 00:07:39.069
might be a terrible trap. A trap? How so? It

00:07:39.069 --> 00:07:41.810
might lead you directly to a sheer cliff drop

00:07:41.810 --> 00:07:44.029
-off. And that forces you to stop, turn around,

00:07:44.129 --> 00:07:46.649
and take tiny, inefficient steps later just to

00:07:46.649 --> 00:07:49.350
recover. Oh, wow. Okay, so the greediest immediate

00:07:49.350 --> 00:07:51.790
choice isn't always the best long -term strategy.

00:07:52.050 --> 00:07:54.740
Precisely. It's a trade -off. A slightly gentler

00:07:54.740 --> 00:07:58.040
slope, a path that isn't perfectly steeply straight

00:07:58.040 --> 00:08:00.819
down, might actually be compensated for by being

00:08:00.819 --> 00:08:02.879
sustained over a much longer distance. Oh, I

00:08:02.879 --> 00:08:05.339
get it. Right, you might be able to take a massive

00:08:05.339 --> 00:08:08.680
fast stride on a gentle slope safely, whereas

00:08:08.680 --> 00:08:11.060
the steepest slope might require you to just

00:08:11.060 --> 00:08:14.459
inch forward carefully. Wolf and others developed

00:08:14.459 --> 00:08:16.959
conditions these mathematical inequalities that

00:08:16.959 --> 00:08:19.300
balance the angle of descent against how quickly

00:08:19.300 --> 00:08:21.680
the landscape itself is changing. But wait, you

00:08:21.680 --> 00:08:24.259
just said that checking how the landscape changes

00:08:24.259 --> 00:08:27.040
takes massive computational effort. How do you

00:08:27.040 --> 00:08:28.899
know a cliff is coming if you're blindfolded?

00:08:29.160 --> 00:08:31.939
Well, that is the genius of it. Mathematicians

00:08:31.939 --> 00:08:34.419
found clever workarounds to estimate the terrain

00:08:34.419 --> 00:08:36.919
without perfectly calculating it. For example,

00:08:37.000 --> 00:08:38.940
they used something called a Lipschitz constant.

00:08:39.179 --> 00:08:41.500
Okay, let's translate that. What is a Lipschitz

00:08:41.500 --> 00:08:44.100
constant? Think of it as a cosmic speed limit

00:08:44.100 --> 00:08:47.120
for the terrain. A speed limit? Yeah. It is a

00:08:47.120 --> 00:08:49.200
mathematical guarantee that the slope of the

00:08:49.200 --> 00:08:51.759
mountain cannot suddenly drop off into a sheer

00:08:51.759 --> 00:08:54.360
cliff without warning. It essentially bounds

00:08:54.360 --> 00:08:56.879
how fast the gradient can possibly change. Oh,

00:08:56.879 --> 00:08:58.899
that makes total sense. Right. If you know that

00:08:58.899 --> 00:09:01.580
speed limit, you can calculate exactly how far

00:09:01.580 --> 00:09:04.659
it is mathematically safe to step without accidentally

00:09:04.659 --> 00:09:06.980
jumping over the drain in the bathtub. That is

00:09:06.980 --> 00:09:09.710
brilliant. Okay, so we've figured out our step

00:09:09.710 --> 00:09:11.870
size. We have our cosmic speed limits telling

00:09:11.870 --> 00:09:15.309
us how far to stride safely. But what happens

00:09:15.309 --> 00:09:18.230
if the landscape itself is inherently hostile?

00:09:18.669 --> 00:09:21.389
Like this moves us from the theory of the step

00:09:21.389 --> 00:09:23.769
to the actual geometric shape of the terrain.

00:09:23.960 --> 00:09:26.500
And the geometry can absolutely break this algorithm

00:09:26.500 --> 00:09:28.340
if you aren't careful. Yeah, let's talk about

00:09:28.340 --> 00:09:30.720
the zigzag trap. The article mentions that when

00:09:30.720 --> 00:09:33.700
you apply gradient descent to systems of linear

00:09:33.700 --> 00:09:35.840
equations, which mathematicians formulate as

00:09:35.840 --> 00:09:39.379
a quadratic minimization problem, the algorithm

00:09:39.379 --> 00:09:41.600
really struggles with terrain that is shaped

00:09:41.600 --> 00:09:45.519
like an elongated narrow bowl or like a narrow

00:09:45.519 --> 00:09:48.679
ravine. Yes. What does an elongated ravine actually

00:09:48.679 --> 00:09:51.940
do to our poor blindfolded hiker? Well, picture

00:09:51.940 --> 00:09:54.919
walking into a long, incredibly narrow canyon.

00:09:55.200 --> 00:09:57.419
You have a very steep drop on your left and your

00:09:57.419 --> 00:10:00.120
right, but a very gentle, almost flat slope down

00:10:00.120 --> 00:10:02.200
the center of the valley toward your actual destination.

00:10:02.440 --> 00:10:04.500
Okay, I'm picturing it. Now remember the core

00:10:04.500 --> 00:10:07.059
rule of our blindfolded hiker. They only feel

00:10:07.059 --> 00:10:09.360
the ground directly under their boots, and they

00:10:09.360 --> 00:10:11.840
step in the steepest immediate direction. They

00:10:11.840 --> 00:10:14.080
do not look down the length of the valley. So

00:10:14.080 --> 00:10:16.460
if they are standing on the sidewall of the ravine,

00:10:16.799 --> 00:10:19.399
the steepest way down isn't forward toward the

00:10:19.399 --> 00:10:21.980
destination. It's sideways, straight down the

00:10:21.980 --> 00:10:24.299
wall. Right. You take a step down the steep wall.

00:10:24.759 --> 00:10:27.460
But because the valley is narrow and your step

00:10:27.460 --> 00:10:30.139
size is fixed, you slightly overshoot the center

00:10:30.139 --> 00:10:32.679
and end up stepping slightly up the opposite

00:10:32.679 --> 00:10:35.580
wall. Oh, no. Yeah. And now you calculate the

00:10:35.580 --> 00:10:38.179
slope again. The steepest direction is now pointing

00:10:38.179 --> 00:10:40.080
right back across the valley where you just came

00:10:40.080 --> 00:10:41.820
from, just slightly further down the canyon.

00:10:42.120 --> 00:10:44.539
Because each step is taken in the steepest direction,

00:10:45.320 --> 00:10:47.559
the vectors become orthogonal to each other.

00:10:47.620 --> 00:10:50.580
They hit. 90 degree angles. Yes, you step left,

00:10:50.720 --> 00:10:53.019
then you step right, then you step left. It creates

00:10:53.019 --> 00:10:56.220
this highly inefficient characteristic zigzag

00:10:56.220 --> 00:10:59.289
path. That sounds exhausting. It is. You are

00:10:59.289 --> 00:11:01.570
endlessly ping -ponging back and forth across

00:11:01.570 --> 00:11:04.590
the walls of this narrow ravine, making incredibly

00:11:04.590 --> 00:11:07.950
slow, agonizing progress toward the actual minimum

00:11:07.950 --> 00:11:09.929
down the length of the valley. Wait, I'm losing

00:11:09.929 --> 00:11:12.950
the thread here. If this algorithm is so easily

00:11:12.950 --> 00:11:15.570
trapped, bouncing back and forth across a narrow

00:11:15.570 --> 00:11:18.429
ravine, why are we using it at all? Doesn't the

00:11:18.429 --> 00:11:20.509
math community prefer something called the conjugate

00:11:20.509 --> 00:11:22.750
gradient method for these kinds of linear equations?

00:11:23.169 --> 00:11:26.120
It supposedly fixes this exact problem. You are

00:11:26.120 --> 00:11:29.139
absolutely right, and it is a great catch. For

00:11:29.139 --> 00:11:31.980
simple, predictable systems of linear equations,

00:11:32.519 --> 00:11:34.700
the conjugate gradient method is absolutely preferred.

00:11:35.080 --> 00:11:38.360
Okay, so why the focus on gradient descent? Because

00:11:38.360 --> 00:11:40.759
conjugate gradient essentially warps the geometry

00:11:40.759 --> 00:11:43.360
of the space to eliminate the ravine entirely.

00:11:43.740 --> 00:11:46.759
The zigzag makes standard gradient descent practically

00:11:46.759 --> 00:11:49.840
useless for those specific simple problems. But

00:11:49.840 --> 00:11:52.639
this raises an important question. OK. What happens

00:11:52.639 --> 00:11:55.159
when the problems are no longer simple linear

00:11:55.159 --> 00:11:58.519
equations? The real world. Exactly. What happens

00:11:58.519 --> 00:12:00.539
when the equations are highly nonlinear? What

00:12:00.539 --> 00:12:02.919
happens when they are infinitely complex and

00:12:02.919 --> 00:12:05.960
require us to calculate massive Jacobian matrices

00:12:05.960 --> 00:12:08.529
just to figure out where we are? And the Jacobi

00:12:08.529 --> 00:12:11.990
matrix is what exactly? Think of it as the ultimate

00:12:11.990 --> 00:12:15.309
dashboard of all possible slopes. It is a matrix

00:12:15.309 --> 00:12:18.710
of all first order partial derivatives for a

00:12:18.710 --> 00:12:21.409
multivariable function. Got it. It tells you

00:12:21.409 --> 00:12:24.149
how every single variable is changing in relation

00:12:24.149 --> 00:12:26.990
to every other variable simultaneously. In those

00:12:26.990 --> 00:12:30.409
wildly complex scenarios, we cannot easily use

00:12:30.409 --> 00:12:33.090
those alternative cleaner methods like conjugate

00:12:33.090 --> 00:12:36.230
gradients. We are mathematically forced to use

00:12:36.230 --> 00:12:39.649
gradient descent. Wow. Which means the mathematicians

00:12:39.649 --> 00:12:42.549
had to find a way to fix the blindfolded hiker's

00:12:42.549 --> 00:12:45.389
zigzag problem. And to fix the math, they turned

00:12:45.389 --> 00:12:48.590
to physics, which I absolutely love. They introduced

00:12:48.590 --> 00:12:51.509
momentum, specifically the heavy ball method.

00:12:51.629 --> 00:12:53.850
Yes, it completely changed the physical dynamics

00:12:53.850 --> 00:12:56.370
of our analogy. Here's where it gets really interesting,

00:12:56.470 --> 00:12:58.210
because instead of a person blindly stepping,

00:12:58.409 --> 00:13:00.169
stopping, and suddenly snapping 90 degrees to

00:13:00.169 --> 00:13:02.710
the other way, I want you to picture sliding

00:13:02.710 --> 00:13:05.149
a heavy bowling ball down that same mountain.

00:13:05.250 --> 00:13:07.710
Okay, bowling ball. The bowling ball has physical

00:13:07.710 --> 00:13:10.570
mass, and let's say it's moving through a viscous

00:13:10.570 --> 00:13:12.490
medium, like a thick fluid, just to create some

00:13:12.490 --> 00:13:15.529
friction. If a heavy bowling ball rolls down

00:13:15.529 --> 00:13:18.070
the steep wall of that ravine, it doesn't instantly

00:13:18.070 --> 00:13:20.230
snap 90 degrees when it crosses the center line.

00:13:20.360 --> 00:13:23.500
It can't. Its momentum forces it to carry forward.

00:13:23.700 --> 00:13:26.279
Right. But wait, math equations don't have actual

00:13:26.279 --> 00:13:29.820
mass. How do you literally program momentum into

00:13:29.820 --> 00:13:32.019
lines of code? You do it through memory. Memory.

00:13:32.179 --> 00:13:34.100
Yeah. The algorithm is programmed to remember

00:13:34.100 --> 00:13:36.940
its previous update, its last directional vector.

00:13:37.539 --> 00:13:39.460
When it calculates the next step, it doesn't

00:13:39.460 --> 00:13:42.139
just use the current downward slope. It calculates

00:13:42.139 --> 00:13:44.440
a linear combination of the current gradient

00:13:44.440 --> 00:13:47.100
A and D, the momentum from the past step. That's

00:13:47.100 --> 00:13:49.480
so smart. So if the ravine wall is telling it

00:13:49.480 --> 00:13:51.929
to go sharply left, but its momentum is pushing

00:13:51.929 --> 00:13:54.690
it forward down the valley, the resulting vector

00:13:54.690 --> 00:13:57.809
smooths out. It cancels out the ping -ponging

00:13:57.809 --> 00:14:00.669
sideways energy and aggressively pushes the path

00:14:00.669 --> 00:14:02.990
down the center of the valley. The math essentially

00:14:02.990 --> 00:14:06.470
gives the algorithm inertia. It resists sudden

00:14:06.470 --> 00:14:10.000
changes in direction. And this concept of momentum

00:14:10.000 --> 00:14:12.899
was then taken even further by a mathematician

00:14:12.899 --> 00:14:16.279
named Yuri Nesterov with his fast gradient methods,

00:14:16.740 --> 00:14:18.919
which are often called Nesterov acceleration.

00:14:19.340 --> 00:14:21.659
Yes, exactly. But how does Nesterov acceleration

00:14:21.659 --> 00:14:24.080
differ from just rolling the heavy bowling ball?

00:14:24.759 --> 00:14:27.360
Well, Nesterov took that bowling ball idea and

00:14:27.360 --> 00:14:29.679
essentially gave it a superpower, which is the

00:14:29.679 --> 00:14:32.799
ability to slightly look ahead. Look ahead. Yeah,

00:14:33.179 --> 00:14:35.419
with standard momentum, you calculate your slope

00:14:35.419 --> 00:14:37.580
exactly where you were standing, and then you

00:14:37.580 --> 00:14:39.340
add your momentum to figure out where to go.

00:14:39.620 --> 00:14:41.419
With Nesterov, you know your momentum is going

00:14:41.419 --> 00:14:43.620
to carry you forward anyway, right? Right. So

00:14:43.620 --> 00:14:45.779
you project where you are going to be based on

00:14:45.779 --> 00:14:47.960
that momentum, and you calculate the slope at

00:14:47.960 --> 00:14:50.100
that future point. Then you apply the correction.

00:14:50.200 --> 00:14:52.580
Hold on. That is wild. So you are sensing the

00:14:52.580 --> 00:14:54.860
slope, not where your boots are currently standing,

00:14:54.940 --> 00:14:57.580
but where your boots are about to land. Exactly.

00:14:58.080 --> 00:15:00.299
You peek ahead to see if your momentum is about

00:15:00.299 --> 00:15:03.419
to carry you into an uphill wall, and you correct

00:15:03.419 --> 00:15:05.419
your trajectory before you even get there. That

00:15:05.419 --> 00:15:08.340
is incredible. And the mathematical result of

00:15:08.340 --> 00:15:12.139
that look ahead is profound. For smooth, convex

00:15:12.139 --> 00:15:15.419
problems, standard gradient descent reduces the

00:15:15.419 --> 00:15:18.039
error rate at a speed of big O of k to the negative

00:15:18.039 --> 00:15:21.240
1. But Nestrov's acceleration drops that error

00:15:21.240 --> 00:15:24.639
rate to big O of k to the negative 2. Okay, for

00:15:24.639 --> 00:15:26.799
the listener who isn't writing down Big O notation

00:15:26.799 --> 00:15:29.759
on a napkin right now, translate that. How big

00:15:29.759 --> 00:15:32.100
of a real -world difference is that negative

00:15:32.100 --> 00:15:34.860
one to negative two? It is massive. Big O notation

00:15:34.860 --> 00:15:37.340
is just how computer scientists measure how fast

00:15:37.340 --> 00:15:39.899
an algorithm slows down as a problem gets bigger.

00:15:40.399 --> 00:15:42.279
Dropping from negative one to negative two means

00:15:42.279 --> 00:15:45.580
we move from a linear decrease in error to a

00:15:45.580 --> 00:15:48.740
quadratic or exponential decrease. It is the

00:15:48.740 --> 00:15:50.940
mathematically proven absolute optimal speed

00:15:50.940 --> 00:15:53.299
for any first -order optimization method. So

00:15:53.299 --> 00:15:55.059
in real world terms, what does that look like?

00:15:55.279 --> 00:15:57.399
In real world terms, an AI training model that

00:15:57.399 --> 00:15:59.320
used to take three months and millions of dollars

00:15:59.320 --> 00:16:01.340
in supercomputer processing power could suddenly

00:16:01.340 --> 00:16:04.639
be done in a fraction of the time. It is computationally

00:16:04.639 --> 00:16:06.840
life -saving. It literally rescued machine learning

00:16:06.840 --> 00:16:09.039
from a computational bottleneck. Which brings

00:16:09.039 --> 00:16:12.419
us to the grand finale. We have Augustine -Louis

00:16:12.419 --> 00:16:16.200
Cauchy's 1840s calculus. We have the Goldilocks

00:16:16.200 --> 00:16:19.240
step sizes and the cosmic speed limits. We have

00:16:19.240 --> 00:16:22.179
heavy mathematical bowling balls peering into

00:16:22.179 --> 00:16:24.240
the future. All of it coming together. How does

00:16:24.240 --> 00:16:26.919
this all culminate in the technology that you

00:16:26.919 --> 00:16:30.379
and I and the listener use every single day on

00:16:30.379 --> 00:16:33.259
our phones? It all converges on an extension

00:16:33.259 --> 00:16:35.940
of this concept called stochastic gradient descent

00:16:35.940 --> 00:16:39.039
or SGD. Stochastic meaning random. Yes. If we

00:16:39.039 --> 00:16:41.610
connect this to the bigger picture, Think about

00:16:41.610 --> 00:16:44.710
a modern deep neural network, like the architecture

00:16:44.710 --> 00:16:47.789
behind chat GPT. The landscape it is trying to

00:16:47.789 --> 00:16:49.789
navigate isn't just a three dimensional mountain.

00:16:50.389 --> 00:16:53.289
It is a mathematical space with billions, sometimes

00:16:53.289 --> 00:16:55.509
hundreds of billions of dimensions. I mean, the

00:16:55.509 --> 00:16:58.009
mind completely recoils at trying to picture

00:16:58.009 --> 00:17:00.389
a hundred billion dimensional valley. It does.

00:17:00.409 --> 00:17:02.789
It's impossible to visualize. But mathematically,

00:17:02.789 --> 00:17:05.450
it works the exact same way. In this case, the

00:17:05.450 --> 00:17:07.769
landscape is the cost function, which is essentially

00:17:07.769 --> 00:17:10.539
a measure of how wrong the AI currently is. When

00:17:10.539 --> 00:17:12.640
an AI is trying to learn how to speak English,

00:17:12.779 --> 00:17:15.259
it makes a billion guesses and the cost function

00:17:15.259 --> 00:17:18.579
scores how terrible those guesses are. The goal

00:17:18.579 --> 00:17:21.059
is to get to the bottom of the valley where the

00:17:21.059 --> 00:17:24.240
wrongness is at its absolute minimum. But to

00:17:24.240 --> 00:17:27.859
calculate the exact perfect slope of that multi

00:17:27.859 --> 00:17:30.059
-billion dimensional mountain, you would have

00:17:30.059 --> 00:17:32.700
to evaluate every single piece of training data

00:17:32.700 --> 00:17:35.980
in the world. Every book, every website, every

00:17:35.980 --> 00:17:39.250
article all at once for every single step. which

00:17:39.250 --> 00:17:41.670
is computationally impossible. Precisely. You'd

00:17:41.670 --> 00:17:44.670
be frozen on the mountain forever. So stochastic

00:17:44.670 --> 00:17:47.150
gradient descent adds that random property to

00:17:47.150 --> 00:17:49.470
the update direction. Instead of looking at the

00:17:49.470 --> 00:17:51.910
whole mountain, it grabs a tiny random sample

00:17:51.910 --> 00:17:54.309
of data called a mini -batch. A mini -batch?

00:17:54.450 --> 00:17:56.589
Yeah, maybe just a few hundred sentences. And

00:17:56.589 --> 00:17:59.130
it calculates a rough, noisy estimate of the

00:17:59.130 --> 00:18:01.390
slope just based on that tiny sample. It's like,

00:18:01.470 --> 00:18:03.170
instead of trying to map the whole valley with

00:18:03.170 --> 00:18:06.349
a laser level, the blindfolded hiker just feels

00:18:06.349 --> 00:18:08.930
the slope of a single random pebble under their

00:18:08.930 --> 00:18:12.230
boot and says, good enough. I'm stepping. Yes.

00:18:12.690 --> 00:18:15.849
It is incredibly noisy and chaotic. The hiker

00:18:15.849 --> 00:18:18.390
is stumbling all over the place. But surprisingly,

00:18:18.630 --> 00:18:20.990
that random noise is exactly what makes it work

00:18:20.990 --> 00:18:23.609
so beautifully. Wait, how does stumbling blindly

00:18:23.609 --> 00:18:27.609
help you find the drain faster? Because a hundred

00:18:27.609 --> 00:18:30.069
billion dimensional mountain is full of fake

00:18:30.069 --> 00:18:33.369
valleys, local minima. little potholes on the

00:18:33.369 --> 00:18:34.950
side of the mountain where the slope flattens

00:18:34.950 --> 00:18:37.650
out. If you take perfectly calculated, smooth

00:18:37.650 --> 00:18:40.230
steps, you will step right into a pothole and

00:18:40.230 --> 00:18:42.650
get stuck, thinking you found the absolute bottom.

00:18:42.690 --> 00:18:45.309
Oh, I see. But because stochastic gradient descent

00:18:45.309 --> 00:18:48.170
is noisy and random, it constantly shakes the

00:18:48.170 --> 00:18:50.849
algorithm out of those shallow potholes. It stumbles

00:18:50.849 --> 00:18:52.789
its way out, allowing it to keep falling toward

00:18:52.789 --> 00:18:55.769
the true, deep, global minimum. That is incredible.

00:18:55.910 --> 00:18:58.529
The flaw is actually the feature. Exactly. This

00:18:58.529 --> 00:19:01.109
noisy, momentum -driven descent combined with

00:19:01.109 --> 00:19:02.809
an algorithm over them called backpropagation

00:19:02.809 --> 00:19:05.450
is the absolute bedrock of how neural networks

00:19:05.450 --> 00:19:07.849
learn. Modern optimizers that you might hear

00:19:07.849 --> 00:19:10.269
computer scientists talk about like Adam or Yogi

00:19:10.269 --> 00:19:13.029
or Ada Belief, these are all just highly advanced

00:19:13.029 --> 00:19:15.829
momentum -based descendants of this exact same

00:19:15.829 --> 00:19:18.829
process. Next time you, the listener, type a

00:19:18.829 --> 00:19:21.650
prompt into a generative AI to write an email

00:19:21.650 --> 00:19:24.650
or generate an image or summarize a document,

00:19:24.809 --> 00:19:26.569
I want you to remember this. Yes, definitely.

00:19:26.869 --> 00:19:29.490
Behind the slick interface and the magic of a

00:19:29.490 --> 00:19:33.619
machine speaking to you. Well, it is just a highly

00:19:33.619 --> 00:19:37.460
advanced version of an 1847 math equation strapped

00:19:37.460 --> 00:19:40.680
to a metaphorical heavy bowling ball, stumbling

00:19:40.680 --> 00:19:43.319
and feeling its way down a hundred billion dimensional

00:19:43.319 --> 00:19:46.599
mountain in the dark. That is what learning actually

00:19:46.599 --> 00:19:49.640
means for a machine. It really is an elegant,

00:19:49.859 --> 00:19:52.779
beautiful reduction of error through sheer geometry.

00:19:53.000 --> 00:19:55.180
We've covered so much vital ground today. From

00:19:55.180 --> 00:19:57.880
Cauchy's early theories in Paris to the visceral

00:19:57.880 --> 00:20:00.400
panic of the fog on the mountain analogy, we

00:20:00.400 --> 00:20:02.700
learn how finding the right step size, the learning

00:20:02.700 --> 00:20:05.900
rate, is a delicate mathematical Goldilocks balance

00:20:05.900 --> 00:20:08.619
to avoid overshooting the valley entirely. Such

00:20:08.619 --> 00:20:10.619
an important piece of the puzzle. And we discuss

00:20:10.619 --> 00:20:13.319
the agonizing danger of the zigzag trap in narrow,

00:20:13.539 --> 00:20:16.420
elongated ravine. and how adding momentum broke

00:20:16.420 --> 00:20:19.099
that trap and paved the way for modern stochastic

00:20:19.099 --> 00:20:21.880
neural networks. It really highlights how modern

00:20:21.880 --> 00:20:24.420
technology is entirely built on a foundation

00:20:24.420 --> 00:20:27.619
of centuries -old mathematical problem solving.

00:20:28.569 --> 00:20:31.029
But before we wrap up today's deep dive, I want

00:20:31.029 --> 00:20:33.609
to leave the listener with one final slightly

00:20:33.609 --> 00:20:35.569
mind -bending fact straight from the math that

00:20:35.569 --> 00:20:37.369
we haven't even touched on yet. Oh, please lay

00:20:37.369 --> 00:20:40.769
it on us. We've been visualizing this in mostly

00:20:40.769 --> 00:20:43.970
three dimensions, a hiker on a mountain, a bowling

00:20:43.970 --> 00:20:47.210
ball in a physical valley. And we just stretched

00:20:47.210 --> 00:20:51.400
it to billions of dimensions for AI. But mathematically,

00:20:51.859 --> 00:20:54.039
gradient descent doesn't stop there. It actually

00:20:54.039 --> 00:20:56.440
works in infinite dimensional spaces. I'm sorry,

00:20:56.680 --> 00:20:59.099
infinite dimensions? Yes. In what mathematicians

00:20:59.099 --> 00:21:02.259
call function spaces. To do it, they use a tool

00:21:02.259 --> 00:21:04.759
called the Fréchette derivative. The Fréchette

00:21:04.759 --> 00:21:06.759
derivative. Without getting bogged down in the

00:21:06.759 --> 00:21:08.940
jargon, it is basically a way to calculate a

00:21:08.940 --> 00:21:10.940
derivative not just of a point but of an entire

00:21:10.940 --> 00:21:13.200
function. And when viewed through the lens of

00:21:13.200 --> 00:21:15.299
Euler's method, which is a way to solve differential

00:21:15.299 --> 00:21:18.779
equations, gradient descent becomes a fluid continuous

00:21:18.779 --> 00:21:22.400
flow. is slightly melting. Meaning mathematically

00:21:22.400 --> 00:21:25.359
we can use this exact same process of feeling

00:21:25.359 --> 00:21:28.680
the slope to find the most optimal perfect path

00:21:28.680 --> 00:21:31.420
through an infinite number of possible dimensions

00:21:31.420 --> 00:21:33.759
simultaneously. I want you to just chew on that

00:21:33.759 --> 00:21:37.099
for a second. Try to picture what walking downhill

00:21:37.099 --> 00:21:39.900
in infinite dimensions might possibly look like

00:21:39.900 --> 00:21:42.019
in your mind's eye. Good luck with that. Next

00:21:42.019 --> 00:21:44.240
time you find yourself stuck in the dark, whether

00:21:44.240 --> 00:21:46.599
in a literal fog on a hiking trail or just trying

00:21:46.599 --> 00:21:49.700
to navigate a massively complex problem in your

00:21:49.700 --> 00:21:52.259
own life, just remember the math. You don't need

00:21:52.259 --> 00:21:54.019
to see the whole path. You don't need to map

00:21:54.019 --> 00:21:56.119
the entire valley. You just need to figure out

00:21:56.119 --> 00:21:58.720
which way is down, find your momentum, and take

00:21:58.720 --> 00:21:59.599
the right size step.
