WEBVTT

00:00:00.000 --> 00:00:05.379
Imagine for a second the sheer computational

00:00:05.379 --> 00:00:08.220
power it takes to train modern artificial intelligence.

00:00:08.939 --> 00:00:10.660
I mean, we completely take it for granted now,

00:00:10.740 --> 00:00:12.539
right? Oh, absolutely. It's just background noise

00:00:12.539 --> 00:00:14.539
to us now. Yeah. You point your smartphone at

00:00:14.539 --> 00:00:17.500
your face, and it instantly recognizes you. Or

00:00:17.500 --> 00:00:20.420
you type a complex phrase, and a machine translates

00:00:20.420 --> 00:00:22.679
it into a flawless Japanese on the fly. Right.

00:00:22.679 --> 00:00:24.980
Or you ask a chat bot to write a sonnet about

00:00:24.980 --> 00:00:27.059
a toaster, and it spits it out in what? Three

00:00:27.059 --> 00:00:30.480
seconds? Exactly. Three seconds. But behind those

00:00:30.480 --> 00:00:32.740
split -second miracles is this training process

00:00:32.740 --> 00:00:35.579
that involves processing billions, sometimes

00:00:35.579 --> 00:00:39.259
literally trillions, of data points. By all normal

00:00:39.259 --> 00:00:41.719
mathematical logic, a machine should take billions

00:00:41.719 --> 00:00:44.179
of years to process all that data and actually

00:00:44.179 --> 00:00:46.359
learn anything from it. It really should. The

00:00:46.359 --> 00:00:48.399
math is staggering. But it doesn't take billions

00:00:48.399 --> 00:00:51.250
of years. It happens in weeks or months. So,

00:00:51.549 --> 00:00:54.810
hows? Today, we are taking a deep dive into a

00:00:54.810 --> 00:00:57.990
dense, highly technical Wikipedia article on

00:00:57.990 --> 00:01:00.189
a concept called Stochastic Gradient Descent,

00:01:00.390 --> 00:01:03.689
or SGD. Which is quite the mouthful. It is. But

00:01:03.689 --> 00:01:06.689
our mission for this deep dive is to decode this

00:01:06.689 --> 00:01:09.549
exact mathematical engine. We're going to transform

00:01:09.549 --> 00:01:11.849
what looks like an overwhelming wall of calculus

00:01:11.849 --> 00:01:14.790
into a really fascinating story of problem -solving.

00:01:14.989 --> 00:01:17.689
Because this single algorithm is basically the

00:01:17.689 --> 00:01:20.290
beating heart that makes all modern machine learning

00:01:20.290 --> 00:01:22.629
possible. Okay, let's unpack this, starting with

00:01:22.629 --> 00:01:24.790
the name itself, Stochastic Gradient Descent.

00:01:25.000 --> 00:01:28.840
It sounds incredibly intimidating. It does, yeah.

00:01:28.920 --> 00:01:31.459
It carries a really heavy academic weight. Yeah.

00:01:31.739 --> 00:01:33.620
But if we break it down, the underlying physical

00:01:33.620 --> 00:01:36.379
concept is actually very intuitive. OK, I'm holding

00:01:36.379 --> 00:01:38.640
you to that. I promise. So to understand the

00:01:38.640 --> 00:01:40.519
stochastic part, which by the way is just a fancy

00:01:40.519 --> 00:01:44.019
mathematical term for random, we first have to

00:01:44.019 --> 00:01:45.739
understand the gradient descent part. Right.

00:01:45.920 --> 00:01:48.560
In statistics and machine learning, you are essentially

00:01:48.560 --> 00:01:51.060
trying to solve a massive minimization problem.

00:01:51.239 --> 00:01:53.180
You have an objective function, which is often

00:01:53.180 --> 00:01:56.079
referred to as empirical risk. I want you to

00:01:56.079 --> 00:01:58.650
imagine a landscape. Like a physical landscape.

00:01:58.930 --> 00:02:01.349
Yeah, exactly. A massive, incredibly complex

00:02:01.349 --> 00:02:04.709
landscape of hills, mountains and valleys. Your

00:02:04.709 --> 00:02:07.370
singular goal is to find the absolute lowest

00:02:07.370 --> 00:02:10.750
point in that entire landscape because that lowest

00:02:10.750 --> 00:02:13.870
point represents the minimum possible error for

00:02:13.870 --> 00:02:16.569
your AI model. So gradient descent is literally

00:02:16.569 --> 00:02:18.889
how we find the bottom. Exactly. It's like trying

00:02:18.889 --> 00:02:21.689
to find the very bottom of a valley while completely

00:02:21.689 --> 00:02:23.930
blindfolded. You can't see the destination. You

00:02:23.930 --> 00:02:26.509
can only like feel the slope of the ground right

00:02:26.509 --> 00:02:29.050
under your feet. That is a perfect way to visualize

00:02:29.050 --> 00:02:31.569
it. Now, in classical batch gradient descent

00:02:31.569 --> 00:02:34.530
to figure out which way is down, the algorithm

00:02:34.530 --> 00:02:37.539
calculates the gradient, the slope using every

00:02:37.539 --> 00:02:40.000
single piece of data in your entire training

00:02:40.000 --> 00:02:43.300
set. Every single one? Every single one. It looks

00:02:43.300 --> 00:02:45.360
at the entirety of the data set to calculate

00:02:45.360 --> 00:02:48.039
the mathematical average, just to take one single

00:02:48.039 --> 00:02:50.780
perfect step down the hill. Which, I mean, that

00:02:50.780 --> 00:02:52.819
makes total sense if you're working with, say,

00:02:53.060 --> 00:02:55.180
100 data points in a simple spreadsheet. Right,

00:02:55.219 --> 00:02:58.550
it's easy. But if you're training an AI, on like

00:02:58.550 --> 00:03:01.569
every piece of text on the public internet, evaluating

00:03:01.569 --> 00:03:04.090
the slope from all those billions of individual

00:03:04.090 --> 00:03:06.229
pieces of training data just to take one step

00:03:06.229 --> 00:03:09.789
is completely unfeasible. Batch descent is like

00:03:09.789 --> 00:03:11.969
surveying every microscopic inch of the mountain

00:03:11.969 --> 00:03:13.569
before deciding where to put your foot next.

00:03:13.789 --> 00:03:16.939
It's just computationally crushing. It is. The

00:03:16.939 --> 00:03:19.319
computational cost at every iteration becomes

00:03:19.319 --> 00:03:21.759
astronomical. Your machine would literally never

00:03:21.759 --> 00:03:24.539
finish training. And this is exactly where the

00:03:24.539 --> 00:03:27.520
stochastic shortcut comes in. OK. The basic idea

00:03:27.520 --> 00:03:30.300
actually traces all the way back to the 1950s

00:03:30.300 --> 00:03:32.979
to something called the Robbins -Monroe algorithm.

00:03:33.659 --> 00:03:35.759
Instead of looking at all the data to calculate

00:03:35.759 --> 00:03:38.780
that one perfect slope, stochastic gradient descent

00:03:38.780 --> 00:03:42.000
samples just a single randomly selected subset

00:03:42.000 --> 00:03:44.240
of the data. Really? Just one? Often, yeah. It's

00:03:44.240 --> 00:03:46.689
just one single data. point. It estimates the

00:03:46.689 --> 00:03:49.169
entire gradient based solely on that one random

00:03:49.169 --> 00:03:52.110
sample and it immediately takes a step. So SGD

00:03:52.110 --> 00:03:54.009
is just feeling the ground right under your left

00:03:54.009 --> 00:03:55.930
toe and immediately stepping down. But wait I

00:03:55.930 --> 00:03:57.830
have to push back on that. Go for it. If we're

00:03:57.830 --> 00:04:00.270
just guessing the slope of the entire mountain

00:04:00.270 --> 00:04:03.789
based on one random single data point, isn't

00:04:03.789 --> 00:04:05.840
that incredibly risky? I mean, couldn't that

00:04:05.840 --> 00:04:09.039
one data point be a massive outlier? We could

00:04:09.039 --> 00:04:11.060
easily step the completely wrong way and start

00:04:11.060 --> 00:04:13.879
climbing up a different hill. You've hit on the

00:04:13.879 --> 00:04:15.840
exact trade -off that makes the algorithm work.

00:04:16.040 --> 00:04:18.360
You are exchanging a high convergence rate, that

00:04:18.360 --> 00:04:21.720
perfect, smooth, guaranteed path to the bottom

00:04:21.720 --> 00:04:25.180
for vastly faster iterations. OK, so it's a speed

00:04:25.180 --> 00:04:28.300
thing. Exactly. If you were doing this in a simple

00:04:28.300 --> 00:04:30.459
three -dimensional valley, yes, you would bounce

00:04:30.459 --> 00:04:33.660
around wildly and probably get lost. But what

00:04:33.660 --> 00:04:36.060
we have to remember is that modern machine learning

00:04:36.060 --> 00:04:38.899
operates in incredibly high dimensional optimization

00:04:38.899 --> 00:04:42.240
problems. We are talking about models with millions

00:04:42.240 --> 00:04:44.500
or even billions of dimensions, which the human

00:04:44.500 --> 00:04:47.139
brain obviously can't even picture. In those

00:04:47.139 --> 00:04:50.060
high dimensional spaces, the sheer speed of taking

00:04:50.060 --> 00:04:52.720
thousands of rapid, slightly imperfect steps

00:04:52.720 --> 00:04:55.220
drastically outweighs the computational cost

00:04:55.220 --> 00:04:58.540
of calculating one perfect step. You sweep through

00:04:58.540 --> 00:05:01.420
the training set, continually shuffling the data

00:05:01.420 --> 00:05:03.519
to prevent the algorithm from getting stuck in

00:05:03.529 --> 00:05:07.269
loops, and those noisy, erratic steps will eventually

00:05:07.269 --> 00:05:09.750
trend downward over time. So we're basically

00:05:09.750 --> 00:05:12.290
trading precision for speed and trusting the

00:05:12.290 --> 00:05:14.810
law of averages to get us there eventually. That's

00:05:14.810 --> 00:05:17.750
the core of it, yeah. But relying on just a single

00:05:17.750 --> 00:05:21.689
random sample still feels a bit too chaotic to

00:05:21.689 --> 00:05:24.730
me, especially as datasets get larger. It seems

00:05:24.730 --> 00:05:26.769
like taking one step forward and three steps

00:05:26.769 --> 00:05:28.870
sideways wouldn't be the most efficient way to

00:05:28.870 --> 00:05:31.189
learn. even that the steps are really fast. Well,

00:05:31.230 --> 00:05:33.170
the field realized that fairly quickly, too.

00:05:33.470 --> 00:05:36.430
A compromise emerged to bridge the gap between

00:05:36.430 --> 00:05:39.689
true stochastic gradient descent, which uses

00:05:39.689 --> 00:05:43.230
just that one single sample and full batch gradient

00:05:43.230 --> 00:05:45.089
descent. It's called the mini -batch. Yeah, mini

00:05:45.089 --> 00:05:46.990
-batch, okay. Right. Instead of one sample per

00:05:46.990 --> 00:05:48.949
step, the algorithm computes the gradient against

00:05:48.949 --> 00:05:51.110
a small group of training samples at each step,

00:05:51.449 --> 00:05:54.810
maybe 32 samples or maybe a few hundred. Ah,

00:05:54.810 --> 00:05:57.290
got it. So instead of asking one random person

00:05:57.290 --> 00:05:59.829
in the valley which way is down, you ask a small

00:05:59.829 --> 00:06:03.149
group of people to vote. It filters out the absolute

00:06:03.149 --> 00:06:05.870
craziest answers without requiring you to pull

00:06:05.870 --> 00:06:08.149
the entire population of the planet. That group

00:06:08.149 --> 00:06:10.769
voting analogy captures the math perfectly. And

00:06:10.769 --> 00:06:13.430
from a historical perspective, this was a massive

00:06:13.430 --> 00:06:15.769
breakthrough. The sources point out that back

00:06:15.769 --> 00:06:19.250
in 1997, this was explored under the rather clinky

00:06:19.250 --> 00:06:22.410
name of the bunch mode back propagation algorithm.

00:06:22.569 --> 00:06:24.829
Wow, bunch mode. Very kitschy. Yeah, it didn't

00:06:24.829 --> 00:06:27.430
really stick. But the reason the mini -batch

00:06:27.430 --> 00:06:30.879
became the absolute unquestioned norm for training

00:06:30.879 --> 00:06:33.060
neural networks isn't just because the math is

00:06:33.060 --> 00:06:35.339
smoother, it's because of the physical hardware

00:06:35.339 --> 00:06:38.600
of computers. Using mini -batches allows the

00:06:38.600 --> 00:06:41.579
code to heavily leverage vectorization libraries.

00:06:41.980 --> 00:06:43.899
Ah, I see. So instead of the computer's processor

00:06:43.899 --> 00:06:46.959
handling one single data point at a time, sequentially

00:06:46.959 --> 00:06:49.600
it can process the entire mini -batch simultaneously

00:06:49.600 --> 00:06:52.100
using parallel processing on a modern graphics

00:06:52.100 --> 00:06:55.439
card. The hardware synergy is the real secret

00:06:55.439 --> 00:07:00.079
sauce. energy is exactly why it took off. Now,

00:07:00.220 --> 00:07:02.759
alongside the mini -batch, there is another crucial

00:07:02.759 --> 00:07:04.860
concept you have to manage, and that's the step

00:07:04.860 --> 00:07:07.579
size. In machine learning, this is called the

00:07:07.579 --> 00:07:09.939
learning rate, and it's represented mathematically

00:07:09.939 --> 00:07:13.839
by the Greek letter eta. You have to decide exactly

00:07:13.839 --> 00:07:16.420
how big a step to take once you know the direction.

00:07:16.680 --> 00:07:20.000
If you set the learning rate too high, the algorithm

00:07:20.000 --> 00:07:24.139
diverges. It takes massive sweeping leaps, overshoots

00:07:24.139 --> 00:07:26.800
the valley entirely, and mathematically flies

00:07:26.800 --> 00:07:28.660
off into space. Right, like jumping across the

00:07:28.660 --> 00:07:31.220
entire canyon. Exactly. But if you set it too

00:07:31.220 --> 00:07:34.540
low... It converges so painfully slowly that

00:07:34.540 --> 00:07:37.279
you lose the entire speed advantage of the stochastic

00:07:37.279 --> 00:07:39.480
shortcut. It's the classic Goldilocks problem.

00:07:39.579 --> 00:07:42.339
You need the step size to be just right. The

00:07:42.339 --> 00:07:44.480
source material actually has a really fascinating

00:07:44.480 --> 00:07:46.879
example of this involving linear regression.

00:07:47.339 --> 00:07:49.680
It mentions that if you have an over -parameterized

00:07:49.680 --> 00:07:51.379
case, which just means you have more parameters

00:07:51.379 --> 00:07:53.480
or variables than you have actual data points,

00:07:54.180 --> 00:07:58.740
SGD does something deeply unexpected. The overparameterized

00:07:58.740 --> 00:08:01.139
linear regression case is remarkable because

00:08:01.139 --> 00:08:03.399
it highlights a hidden feature of the algorithm.

00:08:04.180 --> 00:08:06.699
In that specific scenario, there are technically

00:08:06.699 --> 00:08:09.680
infinite ways to perfectly fit the data. Infinite

00:08:09.680 --> 00:08:12.839
ways. Right. But even if you keep the learning

00:08:12.839 --> 00:08:15.620
rate completely constant, stochastic gradient

00:08:15.620 --> 00:08:18.279
descent will naturally converge to the interpolation

00:08:18.279 --> 00:08:20.300
solution that has the minimum distance from where

00:08:20.300 --> 00:08:23.170
it started. Which means out of all the infinite

00:08:23.170 --> 00:08:25.810
possible solutions that perfectly solve the problem,

00:08:26.329 --> 00:08:28.589
it's lazy in the absolute best possible way.

00:08:28.610 --> 00:08:31.149
Yes. It naturally finds the solution that required

00:08:31.149 --> 00:08:33.210
the shortest journey from its starting point.

00:08:33.429 --> 00:08:36.490
Ugh. But OK, but even with a group voting to

00:08:36.490 --> 00:08:38.950
smooth out the direction and a perfectly tuned

00:08:38.950 --> 00:08:41.710
step size, I still have a major question about

00:08:41.710 --> 00:08:44.549
navigating this multi -billion dimension landscape.

00:08:44.850 --> 00:08:47.590
Shoot. What if the valley has weird shallow grooves?

00:08:48.029 --> 00:08:50.429
Or what if it's really steep on one side but

00:08:50.429 --> 00:08:52.830
totally flat on the other? Won't our blindfold

00:08:52.830 --> 00:08:55.090
and steps still bounce back and forth erratically

00:08:55.090 --> 00:08:57.850
across the steep walls while barely moving forward

00:08:57.850 --> 00:09:00.429
along the flat bottom? That bouncing problem.

00:09:00.679 --> 00:09:03.899
often called oscillation, severely plagued early

00:09:03.899 --> 00:09:06.980
optimization efforts. The algorithm would basically

00:09:06.980 --> 00:09:09.860
waste all its energy ping -ponging across a ravine

00:09:09.860 --> 00:09:12.139
instead of walking down the center of it. To

00:09:12.139 --> 00:09:14.100
solve it, computer scientists actually looked

00:09:14.100 --> 00:09:16.179
outside of mathematics and borrowed a concept

00:09:16.179 --> 00:09:18.919
straight from physics. Oh, I know this one, momentum.

00:09:19.299 --> 00:09:21.980
Yes, the momentum method, sometimes called the

00:09:21.980 --> 00:09:24.480
heavy ball method. The roots of this go back

00:09:24.480 --> 00:09:28.039
to a 1964 article by Soviet mathematician Boris

00:09:28.039 --> 00:09:30.779
Poliak, but it was really broad. into the mainstream

00:09:30.779 --> 00:09:33.779
machine learning consciousness around 1986 by

00:09:33.779 --> 00:09:36.320
researchers Rumelhart, Hinton, and Williams.

00:09:37.200 --> 00:09:39.480
What's fascinating here is how beautifully the

00:09:39.480 --> 00:09:42.799
physics analogy maps onto the math. How so? Well,

00:09:42.840 --> 00:09:44.840
we stop thinking about our algorithm as just

00:09:44.840 --> 00:09:47.720
a dimensionless point making isolated blind steps,

00:09:47.799 --> 00:09:49.840
and we start mathematically treating it as a

00:09:49.840 --> 00:09:52.600
physical particle, a heavy iron ball traveling

00:09:52.600 --> 00:09:54.899
through parameter space. I get the concept of

00:09:54.899 --> 00:09:57.139
momentum in the physical world, like a heavy

00:09:57.139 --> 00:09:59.460
ball rolling down a hill builds up speed, but

00:09:59.460 --> 00:10:01.539
an algorithm is just numbers in a server somewhere.

00:10:01.700 --> 00:10:04.059
Yeah. How do you give a line of code weight and

00:10:04.059 --> 00:10:07.919
speed? By giving it a memory. In classical SGD,

00:10:08.100 --> 00:10:10.299
every single step is entirely independent of

00:10:10.299 --> 00:10:13.700
the last one. But with momentum, the next update

00:10:13.700 --> 00:10:16.220
is a linear combination of the current gradient

00:10:16.220 --> 00:10:18.879
and the previous update. OK, so it remembers

00:10:18.879 --> 00:10:22.559
where it just was. Exactly. You introduce an

00:10:22.559 --> 00:10:25.220
exponential decay factor, usually denoted by

00:10:25.220 --> 00:10:27.500
the Greek letter alpha, to manage this memory.

00:10:28.299 --> 00:10:30.840
The gradient, or the slope of the loss function,

00:10:31.139 --> 00:10:34.190
acts as a force that accelerates the ball. Because

00:10:34.190 --> 00:10:36.950
the math remembers its past velocity, it tends

00:10:36.950 --> 00:10:39.049
to keep traveling in the same general direction,

00:10:39.549 --> 00:10:42.529
which smooths out those erratic ping -pong oscillations

00:10:42.529 --> 00:10:45.090
you asked about. It literally powers through

00:10:45.090 --> 00:10:48.110
the shallow grooves. But wait, if we're mathematically

00:10:48.110 --> 00:10:50.529
building up speed like a heavy ball rolling down

00:10:50.529 --> 00:10:53.570
a steep hill... Don't we run the risk of completely

00:10:53.570 --> 00:10:55.570
overshooting the actual minimum at the bottom?

00:10:55.690 --> 00:10:57.649
I mean, if it's picking up that much speed, it's

00:10:57.649 --> 00:10:59.210
going to hit the bottom of the valley and just

00:10:59.210 --> 00:11:01.350
roll right up the other side. That is exactly

00:11:01.350 --> 00:11:03.769
where that exponential decay factor alpha comes

00:11:03.769 --> 00:11:05.669
in. In physics terms, you can think of alpha

00:11:05.669 --> 00:11:08.330
acting like friction or air resistance. Ah, friction.

00:11:08.669 --> 00:11:10.950
Right. It determines the relative contribution

00:11:10.950 --> 00:11:12.950
of the earlier gradients to the weight change,

00:11:13.470 --> 00:11:16.149
slowly bleeding off the stored energy. Without

00:11:16.149 --> 00:11:18.529
that friction, yes, the ball would oscillate

00:11:18.529 --> 00:11:20.809
wildly or shoot right out of the valley. But

00:11:20.809 --> 00:11:23.370
with the right decay factor, the momentum allows

00:11:23.370 --> 00:11:26.350
it to punch through local shallow minimums, little

00:11:26.350 --> 00:11:28.610
potholes on the way down, while the friction

00:11:28.610 --> 00:11:31.450
ensures it eventually settles smoothly at the

00:11:31.450 --> 00:11:33.610
true bottom of the valley. It's an incredibly

00:11:33.610 --> 00:11:36.850
elegant mathematical balancing act. It is elegant.

00:11:37.330 --> 00:11:40.590
But looking at how fast AI has evolved, I have

00:11:40.590 --> 00:11:43.659
to assume even momentum had its limits. The math

00:11:43.659 --> 00:11:46.940
assumes you're using fixed hyperparameters, meaning

00:11:46.940 --> 00:11:49.639
you as the programmer have to set a constant

00:11:49.639 --> 00:11:51.600
learning rate and a constant momentum factor

00:11:51.600 --> 00:11:53.620
for every single dimension of your problem before

00:11:53.620 --> 00:11:55.320
you press play. Right, before you even start.

00:11:55.519 --> 00:11:58.000
And as we established, these AI models have millions

00:11:58.000 --> 00:12:00.679
of dimensions. What if the landscape is incredibly

00:12:00.679 --> 00:12:03.139
steep in one direction but totally flat in another?

00:12:03.539 --> 00:12:06.340
Using a single fixed step size for every dimension

00:12:06.340 --> 00:12:09.529
suddenly seems wildly inefficient. That realization

00:12:09.529 --> 00:12:12.809
birthed the adaptive era in the early 2010s.

00:12:13.070 --> 00:12:15.389
The algorithms needed to get smarter about adjusting

00:12:15.389 --> 00:12:18.009
themselves on the fly without human intervention.

00:12:18.809 --> 00:12:21.470
The first major breakthrough was AdaGrad, or

00:12:21.470 --> 00:12:24.509
adaptive gradient, published in 2011. AdaGrad

00:12:24.509 --> 00:12:26.450
is brilliant because it adapts the learning rate

00:12:26.450 --> 00:12:29.370
on a per parameter basis. Meaning different dimensions

00:12:29.370 --> 00:12:31.809
get totally different step sizes based on what

00:12:31.809 --> 00:12:33.870
the landscape looks like locally. Exactly how

00:12:33.870 --> 00:12:36.970
it works. Imagine our valley is a steep ravine.

00:12:37.120 --> 00:12:39.220
Ettergrad looks at the historical sum of your

00:12:39.220 --> 00:12:41.519
steps. If you've been bouncing wildly up and

00:12:41.519 --> 00:12:43.659
down the steep walls, meaning those specific

00:12:43.659 --> 00:12:46.240
parameters are changing a lot and have huge gradients,

00:12:46.639 --> 00:12:48.779
the algorithm stores that history and throws

00:12:48.779 --> 00:12:51.440
on the brakes, aggressively dampening the learning

00:12:51.440 --> 00:12:54.279
rate for those specific dimensions. But for sparse

00:12:54.279 --> 00:12:56.539
parameters, features that don't show up often,

00:12:56.720 --> 00:12:58.779
like a flat plane where you've barely moved,

00:12:58.980 --> 00:13:01.700
it boosts the learning rate. So when those rare

00:13:01.700 --> 00:13:04.639
features do appear, the model puts its foot on

00:13:04.639 --> 00:13:07.779
the gas and takes a big, meaningful step. This

00:13:07.779 --> 00:13:10.460
made it incredibly powerful for things like natural

00:13:10.460 --> 00:13:13.100
language processing and image recognition, where

00:13:13.100 --> 00:13:16.100
data is often very sparse and uneven. But AdaGrad

00:13:16.100 --> 00:13:19.080
wasn't perfect. Because it keeps accumulating

00:13:19.080 --> 00:13:21.159
all those historical gradients in the denominator

00:13:21.159 --> 00:13:23.740
of its equation, the math means the learning

00:13:23.740 --> 00:13:25.740
rate eventually shrinks down to practically zero.

00:13:25.820 --> 00:13:28.240
Yeah, that was the fatal flaw. The breaks basically

00:13:28.240 --> 00:13:30.799
lock up, and the model just stops learning entirely.

00:13:31.419 --> 00:13:34.179
Which brings us to arguably the funniest piece

00:13:34.179 --> 00:13:36.539
of trivia in this entire deep dive. Here's where

00:13:36.539 --> 00:13:39.539
it gets really interesting. To fix Adigrad's

00:13:39.539 --> 00:13:42.679
lock breaks, two PhD students, James Martens

00:13:42.679 --> 00:13:45.039
and Ilya Sutskever, invented a method called

00:13:45.039 --> 00:13:49.019
RMSProp in 2012. RMSProp introduced a forgetting

00:13:49.019 --> 00:13:51.940
factor. turning that infinite history into a

00:13:51.940 --> 00:13:54.379
running average so the algorithm wouldn't get

00:13:54.379 --> 00:13:57.320
bogged down by ancient data. It was highly influential,

00:13:57.620 --> 00:13:59.960
wildly successful, and it was never actually

00:13:59.960 --> 00:14:02.240
published in a peer -reviewed academic paper.

00:14:02.639 --> 00:14:04.980
It really is an incredible piece of AI lore.

00:14:05.360 --> 00:14:07.980
It was merely described in an online Coursera

00:14:07.980 --> 00:14:10.519
video lecture by their professor, the legendary

00:14:10.519 --> 00:14:13.480
AI researcher Jeffrey Hinton. I just love that.

00:14:13.559 --> 00:14:15.600
We have this highly rigorous mathematical field,

00:14:16.120 --> 00:14:18.960
a field heavily dependent on peer -reviewed validation

00:14:18.960 --> 00:14:21.960
and exhaustive proofs. And one of its foundational

00:14:21.960 --> 00:14:24.379
pillars was essentially dropped in an online

00:14:24.379 --> 00:14:26.799
video lecture. Just casually mentioned to the

00:14:26.799 --> 00:14:28.820
class. Right. Hey, everyone, divide the gradient

00:14:28.820 --> 00:14:30.700
by a running average of its recent magnitude.

00:14:31.080 --> 00:14:33.240
It works great. And the entire industry just

00:14:33.240 --> 00:14:35.220
said, yep, sounds good to us. Let's put it in

00:14:35.220 --> 00:14:37.850
everything. It speaks volumes to how fast the

00:14:37.850 --> 00:14:39.830
field was moving at the time. If something worked

00:14:39.830 --> 00:14:42.710
in practice, researchers ran with it, proofs

00:14:42.710 --> 00:14:45.750
be damned, and RMSProp worked brilliantly. But

00:14:45.750 --> 00:14:48.789
it wasn't the final form. In 2014, we got ADAM,

00:14:49.169 --> 00:14:51.690
which stands for Adaptive Moment Estimation.

00:14:51.970 --> 00:14:54.750
Yeah. Adam is essentially the ultimate hybrid

00:14:54.750 --> 00:14:57.690
combination of RMS props, adaptive learning rates,

00:14:58.029 --> 00:14:59.850
and the momentum method we discussed earlier.

00:14:59.950 --> 00:15:02.110
Taking the best of both worlds. The absolute

00:15:02.110 --> 00:15:04.950
best. It uses running averages with exponential

00:15:04.950 --> 00:15:06.710
forgetting for both the gradients, which are

00:15:06.710 --> 00:15:08.590
the first moments, and the squared gradients,

00:15:08.629 --> 00:15:11.070
which are the second moments. It even introduces

00:15:11.070 --> 00:15:13.929
a clever bias correction to make sure the mathematical

00:15:13.929 --> 00:15:16.210
estimates don't lean too heavily towards zero

00:15:16.210 --> 00:15:18.409
during the very first few training iterations

00:15:18.409 --> 00:15:21.929
when it doesn't have much history yet. Wow. In

00:15:21.929 --> 00:15:28.909
this dual approach, Adam and TensorFlow today,

00:15:29.529 --> 00:15:31.970
they are dominated by Atom and its later variants,

00:15:32.289 --> 00:15:35.669
like AtomW or AtomX. But wait, the source text

00:15:35.669 --> 00:15:38.269
notes a really quirky detail about Atom. Despite

00:15:38.269 --> 00:15:40.389
it being the absolute industry standard that

00:15:40.389 --> 00:15:43.570
basically runs the AI world, its initial mathematical

00:15:43.570 --> 00:15:45.389
proof of convergence was actually completely

00:15:45.389 --> 00:15:48.269
flawed. Yes, it was. In fact, later analysis

00:15:48.269 --> 00:15:51.429
proved that there are simple, perfectly smooth,

00:15:51.639 --> 00:15:54.539
bowl -shaped valleys, what mathematicians call

00:15:54.539 --> 00:15:57.820
convex objectives, where atoms shouldn't mathematically

00:15:57.820 --> 00:16:00.220
be able to find the bottom. It can theoretically

00:16:00.220 --> 00:16:02.620
get stuck. It mathematically shouldn't always

00:16:02.620 --> 00:16:05.440
work. And yet, it continues to be used everywhere

00:16:05.440 --> 00:16:07.580
because of its incredibly strong performance

00:16:07.580 --> 00:16:10.500
in actual practice. It's a recurring theme in

00:16:10.500 --> 00:16:13.100
modern machine learning, honestly. Empirical

00:16:13.100 --> 00:16:15.799
success often massively outpaces theoretical

00:16:15.799 --> 00:16:18.419
understanding. The engineers build a bridge that

00:16:18.419 --> 00:16:21.200
carries heavy traffic perfectly, and the mathematicians

00:16:21.200 --> 00:16:23.259
are left standing on the shore trying to prove

00:16:23.259 --> 00:16:25.980
why the bridge hasn't collapsed yet. That bridge

00:16:25.980 --> 00:16:28.179
analogy perfectly sets up the struggle in the

00:16:28.179 --> 00:16:31.240
theoretical space. Because while Adam dominates

00:16:31.240 --> 00:16:34.500
the practical engineering side of AI, the theoretical

00:16:34.500 --> 00:16:36.820
side, the mathematicians on the shore are still

00:16:36.820 --> 00:16:39.279
wrestling with the extreme numerical instability

00:16:39.279 --> 00:16:41.559
of these algorithms and trying to mathematically

00:16:41.559 --> 00:16:44.740
model how SGD's inherent randomness actually

00:16:44.740 --> 00:16:46.960
functions in the real world. Let's examine that

00:16:46.960 --> 00:16:48.799
instability first because it's a massive headache.

00:16:49.600 --> 00:16:52.379
Classical SGD can become wildly unstable if that

00:16:52.379 --> 00:16:54.779
learning weight, eta, is misspecified by even

00:16:54.779 --> 00:16:57.460
a tiny amount. Just a tiny bit off. Just a fraction.

00:16:58.100 --> 00:17:00.700
The math becomes unstable, essentially. The feedback

00:17:00.700 --> 00:17:03.259
loop of the algorithm multiplies on itself, causing

00:17:03.259 --> 00:17:06.099
our virtual heavy ball to vibrate so violently

00:17:06.099 --> 00:17:08.799
it shatters the calculation and diverges numerically

00:17:08.799 --> 00:17:11.720
within just a few iterations. everything crashes.

00:17:12.319 --> 00:17:14.819
To solve this, theoreticians explored something

00:17:14.819 --> 00:17:18.640
called implicit updates, or ISGD. Okay, so I

00:17:18.640 --> 00:17:20.799
understand the explicit math calculating the

00:17:20.799 --> 00:17:23.079
slope and taking a step. How does an implicit

00:17:23.079 --> 00:17:25.559
update actually stop the calculation from shattering?

00:17:25.740 --> 00:17:28.180
In classic SGD, you look at exactly where you

00:17:28.180 --> 00:17:30.119
are standing right now, decide your slope, and

00:17:30.119 --> 00:17:32.519
take a leap. It's an explicit equation. If you

00:17:32.519 --> 00:17:35.660
leap too far, you crash. In implicit SGD, the

00:17:35.660 --> 00:17:37.920
stochastic gradient is evaluated at the next

00:17:37.920 --> 00:17:40.319
iterate. The new position mathematically appears

00:17:40.319 --> 00:17:42.660
on both sides of the equation. So it's like having

00:17:42.660 --> 00:17:45.759
a mathematical tether. It calculates the slope

00:17:45.759 --> 00:17:48.160
based on where you are going to land, pulling

00:17:48.160 --> 00:17:51.140
you back if that future landing spot is wildly

00:17:51.140 --> 00:17:53.940
off course. It self -corrects the step before

00:17:53.940 --> 00:17:56.660
you even fully take it. That Tether concept is

00:17:56.660 --> 00:17:59.140
spot on. It effectively normalizes the learning

00:17:59.140 --> 00:18:01.779
rate from the future backward. So even if you

00:18:01.779 --> 00:18:04.680
wildly misspecified the step size, the procedure

00:18:04.680 --> 00:18:07.240
remains numerically stable virtually for all

00:18:07.240 --> 00:18:09.859
possible values of eta. It's incredibly robust.

00:18:09.900 --> 00:18:12.200
That's amazing. But fixing this stability is

00:18:12.200 --> 00:18:14.920
only half the battle. The sources also dive deep

00:18:14.920 --> 00:18:17.400
into how mathematicians try to model the actual

00:18:17.400 --> 00:18:19.940
continuous path the algorithm takes. They try

00:18:19.940 --> 00:18:22.880
to view SGD as a discretization of a gradient

00:18:22.880 --> 00:18:25.559
flow ordinary differential equation, or ODE.

00:18:25.839 --> 00:18:27.839
Basically trying to draw a smooth continuous

00:18:27.839 --> 00:18:30.000
line over all the jagged random little steps.

00:18:30.200 --> 00:18:33.380
Exactly. But an ODE only gets you so far because

00:18:33.380 --> 00:18:35.720
SGD is inherently... Remember, it's just sampling

00:18:35.720 --> 00:18:38.720
random subsets of data at every step. A standard

00:18:38.720 --> 00:18:40.859
ordinary differential equation cannot capture

00:18:40.859 --> 00:18:43.099
the random fluctuations around the mean behavior.

00:18:43.759 --> 00:18:46.440
So theoreticians have to move to stochastic differential

00:18:46.440 --> 00:18:49.400
equations or SDEs. To capture the randomness.

00:18:49.819 --> 00:18:52.859
And the source mentions they model this random

00:18:52.859 --> 00:18:55.740
noise using something called Brownian motion.

00:18:56.000 --> 00:18:58.619
That's the exact same math used in physics to

00:18:58.619 --> 00:19:01.440
describe the random jiggling of microscopic pollen

00:19:01.440 --> 00:19:04.279
particles suspended in water. When I read that

00:19:04.279 --> 00:19:06.150
I had to stop. It sounds totally disconnected,

00:19:06.210 --> 00:19:09.210
right? Yeah. I mean, is this just academic flexing,

00:19:09.349 --> 00:19:11.470
using the math of floating pollen to describe

00:19:11.470 --> 00:19:14.690
cutting edge AI? Or does modeling this noise

00:19:14.690 --> 00:19:17.269
actually help us build better artificial intelligence?

00:19:17.289 --> 00:19:19.049
If we connect this to the bigger picture, these

00:19:19.049 --> 00:19:21.130
theoretical models, the SDEs and the Brownian

00:19:21.130 --> 00:19:23.769
motion, they are attempts to mathematically tame

00:19:23.769 --> 00:19:26.390
and map the beautiful chaos of the algorithm

00:19:26.390 --> 00:19:29.589
to prove exactly why the blindfolded climber

00:19:29.589 --> 00:19:32.109
makes it to the bottom. And it absolutely helps

00:19:32.109 --> 00:19:34.549
us build better tools. Really? How so? Well,

00:19:34.609 --> 00:19:37.309
understanding that the noise in SGD behaves like

00:19:37.309 --> 00:19:40.009
Brownian motion allows researchers to design

00:19:40.009 --> 00:19:42.849
much better limits, vastly superior learning

00:19:42.849 --> 00:19:45.750
rate schedules, and even new variations of the

00:19:45.750 --> 00:19:48.329
algorithm that intentionally inject or manipulate

00:19:48.329 --> 00:19:51.230
that noise to prevent the AI from getting stuck

00:19:51.230 --> 00:19:54.490
in a bad local minimum. It bridges the gap between

00:19:54.490 --> 00:19:57.170
the empirical success we talked about and deep

00:19:57.170 --> 00:19:59.589
theoretical guarantees. So what does this all

00:19:59.589 --> 00:20:02.769
mean? Let's zoom all the way out. We started

00:20:02.769 --> 00:20:05.470
our journey in 1951 with the Robbins mineral

00:20:05.470 --> 00:20:07.950
algorithm, watched it evolve through the back

00:20:07.950 --> 00:20:11.250
propagation breakthroughs of 1986, took a detour

00:20:11.250 --> 00:20:13.650
through Jeffrey Hinton's Coursera class in 2012,

00:20:14.170 --> 00:20:16.750
and arrived at the atom optimizer that dominates

00:20:16.750 --> 00:20:19.309
the frameworks today. It is a remarkable trajectory,

00:20:19.609 --> 00:20:21.690
a mathematical framework refined over decades,

00:20:21.890 --> 00:20:23.950
borrowing from physics, hardware engineering,

00:20:24.390 --> 00:20:26.990
and pure necessity. And it's wild to think that

00:20:26.990 --> 00:20:29.390
every single time you use an AI tool, every time

00:20:29.390 --> 00:20:32.029
you generate a gorgeous image, or ask a chat

00:20:32.029 --> 00:20:34.630
bot to summarize a document, or rely on a medical

00:20:34.630 --> 00:20:38.089
AI to scan an x -ray, this iterative stochastic

00:20:38.089 --> 00:20:40.130
game of blindfolded mountain climbing is what

00:20:40.130 --> 00:20:42.309
made it possible. A concept that was designed

00:20:42.309 --> 00:20:45.190
originally for one very simple pragmatic reason

00:20:45.190 --> 00:20:47.769
to save compute time. They just wanted to avoid

00:20:47.769 --> 00:20:50.329
calculating the whole data set. And that shortcut,

00:20:50.509 --> 00:20:52.930
that compromise became the beating heart of modern

00:20:52.930 --> 00:20:55.630
technology. Necessity truly is the mother of

00:20:55.630 --> 00:20:58.589
invention. By embracing the noise and the imperfection

00:20:58.589 --> 00:21:01.029
rather than fighting it, they unlocked the massive

00:21:01.029 --> 00:21:03.410
scale we see today. And that actually leads me

00:21:03.410 --> 00:21:05.329
with a final lingering thought for you to ponder.

00:21:05.930 --> 00:21:07.970
Based on that whole section about continuous

00:21:07.970 --> 00:21:11.930
time and Brownian motion, if SGD's defining feature

00:21:11.930 --> 00:21:15.190
is its randomness, the stochastic noise it injects

00:21:15.190 --> 00:21:17.890
just to save time, is it possible that this noise

00:21:17.890 --> 00:21:20.769
isn't a bug to be minimized? What if it's the

00:21:20.769 --> 00:21:23.660
actual feature? Think about it. If the algorithm

00:21:23.660 --> 00:21:25.740
had infinite compute power and took a mathematically

00:21:25.740 --> 00:21:28.380
perfect noiseless path every single time, it

00:21:28.380 --> 00:21:30.380
might rigidly memorize the training data and

00:21:30.380 --> 00:21:33.599
completely fail at anything new. Could the random

00:21:33.599 --> 00:21:36.140
jiggling chaos of Brownian motion be the very

00:21:36.140 --> 00:21:39.400
thing that forces an AI to be adaptable, to generalize

00:21:39.400 --> 00:21:41.759
essentially, to be creative? It makes you look

00:21:41.759 --> 00:21:44.000
at the errors, the noise, the random stumbling

00:21:44.000 --> 00:21:45.960
in your own learning process in a whole new light.

00:21:46.660 --> 00:21:48.539
Maybe the blindfold is what teaches us how to

00:21:48.539 --> 00:21:49.279
truly see the mountain.