WEBVTT

00:00:00.000 --> 00:00:03.080
You know that feeling like when you're cramming

00:00:03.080 --> 00:00:06.400
for a massive high -stakes exam? Oh, yeah. freaking

00:00:06.400 --> 00:00:08.779
out at the last minute exactly you sit down at

00:00:08.779 --> 00:00:12.220
a desk at like two in the morning totally surrounded

00:00:12.220 --> 00:00:15.199
by textbooks and you just try to ingest an entire

00:00:15.199 --> 00:00:17.559
semester's worth of dense information all at

00:00:17.559 --> 00:00:19.760
once before the sun comes up right which is exhausting

00:00:19.760 --> 00:00:22.600
just forcing it into your working memory it requires

00:00:22.600 --> 00:00:25.120
a ridiculous amount of brainpower to hold it

00:00:25.120 --> 00:00:27.559
all in and well the worst part is once the exam

00:00:27.559 --> 00:00:30.440
is actually over your knowledge base is essentially

00:00:30.649 --> 00:00:33.729
Locked in. Yeah, the test is done. Right. If

00:00:33.729 --> 00:00:35.549
a groundbreaking new piece of research drops

00:00:35.549 --> 00:00:38.070
the very next day, your exam score doesn't reflect

00:00:38.070 --> 00:00:40.210
it. You've already taken the test, so your model

00:00:40.210 --> 00:00:43.009
of the world is basically frozen. Which is exactly

00:00:43.009 --> 00:00:45.909
the limitation traditional machine learning architectures

00:00:45.909 --> 00:00:48.130
face. Yeah. I mean, the moment the engineers

00:00:48.130 --> 00:00:50.869
finish running the training data and put their

00:00:50.869 --> 00:00:53.630
pencils down, that model is effectively static.

00:00:53.950 --> 00:00:57.229
Welcome to the deep dive. We have a seriously

00:00:57.229 --> 00:00:59.490
fascinating piece of source material on the table

00:00:59.490 --> 00:01:02.719
today. we're exploring a comprehensive Wikipedia

00:01:02.719 --> 00:01:05.739
article on online machine learning. It's a huge

00:01:05.739 --> 00:01:08.620
topic. It really is. And if you are listening

00:01:08.620 --> 00:01:10.180
to this on your commute right now, or maybe you're

00:01:10.180 --> 00:01:13.079
working out, our mission today is to cut through

00:01:13.079 --> 00:01:15.680
all the dense algorithmic jargon and show you

00:01:15.680 --> 00:01:18.060
how machines are actually learning to adapt in

00:01:18.060 --> 00:01:21.040
real time, side by side with you. Because the

00:01:21.040 --> 00:01:22.920
real world doesn't work like that late night

00:01:22.920 --> 00:01:25.599
cram session. Exactly. The real world never stops

00:01:25.599 --> 00:01:27.920
generating data. Okay, let's unpack this. Well

00:01:27.920 --> 00:01:29.840
what's fascinating here is that the transition

00:01:29.840 --> 00:01:32.799
to online machine learning allows algorithms

00:01:32.799 --> 00:01:36.680
to dynamically adapt to sequential data. basically

00:01:36.680 --> 00:01:39.620
processing it one piece at a time. Right. And

00:01:39.620 --> 00:01:41.939
this isn't just some theoretical computer science

00:01:41.939 --> 00:01:44.659
concept waiting in a lab somewhere. If your map

00:01:44.659 --> 00:01:47.200
app just rerouted you around a sudden traffic

00:01:47.200 --> 00:01:50.420
jam that didn't exist five minutes ago, that

00:01:50.420 --> 00:01:53.159
is online learning. Oh, wow. So it's everywhere.

00:01:53.459 --> 00:01:55.959
Completely. It's the architecture updating automated

00:01:55.959 --> 00:01:58.920
stock portfolios as the financial markets fluctuate

00:01:58.920 --> 00:02:01.579
second by second. It's spotting anomalous credit

00:02:01.579 --> 00:02:04.489
card transactions in milliseconds. We're even

00:02:04.489 --> 00:02:06.930
seeing a massive push to apply these paradigms

00:02:06.930 --> 00:02:09.930
to large language models. Right, so they can

00:02:09.930 --> 00:02:12.770
continuously adapt and update their factual baselines

00:02:12.770 --> 00:02:15.629
after their initial training runs, which are

00:02:15.629 --> 00:02:18.949
incredibly expensive. Exactly. So to really appreciate

00:02:18.949 --> 00:02:21.729
how much of a paradigm shift that is, I think

00:02:21.729 --> 00:02:24.110
we need to look at the baseline standard the

00:02:24.110 --> 00:02:25.969
industry uses, which is called batch learning,

00:02:26.150 --> 00:02:30.550
and why it's currently hitting a massive brick

00:02:30.550 --> 00:02:32.969
wall. Yeah, batch learning is struggling. In

00:02:32.969 --> 00:02:36.389
a batch environment, you generate your best possible

00:02:36.389 --> 00:02:39.710
predictor. by feeding the entire historical data

00:02:39.710 --> 00:02:42.490
set into the machine simultaneously. All at once.

00:02:42.569 --> 00:02:44.990
All at once. The algorithm churns through the

00:02:44.990 --> 00:02:47.490
entirety of the data, finds the patterns, and

00:02:47.490 --> 00:02:50.330
just spits out a finalized model. Yeah. And for

00:02:50.330 --> 00:02:52.610
a long time, when data sets were relatively small,

00:02:52.830 --> 00:02:55.050
that worked brilliantly. But not anymore. No.

00:02:55.090 --> 00:02:57.150
We have scaled way past the point where that

00:02:57.150 --> 00:02:59.370
is physically viable for a lot of applications.

00:02:59.810 --> 00:03:02.090
When you're dealing with modern web scale data

00:03:02.090 --> 00:03:04.650
sets, batch learning becomes computationally

00:03:04.650 --> 00:03:06.409
impossible. Because it's just too much data,

00:03:06.449 --> 00:03:10.280
right? Right. The volume of data literally cannot

00:03:10.280 --> 00:03:13.139
fit into the computer's main memory, or RAM,

00:03:13.300 --> 00:03:15.780
all at once. When a system tries to load more

00:03:15.780 --> 00:03:18.599
data than the RAM can hold, it starts paging

00:03:18.599 --> 00:03:20.740
or swapping data to the hard drive. Which is

00:03:20.740 --> 00:03:23.840
agonizingly slow. It is. It effectively chokes

00:03:23.840 --> 00:03:26.520
the system. This forces engineers to use what

00:03:26.520 --> 00:03:29.360
are called out of core algorithms, where the

00:03:29.360 --> 00:03:32.419
system has to process data from the storage drive

00:03:32.419 --> 00:03:36.500
in chunks. Yikes. And beyond the hardware limitations,

00:03:36.659 --> 00:03:39.000
the other major failure point for batch learning

00:03:39.000 --> 00:03:41.800
is dealing with time series data. Oh, absolutely.

00:03:41.939 --> 00:03:43.840
Like if you're building a predictive model for

00:03:43.840 --> 00:03:46.879
international financial market prices, you can't

00:03:46.879 --> 00:03:49.340
train your model on all the data because, well,

00:03:49.479 --> 00:03:51.469
tomorrow's data hasn't happened yet. Right. The

00:03:51.469 --> 00:03:53.509
underlying environment is constantly shifting.

00:03:54.009 --> 00:03:56.289
Economic policies, consumer sentiment. It never

00:03:56.289 --> 00:03:59.069
stops. In statistical terms, the underlying distribution

00:03:59.069 --> 00:04:01.430
of the data is non -stationary. The rules of

00:04:01.430 --> 00:04:03.430
the game are changing while you are actively

00:04:03.430 --> 00:04:06.030
playing it. I look at it this way. Bash learning

00:04:06.030 --> 00:04:09.229
is like trying to memorize an entire public library.

00:04:09.750 --> 00:04:12.389
The moment the librarian adds a single new book

00:04:12.389 --> 00:04:15.270
to the shelf, your knowledge is technically incomplete.

00:04:15.530 --> 00:04:18.480
Yeah. out of date immediately. Exactly. And to

00:04:18.480 --> 00:04:21.420
update it, you have to read the entire library

00:04:21.420 --> 00:04:24.220
from start to finish all over again just to integrate

00:04:24.220 --> 00:04:27.180
that one new plot line. Whereas online learning

00:04:27.180 --> 00:04:30.120
is like reading a single page, updating your

00:04:30.120 --> 00:04:32.279
worldview based on the new information on that

00:04:32.279 --> 00:04:34.800
specific page, and moving on to the next. That's

00:04:34.800 --> 00:04:37.180
a perfect way to picture it. Imagine your email

00:04:37.180 --> 00:04:39.339
spam filter waiting until the end of the year

00:04:39.339 --> 00:04:42.160
to run a massive batch learning session just

00:04:42.160 --> 00:04:44.500
to learn what a newly invented phishing scam

00:04:44.500 --> 00:04:47.379
looks like. Your inbox would be completely compromised

00:04:47.379 --> 00:04:50.279
for 12 months. You'd be ruined. Seriously. So

00:04:50.279 --> 00:04:52.699
to understand how we engineer a system to learn

00:04:52.699 --> 00:04:55.540
page by page, the source outlines the statistical

00:04:55.540 --> 00:04:57.779
view of the problem. Yeah. So a machine learning

00:04:57.779 --> 00:05:00.519
model never actually knows the true distribution

00:05:00.519 --> 00:05:04.560
of data. It doesn't possess a god's eye view

00:05:04.560 --> 00:05:07.720
of all possible spam emails or financial shifts

00:05:07.720 --> 00:05:10.180
that will ever exist. It's basically flying blind

00:05:10.180 --> 00:05:12.879
to the big picture. Right. It only has access

00:05:12.879 --> 00:05:15.500
to a finite training set of examples and, crucially,

00:05:16.360 --> 00:05:19.819
a loss function. Ah, the loss function. This

00:05:19.819 --> 00:05:22.259
is our North Star here. For you listening, it's

00:05:22.259 --> 00:05:24.379
the mathematical delta between our prediction

00:05:24.379 --> 00:05:26.639
and reality. It basically measures the penalty

00:05:26.639 --> 00:05:29.579
for a bad prediction. Exactly. If the model guesses

00:05:29.579 --> 00:05:32.500
a stock will go up by 2 % and it actually drops

00:05:32.500 --> 00:05:36.589
by 5%, The loss function quantifies exactly how

00:05:36.589 --> 00:05:39.170
wrong that guess was. And minimizing that total

00:05:39.170 --> 00:05:42.069
loss over time is the sole objective of the system.

00:05:42.689 --> 00:05:45.029
In a pure online model, the algorithm looks at

00:05:45.029 --> 00:05:47.529
the brand new input, checks its current internal

00:05:47.529 --> 00:05:49.850
weights to make a prediction, looks at the true

00:05:49.850 --> 00:05:52.649
outcome, calculates the loss function, and makes

00:05:52.649 --> 00:05:54.829
a tiny mathematical adjustment to its weights.

00:05:55.050 --> 00:05:57.029
But the mechanical question is how the system

00:05:57.029 --> 00:05:59.230
actually executes that math without starting

00:05:59.230 --> 00:06:02.399
over. Right, because if we absolutely must update

00:06:02.399 --> 00:06:04.720
sequentially to avoid the memory limits of batch

00:06:04.720 --> 00:06:07.379
learning, how does the machine recalculate its

00:06:07.379 --> 00:06:09.879
trajectory every single time a new data point

00:06:09.879 --> 00:06:12.180
arrives? Well, the text walks us through the

00:06:12.180 --> 00:06:15.379
fundamental example of linear least squares.

00:06:15.980 --> 00:06:18.680
It's a classic method for drawing a line of best

00:06:18.680 --> 00:06:21.160
fit through scattered data. Like the scatter

00:06:21.160 --> 00:06:23.259
plots you'd see in high school math. Exactly.

00:06:23.720 --> 00:06:25.879
Now, if you attempt this using a brute force

00:06:25.879 --> 00:06:28.959
batch approach, every time a new data point arrives,

00:06:29.420 --> 00:06:31.379
you have to recalculate everything to find the

00:06:31.379 --> 00:06:34.689
new perfect line. Mathematically, this involves

00:06:34.689 --> 00:06:37.189
inverting a covariance matrix. And when you look

00:06:37.189 --> 00:06:39.310
at the covariance matrix, for anyone dealing

00:06:39.310 --> 00:06:41.610
with high dimensional data, you know that tracking

00:06:41.610 --> 00:06:44.529
how every single variable relates to every other

00:06:44.529 --> 00:06:47.670
variable becomes a massive computationally heavy

00:06:47.670 --> 00:06:50.790
grid. Oh, yeah. As your features grow, that matrix

00:06:50.790 --> 00:06:53.629
becomes unwieldy. The computational cost is immense.

00:06:53.889 --> 00:06:55.750
How bad does it get? Well, the text notes that

00:06:55.750 --> 00:06:57.889
performing this naive recalculation for every

00:06:57.889 --> 00:07:01.069
new data point results in a time complexity of

00:07:01.149 --> 00:07:04.110
Big O of n times d cubed. Yeah, okay, d cubed.

00:07:04.370 --> 00:07:06.569
Right, where n is the number of data points,

00:07:07.029 --> 00:07:09.470
and n represents the dimensions, or the number

00:07:09.470 --> 00:07:11.629
of features in your data. Because the dimensions

00:07:11.629 --> 00:07:14.129
are cubed, if you increase your features from

00:07:14.129 --> 00:07:17.750
10 to just 100, the time it takes to compute

00:07:17.750 --> 00:07:20.790
that matrix inversion skyrockets from 1 ,000

00:07:20.790 --> 00:07:24.769
operations to 1 million. Wow. It scales terribly.

00:07:25.029 --> 00:07:27.810
It scales incredibly poorly. That's an absolute

00:07:27.810 --> 00:07:30.310
death sentence for complex modern data sets.

00:07:30.649 --> 00:07:32.970
So the first solution the source offers to mitigate

00:07:32.970 --> 00:07:36.949
this is recursive least squares, or RLS. Right,

00:07:36.949 --> 00:07:39.790
RLS. Instead of recalculating the entire covariance

00:07:39.790 --> 00:07:43.149
matrix from scratch, RLS stores a single summary

00:07:43.149 --> 00:07:45.829
matrix like a condensed representation of everything

00:07:45.829 --> 00:07:48.329
it's learned so far. When a new data point arrives,

00:07:48.529 --> 00:07:50.649
the algorithm performs a much simpler update

00:07:50.649 --> 00:07:53.149
to just that summary matrix. It's a really elegant

00:07:53.149 --> 00:07:55.769
trick. It drops the time complexity from D cubed

00:07:55.769 --> 00:07:58.720
down to big O of n times d squared. Which is

00:07:58.720 --> 00:08:01.079
huge. You shave off an entire order of magnitude

00:08:01.079 --> 00:08:03.480
in processing time. It makes the math significantly

00:08:03.480 --> 00:08:06.019
faster, allowing the system to keep up with faster

00:08:06.019 --> 00:08:09.060
data streams. OK, sure, RLS drops the compute

00:08:09.060 --> 00:08:11.240
time, but I'm looking at the memory requirements.

00:08:11.939 --> 00:08:14.639
If we are tracking an infinite stream of data

00:08:14.639 --> 00:08:16.879
in a production environment, that summary matrix

00:08:16.879 --> 00:08:18.699
is still going to grow and become a bottleneck,

00:08:18.819 --> 00:08:21.379
right? Yes, absolutely. We haven't entirely solved

00:08:21.379 --> 00:08:23.579
the resource issue. We've just shifted the burden.

00:08:24.079 --> 00:08:26.819
As the data flows in continuously, wouldn't storing

00:08:26.819 --> 00:08:29.079
and updating that dense matrix eventually hit

00:08:29.079 --> 00:08:31.939
a wall anyway? That is the pivotal bottleneck.

00:08:32.039 --> 00:08:34.559
And it's exactly why the field had to develop

00:08:34.559 --> 00:08:37.179
a fundamentally different approach. That brings

00:08:37.179 --> 00:08:41.580
us to stochastic gradient descent, or SGD, which

00:08:41.580 --> 00:08:44.580
just abandons complex matrix inversions entirely.

00:08:45.059 --> 00:08:47.440
SGD is essentially the underlying engine powering

00:08:47.440 --> 00:08:49.960
the vast majority of modern artificial intelligence.

00:08:50.179 --> 00:08:52.909
It really is the workhorse of the industry. Rather

00:08:52.909 --> 00:08:55.389
than trying to find the perfect exact mathematical

00:08:55.389 --> 00:08:57.769
line of best fit by looking at a massive matrix

00:08:57.769 --> 00:09:01.110
of past data, SGD updates the predictor by taking

00:09:01.110 --> 00:09:04.330
a small, populated step based solely on the gradient

00:09:04.330 --> 00:09:06.669
of the loss function for the current data point.

00:09:06.970 --> 00:09:09.509
I love visualizing this. If we visualize the

00:09:09.509 --> 00:09:12.149
loss function as a physical landscape, the gradient

00:09:12.149 --> 00:09:14.429
is the slope of the ground beneath your feet.

00:09:14.669 --> 00:09:17.730
Yes. The algorithm is essentially standing on

00:09:17.730 --> 00:09:20.269
a mountain blindfolded. It can't see the entire

00:09:20.269 --> 00:09:22.830
landscape like it would in batch data. So it

00:09:22.830 --> 00:09:25.769
just feels around with its foot to find the steepest

00:09:25.769 --> 00:09:28.529
downward slope, taking one step at a time toward

00:09:28.529 --> 00:09:30.990
the valley floor, which represents the lowest

00:09:30.990 --> 00:09:34.110
possible error. That's a great analogy. And because

00:09:34.110 --> 00:09:36.549
it only takes that one step based on the slope

00:09:36.549 --> 00:09:39.289
of the immediate data point, the time complexity

00:09:39.289 --> 00:09:43.570
drops again down to big O of N times D. Not squared,

00:09:43.669 --> 00:09:47.549
not cubed. Just D. Just D. But more importantly,

00:09:47.990 --> 00:09:50.669
this addresses your exact concern regarding memory.

00:09:51.330 --> 00:09:53.990
The storage requirements become strictly constant.

00:09:54.330 --> 00:09:57.629
Wait, really? Constant memory? Yes. You do not

00:09:57.629 --> 00:09:59.970
need to store a growing matrix of past data.

00:10:01.529 --> 00:10:04.590
processes 10 data points, or 10 billion, the

00:10:04.590 --> 00:10:06.769
amount of RAM required to hold the model parameters

00:10:06.769 --> 00:10:09.269
stays exactly the same. OK, here's where it gets

00:10:09.269 --> 00:10:11.450
really interesting, though. There is a specific

00:10:11.450 --> 00:10:13.669
mechanic the text highlights regarding how the

00:10:13.669 --> 00:10:15.669
algorithm takes those steps. It's known as the

00:10:15.669 --> 00:10:17.509
decaying step size, or the learning rate scale.

00:10:17.590 --> 00:10:20.960
Ah, yes. Crucial detail. If the algorithm reacts

00:10:20.960 --> 00:10:23.960
too aggressively to every single new piece of

00:10:23.960 --> 00:10:26.519
data -like taking massive leaps down that mountain,

00:10:26.940 --> 00:10:29.279
it will constantly overshoot the lowest point

00:10:29.279 --> 00:10:31.220
of the valley. Right, it'll just bounce back

00:10:31.220 --> 00:10:33.120
and forth up the opposite walls of the canyon

00:10:33.120 --> 00:10:36.179
forever. Exactly. So by actively decaying the

00:10:36.179 --> 00:10:38.279
step size, meaning the algorithm takes smaller

00:10:38.279 --> 00:10:40.460
and smaller steps as time goes on and it gets

00:10:40.460 --> 00:10:43.740
closer to the minimum, it stabilizes and smoothly

00:10:43.740 --> 00:10:46.299
converges on the optimal mathematical truth.

00:10:46.659 --> 00:10:49.850
It's a brilliant stabilization mechanism. However,

00:10:50.570 --> 00:10:53.549
and this is a big however gradient descent in

00:10:53.549 --> 00:10:56.870
this peer form, works remarkably well for relatively

00:10:56.870 --> 00:11:00.190
simple linear problems. Yeah. But complex real

00:11:00.190 --> 00:11:02.929
world data is rarely linear. Yeah, you cannot

00:11:02.929 --> 00:11:05.509
just draw a straight line to separate user behavior

00:11:05.509 --> 00:11:07.590
patterns or financial anomalies. The data points

00:11:07.590 --> 00:11:10.289
are tangled up in incredibly messy nonlinear

00:11:10.289 --> 00:11:12.929
ways. So to handle nonlinear decision making

00:11:12.929 --> 00:11:15.509
on the fly, the text introduces kernel methods.

00:11:15.789 --> 00:11:17.929
Okay, kernels, how do those work? Well, when

00:11:17.929 --> 00:11:20.149
you have a data set that cannot be separated

00:11:20.149 --> 00:11:22.149
by a straight plane in its current dimensions,

00:11:22.789 --> 00:11:25.389
kernel methods use a mathematical function to

00:11:25.389 --> 00:11:28.230
project that data into a much higher, sometimes

00:11:28.230 --> 00:11:30.789
infinite dimensional space. Infinite dimensions?

00:11:30.909 --> 00:11:33.309
Wow. Yeah. By adding these dimensions, the data

00:11:33.309 --> 00:11:36.529
stretches out and suddenly a straight line actually

00:11:36.529 --> 00:11:39.370
can cleanly separate the categories. Projecting

00:11:39.370 --> 00:11:41.649
data into infinite dimensions mathematically

00:11:41.649 --> 00:11:44.870
solves the non -linearity problem. But kernel

00:11:44.870 --> 00:11:47.590
methods create a totally new architectural problem

00:11:47.590 --> 00:11:50.629
for an online system. The text points to the

00:11:50.629 --> 00:11:52.769
representer theorem. Right, this representer

00:11:52.769 --> 00:11:54.789
theorem basically states that to evaluate this

00:11:54.789 --> 00:11:56.929
infinite dimensional kernel space for a brand

00:11:56.929 --> 00:11:59.409
new piece of data, the algorithm actually has

00:11:59.409 --> 00:12:01.990
to reference the past data points. Which completely

00:12:01.990 --> 00:12:04.409
breaks the pure online learning paradigm we just

00:12:04.409 --> 00:12:07.110
talked about. It does. The algorithm can no longer

00:12:07.110 --> 00:12:09.110
just take a step and forget the data like it

00:12:09.110 --> 00:12:11.590
does in pure stochastic gradient descent. It

00:12:11.590 --> 00:12:14.710
must store a subset of previous examples, often

00:12:14.710 --> 00:12:17.509
called support vectors, to define that complex

00:12:17.509 --> 00:12:20.269
boundary. So it becomes a hybrid approach. Exactly.

00:12:20.570 --> 00:12:23.610
You gain the ability to model highly complex

00:12:23.610 --> 00:12:27.190
nonlinear realities, but you lose that beautiful

00:12:27.190 --> 00:12:29.870
constant memory guarantee. You're forced back

00:12:29.870 --> 00:12:32.889
into a tradeoff between accuracy and memory limits.

00:12:33.029 --> 00:12:35.450
And because we're no longer just cleanly calculating

00:12:35.450 --> 00:12:38.389
a straight line, but rather projecting data and

00:12:38.389 --> 00:12:40.669
placing bets in a highly uncertain, shifting

00:12:40.669 --> 00:12:43.370
environment, mathematicians had to reframe the

00:12:43.370 --> 00:12:46.200
entire problem. It did. The source introduces

00:12:46.200 --> 00:12:48.799
a broader theoretical framework called Online

00:12:48.799 --> 00:12:52.960
Convex Optimization, or OCO. OCO abstracts this

00:12:52.960 --> 00:12:55.639
entire process into a repeated strategic game

00:12:55.639 --> 00:12:59.559
played against nature. You, acting as the machine

00:12:59.559 --> 00:13:02.019
learning algorithm, must make a prediction for

00:13:02.019 --> 00:13:05.340
the current round. Then, nature, the real world,

00:13:05.600 --> 00:13:07.639
reveals the true answer, which is your loss function.

00:13:07.740 --> 00:13:10.120
Right. You suffer a quantifiable loss based on

00:13:10.120 --> 00:13:12.039
the inaccuracy of your prediction. You update

00:13:12.039 --> 00:13:13.659
your internal parameters, and the next round

00:13:13.659 --> 00:13:16.340
begins. The ultimate mathematical objective of

00:13:16.340 --> 00:13:18.779
this game is not to be perfect every time, but

00:13:18.779 --> 00:13:22.750
to minimize regret. The Game of Regret I love

00:13:22.750 --> 00:13:25.049
how that's phrased. To put this into a real -world

00:13:25.049 --> 00:13:27.990
perspective for you listening, think about mathematical

00:13:27.990 --> 00:13:31.029
regret like reviewing your personal investment

00:13:31.029 --> 00:13:34.360
portfolio at the end of the fiscal year. That's

00:13:34.360 --> 00:13:36.980
a painful thought. Right. You look at the cumulative

00:13:36.980 --> 00:13:39.340
returns you actually achieved by making daily

00:13:39.340 --> 00:13:42.960
guesses and trading actively. Then you compare

00:13:42.960 --> 00:13:45.679
that exact return against the single perfect

00:13:45.679 --> 00:13:48.259
stock you could have picked on day one and just

00:13:48.259 --> 00:13:50.480
held all year if you had a crystal ball. And

00:13:50.480 --> 00:13:52.679
the difference between your active sequential

00:13:52.679 --> 00:13:55.779
guesses and the best possible fixed hindsight

00:13:55.779 --> 00:13:58.600
choice, that delta is your mathematical regret.

00:13:58.779 --> 00:14:01.309
So how do you win this game? Winning requires

00:14:01.309 --> 00:14:04.549
specific algorithmic strategies. The text outlines

00:14:04.549 --> 00:14:06.549
a baseline approach called Follow the Leader,

00:14:07.049 --> 00:14:10.309
or FTL. It's a purely greedy algorithm. At every

00:14:10.309 --> 00:14:12.429
single step, it looks back at all the past rounds

00:14:12.429 --> 00:14:14.830
and simply selects the hypothesis or weight distribution

00:14:14.830 --> 00:14:16.570
that would have resulted in the least amount

00:14:16.570 --> 00:14:18.870
of loss up to that point. It basically just jumps

00:14:18.870 --> 00:14:21.049
on whatever mathematical bandwagon is currently

00:14:21.049 --> 00:14:24.250
winning the game. Exactly. Which introduces severe

00:14:24.250 --> 00:14:27.169
volatility. If the data environment is highly

00:14:27.169 --> 00:14:30.110
dynamic, Follow the Leader will violently swing

00:14:30.110 --> 00:14:32.490
its predictions back and forth, overreacting

00:14:32.490 --> 00:14:33.990
to whatever just happened in the pre -s this

00:14:33.990 --> 00:14:36.889
few rounds. So it's too reactive. Way too reactive.

00:14:37.230 --> 00:14:39.990
To counter this, the field utilizes Follow the

00:14:39.990 --> 00:14:43.879
Regularized Leader, or FTRL. This algorithm introduces

00:14:43.879 --> 00:14:47.480
a mathematical stabilization penalty, a regularization

00:14:47.480 --> 00:14:49.559
term. So it fundamentally changes the incentive

00:14:49.559 --> 00:14:51.539
structure. Instead of just chasing the lowest

00:14:51.539 --> 00:14:54.460
loss, FTRL punishes the model's internal weights

00:14:54.460 --> 00:14:56.639
if they grow too large or change too rapidly.

00:14:56.700 --> 00:14:59.299
Right. It effectively tells the algorithm, find

00:14:59.299 --> 00:15:02.120
the lowest error, but you are penalized for drastically

00:15:02.120 --> 00:15:04.340
overhauling your worldview based on short -term

00:15:04.340 --> 00:15:07.580
noise. It smooths out the learning curve by demanding

00:15:07.580 --> 00:15:10.840
consistency alongside accuracy. Yes, consistency

00:15:10.840 --> 00:15:13.509
is key. So what does this all mean when we zoom

00:15:13.509 --> 00:15:17.009
out? We have algorithms making dynamic predictions,

00:15:17.409 --> 00:15:19.429
constantly updating their internal weights to

00:15:19.429 --> 00:15:22.129
minimize regret, and stepping down gradient slopes.

00:15:22.730 --> 00:15:26.169
But if a neural network is continuously updating

00:15:26.169 --> 00:15:28.870
and physically overwriting its internal parameters

00:15:28.870 --> 00:15:32.610
with brand new information, what happens to the

00:15:32.610 --> 00:15:36.169
knowledge it already acquired? Oh. You are pointing

00:15:36.169 --> 00:15:38.169
to one of the most critical vulnerabilities in

00:15:38.169 --> 00:15:40.129
the entire architecture of continuous learning.

00:15:40.279 --> 00:15:43.299
The source literature formally refers to it as

00:15:43.299 --> 00:15:45.399
catastrophic interference, but it's commonly

00:15:45.399 --> 00:15:47.549
known as catastrophic forgetting. When you look

00:15:47.549 --> 00:15:49.389
at the architecture of a neural network, the

00:15:49.389 --> 00:15:51.409
knowledge isn't stored in neat little databases,

00:15:51.690 --> 00:15:54.110
right? It's distributed across the weight connections

00:15:54.110 --> 00:15:57.269
between artificial neurons. When the model processes

00:15:57.269 --> 00:15:59.809
incrementally available information from non

00:15:59.809 --> 00:16:02.370
-stationary data, meaning the fundamental rules

00:16:02.370 --> 00:16:04.850
of the incoming data have shifted, it has to

00:16:04.850 --> 00:16:07.149
adjust those weights to accommodate the new reality.

00:16:07.529 --> 00:16:10.389
And by doing so, it can completely overwrite

00:16:10.389 --> 00:16:13.029
and destroy the delicate web of pathways that

00:16:13.029 --> 00:16:17.000
represented previously learned patterns. physically

00:16:17.000 --> 00:16:20.240
evicts the old knowledge. Imagine you spend five

00:16:20.240 --> 00:16:23.039
grueling years learning to play the piano. You

00:16:23.039 --> 00:16:25.860
master the instrument. Then you decide to take

00:16:25.860 --> 00:16:28.820
an intensive week of tennis lessons. OK, I see

00:16:28.820 --> 00:16:31.950
where this is going. Because your brain, in this

00:16:31.950 --> 00:16:34.029
analogy, suffers from catastrophic forgetting,

00:16:34.549 --> 00:16:37.549
the specific neural pathways you use to swing

00:16:37.549 --> 00:16:40.429
the racket forcefully overwrite the pathways

00:16:40.429 --> 00:16:43.370
you used for finger dexterity. You walk back

00:16:43.370 --> 00:16:45.269
to a piano and have completely forgotten what

00:16:45.269 --> 00:16:48.389
a keyboard is or how it functions. The new data

00:16:48.389 --> 00:16:50.870
obliterated the foundational data. And applying

00:16:50.870 --> 00:16:53.090
that to enterprise systems is highly problematic.

00:16:53.610 --> 00:16:55.909
If you deploy an online machine learning model

00:16:55.909 --> 00:16:58.649
to detect credit card fraud and initially train

00:16:58.649 --> 00:17:01.570
it on European transaction behaviors, it'll perform

00:17:01.570 --> 00:17:04.089
exceptionally well. Right. But if you then expand

00:17:04.089 --> 00:17:06.069
your business and start feeling it a continuous

00:17:06.069 --> 00:17:08.410
stream of real -time data from Asian markets,

00:17:08.990 --> 00:17:11.170
the model adjusts its weights to understand the

00:17:11.170 --> 00:17:14.490
new Asian fraud patterns. In the process, it

00:17:14.490 --> 00:17:16.849
might completely lose its ability to recognize

00:17:16.849 --> 00:17:19.029
the European fraud patterns it used to catch

00:17:19.029 --> 00:17:21.910
effortlessly. Yeah, it optimizes for the present

00:17:21.910 --> 00:17:24.839
at the total expense of the past. It basically

00:17:24.839 --> 00:17:27.599
induces a form of algorithmic amnesia. This is

00:17:27.599 --> 00:17:30.220
exactly why the text highlights continual learning

00:17:30.220 --> 00:17:33.880
paradigms as a massive, heavily funded area of

00:17:33.880 --> 00:17:36.559
active research. Continual learning specifically

00:17:36.559 --> 00:17:39.380
attempts to solve this plasticity stability dilemma.

00:17:39.779 --> 00:17:42.519
How do you design an architecture plastic enough

00:17:42.519 --> 00:17:45.500
to learn new things, but stable enough to protect

00:17:45.500 --> 00:17:48.460
its core foundation? Exactly. It's absolutely

00:17:48.460 --> 00:17:51.299
essential for autonomous agents interacting in

00:17:51.299 --> 00:17:53.880
the physical world. A self -driving car cannot

00:17:53.880 --> 00:17:56.940
learn how to navigate a new type of snowy intersection

00:17:56.940 --> 00:17:59.920
and suddenly forget how to recognize a standard

00:17:59.920 --> 00:18:02.079
red light. The same applies to large language

00:18:02.079 --> 00:18:05.180
models. If an LLM is continuously adapting its

00:18:05.180 --> 00:18:07.059
weights based on individual interactions with

00:18:07.059 --> 00:18:10.339
users to stay current with modern events, engineers

00:18:10.339 --> 00:18:12.960
must ensure it doesn't catastrophically forget

00:18:12.960 --> 00:18:16.509
the basic logical structures of language. just

00:18:16.509 --> 00:18:19.049
because it ingested a million new slang terms

00:18:19.049 --> 00:18:22.009
from social media. Right. Balancing that adaptationally

00:18:22.009 --> 00:18:23.890
is an incredibly complex engineering hurdle.

00:18:24.190 --> 00:18:27.089
So we've covered the theoretical math, the shift

00:18:27.089 --> 00:18:30.670
to sequential processing, and the danger of catastrophic

00:18:30.670 --> 00:18:33.349
forgetting. Let's look at how this is practically

00:18:33.349 --> 00:18:35.849
implemented in the development environments engineers

00:18:35.849 --> 00:18:39.440
are using today. The text highlights two major

00:18:39.440 --> 00:18:42.059
open -source implementations that power a lot

00:18:42.059 --> 00:18:45.200
of this infrastructure. First is Scikit -learn,

00:18:45.480 --> 00:18:47.720
which is practically ubiquitous in data science.

00:18:48.299 --> 00:18:50.839
They provide out -of -core tools like an SGD

00:18:50.839 --> 00:18:53.380
classifier and something called mini -batch k

00:18:53.380 --> 00:18:56.279
-means. Mini -batch represents a highly practical

00:18:56.279 --> 00:18:59.019
compromise between pure online learning and traditional

00:18:59.019 --> 00:19:02.250
batch processing. Rather than updating the internal

00:19:02.250 --> 00:19:04.650
weights after every single individual data point,

00:19:04.789 --> 00:19:07.009
which, you know, can introduce a lot of erratic

00:19:07.009 --> 00:19:09.109
noise into the gradient descent, it processes

00:19:09.109 --> 00:19:11.769
a small, manageable batch of data. Like a few

00:19:11.769 --> 00:19:13.609
hundred examples at a time. Yeah, say a few hundred

00:19:13.609 --> 00:19:15.950
examples. It smooths out the gradient steps while

00:19:15.950 --> 00:19:18.109
still bypassing the massive memory requirements

00:19:18.109 --> 00:19:20.309
of full batch learning. It's a really pragmatic

00:19:20.309 --> 00:19:22.569
middle ground. But the second tool mentioned

00:19:22.569 --> 00:19:25.390
in the source is where the engineering gets remarkably

00:19:25.390 --> 00:19:28.349
creative, Malpal Wabbit. Setting aside the fantastic

00:19:28.349 --> 00:19:32.869
name. Right. Great name. The text details a specific

00:19:32.869 --> 00:19:35.390
mechanism it utilizes called the hashing trick.

00:19:35.930 --> 00:19:38.369
In an infinite data stream, you don't just have

00:19:38.369 --> 00:19:40.710
an infinite number of rows, you often encounter

00:19:40.710 --> 00:19:43.410
an endlessly expanding number of features, like

00:19:43.410 --> 00:19:46.910
unique user IDs or novel text strings. And tracking

00:19:46.910 --> 00:19:48.750
an ever -growing dictionary of these features

00:19:48.750 --> 00:19:51.069
will eventually exhaust your memory. Totally.

00:19:51.349 --> 00:19:53.869
The hashing trick circumvents that entirely.

00:19:54.450 --> 00:19:56.809
Instead of maintaining a massive lookup table

00:19:56.809 --> 00:19:59.369
that assigns a unique index to every new feature

00:19:59.369 --> 00:20:02.849
it sees, Valpol Wabbit passes the raw feature

00:20:02.849 --> 00:20:05.470
data, like a text string, through a mathematical

00:20:05.470 --> 00:20:07.490
hash function. Right, which is brilliant. It

00:20:07.490 --> 00:20:09.869
really is. This function takes any arbitrary

00:20:09.869 --> 00:20:12.150
piece of data and mathematically maps it to a

00:20:12.150 --> 00:20:14.849
specific integer index within a fixed size array.

00:20:15.190 --> 00:20:17.470
It essentially creates a predetermined, hard

00:20:17.470 --> 00:20:19.529
-coded number of buckets. And when a new feature

00:20:19.529 --> 00:20:22.819
arrives, function instantly calculates which

00:20:22.819 --> 00:20:25.660
bucket it belongs in. Exactly. Even if two different

00:20:25.660 --> 00:20:27.619
features occasionally hash into the exact same

00:20:27.619 --> 00:20:30.019
bucket, which is a collision, the mathematical

00:20:30.019 --> 00:20:31.980
models are robust enough to handle the noise,

00:20:32.319 --> 00:20:35.240
provided the array is large enough. It bounds

00:20:35.240 --> 00:20:37.940
the memory footprint of the feature, set entirely

00:20:37.940 --> 00:20:39.779
independently of the amount of training data

00:20:39.779 --> 00:20:42.480
flowing in. So you can process an infinite firehose

00:20:42.480 --> 00:20:44.720
of data with a strictly capped memory limit.

00:20:44.859 --> 00:20:47.480
Yes. Well, if we connect this brilliant piece

00:20:47.480 --> 00:20:49.920
of engineering to the bigger picture, the text

00:20:49.920 --> 00:20:52.759
clarifies that utilizing tools like Valpol Wabbit

00:20:52.759 --> 00:20:56.880
or Scikit -Learns SGD hinges entirely on which

00:20:56.880 --> 00:20:58.839
of the two interpretations of online learning

00:20:58.839 --> 00:21:01.299
you are executing. Because the objective dictates

00:21:01.299 --> 00:21:03.799
the interpretation. Exactly. If you're dealing

00:21:03.799 --> 00:21:06.680
with a truly infinite stream of real -world data

00:21:06.680 --> 00:21:09.609
like... analyzing a continuous, never -ending

00:21:09.609 --> 00:21:12.930
feed of global sensor data or social media posts,

00:21:13.470 --> 00:21:15.950
you are utilizing online learning to minimize

00:21:15.950 --> 00:21:18.730
what statisticians call the expected risk. Your

00:21:18.730 --> 00:21:21.150
ultimate goal is to build a model that performs

00:21:21.150 --> 00:21:24.069
accurately on the true, fundamentally unknowable

00:21:24.069 --> 00:21:26.990
distribution of future unseen data. Right. You're

00:21:26.990 --> 00:21:29.910
modeling a moving target. But the second interpretation

00:21:29.910 --> 00:21:32.990
is far more pragmatic and resource -driven. Imagine

00:21:32.990 --> 00:21:36.099
you possess a massive, finite data set. It's

00:21:36.099 --> 00:21:38.220
completely static, but it's simply half a terabyte

00:21:38.220 --> 00:21:42.220
in size and physically cannot fit into your workstation's

00:21:42.220 --> 00:21:45.480
RAM. In that scenario, you use the exact same

00:21:45.480 --> 00:21:48.240
online learning algorithms like stochastic gradient

00:21:48.240 --> 00:21:51.140
descent merely to loop sequentially through the

00:21:51.140 --> 00:21:53.869
static data point by point to save memory. You're

00:21:53.869 --> 00:21:56.210
not trying to predict an unknown future environment.

00:21:56.509 --> 00:21:59.549
You are simply minimizing the empirical risk

00:21:59.549 --> 00:22:01.910
of the massive data set you already possess on

00:22:01.910 --> 00:22:04.289
your hard drive. It's the exact same underlying

00:22:04.289 --> 00:22:06.769
mathematics applied to achieve a completely different

00:22:06.769 --> 00:22:09.569
philosophical and practical objective. Yes. One

00:22:09.569 --> 00:22:12.089
explores the unknown continuous frontier, while

00:22:12.089 --> 00:22:14.210
the other is just an efficient mechanism for

00:22:14.210 --> 00:22:16.650
processing a warehouse of data too massive for

00:22:16.650 --> 00:22:18.940
standard hardware. We have covered some profound

00:22:18.940 --> 00:22:21.420
architectural shifts today. We journeyed from

00:22:21.420 --> 00:22:23.960
the sheer computational limits of batch learning

00:22:23.960 --> 00:22:27.059
cram sessions, where memory swapping physically

00:22:27.059 --> 00:22:29.920
chokes the system into the dynamic necessity

00:22:29.920 --> 00:22:33.079
of online learning. We broke down how techniques

00:22:33.079 --> 00:22:36.200
like stochastic gradient descent abandon matrix

00:22:36.200 --> 00:22:38.759
inversions to keep the mathematics incredibly

00:22:38.759 --> 00:22:41.559
fast, stepping carefully down the gradient slope

00:22:41.559 --> 00:22:44.200
while keeping memory requirements strictly flat.

00:22:44.480 --> 00:22:46.700
And we also explore the complex trade -offs from

00:22:46.700 --> 00:22:49.200
the infinite dimensional projections of kernel

00:22:49.200 --> 00:22:52.200
methods demanding memory back to the very real

00:22:52.200 --> 00:22:54.779
threat of catastrophic forgetting physically

00:22:54.779 --> 00:22:57.079
overriding a model's foundational knowledge.

00:22:57.299 --> 00:22:59.500
And we saw how the hashing trick in tools like

00:22:59.500 --> 00:23:02.410
Valpol Wabbit allows engineers to process infinite

00:23:02.410 --> 00:23:04.490
streams without crashing the system. Why does

00:23:04.490 --> 00:23:06.269
grasping this underlying architecture matter

00:23:06.269 --> 00:23:08.849
to you? Because we are rapidly moving away from

00:23:08.849 --> 00:23:11.490
static technology. The next time your financial

00:23:11.490 --> 00:23:13.910
app adjusts your portfolio based on a sudden

00:23:13.910 --> 00:23:16.569
market swing or your email catches a phishing

00:23:16.569 --> 00:23:18.890
attempt that was invented three hours ago, you

00:23:18.890 --> 00:23:20.650
will know exactly what is happening under the

00:23:20.650 --> 00:23:23.369
hood. It's online machine learning, mathematically

00:23:23.369 --> 00:23:25.869
calculating regret, and adjusting its worldview

00:23:25.869 --> 00:23:28.890
in real time. And this raises an important question,

00:23:29.170 --> 00:23:31.130
something for you to consider long after this

00:23:31.130 --> 00:23:34.430
deep dive ends. If the research community manages

00:23:34.430 --> 00:23:37.390
to fully solve the catastrophic forgetting problem

00:23:37.390 --> 00:23:39.750
within continual learning. Which would be a huge

00:23:39.750 --> 00:23:43.609
deal. Massive. It introduces a profound architectural

00:23:43.609 --> 00:23:47.289
possibility. Could a single, continuously adapting

00:23:47.289 --> 00:23:50.789
machine learning model essentially lav and compound

00:23:50.789 --> 00:23:53.589
its knowledge indefinitely? A system capable

00:23:53.589 --> 00:23:55.910
of evolving its understanding alongside human

00:23:55.910 --> 00:23:58.529
history, adapting to every new paradigm and era

00:23:58.529 --> 00:24:01.049
of data a century from now without ever requiring

00:24:01.049 --> 00:24:03.809
a system reboot or a fresh bachelor and start.

00:24:04.170 --> 00:24:06.170
An intelligence that just continuously reads

00:24:06.170 --> 00:24:09.109
the pages of history as they are written, compounding

00:24:09.109 --> 00:24:11.569
its understanding without ever having to unlearn

00:24:11.569 --> 00:24:14.329
the past. That is a truly wild thought to leave

00:24:14.329 --> 00:24:16.750
off on. Thank you for joining us on this deep

00:24:16.750 --> 00:24:18.289
dive. We will catch you next time.
