WEBVTT

00:00:00.000 --> 00:00:03.459
So welcome to today's Deep Dive. You know, usually

00:00:03.459 --> 00:00:06.599
when we talk about artificial intelligence, there's

00:00:06.599 --> 00:00:08.679
this expectation of magic, right? Oh, absolutely.

00:00:08.820 --> 00:00:11.580
Like it's just this mystical black box of instant

00:00:11.580 --> 00:00:13.560
intelligence. Yeah, exactly. Like something straight

00:00:13.560 --> 00:00:16.100
out of a sci -fi movie, where you just feed a

00:00:16.100 --> 00:00:18.980
computer a mountain of data, the screen flashes,

00:00:19.100 --> 00:00:22.019
and boom, the machine just boldly says, you know,

00:00:22.199 --> 00:00:24.399
here is the answer. Right. We really like to

00:00:24.399 --> 00:00:27.859
think the machine. just intrinsically knows how

00:00:27.859 --> 00:00:29.699
to learn from the exact second you turn it on.

00:00:29.859 --> 00:00:32.659
But then you step into the actual engine room

00:00:32.659 --> 00:00:35.399
of machine learning and suddenly that whole illusion

00:00:35.399 --> 00:00:38.159
of magic is, well, it's entirely gone. Yeah,

00:00:38.159 --> 00:00:40.039
it's a very different picture behind the curtain.

00:00:40.320 --> 00:00:42.659
It really is. We're looking at a highly mechanical

00:00:42.659 --> 00:00:46.420
landscape that relies on just a staggering amount

00:00:46.420 --> 00:00:50.780
of trial, error, and meticulous tuning. And that

00:00:50.780 --> 00:00:52.890
brings us to our mission today. Which is a fun

00:00:52.890 --> 00:00:55.329
one. It is. We are going to completely demystify

00:00:55.329 --> 00:00:58.270
the engine room of AI, focusing on a concept

00:00:58.270 --> 00:01:01.210
called hyperparameter optimization. It sounds

00:01:01.210 --> 00:01:03.049
intimidating, I know, but it's really the secret

00:01:03.049 --> 00:01:06.329
sauce. Exactly. We're pulling all our insights

00:01:06.329 --> 00:01:09.349
today directly from a really comprehensive Wikipedia

00:01:09.349 --> 00:01:11.950
article on the topic. So if you've ever wondered

00:01:11.950 --> 00:01:14.170
how machine learning models actually get smart,

00:01:14.650 --> 00:01:17.170
well, it doesn't just happen magically. It requires

00:01:17.170 --> 00:01:19.530
finding the absolute perfect settings before

00:01:19.530 --> 00:01:22.280
the system even boots up. Right. And it's a fascinating

00:01:22.280 --> 00:01:24.719
journey because it reveals that these algorithms,

00:01:25.739 --> 00:01:28.099
the ones we think of as smart, they actually

00:01:28.099 --> 00:01:31.379
need an immense amount of human design guidance.

00:01:31.680 --> 00:01:33.780
Yeah. And our goal today is to break down exactly

00:01:33.780 --> 00:01:36.439
how this works, completely jargon free. We want

00:01:36.439 --> 00:01:39.319
to give you some serious tech literacy without

00:01:39.319 --> 00:01:43.239
the information overload. Definitely. So before

00:01:43.239 --> 00:01:45.519
we can even dive into how to optimize these things,

00:01:45.579 --> 00:01:47.640
we really need to establish what it is we're

00:01:47.640 --> 00:01:50.420
actually tuning. Right. OK, let's unpack this.

00:01:50.760 --> 00:01:53.780
Think of training in AI, and bear with me here,

00:01:54.299 --> 00:01:56.620
like baking a really complex cake. I love a good

00:01:56.620 --> 00:01:58.459
baking analogy. Right. So the machine learning

00:01:58.459 --> 00:02:00.780
process itself, like the algorithm figuring out

00:02:00.780 --> 00:02:03.519
the patterns in the data, that's the mixing of

00:02:03.519 --> 00:02:05.680
the ingredients and the chemical reactions happening

00:02:05.680 --> 00:02:09.060
inside the cake tin. OK, I follow. But the hyperparameters,

00:02:09.360 --> 00:02:13.340
those are the oven temperature or the rack position.

00:02:13.479 --> 00:02:15.740
You have to set those before you put the cake

00:02:15.740 --> 00:02:18.300
in the oven. What's fascinating here is how rigid

00:02:18.300 --> 00:02:22.180
that rule is fundamentally. The source specifically

00:02:22.180 --> 00:02:25.240
defines a hyperparameter as a parameter whose

00:02:25.240 --> 00:02:28.039
value controls the learning process itself. Which

00:02:28.039 --> 00:02:29.860
means it has to be configured before the learning

00:02:29.860 --> 00:02:33.360
even begins. Exactly. The AI cannot learn its

00:02:33.360 --> 00:02:35.800
own oven temperature while it's baking. It needs

00:02:35.800 --> 00:02:38.319
those dials set first. Right, because if you

00:02:38.319 --> 00:02:41.080
set the temperature too high, the outside burns

00:02:41.080 --> 00:02:43.699
while the inside is raw. Too low, and it never

00:02:43.699 --> 00:02:46.520
sets. So we have all these dials, but how do

00:02:46.520 --> 00:02:48.620
we actually know if we picked the right combination?

00:02:48.909 --> 00:02:50.689
Well, that's where the objective function comes

00:02:50.689 --> 00:02:53.569
in. The whole optimization process aims to find

00:02:53.569 --> 00:02:55.590
the set of hyperparameters that gives you an

00:02:55.590 --> 00:02:58.229
optimal model. And optimal is basically just

00:02:58.229 --> 00:03:00.189
the lowest penalty score, right? Yeah, exactly.

00:03:00.349 --> 00:03:02.990
They define it as minimizing a predefined loss

00:03:02.990 --> 00:03:05.370
function on a given data set. So the objective

00:03:05.370 --> 00:03:07.210
function basically takes your oven settings,

00:03:07.810 --> 00:03:09.909
bakes the cake, and returns a loss. Which is

00:03:09.909 --> 00:03:12.330
just a score of how badly the model performed.

00:03:12.810 --> 00:03:15.009
Right. And to estimate this, the source notes

00:03:15.009 --> 00:03:16.960
that they usually use cross -validation. you

00:03:16.960 --> 00:03:19.039
know, testing the model on different subsets

00:03:19.039 --> 00:03:21.520
of the data to make sure it actually works generally.

00:03:21.680 --> 00:03:24.219
OK, that makes sense. We have our dials and we

00:03:24.219 --> 00:03:26.960
want the lowest penalty score. So how do we actually

00:03:26.960 --> 00:03:29.219
find that perfect combination? I mean, we've

00:03:29.219 --> 00:03:31.180
got to start with the most basic method, right?

00:03:31.639 --> 00:03:34.580
Grid search. Grid search is, yeah, it's essentially

00:03:34.580 --> 00:03:37.099
the brute force approach. The article describes

00:03:37.099 --> 00:03:39.560
it as an exhaustive search through a manually

00:03:39.560 --> 00:03:41.960
specified subset. So you just make a grid of

00:03:41.960 --> 00:03:44.620
all the possible values and check every single

00:03:44.620 --> 00:03:46.840
one. Exactly. The source gives this really specific

00:03:46.840 --> 00:03:50.240
example of a soft margin SVM classifier. Oh,

00:03:50.240 --> 00:03:52.139
right. With the C and gamma values. Yeah. So

00:03:52.139 --> 00:03:55.620
you have a regularization constant C with values

00:03:55.620 --> 00:04:00.319
like 10, 100, and 1 ,000, and then a kernel hyperparameter

00:04:00.319 --> 00:04:03.419
gamma with values like 0 .1, 0 .2, 0 .5, and

00:04:03.419 --> 00:04:06.439
1 .0. And GridSearch just checks every single

00:04:06.439 --> 00:04:08.300
combination of those numbers. Every single one.

00:04:08.340 --> 00:04:10.509
It checks the Cartesian product first. c with

00:04:10.509 --> 00:04:12.210
the first gamma, then first c with the second

00:04:12.210 --> 00:04:15.770
gamma, and so on. It's exhaustive. Yeah. But

00:04:15.770 --> 00:04:19.129
both grid search and the next method, random

00:04:19.129 --> 00:04:21.730
search, the source calls them embarrassingly

00:04:21.730 --> 00:04:24.430
parallel, right? Yes, meaning the computations

00:04:24.430 --> 00:04:26.430
don't depend on each other at all. You can run

00:04:26.430 --> 00:04:28.250
them all at the exact same time if you have enough

00:04:28.250 --> 00:04:30.449
computers. But wait, I have to challenge the

00:04:30.449 --> 00:04:33.310
logic here, because random search just replaces

00:04:33.310 --> 00:04:36.050
the exhaustive grid with random selection. It

00:04:36.050 --> 00:04:39.110
does, yes. But randomly guessing sounds strictly

00:04:39.110 --> 00:04:41.670
worse. than checking everything systematically.

00:04:42.490 --> 00:04:45.240
Why does the source say random search can actually

00:04:45.240 --> 00:04:47.500
outperform grid search. Oh, no, it sounds super

00:04:47.500 --> 00:04:49.860
counterintuitive, but it comes down to this concept

00:04:49.860 --> 00:04:52.600
called low intrinsic dimensionality. Low intrinsic

00:04:52.600 --> 00:04:54.899
dimensionality. Right. It basically means that

00:04:54.899 --> 00:04:58.300
out of all the dials you have, often only a really

00:04:58.300 --> 00:05:01.000
small number actually affect the final performance.

00:05:01.339 --> 00:05:03.439
Ah, so the rest are basically just dead weight.

00:05:03.620 --> 00:05:06.480
Exactly. Grid search wastes a ton of time checking

00:05:06.480 --> 00:05:09.620
every minor variation of those useless dials.

00:05:09.680 --> 00:05:12.259
But random search can explore a much wider range

00:05:12.259 --> 00:05:14.420
of values for the continuous parameters that

00:05:14.420 --> 00:05:17.339
actually matter. So casting a wider net randomly

00:05:17.339 --> 00:05:20.939
actually lets the math figure out which dials

00:05:20.939 --> 00:05:23.060
are important. Wow, that flips how I thought

00:05:23.060 --> 00:05:25.360
about it. It's a really clever workaround for

00:05:25.360 --> 00:05:27.899
human intuition, honestly. But still, rolling

00:05:27.899 --> 00:05:31.399
the dice, even if it's smart, feels kind of primitive.

00:05:32.139 --> 00:05:33.959
Wouldn't it be better if the algorithm actually

00:05:33.959 --> 00:05:36.930
learned from its previous guesses? which leads

00:05:36.930 --> 00:05:39.790
us perfectly into Bayesian optimization. Yes.

00:05:40.050 --> 00:05:42.170
And here's where it gets really interesting to

00:05:42.170 --> 00:05:43.910
me. I was reading this part, and it's basically

00:05:43.910 --> 00:05:45.810
like playing the game Battleship. Oh, that's

00:05:45.810 --> 00:05:47.470
a great way to look at it. Right. Because you

00:05:47.470 --> 00:05:50.550
don't just keep guessing random coordinates on

00:05:50.550 --> 00:05:53.689
the board. You know, like B4, miss, G7, hit.

00:05:54.149 --> 00:05:56.579
Once you get a hit, you update your mental model.

00:05:56.740 --> 00:05:58.740
You definitely do. You start guessing around

00:05:58.740 --> 00:06:01.279
that hit to sink the ship. Exactly. And that's

00:06:01.279 --> 00:06:04.000
Bayesian optimization. It really is. The source

00:06:04.000 --> 00:06:06.459
explains that it's a global optimization method

00:06:06.459 --> 00:06:09.740
specifically for noisy black box functions. It

00:06:09.740 --> 00:06:12.519
builds a probabilistic model mapping the hyperparameter

00:06:12.519 --> 00:06:15.199
values to the objective evaluated on a validation

00:06:15.199 --> 00:06:19.000
set. So it's constantly deciding whether to guess

00:06:19.000 --> 00:06:22.139
near its past hits or try a totally new area.

00:06:22.250 --> 00:06:24.750
Yeah, that's the classic balance between exploration

00:06:24.750 --> 00:06:27.670
and exploitation. Exploration is trying settings

00:06:27.670 --> 00:06:30.410
where the outcome is totally uncertain. Exploitation

00:06:30.410 --> 00:06:33.230
is trying settings we expect are close to the

00:06:33.230 --> 00:06:35.629
optimum based on our past hits. And because it

00:06:35.629 --> 00:06:38.930
does this, it gets better results in fewer evaluations

00:06:38.930 --> 00:06:41.930
than grid or random search, right? Exactly. The

00:06:41.930 --> 00:06:44.250
core advantage is the system's ability to reason

00:06:44.250 --> 00:06:46.290
about the quality of experiments before they're

00:06:46.290 --> 00:06:49.149
even run. It saves just massive amounts of compute

00:06:49.149 --> 00:06:52.269
time. That is so cool. OK, so Bayesian optimization

00:06:52.269 --> 00:06:54.810
is like learning from experience. But what if

00:06:54.810 --> 00:06:59.129
we treat these algorithms like a biological species

00:06:59.129 --> 00:07:01.949
trying to survive? Now you're talking about evolutionary

00:07:01.949 --> 00:07:04.230
and population -based approaches. Right, where

00:07:04.230 --> 00:07:06.889
we literally evolve the dials. Yeah, evolutionary

00:07:06.889 --> 00:07:09.089
optimization. It creates an initial population

00:07:09.089 --> 00:07:12.100
of, say, a hundred or more random solutions.

00:07:12.339 --> 00:07:14.620
So a hundred different randomly generated organisms,

00:07:14.660 --> 00:07:17.459
basically. And then it evaluates their fitness,

00:07:17.639 --> 00:07:19.839
like checking their tenfold cross -validation

00:07:19.839 --> 00:07:23.120
accuracy. It ranks them, keeps the best, and

00:07:23.120 --> 00:07:25.699
replaces the worst performing ones with new ones.

00:07:25.819 --> 00:07:27.839
And it makes those new ones through crossover

00:07:27.839 --> 00:07:30.079
and mutation, right? Like combining the settings

00:07:30.079 --> 00:07:32.360
of the winners or just randomly tweaking a dial.

00:07:32.620 --> 00:07:36.060
Exactly. But then there's population -based training,

00:07:36.300 --> 00:07:39.600
or PBT. It operates multiple independent learning

00:07:39.600 --> 00:07:42.600
processes. Okay, but hold on. I spotted a contradiction

00:07:42.600 --> 00:07:46.660
here from earlier. Oh, what's that? Well, we

00:07:46.660 --> 00:07:49.540
define hyperparameters in Section 1 as things

00:07:49.540 --> 00:07:51.899
that must be configured before the process starts.

00:07:52.279 --> 00:07:54.120
The oven temperature before the cake goes in.

00:07:54.339 --> 00:07:56.839
Right, that was the fundamental rule. But the

00:07:56.839 --> 00:08:00.720
source says PBT is adaptive. It updates hyperparameters

00:08:00.720 --> 00:08:03.199
during the training of the models using something

00:08:03.199 --> 00:08:06.160
called warm starting. Doesn't that break the

00:08:06.160 --> 00:08:08.040
fundamental rule we just established? That's

00:08:08.040 --> 00:08:10.279
a really sharp catch. And well, if we connect

00:08:10.279 --> 00:08:13.180
this to the bigger picture, PBT completely shifts

00:08:13.180 --> 00:08:15.740
the paradigm. It essentially eliminates what

00:08:15.740 --> 00:08:19.149
it considers a suboptimal strategy. you know,

00:08:19.310 --> 00:08:21.569
assigning constant hyperparameters for the whole

00:08:21.569 --> 00:08:24.730
training process. PBT allows the settings to

00:08:24.730 --> 00:08:27.389
evolve alongside the network weights. Ah, so

00:08:27.389 --> 00:08:29.389
it's turning the oven dial while the cake is

00:08:29.389 --> 00:08:31.709
baking based on how the cake is rising. Exactly.

00:08:32.429 --> 00:08:34.929
It warm starts new models from the better performers.

00:08:35.250 --> 00:08:36.909
So instead of starting from scratch, it takes

00:08:36.909 --> 00:08:39.830
a good model, copies it, mutates the dials a

00:08:39.830 --> 00:08:42.870
bit, and keeps going. Wow. Okay, so these evolutionary

00:08:42.870 --> 00:08:45.610
populations are great, but... and the source

00:08:45.610 --> 00:08:48.110
touches on this, evaluating them takes immense

00:08:48.110 --> 00:08:50.009
computing power. Oh, absolutely. It's incredibly

00:08:50.009 --> 00:08:53.379
heavy. So how do data scientists optimize these

00:08:53.379 --> 00:08:55.639
massive neural networks without just running

00:08:55.639 --> 00:08:57.720
out of time and server costs? They have to scale

00:08:57.720 --> 00:08:59.980
up. And the two main ways they do that are gradient

00:08:59.980 --> 00:09:02.600
-based methods and early stopping. OK, let's

00:09:02.600 --> 00:09:04.879
start with gradient -based. So gradient -based

00:09:04.879 --> 00:09:07.100
optimization computes gradients with respect

00:09:07.100 --> 00:09:10.200
to the hyperparameters. It uses things like automatic

00:09:10.200 --> 00:09:12.860
differentiation and the implicit function theorem.

00:09:13.080 --> 00:09:15.399
Which means it can handle a lot of dials at once.

00:09:15.659 --> 00:09:18.620
A staggering amount. It can scale to millions

00:09:18.620 --> 00:09:20.840
of hyperparameters with constant memory. They

00:09:20.840 --> 00:09:22.620
use it in things like self -tuning networks,

00:09:23.019 --> 00:09:25.940
delta STN, and darts for neural architecture

00:09:25.940 --> 00:09:29.200
search. Millions of dials. That is, I can't even

00:09:29.200 --> 00:09:30.779
picture that. But what about the other method?

00:09:31.240 --> 00:09:34.679
Early stopping. That one is Copus built for high

00:09:34.679 --> 00:09:37.580
computational costs, right? Yeah, exactly. Because

00:09:37.580 --> 00:09:41.059
if evaluating just one setup takes a week, you

00:09:41.059 --> 00:09:43.519
need a way to cut your losses early if it's doing

00:09:43.519 --> 00:09:45.990
poorly. Right. And to make sense of one of their

00:09:45.990 --> 00:09:48.710
methods, success of having, I thought of this

00:09:48.710 --> 00:09:52.009
visual. Imagine an intense reality TV talent

00:09:52.009 --> 00:09:54.230
show. OK, I'm listening. You start with, say,

00:09:54.549 --> 00:09:57.200
eight arbitrary configurations. eight contestants.

00:09:57.840 --> 00:09:59.879
But instead of letting all eight perform their

00:09:59.879 --> 00:10:02.679
entire routine, you periodically stop the music.

00:10:02.840 --> 00:10:05.299
Ah, you cut the worst performer. Exactly. You

00:10:05.299 --> 00:10:07.480
cut the bottom half, and you only focus your

00:10:07.480 --> 00:10:09.419
attention and resources on the one showing promise,

00:10:09.539 --> 00:10:11.740
until only one remains. That's a great way to

00:10:11.740 --> 00:10:14.399
picture success of having, or SHA. But there's

00:10:14.399 --> 00:10:17.340
a slight problem with regular SHA. If you wait

00:10:17.340 --> 00:10:19.320
for everyone to finish their song before cutting

00:10:19.320 --> 00:10:22.480
the losers, you waste time. Oh, because some

00:10:22.480 --> 00:10:24.820
configurations take longer to evaluate than others.

00:10:24.919 --> 00:10:38.179
Right. So the second you drop into the bottom

00:10:38.179 --> 00:10:41.200
half, the trap door opens. You're out. You're

00:10:41.200 --> 00:10:44.179
out. And managing all of this is HyperBand. You

00:10:44.179 --> 00:10:46.639
can think of HyperBand as the producer of that

00:10:46.639 --> 00:10:49.960
reality show. It acts as a manager, invoking

00:10:49.960 --> 00:10:52.860
these pruning methods with varying aggressiveness.

00:10:53.299 --> 00:10:55.220
So we've used all these incredible methods. We

00:10:55.220 --> 00:10:57.179
have Beijing Battleship, Evolutionary Reality

00:10:57.179 --> 00:11:00.120
TV, all of it. We found the absolute perfect

00:11:00.120 --> 00:11:02.909
hyperparameters. We're done, right? We're successful.

00:11:02.990 --> 00:11:05.149
Well, actually, no. There's a massive risk waiting

00:11:05.149 --> 00:11:07.389
at the end of all this. The crap. A huge trap.

00:11:07.509 --> 00:11:10.690
And the source warns heavily about this, overfitting

00:11:10.690 --> 00:11:12.970
the hyperparameters to the validation set. OK,

00:11:12.970 --> 00:11:14.690
wait. What does that actually mean for you, the

00:11:14.690 --> 00:11:17.070
listener? Like, if I use cross validation to

00:11:17.070 --> 00:11:19.809
pick my settings, why can't I trust that final

00:11:19.809 --> 00:11:21.269
score? I mean, I'm already rotating my data,

00:11:21.289 --> 00:11:23.190
right? This raises an important question, doesn't

00:11:23.190 --> 00:11:25.870
it? The problem is that the performance score

00:11:25.870 --> 00:11:28.870
on the validation set becomes too optimistic.

00:11:29.429 --> 00:11:32.789
Why is it too optimistic, though? Because if

00:11:32.789 --> 00:11:35.250
you tweak your settings based on the validation

00:11:35.250 --> 00:11:37.950
score over and over and over, you aren't really

00:11:37.950 --> 00:11:40.690
learning general rules anymore. You're inadvertently

00:11:40.690 --> 00:11:44.429
memorizing the validation test itself. Like a

00:11:44.429 --> 00:11:46.649
student taking the exact same practice test 50

00:11:46.649 --> 00:11:49.269
times, they get a perfect score because they

00:11:49.269 --> 00:11:51.409
memorized the questions, not because they know

00:11:51.409 --> 00:11:54.990
the subject. Exactly. So to fix this, the generalization

00:11:54.990 --> 00:11:57.730
performance must be evaluated on a completely

00:11:57.730 --> 00:12:00.850
independent test set, a totally untouched batch

00:12:00.850 --> 00:12:02.950
of data. Just to prove it actually works in the

00:12:02.950 --> 00:12:05.409
real world. Right. Or, the source mentions you

00:12:05.409 --> 00:12:07.590
can use an outer cross -validation procedure

00:12:07.590 --> 00:12:10.590
called nested cross -validation for an unbiased

00:12:10.590 --> 00:12:13.149
estimation. Nested cross -validation. Man, it

00:12:13.149 --> 00:12:15.539
really... just layers upon layers with this stuff.

00:12:15.740 --> 00:12:18.340
It really is. Well, just to summarize our journey

00:12:18.340 --> 00:12:20.940
today for everyone, we went from the brute force

00:12:20.940 --> 00:12:23.480
of grid search to the hyper efficient gradient

00:12:23.480 --> 00:12:26.879
and population based methods. And understanding

00:12:26.879 --> 00:12:29.639
this whole process, it's really the key to seeing

00:12:29.639 --> 00:12:32.639
behind the curtain of the AI tools you use every

00:12:32.639 --> 00:12:35.440
day. It definitely is. And you know, we've explored

00:12:35.440 --> 00:12:38.279
how we use algorithms to automatically optimize

00:12:38.279 --> 00:12:41.039
the parameters of other algorithms today. But

00:12:41.039 --> 00:12:43.440
as tools like automated machine learning get

00:12:43.440 --> 00:12:45.679
more advanced, it leads to a really fascinating

00:12:45.639 --> 00:12:48.360
paradox. What kind of paradox? Well, eventually,

00:12:48.659 --> 00:12:52.159
who or what optimizes the hyper parameters of

00:12:52.159 --> 00:12:54.220
the evolutionary algorithms doing the tuning?

00:12:54.539 --> 00:12:56.440
Oh, wow. I didn't even think about that. Right.

00:12:56.860 --> 00:12:59.000
Are we approaching a point where human intuition

00:12:59.000 --> 00:13:01.840
is entirely removed from the loop, leaving us

00:13:01.840 --> 00:13:04.960
with essentially turtles all the way down, machines

00:13:04.960 --> 00:13:07.100
optimizing machines optimizing machines. Turtles

00:13:07.100 --> 00:13:09.399
all the way down. That is a wild thought to leave

00:13:09.399 --> 00:13:12.039
off on. Machines building the machines to bake

00:13:12.039 --> 00:13:15.090
the machines. Well, that wraps up our deep dive

00:13:15.090 --> 00:13:17.049
for today. Thanks for joining us and we'll catch

00:13:17.049 --> 00:13:17.669
you in the next one.