WEBVTT

00:00:00.000 --> 00:00:03.100
So I want you to imagine you walk into a cafe,

00:00:03.359 --> 00:00:08.160
right? And you order an extra hot coffee. And

00:00:08.160 --> 00:00:11.560
an artificially intelligent barista turns around

00:00:11.560 --> 00:00:15.160
and hands you a cup that is heated to literally

00:00:15.160 --> 00:00:18.359
5 million degrees. Which is slightly problematic

00:00:18.359 --> 00:00:20.940
for the customer. Just a bit. I mean, to a human

00:00:20.940 --> 00:00:23.559
being, that is just an apocalyptic failure of

00:00:23.559 --> 00:00:25.859
common sense. Yeah. You would instantly recognize

00:00:25.859 --> 00:00:28.140
that as completely absurd. Right. Right. But

00:00:28.140 --> 00:00:31.769
to the AI. It just perfectly executed the exact

00:00:31.769 --> 00:00:34.229
parameters it was given. Like, it did its job

00:00:34.229 --> 00:00:36.049
flawlessly. It did exactly what the math told

00:00:36.049 --> 00:00:39.149
it to do. Exactly. And the thing is, when we

00:00:39.149 --> 00:00:42.469
see an AI write a stunning poem or navigate a

00:00:42.469 --> 00:00:45.329
car through rush hour traffic, there's this overwhelming

00:00:45.329 --> 00:00:47.770
temptation to think of it as magic. Oh, absolutely.

00:00:47.890 --> 00:00:50.189
We really want to believe there's like a little

00:00:50.189 --> 00:00:52.310
digital ghost in the machine that intuitively

00:00:52.310 --> 00:00:55.049
understands the world. Yeah, we anthropomorphize

00:00:55.049 --> 00:00:57.229
the technology because, oh, it's just easier

00:00:57.229 --> 00:01:00.090
than confronting the reality of it. We project

00:01:00.090 --> 00:01:03.130
a human brain onto the software. We imagine it

00:01:03.130 --> 00:01:07.799
having this sudden eureka moment of truth. comprehension.

00:01:08.560 --> 00:01:11.040
But the reality is there is no comprehension.

00:01:11.200 --> 00:01:13.939
Not at all. None. There is only a mathematical

00:01:13.939 --> 00:01:16.120
path of least resistance. And when you actually

00:01:16.120 --> 00:01:18.480
peel back the curtain on how these models are

00:01:18.480 --> 00:01:21.900
built, the magic trick... just vanishes entirely.

00:01:22.599 --> 00:01:24.700
You don't find a digital brain. You just find

00:01:24.700 --> 00:01:28.359
data, just stacks and stacks of meticulously

00:01:28.359 --> 00:01:31.219
sorted data. It is the absolute definition of

00:01:31.219 --> 00:01:34.620
mundane architecture, but that specific highly

00:01:34.620 --> 00:01:37.359
curated architecture dictates literally everything.

00:01:37.879 --> 00:01:40.180
It is the sole dividing line between a machine

00:01:40.180 --> 00:01:42.959
that appears to be a total genius and a machine

00:01:42.959 --> 00:01:45.859
that tries to serve you plasma hot coffee. So

00:01:45.859 --> 00:01:47.840
welcome to the deep dive. Our mission today is

00:01:47.840 --> 00:01:50.620
to demystify that hidden architecture. We're

00:01:50.620 --> 00:01:52.680
going to explore how machines actually learn,

00:01:53.540 --> 00:01:56.200
and we're going to do it without drowning you

00:01:56.200 --> 00:01:58.680
in a sea of dense mathematics. Which I think

00:01:58.680 --> 00:02:01.340
everyone will appreciate. Definitely. We're focusing

00:02:01.340 --> 00:02:03.519
entirely on the three fundamental pillars of

00:02:03.519 --> 00:02:06.340
machine learning data. That's the training, validation,

00:02:06.640 --> 00:02:10.360
and test sets. And I know data set sounds a bit

00:02:10.360 --> 00:02:13.439
dry, but understanding how data scientists slice

00:02:13.439 --> 00:02:15.840
up information is really the only way to understand

00:02:15.840 --> 00:02:20.439
why AI sometimes works like absolute magic and

00:02:20.439 --> 00:02:23.560
other times, you know, fails so spectacularly.

00:02:23.680 --> 00:02:26.120
Because if you don't grasp how the data is divided

00:02:26.120 --> 00:02:29.219
before the machine ever gets turned on, the behavior

00:02:29.219 --> 00:02:30.800
of the whole system will just always seem like

00:02:30.800 --> 00:02:33.379
this unpredictable black box. OK, let's unpack

00:02:33.379 --> 00:02:36.610
this. So before a machine can be tested, It obviously

00:02:36.610 --> 00:02:38.930
needs to study. Naturally. This is the first

00:02:38.930 --> 00:02:40.789
phase of any machine's learning journey. It's

00:02:40.789 --> 00:02:43.849
called the training data set. Right. And to picture

00:02:43.849 --> 00:02:46.189
this, I want you to think about teaching a toddler

00:02:46.189 --> 00:02:49.689
using a massive deck of flash cards. OK. I like

00:02:49.689 --> 00:02:51.810
that analogy. So on the front of the flash card

00:02:51.810 --> 00:02:55.349
is a picture. Let's say an apple. That is your

00:02:55.349 --> 00:02:57.509
input. And then on the back of the card is the

00:02:57.509 --> 00:03:00.469
answer, the word apple. The answer key. Exactly.

00:03:00.689 --> 00:03:02.969
And in machine learning, that answer key on the

00:03:02.969 --> 00:03:05.750
back is called the target or the label. Right.

00:03:05.810 --> 00:03:08.150
And that visual perfectly represents what we

00:03:08.150 --> 00:03:11.310
call supervised learning. The model, and whether

00:03:11.310 --> 00:03:13.930
that's a naive Bayes classifier, some massive

00:03:13.930 --> 00:03:17.289
artificial neural network, it's fed this training

00:03:17.289 --> 00:03:20.729
data set. It gets the flashcards. Exactly. It

00:03:20.729 --> 00:03:23.050
analyzes the input on the front of the flashcard,

00:03:23.330 --> 00:03:25.710
which, to the machine, is just a digital breakdown

00:03:25.710 --> 00:03:28.750
of pixels or text. Right. It makes a blind guess,

00:03:28.849 --> 00:03:31.389
and then it checks its guess against the target

00:03:31.389 --> 00:03:34.870
label on the back. But, I mean... The machine

00:03:34.870 --> 00:03:36.990
doesn't have a biological brain to remember the

00:03:36.990 --> 00:03:39.090
flash card. It can't just memorize the image

00:03:39.090 --> 00:03:41.689
of the apple like a toddler would. No, it can't.

00:03:41.770 --> 00:03:44.030
So how does looking at the back of the card actually

00:03:44.030 --> 00:03:46.129
change the machine? Well, it happens through

00:03:46.129 --> 00:03:48.949
a mechanism called optimization. Okay. Instead

00:03:48.949 --> 00:03:53.069
of a brain, I want you to imagine the AI is a

00:03:53.069 --> 00:03:55.849
massive soundboard with millions of little dials.

00:03:55.870 --> 00:03:58.449
Like in a recording studio. Exactly. And we call

00:03:58.449 --> 00:04:00.569
those dials parameters. They're essentially the

00:04:00.569 --> 00:04:02.729
mathematical weights of the connections inside

00:04:02.729 --> 00:04:05.150
the system. OK, so dials equal parameters. Right.

00:04:05.430 --> 00:04:08.530
So when the machine guesses wrong, an algorithm,

00:04:08.949 --> 00:04:11.729
often something called a gradient descent, reaches

00:04:11.729 --> 00:04:14.969
in and twists those dials just a tiny fraction

00:04:14.969 --> 00:04:17.949
of a millimeter. It adjusts the weights. It adjusts

00:04:17.949 --> 00:04:20.209
the weights. It keeps guessing, checking the

00:04:20.209 --> 00:04:22.990
flash card, and twisting the dials over thousands

00:04:22.990 --> 00:04:25.410
and thousands of iterations. Until what? Until

00:04:25.410 --> 00:04:27.670
the mathematical output perfectly matches the

00:04:27.670 --> 00:04:30.149
label. So it's mechanically strengthening or

00:04:30.149 --> 00:04:33.430
weakening digital connections based purely on

00:04:33.430 --> 00:04:36.329
a feedback loop. That is exactly it. Let's ground

00:04:36.329 --> 00:04:38.569
this with a real world example from our sources.

00:04:39.790 --> 00:04:43.079
Object detection using marine life. Oh, yeah,

00:04:43.259 --> 00:04:45.779
the starfish example. Right. So imagine you're

00:04:45.779 --> 00:04:49.319
training an AI to identify sea creatures. You

00:04:49.319 --> 00:04:52.399
feed it thousands of training images of starfish.

00:04:52.480 --> 00:04:56.000
OK. The algorithm twists those internal dials

00:04:56.000 --> 00:04:58.519
until features like a ring texture and a star

00:04:58.519 --> 00:05:01.279
-shaped outline heavily tip the scales toward

00:05:01.279 --> 00:05:03.860
the label starfish. Right. And at the exact same

00:05:03.860 --> 00:05:06.459
time, it might learn that most images of sea

00:05:06.459 --> 00:05:08.899
urchins heavily feature a striped texture and

00:05:08.899 --> 00:05:11.449
like an oval shape. Right. So it builds these

00:05:11.449 --> 00:05:14.269
purely mathematical associations between visual

00:05:14.269 --> 00:05:16.790
features and specific words. I have to push back

00:05:16.790 --> 00:05:19.230
on that though. Oh, why is that? Because nature

00:05:19.230 --> 00:05:21.470
is messy. Categories just aren't that clean.

00:05:21.810 --> 00:05:23.610
I mean, what happens if a sea urchin happens

00:05:23.610 --> 00:05:26.009
to have rings instead of stripes? Okay, yeah.

00:05:26.220 --> 00:05:29.639
Or what if some random oval rock just wanders

00:05:29.639 --> 00:05:31.740
into the background of the picture? A toddler

00:05:31.740 --> 00:05:34.160
with a flash card might just point and ask a

00:05:34.160 --> 00:05:36.839
question, but our AI is literally just doing

00:05:36.839 --> 00:05:39.279
math. And that is exactly where the machine gets

00:05:39.279 --> 00:05:42.160
confused, which is where the mechanics get incredibly

00:05:42.160 --> 00:05:45.519
nuanced. How so? Well... The AI is not using

00:05:45.519 --> 00:05:48.639
a single isolated dial for texture. It doesn't

00:05:48.639 --> 00:05:51.620
work like that. It relies on complex overlapping

00:05:51.620 --> 00:05:55.279
weight patterns. So if your training data includes

00:05:55.279 --> 00:05:58.720
just one instance of a ring -textured sea urchin,

00:05:59.100 --> 00:06:01.779
the system builds a sort of mechanical tripwire.

00:06:01.779 --> 00:06:04.620
A tripwire? Yeah, it creates a very weakly weighted

00:06:04.620 --> 00:06:07.379
association between the ring feature and the

00:06:07.379 --> 00:06:09.720
urchin label. Meaning, the next time it scans

00:06:09.720 --> 00:06:12.569
an image, and detects anything with rings, it

00:06:12.569 --> 00:06:15.209
puts a tiny mathematical pebble on the sea urchin

00:06:15.209 --> 00:06:17.170
side of the scale. Exactly that. It's a weak

00:06:17.170 --> 00:06:20.629
signal. And if that random oval rock appears

00:06:20.629 --> 00:06:23.930
in the input image, even a rock that was never

00:06:23.930 --> 00:06:27.129
ever included in the training data, it triggers

00:06:27.129 --> 00:06:29.990
the oval -shaped tripwire. Which drops another

00:06:29.990 --> 00:06:33.490
weak pebble onto the sea urchin scale. Yes. And

00:06:33.490 --> 00:06:35.829
if enough of those weak signals accidentally

00:06:35.829 --> 00:06:38.610
combine, the scale tips completely. Oh, wow.

00:06:38.649 --> 00:06:41.649
The network produces a false positive, confidently

00:06:41.649 --> 00:06:44.170
declaring that a random rock is a sea urchin.

00:06:44.230 --> 00:06:47.529
Which means feeding the AI one batch of flashcards

00:06:47.529 --> 00:06:49.490
is just never going to be enough. You can't just

00:06:49.490 --> 00:06:51.470
do it once and walk away. Oh, definitely not.

00:06:51.629 --> 00:06:53.610
You have to continuously expand the training

00:06:53.610 --> 00:06:56.660
set with new diverse data to refine those dials

00:06:56.660 --> 00:06:59.139
and correct those false tripwires. And that ongoing

00:06:59.139 --> 00:07:01.060
process is known as incremental learning, right?

00:07:01.160 --> 00:07:03.779
You got it, incremental learning. But wait, if

00:07:03.779 --> 00:07:05.779
you just keep forcing the machine to look at

00:07:05.779 --> 00:07:08.620
the exact same deck of flashcards over and over,

00:07:09.199 --> 00:07:11.019
we run into a completely different issue. We

00:07:11.019 --> 00:07:13.680
absolutely do. The machine might just memorize

00:07:13.680 --> 00:07:16.660
the specific pixels of those exact cards rather

00:07:16.660 --> 00:07:18.839
than actually understanding the underlying concept

00:07:18.839 --> 00:07:21.360
of what makes an urchin an urchin. And that is

00:07:21.360 --> 00:07:23.779
the most persistent nightmare in machine learning.

00:07:24.110 --> 00:07:28.430
It's called overfitting. The model searches through

00:07:28.430 --> 00:07:31.689
the training data and finds highly specific empirical

00:07:31.689 --> 00:07:34.649
relationships that work perfectly for that one

00:07:34.649 --> 00:07:37.170
data set. But they completely fall apart in the

00:07:37.170 --> 00:07:39.629
real world. Exactly. It becomes obsessed with

00:07:39.629 --> 00:07:41.449
the training data. It ends up optimizing for

00:07:41.449 --> 00:07:44.050
the background noise instead of the actual signal.

00:07:44.389 --> 00:07:46.370
Okay, here's where it gets really interesting.

00:07:47.350 --> 00:07:50.610
Because to stop this obsession, developers have

00:07:50.610 --> 00:07:53.519
to bring in a reality check. The second pillar.

00:07:53.819 --> 00:07:56.680
Right, the second pillar. The validation data

00:07:56.680 --> 00:08:00.319
set. If the training set is studying with flashcards,

00:08:00.839 --> 00:08:03.199
you can think of the validation set like taking

00:08:03.199 --> 00:08:05.259
a practice exam. That's a great way to put it.

00:08:05.459 --> 00:08:07.860
The teacher doesn't officially grade you, but

00:08:07.860 --> 00:08:10.079
they do tell you if your fundamental study habits

00:08:10.079 --> 00:08:12.399
are completely broken. And that distinction between

00:08:12.399 --> 00:08:14.779
studying and the practice exam really highlights

00:08:14.779 --> 00:08:17.439
a critical division in the AI architecture. How

00:08:17.439 --> 00:08:19.959
do you mean? Well, the training set is used to

00:08:19.959 --> 00:08:22.379
adjust the parameters, those millions of internal

00:08:22.379 --> 00:08:24.759
dials we just talked about. But the validation

00:08:24.759 --> 00:08:27.439
data set is used to tune the hyperparameters.

00:08:27.660 --> 00:08:29.500
Okay, parameters versus hyperparameters. Let's

00:08:29.500 --> 00:08:31.120
make sure we clearly define that difference for

00:08:31.120 --> 00:08:32.500
the listener, because that sounds confusing.

00:08:32.919 --> 00:08:36.820
It can be. Think of parameters as the low -level

00:08:36.820 --> 00:08:40.200
knowledge, the dials. Hyperparameters represent

00:08:40.200 --> 00:08:42.600
the overall architecture of the soundboard itself.

00:08:42.919 --> 00:08:44.779
Oh, like how many dials should there be in the

00:08:44.779 --> 00:08:47.080
first place? Exactly. How many dials, how many

00:08:47.080 --> 00:08:49.600
layers of artificial neurons do we need? Got

00:08:49.600 --> 00:08:52.600
it. You use the training data set to teach several

00:08:52.600 --> 00:08:56.259
different candidate models. Then you use the

00:08:56.259 --> 00:08:58.799
validation data set to test them against each

00:08:58.799 --> 00:09:02.029
other. provide an unbiased evaluation. Right,

00:09:02.070 --> 00:09:04.870
to decide which architectural setup actually

00:09:04.870 --> 00:09:08.429
generalizes best to new information. And this

00:09:08.429 --> 00:09:10.309
relies on something called the holdout method.

00:09:10.309 --> 00:09:13.450
Yes. Because you have to physically quarantine

00:09:13.450 --> 00:09:16.190
this validation data so it remains completely

00:09:16.190 --> 00:09:18.350
independent from the training data. It has to

00:09:18.350 --> 00:09:21.690
be isolated. Right. The AI cannot see the practice

00:09:21.690 --> 00:09:24.710
exam while it's studying the flashcards. That

00:09:24.710 --> 00:09:27.419
would just be cheating. And that quarantine is

00:09:27.419 --> 00:09:30.539
heavily utilized for a vital regularization process

00:09:30.539 --> 00:09:32.879
called early stopping. Early stopping. Let's

00:09:32.879 --> 00:09:35.240
break that down. So as the model grinds through

00:09:35.240 --> 00:09:38.440
the training data, its error rate on those flashcards

00:09:38.440 --> 00:09:40.779
will naturally keep dropping. Because it's getting

00:09:40.779 --> 00:09:42.379
better and better at answering the cards. Right.

00:09:43.120 --> 00:09:46.539
But if you periodically pause and check its progress

00:09:46.539 --> 00:09:49.559
against the quarantine validation set, eventually

00:09:49.559 --> 00:09:51.960
something shifts. The error rate on the validation

00:09:51.960 --> 00:09:54.610
set stops dropping. Exactly. It stops dropping

00:09:54.610 --> 00:09:57.110
and actually begins to rise. Because it's starting

00:09:57.110 --> 00:09:59.909
to overfit. It's memorizing the training data

00:09:59.909 --> 00:10:02.509
so aggressively that it's losing the ability

00:10:02.509 --> 00:10:06.429
to generalize to the fresh examples in the validation

00:10:06.429 --> 00:10:09.129
set. You nailed it. And what's fascinating here

00:10:09.129 --> 00:10:11.470
is that the early stopping procedure dictates

00:10:11.470 --> 00:10:14.149
that training must immediately halt the moment

00:10:14.149 --> 00:10:16.309
the validation error grows. Let's pull the plug.

00:10:16.470 --> 00:10:19.549
Yes. You roll back the system and select the

00:10:19.549 --> 00:10:21.929
previous iteration of the model, the one with

00:10:21.929 --> 00:10:24.100
the absolute minimum error. Well that sounds

00:10:24.100 --> 00:10:26.600
like a perfectly clean simple mathematical rule.

00:10:26.659 --> 00:10:29.200
It does sound like that but in practice it is

00:10:29.200 --> 00:10:31.820
a chaotic nightmare. Really I would assume the

00:10:31.820 --> 00:10:34.139
error rate just drops in a nice smooth curve

00:10:34.139 --> 00:10:36.000
and then eventually pops back up like a check

00:10:36.000 --> 00:10:39.120
mark. Is it not that clean? Oh not at all. The

00:10:39.120 --> 00:10:42.299
validation error fluctuates wildly during training.

00:10:42.639 --> 00:10:45.679
It creates what we call multiple local minima.

00:10:45.759 --> 00:10:47.860
What does that look like? It dips, it spikes,

00:10:48.059 --> 00:10:50.419
it dips lower, spikes violently, it drops again.

00:10:50.480 --> 00:10:53.679
It's incredibly noisy. Wow, okay. This complication

00:10:53.679 --> 00:10:57.200
forces data scientists to invent complex ad hoc

00:10:57.200 --> 00:11:00.500
rules just to figure out when true overfitting

00:11:00.500 --> 00:11:02.940
has actually begun. Versus when the model is

00:11:02.940 --> 00:11:05.679
just, you know, experiencing a temporary statistical

00:11:05.679 --> 00:11:08.120
fluctuation in its learning curve. Exactly. It's

00:11:08.120 --> 00:11:11.039
a lot of guesswork sometimes. So, okay, the model

00:11:11.039 --> 00:11:14.330
has studied the flashcards. It's tuned its brain

00:11:14.330 --> 00:11:16.889
architecture by taking the practice exams. We've

00:11:16.889 --> 00:11:19.070
rolled it back to avoid overfitting, but how

00:11:19.070 --> 00:11:21.210
do we know it's actually ready to be deployed

00:11:21.210 --> 00:11:23.769
into the real world? It feels like it needs a

00:11:23.769 --> 00:11:26.009
final, completely blind trial. And this brings

00:11:26.009 --> 00:11:29.309
us to the third and final pillar, the test data

00:11:29.309 --> 00:11:32.649
set. The final exam. The final exam. This is

00:11:32.649 --> 00:11:34.909
a collection of examples used exclusively to

00:11:34.909 --> 00:11:37.129
assess the final performance of the fully trained

00:11:37.129 --> 00:11:40.190
classifier on completely unseen data. And the

00:11:40.190 --> 00:11:43.769
rule here is absolute, right? The test data must

00:11:43.769 --> 00:11:46.529
follow the exact same probability distribution

00:11:46.529 --> 00:11:49.950
as the training data, but it has to be entirely

00:11:49.950 --> 00:11:52.409
independent. It must represent the exact same

00:11:52.409 --> 00:11:55.549
statistical universe. Right. So if you train

00:11:55.549 --> 00:11:58.549
the model on images of marine life, the test

00:11:58.549 --> 00:12:01.409
set cannot be images of bicycles. That wouldn't

00:12:01.409 --> 00:12:04.419
make any sense. No. But it absolutely must be

00:12:04.419 --> 00:12:06.779
marine life images the model has never encountered

00:12:06.779 --> 00:12:08.700
during training and never encountered during

00:12:08.700 --> 00:12:10.580
the validation practice exams. Right, because

00:12:10.580 --> 00:12:12.580
this is how we guarantee an honest evaluation.

00:12:12.860 --> 00:12:15.799
Exactly. And we measure the success by looking

00:12:15.799 --> 00:12:18.480
at something called the mean squared error, or

00:12:18.480 --> 00:12:22.029
MSE. OK. Let's break down the MSE, because the

00:12:22.029 --> 00:12:24.470
squared part of that term is actually the secret

00:12:24.470 --> 00:12:27.490
sauce to how we punish bad AI behavior. It really

00:12:27.490 --> 00:12:30.110
is. It is the mathematical ultimate judge. When

00:12:30.110 --> 00:12:33.129
a model makes a prediction, it is rarely 100

00:12:33.129 --> 00:12:35.970
% perfect. There is always an error. But if we

00:12:35.970 --> 00:12:38.730
just averaged out all the regular errors, positive

00:12:38.730 --> 00:12:41.110
and negative, they might cancel each other out

00:12:41.110 --> 00:12:43.029
and make a terrible model look totally fine.

00:12:43.110 --> 00:12:46.909
By squaring the error, we do two things. First,

00:12:46.970 --> 00:12:49.830
we make all the numbers positive. Second, and

00:12:49.830 --> 00:12:51.929
much more importantly, we disproportionately

00:12:51.929 --> 00:12:54.909
penalize massive failures. Right, because let's

00:12:54.909 --> 00:12:58.029
do the math. If the AI guesses a value and is

00:12:58.029 --> 00:13:01.509
off by 2, you square that error and get 4, a

00:13:01.509 --> 00:13:04.070
pretty minor penalty. Exactly. But if it goes

00:13:04.070 --> 00:13:06.649
completely off the rails and is off by 10, you

00:13:06.649 --> 00:13:09.210
square that and suddenly get 100. Why? The math

00:13:09.210 --> 00:13:12.610
violently rejects catastrophic guesses. It does.

00:13:12.970 --> 00:13:15.230
Look at two hypothetical models. Let's say model

00:13:15.230 --> 00:13:17.570
A severely overfits the training data. Okay.

00:13:17.740 --> 00:13:20.080
On the flashcards, its mean squared error is

00:13:20.080 --> 00:13:22.879
a tiny, impressive 4. But when confronted with

00:13:22.879 --> 00:13:25.700
the unseen test set, that error spikes to 15.

00:13:25.879 --> 00:13:28.299
It increases by almost a factor of 4. Yes. Which

00:13:28.299 --> 00:13:30.539
is a massive red flag. It basically memorized

00:13:30.539 --> 00:13:32.940
the cards but totally failed the final exam.

00:13:33.059 --> 00:13:35.240
Now compare that to model B. Its training error

00:13:35.240 --> 00:13:37.620
is initially much higher, sitting at a 9. Okay.

00:13:37.820 --> 00:13:40.620
Worse on the flashcards. Right. But on the blind

00:13:40.620 --> 00:13:44.779
test set, it only goes up to 13. The error increases

00:13:44.779 --> 00:13:48.250
by less than a factor of 2. Model B is vastly

00:13:48.250 --> 00:13:50.750
superior. Because it generalizes to reality.

00:13:51.490 --> 00:13:54.070
Even if its initial flashcard score looked way

00:13:54.070 --> 00:13:56.690
worse than Model A's. Exactly. It understands

00:13:56.690 --> 00:13:58.789
the concepts better. I'm trying to picture this

00:13:58.789 --> 00:14:01.509
in a real world workflow, though, and I see a

00:14:01.509 --> 00:14:04.330
glaring practical problem. What's that? Well,

00:14:04.330 --> 00:14:07.009
this three -part slicing training, validation,

00:14:07.490 --> 00:14:10.450
test cunning, it requires an enormous amount

00:14:10.450 --> 00:14:13.139
of data. Oh, it really does. Right. So if I'm

00:14:13.139 --> 00:14:15.679
building a highly specialized tool, say analyzing

00:14:15.679 --> 00:14:18.259
a rare type of medical scan, and I only have

00:14:18.259 --> 00:14:21.220
100 images total. Yeah, that's tough. I can't

00:14:21.220 --> 00:14:23.139
afford to quarantine 30 of them for a test set.

00:14:23.200 --> 00:14:26.320
I need the AI to study all 100 just to grasp

00:14:26.320 --> 00:14:28.679
the basics. So how do you test a model without

00:14:28.679 --> 00:14:31.200
starving it of the material it desperately needs

00:14:31.200 --> 00:14:34.000
to learn? That data scarcity is honestly one

00:14:34.000 --> 00:14:36.279
of the biggest hurdles in machine learning today.

00:14:36.299 --> 00:14:38.360
I can imagine. Because a simple three -way split

00:14:38.360 --> 00:14:41.220
on a tiny data set will almost guarantee overfitting

00:14:41.220 --> 00:14:44.419
and biased performance estimates. Right. So when

00:14:44.419 --> 00:14:47.179
holding out large chunks of data is a luxury

00:14:47.179 --> 00:14:50.419
you just cannot afford, the industry relies on

00:14:50.419 --> 00:14:53.179
a brilliant mathematical workaround called cross

00:14:53.179 --> 00:14:56.100
-validation. Cross -validation. Break the mechanics

00:14:56.100 --> 00:14:59.059
of that down for us. How do you cross -validate

00:14:59.059 --> 00:15:01.879
without breaking the quarantine rules? Well,

00:15:01.919 --> 00:15:04.500
let's take your 100 medical images. Instead of

00:15:04.500 --> 00:15:07.039
permanently locking away 20 images for a test

00:15:07.039 --> 00:15:10.120
set, you split your entire data set into five

00:15:10.120 --> 00:15:13.200
equal blocks, or what we call folds, of 20 images

00:15:13.200 --> 00:15:15.980
each. OK, five blocks of 20. You take one block,

00:15:16.259 --> 00:15:19.220
just one or town, and hide it. That is your temporary

00:15:19.220 --> 00:15:21.879
test set. Then you train your model on the remaining

00:15:21.879 --> 00:15:24.480
four blocks. Once it's fully trained, you test

00:15:24.480 --> 00:15:26.299
it on the hidden block and record the score.

00:15:26.580 --> 00:15:28.399
OK. And then you reset the whole thing? Exactly.

00:15:28.840 --> 00:15:30.559
You put that hidden block back into the pool.

00:15:31.000 --> 00:15:32.980
Then you pull out a different block of 20 to

00:15:32.980 --> 00:15:36.000
be the new hidden test set. And you train a fresh

00:15:36.000 --> 00:15:38.799
model from scratch on the remaining 80. You repeat

00:15:38.799 --> 00:15:41.799
this entire process five times, rotating the

00:15:41.799 --> 00:15:43.779
hidden block until every single piece of data

00:15:43.779 --> 00:15:46.440
has been used as a test set exactly once. And

00:15:46.440 --> 00:15:48.840
then what? Finally, you just average the test

00:15:48.840 --> 00:15:51.419
results across all five iterations. OK. That

00:15:51.419 --> 00:15:55.279
is incredibly clever. You allow the machine to

00:15:55.279 --> 00:15:58.889
eventually study 100 % of the material. but you

00:15:58.889 --> 00:16:00.590
never let us study the material it's currently

00:16:00.590 --> 00:16:03.289
being tested on. It's extremely efficient. It's

00:16:03.289 --> 00:16:06.029
like taking a single deck of flashcards but constantly

00:16:06.029 --> 00:16:08.929
rotating which specific 10 cards you hide in

00:16:08.929 --> 00:16:10.990
your pocket for the exam. Yeah, right. And it

00:16:10.990 --> 00:16:13.809
effectively eliminates the bias of a single lucky

00:16:13.809 --> 00:16:16.710
or unlucky test split. That makes total sense.

00:16:17.230 --> 00:16:19.049
And there is also another technique from our

00:16:19.049 --> 00:16:21.429
sources called the bootstrap method. Oh, right,

00:16:21.590 --> 00:16:25.129
bootstrapping. And that relies on generating

00:16:25.129 --> 00:16:27.769
simulated data sets. Yes, it does. But how do

00:16:27.769 --> 00:16:30.350
you simulate data if you don't have enough to

00:16:30.350 --> 00:16:32.450
begin with? I mean, you can't just make up medical

00:16:32.450 --> 00:16:35.950
scans. No, you can't. What you do is sample your

00:16:35.950 --> 00:16:38.870
original data randomly, but with replacement.

00:16:39.009 --> 00:16:40.470
With replacement. Okay, what does that look like?

00:16:40.610 --> 00:16:43.730
Imagine your hundred medical images are balls

00:16:43.730 --> 00:16:47.070
in a hat. You draw one ball out, write down its

00:16:47.070 --> 00:16:48.809
data, and then you put the ball back in the hat.

00:16:49.149 --> 00:16:51.990
You repeat this 100 times to create a simulated

00:16:51.990 --> 00:16:54.809
data set of the exact same size. But wait, because

00:16:54.809 --> 00:16:56.669
you're putting the ball back every time, you're

00:16:56.669 --> 00:16:58.649
inevitably going to draw the same image twice.

00:16:59.370 --> 00:17:02.090
Or even three times. Exactly. Statistically,

00:17:02.389 --> 00:17:04.750
a simulated data set created this way will only

00:17:04.750 --> 00:17:07.970
contain about 63 % of the unique original data

00:17:07.970 --> 00:17:10.470
points. Okay, so the duplicates fill in the rest

00:17:10.470 --> 00:17:13.430
of the space. Yes. But here is the magic trick.

00:17:13.789 --> 00:17:17.349
that leaves about 37 % of your original data

00:17:17.349 --> 00:17:20.049
that was never drawn from the hat at all. Oh,

00:17:20.309 --> 00:17:22.710
the out of bag samples. Exactly. Those untouched

00:17:22.710 --> 00:17:27.269
images automatically become your pristine quarantined

00:17:27.269 --> 00:17:29.750
test set for that specific simulation. Yeah.

00:17:29.829 --> 00:17:31.509
You didn't even have to carve them out manually.

00:17:31.670 --> 00:17:33.670
The probability of the draw just did it for you.

00:17:34.089 --> 00:17:37.130
It is an incredibly elegant solution to data

00:17:37.130 --> 00:17:40.150
starvation. It really is. So, the methodology

00:17:40.150 --> 00:17:42.359
is mathematically rigorous. But I have to ask

00:17:42.359 --> 00:17:44.880
about a really confusing aspect of the industry

00:17:44.880 --> 00:17:47.380
itself. Uh -oh, what's that? Well, when you read

00:17:47.380 --> 00:17:49.359
through academic papers or talk to developers,

00:17:50.000 --> 00:17:52.480
they constantly use the terms validation set

00:17:52.480 --> 00:17:55.839
and test set interchangeably. Oh, yes. It makes

00:17:55.839 --> 00:17:57.859
it nearly impossible to follow their workflow

00:17:57.859 --> 00:18:00.640
sometimes. It is a profound frustration in the

00:18:00.640 --> 00:18:03.640
field. It's widely considered the most blatant

00:18:03.640 --> 00:18:06.480
example of terminological confusion pervading

00:18:06.480 --> 00:18:09.819
AI research. Why does it happen? The perms get

00:18:09.819 --> 00:18:12.079
flipped entirely based on the perspective of

00:18:12.079 --> 00:18:15.140
the person doing the work. OK. Internally, developers

00:18:15.140 --> 00:18:17.799
view the process of trying out different architectures

00:18:17.799 --> 00:18:21.119
as an experiment or a test. Oh, I see where this

00:18:21.119 --> 00:18:23.539
is going. Right. So they incorrectly call their

00:18:23.539 --> 00:18:26.440
development set the test set. Then they take

00:18:26.440 --> 00:18:28.910
their finalized model and they want to validate

00:18:28.910 --> 00:18:30.970
it before releasing it to the public. So they

00:18:30.970 --> 00:18:33.569
call the final unseen data the validation set.

00:18:33.910 --> 00:18:36.910
Exactly. They just casually reverse the dictionary

00:18:36.910 --> 00:18:39.410
definitions based on whether they are looking

00:18:39.410 --> 00:18:41.609
inward at the code or outward at the consumer.

00:18:41.650 --> 00:18:45.009
That is maddening. It is. But regardless of the

00:18:45.009 --> 00:18:47.869
messy terminology, the core mechanical concept

00:18:47.869 --> 00:18:50.839
remains absolute. The final set, whatever word

00:18:50.839 --> 00:18:53.380
you choose to label it with, must only be used

00:18:53.380 --> 00:18:55.920
in the final experiment. It can never be used

00:18:55.920 --> 00:18:58.420
to validate the training model or fine tune those

00:18:58.420 --> 00:19:01.319
hyperparameters. The moment you leak test data

00:19:01.319 --> 00:19:04.900
into the training phase, your evaluation is entirely

00:19:04.900 --> 00:19:08.079
compromised. So what does this all mean? We have

00:19:08.079 --> 00:19:11.240
this incredibly robust, mathematically sound

00:19:11.240 --> 00:19:14.279
three -part structure. Training, validating.

00:19:14.359 --> 00:19:17.500
We've got cross -validation, bootstrapping, early

00:19:17.500 --> 00:19:20.920
stopping, penalizing errors with MSE. And yet,

00:19:21.539 --> 00:19:23.819
we see machine learning models fail dramatically,

00:19:24.180 --> 00:19:26.480
sometimes comically, in the real world. Why?

00:19:26.880 --> 00:19:28.799
Well, if we connect this to the bigger picture.

00:19:29.259 --> 00:19:32.019
A machine's logic is fundamentally bounded by

00:19:32.019 --> 00:19:34.420
the data it is fed. Right. It has no common sense

00:19:34.420 --> 00:19:37.240
to fall back on. If the human curated data pipeline

00:19:37.240 --> 00:19:39.799
contains flaws, omissions, or completely irrelevant

00:19:39.799 --> 00:19:42.740
inputs, that highly rigorous mathematical structure

00:19:42.740 --> 00:19:45.539
we just discussed doesn't fix the flaws. It simply

00:19:45.539 --> 00:19:47.740
optimizes for them. Precisely. It optimizes the

00:19:47.740 --> 00:19:50.380
flaws. So if your flashcards are missing critical

00:19:50.380 --> 00:19:53.299
information, you're just perfectly training a

00:19:53.299 --> 00:19:56.279
flawed worldview. Yes. Omissions in the training

00:19:56.279 --> 00:19:59.380
phase basically doom the entire process. This

00:19:59.380 --> 00:20:02.940
means things like obsolete data, ambiguous inputs,

00:20:03.579 --> 00:20:05.740
an inability to adapt to environmental changes,

00:20:06.180 --> 00:20:09.279
or the simple inability to ask a human for help

00:20:09.279 --> 00:20:11.809
when it's confused. Right. When a particular

00:20:11.809 --> 00:20:14.390
circumstance or variation is entirely omitted

00:20:14.390 --> 00:20:17.289
from the training data, the model has no mathematical

00:20:17.289 --> 00:20:19.789
pathway to understand it. It has no internal

00:20:19.789 --> 00:20:22.150
tripwire. Exactly. It can't understand it when

00:20:22.150 --> 00:20:24.250
it encounters it in the real world. And this

00:20:24.250 --> 00:20:26.210
actually happened in the real world with Apple's

00:20:26.210 --> 00:20:28.470
Face ID system, didn't it? Yeah. A 10 -year -old

00:20:28.470 --> 00:20:31.390
boy managed to completely bypass the sophisticated

00:20:31.390 --> 00:20:34.589
biometric security to unlock his mother's iPhone

00:20:34.589 --> 00:20:38.079
X. Yeah, that was a huge story. And the mechanism

00:20:38.079 --> 00:20:40.660
behind that failure is a textbook omission of

00:20:40.660 --> 00:20:43.019
environmental conditions. How so? The mother

00:20:43.019 --> 00:20:45.319
had registered her face into the system under

00:20:45.319 --> 00:20:48.799
indoor nighttime lighting. Ah, the lighting was

00:20:48.799 --> 00:20:51.839
the missing variable. Exactly. That specific

00:20:51.839 --> 00:20:54.200
lighting condition was not appropriately included

00:20:54.200 --> 00:20:56.539
as a diverse variation in the system's training

00:20:56.539 --> 00:21:00.220
data. Wow. So the AI learned a highly specific,

00:21:00.619 --> 00:21:03.759
overly narrow pattern that under those exact

00:21:03.759 --> 00:21:06.759
shadowy conditions, mathematically overlapped

00:21:06.759 --> 00:21:09.440
with her son's facial features. Resulting in

00:21:09.440 --> 00:21:11.940
a complete failure of the security classifier.

00:21:12.140 --> 00:21:15.119
Completely. There is also the infamous object

00:21:15.119 --> 00:21:17.619
detection failure with the sheep on the grasslands.

00:21:17.960 --> 00:21:20.569
Oh, the sheep. Let's say you want to build an

00:21:20.569 --> 00:21:23.609
algorithm to detect sheep. So you feed your training

00:21:23.609 --> 00:21:26.289
data set thousands of pictures of sheep standing

00:21:26.289 --> 00:21:28.930
on grassy fields. And this is a classic example

00:21:28.930 --> 00:21:31.569
of an algorithm utilizing totally irrelevant

00:21:31.569 --> 00:21:33.470
inverting. Why is it irrelevant? The sheep is

00:21:33.470 --> 00:21:36.200
in the picture. Sure, but the model is always

00:21:36.200 --> 00:21:38.359
searching for the path of least mathematical

00:21:38.359 --> 00:21:41.559
resistance to reduce its error rate, right? Right.

00:21:41.660 --> 00:21:44.220
Analyzing the complex woolly outline of a sheep

00:21:44.220 --> 00:21:46.960
is mathematically difficult, but identifying

00:21:46.960 --> 00:21:50.119
a massive, consistent expanse of green pixels.

00:21:50.220 --> 00:21:52.720
Oh, that's incredibly easy. Yes. So it doesn't

00:21:52.720 --> 00:21:54.779
actually learn to identify the animal. It learns

00:21:54.779 --> 00:21:58.200
to identify the green background. Yeah. It associates

00:21:58.200 --> 00:22:01.829
the target label sheep with green grass. Exactly.

00:22:02.029 --> 00:22:04.049
Which means if you feed it an image of a dog

00:22:04.049 --> 00:22:07.210
on a grassy field, or a tractor, or literally

00:22:07.210 --> 00:22:10.430
just an empty patch of grass, the AI confidently

00:22:10.430 --> 00:22:12.829
outputs a false positive and labels it a sheep.

00:22:13.150 --> 00:22:15.670
Because in its limited training universe, the

00:22:15.670 --> 00:22:18.930
grassy background was the most reliable empirical

00:22:18.930 --> 00:22:22.109
predictor of the label. That's crazy. It is a

00:22:22.109 --> 00:22:24.970
failure of the human curators. They failed to

00:22:24.970 --> 00:22:27.609
provide diverse, varying backgrounds in the training

00:22:27.609 --> 00:22:30.380
and validation sets. If they had, it would have

00:22:30.380 --> 00:22:32.559
forced the model to ignore the grass and look

00:22:32.559 --> 00:22:34.940
at the actual object of interest. Which brings

00:22:34.940 --> 00:22:37.299
us full circle to my absolute favorite example,

00:22:37.339 --> 00:22:39.799
the one we started with today. The machine that

00:22:39.799 --> 00:22:42.359
delivers a cup of coffee heated to five million

00:22:42.359 --> 00:22:45.059
degrees. Right, the plasma coffee. It produced

00:22:45.059 --> 00:22:47.539
that catastrophic output based on a previous

00:22:47.539 --> 00:22:50.220
definition of extra hot. Because the human curators

00:22:50.220 --> 00:22:52.180
omitted the relevant environmental conditions.

00:22:52.380 --> 00:22:55.440
Exactly. Namely, the constraints of physics and

00:22:55.440 --> 00:22:58.509
human biology. The machine simply optimized the

00:22:58.509 --> 00:23:01.529
parameter for hot without any boundaries or contextual

00:23:01.529 --> 00:23:04.809
grounding. It just turned the dial for heat all

00:23:04.809 --> 00:23:07.569
the way up optimizing the metric no matter how

00:23:07.569 --> 00:23:10.569
absurd the real world result became. That's all

00:23:10.569 --> 00:23:12.769
it knows how to do. Which brings us to the end

00:23:12.769 --> 00:23:15.549
of today's deep dive. If there is one thing for

00:23:15.549 --> 00:23:17.930
you to take away today, it's that a machine's

00:23:17.930 --> 00:23:20.910
intelligence is strictly bounded by how its human

00:23:20.910 --> 00:23:24.630
creators divide, curate, and test its data. The

00:23:24.630 --> 00:23:27.690
training, validation, and test sets are basically

00:23:27.690 --> 00:23:30.750
the invisible walls of an AI's universe. The

00:23:30.750 --> 00:23:33.890
final output can only ever be as robust as the

00:23:33.890 --> 00:23:36.190
data sets used to build it. But I want to leave

00:23:36.190 --> 00:23:38.089
you with a final thought to mull over, building

00:23:38.089 --> 00:23:39.980
on the mechanics we've just uncovered. We've

00:23:39.980 --> 00:23:42.519
seen how spectacularly these models fail when

00:23:42.519 --> 00:23:45.099
their training data omits specific real -world

00:23:45.099 --> 00:23:47.940
conditions, like nighttime lighting or non -grassy

00:23:47.940 --> 00:23:50.759
backgrounds. But right now, our digital world

00:23:50.759 --> 00:23:53.359
is being flooded with AI -generated content.

00:23:53.630 --> 00:23:57.410
images, text, entirely synthesized data pipelines.

00:23:57.650 --> 00:24:00.170
Which raises a very, very alarming question about

00:24:00.170 --> 00:24:03.089
the integrity of future data sets. Exactly. What

00:24:03.089 --> 00:24:05.109
happens when tomorrow's machine learning models

00:24:05.109 --> 00:24:07.890
start using training and validation sets populated

00:24:07.890 --> 00:24:10.210
by the outputs of other AI models? It's a scary

00:24:10.210 --> 00:24:12.869
thought. Will these hidden biases, these false

00:24:12.869 --> 00:24:15.470
assumptions about grassy fields, and these bizarre

00:24:15.470 --> 00:24:18.450
logical errors compound in an endless feedback

00:24:18.450 --> 00:24:21.809
loop? If an AI learns entirely from the mistakes

00:24:21.809 --> 00:24:25.039
of another AI, Are we pulling machines further

00:24:25.039 --> 00:24:27.599
and further away from reality? If the test data

00:24:27.599 --> 00:24:30.579
ceases to reflect the real world, the mathematical

00:24:30.579 --> 00:24:33.720
evaluation ceases to be honest. We started today

00:24:33.720 --> 00:24:35.660
by talking about that feeling of magic, that

00:24:35.660 --> 00:24:37.579
little ghost in the machine that just seems to

00:24:37.579 --> 00:24:39.980
know things. But it turns out the real magic

00:24:39.980 --> 00:24:42.660
isn't in the algorithm itself. It's in the messy,

00:24:43.019 --> 00:24:45.500
flawed, and distinctly human task of deciding

00:24:45.500 --> 00:24:48.460
what reality looks like one data point at a time.
