WEBVTT

00:00:00.000 --> 00:00:03.520
So picture this. A 10 -year -old boy picks up

00:00:03.520 --> 00:00:06.799
his mother's iPhone X. He just casually stares

00:00:06.799 --> 00:00:09.820
at the screen, and the phone immediately unlocks.

00:00:10.419 --> 00:00:12.660
And it's not a glitch, and it's definitely not

00:00:12.660 --> 00:00:16.100
magic. The most advanced facial recognition software

00:00:16.100 --> 00:00:19.309
on the planet at that time, was mathematically

00:00:19.309 --> 00:00:21.649
confident that a 10 -year -old child was his

00:00:21.649 --> 00:00:23.949
adult mother. Which is wild to think about. It

00:00:23.949 --> 00:00:27.210
is. And to understand why that happened, and

00:00:27.210 --> 00:00:29.850
why artificial intelligence sometimes fails so

00:00:29.850 --> 00:00:32.109
spectacularly in the real world, we actually

00:00:32.109 --> 00:00:34.810
have to look inside the incredibly fragile way

00:00:34.810 --> 00:00:37.469
these systems learn. Yeah, because it is entirely

00:00:37.469 --> 00:00:40.009
about the data. I mean, when an algorithm makes

00:00:40.009 --> 00:00:41.950
a mistake, whether it's a harmless error from

00:00:41.950 --> 00:00:45.549
your smart coffee machine or a car misinterpreting

00:00:45.549 --> 00:00:47.649
a shadow on the highway, it almost always comes

00:00:47.600 --> 00:00:49.859
down to the exact mathematical foundation we

00:00:49.859 --> 00:00:52.619
used to teach it. If you want to know why algorithms

00:00:52.619 --> 00:00:54.799
get things wrong, you really have to look at

00:00:54.799 --> 00:00:56.759
how their knowledge is built from the ground

00:00:56.759 --> 00:00:59.439
up. So if you've ever wondered why your navigation

00:00:59.439 --> 00:01:01.539
app suddenly tells you to drive into a lake,

00:01:01.880 --> 00:01:05.019
the answer actually lies in a tiny mathematical

00:01:05.019 --> 00:01:07.340
mistake made months before you even open the

00:01:07.340 --> 00:01:09.859
app. Exactly. In this deep dive, we are opening

00:01:09.859 --> 00:01:12.560
up the black box. We are looking at the comprehensive

00:01:12.560 --> 00:01:15.340
data on how machine learning algorithms are trained,

00:01:15.840 --> 00:01:18.480
validated, and tested. And the mission today

00:01:18.480 --> 00:01:20.780
is, by the end of our time together, we're going

00:01:20.780 --> 00:01:23.739
to completely demystify this process so you can

00:01:23.739 --> 00:01:25.700
see exactly where the invisible brain behind

00:01:25.700 --> 00:01:28.189
your screen goes. completely off the rails. What's

00:01:28.189 --> 00:01:30.549
really fascinating here is that in machine learning,

00:01:31.290 --> 00:01:34.290
the data used to build a mathematical model is

00:01:34.290 --> 00:01:37.129
separated into three highly distinct buckets.

00:01:37.230 --> 00:01:39.810
Okay. And if a data scientist mixes these buckets

00:01:39.810 --> 00:01:42.530
up, the whole system effectively collapses. So

00:01:42.530 --> 00:01:44.609
the first phase is entirely about establishing

00:01:44.609 --> 00:01:47.000
a baseline. We call this the training data set.

00:01:47.219 --> 00:01:49.519
OK, let's unpack this. The training data set,

00:01:49.540 --> 00:01:51.579
this is where the model is using what they call

00:01:51.579 --> 00:01:54.599
supervised learning. So I'm picturing like a

00:01:54.599 --> 00:01:57.500
student studying with a massive stack of flashcards.

00:01:59.000 --> 00:02:01.219
The question is on the front of the flashcard

00:02:01.219 --> 00:02:04.340
and the target or the label is the answer on

00:02:04.340 --> 00:02:06.459
the back. Yeah, let's take that analogy a step

00:02:06.459 --> 00:02:09.139
further. Imagine the student doesn't even know

00:02:09.139 --> 00:02:11.159
English yet. They're just looking at shapes.

00:02:11.360 --> 00:02:14.610
OK, totally blank slate. Completely. The training

00:02:14.610 --> 00:02:17.550
data set consists of millions of pairs. You have

00:02:17.550 --> 00:02:20.509
an input vector, like a picture, and the corresponding

00:02:20.509 --> 00:02:23.150
output vector, which is the label. The model

00:02:23.150 --> 00:02:26.009
looks at the picture, produces a completely random

00:02:26.009 --> 00:02:28.250
guess, and then checks the back of the flash

00:02:28.250 --> 00:02:30.370
card to see the actual target. And when it gets

00:02:30.370 --> 00:02:32.849
the answer wrong, which, I mean, it obviously

00:02:32.849 --> 00:02:34.909
will at first. Oh, almost every time. Right.

00:02:34.930 --> 00:02:37.610
So how does it actually learn? Does it just memorize

00:02:37.610 --> 00:02:41.090
that specific card? Well, no. It uses algorithms

00:02:41.090 --> 00:02:43.810
like gradient descent to physically adjust its

00:02:43.810 --> 00:02:46.930
internal parameters. So think of parameters as

00:02:46.930 --> 00:02:49.849
the literal weights of the connections between

00:02:49.849 --> 00:02:53.439
artificial neurons. If it guesses wrong... It

00:02:53.439 --> 00:02:56.219
tweaks those connections slightly so that next

00:02:56.219 --> 00:02:58.919
time the mathematical pathway leans a little

00:02:58.919 --> 00:03:01.319
closer to the right answer. And it does this

00:03:01.319 --> 00:03:04.139
millions of times, constantly tweaking its own

00:03:04.139 --> 00:03:06.460
internal wiring. So it's incrementally shifting

00:03:06.460 --> 00:03:09.219
its brain structure until it reliably guesses

00:03:09.219 --> 00:03:11.270
what's on the front of the card. Exactly. But

00:03:11.270 --> 00:03:13.930
the real world isn't as simple as a flash card.

00:03:14.270 --> 00:03:16.289
And looking at the research, there's this brilliant

00:03:16.289 --> 00:03:19.449
example of how this guessing and tweaking process

00:03:19.449 --> 00:03:22.349
can create some really bizarre logic. Let's talk

00:03:22.349 --> 00:03:24.530
about the starfish and the sea urchins. Yeah,

00:03:24.590 --> 00:03:26.770
this is a classic. So let's say engineers are

00:03:26.770 --> 00:03:30.110
training in AI specifically for ocean object

00:03:30.110 --> 00:03:33.139
detection. They feed the network thousands of

00:03:33.139 --> 00:03:36.139
images of starfish and sea urchins. Just massive

00:03:36.139 --> 00:03:39.319
piles of ocean photos. Right. Over time, the

00:03:39.319 --> 00:03:41.699
AI starts correlating specific visual features

00:03:41.699 --> 00:03:44.539
with its internal nodes. For the starfish, the

00:03:44.539 --> 00:03:48.360
AI learns to heavily weigh a ring texture in

00:03:48.360 --> 00:03:50.919
a star outline. Makes sense. Starfish equal ringed

00:03:50.919 --> 00:03:52.680
and star -shaped. And for the sea urchins, it

00:03:52.680 --> 00:03:54.599
learns to match them with a stripe texture in

00:03:54.599 --> 00:03:57.099
an oval shape. But during this massive training

00:03:57.099 --> 00:04:00.520
run, the AI is inevitably shown a rare anomalous

00:04:00.520 --> 00:04:03.520
image. It's a sea urchin that happens to have

00:04:03.520 --> 00:04:06.099
a ringed texture instead of a striped one. Just

00:04:06.099 --> 00:04:07.960
a total freak of nature that slipped into the

00:04:07.960 --> 00:04:10.719
flashcard deck. Exactly. And because of this

00:04:10.719 --> 00:04:13.759
rare image, the AI creates a very weakly weighted

00:04:13.759 --> 00:04:17.139
association between ringed texture and sea ocean.

00:04:17.819 --> 00:04:20.759
Oh, I see. It's not the primary rule, but that

00:04:20.759 --> 00:04:23.279
mathematical connection now exists in its brain.

00:04:23.789 --> 00:04:26.209
Fast forward to later in the training, we show

00:04:26.209 --> 00:04:29.329
the AI a brand new image. This image contains

00:04:29.329 --> 00:04:32.569
a normal starfish and a completely random seashell

00:04:32.569 --> 00:04:34.389
that was never included in any of the training

00:04:34.389 --> 00:04:37.490
data. So the AI has zero concept of what a shell

00:04:37.490 --> 00:04:40.170
is? None at all. Yeah. The network correctly

00:04:40.170 --> 00:04:43.009
detects the starfish. But that random shell happened

00:04:43.009 --> 00:04:45.069
to have an oval shape. Uh -oh. Yeah, this triggers

00:04:45.069 --> 00:04:47.149
a weak signal for the oval shape node in the

00:04:47.149 --> 00:04:50.110
AI's brain. Now combine that oval signal with

00:04:50.110 --> 00:04:52.490
the weak ring texture signal emanating from the

00:04:52.490 --> 00:04:54.850
starfish right next to it. Suddenly, those two

00:04:54.850 --> 00:04:57.089
weak signals cross -pollinate and combine to

00:04:57.089 --> 00:05:00.629
produce a false positive. Wow. So it confidently

00:05:00.629 --> 00:05:02.990
tells you there is a sea urchin in the image,

00:05:03.170 --> 00:05:05.149
even though it's looking at a shell and a starfish.

00:05:05.170 --> 00:05:08.129
Exactly. It just hallucinates an entirely different

00:05:08.129 --> 00:05:10.529
animal because it's piecing together fragmented

00:05:10.529 --> 00:05:13.829
features like an oval here, a ring there. And

00:05:13.829 --> 00:05:15.970
this highlights the biggest trap in machine learning.

00:05:16.790 --> 00:05:19.470
Just because an algorithm memorized the patterns

00:05:19.470 --> 00:05:22.339
in the training data, doesn't mean it actually

00:05:22.339 --> 00:05:25.180
grasps the underlying reality of what it's looking

00:05:25.180 --> 00:05:28.459
at. Most approaches that search through training

00:05:28.459 --> 00:05:31.480
data for empirical relationships tend to fall

00:05:31.480 --> 00:05:34.620
into this trap. It's called overfitting. The

00:05:34.620 --> 00:05:37.000
model identifies and exploits apparent relationships

00:05:37.000 --> 00:05:39.300
in the training data that simply do not hold

00:05:39.300 --> 00:05:41.910
up in the general real world. It's the classic

00:05:41.910 --> 00:05:44.269
case of an A -plus student who only knows how

00:05:44.269 --> 00:05:47.149
to pass the test but has no actual critical thinking

00:05:47.149 --> 00:05:49.649
skills. Yes, perfectly said. Which means we can't

00:05:49.649 --> 00:05:51.829
just unleash this thing into the wild after it

00:05:51.829 --> 00:05:54.029
studies the flashcards. We have to check its

00:05:54.029 --> 00:05:56.129
actual comprehension. We need a practice exam.

00:05:56.269 --> 00:05:59.230
Right, which introduces the second bucket, the

00:05:59.230 --> 00:06:01.829
validation data set. Some engineers call it the

00:06:01.829 --> 00:06:04.230
development set or the dev set. This provides

00:06:04.230 --> 00:06:06.930
an unbiased evaluation of the model while it

00:06:06.930 --> 00:06:10.370
is still actively being tuned. But here is the

00:06:10.370 --> 00:06:13.189
critical distinction. We are no longer tuning

00:06:13.189 --> 00:06:15.829
the parameters. In this phase, we are tuning

00:06:15.829 --> 00:06:18.449
the model's hyperparameters. Wait, wait. I have

00:06:18.449 --> 00:06:21.810
to push back here. Parameters. Hyperparameters.

00:06:22.410 --> 00:06:24.649
Why are we splitting hairs with the terminology?

00:06:25.269 --> 00:06:27.990
Why can't the engineers just use the massive

00:06:27.990 --> 00:06:30.610
pile of training data to figure out the entire

00:06:30.610 --> 00:06:32.990
architecture of the AI? Because they govern two

00:06:32.990 --> 00:06:35.620
completely different things. The parameters are

00:06:35.620 --> 00:06:38.199
the internal weights. That's the knowledge the

00:06:38.199 --> 00:06:40.399
model learns on its own from the training data,

00:06:40.480 --> 00:06:42.639
like associating the star shape with the starfish.

00:06:42.759 --> 00:06:44.920
Okay, and the hyperparameters? Hyperparameters

00:06:44.920 --> 00:06:47.060
are the architectural choices set by the human

00:06:47.060 --> 00:06:49.680
engineers before the learning even begins. Oh,

00:06:49.699 --> 00:06:51.439
okay. Give me an example of an architectural

00:06:51.439 --> 00:06:53.959
choice. Well, in an artificial neural network,

00:06:54.300 --> 00:06:56.779
a hyperparameter would be the actual number of

00:06:56.779 --> 00:06:59.019
hidden layers in the network, or the mathematical

00:06:59.019 --> 00:07:01.519
width of those layers is the physical capacity

00:07:01.519 --> 00:07:03.720
of the brain. Got it. If you use the training

00:07:03.720 --> 00:07:06.259
data, to decide on the architecture of the brain,

00:07:06.839 --> 00:07:09.399
you guarantee overfitting. You'll just build

00:07:09.399 --> 00:07:13.259
a brain perfectly rigidly customized to memorize

00:07:13.259 --> 00:07:17.040
those specific flashcards. So if parameters are

00:07:17.040 --> 00:07:19.899
the knowledge the student learns, hyperparameters

00:07:19.899 --> 00:07:22.360
are the physical structure of the student's brain

00:07:22.360 --> 00:07:25.579
itself. We are tweaking the brain's anatomy to

00:07:25.579 --> 00:07:27.639
make sure it's capable of general understanding.

00:07:27.800 --> 00:07:29.939
Exactly. And if we connect this to the bigger

00:07:29.939 --> 00:07:33.189
picture of overfitting, the validation set is

00:07:33.189 --> 00:07:35.949
how we measure if that brain structure is actually

00:07:35.949 --> 00:07:39.209
working. Yeah. So imagine grading two different

00:07:39.209 --> 00:07:41.389
AI models during this phase. OK, let's call them

00:07:41.389 --> 00:07:44.629
model A and model B. Sure. Model A draws a perfectly

00:07:44.629 --> 00:07:47.550
squiggly mathematical line that hits every single

00:07:47.550 --> 00:07:49.540
data point in the training set. It gets a near

00:07:49.540 --> 00:07:52.379
perfect score on the practice test. Model A crushed

00:07:52.379 --> 00:07:54.879
the flash card. Crushed them. Now, Model B draws

00:07:54.879 --> 00:07:57.100
a smoother, more general line that misses a few

00:07:57.100 --> 00:07:59.319
points, so it only gets a C. If you only looked

00:07:59.319 --> 00:08:01.060
at the training data, you'd think Model A is

00:08:01.060 --> 00:08:03.660
a genius and Model B is mediocre. But then you

00:08:03.660 --> 00:08:07.199
give them both a pop quiz on brand new unseen

00:08:07.199 --> 00:08:10.480
material from the validation set. Model A, which

00:08:10.480 --> 00:08:12.839
memorized the exact placement of every single

00:08:12.839 --> 00:08:15.420
practice point, completely panics. Its error

00:08:15.420 --> 00:08:18.879
rate explodes. It fails the pop quiz miserably.

00:08:18.980 --> 00:08:22.079
Oh, wow. Meanwhile, Model B, which only got a

00:08:22.079 --> 00:08:24.660
C on the practice test, gets almost the exact

00:08:24.660 --> 00:08:27.620
same C on the pop quiz. Because Model B learned

00:08:27.620 --> 00:08:30.779
the general trend of the data, rather than hyper

00:08:30.779 --> 00:08:33.279
-fixating on the specific noise of the practice

00:08:33.279 --> 00:08:36.620
rounds, Model B is resilient. And resiliency

00:08:36.620 --> 00:08:39.580
is the goal. The validation set helps engineers

00:08:39.580 --> 00:08:42.259
spot model A before it's too late. They use a

00:08:42.259 --> 00:08:44.200
technique called early stopping. Early stopping.

00:08:44.460 --> 00:08:46.139
Yeah, they evaluate the candidate models through

00:08:46.139 --> 00:08:48.340
successive iterations. The very moment the error

00:08:48.340 --> 00:08:50.320
rate on the validation data set starts growing,

00:08:50.779 --> 00:08:52.580
the moment the AI starts panicking on the pop

00:08:52.580 --> 00:08:54.980
quiz, you halt the training immediately. That

00:08:54.980 --> 00:08:57.480
makes sense. That growth is the first mathematical

00:08:57.480 --> 00:08:59.539
proof that the model has stopped learning general

00:08:59.539 --> 00:09:02.019
rules and has started memorizing specific training

00:09:02.019 --> 00:09:04.460
data. So you stop the clock, and you revert the

00:09:04.460 --> 00:09:06.659
AI's brain back to the previous iteration where

00:09:06.659 --> 00:09:09.299
it was still generalizing well. Exactly. OK,

00:09:09.360 --> 00:09:11.960
that makes perfect sense. But looking at the

00:09:11.960 --> 00:09:15.460
logic here, if we are using the validation set

00:09:15.460 --> 00:09:17.440
to make all these crucial decisions about the

00:09:17.440 --> 00:09:19.600
hyperparameters, and we're using it to decide

00:09:19.600 --> 00:09:23.220
when to stop training, isn't that data now fundamentally

00:09:23.220 --> 00:09:27.070
compromised? How do you mean? Well, the AI has

00:09:27.070 --> 00:09:29.730
essentially seen the practice exam. The human

00:09:29.730 --> 00:09:31.730
engineers changed the architecture of its brain

00:09:31.730 --> 00:09:34.110
based specifically on how it performed on the

00:09:34.110 --> 00:09:36.169
validation set. It's no longer a blind test.

00:09:36.289 --> 00:09:38.769
That is an incredibly important point, and it's

00:09:38.769 --> 00:09:41.029
the exact reason why a third bucket is required.

00:09:41.149 --> 00:09:43.809
Ah, there we go. The validation data set is a

00:09:43.809 --> 00:09:46.730
hybrid. It is training data used for testing,

00:09:47.309 --> 00:09:49.149
but neither as part of the low -level training

00:09:49.149 --> 00:09:52.440
nor as part of the final evaluation. Because

00:09:52.440 --> 00:09:55.480
we used it to tune the hyperparameters, the model's

00:09:55.480 --> 00:09:57.480
performance on the validation set is inherently

00:09:57.480 --> 00:09:59.659
biased. Which means, despite all this rigorous

00:09:59.659 --> 00:10:01.679
tuning, we still don't actually know for sure

00:10:01.679 --> 00:10:04.460
if it's ready to safely drive a car or diagnose

00:10:04.460 --> 00:10:06.679
a disease in the real world. Which brings us

00:10:06.679 --> 00:10:10.220
to the final, absolute measure of an algorithm's

00:10:10.220 --> 00:10:13.039
capability, the test dataset. Here's where it

00:10:13.039 --> 00:10:15.830
gets really interesting. Because if the training

00:10:15.830 --> 00:10:18.090
set is you sitting at home studying the driver's

00:10:18.090 --> 00:10:20.730
manual and the validation set is your driving

00:10:20.730 --> 00:10:23.029
instructor taking you on practice routes to correct

00:10:23.029 --> 00:10:26.789
your bad habits. The test set is the final DMV

00:10:26.789 --> 00:10:29.769
exam. It is a completely blind brand new route

00:10:29.769 --> 00:10:32.230
with a terrifying instructor with a clipboard

00:10:32.230 --> 00:10:34.269
to prove you can actually operate the vehicle

00:10:34.269 --> 00:10:37.049
under pressure. And just like a real DMV exam,

00:10:37.590 --> 00:10:40.340
you cannot use the test route for practice. The

00:10:40.340 --> 00:10:42.740
test dataset, which is often called the holdout

00:10:42.740 --> 00:10:45.820
dataset, must be entirely independent of the

00:10:45.820 --> 00:10:48.139
training and validation data. Completely walled

00:10:48.139 --> 00:10:51.240
off. Yes. It is used exclusively to assess the

00:10:51.240 --> 00:10:54.120
performance of the fully built, fully tuned classifier

00:10:54.120 --> 00:10:57.299
on unseen data. So it is strictly off limits

00:10:57.299 --> 00:10:59.340
until the very final moment. You don't touch

00:10:59.340 --> 00:11:01.279
it, you don't peek at it, you don't tweak any

00:11:01.279 --> 00:11:03.220
layers based on it. Standard machine learning

00:11:03.220 --> 00:11:05.919
practice demands it. If you use the test data

00:11:05.919 --> 00:11:08.059
set for validating the training model or fine

00:11:08.059 --> 00:11:10.259
-tuning the hyperparameters, you destroy its

00:11:10.259 --> 00:11:13.360
value. It provides the only accurate, honest

00:11:13.360 --> 00:11:16.100
evaluation of how the model will perform tomorrow,

00:11:16.320 --> 00:11:18.980
next week, or next year in the real world. But

00:11:18.980 --> 00:11:21.039
let's look at the reality of data science for

00:11:21.039 --> 00:11:25.070
a second. Data isn't infinite. Gathering high

00:11:25.070 --> 00:11:28.230
quality labeled data like millions of perfectly

00:11:28.230 --> 00:11:30.990
categorized images of street signs or medical

00:11:30.990 --> 00:11:34.029
scans is incredibly expensive and time -consuming.

00:11:34.389 --> 00:11:36.509
Oh absolutely. So what happens when researchers

00:11:36.509 --> 00:11:39.029
just don't have enough data to cleanly divide

00:11:39.029 --> 00:11:42.409
into three massive pristine buckets? It is a

00:11:42.409 --> 00:11:44.970
massive hurdle, especially in specialized fields

00:11:44.970 --> 00:11:47.129
like rare disease detection where you might only

00:11:47.129 --> 00:11:49.789
have a few hundred scans total. Right. If you

00:11:49.789 --> 00:11:52.149
have a low number of samples and you simply chop

00:11:52.149 --> 00:11:54.860
it in half for training validation. The model

00:11:54.860 --> 00:11:56.899
won't have enough examples to learn from. It

00:11:56.899 --> 00:11:59.360
will inevitably overfit. So how do they fix it?

00:11:59.519 --> 00:12:02.720
To solve this, scientists rely on advanced statistical

00:12:02.720 --> 00:12:05.899
mechanics, primarily cross -validation and bootstrapping.

00:12:06.100 --> 00:12:07.620
Let's break those down because when you look

00:12:07.620 --> 00:12:09.539
at the mechanics, they sound like really intense

00:12:09.539 --> 00:12:12.460
mathematical workarounds. Let's start with bootstrapping.

00:12:12.559 --> 00:12:14.980
That involves random sampling with replacement,

00:12:15.259 --> 00:12:17.840
right? Yes. Bootstrapping generates numerous

00:12:17.840 --> 00:12:20.480
simulated data sets. It makes these new sets

00:12:20.480 --> 00:12:22.860
the exact same size as your original limited

00:12:22.860 --> 00:12:25.700
data pool. But it does this by randomly picking

00:12:25.700 --> 00:12:28.879
a data point, copying it into the new set, and

00:12:28.879 --> 00:12:30.720
then and putting the original point back into

00:12:30.720 --> 00:12:33.399
the pool to potentially be picked again. Wait,

00:12:33.419 --> 00:12:35.960
I need to visualize this. If I have a deck of

00:12:35.960 --> 00:12:39.379
10 flash cards, I draw one at random. Let's say

00:12:39.379 --> 00:12:41.940
it's card number three. Okay. I write it down,

00:12:42.139 --> 00:12:44.299
and I put card number three back in the deck.

00:12:44.490 --> 00:12:47.350
So in my new simulated deck of ten cards, I might

00:12:47.350 --> 00:12:49.610
draw card number three four different times,

00:12:49.850 --> 00:12:51.870
and I might never draw card number seven at all.

00:12:52.029 --> 00:12:54.549
That is exactly how it works. But how does recycling

00:12:54.549 --> 00:12:58.029
the exact same limited data stop the AI from

00:12:58.029 --> 00:13:00.590
memorizing it? Aren't we just hammering the same

00:13:00.590 --> 00:13:03.129
flashcards into its brain? It sounds counterintuitive,

00:13:03.230 --> 00:13:05.389
but think about the cards you didn't draw. Because

00:13:05.389 --> 00:13:08.029
you sampled with replacement, mathematically,

00:13:08.590 --> 00:13:10.509
about a third of your original cards will be

00:13:10.509 --> 00:13:13.370
left out of any given simulated deck. Oh. Right.

00:13:13.690 --> 00:13:16.090
Those left out cards are the secret weapon. They

00:13:16.090 --> 00:13:19.549
become a pristine, completely unseen validation

00:13:19.549 --> 00:13:21.970
set for that specific round of training. That

00:13:21.970 --> 00:13:24.990
is br— You train the model on the weird, repetitive

00:13:24.990 --> 00:13:27.669
deck, and then you test it on the cards it happened

00:13:27.669 --> 00:13:30.730
to miss. You do this hundreds of times, creating

00:13:30.730 --> 00:13:32.669
different weird decks and different pristine

00:13:32.669 --> 00:13:35.730
test sets and average the results. So you are

00:13:35.730 --> 00:13:38.190
artificially creating blind spots in the training

00:13:38.190 --> 00:13:40.950
so you can use those blind spots as pop quizzes

00:13:40.950 --> 00:13:44.090
later. Exactly. And then cross validation is

00:13:44.090 --> 00:13:46.470
another powerful method that operates on a similar

00:13:46.470 --> 00:13:49.629
philosophy of squeezing every drop of value out

00:13:49.629 --> 00:13:51.799
of limited data. OK, how does that one work?

00:13:52.120 --> 00:13:54.080
Instead of random replacement, it splits the

00:13:54.080 --> 00:13:56.759
small data set into multiple rigid fractions,

00:13:57.100 --> 00:14:00.059
or folds. Let's say five folds. OK, so five equal

00:14:00.059 --> 00:14:02.360
piles of data. Right. You hold out pile number

00:14:02.360 --> 00:14:04.720
one as your test data. You train the model on

00:14:04.720 --> 00:14:07.320
piles two, three, four, and five. You get a result.

00:14:07.340 --> 00:14:09.659
Got it. Then you rotate. You hold out pile number

00:14:09.659 --> 00:14:12.320
two as the test data and train on one, three,

00:14:12.399 --> 00:14:15.000
four, and five. You rotate through every single

00:14:15.000 --> 00:14:17.139
pile until all the folds have been cross validated.

00:14:17.419 --> 00:14:20.100
So the exam is constantly shifting. Every single

00:14:20.100 --> 00:14:22.200
piece of data eventually gets its moment to be

00:14:22.200 --> 00:14:24.759
the blind pop quiz, and every piece of data gets

00:14:24.759 --> 00:14:27.259
to be part of the study guide. And you average

00:14:27.259 --> 00:14:30.279
out all those shifting results to estimate the

00:14:30.279 --> 00:14:33.299
final performance. It is highly effective at

00:14:33.299 --> 00:14:35.820
reducing bias and variability in small datasets.

00:14:36.679 --> 00:14:39.519
Researchers strongly advise against using a single

00:14:39.519 --> 00:14:42.460
static split on small datasets because it almost

00:14:42.460 --> 00:14:45.169
guarantees a biased performance estimate. OK,

00:14:45.250 --> 00:14:47.990
so the mechanics here are incredibly rigorous.

00:14:48.370 --> 00:14:51.049
The math behind bootstrapping and cross -validation

00:14:51.049 --> 00:14:54.549
is really precise. We have clear boundaries for

00:14:54.549 --> 00:14:57.509
training, validation, and testing. Yes, theoretically.

00:14:57.690 --> 00:15:00.169
Right. But looking at the actual landscape of

00:15:00.169 --> 00:15:03.409
AI development right now, there is a glaring,

00:15:03.549 --> 00:15:06.350
almost comical problem. Despite all this mathematical

00:15:06.350 --> 00:15:08.509
precision, the human beings running these systems

00:15:08.509 --> 00:15:11.509
are completely mixing up the words. It is a widely

00:15:11.509 --> 00:15:14.289
documented issue. There is a blatant terminological

00:15:14.289 --> 00:15:16.610
confusion that pervades artificial intelligence

00:15:16.610 --> 00:15:19.350
research today, both in academia and in the corporate

00:15:19.350 --> 00:15:21.350
sector. They are constantly swapping the definitions

00:15:21.350 --> 00:15:24.129
of test set and validation set. Constantly. A

00:15:24.129 --> 00:15:26.210
team of researchers might consider their internal

00:15:26.210 --> 00:15:28.690
process of tuning hyperparameters to be testing

00:15:28.690 --> 00:15:31.090
different models. So they label their development

00:15:31.090 --> 00:15:34.230
set, the test set. Then they argue that the final

00:15:34.230 --> 00:15:36.710
completed model needs to be validated before

00:15:36.710 --> 00:15:38.950
it's sold to the public. So they call the final

00:15:38.950 --> 00:15:42.480
holdout data the validation set. It is the exact

00:15:42.480 --> 00:15:44.779
reverse of the standard definitions we just spent

00:15:44.779 --> 00:15:47.600
the last 10 minutes outlining. Yep. And it causes

00:15:47.600 --> 00:15:49.860
massive friction when teams try to replicate

00:15:49.860 --> 00:15:52.340
each other's work or verify the safety of an

00:15:52.340 --> 00:15:54.620
algorithm. Which raises an alarming thought.

00:15:55.039 --> 00:15:57.360
If the top scientists and engineers building

00:15:57.360 --> 00:15:59.820
these multi -billion dollar systems are getting

00:15:59.820 --> 00:16:03.480
the foundational terminology confused, What happens

00:16:03.480 --> 00:16:06.159
when these algorithms actually make it into the

00:16:06.159 --> 00:16:08.840
real world with deeply flawed training or testing

00:16:08.840 --> 00:16:11.539
procedures? Well, the real world results range

00:16:11.539 --> 00:16:15.669
from absurd to genuinely dangerous. When an algorithm

00:16:15.669 --> 00:16:18.429
fails in the wild, it is almost always due to

00:16:18.429 --> 00:16:20.870
emissions or irrelevant inputs during those initial

00:16:20.870 --> 00:16:23.610
three -bucket phases. We fail to give the AI

00:16:23.610 --> 00:16:26.409
the proper context of the complex world it's

00:16:26.409 --> 00:16:29.210
about to enter. Because an AI can't ask for context.

00:16:29.409 --> 00:16:31.090
It doesn't raise its hand and say, hey, the lighting

00:16:31.090 --> 00:16:33.169
in this room is weird. I've never seen this before.

00:16:33.470 --> 00:16:35.289
It just makes a mathematically -backed guess

00:16:35.289 --> 00:16:37.669
based on whatever limited flashcards it studied.

00:16:37.970 --> 00:16:40.149
Let's look at the danger of irrelevant input.

00:16:40.460 --> 00:16:44.059
Imagine an engineering team building an AI for

00:16:44.059 --> 00:16:47.539
agricultural drone object detection. They want

00:16:47.539 --> 00:16:50.399
the drone to fly over a farm and count the sheep.

00:16:50.500 --> 00:16:53.019
Seems simple enough. They feed the training data

00:16:53.019 --> 00:16:55.659
set thousands of pristine pictures of sheep.

00:16:56.240 --> 00:16:59.480
but every single picture features sheep standing

00:16:59.480 --> 00:17:02.960
on vibrant green grassy fields. So the AI isn't

00:17:02.960 --> 00:17:05.339
just looking at the white willy animal, it's

00:17:05.339 --> 00:17:08.119
absorbing every single pixel in those images.

00:17:08.240 --> 00:17:10.839
It processes the entire environment. So during

00:17:10.839 --> 00:17:13.279
training the AI might unconsciously start using

00:17:13.279 --> 00:17:15.480
the background rather than the object of interest

00:17:15.480 --> 00:17:17.920
to make its decisions. It learns to associate

00:17:17.920 --> 00:17:20.559
a large patch of green pixels with the label

00:17:20.559 --> 00:17:22.759
sheep. Because in its limited universe green

00:17:22.759 --> 00:17:25.440
grass always equals sheep. And the devastating

00:17:25.440 --> 00:17:27.339
risk is that when you launch that drone in the

00:17:27.339 --> 00:17:29.980
real world and it flies over a grassy field that

00:17:29.980 --> 00:17:31.819
happens to have a large gray rock or a stray

00:17:31.819 --> 00:17:34.220
dog, the algorithm will confidently interpret

00:17:34.220 --> 00:17:36.880
those objects as sheep. It overfit to the irrelevant

00:17:36.880 --> 00:17:39.480
environmental input of the grass. It fundamentally

00:17:39.480 --> 00:17:42.059
misunderstood the assignment because the humans

00:17:42.059 --> 00:17:44.240
didn't provide enough variation in the test set.

00:17:44.640 --> 00:17:47.210
But it's not just agricultural drones. Let's

00:17:47.210 --> 00:17:49.609
go back to that iPhone X example we started with.

00:17:50.049 --> 00:17:52.349
The 10 -year -old boy unlocking his mother's

00:17:52.349 --> 00:17:55.430
phone. That is a failure of omitting particular

00:17:55.430 --> 00:17:57.490
circumstances. It perfectly illustrates what

00:17:57.490 --> 00:18:00.470
happens when a test set doesn't adequately represent

00:18:00.470 --> 00:18:03.670
the chaos of the real world. Facial recognition

00:18:03.670 --> 00:18:06.210
is supposed to map the exact geometric contours

00:18:06.210 --> 00:18:09.390
of your face, but it relies heavily on environmental

00:18:09.390 --> 00:18:11.769
lighting to calculate those shadows and depths.

00:18:12.309 --> 00:18:14.369
So what exactly went wrong with the mother's

00:18:14.369 --> 00:18:16.440
phone? When the mother first purchased the phone

00:18:16.440 --> 00:18:18.759
and registered her face into the system, she

00:18:18.759 --> 00:18:21.440
did it under indoor nighttime lighting. Oh, I

00:18:21.440 --> 00:18:24.400
see. That specific artificial environmental condition

00:18:24.400 --> 00:18:27.279
became the baseline the system learned. The algorithm

00:18:27.279 --> 00:18:29.539
was not appropriately trained or validated with

00:18:29.539 --> 00:18:32.279
enough variation in lighting angles or time of

00:18:32.279 --> 00:18:35.240
day to firmly differentiate the geometry of her

00:18:35.240 --> 00:18:37.980
face in that specific shadow from the geometrically

00:18:37.980 --> 00:18:40.779
similar face of her 10 year old son. Because

00:18:40.779 --> 00:18:42.900
the training data set omitted the circumstance

00:18:42.900 --> 00:18:45.579
of daylight or fluorescent light, the AI fell

00:18:45.579 --> 00:18:49.299
back on a weak mathematical association. It saw

00:18:49.299 --> 00:18:52.039
similar cheekbones in the dark, panicked, and

00:18:52.039 --> 00:18:54.240
granted the kid access to her bank accounts.

00:18:54.900 --> 00:18:57.119
It's a failure to include relevant environmental

00:18:57.119 --> 00:18:59.779
conditions. There's a famous fictional comic

00:18:59.779 --> 00:19:01.579
strip often referenced in data science circles

00:19:01.579 --> 00:19:04.359
where a computer output makes a cup of coffee

00:19:04.359 --> 00:19:07.500
five million degrees hot. Five million degrees?

00:19:07.619 --> 00:19:10.619
Why? because the system was given an unquestioned

00:19:10.619 --> 00:19:13.519
literal definition of the user requesting it

00:19:13.519 --> 00:19:16.579
extra hot. It's a failure in logic born from

00:19:16.579 --> 00:19:19.599
a total lack of environmental context. So what

00:19:19.599 --> 00:19:21.059
does this all mean for you listening at home?

00:19:21.230 --> 00:19:23.430
Think about how often you trust algorithms in

00:19:23.430 --> 00:19:25.589
your daily life right now. The recommendations

00:19:25.589 --> 00:19:27.950
feeding your news algorithm, the facial recognition

00:19:27.950 --> 00:19:30.509
guarding your digital life, the navigation system

00:19:30.509 --> 00:19:32.869
steering a two -ton vehicle at 60 miles an hour.

00:19:33.529 --> 00:19:36.450
Every single AI mistake, whether it's a funny

00:19:36.450 --> 00:19:39.549
glitch on a website or a self -driving car misinterpreting

00:19:39.549 --> 00:19:42.309
a shadow as a pedestrian, ultimately comes down

00:19:42.309 --> 00:19:45.549
to what was put into or left out of these three

00:19:45.549 --> 00:19:47.890
data sets. The output is only ever as good as

00:19:47.890 --> 00:19:50.710
the data. And the data is only as good as the

00:19:50.710 --> 00:19:53.470
strict methodology used to separate it into training,

00:19:53.930 --> 00:19:57.210
validation, and testing. So to recap our journey

00:19:57.210 --> 00:20:00.289
today, we build the raw foundation of an AI's

00:20:00.289 --> 00:20:02.589
knowledge with the training set, where it slowly

00:20:02.589 --> 00:20:05.430
tweaks its internal parameters. We refine the

00:20:05.430 --> 00:20:07.430
actual architecture of its brain and stop it

00:20:07.430 --> 00:20:10.450
from overfitting using the validation set. And

00:20:10.450 --> 00:20:12.750
finally, we prove it can actually survive the

00:20:12.750 --> 00:20:15.390
chaos of reality using the strictly off -limits

00:20:15.390 --> 00:20:17.910
test set. And, you know, if you take away anything

00:20:17.910 --> 00:20:20.269
from the complex mechanics of how machines learn,

00:20:20.809 --> 00:20:22.890
let it be a broader lesson about learning itself.

00:20:23.589 --> 00:20:26.049
The data clearly warns us about the dangers of

00:20:26.049 --> 00:20:28.970
algorithms failing due to obsolete data and their

00:20:28.970 --> 00:20:31.410
rigid inability to change to new environments.

00:20:31.910 --> 00:20:34.049
That is a really fascinating way to look at it.

00:20:34.210 --> 00:20:36.519
It leaves us with a deep recursion. If the most

00:20:36.519 --> 00:20:38.660
advanced artificial intelligence systems in the

00:20:38.660 --> 00:20:41.259
world fail spectacularly simply because their

00:20:41.259 --> 00:20:43.559
initial training sets didn't include enough variations

00:20:43.559 --> 00:20:46.599
of a changing complex environment, how often

00:20:46.599 --> 00:20:48.799
do we fail as humans because we're still operating

00:20:48.799 --> 00:20:51.119
on obsolete training data from our own childhoods?

00:20:51.240 --> 00:20:54.240
Are we actively updating our own internal test

00:20:54.240 --> 00:20:57.079
sets as we encounter new people, new ideas, and

00:20:57.079 --> 00:20:59.660
new environments? Or are we just like a flawed

00:20:59.660 --> 00:21:03.019
algorithm, overfitting to our past, stubbornly

00:21:03.019 --> 00:21:05.519
relying on the same old flashcards to navigate

00:21:05.519 --> 00:21:07.839
a world that has completely changed? We are all

00:21:07.839 --> 00:21:10.819
just trying not to blindly memorize the flashcards

00:21:10.819 --> 00:21:12.980
of our youth and actually learn how to navigate

00:21:12.980 --> 00:21:15.079
the road ahead. We'll leave you to think on that.

00:21:15.140 --> 00:21:15.839
Keep digging deeper.
