WEBVTT

00:00:00.000 --> 00:00:02.620
So if you want to understand the neural networks

00:00:02.620 --> 00:00:05.280
that are driving your car right now. Or even

00:00:05.280 --> 00:00:07.459
just the algorithm seamlessly sorting through

00:00:07.459 --> 00:00:09.199
all those thousands of photos on your phone.

00:00:09.359 --> 00:00:11.619
Yeah, exactly. If you want to understand any

00:00:11.619 --> 00:00:14.419
of that, you actually have to look at this staggeringly

00:00:14.419 --> 00:00:19.750
ugly 32 by 32 pixel blur of a frog from 2009.

00:00:20.030 --> 00:00:23.210
I know it sounds ridiculous, but it is the ultimate

00:00:23.210 --> 00:00:26.230
paradox of modern machine vision. Right. That

00:00:26.230 --> 00:00:29.670
tiny, like almost unrecognizable block of pixels

00:00:29.670 --> 00:00:33.189
is the exact foundational building block that

00:00:33.189 --> 00:00:35.590
taught the modern world how to see. OK, let's

00:00:35.590 --> 00:00:38.109
unpack this because today for our deep dive,

00:00:38.130 --> 00:00:40.950
we are pulling from a remarkably dense Wikipedia

00:00:40.950 --> 00:00:44.070
article detailing something called CIFAR 10.

00:00:45.670 --> 00:00:48.189
CIFAR -10. Right. Which stands for the Canadian

00:00:48.189 --> 00:00:50.329
Institute for Advanced Research. Yeah. And our

00:00:50.329 --> 00:00:53.049
mission today is really to demystify this obscure

00:00:53.049 --> 00:00:55.429
robotic sounding term. We want to bring you into

00:00:55.429 --> 00:00:57.450
the room with us to look at the physical reality

00:00:57.450 --> 00:00:59.890
of this data set. Because CIFAR -10 isn't just

00:00:59.890 --> 00:01:02.109
a collection of pictures, you know. It is the

00:01:02.109 --> 00:01:04.709
arena where the modern artificial intelligence

00:01:04.709 --> 00:01:07.170
industry basically figured out its own architecture.

00:01:07.590 --> 00:01:10.459
Wow. To really appreciate its impact, you have

00:01:10.459 --> 00:01:14.040
to remember that before 2009, computer vision

00:01:14.040 --> 00:01:17.620
benchmarking was, well, it was a chaotic Wild

00:01:17.620 --> 00:01:20.760
West. Researchers were testing algorithms on

00:01:20.760 --> 00:01:23.900
their own private bespoke data sets. Oh, I see.

00:01:24.060 --> 00:01:26.780
Yeah. Which makes peer review essentially impossible.

00:01:27.280 --> 00:01:29.459
You can't claim your new neural network is superior

00:01:29.459 --> 00:01:31.200
if you're testing it on completely different

00:01:31.200 --> 00:01:33.120
data than your competitors. Right. That makes

00:01:33.120 --> 00:01:35.200
total sense. You can't compare apples and oranges.

00:01:35.579 --> 00:01:38.400
Exactly. The industry desperately needed a universal

00:01:38.400 --> 00:01:41.079
yardstick. Which brings us to the actual structure

00:01:41.079 --> 00:01:44.939
of CIFAR 10. It is a collection of 60 ,000 color

00:01:44.939 --> 00:01:48.180
images. Yeah, 60 ,000. But the genius is in the

00:01:48.180 --> 00:01:51.329
taxonomy. Those images are divided evenly into

00:01:51.329 --> 00:01:54.209
exactly 10 mutually exclusive classes. And those

00:01:54.209 --> 00:01:56.290
are very specific, right? Very specific. You've

00:01:56.290 --> 00:01:59.349
got airplanes, cars, birds, cats, deer, dogs,

00:01:59.530 --> 00:02:03.129
frogs, horses, ships, and trucks. Exactly 6 ,000

00:02:03.129 --> 00:02:05.930
images per class. What's fascinating here is

00:02:05.930 --> 00:02:08.930
the sheer brute force required to establish that

00:02:08.930 --> 00:02:11.090
taxonomy. Yeah, it didn't just digitally manifest

00:02:11.090 --> 00:02:13.930
out of nowhere. No, not at all. The source notes,

00:02:13.990 --> 00:02:16.990
it is a specifically labeled subset of this massive

00:02:16.990 --> 00:02:21.050
2008 project called the 80 million tiny images

00:02:21.050 --> 00:02:24.069
data set, which was published in 2009. Right.

00:02:24.270 --> 00:02:26.610
So researchers had this overwhelming ocean of

00:02:26.610 --> 00:02:30.310
80 million scraped images. And to carve out this

00:02:30.310 --> 00:02:33.729
neat, mathematically tidy pond of 60 ,000 images,

00:02:34.289 --> 00:02:36.729
they had to rely on human capital. Which is wild

00:02:36.729 --> 00:02:39.509
to think about. Wait, so before a supercomputer

00:02:39.509 --> 00:02:42.409
could see, actual college students had to sit

00:02:42.409 --> 00:02:44.969
in a room somewhere and manually tag thousands

00:02:44.969 --> 00:02:47.310
upon thousands of frogs and trucks. They did.

00:02:47.330 --> 00:02:49.650
They literally paid students to sit at monitors

00:02:49.650 --> 00:02:52.110
and manually filter through that raw data. It's

00:02:52.110 --> 00:02:54.310
like kindergarten flashcards for baby algorithms.

00:02:54.449 --> 00:02:56.530
That is a great way to put it. And it highlights

00:02:56.530 --> 00:02:58.889
this inescapable truth about early machine learning,

00:02:58.909 --> 00:03:01.530
you know. Which is? To eventually eliminate human

00:03:01.530 --> 00:03:03.990
intervention from visual processing, researchers

00:03:03.990 --> 00:03:06.729
first required an industrial scale mobilization

00:03:06.729 --> 00:03:09.460
of human eyes. eyeballs. Wow. The model requires

00:03:09.460 --> 00:03:12.300
a definitive answer key. Algorithms learn by

00:03:12.300 --> 00:03:16.199
example. So establishing those 10 distinct classes

00:03:16.199 --> 00:03:19.680
digitize human perception to teach the computer.

00:03:20.139 --> 00:03:23.900
It provided the first truly rigid universal benchmark.

00:03:24.259 --> 00:03:26.800
But OK, let's look at the actual images those

00:03:26.800 --> 00:03:29.199
students were labeling, because this is where

00:03:29.199 --> 00:03:31.979
the methodology seems entirely counterintuitive

00:03:31.979 --> 00:03:34.919
to me. You mean the quality? Yes. The resolution

00:03:34.919 --> 00:03:38.080
is shockingly poor. Like we said, they're exactly

00:03:38.080 --> 00:03:41.599
32 by 32 pixels. Tiny. If you take one of these

00:03:41.599 --> 00:03:43.759
images of a horse and blow it up on a modern

00:03:43.759 --> 00:03:46.180
monitor, it just looks like a brown staircase

00:03:46.180 --> 00:03:49.159
next to a green blob. It really does. Why build

00:03:49.159 --> 00:03:51.520
a universal yardstick out of images so blurry

00:03:51.520 --> 00:03:53.699
that even a human struggles to identify them?

00:03:54.080 --> 00:03:56.379
It's like practicing scales on a cheap, out -of

00:03:56.379 --> 00:03:58.439
-tune piano before moving to the concert grand.

00:03:58.560 --> 00:04:00.979
But if it's so low -res, how does a 32 by 32

00:04:00.979 --> 00:04:03.599
pixel blurry shape actually help an algorithm

00:04:03.599 --> 00:04:05.719
recognize high def— objects in the real world.

00:04:06.080 --> 00:04:08.080
Right. Well, it comes down to the severe hardware

00:04:08.080 --> 00:04:10.639
limitations of the era. If you were a researcher

00:04:10.639 --> 00:04:13.900
in, say, 2010, training a model on a primitive

00:04:13.900 --> 00:04:17.160
GPU, feeding it a high definition image meant

00:04:17.160 --> 00:04:19.839
your machine would either completely crash. Oh,

00:04:19.839 --> 00:04:22.740
jeez. Yeah. Or you'd be waiting like three weeks

00:04:22.740 --> 00:04:25.360
just for a single training epoch to finish. Which

00:04:25.360 --> 00:04:27.540
completely destroys your ability to experiment.

00:04:27.759 --> 00:04:30.000
I mean, if the architecture is flawed, you just

00:04:30.000 --> 00:04:32.019
wasted a month of compute time to find that out.

00:04:32.139 --> 00:04:34.170
Exactly. So the low resolution was actually a

00:04:34.170 --> 00:04:36.730
strategic feature. It's not a bug. Oh. Because

00:04:36.730 --> 00:04:40.230
a 32 by 32 pixel image has a microscopic file

00:04:40.230 --> 00:04:43.810
size, a computer can chew through 60 ,000 of

00:04:43.810 --> 00:04:47.310
them incredibly fast. It entirely removes the

00:04:47.310 --> 00:04:49.209
computational bottleneck. So it's all about the

00:04:49.209 --> 00:04:52.110
speed of iteration. Yes. CIFAR -10 became the

00:04:52.110 --> 00:04:54.209
ultimate sandbox because it allowed researchers

00:04:54.209 --> 00:04:57.610
to run a model, watch it fail, tweak the hyperparameters,

00:04:57.850 --> 00:05:00.420
and run it again all before lunch. So they weren't

00:05:00.420 --> 00:05:02.579
trying to build the ultimate high -definition

00:05:02.579 --> 00:05:05.420
digital eye just yet? No, no. They were trying

00:05:05.420 --> 00:05:08.279
to efficiently test the underlying math of the

00:05:08.279 --> 00:05:10.279
network. Which makes sense of why the source

00:05:10.279 --> 00:05:13.180
mentions CIFAR -10 is heavily utilized by DonBench.

00:05:13.319 --> 00:05:16.639
Right, DonBench. Yeah, for you listening, DonBench

00:05:16.639 --> 00:05:19.379
tracks benchmark data for teams competing to

00:05:19.379 --> 00:05:22.540
run neural networks basically faster and cheaper.

00:05:22.860 --> 00:05:25.639
Because efficiency is the shadow metric of AI

00:05:25.639 --> 00:05:28.319
development. DonBench isn't just a leaderboard

00:05:28.319 --> 00:05:30.879
for the most accurate model, it's a leaderboard

00:05:30.879 --> 00:05:33.819
for commercial viability. Right, because an algorithm

00:05:33.819 --> 00:05:37.800
that achieves perfect accuracy but costs, like...

00:05:37.720 --> 00:05:40.379
a million dollars in cloud computing resources

00:05:40.379 --> 00:05:43.620
to train is totally useless in the real world.

00:05:43.959 --> 00:05:46.379
Completely useless. Yeah. So CIFAR -10 proved

00:05:46.379 --> 00:05:48.019
you could optimize the network's architecture

00:05:48.019 --> 00:05:50.199
cheaply before scaling it up. And the architecture

00:05:50.199 --> 00:05:52.439
that dominated this data set, according to the

00:05:52.439 --> 00:05:54.740
source, was the Convolutional Neural Network,

00:05:54.959 --> 00:05:57.879
or CNN. The undisputed champion for a long time.

00:05:57.920 --> 00:06:00.259
Yeah, but the progression wasn't overnight. The

00:06:00.259 --> 00:06:02.339
Wikipedia article provides this incredible timeline.

00:06:02.720 --> 00:06:05.339
It's basically a decade -long scoreboard tracking

00:06:05.339 --> 00:06:08.319
the race to a 0 % error rate. It serves as a

00:06:08.319 --> 00:06:10.540
literal fossil record for the evolution of deep

00:06:10.540 --> 00:06:12.639
learning. It really does. Let's trace this dramatic

00:06:12.639 --> 00:06:15.980
drop. The timeline kicks off in August 2010 with

00:06:15.980 --> 00:06:18.339
a paper on convolutional deep belief networks.

00:06:18.560 --> 00:06:20.439
OK. And what was their score? Their error rate

00:06:20.439 --> 00:06:24.240
was 21 .1%. Right. Meaning roughly one out of

00:06:24.240 --> 00:06:27.459
every five times, the model looked at a 32 -pixel

00:06:27.459 --> 00:06:30.139
truck and confidently classified it as a frog.

00:06:30.259 --> 00:06:32.600
Which sounds abysmal by today's standards. It

00:06:32.600 --> 00:06:35.620
does. But it was a vital proof of concept. It

00:06:35.620 --> 00:06:38.160
demonstrated that deep hierarchical layers could

00:06:38.160 --> 00:06:41.019
extract features from raw pixel data without

00:06:41.019 --> 00:06:43.459
researchers manually coding those features in

00:06:43.459 --> 00:06:46.199
advance. From there, the architectural leaps

00:06:46.199 --> 00:06:49.100
happened pretty rapidly. Like, by February 2013,

00:06:48.970 --> 00:06:52.750
a paper utilizing max -out networks drops the

00:06:52.750 --> 00:06:56.430
error rate to 9 .38%. Oh, max -out. That was

00:06:56.430 --> 00:06:58.430
a brilliant piece of mathematical engineering.

00:06:58.610 --> 00:07:01.089
How so? Well, traditional neural networks struggled

00:07:01.089 --> 00:07:03.110
heavily with the vanishing gradient problem.

00:07:03.230 --> 00:07:05.449
Essentially, as the network got deeper, the learning

00:07:05.449 --> 00:07:07.730
signal would decay until the earlier layers stopped

00:07:07.730 --> 00:07:09.709
learning entirely. So they just hit a wall. Yeah.

00:07:09.949 --> 00:07:12.209
But max -out computed the maximum across a group

00:07:12.209 --> 00:07:14.410
of inputs, which allowed the network to effectively

00:07:14.410 --> 00:07:16.750
learn its own activation function. It made the

00:07:16.750 --> 00:07:19.420
models far more robust. And that paved the way

00:07:19.420 --> 00:07:23.519
for even deeper architectures. By May 2016, wide

00:07:23.519 --> 00:07:26.579
residual networks pushed the error down to exactly

00:07:26.579 --> 00:07:31.120
4 .0%. Now, we know standard resonance use skip

00:07:31.120 --> 00:07:33.220
connections to let information bypass certain

00:07:33.220 --> 00:07:37.139
layers, but the wide aspect is a highly specific

00:07:37.139 --> 00:07:39.600
optimization for this particular dataset. Right,

00:07:39.600 --> 00:07:41.600
because rather than just stacking hundreds of

00:07:41.600 --> 00:07:43.759
layers endlessly, researchers realized that on

00:07:43.759 --> 00:07:46.459
a dataset as small and low -res as CIFAR -10,

00:07:46.600 --> 00:07:48.720
increasing the number of channels... The width

00:07:48.720 --> 00:07:51.180
of the network blocks. Yes, the width. That was

00:07:51.180 --> 00:07:53.899
highly effective. It allowed the model to recognize

00:07:53.899 --> 00:07:56.319
a wider variety of complex patterns in those

00:07:56.319 --> 00:07:59.120
tiny pixel blocks without immense computational

00:07:59.120 --> 00:08:01.839
drag of an ultra deep network. The snowball effect

00:08:01.839 --> 00:08:05.100
just keeps rolling. November 2018, an architecture

00:08:05.100 --> 00:08:07.519
called G -Pipe gets the error rate to an astonishing

00:08:07.519 --> 00:08:11.319
1 .00%. Almost perfect. Almost. And then we hit

00:08:11.319 --> 00:08:14.540
a massive paradigm shift in 2021, a paper beautifully

00:08:14.540 --> 00:08:18.519
titled, An Image is Worth 16 by 16 Words, Transformers

00:08:18.519 --> 00:08:21.389
for Image Recognition at Scale. Their error rate

00:08:21.389 --> 00:08:24.269
plummeted to a mere 0 .5%. Wow, that paper represents

00:08:24.269 --> 00:08:26.149
the crossover event of the decade, honestly.

00:08:26.470 --> 00:08:27.949
Because transformers were designed for natural

00:08:27.949 --> 00:08:30.329
language processing, right? Exactly. They are

00:08:30.329 --> 00:08:33.450
the architecture behind modern, large language

00:08:33.450 --> 00:08:36.009
models. The traditional assumption was always

00:08:36.009 --> 00:08:38.509
that CNNs were the only effective way to process

00:08:38.509 --> 00:08:41.230
vision. Because CNNs build up an image locally.

00:08:41.500 --> 00:08:44.500
analyzing small clusters of pixels. Right. But

00:08:44.500 --> 00:08:47.299
the transformer treats the image like a paragraph

00:08:47.299 --> 00:08:52.059
of text. They slice the 32 by 32 image into tiny

00:08:52.059 --> 00:08:55.440
16 by 16 patches and fed them into the network

00:08:55.440 --> 00:08:57.879
sequentially, just like words in a sentence.

00:08:57.879 --> 00:09:00.679
That's wild. It allows the model to process the

00:09:00.679 --> 00:09:03.419
image globally from layer one. The network can

00:09:03.419 --> 00:09:05.259
immediately analyze the relationship between

00:09:05.259 --> 00:09:07.659
a patch of pixels in the top left corner and

00:09:07.659 --> 00:09:10.000
a patch in the bottom right, achieving that 99

00:09:10.000 --> 00:09:12.809
.5%. accuracy, right? Here's where it gets really

00:09:12.809 --> 00:09:14.850
interesting though. Oh! Yeah, the source material

00:09:14.850 --> 00:09:17.850
slaps a massive caveat on this entire timeline.

00:09:18.789 --> 00:09:21.009
It turns out you can't just compare these error

00:09:21.009 --> 00:09:23.330
rates side by side like a clean Olympic sprint.

00:09:23.750 --> 00:09:26.360
Ah, right, the pre -processing. Exactly. Not

00:09:26.360 --> 00:09:28.639
all of these research papers standardized their

00:09:28.639 --> 00:09:31.940
pre -processing techniques. Some relied heavily

00:09:31.940 --> 00:09:34.960
on data augmentation, specifically things like

00:09:34.960 --> 00:09:37.480
image flipping or image shifting. Which creates

00:09:37.480 --> 00:09:39.940
a fascinating historical discrepancy. Oh boy.

00:09:40.080 --> 00:09:42.299
Because you could have a newer paper with a 0

00:09:42.299 --> 00:09:46.539
.95 % error rate, like the May 2023 paper on

00:09:46.539 --> 00:09:49.720
reduction of class activation uncertainty, that

00:09:49.720 --> 00:09:53.039
actually possesses a superior architecture to

00:09:53.039 --> 00:09:56.600
the 2021 paper with the 0 .5 % error rate, simply

00:09:56.600 --> 00:09:59.559
because the 2023 model was tested under vastly

00:09:59.559 --> 00:10:02.220
more brutal conditions. It's like comparing marathon

00:10:02.220 --> 00:10:04.740
world records, but discovering some runners were

00:10:04.740 --> 00:10:07.240
allowed to use rollerblades. That's a great analogy.

00:10:07.379 --> 00:10:09.919
Thanks. But seriously, if some researchers are

00:10:09.919 --> 00:10:11.919
using image flipping and others aren't, Are we

00:10:11.919 --> 00:10:14.600
even looking at a fair race? If you have an image

00:10:14.600 --> 00:10:17.080
of a cat facing left in the data set and you

00:10:17.080 --> 00:10:19.460
computationally flip it horizontally so it faces

00:10:19.460 --> 00:10:21.759
right, you've essentially doubled your training

00:10:21.759 --> 00:10:24.860
data without needing more human labelers. But

00:10:24.860 --> 00:10:27.559
it also fundamentally changes what the AI is

00:10:27.559 --> 00:10:29.779
learning. Yeah, the technical threat researchers

00:10:29.779 --> 00:10:31.820
are fighting here is overfitting. Overfitting.

00:10:32.120 --> 00:10:35.960
Right. A neural network is inherently lazy. It

00:10:35.960 --> 00:10:38.779
will find the path of least mathematical resistance.

00:10:39.210 --> 00:10:42.230
If you feed it that left -facing cat repeatedly,

00:10:42.870 --> 00:10:45.389
it won't learn the conceptual idea of a feline.

00:10:45.529 --> 00:10:48.070
It'll just memorize it. Yes. It will simply memorize

00:10:48.070 --> 00:10:50.870
the exact spatial coordinates of those specific

00:10:50.870 --> 00:10:53.830
brown pixels. Which creates a hopelessly brittle

00:10:53.830 --> 00:10:56.250
model. I mean, the moment you deploy it in the

00:10:56.250 --> 00:10:58.490
real world and it encounters a right -facing

00:10:58.490 --> 00:11:01.169
cat, the confidence score drops to zero because

00:11:01.169 --> 00:11:03.190
it hasn't actually learned anything about cats.

00:11:03.409 --> 00:11:05.850
To prevent that brittle memorization, researchers

00:11:05.850 --> 00:11:09.029
actively mutate the data. They shift the image

00:11:09.029 --> 00:11:11.690
a few pixels off center, so the object is partially

00:11:11.690 --> 00:11:15.129
cropped. They alter the RGB values to simulate

00:11:15.129 --> 00:11:17.289
different lighting. The source even mentions

00:11:17.289 --> 00:11:19.230
a technique called cutout, where they literally

00:11:19.230 --> 00:11:22.090
mask out random contiguous sections of the image

00:11:22.090 --> 00:11:25.070
during training. They just drop a giant digital

00:11:25.070 --> 00:11:27.029
black square right over the dog's face. Which

00:11:27.029 --> 00:11:29.690
is such a brilliant form of regularization. How

00:11:29.690 --> 00:11:32.370
so? Well, by masking the most obvious identifying

00:11:32.370 --> 00:11:34.809
features, the network is forced to rely on holistic

00:11:34.809 --> 00:11:38.159
context. Ah, I see. Yeah. If the dog's face is

00:11:38.159 --> 00:11:40.820
missing, the model has to learn that the curve

00:11:40.820 --> 00:11:43.440
of the tail or the texture of the fur or the

00:11:43.440 --> 00:11:47.519
posture of the legs also define the class. It

00:11:47.519 --> 00:11:50.519
actively prevents the AI from relying on a single

00:11:50.519 --> 00:11:54.059
easy visual cue. That makes total sense. So the

00:11:54.059 --> 00:11:56.360
raw error rate on a leaderboard doesn't tell

00:11:56.360 --> 00:11:58.759
the whole story if you don't factor in how severely

00:11:58.759 --> 00:12:01.299
the data was mutilated during training. Exactly.

00:12:01.539 --> 00:12:03.860
Now, as these models began to effectively master

00:12:03.860 --> 00:12:06.679
CIFAR -10 and drive that error rate towards zero,

00:12:07.299 --> 00:12:09.799
the 60 ,000 image data set wasn't enough anymore.

00:12:09.940 --> 00:12:12.399
They outgrew it. They really did. The source

00:12:12.399 --> 00:12:15.080
outlined a whole ecosystem of related datasets

00:12:15.080 --> 00:12:17.639
that researchers had to build to test entirely

00:12:17.639 --> 00:12:19.700
new mathematical weaknesses. Because once your

00:12:19.700 --> 00:12:21.840
architecture aces the sandbox, you have to find

00:12:21.840 --> 00:12:23.899
out where its boundaries are. Right. And the

00:12:23.899 --> 00:12:25.759
heavyweight of the group mentioned is ImageNet,

00:12:26.080 --> 00:12:29.580
specifically the ILS VRC. This dataset scales

00:12:29.580 --> 00:12:32.980
the complexity exponentially. It features 1 million

00:12:32.980 --> 00:12:35.000
color images spread across a thousand different

00:12:35.000 --> 00:12:37.759
classes at a much higher resolution, averaging

00:12:37.759 --> 00:12:42.639
469 by 387 pixels. ImageNet is the ultimate test

00:12:42.639 --> 00:12:45.279
of computational scalability. It's massive. Yeah,

00:12:45.419 --> 00:12:48.259
if CIFAR -10 was where you optimized your architecture's

00:12:48.259 --> 00:12:50.700
efficiency, ImageNet was where you spent your

00:12:50.700 --> 00:12:53.899
massive computational budget to see if that architecture

00:12:53.899 --> 00:12:56.539
could survive the high -fidelity chaos of real

00:12:56.539 --> 00:12:59.100
-world complexity. But then you have CIFAR -100,

00:12:59.320 --> 00:13:01.179
which goes in the completely opposite direction.

00:13:01.279 --> 00:13:03.879
Oh, this one's brutal. It utilizes the exact

00:13:03.879 --> 00:13:08.580
same tiny 32 by 32 resolution and the exact same

00:13:08.580 --> 00:13:11.759
60 ,000 total images, but it spreads them across

00:13:11.759 --> 00:13:14.940
100 classes instead of 10. That means instead

00:13:14.940 --> 00:13:18.320
of 6 ,000 images per class the model only gets

00:13:18.320 --> 00:13:21.080
600. which introduces the incredibly difficult

00:13:21.080 --> 00:13:23.519
problem of data scarcity. Right. So if CIFAR

00:13:23.519 --> 00:13:26.539
-10 is elementary school flashcards, and ImageNet

00:13:26.539 --> 00:13:28.879
is the final exams with its million high -res

00:13:28.879 --> 00:13:31.940
images, then CIFAR -100 is like a pop quiz where

00:13:31.940 --> 00:13:33.840
you barely get any study materials, right? That

00:13:33.840 --> 00:13:36.220
is exactly right. If you are developing an AI

00:13:36.220 --> 00:13:38.720
for medical diagnostics, say to detect a rare

00:13:38.720 --> 00:13:41.279
disease, you won't have a million x -rays to

00:13:41.279 --> 00:13:44.139
train on. You might have 500. CIFAR -100 forces

00:13:44.139 --> 00:13:46.519
the algorithm to extrapolate accurate patterns

00:13:46.519 --> 00:13:49.399
from a highly constrained limited size. Then

00:13:49.399 --> 00:13:52.100
we have SVHN, which stands for street view house

00:13:52.100 --> 00:13:55.299
numbers. This one has approximately 600 ,000

00:13:55.299 --> 00:13:58.700
images, also in the 32 by 32 resolution, but

00:13:58.700 --> 00:14:01.299
it is strictly limited to 10 classes, just the

00:14:01.299 --> 00:14:04.159
digits 0 through 9. And that shift from organic

00:14:04.159 --> 00:14:08.360
shapes to rigid geometry is crucial. SVHN strips

00:14:08.360 --> 00:14:10.759
away the biological unpredictability of frogs

00:14:10.759 --> 00:14:13.080
and deer. No more fuzzy tails to look for. Nope.

00:14:13.240 --> 00:14:16.000
It isolates the model's ability to perform structural

00:14:16.000 --> 00:14:18.700
recognition across wildly varying real -world

00:14:18.700 --> 00:14:20.799
lighting and angles. So if you imagine deploying

00:14:20.799 --> 00:14:23.320
a system, it proves how strictly the testing

00:14:23.320 --> 00:14:25.299
environment dictates the model's capability.

00:14:25.759 --> 00:14:27.840
I mean, if you train an architecture exclusively

00:14:27.840 --> 00:14:30.320
on SVHN, it will read a blurry house number perfectly.

00:14:30.600 --> 00:14:32.360
But if a bird happens to fly into the frame,

00:14:32.419 --> 00:14:35.440
the model's internal logic collapses. Yes. The

00:14:35.440 --> 00:14:37.700
boundary of the data set entirely defines the

00:14:37.700 --> 00:14:40.639
AI's reality. The tools we use to test the machine

00:14:40.639 --> 00:14:42.720
dictate exactly how the machine will learn to

00:14:42.720 --> 00:14:44.960
think. So what does this all mean for you listening?

00:14:45.679 --> 00:14:48.139
Well when you open an app and it instantly recognizes

00:14:48.139 --> 00:14:50.779
a street sign or translates a menu in real time,

00:14:51.240 --> 00:14:53.820
you are witnessing the direct historical lineage

00:14:53.820 --> 00:14:56.460
of models that cut their teeth on these early

00:14:56.460 --> 00:15:00.159
data sets. The massive multi -billion parameter

00:15:00.159 --> 00:15:02.940
models of today trace their evolutionary roots

00:15:02.940 --> 00:15:06.059
directly back to college students labeling 32

00:15:06.059 --> 00:15:09.389
pixel digital flashcards. It really is a remarkable

00:15:09.389 --> 00:15:12.090
progression of forced optimization. But looking

00:15:12.090 --> 00:15:14.649
closely at the source material, there is one

00:15:14.649 --> 00:15:17.409
final data set in this ecosystem that completely

00:15:17.409 --> 00:15:19.470
upends the philosophy we've been discussing today.

00:15:20.049 --> 00:15:22.149
The source briefly mentions a variant called

00:15:22.149 --> 00:15:25.330
CIFAR10H. The H standing for human perceptual

00:15:25.330 --> 00:15:27.750
uncertainty. Exactly. For the entire history

00:15:27.750 --> 00:15:30.230
of the Race to Zero errors, the goal of benchmarking

00:15:30.230 --> 00:15:33.549
was objective truth. The model outputs dog, and

00:15:33.549 --> 00:15:35.750
the human -created answer key definitively says

00:15:35.750 --> 00:15:39.129
dog with 100 % confidence. But CIFAR -10H replaces

00:15:39.129 --> 00:15:41.330
those rigid labels with probability distributions

00:15:41.330 --> 00:15:44.029
based on how humans actually react to blurry

00:15:44.029 --> 00:15:47.129
data. Oh, because if you show a 32 -pixel blur

00:15:47.129 --> 00:15:49.970
to 10 different people, maybe seven of them say

00:15:49.970 --> 00:15:52.330
it's a cat, but three are convinced it's a dog.

00:15:52.470 --> 00:15:55.980
Yes. It bakes our cognitive hesitation directly

00:15:55.980 --> 00:15:59.080
into the training data. We spent a decade forcing

00:15:59.080 --> 00:16:01.019
these architectures to be absolutely certain,

00:16:01.200 --> 00:16:03.620
demanding they eliminate all error. But if we

00:16:03.620 --> 00:16:05.639
are now intentionally feeding them data sets

00:16:05.639 --> 00:16:07.919
modeled on our own ambiguity... Teaching them

00:16:07.919 --> 00:16:11.039
to hesitate, just like we do. Exactly. What happens

00:16:11.039 --> 00:16:13.320
when artificial intelligence is built not on

00:16:13.320 --> 00:16:15.019
human facts, but on human doubt?
