WEBVTT

00:00:00.000 --> 00:00:03.060
You know, when you think about cutting edge artificial

00:00:03.060 --> 00:00:05.400
intelligence today, there's this expectation

00:00:05.400 --> 00:00:08.900
of these sleek, sterile server farms. Oh, absolutely.

00:00:09.179 --> 00:00:12.220
Just glowing racks of silicon everywhere. Right.

00:00:12.539 --> 00:00:15.099
Massive cooling fans, endless streams of code,

00:00:15.500 --> 00:00:18.280
just something completely divorced from the messy,

00:00:18.280 --> 00:00:21.059
you know, organic world of biology. Yeah, it

00:00:21.059 --> 00:00:23.739
definitely feels like a purely mathematical achievement.

00:00:24.030 --> 00:00:26.969
like something born entirely inside a machine,

00:00:27.410 --> 00:00:29.570
completely disconnected from the natural world.

00:00:29.910 --> 00:00:32.229
But then you trace the family tree of the very

00:00:32.229 --> 00:00:34.909
technology that gave machines the ability to

00:00:34.909 --> 00:00:38.170
see. And suddenly, you aren't in a server farm

00:00:38.170 --> 00:00:39.869
at all. No, not at all. You're actually in a

00:00:39.869 --> 00:00:42.929
biology lab. Exactly. A biology lab in the 1950s

00:00:42.929 --> 00:00:45.170
looking at the brain of a cat. It really is the

00:00:45.170 --> 00:00:47.350
ultimate technological plot twist. I mean, the

00:00:47.350 --> 00:00:49.609
blueprint for modern computer vision was actually

00:00:49.609 --> 00:00:51.929
drawn by Mother Nature long before anyone ever

00:00:51.929 --> 00:00:54.369
typed a line of code. Welcome to the Deep Dive.

00:00:54.649 --> 00:00:57.530
Today, we are taking a stack of research on convolutional

00:00:57.530 --> 00:01:01.229
neural networks, or CNNs, and figuring out how

00:01:01.229 --> 00:01:03.810
a biological discovery in the 50s became the

00:01:03.810 --> 00:01:05.799
technology that lets your phone recognize your

00:01:05.799 --> 00:01:08.859
face. And powers self -driving cars. Yeah and

00:01:08.859 --> 00:01:11.359
even discovers new medicines. Our mission here

00:01:11.359 --> 00:01:14.019
is to demystify how we actually taught machines

00:01:14.019 --> 00:01:16.719
to see and to do that we have to look at how

00:01:16.719 --> 00:01:19.260
nature taught us to see. Which is fascinating

00:01:19.260 --> 00:01:22.120
all on its own. Okay let's unpack this because

00:01:22.120 --> 00:01:24.260
the idea of computer vision can feel totally

00:01:24.260 --> 00:01:27.000
overwhelming but human vision is actually built

00:01:27.000 --> 00:01:29.500
on a surprisingly simple foundational trick.

00:01:30.200 --> 00:01:34.409
Think of it like looking at a massive complex

00:01:34.409 --> 00:01:36.769
mosaic on a wall. A mosaic is a really great

00:01:36.769 --> 00:01:38.769
way to frame it. Right, because when you stand

00:01:38.769 --> 00:01:41.269
right up close to a mosaic, you don't process

00:01:41.269 --> 00:01:44.390
the entire wall at once. Your eye catches a single

00:01:44.390 --> 00:01:47.829
dark blue square. Just one tiny piece. Yeah,

00:01:47.930 --> 00:01:49.530
then maybe you notice a light blue square right

00:01:49.530 --> 00:01:51.989
next to it. You're just looking at tiny, localized

00:01:51.989 --> 00:01:53.430
patches of color. And they don't mean anything

00:01:53.430 --> 00:01:56.530
on their own. Exactly. But as you step back,

00:01:56.870 --> 00:01:59.510
your brain starts combining all those individual

00:01:59.510 --> 00:02:02.810
meaningless tiles. The dark blue and light blue

00:02:02.810 --> 00:02:05.989
merge into a pattern and eventually you realize

00:02:05.989 --> 00:02:08.050
you were looking at a giant picture of an ocean.

00:02:08.430 --> 00:02:11.530
You build the whole from tiny overlapping pieces.

00:02:11.710 --> 00:02:14.250
And that maps perfectly onto the biological research

00:02:14.250 --> 00:02:16.990
that started all of this. Back in the 1950s and

00:02:16.990 --> 00:02:20.349
60s, there were these two researchers named Hubel

00:02:20.349 --> 00:02:22.849
and Riesel. Okay. And they were studying the

00:02:22.849 --> 00:02:25.629
visual cortices of cats. They were trying to

00:02:25.629 --> 00:02:28.370
figure out exactly how the brain translates light

00:02:28.370 --> 00:02:31.310
into shapes. So what did they do? Well, they

00:02:31.310 --> 00:02:33.789
showed these cats various shapes and lines on

00:02:33.789 --> 00:02:37.210
a screen, and they literally measured the electrical

00:02:37.210 --> 00:02:40.219
spikes in specific neurons in the cat's brains.

00:02:40.969 --> 00:02:43.270
Yeah, and what they discovered fundamentally

00:02:43.270 --> 00:02:45.990
changed our understanding of sight. They found

00:02:45.990 --> 00:02:48.389
that individual neurons in the brain, they don't

00:02:48.389 --> 00:02:51.030
process the entire visual field at once. Like

00:02:51.030 --> 00:02:54.050
staring too close to the mosaic. Exactly. Instead,

00:02:54.090 --> 00:02:56.949
they respond to stimuli only in very restricted

00:02:56.949 --> 00:03:00.050
tiny regions. They called these regions receptive

00:03:00.050 --> 00:03:03.030
fields. So a single neuron is only responsible

00:03:03.030 --> 00:03:05.310
for its own little patch of reality. Right. It's

00:03:05.310 --> 00:03:07.050
completely blind to what's happening on the other

00:03:07.050 --> 00:03:09.189
side of the visual field. Precisely. And neighboring

00:03:09.189 --> 00:03:11.819
cells have overla - receptive fields, so together

00:03:11.819 --> 00:03:14.520
they map out the whole visual space, just like

00:03:14.520 --> 00:03:17.240
your mosaic tiles overlapping to create the ocean.

00:03:17.580 --> 00:03:20.039
That makes a lot of sense. Hubel and Weisel even

00:03:20.039 --> 00:03:23.319
identified two specific types of visual cells

00:03:23.319 --> 00:03:26.120
operating in this system. First, they found what

00:03:26.120 --> 00:03:29.340
they called simple cells. Okay, what do simple

00:03:29.340 --> 00:03:32.159
cells do? These cells get really excited, like

00:03:32.159 --> 00:03:34.900
they maximize their electrical output only when

00:03:34.900 --> 00:03:38.020
they see a straight edge or a line at a very

00:03:38.020 --> 00:03:40.520
specific angle within their little patch. So

00:03:40.520 --> 00:03:43.460
they are very picky. Incredibly picky. If the

00:03:43.460 --> 00:03:46.300
line is tilted even slightly wrong, the cell

00:03:46.300 --> 00:03:50.960
stays totally quiet. Then they found complex

00:03:50.960 --> 00:03:53.060
cells. And I'm guessing they are less picky?

00:03:53.360 --> 00:03:55.539
Yeah. They take the information from the simple

00:03:55.539 --> 00:03:57.400
cells, but they are a bit more flexible. They

00:03:57.400 --> 00:03:59.620
don't care exactly where that edge is located

00:03:59.620 --> 00:04:02.219
within the patch, just that the edge exists somewhere

00:04:02.219 --> 00:04:04.280
in their neighborhood. Oh, I see. So the simple

00:04:04.280 --> 00:04:07.259
cells are the strict rule followers looking for

00:04:07.259 --> 00:04:09.960
exact angles, and the complex cells are just

00:04:09.960 --> 00:04:12.240
looking at the general vibe of the area. That's

00:04:12.240 --> 00:04:14.710
a really good way to put it. And here is where

00:04:14.710 --> 00:04:18.670
the jump to technology happens. In 1980, a computer

00:04:18.670 --> 00:04:21.389
scientist named Kunihiko Fukushima took this

00:04:21.389 --> 00:04:24.189
exact biological discovery and translated it

00:04:24.189 --> 00:04:28.310
into a mathematical model. In 1980. Yeah, 1980.

00:04:28.470 --> 00:04:30.569
He introduced something called the neocognitron.

00:04:31.050 --> 00:04:32.970
Neocognitron. Sounds like a sci -fi villain.

00:04:33.069 --> 00:04:35.589
It really does, but what he did was directly

00:04:35.589 --> 00:04:39.160
mimic. those cat brains by creating two alternating

00:04:39.160 --> 00:04:42.680
layers in his computer model. He made an S layer,

00:04:42.860 --> 00:04:45.360
which acted like the simple cells, looking at

00:04:45.360 --> 00:04:47.779
specific receptive fields and identifying basic

00:04:47.779 --> 00:04:50.920
patterns. Exactly. And he made a C layer acting

00:04:50.920 --> 00:04:53.660
like the complex cells, which consolidated the

00:04:53.660 --> 00:04:56.180
information and helped classify objects, even

00:04:56.180 --> 00:04:59.279
if they shifted position slightly. So this neocognitron

00:04:59.279 --> 00:05:02.040
is essentially the origin of the whole CNN architecture

00:05:02.040 --> 00:05:04.180
we use today. It is the foundational blueprint.

00:05:04.439 --> 00:05:06.509
Yes, it is wild. that the blueprint was mapped

00:05:06.509 --> 00:05:10.709
out in 1980. But the source material notes that

00:05:10.709 --> 00:05:12.550
when early computer scientists actually tried

00:05:12.550 --> 00:05:14.709
to build standard neural networks to process

00:05:14.709 --> 00:05:17.870
images using these ideas, they immediately hit

00:05:17.870 --> 00:05:20.990
a massive mathematical brick wall. A huge wall.

00:05:21.029 --> 00:05:23.269
They ran into something called the curse of dimensionality.

00:05:23.649 --> 00:05:25.790
Right. And looking at the numbers in the research,

00:05:25.889 --> 00:05:28.709
I see exactly why. Let's do the math on a simple

00:05:28.709 --> 00:05:33.170
100 by 100 pixel image. To us, that is incredibly

00:05:33.170 --> 00:05:35.769
tiny. Yeah, it's just a low resolution thumbnail.

00:05:36.189 --> 00:05:41.069
But for a computer, a 100 by 100 grid is 10 ,000

00:05:41.069 --> 00:05:43.870
individual pixels. Which doesn't sound too bad

00:05:43.870 --> 00:05:45.990
until you look at how a traditional neural network

00:05:45.990 --> 00:05:48.589
is wired. Right. In a standard fully connected

00:05:48.589 --> 00:05:51.689
network, every single artificial neuron in one

00:05:51.689 --> 00:05:54.750
layer has to connect to every single neuron in

00:05:54.750 --> 00:05:57.209
the next layer. OK, so if you have those 10 ,000

00:05:57.209 --> 00:05:59.399
pixels in the first layer, And let's say your

00:05:59.399 --> 00:06:02.600
next hidden layer has a thousand neurons. Every

00:06:02.600 --> 00:06:05.759
single one of those 1 ,000 neurons needs a dedicated

00:06:05.759 --> 00:06:08.500
connection to all 10 ,000 pixels. Do the math

00:06:08.500 --> 00:06:11.420
on that. That is 10 million individual weights,

00:06:11.740 --> 00:06:13.879
or connections, that the computer has to calculate.

00:06:14.000 --> 00:06:16.480
For a tiny thumbnail. And if it is a color image

00:06:16.480 --> 00:06:18.620
with red, green, and blue channels, you multiply

00:06:18.620 --> 00:06:21.339
that by three, suddenly you have 30 million parameters

00:06:21.339 --> 00:06:24.139
for a blurry little picture. Exactly. And if

00:06:24.139 --> 00:06:26.519
you try to process a modern high -resolution

00:06:26.519 --> 00:06:29.000
image, say from a smartphone camera, the math

00:06:29.000 --> 00:06:31.740
explodes into the billions or even trillions

00:06:31.740 --> 00:06:33.899
of connections. The computers of the era just

00:06:33.899 --> 00:06:36.100
choked on the data. They completely choked. I

00:06:36.100 --> 00:06:37.689
see. It's kind of like the difference between

00:06:37.689 --> 00:06:41.069
a micromanager and a smart auditor. A traditional

00:06:41.069 --> 00:06:43.610
neural network is the micromanager. I like that

00:06:43.610 --> 00:06:46.069
analogy. It demands a direct individual line

00:06:46.069 --> 00:06:48.730
of communication with every single employee in

00:06:48.730 --> 00:06:51.790
a global company, asking for an update every

00:06:51.790 --> 00:06:53.990
single second. Right, checking every single pixel

00:06:53.990 --> 00:06:56.689
at all times. There are millions of emails flying

00:06:56.689 --> 00:06:59.189
around, it's too much noise, and the micromanager

00:06:59.189 --> 00:07:02.490
just collapses from exhaustion. But a convolutional

00:07:02.490 --> 00:07:05.649
neural network, a CNN, is like a smart auditor.

00:07:05.850 --> 00:07:08.310
Right. Instead of talking to everyone at once,

00:07:08.629 --> 00:07:10.430
the auditor takes a little rubber stamp, let's

00:07:10.430 --> 00:07:13.230
call it a filter, and walks down the hall, checking

00:07:13.230 --> 00:07:16.709
small 5x5 teams one at a time. That little rubber

00:07:16.709 --> 00:07:19.129
stamp is what we call a convolution filter or

00:07:19.129 --> 00:07:21.670
a kernel. Okay. Instead of a single neuron needing

00:07:21.670 --> 00:07:23.910
10 ,000 weights to look at the whole image at

00:07:23.910 --> 00:07:28.250
once, a CNN uses a tiny 5x5 filter. That filter

00:07:28.250 --> 00:07:31.629
only requires 25 weights. Wow, just 25? Just

00:07:31.629 --> 00:07:33.730
25. And the secret sauce that makes this whole

00:07:33.730 --> 00:07:36.129
thing work is that it uses shared weights. What

00:07:36.129 --> 00:07:39.319
does that mean, shared weights? As that 5x5 filter

00:07:39.319 --> 00:07:42.420
slides or convolves across the image, it uses

00:07:42.420 --> 00:07:46.819
the exact same 25 parameters to check every single

00:07:46.819 --> 00:07:49.399
patch of the picture. Oh, because if a pattern

00:07:49.399 --> 00:07:52.160
like a vertical line or the curve of an ear is

00:07:52.160 --> 00:07:54.379
important in the top left corner of an image,

00:07:54.939 --> 00:07:56.639
it's probably just as important if it appears

00:07:56.639 --> 00:07:58.639
in the bottom right corner. Exactly. The rule

00:07:58.639 --> 00:08:00.720
doesn't need to change just because the location

00:08:00.720 --> 00:08:04.430
changed. That is the core insight. This massive

00:08:04.430 --> 00:08:07.490
reduction in parameters from 30 million down

00:08:07.490 --> 00:08:11.050
to just a few dozen per filter does something

00:08:11.050 --> 00:08:13.310
crucial for the learning process. It keeps the

00:08:13.310 --> 00:08:16.310
computer from crashing. Well, yes, but also in

00:08:16.310 --> 00:08:19.290
early, fully connected networks, scientists struggle

00:08:19.290 --> 00:08:21.610
with the learning signal either dying out into

00:08:21.610 --> 00:08:24.629
nothing or blowing up into infinity as the system

00:08:24.629 --> 00:08:26.569
tried to adjust millions of weights at once.

00:08:27.019 --> 00:08:29.860
So it was unstable. Highly unstable. By forcing

00:08:29.860 --> 00:08:32.379
the network to share weights over fewer connections,

00:08:32.779 --> 00:08:35.940
the CNN regularizes the math. It stabilizes the

00:08:35.940 --> 00:08:38.000
whole learning process. That makes sense. The

00:08:38.000 --> 00:08:40.840
filter slides across the input, calculates how

00:08:40.840 --> 00:08:43.500
closely each patch matches its specific pattern,

00:08:43.960 --> 00:08:46.419
and generates an abstracted map of where it found

00:08:46.419 --> 00:08:49.179
that feature. We call that output a feature map.

00:08:49.399 --> 00:08:51.120
OK, so the auditor has walked through the building

00:08:51.120 --> 00:08:53.639
with the rubber stamp and created a feature map.

00:08:54.349 --> 00:08:57.409
But according to the source, once the CNN has

00:08:57.409 --> 00:08:59.769
scanned the image with its filter, it suddenly

00:08:59.769 --> 00:09:01.929
has way too much data again. It does. It generates

00:09:01.929 --> 00:09:04.509
a lot of maps. Right. And it needs a way to separate

00:09:04.509 --> 00:09:07.549
the signal from the noise. And this is where

00:09:07.549 --> 00:09:09.710
I really had to challenge the logic of what happens

00:09:09.710 --> 00:09:12.889
next. Oh, what's the issue? The text talks about

00:09:12.889 --> 00:09:16.230
a process called max pooling. It says, a standard

00:09:16.230 --> 00:09:19.149
two by two max pooling layer actually discards

00:09:19.149 --> 00:09:22.049
75 % of the data it just looked at. It does do

00:09:22.049 --> 00:09:24.090
that. But if we are throwing away three quarters

00:09:24.090 --> 00:09:25.870
of the information we just worked so hard to

00:09:25.870 --> 00:09:28.690
extract, how does the AI still have any idea

00:09:28.690 --> 00:09:30.710
what it's looking at? This raises an important

00:09:30.710 --> 00:09:33.090
question because it seems entirely counterintuitive

00:09:33.090 --> 00:09:35.230
to throw away data. Totally counterintuitive.

00:09:35.529 --> 00:09:37.750
But let's look at the mechanics of how max pooling,

00:09:37.929 --> 00:09:40.490
which was introduced by a researcher named Yamaguchi

00:09:40.490 --> 00:09:43.889
in 1990, actually works. OK. Walk me through

00:09:43.889 --> 00:09:46.389
it. Imagine your feature map is divided into

00:09:46.389 --> 00:09:48.850
a grid of two by two squares. That's four pixels

00:09:48.850 --> 00:09:52.220
per square. Got it. Max pooling looks at those

00:09:52.220 --> 00:09:54.740
four numbers, takes the single highest value

00:09:54.740 --> 00:09:57.659
of the maximum, and literally deletes the other

00:09:57.659 --> 00:09:59.919
three. Just deletes them. It seems so reckless

00:09:59.919 --> 00:10:03.299
to just dump 75 % of the resolution. But alongside

00:10:03.299 --> 00:10:05.659
max pooling, the network usually applies something

00:10:05.659 --> 00:10:08.580
called a ReLU layer. That's a rectified linear

00:10:08.580 --> 00:10:11.980
unit, first used by Fukushima back in 69, actually.

00:10:12.120 --> 00:10:15.440
OK, what does ReLU do? The ReLU function acts

00:10:15.440 --> 00:10:18.059
as a strict bouncer at the door of the next layer.

00:10:18.320 --> 00:10:21.039
It looks at the data and says, if your mathematical

00:10:21.039 --> 00:10:24.039
value is negative, you are out. You become zero.

00:10:24.139 --> 00:10:26.840
Just flat out zero. Yep. It only lets positive

00:10:26.840 --> 00:10:29.700
values through. This removes negative noise,

00:10:29.820 --> 00:10:31.980
adds necessary mathematical complexity without

00:10:31.980 --> 00:10:34.500
bogging down the system, and massively speeds

00:10:34.500 --> 00:10:36.820
up training. So you have real you cleaning up

00:10:36.820 --> 00:10:39.419
the noise and max pooling shrinking the image

00:10:39.419 --> 00:10:42.159
down. OK, I understand the cleanup part. but

00:10:42.159 --> 00:10:43.960
I'm still stuck on throwing away the location

00:10:43.960 --> 00:10:46.899
data with MaxPooling. To understand why it works,

00:10:46.960 --> 00:10:48.899
we have to talk about something called Translation

00:10:48.899 --> 00:10:51.539
Equivariance. Translation Equivariance? Which

00:10:51.539 --> 00:10:54.639
is just a fancy way of saying flexibility in

00:10:54.639 --> 00:10:59.340
position. Okay. Intuitively, the exact pixel

00:10:59.340 --> 00:11:02.480
-perfect location of a feature is far less important

00:11:02.480 --> 00:11:05.139
than its rough location relative to other features.

00:11:05.259 --> 00:11:07.940
Like, it matters more that an eye is above a

00:11:07.940 --> 00:11:11.129
nose. rather than being at pixel coordinate 45

00:11:11.129 --> 00:11:14.490
by 60. Yes. Think about your phone's facial recognition.

00:11:14.929 --> 00:11:16.629
When you wake up in the morning and hold your

00:11:16.629 --> 00:11:18.690
phone slightly off center or your head is tilted

00:11:18.690 --> 00:11:22.110
on a pillow, the image your camera sees is completely

00:11:22.110 --> 00:11:25.210
different on a pixel by pixel level than the

00:11:25.210 --> 00:11:28.029
one you originally scanned. Ah, I get it. The

00:11:28.029 --> 00:11:30.029
phone doesn't care that my left eye moved down

00:11:30.029 --> 00:11:32.309
three pixels on the camera's grid. No, it doesn't.

00:11:32.429 --> 00:11:34.669
It just cares that the eye pattern is still roughly

00:11:34.669 --> 00:11:36.490
above the nose pattern and next to the other

00:11:36.490 --> 00:11:39.590
eye pattern. Exactly. If the neural network memorized

00:11:39.590 --> 00:11:42.730
the exact pixel coordinate of everything, it

00:11:42.730 --> 00:11:44.529
would be useless in the real world because real

00:11:44.529 --> 00:11:47.149
world objects move. Right. By taking only the

00:11:47.149 --> 00:11:50.090
maximum value in that two by two cluster, max

00:11:50.090 --> 00:11:52.679
pooling essentially says... I found a strong

00:11:52.679 --> 00:11:55.639
signal for an edge in this general area. I don't

00:11:55.639 --> 00:11:57.740
care exactly which of the four pixels it was

00:11:57.740 --> 00:12:00.000
in. It's focusing on the big picture. It grants

00:12:00.000 --> 00:12:02.440
the network that flexibility, making it robust

00:12:02.440 --> 00:12:05.159
against variations in position. And as a bonus,

00:12:05.360 --> 00:12:07.919
it drastically reduces the memory footprint the

00:12:07.919 --> 00:12:10.259
computer has to handle. I am looking at the timeline

00:12:10.259 --> 00:12:12.700
of all this, and it is driving me crazy. How

00:12:12.700 --> 00:12:15.379
so? We have the biological blueprint in the 50s

00:12:15.379 --> 00:12:18.759
and 60s. Fukushima builds the neocognitron in

00:12:18.759 --> 00:12:22.000
1980. The math for training these networks is

00:12:22.000 --> 00:12:24.340
figured out in the late 80s. Yamaguchi brings

00:12:24.340 --> 00:12:27.120
in Max Pooley in 1990. It was a long buildup.

00:12:27.379 --> 00:12:29.440
Here's where it gets really interesting. The

00:12:29.440 --> 00:12:32.059
text even points out this wild fact. A researcher

00:12:32.059 --> 00:12:34.960
named Jan Lacoon developed a CNN called Lynette

00:12:34.960 --> 00:12:38.360
5 that was successfully reading millions of handwritten

00:12:38.360 --> 00:12:41.179
bank checks a day for the NCR corporation back

00:12:41.179 --> 00:12:44.649
in 1996. Yeah, Linnet 5, it was incredibly successful

00:12:44.649 --> 00:12:47.750
for that specific constrained task. Right, so

00:12:47.750 --> 00:12:51.049
if they had the math, the layers, the pooling,

00:12:51.389 --> 00:12:55.490
and commercial viability in 1996, why did we

00:12:55.490 --> 00:12:59.429
wait until the 2010s for the massive world -changing

00:12:59.429 --> 00:13:02.769
AI boom we are living in now? It's a great question.

00:13:02.870 --> 00:13:05.190
The math was clearly there. The hardware must

00:13:05.190 --> 00:13:08.070
have been failing them. That is exactly the bottleneck.

00:13:08.200 --> 00:13:10.720
The Mac required to run hundreds of convolution

00:13:10.720 --> 00:13:14.200
filters across thousands of images is extraordinarily

00:13:14.200 --> 00:13:16.460
computationally intense. Because even with the

00:13:16.460 --> 00:13:18.620
shortcuts, it's still a lot of math. Tons of

00:13:18.620 --> 00:13:20.779
it. In the 90s, we were trying to run these networks

00:13:20.779 --> 00:13:23.700
on standard central processing units, or CPUs.

00:13:24.139 --> 00:13:27.000
Think of a CPU like a brilliant mathematics professor.

00:13:27.460 --> 00:13:29.899
The professor can solve incredibly complex problems,

00:13:29.960 --> 00:13:32.019
but they do them sequentially, one after another.

00:13:32.139 --> 00:13:34.519
They read a problem, solve it, and move to the

00:13:34.519 --> 00:13:37.200
next. So they are smart. but slow if there's

00:13:37.200 --> 00:13:39.740
a huge stack of simple problems. Right. That

00:13:39.740 --> 00:13:42.000
is great for running a spreadsheet or an operating

00:13:42.000 --> 00:13:45.179
system, but it is terrible for the massive simultaneous

00:13:45.179 --> 00:13:48.240
calculations required by a deep CNN. They were

00:13:48.240 --> 00:13:50.500
trying to dig the Panama Canal with a collection

00:13:50.500 --> 00:13:53.220
of really nice teaspoons. Exactly. It doesn't

00:13:53.220 --> 00:13:55.759
matter how good the teaspoon is, the sheer volume

00:13:55.759 --> 00:13:58.500
of dirt is the problem. So what changed? The

00:13:58.500 --> 00:14:00.840
breakthrough didn't come from AI researchers

00:14:00.840 --> 00:14:03.259
changing the math. It came entirely by accident

00:14:03.259 --> 00:14:05.919
from the video game industry. Video games. In

00:14:05.919 --> 00:14:08.659
the mid -2000s, video games were getting highly

00:14:08.659 --> 00:14:11.399
realistic, requiring millions of pixels and polygons

00:14:11.399 --> 00:14:14.059
to be rendered on screen at the exact same time.

00:14:14.100 --> 00:14:17.700
Oh, to draw all the 3D graphics. Yes. To do this,

00:14:18.019 --> 00:14:20.220
companies built graphics processing units, or

00:14:20.220 --> 00:14:23.820
GPUs. If a CPU is a brilliant professor doing

00:14:23.820 --> 00:14:27.059
one problem at a time, a GPU is an army of 10

00:14:27.059 --> 00:14:29.600
,000 middle schoolers doing basic arithmetic

00:14:29.600 --> 00:14:32.059
all at the exact same time. Parallel processing.

00:14:32.419 --> 00:14:34.840
Instead of sliding one silter across an image

00:14:34.840 --> 00:14:37.879
step by step, a GPU can calculate the entire

00:14:37.879 --> 00:14:40.259
feature map in an instant. And researchers started

00:14:40.259 --> 00:14:43.019
to realize this was the key. In 2011, a team

00:14:43.019 --> 00:14:46.419
at an institute called IDSIA managed to accelerate

00:14:46.419 --> 00:14:49.700
CNN training by an astonishing 60 times compared

00:14:49.700 --> 00:14:53.899
to a CPU by using GPUs. 60 times faster. That

00:14:53.899 --> 00:14:55.779
is the difference between an experiment taking

00:14:55.779 --> 00:14:58.200
two months to run and taking one single day.

00:14:58.269 --> 00:15:01.110
It's a game changer. That fundamentally changes

00:15:01.110 --> 00:15:03.909
the entire iteration cycle of science. You can

00:15:03.909 --> 00:15:07.230
test a hypothesis, fail, tweak it, and try again

00:15:07.230 --> 00:15:09.250
before the weekend. It changed everything. And

00:15:09.250 --> 00:15:12.049
the catalytic event, the moment the AI boom officially

00:15:12.049 --> 00:15:15.350
started, was the 2012 ImageNet challenge. What

00:15:15.350 --> 00:15:17.769
was ImageNet? It was a massive competition to

00:15:17.769 --> 00:15:20.549
see which software could categorize millions

00:15:20.549 --> 00:15:23.769
of images best. A researcher named Alex Krzyzewski

00:15:23.769 --> 00:15:27.230
and his team unleashed a GPU -based CNN called

00:15:27.230 --> 00:15:30.039
AlexNet. And how did it do? It used all the building

00:15:30.039 --> 00:15:32.179
blocks we just talked about, convolutional layers,

00:15:32.620 --> 00:15:36.379
max pooling, ReLU activations, and GPUs, and

00:15:36.379 --> 00:15:38.320
it absolutely crushed the competition. It wasn't

00:15:38.320 --> 00:15:40.340
even close. Wow. It was an earth -shattering

00:15:40.340 --> 00:15:42.600
moment for computer science. We've spent a lot

00:15:42.600 --> 00:15:45.320
of time establishing how CNNs conquered images

00:15:45.320 --> 00:15:48.039
and visual data. But one of the most surprising

00:15:48.039 --> 00:15:50.620
revelations in this source material is that once

00:15:50.620 --> 00:15:53.820
researchers realized how good CNNs were at finding

00:15:53.820 --> 00:15:56.159
spatial relationships, they started using them

00:15:56.159 --> 00:15:58.340
for things that aren't visual. because when you

00:15:58.340 --> 00:16:00.360
get right down to it, a convolutional neural

00:16:00.360 --> 00:16:03.019
network doesn't actually know what an image or

00:16:03.019 --> 00:16:05.159
a photograph is. It just sees numbers. Right,

00:16:05.259 --> 00:16:07.379
it just sees a grid of numbers and looks for

00:16:07.379 --> 00:16:09.759
local patterns between those numbers. So if you

00:16:09.759 --> 00:16:12.100
could figure out a way to turn something, anything

00:16:12.100 --> 00:16:15.620
in the real world into a grid, the CNN can look

00:16:15.620 --> 00:16:17.919
at it just like it looks at a picture of a cat.

00:16:18.120 --> 00:16:20.419
It's like translating reality into a spatial

00:16:20.419 --> 00:16:23.720
map. That realization opened the floodgates.

00:16:24.000 --> 00:16:26.080
If we connect this to the bigger picture, think

00:16:26.080 --> 00:16:29.000
about a go board. The board gate. Yeah. It is

00:16:29.000 --> 00:16:31.620
literally a two -dimensional grid of black and

00:16:31.620 --> 00:16:35.019
white stones. Researchers fed the board states

00:16:35.019 --> 00:16:38.320
of millions of games into a CNN, allowing the

00:16:38.320 --> 00:16:41.080
network to find spatial patterns in winning strategies.

00:16:41.240 --> 00:16:43.279
Oh, that makes total sense. That was fundamental

00:16:43.279 --> 00:16:45.879
to AlphaGo, the system that famously beat the

00:16:45.879 --> 00:16:48.620
world's best human player at a game once thought

00:16:48.620 --> 00:16:51.259
too complex for machines. But the text mentions

00:16:51.259 --> 00:16:54.080
drug discovery, too. How do you turn a disease

00:16:54.080 --> 00:16:57.159
into a grid? Well, a chemist looked at this technology

00:16:57.159 --> 00:16:59.840
and asked, wait, isn't a molecule just a three

00:16:59.840 --> 00:17:02.620
-dimensional grid of atoms? Carbon is at this

00:17:02.620 --> 00:17:05.180
coordinate. Oxygen is at that coordinate. In

00:17:05.180 --> 00:17:09.099
2015, a system called AtomNet used a 3D CNN to

00:17:09.099 --> 00:17:11.259
look at representations of chemical interactions.

00:17:11.519 --> 00:17:13.839
So instead of a 2D picture, it's a 3D space.

00:17:14.299 --> 00:17:17.309
Exactly. Just like a 2D CNN learns to compose

00:17:17.309 --> 00:17:19.730
small visual edges into the shape of an ear or

00:17:19.730 --> 00:17:22.130
an eye, AtomNet learned to discover the shapes

00:17:22.130 --> 00:17:24.549
of chemical features like hydrogen bonding. That's

00:17:24.549 --> 00:17:27.269
brilliant. By sliding 3D filters through the

00:17:27.269 --> 00:17:29.390
molecular grid, it was used to predict novel

00:17:29.390 --> 00:17:32.670
treatments for the Ebola virus and multiple sclerosis.

00:17:33.329 --> 00:17:36.029
That is incredible. And what about time? The

00:17:36.029 --> 00:17:38.349
text mentions forecasting financial markets and

00:17:38.349 --> 00:17:41.450
weather patterns. How does a CNN look at time?

00:17:41.710 --> 00:17:44.170
Time is just a one -dimensional grid. I never

00:17:44.170 --> 00:17:45.990
thought of it like that. It's a line of data

00:17:45.990 --> 00:17:48.210
points moving forward. Researchers found that

00:17:48.210 --> 00:17:50.809
if you treat time dependencies like visual patterns

00:17:50.809 --> 00:17:53.630
using 1D convolutions sliding along a timeline,

00:17:54.150 --> 00:17:56.549
CNNs can actually outperform traditional forecasting

00:17:56.549 --> 00:17:58.490
networks. Because they just look for shapes in

00:17:58.490 --> 00:18:02.430
the data. Yes, they spot local patterns in stock

00:18:02.430 --> 00:18:05.450
dips or temperature changes, they are computationally

00:18:05.450 --> 00:18:08.049
more efficient, and they don't suffer from the

00:18:08.049 --> 00:18:10.670
learning signals dying out during training. So

00:18:10.670 --> 00:18:13.089
what does this all mean if we step back and look

00:18:13.089 --> 00:18:15.589
at the journey we've taken today? We started

00:18:15.589 --> 00:18:18.069
by peering into the visual cortex of a cat in

00:18:18.069 --> 00:18:21.490
the 1950s, watching biological cells react to

00:18:21.490 --> 00:18:24.150
straight lines and overlapping fields of vision.

00:18:24.269 --> 00:18:26.450
A long way from where we are now. We watched

00:18:26.450 --> 00:18:29.369
that biological mosaic get translated into mathematical

00:18:29.369 --> 00:18:31.970
layers. We saw how shared weights and sliding

00:18:31.970 --> 00:18:34.829
filters solve the curse of dimensionality, stopping

00:18:34.829 --> 00:18:36.849
the computers from collapsing under millions

00:18:36.849 --> 00:18:39.750
of connections. And we threw away the exact coordinates

00:18:39.750 --> 00:18:42.069
with max pooling. Right, which gave the system

00:18:42.069 --> 00:18:44.400
the flexibility to recognize a face even even

00:18:44.400 --> 00:18:47.119
if it moved, focusing on the big picture. We

00:18:47.119 --> 00:18:49.519
traced how this math laid dormant for decades,

00:18:49.980 --> 00:18:52.400
quietly reading bank checks, until the video

00:18:52.400 --> 00:18:54.440
game industry accidentally built the perfect

00:18:54.440 --> 00:18:56.880
parallel processing engine to unleash it. Those

00:18:56.880 --> 00:19:00.259
GPUs really saved the day. And now, we are using

00:19:00.259 --> 00:19:03.440
that exact same architecture to design life -saving

00:19:03.440 --> 00:19:06.880
drugs and forecast financial markets simply by

00:19:06.880 --> 00:19:10.140
turning reality into a grid. It is a remarkable

00:19:10.140 --> 00:19:12.960
synthesis of biology, mathematics, and hardware

00:19:12.960 --> 00:19:15.630
engineering. At its core, you have to remember

00:19:15.630 --> 00:19:18.589
that a CNN is a system designed to discover what

00:19:18.589 --> 00:19:21.349
is truly important in a massive sea of data.

00:19:22.210 --> 00:19:24.750
In the early days of AI, humans tried to hand

00:19:24.750 --> 00:19:27.490
code every rule. If you see two circles in a

00:19:27.490 --> 00:19:31.220
triangle, it is a face. But with CNNs, we don't

00:19:31.220 --> 00:19:33.359
hand engineer the rules or the filters anymore.

00:19:33.539 --> 00:19:35.759
The machine does it. We just set up the architecture,

00:19:36.140 --> 00:19:38.599
feed it the grid of data, and the network learns

00:19:38.599 --> 00:19:40.779
how to see the patterns all on its own without

00:19:40.779 --> 00:19:43.599
human bias. But as much as we talk about how

00:19:43.599 --> 00:19:46.099
CNNs replicate human vision, I want to leave

00:19:46.099 --> 00:19:48.279
you with a fascinating contradiction pulled right

00:19:48.279 --> 00:19:50.109
from the source material. Let's hear it. We've

00:19:50.109 --> 00:19:52.569
built these incredibly powerful networks, and

00:19:52.569 --> 00:19:55.589
they're capable of instantly classifying incredibly

00:19:55.589 --> 00:19:58.230
fine -grained categories that totally confuse

00:19:58.230 --> 00:20:02.819
humans. can distinguish between two nearly identical

00:20:02.819 --> 00:20:05.839
breeds of dogs in a fraction of a second, spotting

00:20:05.839 --> 00:20:08.500
spatial relationships in the fur patterns that

00:20:08.500 --> 00:20:11.180
our eyes completely glaze over. They're completely

00:20:11.180 --> 00:20:13.880
superhuman in that specific regard. Superhuman.

00:20:14.440 --> 00:20:16.900
Yet the text points out that these exact same

00:20:16.900 --> 00:20:19.680
advanced algorithms can easily get confused by

00:20:19.680 --> 00:20:22.500
things that make no sense to us. Like what? A

00:20:22.500 --> 00:20:26.019
CNN might confidently classify a school bus as

00:20:26.019 --> 00:20:28.279
a completely different object. just because a

00:20:28.279 --> 00:20:30.299
tiny ant is resting on the camera lens. Just

00:20:30.299 --> 00:20:33.539
a few pixels out of place. Exactly. Or gets completely

00:20:33.539 --> 00:20:35.559
thrown off by an image that has been slightly

00:20:35.559 --> 00:20:37.759
distorted with a simple digital camera filter.

00:20:38.660 --> 00:20:41.000
A human toddler wouldn't be fooled for a single

00:20:41.000 --> 00:20:44.440
second by a grainy photo of a bus. But the AI

00:20:44.440 --> 00:20:47.220
is baffled. It's a totally different way of processing

00:20:47.220 --> 00:20:49.700
the world. Which leaves us with a profound question

00:20:49.700 --> 00:20:52.039
to ponder long after you finish listening today.

00:20:52.700 --> 00:20:55.230
What does it actually mean for our future? that

00:20:55.230 --> 00:20:57.430
the machines we specifically built to replicate

00:20:57.430 --> 00:21:00.190
human vision are rapidly developing an entirely

00:21:00.190 --> 00:21:03.769
alien way of seeing our world. Keep diving deep.
