WEBVTT

00:00:00.000 --> 00:00:03.740
So picture this. It's 2017, and a group of researchers

00:00:03.740 --> 00:00:07.160
walk up to just a standard red octagonal stop

00:00:07.160 --> 00:00:10.060
sign at a really busy intersection. And they

00:00:10.060 --> 00:00:11.880
take a few black and white stickers and just

00:00:11.880 --> 00:00:14.220
sort of place them carefully over a couple of

00:00:14.220 --> 00:00:15.939
the white letters. Just normal stickers. Yeah,

00:00:16.000 --> 00:00:19.940
exactly. To you or me or any human driver passing

00:00:19.940 --> 00:00:23.420
by, it just looks like minor meaningless vandalism.

00:00:23.579 --> 00:00:25.280
Like, you would obviously still hit the brakes.

00:00:25.280 --> 00:00:27.760
Of course. But to the artificial intelligence

00:00:27.760 --> 00:00:31.160
driving a two -ton autonomous vehicle, Those

00:00:31.160 --> 00:00:34.179
specific stickers mathematically transform that

00:00:34.179 --> 00:00:37.579
stop sign into a speed limit 45 sign Wow. Yeah,

00:00:37.960 --> 00:00:40.000
the car wouldn't stop it would literally just

00:00:40.000 --> 00:00:42.600
accelerate into the intersection It's a terrifying

00:00:42.600 --> 00:00:45.039
scenario honestly, but it perfectly illustrates

00:00:45.039 --> 00:00:47.280
this this central paradox. We're dealing with

00:00:47.280 --> 00:00:50.939
today We've built these machines that can perform

00:00:50.939 --> 00:00:54.179
Seemingly miraculous superhuman feats right right

00:00:54.179 --> 00:00:57.219
yet. They can be completely derailed by Just

00:00:57.219 --> 00:00:59.929
a few strategically placed stickers And the craziest

00:00:59.929 --> 00:01:02.649
part is that the engineers who built the system

00:01:02.649 --> 00:01:06.430
often can't tell you exactly why the car decided

00:01:06.430 --> 00:01:09.849
to accelerate. Like the transparency of traditional

00:01:09.849 --> 00:01:12.450
machines where you turn a gear here and a piston

00:01:12.450 --> 00:01:14.609
moves there that's completely evaporated. Yeah,

00:01:14.650 --> 00:01:16.430
it's totally gone. We're basically staring at

00:01:16.430 --> 00:01:19.909
a black box. So today we are opening that box.

00:01:20.170 --> 00:01:23.510
We're taking a comprehensive deep dive into the

00:01:23.510 --> 00:01:25.829
source material surrounding deep learning. It's

00:01:25.829 --> 00:01:28.750
about time. OK, let's unpack this. Our mission

00:01:28.750 --> 00:01:32.090
today is to cut through the heavy academic jargon,

00:01:32.450 --> 00:01:34.829
trace its surprisingly ancient history, figure

00:01:34.829 --> 00:01:37.530
out how it's discovering new medicines and reshaping

00:01:37.530 --> 00:01:40.310
your world, and uncover the hidden vulnerabilities

00:01:40.310 --> 00:01:43.329
that actually affect you directly. Because as

00:01:43.329 --> 00:01:45.450
you'll see, you are far more connected to the

00:01:45.450 --> 00:01:47.150
inner workings of this technology than you might

00:01:47.150 --> 00:01:49.370
realize. I mean, your daily habits are actually

00:01:49.370 --> 00:01:51.790
fueling it. Exactly. So let's start from the

00:01:51.790 --> 00:01:53.950
very beginning. What is the actual anatomy of

00:01:53.950 --> 00:01:55.849
this machine? Like, when we throw around the

00:01:55.849 --> 00:01:58.530
buzzword deep learning, what is physically happening

00:01:58.530 --> 00:02:01.370
inside the code? Well, at its core, deep learning

00:02:01.370 --> 00:02:04.590
is a specific class of machine learning. And

00:02:04.590 --> 00:02:07.049
the entire architecture is heavily inspired by

00:02:07.049 --> 00:02:09.370
biological neuroscience. Like the human brain.

00:02:09.560 --> 00:02:12.300
Right, specifically how the human brain processes

00:02:12.300 --> 00:02:14.659
information. It revolves around what we call

00:02:14.659 --> 00:02:18.060
artificial neural networks. So think of it as

00:02:18.060 --> 00:02:20.460
stacking artificial neurons into discrete layers.

00:02:21.180 --> 00:02:23.560
Data comes in, and it passes through these layers.

00:02:23.879 --> 00:02:26.659
Now, inside the code, these artificial neurons

00:02:26.659 --> 00:02:29.919
are connected by synapses, and every single connection

00:02:29.919 --> 00:02:31.939
has specific weights and biases. Hold on, let

00:02:31.939 --> 00:02:34.120
me stop you right there. In human biology, a

00:02:34.120 --> 00:02:37.419
synapse is the physical gap between two brain

00:02:37.419 --> 00:02:39.419
cells where chemicals jump across, right? Yeah,

00:02:39.560 --> 00:02:42.120
neurotransmitters. Right. So what is a synapse

00:02:42.120 --> 00:02:44.599
in lines of code, and what do you mean by weights

00:02:44.599 --> 00:02:47.000
and biases? That's an excellent question. So

00:02:47.000 --> 00:02:49.800
in code, a synapse is literally just a mathematical

00:02:49.800 --> 00:02:52.800
connection, casting a number from one artificial

00:02:52.800 --> 00:02:54.900
neuron to the other. next. That's it. Oh, okay.

00:02:55.000 --> 00:02:58.360
And a weight is essentially a volume knob. It

00:02:58.360 --> 00:03:00.259
determines how important that specific piece

00:03:00.259 --> 00:03:03.699
of incoming data is. Yeah, so if a neuron receives

00:03:03.699 --> 00:03:05.879
data through a connection with a high weight,

00:03:06.580 --> 00:03:09.580
it pays a lot of attention to it. A bias, on

00:03:09.580 --> 00:03:11.740
the other hand, is sort of like a threshold or

00:03:11.740 --> 00:03:14.080
a tripwire. A tripwire. Right, it determines

00:03:14.080 --> 00:03:16.879
how high the total signal needs to be before

00:03:16.879 --> 00:03:19.740
that artificial neuron decides to, you know,

00:03:19.840 --> 00:03:22.159
fire and pass its own signal onto the next layer.

00:03:22.379 --> 00:03:24.500
Okay, I want to make this really concrete for

00:03:24.500 --> 00:03:28.219
anyone listening. Imagine a massive factory assembly

00:03:28.219 --> 00:03:30.639
line and the goal of this factory is to look

00:03:30.639 --> 00:03:32.860
at a photograph and determine if it's a picture

00:03:32.860 --> 00:03:36.439
of a human face. I like this, okay. So raw pixels

00:03:36.439 --> 00:03:39.060
come in the loading dock. Worker 1 is at the

00:03:39.060 --> 00:03:41.719
very start of the line. Their job is incredibly

00:03:41.719 --> 00:03:44.879
simple, right? Their bias, or their tripwire,

00:03:45.379 --> 00:03:48.439
is only triggered by basic lines and stark color

00:03:48.439 --> 00:03:51.199
contrasts. Just edges. Exactly, edges. That's

00:03:51.199 --> 00:03:52.960
all they care about. They highlight all the edges

00:03:52.960 --> 00:03:55.460
and pass their work to Worker 2. Precisely. And

00:03:55.460 --> 00:03:57.199
Worker 2 has a completely different set of weights

00:03:57.199 --> 00:03:59.259
and biases. Right. They aren't looking at the

00:03:59.259 --> 00:04:01.199
raw pixels anymore. They're looking at the edges

00:04:01.199 --> 00:04:05.400
that Worker 1 found. So Worker 2 combines those

00:04:05.400 --> 00:04:08.419
edges to spot geometric shapes. You know, a curve

00:04:08.419 --> 00:04:11.030
here, a triangle there. So the complexity just

00:04:11.030 --> 00:04:13.509
builds. Worker 3 takes those shapes and suddenly

00:04:13.509 --> 00:04:16.550
spots a nose or an eye. Exactly. And by the time

00:04:16.550 --> 00:04:19.449
all of this compiled abstract info reaches Worker

00:04:19.449 --> 00:04:21.490
4 at the very end of the line, they can look

00:04:21.490 --> 00:04:23.970
at the sum of those complex parts and confidently

00:04:23.970 --> 00:04:26.850
yell, it's a face! Yeah, and that flu you just

00:04:26.850 --> 00:04:29.250
described, that's called a feed -forward neural

00:04:29.250 --> 00:04:32.670
network. The data moves strictly in one direction,

00:04:32.889 --> 00:04:34.870
from the loading dock to the final inspector.

00:04:35.050 --> 00:04:38.129
Makes sense. But our sources also mention recurrent

00:04:38.129 --> 00:04:40.709
neural networks. How does that work? Well, in

00:04:40.709 --> 00:04:42.790
that version of your factory, worker 3 might

00:04:42.790 --> 00:04:45.550
spot an eye and actually send a message back

00:04:45.550 --> 00:04:48.110
up the line to worker 2, saying, hey, I just

00:04:48.110 --> 00:04:49.470
found an eye. You should probably look closer

00:04:49.470 --> 00:04:52.009
for a nose. Oh, wow. So it goes backward, too.

00:04:52.089 --> 00:04:54.269
Yeah, it introduces memory and context into the

00:04:54.269 --> 00:04:56.730
system. That is wild. But if I'm a programmer

00:04:56.730 --> 00:04:58.269
in the traditional sense, I would have to sit

00:04:58.269 --> 00:05:00.430
down and write the mathematical definition of

00:05:00.430 --> 00:05:03.649
an edge and a nose for every single worker, right?

00:05:03.970 --> 00:05:06.589
That sounds exhausting. Well, what's fascinating

00:05:06.589 --> 00:05:10.149
here is that you don't. Really? Yeah. That older

00:05:10.149 --> 00:05:12.750
traditional method is called handcrafted feature

00:05:12.750 --> 00:05:16.550
engineering, and it is incredibly tedious and

00:05:16.550 --> 00:05:19.370
highly prone to human error. I bet. The true

00:05:19.370 --> 00:05:22.410
magic of deep learning is that the system discovers

00:05:22.410 --> 00:05:25.290
these useful representations automatically. Wait.

00:05:25.649 --> 00:05:28.329
Automatically? How? You literally just feed it

00:05:28.329 --> 00:05:30.470
millions of pictures of faces and millions of

00:05:30.470 --> 00:05:32.970
pictures of not faces. And through trial and

00:05:32.970 --> 00:05:35.660
error, the network automatically adjusts its

00:05:35.660 --> 00:05:37.720
own volume knobs, its own weights and biases.

00:05:37.899 --> 00:05:41.000
No way. Yeah, it quees them until worker one

00:05:41.000 --> 00:05:43.259
naturally becomes an edge detector and worker

00:05:43.259 --> 00:05:45.759
three naturally becomes an eye detector. It figures

00:05:45.759 --> 00:05:48.319
out the optimal factory line entirely on its

00:05:48.319 --> 00:05:50.720
own. So it's basically a self -organizing factory.

00:05:51.060 --> 00:05:53.139
That's incredible. But that brings up the buzzword

00:05:53.139 --> 00:05:56.110
itself, right? Deep. companies slap the word

00:05:56.110 --> 00:05:58.889
deep onto every software update these days. Is

00:05:58.889 --> 00:06:01.730
there an actual physical threshold where a neural

00:06:01.730 --> 00:06:04.790
network officially crosses over into being deep

00:06:04.790 --> 00:06:07.110
learning? There actually is a mathematical definition,

00:06:07.170 --> 00:06:10.250
yeah. The deep simply refers to the number of

00:06:10.250 --> 00:06:12.689
layers the data is transformed through. Researchers

00:06:12.689 --> 00:06:14.470
measure this using something called the credit

00:06:14.470 --> 00:06:17.970
assignment path or CAP depth. It's the chain

00:06:17.970 --> 00:06:20.470
of transformations from the input to the output.

00:06:21.069 --> 00:06:23.759
And while there isn't like a universal governing

00:06:23.759 --> 00:06:26.959
body dictating the exact number. The broad consensus

00:06:26.959 --> 00:06:29.620
in the field is that any system with a CAIP depth

00:06:29.620 --> 00:06:32.959
higher than 2 is considered deep learning. Just

00:06:32.959 --> 00:06:35.139
more than 2, that doesn't sound very deep. Well,

00:06:35.240 --> 00:06:37.279
a shallow network with just one hidden layer

00:06:37.279 --> 00:06:39.920
can theoretically emulate any function if it's

00:06:39.920 --> 00:06:42.680
wide enough. But adding more layers going deeper

00:06:42.680 --> 00:06:46.040
allows the model to extract much richer, exponentially

00:06:46.040 --> 00:06:48.639
more complex features. Okay, so we have this

00:06:48.639 --> 00:06:51.339
multi -layered self -organizing factory that

00:06:51.339 --> 00:06:54.199
adjusts its own code to process wildly complex

00:06:54.199 --> 00:06:56.680
data. Hearing that, it sounds like something

00:06:56.680 --> 00:06:59.279
dreamed up, you know, five minutes ago in a Silicon

00:06:59.279 --> 00:07:01.500
Valley basement by a startup. It really does.

00:07:02.180 --> 00:07:04.040
But reading through our sources for this deep

00:07:04.040 --> 00:07:07.100
dive, the blueprint for this factory is actually...

00:07:06.910 --> 00:07:10.850
Shockingly ancient. Oh, it really is. The foundational

00:07:10.850 --> 00:07:13.569
concepts go back an entire century. Yeah, a century.

00:07:13.709 --> 00:07:16.550
Yeah. We can trace the earliest non -learning

00:07:16.550 --> 00:07:19.670
precursors back to the 1920s with physicists

00:07:19.670 --> 00:07:22.910
Wilhelm Lenz and Ernst Ising. They created something

00:07:22.910 --> 00:07:25.410
called the Ising model. Wait, the 1920s, people

00:07:25.410 --> 00:07:27.910
were literally driving Model T Fords. What were

00:07:27.910 --> 00:07:29.670
physicists doing with neural networks? Well,

00:07:29.670 --> 00:07:31.490
they weren't trying to build AI. They were trying

00:07:31.490 --> 00:07:33.870
to understand ferromagnetism. Like how magnets

00:07:33.870 --> 00:07:37.449
work. Exactly. The Ising model looked at atoms

00:07:37.449 --> 00:07:40.370
as tiny magnets that could either point up or

00:07:40.370 --> 00:07:42.449
down, depending on the influence of their neighboring

00:07:42.449 --> 00:07:45.449
atoms. OK. If enough neighbors pointed up, the

00:07:45.449 --> 00:07:47.350
atom crossed a threshold and flipped up too.

00:07:47.529 --> 00:07:51.850
Ah, a threshold. Right. Decades later, computer

00:07:51.850 --> 00:07:54.470
scientists looked at that physical model of threshold

00:07:54.470 --> 00:07:58.889
flipping and realized, wait, that's exactly how

00:07:58.889 --> 00:08:02.350
biological neurons fire. That physics concept

00:08:02.350 --> 00:08:04.550
became the ancestor of the artificial neuron.

00:08:04.639 --> 00:08:07.220
So they were using physics to imagine decision

00:08:07.220 --> 00:08:09.439
-making networks before the electronic computer

00:08:09.439 --> 00:08:12.639
was even invented. That is a massive conceptual

00:08:12.639 --> 00:08:15.139
leap. It's huge. And from there, it became sort

00:08:15.139 --> 00:08:17.800
of a multi -generational relay race. How so?

00:08:17.949 --> 00:08:20.949
Well, in 1948, Alan Turing wrote an unpublished

00:08:20.949 --> 00:08:23.410
paper detailing ideas for artificial evolution

00:08:23.410 --> 00:08:25.829
and learning in networks. Turing, of course.

00:08:25.889 --> 00:08:29.350
Yeah. Then, by 1958, Frank Rosenblatt actually

00:08:29.350 --> 00:08:31.970
designed the Perceptron, which was an early network

00:08:31.970 --> 00:08:34.529
with an input layer, a hidden layer, and an output

00:08:34.529 --> 00:08:37.269
layer. Okay, getting closer. And in 1965, Alexi

00:08:37.269 --> 00:08:39.929
Eva Kanko and V .G. Lapa published the first

00:08:39.929 --> 00:08:42.669
working deep learning algorithm. It was a method

00:08:42.669 --> 00:08:44.850
to train these networks no matter how deep they

00:08:44.850 --> 00:08:46.769
were. Okay, I have to throw a flag on the play

00:08:46.769 --> 00:08:49.340
right here. Go for it. If Iveknenko had a working

00:08:49.340 --> 00:08:52.580
algorithm in 1965 and they were publishing papers

00:08:52.580 --> 00:08:56.379
about eight -layer networks in the 70s, why was

00:08:56.379 --> 00:08:58.940
I playing the game Snake on a brick phone in

00:08:58.940 --> 00:09:01.700
the year 2000 instead of talking to ChatGPT?

00:09:01.960 --> 00:09:04.120
That is the big question. Right, like why was

00:09:04.120 --> 00:09:07.299
there a massive multi -decade gap where nothing

00:09:07.299 --> 00:09:09.559
seemed to happen? It's honestly the most crucial

00:09:09.559 --> 00:09:11.870
question in the history of the field. Yeah. Why

00:09:11.870 --> 00:09:14.090
did we endure what's known as a deep learning

00:09:14.090 --> 00:09:17.669
winter? A winter? Yeah. Because the early developers

00:09:17.669 --> 00:09:20.870
hit a colossal mathematical wall known as the

00:09:20.870 --> 00:09:23.470
vanish ingredient problem. What exactly is a

00:09:23.470 --> 00:09:25.289
vanish ingredient? That sounds like a sci -fi

00:09:25.289 --> 00:09:27.309
movie. To understand it, we have to look at how

00:09:27.309 --> 00:09:30.049
the factory line learns. He uses a process called

00:09:30.049 --> 00:09:32.889
backpropagation. Okay. Imagine the final inspector

00:09:32.889 --> 00:09:35.509
at the end of the line looks at the output, realizes

00:09:35.509 --> 00:09:38.370
the network labeled a dog as a toaster, and shouts

00:09:38.370 --> 00:09:41.539
back down the line, We're 90 % wrong. Adjust

00:09:41.539 --> 00:09:43.620
your volume knobs. That's a pretty big mistake.

00:09:43.899 --> 00:09:47.080
Huge. But that error signal is calculated using

00:09:47.080 --> 00:09:49.580
a calculus concept called the chain rule, which

00:09:49.580 --> 00:09:51.580
basically involves multiplying fractions as you

00:09:51.580 --> 00:09:54.100
move backward through the layers. Okay, multiplying

00:09:54.100 --> 00:09:55.740
fractions. I think I see where this is going.

00:09:55.860 --> 00:09:57.899
Think of it like playing the game of telephone

00:09:57.899 --> 00:10:00.720
with 100 people. If the error adjustment is,

00:10:00.720 --> 00:10:04.480
say, one -tenth, you multiply .1... by point

00:10:04.480 --> 00:10:08.179
one for every layer you go back. Oh, wow. Yeah.

00:10:08.639 --> 00:10:10.740
By the time that error signal reaches the very

00:10:10.740 --> 00:10:13.340
first layer of the network, the number has become

00:10:13.340 --> 00:10:16.059
zero, zero, zero, zero, zero, zero, zero, zero.

00:10:16.139 --> 00:10:18.779
That's basically zero. Exactly. The gradient,

00:10:18.980 --> 00:10:21.679
the signal to change has literally vanished into

00:10:21.679 --> 00:10:24.379
mathematical zero. The early layers never get

00:10:24.379 --> 00:10:26.679
the feedback, so they never learn. The network

00:10:26.679 --> 00:10:29.179
just freezes. So the math was literally zeroing

00:10:29.179 --> 00:10:31.570
itself out. They had the blueprint, but the physical

00:10:31.570 --> 00:10:33.929
laws of calculus were stopping the construction.

00:10:34.450 --> 00:10:36.789
Exactly. Now, researchers like Sepp Hochreiter

00:10:36.789 --> 00:10:39.649
identified this in the 1990s and eventually developed

00:10:39.649 --> 00:10:41.730
architectural solutions like long short -term

00:10:41.730 --> 00:10:44.889
memory or LSTMs to help the networks remember

00:10:44.889 --> 00:10:47.190
those signals. Oh, LSTMs. I've seen that term.

00:10:47.250 --> 00:10:49.730
Right. But even with the math fixed, they faced

00:10:49.730 --> 00:10:52.309
a devastating physical reality. They possessed

00:10:52.309 --> 00:10:54.889
incredibly limited computing power, and they

00:10:54.889 --> 00:10:57.370
had almost no massive data sets to train on.

00:10:57.480 --> 00:11:00.299
So, to go back to our foundry analogy, they figured

00:11:00.299 --> 00:11:02.419
out how to build the conveyor belts, but they

00:11:02.419 --> 00:11:04.960
had absolutely no electricity to turn the factory

00:11:04.960 --> 00:11:07.600
on and no raw materials to actually put on the

00:11:07.600 --> 00:11:09.779
belts. That is the perfect way to frame it. The

00:11:09.779 --> 00:11:12.919
theory was sound, but the 20th century physically

00:11:12.919 --> 00:11:15.149
could not support it. Which naturally brings

00:11:15.149 --> 00:11:17.730
us to the explosion of the 2010s. Because the

00:11:17.730 --> 00:11:20.070
20th century provided the blueprints, the 21st

00:11:20.070 --> 00:11:22.110
century finally flipped the power switch. It

00:11:22.110 --> 00:11:24.509
really did. Let's talk about how we went from

00:11:24.509 --> 00:11:28.669
a frozen theoretical winter to algorithms beating

00:11:28.669 --> 00:11:32.029
humans at highly complex tasks. The paradigm

00:11:32.029 --> 00:11:34.950
shift really came down to a hardware revolution.

00:11:35.870 --> 00:11:39.309
Specifically, the leap from CPUs to GPUs. The

00:11:39.309 --> 00:11:41.429
graphics processing unit. OK, let's break down

00:11:41.429 --> 00:11:44.639
why this changed everything. For decades, traditional

00:11:44.639 --> 00:11:47.440
computing relied on the CPU, the central processing

00:11:47.440 --> 00:11:50.740
unit. Right. Think of a CPU as one absolute super

00:11:50.740 --> 00:11:53.399
genius mathematician. They can do insanely complex

00:11:53.399 --> 00:11:55.399
calculus, but they have to do it sequentially.

00:11:55.940 --> 00:11:58.100
One problem after another. Which is perfectly

00:11:58.100 --> 00:12:00.720
fine for running a word processor or basic operating

00:12:00.720 --> 00:12:03.490
system. Yeah. Right. But... Deep learning doesn't

00:12:03.490 --> 00:12:06.389
require complex calculus done one by one. Our

00:12:06.389 --> 00:12:08.769
factory line requires millions and millions of

00:12:08.769 --> 00:12:10.990
incredibly simple multiplication and addition

00:12:10.990 --> 00:12:13.370
problems adjusting all those tiny weights and

00:12:13.370 --> 00:12:15.909
biases simultaneously. Exactly. So moving to

00:12:15.909 --> 00:12:18.389
GPUs was like realizing that instead of having

00:12:18.389 --> 00:12:20.970
one super genius mathematician, it's actually

00:12:20.970 --> 00:12:23.809
much faster to hire 10 ,000 average high school

00:12:23.809 --> 00:12:26.370
students to do basic arithmetic all at the exact

00:12:26.370 --> 00:12:29.350
same time. It's parallel processing. Precisely.

00:12:29.649 --> 00:12:31.570
And GPUs were originally designed by the video

00:12:31.570 --> 00:12:33.870
game industry. Really? Video games? Yeah, to

00:12:33.870 --> 00:12:35.789
calculate the lighting and color of millions

00:12:35.789 --> 00:12:38.269
of individual pixels on your screen simultaneously.

00:12:39.210 --> 00:12:41.570
It turned out that architecture was the exact

00:12:41.570 --> 00:12:44.149
parallel processing power neural networks had

00:12:44.149 --> 00:12:47.080
been starving for. That is wild. And the scale

00:12:47.080 --> 00:12:49.059
of this hardware shift is just mind -boggling.

00:12:49.279 --> 00:12:51.580
According to OpenAI, between 2012, which is when

00:12:51.580 --> 00:12:54.080
a network called AlexNet crushed a major image

00:12:54.080 --> 00:12:56.940
recognition competition, and 2017, the amount

00:12:56.940 --> 00:12:59.379
of hardware computation directed at massive deep

00:12:59.379 --> 00:13:03.269
learning projects increased 300 ,000 -fold. 300

00:13:03.269 --> 00:13:06.509
,000 times more computing power in just five

00:13:06.509 --> 00:13:08.909
years. That's not a steady evolution. That's

00:13:08.909 --> 00:13:11.350
a sudden violent explosion. It was an explosion.

00:13:11.970 --> 00:13:13.669
And the internet provided the raw materials,

00:13:13.990 --> 00:13:16.570
just massive, unprecedented oceans of digital

00:13:16.570 --> 00:13:18.809
data to feed into the factory. Right. The data

00:13:18.809 --> 00:13:21.029
boom. If we connect this to the bigger picture,

00:13:21.549 --> 00:13:25.590
this combination of GPUs and big data is what

00:13:25.590 --> 00:13:28.690
completely altered the trajectory of modern science.

00:13:28.889 --> 00:13:30.809
Let's look at some of those breakthroughs from

00:13:30.809 --> 00:13:33.009
the sources, because this is where the abstract

00:13:33.009 --> 00:13:36.889
math becomes very, very real. Take DeepMind's

00:13:36.889 --> 00:13:40.570
alpha fold. For 50 years, biologists were stuck

00:13:40.570 --> 00:13:43.009
on what they called the protein folding problem.

00:13:43.149 --> 00:13:45.509
Right. Because proteins are the building blocks

00:13:45.509 --> 00:13:47.950
of life. They start as a one -dimensional string

00:13:47.950 --> 00:13:50.429
of amino acids, and then they instantly crumple

00:13:50.429 --> 00:13:52.809
up into highly complex three -dimensional knots.

00:13:53.149 --> 00:13:55.850
And the specific 3D shape of that knot dictates

00:13:55.850 --> 00:13:57.710
exactly what the protein does in your body, right?

00:13:57.830 --> 00:14:00.409
Exactly. If scientists know the shape, they can

00:14:00.409 --> 00:14:02.649
design a drug to slot perfectly into it, like

00:14:02.649 --> 00:14:05.529
a key into a lock. But calculating how that string

00:14:05.529 --> 00:14:08.309
will fold using traditional physics used to take

00:14:08.309 --> 00:14:10.889
a PhD student like five years of lab work just

00:14:10.889 --> 00:14:13.529
to map a single protein. It was grueling work.

00:14:13.629 --> 00:14:16.669
But Alpha Fold fed the data of known proteins

00:14:16.669 --> 00:14:19.470
into a deep learning network, and the AI learned

00:14:19.470 --> 00:14:22.370
the underlying rules. It can now predict the

00:14:22.370 --> 00:14:25.289
3D structure of almost any protein practically

00:14:25.289 --> 00:14:28.309
instantly. It solved a 50 -year grand challenge

00:14:28.309 --> 00:14:30.549
of biology. It's astounding. And then you have

00:14:30.549 --> 00:14:33.269
systems like Graphcast. Right, for weather. Yes.

00:14:33.450 --> 00:14:36.470
Traditional global weather prediction is notoriously

00:14:36.470 --> 00:14:39.870
slavish because it relies on supercomputers grinding

00:14:39.870 --> 00:14:42.549
through nightmare -inducing systems of partial

00:14:42.549 --> 00:14:45.049
differential equations, fluid dynamics, thermodynamics.

00:14:45.129 --> 00:14:48.509
Sounds awful. It is. But GraphCast bypassed the

00:14:48.509 --> 00:14:50.610
traditional calculus entirely. It just looked

00:14:50.610 --> 00:14:53.029
at decades of historical weather patterns, learned

00:14:53.029 --> 00:14:54.929
how the atmosphere behaves as a data sequence,

00:14:54.929 --> 00:14:57.990
and now it can predict highly detailed global

00:14:57.990 --> 00:15:00.929
weather up to 10 days in advance in under a minute.

00:15:01.009 --> 00:15:03.230
Under a minute for a global forecast. That's

00:15:03.230 --> 00:15:05.629
insane. And it doesn't stop there. In material

00:15:05.629 --> 00:15:08.470
science, Google DeepMind developed a system called

00:15:08.470 --> 00:15:12.769
Gnome. Right, G -N -O -M -E. It discovered over

00:15:12.769 --> 00:15:16.830
two million new stable crystal structures. Two

00:15:16.830 --> 00:15:20.210
million. Yes. It essentially expanded the entire

00:15:20.210 --> 00:15:23.190
catalog of known materials humanity can work

00:15:23.190 --> 00:15:26.029
with. And its predictions were actually validated

00:15:26.029 --> 00:15:29.809
by autonomous robots in a lab with a 71 % success

00:15:29.809 --> 00:15:32.259
rate. Wow. I want to pause here for everyone

00:15:32.259 --> 00:15:35.000
listening because this is not ivory tower academic

00:15:35.000 --> 00:15:37.580
theory. This translates directly to your life.

00:15:38.019 --> 00:15:40.559
Absolutely. Alpha fold means exponentially faster

00:15:40.559 --> 00:15:42.940
drug discovery for diseases that might be affecting

00:15:42.940 --> 00:15:45.960
your family right now. Graphcast means highly

00:15:45.960 --> 00:15:48.480
precise, life -saving, severe weather alerts

00:15:48.480 --> 00:15:51.120
for your specific neighborhood. And you know

00:15:51.120 --> 00:15:53.990
me. Those two million new crystal structures

00:15:53.990 --> 00:15:55.809
are the new materials that will make up the next

00:15:55.809 --> 00:15:58.570
generation of highly efficient solar panels or,

00:15:58.570 --> 00:16:00.389
you know, the battery inside your next electric

00:16:00.389 --> 00:16:02.789
vehicle. Exactly. Deep learning is physically

00:16:02.789 --> 00:16:04.970
constructing the future. It is an incredible

00:16:04.970 --> 00:16:08.230
leap forward for humanity. But as with any massive

00:16:08.230 --> 00:16:11.190
technological shift, it comes with dark, highly

00:16:11.190 --> 00:16:13.899
complex complications. Which brings us right

00:16:13.899 --> 00:16:17.179
back to that stop sign at the intersection. Because

00:16:17.179 --> 00:16:19.620
deep learning is discovering new medicines and

00:16:19.620 --> 00:16:22.419
predicting global weather, it is so tempting

00:16:22.419 --> 00:16:26.120
to view it as this flawless omniscient oracle.

00:16:26.379 --> 00:16:29.259
Oh, definitely. But it's not. Because we don't

00:16:29.259 --> 00:16:31.419
hand code the rules anymore, because the machine

00:16:31.419 --> 00:16:34.620
figures out its own factory line, we have a massive

00:16:34.620 --> 00:16:37.919
vulnerability. We cannot easily interpret how

00:16:37.919 --> 00:16:40.639
it's making its decisions. And this raises an

00:16:40.639 --> 00:16:43.480
important question. What happens when you integrate

00:16:43.480 --> 00:16:46.539
an uninterpretable system into critical infrastructure?

00:16:47.639 --> 00:16:50.460
And more dangerously, what happens when bad actors

00:16:50.460 --> 00:16:52.580
learn how to manipulate the math that we can't

00:16:52.580 --> 00:16:56.409
see? Right, because if an AI has superhuman image

00:16:56.409 --> 00:16:58.909
recognition and can map a protein in seconds,

00:16:59.330 --> 00:17:01.629
how does a human possibly trick it with a few

00:17:01.629 --> 00:17:04.170
stickers on a stop sign? Because the AI doesn't

00:17:04.170 --> 00:17:06.569
see the way you and I see. It doesn't understand

00:17:06.569 --> 00:17:08.829
what a stop sign actually is. It has no concept

00:17:08.829 --> 00:17:10.849
of it. Exactly. It doesn't know about traffic

00:17:10.849 --> 00:17:13.630
laws or metal octagons or the concept of stopping.

00:17:14.029 --> 00:17:16.470
It only knows that a specific mathematical arrangement

00:17:16.470 --> 00:17:18.910
of red and white pixels correlates with the label

00:17:18.910 --> 00:17:22.170
stop sign. And this creates a unique vulnerability

00:17:22.170 --> 00:17:24.619
called adversarial attacks. So the attackers

00:17:24.619 --> 00:17:27.740
aren't tricking the camera itself, they are weaponizing

00:17:27.740 --> 00:17:31.559
the AI's own math against it. Exactly. The researchers

00:17:31.559 --> 00:17:34.319
calculated exactly which pixels to alter on that

00:17:34.319 --> 00:17:37.480
stop sign to maximize the mathematical probability

00:17:37.480 --> 00:17:41.220
of a different label. To humanize, the stickers

00:17:41.220 --> 00:17:44.700
are just random. But to the AI? To the AI. The

00:17:44.700 --> 00:17:47.619
altered pixel data forces the layers of the network

00:17:47.619 --> 00:17:50.420
into a completely wrong, yet highly confident

00:17:50.420 --> 00:17:52.980
conclusion. That is so creepy. And the sources

00:17:52.980 --> 00:17:56.000
mention another terrifying example. Researchers

00:17:56.000 --> 00:17:58.319
tricked a facial recognition system into thinking

00:17:58.319 --> 00:18:00.900
ordinary people were famous celebrities just

00:18:00.900 --> 00:18:03.299
by having the ordinary people wear a pair of

00:18:03.299 --> 00:18:06.039
specifically designed psychedelic -looking spectacles.

00:18:06.240 --> 00:18:08.799
Psychedelic glasses. Yeah. The patterns on the

00:18:08.799 --> 00:18:11.200
glasses hijacked the network's mathematical feature

00:18:11.200 --> 00:18:13.380
detectors? And the threats go beyond just visual

00:18:13.380 --> 00:18:16.039
illusions. There's also data poisoning. What's

00:18:16.039 --> 00:18:18.740
that? This is where bad actors smuggle false

00:18:18.740 --> 00:18:20.980
or malicious data into the massive data sets

00:18:20.980 --> 00:18:22.819
used to train the models in the first place.

00:18:22.920 --> 00:18:26.039
Oh, I see. Imagine secretly slipping defective

00:18:26.039 --> 00:18:29.019
raw materials onto the loading dock of our factory.

00:18:29.580 --> 00:18:33.259
The goal is to slowly corrupt the network's internal

00:18:33.259 --> 00:18:36.220
weights and biases over time, causing it to fail

00:18:36.220 --> 00:18:39.839
in very specific targeted scenarios. That's insidious.

00:18:39.940 --> 00:18:42.140
And because the entire system is a black box,

00:18:42.740 --> 00:18:44.779
it is incredibly difficult to know if your training

00:18:44.779 --> 00:18:48.119
data has been poisoned until the AI makes a catastrophic

00:18:48.119 --> 00:18:51.029
error in the real world. That is deeply unsettling.

00:18:51.190 --> 00:18:53.710
But there's another angle to this massive reliance

00:18:53.710 --> 00:18:55.470
on data that I want to bring up, and it's about

00:18:55.470 --> 00:18:57.990
the human cost. Right. Here's where it gets really

00:18:57.990 --> 00:18:59.549
interesting, because according to philosopher

00:18:59.549 --> 00:19:02.529
Rainer Muehlhoff, we aren't just the beneficiaries

00:19:02.529 --> 00:19:05.190
of this technology, we are the raw materials.

00:19:05.349 --> 00:19:08.640
Yes. This is a vital concept. Muehlhoff describes

00:19:08.640 --> 00:19:11.319
this as machinic capture. Machinic capture. So

00:19:11.319 --> 00:19:14.099
when we think of human labor in AI, we usually

00:19:14.099 --> 00:19:17.779
picture giant warehouses of poorly paid quick

00:19:17.779 --> 00:19:20.299
workers explicitly hired to label thousands of

00:19:20.299 --> 00:19:22.900
images. But Muehlhoff identifies something far

00:19:22.900 --> 00:19:26.299
more pervasive, implicit human micro work. This

00:19:26.299 --> 00:19:29.079
is crucial for everyone listening. You are constantly

00:19:29.079 --> 00:19:31.940
providing unpaid labor to train these systems.

00:19:32.299 --> 00:19:35.119
Wait, really? Every single time you log into

00:19:35.119 --> 00:19:37.640
a secure website and have to solve an image,

00:19:38.000 --> 00:19:41.420
CAPTCHA, you know, the grid of photos where it

00:19:41.420 --> 00:19:43.700
asks you to click all the squares containing

00:19:43.700 --> 00:19:46.180
a crosswalk or a traffic light or a bicycle.

00:19:46.420 --> 00:19:48.519
Wait, so when I click on those traffic lights

00:19:48.519 --> 00:19:51.079
to prove I'm not a robot, I'm actually doing

00:19:51.079 --> 00:19:53.960
unpaid data labeling labor to train a robot.

00:19:54.220 --> 00:19:57.009
Precisely. You are confirming the ground truth

00:19:57.009 --> 00:20:00.049
for an autonomous driving algorithm. Mind blown.

00:20:00.190 --> 00:20:02.589
And it's everywhere. When you tag a friend's

00:20:02.589 --> 00:20:05.589
face in a photo on Facebook, you are freely providing

00:20:05.589 --> 00:20:09.009
high quality labeled facial images to their deep

00:20:09.009 --> 00:20:11.650
learning system. Wow. When you use a gamified

00:20:11.650 --> 00:20:14.309
language app like Duolingo, your translation

00:20:14.309 --> 00:20:16.690
choices help refine their natural language processing

00:20:16.690 --> 00:20:19.569
networks. Unbelievable. When you wear a quantified

00:20:19.569 --> 00:20:22.359
self activity tracker, to monitor your sleep

00:20:22.359 --> 00:20:25.440
or count your steps. That massive stream of biometric

00:20:25.440 --> 00:20:28.140
data can be mined to train predictive health

00:20:28.140 --> 00:20:31.000
models. It's brilliantly insidious. Milhoff points

00:20:31.000 --> 00:20:33.079
out that this capture of human labor is deeply

00:20:33.079 --> 00:20:35.460
embedded in our social motivations. We don't

00:20:35.460 --> 00:20:37.200
even realize we are standing on the assembly

00:20:37.200 --> 00:20:39.599
line. Not at all. We're just living our digital

00:20:39.599 --> 00:20:42.960
lives, trying to log into our bank or track a

00:20:42.960 --> 00:20:46.279
run. But every click, every tag and every cappy

00:20:46.279 --> 00:20:50.279
TCJ is harvested. We are the massive unpaid sensor

00:20:50.279 --> 00:20:52.619
network feeding the billion dollar models. We

00:20:52.619 --> 00:20:55.079
are entirely integrated into the machine. We've

00:20:55.079 --> 00:20:57.039
covered an incredible amount of ground today.

00:20:57.279 --> 00:21:00.099
We started with physicists in the 1920s imagining

00:21:00.099 --> 00:21:03.519
threshold networks to understand magnets. We

00:21:03.519 --> 00:21:06.799
saw how those concepts evolved into layered artificial

00:21:06.799 --> 00:21:09.230
neural networks. which hit a mathematical wall

00:21:09.230 --> 00:21:12.750
and froze for decades. Then, a modern explosion

00:21:12.750 --> 00:21:15.130
in parallel GPU hardware and the rise of the

00:21:15.130 --> 00:21:17.730
internet provided the massive scale needed to

00:21:17.730 --> 00:21:20.369
thaw the winter. Exactly. And now, we have world

00:21:20.369 --> 00:21:22.549
-changing applications solving 50 -year -old

00:21:22.549 --> 00:21:25.049
grand challenges in biology and material science.

00:21:25.529 --> 00:21:28.329
Yet these exact same systems carry hidden, easily

00:21:28.329 --> 00:21:31.029
hackable vulnerabilities, and they rely entirely

00:21:31.029 --> 00:21:33.950
on our invisible, everyday digital labor. It's

00:21:33.950 --> 00:21:36.759
a lot to take in. So... What does this all mean

00:21:36.759 --> 00:21:39.559
for you? It means that as you go about your day,

00:21:39.920 --> 00:21:42.299
you are living through a fundamental, unprecedented

00:21:42.299 --> 00:21:45.400
shift in how human knowledge and capability are

00:21:45.400 --> 00:21:49.259
generated. And you are an active, albeit unwittingly

00:21:49.259 --> 00:21:51.980
recruited, participant in training the very black

00:21:51.980 --> 00:21:55.029
box that is driving that shift. And to leave

00:21:55.029 --> 00:21:56.690
you with a final thought from the research to

00:21:56.690 --> 00:21:59.089
ponder on your own. Yeah, let's hear it. We've

00:21:59.089 --> 00:22:01.630
discussed how easily deep learning can make what

00:22:01.630 --> 00:22:04.769
appear to be incredibly dumb mistakes, like an

00:22:04.769 --> 00:22:07.009
autonomous car being completely fooled by a few

00:22:07.009 --> 00:22:09.769
stickers. Right. This happens because, despite

00:22:09.769 --> 00:22:11.950
its massive processing power and its ability

00:22:11.950 --> 00:22:14.549
to solve complex crouching structures, the system

00:22:14.549 --> 00:22:17.720
lacks genuine contextual common sense. Researcher

00:22:17.720 --> 00:22:19.880
Ben Gurtzel hypothesized that for these networks

00:22:19.880 --> 00:22:22.519
to truly evolve into something resembling artificial

00:22:22.519 --> 00:22:25.660
general intelligence, they must move beyond simply

00:22:25.660 --> 00:22:28.960
matching abstract statistical patterns. So predicting

00:22:28.960 --> 00:22:31.000
the next pixel or the next word isn't enough.

00:22:31.119 --> 00:22:32.660
They need to understand what they're actually

00:22:32.660 --> 00:22:35.700
looking at. Exactly. Gurtzel suggests that they

00:22:35.700 --> 00:22:38.180
need to internally form something called an image

00:22:38.180 --> 00:22:40.960
grammar. They need to learn the actual physical

00:22:40.960 --> 00:22:44.039
rules of the world, gravity, object permanence,

00:22:44.259 --> 00:22:47.670
context. not just the surface level pixel correlations.

00:22:47.670 --> 00:22:49.430
That makes a lot of sense. So the challenge for

00:22:49.430 --> 00:22:53.309
you to consider is this. Will future AI architectures

00:22:53.309 --> 00:22:56.190
eventually learn this underlying grammar and

00:22:56.190 --> 00:22:58.329
the true common sense of our physical reality?

00:22:59.049 --> 00:23:01.730
Or, because of their fundamental design, will

00:23:01.730 --> 00:23:04.250
they always remain brilliant, highly capable,

00:23:04.609 --> 00:23:07.410
but ultimately fragile and easily fooled pattern

00:23:07.410 --> 00:23:09.759
matchers? It brings us right back to where we

00:23:09.759 --> 00:23:11.900
started. We have collectively built the most

00:23:11.900 --> 00:23:14.619
complex, powerful, and consequential machine

00:23:14.619 --> 00:23:17.259
in human history. But until we can look inside

00:23:17.259 --> 00:23:19.779
that black box and understand if it truly comprehends

00:23:19.779 --> 00:23:22.240
our reality or if it is just a mirror reflecting

00:23:22.240 --> 00:23:24.380
our own data back at us well, we're still just

00:23:24.380 --> 00:23:25.519
guessing at how the gears are turning.
