WEBVTT

00:00:00.000 --> 00:00:01.980
When we think about artificial intelligence,

00:00:02.279 --> 00:00:05.879
we usually picture this flawless robotic mind.

00:00:05.980 --> 00:00:08.560
Right, this infallible steel trap. Yeah, exactly.

00:00:08.980 --> 00:00:11.439
A system that records every single piece of data

00:00:11.439 --> 00:00:14.859
it ever sees perfectly, forever. We imagine it

00:00:14.859 --> 00:00:17.899
literally never forgets anything. It's funny

00:00:17.899 --> 00:00:20.300
because that encyclopedic memory is the science

00:00:20.300 --> 00:00:23.420
fiction dream we've all been sold. But the engineering

00:00:23.420 --> 00:00:25.699
reality of building these systems is actually

00:00:25.699 --> 00:00:28.440
the exact opposite. Which is just wild to think

00:00:28.440 --> 00:00:30.879
about when you look at the actual history of

00:00:30.879 --> 00:00:34.939
how machines learn to process language or sequence

00:00:34.939 --> 00:00:38.200
data. Early AI didn't have a steel trap memory

00:00:38.200 --> 00:00:41.259
at all. No, not at all. It had catastrophic amnesia.

00:00:41.420 --> 00:00:43.600
It was constantly endlessly forgetting what it

00:00:43.600 --> 00:00:45.920
was doing while it was literally in the middle

00:00:45.920 --> 00:00:48.460
of doing it. Yeah, it was a fundamental computational

00:00:48.460 --> 00:00:51.799
memory crisis. I mean, the system simply could

00:00:51.799 --> 00:00:54.500
not hold on to a thought. Well, welcome to today's

00:00:54.500 --> 00:00:57.640
deep dive. Our mission today is to decode the

00:00:57.640 --> 00:00:59.899
hidden brain architecture that essentially power

00:00:59.899 --> 00:01:03.380
the entire AI revolution of the 2010s. The invisible

00:01:03.380 --> 00:01:06.260
engine behind it all. Exactly. We're looking

00:01:06.260 --> 00:01:09.120
at a foundational Wikipedia article on a technology

00:01:09.120 --> 00:01:13.340
called long short term memory, or LSTM. We're

00:01:13.340 --> 00:01:15.560
going to figure out exactly how machines learned

00:01:15.560 --> 00:01:19.659
the very human art of balancing memory and amnesia.

00:01:19.819 --> 00:01:22.219
It's a fascinating journey. It really is. And

00:01:22.219 --> 00:01:24.500
to be clear with you listening right now, whether

00:01:24.500 --> 00:01:26.980
you're asking Siri for the weather, translating

00:01:26.980 --> 00:01:29.959
a menu with Google Translate, or relying on Alexa

00:01:29.959 --> 00:01:32.599
to set a timer, you are directly interacting

00:01:32.599 --> 00:01:35.040
with the legacy of this exact technology. You're

00:01:35.040 --> 00:01:37.439
using it every day. Yep. Today we are popping

00:01:37.439 --> 00:01:40.260
the hood on your smartphone's brain. And to understand

00:01:40.260 --> 00:01:42.400
how we engineered artificial memory, we actually

00:01:42.400 --> 00:01:44.260
have to take a step back and look at cognitive

00:01:44.260 --> 00:01:46.640
psychology. Oh, interesting. Yeah, specifically

00:01:46.640 --> 00:01:48.519
the relationship between long -term and short

00:01:48.519 --> 00:01:50.980
-term memory, which psychologists have been studying

00:01:50.980 --> 00:01:54.340
since, well, the early 20th century. The architects

00:01:54.340 --> 00:01:57.400
of this AI system deliberately named it in analogy

00:01:57.400 --> 00:02:00.319
to how human cognition partitions information.

00:02:00.480 --> 00:02:03.489
Right, long short -term memory. But... Before

00:02:03.489 --> 00:02:05.750
we can appreciate the absolute genius of the

00:02:05.750 --> 00:02:08.610
LSTM architecture, we have to understand the

00:02:08.610 --> 00:02:10.789
catastrophic failure it was designed to fix.

00:02:11.090 --> 00:02:13.270
We have to talk about the amnesia. Exactly. We

00:02:13.270 --> 00:02:16.409
need to talk about why traditional AI kept forgetting

00:02:16.409 --> 00:02:20.229
everything. Right. So before LSTM, the standard

00:02:20.229 --> 00:02:23.229
approach for handling sequence data, which is

00:02:23.229 --> 00:02:26.990
anything that unfolds over time, like say a paragraph

00:02:26.990 --> 00:02:30.129
of text or an audio recording of speech, was

00:02:30.129 --> 00:02:34.740
a classic recurrent neural network. or RNN. OK,

00:02:34.759 --> 00:02:37.699
an RNN. Yeah. And the premise of a classic RNN

00:02:37.699 --> 00:02:40.819
is that it should be able to track dependencies

00:02:40.819 --> 00:02:43.719
over time. If you feed it a paragraph one word

00:02:43.719 --> 00:02:45.860
at a time, it should mathematically be able to

00:02:45.860 --> 00:02:48.219
connect the context of the first word to the

00:02:48.219 --> 00:02:50.560
meaning of the last word. But they hit a massive

00:02:50.560 --> 00:02:52.500
brick wall called the vanishing gradient problem.

00:02:52.560 --> 00:02:54.879
They did. Yeah. So when you train a neural network,

00:02:54.960 --> 00:02:57.800
you use a process called backpropagation. The

00:02:57.800 --> 00:02:59.740
system makes a prediction, compares it to the

00:02:59.740 --> 00:03:02.159
correct answer, and calculates an error. Right.

00:03:02.479 --> 00:03:04.900
Then it sends a mathematical signal a gradient

00:03:04.900 --> 00:03:07.020
backward through the network to adjust the weights.

00:03:07.520 --> 00:03:09.259
It's effectively teaching the machine how to

00:03:09.259 --> 00:03:11.740
fix its mistake for next time. Makes sense. But

00:03:11.740 --> 00:03:14.280
in an RNN processing a sequence, that gradient

00:03:14.280 --> 00:03:16.479
has to travel backward through time, step by

00:03:16.479 --> 00:03:19.319
step, word by word. And here is where the math

00:03:19.319 --> 00:03:22.139
just completely breaks down. Because every time

00:03:22.139 --> 00:03:25.340
that signal moves back one step, it gets multiplied

00:03:25.340 --> 00:03:27.800
by the network's internal weights. Exactly. And

00:03:27.800 --> 00:03:29.740
if those weights are fractional numbers, meaning

00:03:29.740 --> 00:03:32.419
they're smaller than one, you're multiplying

00:03:32.419 --> 00:03:35.539
fractions by fractions over and over again. Which

00:03:35.539 --> 00:03:38.460
causes the value to shrink exponentially fast.

00:03:38.500 --> 00:03:41.979
I mean... If a sequence is 100 steps long, by

00:03:41.979 --> 00:03:43.819
the time the error signal reaches the beginning

00:03:43.819 --> 00:03:46.759
of the sequence, the gradient has literally vanished.

00:03:46.900 --> 00:03:49.680
It tends towards zero. And when the gradient

00:03:49.680 --> 00:03:53.039
becomes zero, the learning signal is dead. The

00:03:53.039 --> 00:03:55.840
model stops updating its weights for those early

00:03:55.840 --> 00:03:58.659
steps, which means it functionally forgets the

00:03:58.659 --> 00:04:01.319
early information it saw. It just drops it. Completely.

00:04:01.680 --> 00:04:04.400
And this mathematical wall was first analyzed

00:04:04.400 --> 00:04:07.840
in depth in a 1991 German diploma thesis by a

00:04:07.840 --> 00:04:10.689
researcher named Sepp OK, let's unpack this for

00:04:10.689 --> 00:04:13.050
a second. Imagine you are playing the game of

00:04:13.050 --> 00:04:15.430
telephone. Oh, OK. But instead of playing it

00:04:15.430 --> 00:04:17.629
with 10 people in a living room, you are playing

00:04:17.629 --> 00:04:20.449
it over thousands of miles with tens of thousands

00:04:20.449 --> 00:04:23.529
of people. That is a long game. Right. So the

00:04:23.529 --> 00:04:26.970
first person whispers a complex, detailed message.

00:04:27.850 --> 00:04:30.550
But every time the message is passed along, a

00:04:30.550 --> 00:04:32.629
tiny fractional amount of the volume is lost.

00:04:33.580 --> 00:04:36.079
By the time that message reaches the end of the

00:04:36.079 --> 00:04:39.060
line thousands of miles away, the original signal

00:04:39.060 --> 00:04:42.399
hasn't just been misunderstood. The volume has

00:04:42.399 --> 00:04:45.040
completely faded away to absolute silence. The

00:04:45.040 --> 00:04:47.180
end of the line gets nothing. You're on the right

00:04:47.180 --> 00:04:50.319
track with the telephone game, but in a neural

00:04:50.319 --> 00:04:52.540
network, it's not the volume of the audio fading.

00:04:52.829 --> 00:04:55.370
What's fascinating here is that it's the mathematical

00:04:55.370 --> 00:04:57.930
penalty for a mistake that shrinks to zero. Oh,

00:04:57.990 --> 00:05:00.589
the penalty shrinks. Exactly. The network makes

00:05:00.589 --> 00:05:03.629
a terrible prediction at step 100. But by the

00:05:03.629 --> 00:05:05.709
time it tries to pass the blame back to step

00:05:05.709 --> 00:05:08.009
one, the blame has been multiplied by so many

00:05:08.009 --> 00:05:10.350
fractions that step one essentially hears, everything

00:05:10.350 --> 00:05:12.449
is fine, don't change a thing. You did great.

00:05:12.629 --> 00:05:15.410
Right. And if the machine cannot carry that mathematical

00:05:15.410 --> 00:05:18.069
penalty across thousands of time steps, it can

00:05:18.069 --> 00:05:20.730
never learn complex, long -term human tasks.

00:05:21.209 --> 00:05:23.800
The system is function - stuck living in a three

00:05:23.800 --> 00:05:26.240
-second window of the present. So the penalty

00:05:26.240 --> 00:05:28.319
signal in our telephone game is fading out to

00:05:28.319 --> 00:05:31.420
nothing and researchers in the 1990s realized

00:05:31.420 --> 00:05:34.279
they couldn't just, you know, yell louder. No,

00:05:34.360 --> 00:05:36.560
the math wouldn't allow it. They needed a completely

00:05:36.560 --> 00:05:40.060
new architecture, a new way to selectively boost

00:05:40.060 --> 00:05:42.839
and block information so the signal could actually

00:05:42.839 --> 00:05:46.439
survive the journey. Enter the LSTM cell. Enter

00:05:46.439 --> 00:05:49.100
the LSTM. The core concept was published in a

00:05:49.100 --> 00:05:52.879
landmark 1997 paper by Sepp Hochreiter and Jürgen

00:05:52.879 --> 00:05:55.360
Schmidhuber, and they introduced an anatomy that

00:05:55.360 --> 00:05:57.800
was just revolutionary. What was different about

00:05:57.800 --> 00:05:59.920
it? Well, instead of a simple processing node,

00:06:00.000 --> 00:06:03.100
they built a memory cell. You can visualize the

00:06:03.100 --> 00:06:05.240
cell state as a kind of internal conveyor belt

00:06:05.240 --> 00:06:07.480
that runs straight down the entire chain of the

00:06:07.480 --> 00:06:10.620
sequence with only very minor linear interactions.

00:06:10.740 --> 00:06:12.779
It's essentially an information highway. But

00:06:12.779 --> 00:06:15.079
an open highway would just get completely clogged

00:06:15.079 --> 00:06:17.160
with junk, right? Exactly, which is why they

00:06:17.160 --> 00:06:19.720
added internal regulators called gates to control

00:06:19.720 --> 00:06:21.579
what gets onto the highway and what gets taken

00:06:21.579 --> 00:06:24.879
off. Specifically, an input gate and an output

00:06:24.879 --> 00:06:28.189
gate in the original 1997 design. And then the

00:06:28.189 --> 00:06:30.829
crucial refinement came two years later. Yes.

00:06:31.370 --> 00:06:35.389
In 1999, Felix Gers, Schmidhuber, and Fred Cummins

00:06:35.389 --> 00:06:38.509
introduced the forget gate. The bouncer of the

00:06:38.509 --> 00:06:40.610
memory club. That's a really helpful way to picture

00:06:40.610 --> 00:06:43.129
it, actually. You have these three gates, though

00:06:43.129 --> 00:06:44.790
it's important to understand that these gates

00:06:44.790 --> 00:06:47.290
aren't like physical doors. They're actually

00:06:47.290 --> 00:06:50.449
separate, parallel neural network layers that

00:06:50.449 --> 00:06:53.250
make independent mathematical bets. Right. So

00:06:53.250 --> 00:06:56.230
the forget gate looks at the previous hidden

00:06:56.230 --> 00:06:59.350
state. and the current input and passes them

00:06:59.350 --> 00:07:01.970
through a mathematical function, specifically

00:07:01.970 --> 00:07:05.670
a sigmoid function. And the purpose of the sigmoid

00:07:05.670 --> 00:07:08.209
function is basically just to act like a volume

00:07:08.209 --> 00:07:10.149
knob. Pretty much, yeah. It takes any number

00:07:10.149 --> 00:07:12.769
you feed it and squashes it into a value between

00:07:12.769 --> 00:07:15.129
exactly zero and exactly one. And the reason

00:07:15.129 --> 00:07:17.930
that range is so critical is basic multiplication.

00:07:18.300 --> 00:07:21.139
The output of that sigmoid function is multiplied

00:07:21.139 --> 00:07:23.360
against the cell state that conveyor belt. Because

00:07:23.360 --> 00:07:26.120
if you multiply any piece of data by zero, it

00:07:26.120 --> 00:07:28.300
disappears completely. It's gone. But if you

00:07:28.300 --> 00:07:31.100
multiply it by one, it stays exactly the same.

00:07:31.660 --> 00:07:33.720
So the sigmoid function lets the network turn

00:07:33.720 --> 00:07:36.139
the data all the way up, silence it completely,

00:07:36.519 --> 00:07:38.500
or leave it somewhere in the middle. Zero means

00:07:38.500 --> 00:07:41.519
dump it, one means keep it. Exactly. And the

00:07:41.519 --> 00:07:44.620
input date uses that exact same zero to one volume

00:07:44.620 --> 00:07:47.199
knob to decide which pieces of new information

00:07:47.199 --> 00:07:49.660
are actually worth adding to the conveyor belt.

00:07:49.860 --> 00:07:52.620
Okay. Making sense. And finally, the output gate

00:07:52.620 --> 00:07:55.279
uses the same math to decide what parts of the

00:07:55.279 --> 00:07:57.199
cell state should be revealed to the rest of

00:07:57.199 --> 00:07:59.639
the network for the current prediction. Let's

00:07:59.639 --> 00:08:01.720
use the natural language processing example from

00:08:01.720 --> 00:08:04.360
the sources to ground this, because mapping this

00:08:04.360 --> 00:08:06.980
data pipeline out really clarifies it for you

00:08:06.980 --> 00:08:09.180
listening. Good idea. Imagine the network is

00:08:09.180 --> 00:08:12.480
reading a sequence like this. Dave, as a result

00:08:12.480 --> 00:08:15.319
of his controversial claims, is now a pariah.

00:08:15.439 --> 00:08:17.779
OK, classic sentence structure. Right. So as

00:08:17.779 --> 00:08:19.839
the AI reads that sentence left to right, it

00:08:19.839 --> 00:08:23.439
encounters the word Dave. The input gate recognizes

00:08:23.439 --> 00:08:26.600
a specific feature. Say the subject is a singular

00:08:26.600 --> 00:08:30.060
male. It assigns that feature a value close to

00:08:30.060 --> 00:08:32.519
one, successfully placing it onto the conveyor

00:08:32.519 --> 00:08:34.899
belt. And the network needs to hold on to that

00:08:34.899 --> 00:08:38.200
feature as it processes all that messy middle

00:08:38.200 --> 00:08:40.700
context, right? As a result of his controversial

00:08:40.700 --> 00:08:43.220
claims part. Exactly. It maintains that long

00:08:43.220 --> 00:08:45.460
-term dependency on the conveyor belt without

00:08:45.460 --> 00:08:47.700
letting the math fade out. So when it finally

00:08:47.700 --> 00:08:50.200
reaches the pronoun, it has the context to predict

00:08:50.200 --> 00:08:54.519
his instead of their or hers. But then it hits

00:08:54.519 --> 00:08:58.059
the verb is. Which triggers the forget gate.

00:08:58.080 --> 00:09:00.960
Right. Because after the verb is, the grammatical

00:09:00.960 --> 00:09:04.379
context of Dave being a singular male is no longer

00:09:04.379 --> 00:09:06.399
pertinent to predicting the structure of the

00:09:06.399 --> 00:09:09.340
next clause. The network has moved on. So the

00:09:09.340 --> 00:09:12.139
forget gate layer outputs a value of zero for

00:09:12.139 --> 00:09:15.000
that specific feature. Yes. Multiplying it against

00:09:15.000 --> 00:09:17.059
the conveyor belt and effectively erasing it

00:09:17.059 --> 00:09:19.539
from the cell state to free up capacity for new

00:09:19.539 --> 00:09:22.320
dependencies. Here's where it gets really interesting

00:09:22.320 --> 00:09:25.259
though. Wait, so to build an AI system with incredible,

00:09:25.480 --> 00:09:28.019
durable memory, the most crucial upgrade they

00:09:28.019 --> 00:09:31.379
gave it in 1999 was mathematically programming

00:09:31.379 --> 00:09:34.379
it to proactively delete things. I know, it sounds

00:09:34.379 --> 00:09:36.440
counterintuitive, but if we connect this to the

00:09:36.440 --> 00:09:39.190
bigger picture, it makes perfect sense. Without

00:09:39.190 --> 00:09:42.289
the forget gate, the conveyor belt becomes completely

00:09:42.289 --> 00:09:44.850
overwhelmed by information overload. Just totally

00:09:44.850 --> 00:09:48.169
jammed up. Exactly. Its cell state becomes a

00:09:48.169 --> 00:09:51.389
cluttered attic of irrelevant data, and the network

00:09:51.389 --> 00:09:54.009
loses its ability to distinguish signal from

00:09:54.009 --> 00:09:57.210
noise. I mean, think about it. Just like you,

00:09:57.230 --> 00:10:00.210
the listener, navigating your daily life, an

00:10:00.210 --> 00:10:02.870
AI's knowledge is most valuable when it drops

00:10:02.870 --> 00:10:05.789
irrelevant context to focus exclusively on what

00:10:05.789 --> 00:10:08.100
matters right now. That's a great point. You

00:10:08.100 --> 00:10:09.879
don't need to retain the license plate number

00:10:09.879 --> 00:10:11.940
of a car you saw yesterday to safely navigate

00:10:11.940 --> 00:10:14.460
traffic today. The network effectively learns

00:10:14.460 --> 00:10:16.879
which information might be needed later on in

00:10:16.879 --> 00:10:19.580
a sequence and vitally when that information

00:10:19.580 --> 00:10:22.220
has expired. OK, so now we have these gates acting

00:10:22.220 --> 00:10:24.259
like bouncers. They're checking data at the door,

00:10:24.539 --> 00:10:26.360
deciding what gets onto the conveyor belt, what

00:10:26.360 --> 00:10:28.379
gets tossed out and what gets revealed. Right.

00:10:28.539 --> 00:10:30.659
But that brings up a massive operational question.

00:10:31.200 --> 00:10:34.039
How does this system actually fix its own mistakes

00:10:34.039 --> 00:10:37.370
during training without the error signals vanishing.

00:10:37.950 --> 00:10:40.990
Ah, the training problem. Yeah. If the bouncers

00:10:40.990 --> 00:10:43.789
make a bad call and erase something important,

00:10:44.370 --> 00:10:47.110
how does the penalty signal get back to them

00:10:47.110 --> 00:10:49.830
without shrinking to zero like in the old RNNs?

00:10:49.950 --> 00:10:52.009
This is where we get to the true engine of the

00:10:52.009 --> 00:10:54.470
LSTM, something called the Constant Error Carousel,

00:10:54.490 --> 00:10:57.490
or CEC. The CEC? Yeah, and this was part of the

00:10:57.490 --> 00:11:00.750
original 1997 design. Because the central cell

00:11:00.750 --> 00:11:03.909
state is a linear highway, it acts as a protected

00:11:03.909 --> 00:11:07.169
channel. When errors are back -propagated during

00:11:07.169 --> 00:11:09.830
training, they don't get subjected to those endless

00:11:09.830 --> 00:11:12.090
fractional multiplications that cause the vanishing

00:11:12.090 --> 00:11:14.769
gradient. They bypass it. Exactly. Instead, they

00:11:14.769 --> 00:11:16.570
get trapped inside the cell's carousel. Trapped

00:11:16.570 --> 00:11:19.590
in a carousel. Okay, I'm visualizing a soundproof

00:11:19.590 --> 00:11:21.509
room where an alarm is going off. Okay, I like

00:11:21.509 --> 00:11:23.929
where this is going. In the old system, the alarm

00:11:23.929 --> 00:11:25.809
sound would quickly get muffled by the walls

00:11:25.809 --> 00:11:28.230
until it was just completely silent. Yeah. But

00:11:28.230 --> 00:11:30.389
with the constant -error carousel, the alarm

00:11:30.389 --> 00:11:33.230
keeps ringing at maximum volume inside that room.

00:11:33.370 --> 00:11:36.830
It continuously loops, feeding the raw error

00:11:36.830 --> 00:11:39.190
signal back to the gates over and over again.

00:11:39.690 --> 00:11:42.009
The bouncers are stuck in the room with a blaring

00:11:42.009 --> 00:11:44.529
alarm until they finally figure out how to adjust

00:11:44.529 --> 00:11:46.950
their mathematical rules to make the mistakes

00:11:46.950 --> 00:11:49.350
stop happening. That is a much more accurate

00:11:49.350 --> 00:11:51.629
visualization of the mechanics. The gradient

00:11:51.629 --> 00:11:55.350
flows freely through the cell, unshrunk, continuously

00:11:55.350 --> 00:11:57.769
forcing the gates to update their weights. And

00:11:57.769 --> 00:11:59.669
you know, the architecture didn't stop evolving

00:11:59.669 --> 00:12:01.889
there either. Oh, there's more. Yeah, in the

00:12:01.889 --> 00:12:04.789
year 2000, they introduced a variant called the

00:12:04.789 --> 00:12:08.370
P -Pole LSTM. Because in the standard design,

00:12:08.529 --> 00:12:10.370
the gates only looked at the outward -facing

00:12:10.370 --> 00:12:12.990
hidden state to make their decisions. But researchers

00:12:12.990 --> 00:12:15.590
realized the gates needed more context to operate

00:12:15.590 --> 00:12:17.789
efficiently. So they gave them literal peoples?

00:12:18.470 --> 00:12:21.129
Mathematically speaking, yes. They added connections

00:12:21.129 --> 00:12:23.809
that allowed the input, output, and forget gates

00:12:23.809 --> 00:12:26.470
to look directly inside the memory cell at the

00:12:26.470 --> 00:12:28.529
constant error carousel. They get to observe

00:12:28.529 --> 00:12:31.879
the actual internal cell state. Wait, so if the

00:12:31.879 --> 00:12:34.679
peepholes let the gates look directly inside

00:12:34.679 --> 00:12:36.740
the cell to make better decisions, why did they

00:12:36.740 --> 00:12:39.700
need another massive upgrade in 2006? Weren't

00:12:39.700 --> 00:12:41.659
the bouncers already doing their job perfectly

00:12:41.659 --> 00:12:43.740
by then? Well, the bouncers were doing their

00:12:43.740 --> 00:12:46.940
job at managing the internal memory, sure, but

00:12:46.940 --> 00:12:49.320
the network as a whole still struggled with how

00:12:49.320 --> 00:12:52.779
it interacted with messy real -world data. Ah,

00:12:52.779 --> 00:12:55.940
the real world, always ruining perfect math.

00:12:56.159 --> 00:12:59.159
Exactly. The breakthrough in 2006 came from a

00:12:59.159 --> 00:13:01.529
researcher named Alex Graves and his colleagues.

00:13:01.830 --> 00:13:04.710
They introduced an error function called connectionist

00:13:04.710 --> 00:13:08.629
temporal classification, or CTC. CTC. Let's break

00:13:08.629 --> 00:13:11.200
down why that was necessary. So imagine you are

00:13:11.200 --> 00:13:13.779
trying to process a sequence of human speech.

00:13:14.379 --> 00:13:17.299
The raw audio waveforms almost never neatly line

00:13:17.299 --> 00:13:19.419
up with the text transcript. Right, because people

00:13:19.419 --> 00:13:21.299
talk at different speeds. Exactly. Someone might

00:13:21.299 --> 00:13:23.799
say the word hello very briskly, or they might

00:13:23.799 --> 00:13:25.899
draw it out for three seconds. Traditional neural

00:13:25.899 --> 00:13:28.039
network models require perfectly aligned training

00:13:28.039 --> 00:13:31.039
data. A human had to manually timestamp exactly

00:13:31.039 --> 00:13:33.600
when the H sound started and ended, when the

00:13:33.600 --> 00:13:36.200
E sound happened, and so on. That sounds impossibly

00:13:36.200 --> 00:13:39.360
tedious to scale. It was. It was a huge bottleneck.

00:13:39.419 --> 00:13:41.820
It's like trying to subtitle a foreign movie,

00:13:41.919 --> 00:13:43.980
where you have the written script, but the video

00:13:43.980 --> 00:13:46.019
and audio tracks are completely out of sync,

00:13:46.019 --> 00:13:48.899
and you have to manually match every single syllable

00:13:48.899 --> 00:13:51.159
to lip movements. That is exactly the problem.

00:13:51.399 --> 00:13:54.379
And CTC solved this. It acts as an alignment

00:13:54.379 --> 00:13:57.299
engine. It calculates the probabilities of all

00:13:57.299 --> 00:13:59.679
possible alignments between the input audio and

00:13:59.679 --> 00:14:02.419
the target text sequence simultaneously. Oh,

00:14:02.419 --> 00:14:05.279
wow. Yeah, it allowed the LSTM to stretch and

00:14:05.279 --> 00:14:07.539
squeeze the text until it perfectly matched the

00:14:07.539 --> 00:14:10.539
audio waves. without needing perfectly segmented

00:14:10.539 --> 00:14:13.539
human labeled data. It learned to achieve both

00:14:13.539 --> 00:14:16.039
alignment and recognition on its own. That's

00:14:16.039 --> 00:14:19.200
incredible. It really is. And this specific mathematical

00:14:19.200 --> 00:14:21.720
loop combined with the protected cell state is

00:14:21.720 --> 00:14:24.519
what finally allowed the LSTM to handle sequences

00:14:24.519 --> 00:14:27.340
of thousands of continuous time steps effortlessly.

00:14:27.840 --> 00:14:29.419
You know, it really is hard to wrap your head

00:14:29.419 --> 00:14:31.440
around how much sequential engineering went into

00:14:31.440 --> 00:14:33.840
this. Yeah. A student thesis identifying the

00:14:33.840 --> 00:14:36.419
math problem in 1991, the self -state paper in

00:14:36.419 --> 00:14:40.179
1997, the crucial forget gates in 1999, peoples

00:14:40.179 --> 00:14:44.200
in 2000, and CTC training in 2006. It took time.

00:14:44.519 --> 00:14:47.899
Decades. Decades of computer scientists passing

00:14:47.899 --> 00:14:50.440
the baton, trying to build a machine that could

00:14:50.440 --> 00:14:53.320
conquer time. And for a long time it remained

00:14:53.320 --> 00:14:55.820
primarily a fascinating mathematical pursuit

00:14:55.820 --> 00:14:58.899
for academics. A brilliant theoretical solution

00:14:58.899 --> 00:15:01.399
to a really tough problem. Right. Because the

00:15:01.399 --> 00:15:03.360
math was essentially waiting for the silicon

00:15:03.360 --> 00:15:06.039
to catch up. Which brings us to the 2010s. Yes.

00:15:06.279 --> 00:15:10.200
The hardware finally arrived. GPUs became powerful

00:15:10.200 --> 00:15:12.580
enough to run these massive parallel calculations

00:15:12.580 --> 00:15:15.940
efficiently. And suddenly, this robust error

00:15:15.940 --> 00:15:18.480
correcting, forgetting enabled memory wasn't

00:15:18.480 --> 00:15:20.679
just a fun math puzzle anymore. It became very,

00:15:20.679 --> 00:15:23.200
very real. It quietly and totally took over the

00:15:23.200 --> 00:15:25.740
commercial tech world. It's actually staggering

00:15:25.740 --> 00:15:28.000
to look at the timeline of how fast it snowballed.

00:15:28.080 --> 00:15:30.399
The commercial adoption was an absolute avalanche.

00:15:30.419 --> 00:15:32.600
Once the compute power unlocked the architecture,

00:15:33.039 --> 00:15:35.120
the performance jumps were unprecedented. Right.

00:15:35.600 --> 00:15:38.539
In 2015, Google deployed an LSTM, trained with

00:15:38.539 --> 00:15:41.059
that exact CTC method we just discussed, for

00:15:41.059 --> 00:15:43.679
their Google Voice speech recognition. And according

00:15:43.679 --> 00:15:46.120
to their engineering blogs at the time, deploying

00:15:46.120 --> 00:15:48.879
that single architectural upgrade cut their transcription

00:15:48.879 --> 00:15:53.200
errors by 49%. Wait, 49 %? That's practically

00:15:53.200 --> 00:15:55.320
cutting errors in half, seemingly overnight,

00:15:55.820 --> 00:15:58.539
on a massive global product. Yeah, and the dominoes

00:15:58.539 --> 00:16:01.620
fell rapidly across the entire industry. By 2016,

00:16:02.039 --> 00:16:04.259
Google had integrated it into their Allo messaging

00:16:04.259 --> 00:16:07.100
app for smart replies. That same year, the Google

00:16:07.100 --> 00:16:09.299
neural machine translation system rolled out,

00:16:09.580 --> 00:16:12.159
utilizing LSTMs to reduce translation errors

00:16:12.159 --> 00:16:15.580
by a massive 60%. Meanwhile, Apple announced

00:16:15.580 --> 00:16:17.879
they were baking LSTM into the iPhone's Quick

00:16:17.879 --> 00:16:20.899
Type Keyboard, and they used it to power Siri's

00:16:20.899 --> 00:16:23.799
natural language understanding. Then, Amazon

00:16:23.799 --> 00:16:27.019
integrated a bi -directional LSTM, which, by

00:16:27.019 --> 00:16:29.019
the way, is a variant that processes a sequence

00:16:29.019 --> 00:16:31.679
both forwards and backwards simultaneously to

00:16:31.679 --> 00:16:34.379
gain deeper context. And Amazon used that to

00:16:34.379 --> 00:16:36.759
generate the highly realistic voices for Alexa's

00:16:36.759 --> 00:16:39.080
text -to -speech engine, Polly. The scale became

00:16:39.080 --> 00:16:42.299
almost incomprehensible. By 2017, Facebook reported

00:16:42.299 --> 00:16:44.879
they were performing approximately 4 .5 billion

00:16:44.879 --> 00:16:47.720
automatic translations every single day, entirely

00:16:47.720 --> 00:16:49.860
powered by long, short -term memory networks.

00:16:49.980 --> 00:16:54.379
4 .5 billion. A day. A day. Microsoft achieved

00:16:54.379 --> 00:16:58.600
a 94 .9 % recognition accuracy on conversational

00:16:58.600 --> 00:17:01.139
speech using them. The architecture just became

00:17:01.139 --> 00:17:03.600
the invisible backbone of the modern internet.

00:17:03.879 --> 00:17:05.940
So what does this all mean? You're talking about

00:17:05.940 --> 00:17:08.640
language. Transcribing voicemails, translating

00:17:08.640 --> 00:17:11.539
French to English, Siri setting a calendar event.

00:17:11.539 --> 00:17:13.619
Right, right. That makes complete sense based

00:17:13.619 --> 00:17:17.000
on our Dave is a pariah example and the speech

00:17:17.000 --> 00:17:20.589
recognition CTC math. But the adaptability of

00:17:20.589 --> 00:17:22.710
this architecture goes way beyond words. Oh,

00:17:22.769 --> 00:17:25.009
absolutely. The sources highlight that as early

00:17:25.009 --> 00:17:28.930
as 2006, an LSTM system learned to control robotic

00:17:28.930 --> 00:17:31.769
arms to tie microscopic knots for robotic heart

00:17:31.769 --> 00:17:34.130
surgery. It did. It learned the physical movements

00:17:34.130 --> 00:17:36.890
necessary to manipulate sutures. And then...

00:17:36.839 --> 00:17:41.220
Cut to 2018 and 2019, OpenAI and DeepMind utilized

00:17:41.220 --> 00:17:44.539
LSTMs trained by policy gradients, which is essentially

00:17:44.539 --> 00:17:46.539
a form of reinforcement learning to play video

00:17:46.539 --> 00:17:48.559
games. And not just simple games like Pong. Right,

00:17:48.599 --> 00:17:50.559
we're talking about insanely complex real -time

00:17:50.559 --> 00:17:53.519
strategy games like Dota 2 and StarCraft 2. The

00:17:53.519 --> 00:17:56.240
LSTM models utterly crushed professional human

00:17:56.240 --> 00:17:58.599
gamers. They really did. They also used it to

00:17:58.599 --> 00:18:01.500
control a human -like robot hand, manipulating

00:18:01.500 --> 00:18:05.160
physical objects like a Rubik's Cube with unprecedented

00:18:05.160 --> 00:18:08.950
dexterity. So how does a system fundamentally

00:18:08.950 --> 00:18:12.009
designed to remember the gender of Dave in a

00:18:12.009 --> 00:18:15.569
sentence flawlessly pivot to executing a military

00:18:15.569 --> 00:18:18.789
flank in Starcraft 2 or tying knots in a beating

00:18:18.789 --> 00:18:22.309
human heart? Well this raises an important question

00:18:22.309 --> 00:18:24.849
and the answer is really the key to understanding

00:18:24.849 --> 00:18:27.049
modern machine learning as a discipline. Okay.

00:18:27.289 --> 00:18:30.190
Because to a machine the letters in a word the

00:18:30.190 --> 00:18:32.750
audio frequencies in a voice command, the joint

00:18:32.750 --> 00:18:35.509
angles of a robotic arm tying a surgical knot,

00:18:35.849 --> 00:18:38.410
and the strategic unit movements in a Dota 2

00:18:38.410 --> 00:18:41.890
match. They are all fundamentally the exact same

00:18:41.890 --> 00:18:44.309
thing. They are just data points. They are sequential

00:18:44.309 --> 00:18:47.349
data unfolding over time. The LSTM did not learn

00:18:47.349 --> 00:18:49.509
the English language. It did not learn the rules

00:18:49.509 --> 00:18:51.829
of video games or human anatomy. It learned the

00:18:51.829 --> 00:18:54.269
universal language of time and sequence. Oh,

00:18:54.269 --> 00:18:57.170
wow. Yeah. Whether the time step is a millisecond

00:18:57.170 --> 00:18:59.670
of audio, a frame of video game rendering, or

00:18:59.670 --> 00:19:02.470
a micro -adjustment of a robotic actuator, the

00:19:02.470 --> 00:19:04.710
LSTM's architecture makes it universally applicable.

00:19:04.809 --> 00:19:07.190
This is just dealing with the math of time. Exactly.

00:19:08.619 --> 00:19:11.740
contain the past on that conveyor belt, selectively

00:19:11.740 --> 00:19:14.299
forget the irrelevant with its gates, and apply

00:19:14.299 --> 00:19:16.660
that synthesized context to the present prediction

00:19:16.660 --> 00:19:19.980
means it can model any reality that unfolds chronologically.

00:19:20.460 --> 00:19:22.420
That is profound when you really think about

00:19:22.420 --> 00:19:24.440
it. It's not a language machine, it's a causality

00:19:24.440 --> 00:19:28.829
machine. It learns that event A happened 50 time

00:19:28.829 --> 00:19:31.910
steps ago, and because of event A, event B should

00:19:31.910 --> 00:19:35.089
logically happen right now. Yes, the architecture

00:19:35.089 --> 00:19:37.670
models cause and effect by mathematically bridging

00:19:37.670 --> 00:19:40.230
the gap between those two points in time, regardless

00:19:40.230 --> 00:19:42.450
of what the underlying data actually represents.

00:19:42.869 --> 00:19:45.509
So let's recap this incredible journey. We started

00:19:45.509 --> 00:19:49.130
with a 1991 student thesis by Sepp Hochreiter,

00:19:49.269 --> 00:19:51.769
trying to solve a seemingly impossible amnesia

00:19:51.769 --> 00:19:54.990
problem and code the vanishing gradient. signals

00:19:54.990 --> 00:19:57.329
just faded into nothingness. Huge roadblock.

00:19:57.569 --> 00:19:59.430
Right. And through the invention of protected

00:19:59.430 --> 00:20:02.009
memory cells, the crucial forget gates acting

00:20:02.009 --> 00:20:03.890
as bouncers to prevent information overload,

00:20:04.430 --> 00:20:06.710
people connections for internal visibility, and

00:20:06.710 --> 00:20:08.890
error carousels that trap mistakes like a blaring

00:20:08.890 --> 00:20:11.089
alarm until they are fixed, researchers built

00:20:11.089 --> 00:20:13.410
an architecture that conquered time. They solved

00:20:13.410 --> 00:20:15.930
it. It went from a theoretical academic puzzle

00:20:15.930 --> 00:20:19.049
to the invisible engine running almost every

00:20:19.049 --> 00:20:21.329
major application on your phone right now in

00:20:21.329 --> 00:20:24.509
2026. And you know, what's truly remarkable is

00:20:24.509 --> 00:20:26.650
that despite the core architecture being invented

00:20:26.650 --> 00:20:30.450
back in 1997, it is far from obsolete. Really?

00:20:30.809 --> 00:20:33.759
Even with all the new AI models? Yeah. You might

00:20:33.759 --> 00:20:36.579
assume in the era of new technologies and massive

00:20:36.579 --> 00:20:38.920
foundational models that it would have been entirely

00:20:38.920 --> 00:20:43.380
replaced. But just recently, in 2024, a modern

00:20:43.380 --> 00:20:46.599
upgrade called XLSTM was published by a team

00:20:46.599 --> 00:20:48.980
led by Sepp Hoskrider himself. For a circle.

00:20:49.180 --> 00:20:51.859
Completely. They adapted the design so that it

00:20:51.859 --> 00:20:55.319
is highly parallelizable, much like modern transformer

00:20:55.319 --> 00:20:58.000
architectures, while still retaining the superior

00:20:58.000 --> 00:21:00.259
state tracking abilities that LSTMs are known

00:21:00.259 --> 00:21:02.619
for. So the architecture is still evolving and

00:21:02.619 --> 00:21:04.519
still still driving the bleeding edge of sequence

00:21:04.519 --> 00:21:06.839
modeling. That's amazing. So the next time your

00:21:06.839 --> 00:21:08.660
phone perfectly predicts the end of your sentence

00:21:08.660 --> 00:21:10.740
or effortlessly translates a foreign language

00:21:10.740 --> 00:21:13.039
on the fly as you travel, you really should take

00:21:13.039 --> 00:21:15.319
a second to thank the forget gate. It remains

00:21:15.319 --> 00:21:17.539
one of the great unsung heroes of the digital

00:21:17.539 --> 00:21:20.450
age. It really is. We started this deep dive

00:21:20.450 --> 00:21:23.069
talking about the expectation of perfect steel

00:21:23.069 --> 00:21:26.410
trap memory in machines and how early AI had

00:21:26.410 --> 00:21:30.210
catastrophic amnesia instead. We learned that

00:21:30.210 --> 00:21:32.569
to cure that amnesia, to build a machine capable

00:21:32.569 --> 00:21:35.829
of deep contextual memory, we first had to teach

00:21:35.829 --> 00:21:37.910
it the mathematical value of forgetting. The

00:21:37.910 --> 00:21:40.690
power of letting go. Exactly. And that leaves

00:21:40.690 --> 00:21:43.410
us with a thought I want you, listening, to mull

00:21:43.410 --> 00:21:46.259
over as you go about your day. If the greatest

00:21:46.259 --> 00:21:48.740
leap in artificial intelligence relied on programming

00:21:48.740 --> 00:21:52.000
a machine to proactively discard useless information,

00:21:52.619 --> 00:21:54.299
well how much of your own human intelligence

00:21:54.299 --> 00:21:56.880
is defined not by the sheer volume of facts you

00:21:56.880 --> 00:21:59.500
can remember, but by the things your brain actively

00:21:59.500 --> 00:22:02.140
chooses to let go of? A fascinating question

00:22:02.140 --> 00:22:04.180
to end on. Thanks for joining us on this D's

00:22:04.180 --> 00:22:05.279
Dive. We'll see you next time.
