WEBVTT

00:00:00.000 --> 00:00:02.220
Imagine trying to hold a conversation with someone

00:00:02.220 --> 00:00:06.639
who has like severe and I mean absolute Short

00:00:06.639 --> 00:00:09.140
-term amnesia right that would be incredibly

00:00:09.140 --> 00:00:11.660
difficult Yeah, every single word you say to

00:00:11.660 --> 00:00:14.279
them they hear it they process it but the moment

00:00:14.279 --> 00:00:18.370
you say the next word the previous one just vanishes

00:00:18.370 --> 00:00:20.690
completely. It's completely wiped. Exactly. They

00:00:20.690 --> 00:00:23.390
experience every fraction of a second completely

00:00:23.390 --> 00:00:25.670
independently. I mean, it would make communication

00:00:25.670 --> 00:00:28.530
basically impossible. You simply cannot understand

00:00:28.530 --> 00:00:30.250
a sentence without holding onto the beginning

00:00:30.250 --> 00:00:32.109
of it by the time you reach the end. And for

00:00:32.109 --> 00:00:34.770
a long time, that was the exact problem we faced

00:00:34.770 --> 00:00:37.609
when trying to get artificial intelligence to

00:00:37.609 --> 00:00:40.270
understand us. Yeah, it really was. Because the

00:00:40.270 --> 00:00:42.890
basic neural networks, they were brilliant at

00:00:42.890 --> 00:00:46.659
looking at a single static picture. and identifying

00:00:46.659 --> 00:00:49.259
a cat, but they were stuck in this isolated present

00:00:49.259 --> 00:00:51.859
moment. They had absolutely no concept of time

00:00:51.859 --> 00:00:55.100
passing or of sequences. Which meant they failed

00:00:55.100 --> 00:00:58.259
completely at tasks involving speech or language

00:00:58.259 --> 00:01:00.420
translation or even just predicting what happens

00:01:00.420 --> 00:01:04.420
next in a time series. Solving that required

00:01:04.420 --> 00:01:06.640
a completely different architecture. It really

00:01:06.640 --> 00:01:10.069
required giving the machine a memory. That is

00:01:10.069 --> 00:01:12.969
exactly our mission for you today. We are doing

00:01:12.969 --> 00:01:17.390
a deep dive into a massive comprehensive Wikipedia

00:01:17.390 --> 00:01:21.329
article on recurrent neural networks, or RNNs.

00:01:21.590 --> 00:01:24.730
It's a fascinating topic. It really is. So if

00:01:24.730 --> 00:01:26.790
you've ever wondered how your phone dictates

00:01:26.790 --> 00:01:29.689
your voice, or how translation apps grasp the

00:01:29.689 --> 00:01:32.170
context of a whole paragraph, or really just

00:01:32.170 --> 00:01:35.370
how AI processes the simple passage of time,

00:01:35.629 --> 00:01:37.370
well, This is the master class. And we're going

00:01:37.370 --> 00:01:39.590
to cover a lot of ground today. We are. OK, let's

00:01:39.590 --> 00:01:41.849
unpack this. When I was prepping for this, the

00:01:41.849 --> 00:01:45.849
source kept contrasting RNNs with standard feedforward

00:01:45.849 --> 00:01:48.189
neural networks. Yes, that's the standard comparison.

00:01:48.310 --> 00:01:50.370
So as I understand it, a feedforward network

00:01:50.370 --> 00:01:52.829
just takes data, pushes it straight through its

00:01:52.829 --> 00:01:55.409
layers once, and spits out an answer. What makes

00:01:55.409 --> 00:01:58.290
an RNN fundamentally different? Well, the crucial

00:01:58.290 --> 00:02:00.689
difference is right in the name itself, recurrent.

00:02:00.849 --> 00:02:03.540
It implies a loop. In a standard feed -forward

00:02:03.540 --> 00:02:06.060
network, the data travels strictly in one direction,

00:02:06.180 --> 00:02:09.979
from input to output. And RNN alters that fundamental

00:02:09.979 --> 00:02:13.280
geometry. The output of a neuron at one step

00:02:13.280 --> 00:02:16.439
is actually fed back into the network as an input

00:02:16.439 --> 00:02:19.460
for the absolute very next step. Oh, wow. So

00:02:19.460 --> 00:02:21.479
it's essentially eating its own output as it

00:02:21.479 --> 00:02:24.000
moves forward. Exactly. That loop is what we

00:02:24.000 --> 00:02:26.479
call a recurrent unit, and it maintains something

00:02:26.479 --> 00:02:29.960
called a hidden state. Hidden state? Right. You

00:02:29.960 --> 00:02:32.580
can think of this hidden state as the network's

00:02:32.580 --> 00:02:36.060
internal working memory. So at any given moment,

00:02:36.479 --> 00:02:38.620
the network isn't just looking at the new piece

00:02:38.620 --> 00:02:40.520
of data coming in. It's looking at the past,

00:02:40.680 --> 00:02:45.060
too. Yes. It is processing that new data mathematically

00:02:45.060 --> 00:02:48.000
combined with its hidden state, which contains,

00:02:48.000 --> 00:02:50.759
well, the echoes of everything it has seen before.

00:02:51.060 --> 00:02:52.979
Let me try to visualize this with an analogy.

00:02:53.340 --> 00:02:55.520
It feels like reading a book. Okay, I like that.

00:02:55.710 --> 00:02:58.449
If I'm a basic feedforward AI, I read the word

00:02:58.449 --> 00:03:01.250
the, process it, and then completely wipe my

00:03:01.250 --> 00:03:03.409
mental slate. Right, gone. Then I read the word

00:03:03.409 --> 00:03:05.990
dog, and I have zero recollection that the came

00:03:05.990 --> 00:03:08.550
before it. It's just a standalone dog. Exactly.

00:03:08.909 --> 00:03:11.530
But an RNN carries a running summary in its head.

00:03:12.009 --> 00:03:14.490
That running summary is the hidden state. Spot

00:03:14.490 --> 00:03:16.590
on. So by the time it reaches the end of the

00:03:16.590 --> 00:03:19.879
sentence and reads the word barked, It understands

00:03:19.879 --> 00:03:23.780
that action in the context of the dog, the mailman,

00:03:24.020 --> 00:03:26.740
and everything that preceded it. That captures

00:03:26.740 --> 00:03:28.800
the dynamic perfectly. And what's fascinating

00:03:28.800 --> 00:03:31.759
here is how this architecture looks when you

00:03:31.759 --> 00:03:33.840
map it out visually. Oh yeah, the diagrams and

00:03:33.840 --> 00:03:35.699
the source material. Right. If you look at diagrams

00:03:35.699 --> 00:03:38.800
of RNNs, they often look like these massive sprawling

00:03:38.800 --> 00:03:41.879
networks with dozens of horizontal layers. But

00:03:41.879 --> 00:03:44.979
that visual is actually a clever illusion. An

00:03:44.979 --> 00:03:46.699
illusion. Wait, what are we actually looking

00:03:46.699 --> 00:03:49.159
at then? You aren't looking at a giant stack

00:03:49.159 --> 00:03:51.639
of different distinct layers. It is the exact

00:03:51.639 --> 00:03:54.960
same network, the same layer, just unfolded in

00:03:54.960 --> 00:03:57.800
time. Unfolded, okay. Yeah. You are seeing a

00:03:57.800 --> 00:04:00.280
snapshot of the same digital brain at second

00:04:00.280 --> 00:04:03.280
one, then second two, then second three, laid

00:04:03.280 --> 00:04:06.120
out side by side on the page. Oh, I see. It transforms

00:04:06.120 --> 00:04:09.020
the input, updates its internal memory, and gets

00:04:09.020 --> 00:04:11.020
ready for the next step using the exact same

00:04:11.020 --> 00:04:13.300
machinery. It just loops the same hardware over

00:04:13.300 --> 00:04:15.060
and over. That makes a lot of sense conceptually.

00:04:15.159 --> 00:04:17.420
It's very elegant. But, you know, when I think

00:04:17.420 --> 00:04:20.779
about the origins of advanced AI, my mind immediately

00:04:20.779 --> 00:04:23.800
goes to Silicon Valley computer labs in the 2000s

00:04:23.800 --> 00:04:26.120
or maybe the 90s. Well, sure. Most people do.

00:04:26.560 --> 00:04:28.560
Reading this source, though... The inspiration

00:04:28.560 --> 00:04:30.639
for this looping computational memory didn't

00:04:30.639 --> 00:04:33.220
come from computer science at all. No, it didn't.

00:04:33.339 --> 00:04:35.420
The origins are much more organic and they stretch

00:04:35.420 --> 00:04:38.740
back much further. The very concept of recurrency

00:04:38.740 --> 00:04:41.459
in this context comes straight from human anatomy

00:04:41.459 --> 00:04:43.759
and early neuroscience. Here's where it gets

00:04:43.759 --> 00:04:46.759
really interesting. We're talking about observations

00:04:46.759 --> 00:04:49.779
of the human brain from over a century ago. Right.

00:04:50.060 --> 00:04:53.339
In 1901, the legendary neuroscientist Santiago

00:04:53.339 --> 00:04:56.500
Ramonica Hall was studying the cerebellar cortex

00:04:56.500 --> 00:04:59.209
of the brain a microscope. Just looking at physical

00:04:59.209 --> 00:05:02.509
brain tissue. Exactly. And he observed these

00:05:02.509 --> 00:05:05.209
literal loop -like anatomical structures, he

00:05:05.209 --> 00:05:07.430
called them recurrent semicircles, and they were

00:05:07.430 --> 00:05:10.009
formed by different types of biological cells.

00:05:10.050 --> 00:05:13.050
Wow. And then moving into the 1930s and 40s,

00:05:13.209 --> 00:05:15.290
researchers began to realize that the brain wasn't

00:05:15.290 --> 00:05:17.709
just a linear pathway where, you know, a stimulus

00:05:17.709 --> 00:05:19.829
goes in and a reaction comes out. Our source

00:05:19.829 --> 00:05:23.230
brings up Donald Hebb in the 1940s who proposed

00:05:23.230 --> 00:05:26.910
reverberating circuits. Yes, Hebb's work is foundational.

00:05:27.149 --> 00:05:29.569
He had this idea that electrical signals in the

00:05:29.569 --> 00:05:31.709
brain could get caught in a biological loop,

00:05:32.110 --> 00:05:34.949
bouncing around continuously, and that this physical

00:05:34.949 --> 00:05:37.750
reverberation might actually be the physical

00:05:37.750 --> 00:05:40.839
mechanism for short -term memory. Following that

00:05:40.839 --> 00:05:44.180
exact thread, in 1943, Warren McCulloch and Walter

00:05:44.180 --> 00:05:47.660
Pitts published a landmark paper mathematically

00:05:47.660 --> 00:05:51.199
modeling how neurons work. And crucially, they

00:05:51.199 --> 00:05:53.379
included artificial networks that had cycles

00:05:53.379 --> 00:05:56.000
or loops in them. They were attempting to use

00:05:56.000 --> 00:05:58.579
these closed loops to explain medical conditions.

00:05:58.600 --> 00:06:01.220
For instance, why a seizure happens in epilepsy

00:06:01.220 --> 00:06:05.220
or certain types of chronic pain where a nerve

00:06:05.220 --> 00:06:09.680
signal simply will not stop firing. So the idea

00:06:09.680 --> 00:06:12.759
of an AI memory loop starts with literal brain

00:06:12.759 --> 00:06:15.779
anatomy. But the source doesn't stop at biology.

00:06:16.100 --> 00:06:17.819
No, it takes a pretty sharp turn. It takes this

00:06:17.819 --> 00:06:20.459
massive leap into physics. Statistical mechanics

00:06:20.459 --> 00:06:22.120
specifically, which completely threw me for a

00:06:22.120 --> 00:06:24.459
loop. You're referring to the Ising model. This

00:06:24.459 --> 00:06:27.519
dates back to the 1920s, developed by Wilhelm

00:06:27.519 --> 00:06:31.160
Lenz and Ernst Ising. It was originally a statistical

00:06:31.160 --> 00:06:35.000
model to understand magnets. Magnets. Yes. Specifically,

00:06:35.480 --> 00:06:38.300
how magnetic materials reach physical equilibrium.

00:06:38.740 --> 00:06:41.480
OK, hold on a second. I need a bridge here. How

00:06:41.480 --> 00:06:45.079
on earth do 1920s magnets connect to an artificial

00:06:45.079 --> 00:06:47.620
intelligence trying to understand the context

00:06:47.620 --> 00:06:49.579
of a sentence? I know, it sounds totally unrelated

00:06:49.579 --> 00:06:51.579
until you look at the math of energy states.

00:06:51.600 --> 00:06:54.639
Cool. In a material, atoms have a magnetic spin,

00:06:55.019 --> 00:06:57.319
and they constantly influence the spin of their

00:06:57.319 --> 00:06:59.199
immediate neighbors. They're trying to settle

00:06:59.199 --> 00:07:01.680
into the lowest possible energy state, which

00:07:01.680 --> 00:07:04.000
physicists call a local minimum. Right, they

00:07:04.000 --> 00:07:07.079
want to be stable. Exactly. And then in the 1970s

00:07:07.079 --> 00:07:09.980
and 80s, physicists and computer scientists realized

00:07:09.980 --> 00:07:12.519
you could use that exact same mathematical framework

00:07:12.519 --> 00:07:15.399
to model a network of artificial neurons. So

00:07:15.399 --> 00:07:17.959
instead of atoms settling into a magnetic state,

00:07:18.060 --> 00:07:20.500
you have artificial neurons firing and influencing

00:07:20.500 --> 00:07:22.779
each other's weights. The network iterates over

00:07:22.779 --> 00:07:25.259
and over until the whole system mathematically

00:07:25.259 --> 00:07:28.019
settles into a stable pattern. And that stable

00:07:28.019 --> 00:07:31.399
pattern is what we call a memory. That is wild.

00:07:31.860 --> 00:07:34.680
You have neuroscience mapping literal brain loops

00:07:34.680 --> 00:07:37.600
and physics calculating the energy states of

00:07:37.600 --> 00:07:40.899
magnetic atoms. And in the 1980s, these completely

00:07:40.899 --> 00:07:43.620
different disciplines crash into each other to

00:07:43.620 --> 00:07:46.199
form the foundation of modern recurrent neural

00:07:46.199 --> 00:07:49.339
networks. Yeah, by the late 80s and early 90s,

00:07:49.500 --> 00:07:51.560
with architectures like the Jordan and Elman

00:07:51.560 --> 00:07:54.100
networks, researchers were using these early

00:07:54.100 --> 00:07:56.540
RNNs to actually study cognitive psychology.

00:07:56.639 --> 00:07:59.680
They were mapping how human beings process sequential

00:07:59.680 --> 00:08:04.000
tasks. But as incredible as the theory was, the

00:08:04.000 --> 00:08:06.160
reality of building these things hit a massive

00:08:06.160 --> 00:08:09.300
roadblock. They built the RNNs, they gave them

00:08:09.300 --> 00:08:11.819
this looping mathematical memory, but the networks

00:08:11.819 --> 00:08:13.779
had a fatal flaw. They had terrible long -term

00:08:13.779 --> 00:08:16.220
memory. Yes, the vanishing gradient problem.

00:08:16.699 --> 00:08:19.079
This was the great amnesia that almost derailed

00:08:19.079 --> 00:08:22.160
the entire field of sequential AI. Let's break

00:08:22.160 --> 00:08:24.139
this down because Vanishing Gradient sounds like,

00:08:24.139 --> 00:08:26.459
I don't know, a bad sci -fi movie title. To start,

00:08:26.560 --> 00:08:28.660
what actually is the gradient in this context?

00:08:28.800 --> 00:08:30.560
Well, to understand the gradient, you have to

00:08:30.560 --> 00:08:32.899
look at how the network learns through a process

00:08:32.899 --> 00:08:35.659
called backpropagation through time. Let's say

00:08:35.659 --> 00:08:38.679
the network reads a sequence of 100 words and

00:08:38.679 --> 00:08:40.519
then makes a prediction about the next word.

00:08:40.830 --> 00:08:43.529
If the prediction is wrong, it calculates an

00:08:43.529 --> 00:08:45.990
error signal. Makes sense. It then has to go

00:08:45.990 --> 00:08:48.710
backward through time, step by step, through

00:08:48.710 --> 00:08:51.409
the hidden states to figure out which previous

00:08:51.409 --> 00:08:54.470
neuron made the mistake and adjust its mathematical

00:08:54.470 --> 00:08:56.990
weight. That error signal traveling backward,

00:08:57.230 --> 00:08:59.629
that is the gradient. Oh, I see. It's like tracking

00:08:59.629 --> 00:09:03.049
down who started a rumor by going backward through

00:09:03.049 --> 00:09:05.909
a chain of a hundred gossiping people, tweaking

00:09:05.909 --> 00:09:08.659
each person's reliability score as you go. That

00:09:08.659 --> 00:09:10.980
captures the sequence perfectly. But here is

00:09:10.980 --> 00:09:13.580
the specific mathematical wall they hit. Every

00:09:13.580 --> 00:09:15.480
time you take a step backward through time to

00:09:15.480 --> 00:09:17.519
adjust a weight, you use calculus. I'm right.

00:09:17.559 --> 00:09:20.000
The chain rule. Exactly. Which means you are

00:09:20.000 --> 00:09:22.639
multiplying that error signal by a specific number.

00:09:23.139 --> 00:09:25.419
And because of how the network's activation functions

00:09:25.419 --> 00:09:27.879
work, that number is almost always a fraction.

00:09:28.179 --> 00:09:31.120
It's less than one. Oh. And if you multiply a

00:09:31.120 --> 00:09:33.080
fraction by a fraction by a fraction, The number

00:09:33.080 --> 00:09:37.820
shrinks rapidly. If you multiply 0 .9 by itself

00:09:37.820 --> 00:09:40.740
dozens or hundreds of times as you go back through

00:09:40.740 --> 00:09:44.480
a long sequence, the value vanishes exponentially

00:09:44.480 --> 00:09:47.120
quickly. So it just disappears. Right. By the

00:09:47.120 --> 00:09:49.019
time the error signal reaches the beginning of

00:09:49.019 --> 00:09:51.960
the sequence, it's essentially zero. The network

00:09:51.960 --> 00:09:54.639
mathematically cannot learn from events that

00:09:54.639 --> 00:09:57.100
happened too many steps ago. It remembers the

00:09:57.100 --> 00:09:59.779
end of the sentence, but the beginning is effectively

00:09:59.779 --> 00:10:02.519
erased from its learning process. Yes. Okay,

00:10:02.779 --> 00:10:05.500
playing devil's advocate here. If it has a memory

00:10:05.500 --> 00:10:08.080
problem, why not just give it a bigger hard drive?

00:10:08.480 --> 00:10:10.740
Just throw more modern computing power at it,

00:10:10.940 --> 00:10:13.240
make the memory banks larger. Because it's not

00:10:13.240 --> 00:10:16.120
a storage capacity issue, it's a signal degradation

00:10:16.120 --> 00:10:19.230
issue. Oh, okay. Think back to your gossip analogy

00:10:19.230 --> 00:10:21.990
or a game of telephone that stretches for a mile.

00:10:22.629 --> 00:10:24.649
It doesn't matter how big the brains of the people

00:10:24.649 --> 00:10:27.309
in the line are. By the time the whisper goes

00:10:27.309 --> 00:10:29.710
through a thousand people, the original acoustic

00:10:29.710 --> 00:10:32.529
message is degraded to noise. So the underlying

00:10:32.529 --> 00:10:34.850
architecture itself was destroying the signal.

00:10:35.149 --> 00:10:38.470
So how do you fix a game of telephone where the

00:10:38.470 --> 00:10:41.549
signal mathematically degrades? Because, obviously,

00:10:41.889 --> 00:10:43.909
they figured it out, or we wouldn't be using

00:10:43.909 --> 00:10:46.970
these systems today. They solved it with a brilliantly

00:10:46.970 --> 00:10:50.350
counterintuitive approach. In 1997, researchers

00:10:50.350 --> 00:10:53.210
Sepp Hochreiter and Jürgen Schmidhuber invented

00:10:53.210 --> 00:10:56.409
the LSTM. Right. And that stands for long short

00:10:56.409 --> 00:10:58.450
-term memory. Long short -term memory. Kind of

00:10:58.450 --> 00:11:01.690
an oxymoron, but okay. How did this architecture

00:11:01.690 --> 00:11:04.850
fix the vanishing signal? By giving the network

00:11:04.850 --> 00:11:07.509
the explicit ability to intentionally forget.

00:11:07.610 --> 00:11:10.600
Wait. The solution to a memory problem was to

00:11:10.600 --> 00:11:12.679
introduce forgetting. I know, it sounds crazy,

00:11:12.759 --> 00:11:15.320
but an LSTM unit contains internal mechanisms

00:11:15.320 --> 00:11:18.179
called gates. And the most crucial one is the

00:11:18.179 --> 00:11:20.480
forget gate. Okay, what does it do? Instead of

00:11:20.480 --> 00:11:22.879
blindly cramming every single new piece of data

00:11:22.879 --> 00:11:25.320
into the hidden state and letting the old important

00:11:25.320 --> 00:11:28.240
stuff get diluted by the math, the network actively

00:11:28.240 --> 00:11:30.659
decides what is useless and mathematically throws

00:11:30.659 --> 00:11:32.460
it away. How does it actually throw it away,

00:11:32.600 --> 00:11:34.600
mathematically speaking? It uses something called

00:11:34.600 --> 00:11:36.980
a sigmoid function, which basically acts like

00:11:36.980 --> 00:11:40.360
a valve. It takes the incoming data and squishes

00:11:40.360 --> 00:11:43.240
it into a value between zero and one. Okay. So

00:11:43.240 --> 00:11:46.519
if the network decides a word like and or the

00:11:46.519 --> 00:11:49.240
isn't important for the overall context, the

00:11:49.240 --> 00:11:52.600
forget gate outputs a zero. If you multiply any

00:11:52.600 --> 00:11:55.759
piece of memory by zero, it is instantly erased.

00:11:56.000 --> 00:11:59.379
Boom. Gone. Exactly. If it's important, it outputs

00:11:59.379 --> 00:12:02.100
a one and the memory is preserved perfectly.

00:12:02.240 --> 00:12:04.279
Oh, I see. It's like a central conveyor belt

00:12:04.279 --> 00:12:06.389
running through a factory. Right. Instead of

00:12:06.389 --> 00:12:08.070
dumping every new piece of trash under the belt

00:12:08.070 --> 00:12:10.889
and causing a massive jam, the forget gates act

00:12:10.889 --> 00:12:14.009
as robotic arms, actively sweeping useless items

00:12:14.009 --> 00:12:16.370
off the belt. This ensures the core components

00:12:16.370 --> 00:12:18.389
can travel down the line completely undisturbed.

00:12:18.509 --> 00:12:20.509
That is a much more accurate way to visualize

00:12:20.509 --> 00:12:23.450
it. By carefully regulating what flows onto the

00:12:23.450 --> 00:12:26.070
conveyor belt and what gets swept off, the LSTM

00:12:26.070 --> 00:12:28.769
creates a protected central channel. A protected

00:12:28.769 --> 00:12:31.129
channel. Yes, what they call a constant error

00:12:31.129 --> 00:12:33.769
carousel. In this channel, the gradient doesn't

00:12:33.769 --> 00:12:36.649
vanish. The error signal can flow backward through

00:12:36.649 --> 00:12:39.269
thousands, even millions of time steps without

00:12:39.269 --> 00:12:42.230
losing its potency. Wow. And the source also

00:12:42.230 --> 00:12:45.789
mentions GRUs, gated recurrent units. Where do

00:12:45.789 --> 00:12:48.340
they fit into the conveyor belt analogy? Introduced

00:12:48.340 --> 00:12:53.019
much later, in 2014, a GRU is basically a streamlined,

00:12:53.100 --> 00:12:55.860
more efficient version of the LSTM. Like a model

00:12:55.860 --> 00:12:58.019
upgrade? Yeah, it combines a couple of the gates

00:12:58.019 --> 00:13:00.799
to compute faster, but it performs almost identically

00:13:00.799 --> 00:13:03.919
well on most tasks. But the LSTM, that was the

00:13:03.919 --> 00:13:06.980
true paradigm shift. It solved the great amnesia.

00:13:07.259 --> 00:13:09.539
So they cured the memory problem. They built

00:13:09.539 --> 00:13:12.039
a network that can remember the past and prioritize

00:13:12.039 --> 00:13:15.100
the important data. How did researchers actually

00:13:15.100 --> 00:13:16.940
start putting these building blocks together

00:13:16.940 --> 00:13:18.879
to do useful things? Well, this is where it gets

00:13:18.879 --> 00:13:21.259
really creative. Because reading through the

00:13:21.259 --> 00:13:23.840
configurations in the source, the architecture

00:13:23.840 --> 00:13:26.759
gets wild pretty fast. It does. Once you have

00:13:26.759 --> 00:13:29.100
a reliable memory cell like an LSTM, you can

00:13:29.100 --> 00:13:31.120
start stacking them and arranging them in fascinating

00:13:31.120 --> 00:13:33.720
ways. One of the most powerful configurations

00:13:33.720 --> 00:13:37.899
is the bi -directional RNN. processing data forwards

00:13:37.899 --> 00:13:40.039
and backwards at the same time. I remember reading

00:13:40.039 --> 00:13:42.200
this and thinking, why would you ever read a

00:13:42.200 --> 00:13:43.659
sentence backward? Why do you think? Well, if

00:13:43.659 --> 00:13:48.120
I say the bank, you don't know if I mean a financial

00:13:48.120 --> 00:13:50.419
institution or the side of a river until I finish

00:13:50.419 --> 00:13:53.320
the sentence with was muddy from the rain. Exactly.

00:13:53.779 --> 00:13:56.960
A standard RNN moving strictly from left to right

00:13:56.960 --> 00:14:00.059
has to guess what bank means before it sees the

00:14:00.059 --> 00:14:04.220
word muddy. But a bi -directional RNN uses two

00:14:04.220 --> 00:14:06.659
separate networks. Right. One reads left to right,

00:14:06.740 --> 00:14:08.659
the other reads right to left. Hold on, let me

00:14:08.659 --> 00:14:11.259
push back on that. If the backward reading network

00:14:11.259 --> 00:14:14.000
has to see the end of the sentence to understand

00:14:14.000 --> 00:14:16.700
the beginning, how can that possibly work in

00:14:16.700 --> 00:14:18.980
real time? It can't. Because if I was speaking

00:14:18.980 --> 00:14:21.240
to a voice assistant, it doesn't know the end

00:14:21.240 --> 00:14:23.659
of my sentence until I actually say it. Isn't

00:14:23.659 --> 00:14:26.000
reading backward cheating? That's a crucial distinction.

00:14:26.570 --> 00:14:29.490
Bidirectional RNNs are not used for real -time,

00:14:29.529 --> 00:14:31.269
instantaneous prediction. You are absolutely

00:14:31.269 --> 00:14:32.990
right. They need the whole sequence available

00:14:32.990 --> 00:14:35.629
beforehand. They are used for offline processing,

00:14:36.169 --> 00:14:38.350
like translating a document you've already uploaded

00:14:38.350 --> 00:14:41.090
or analyzing the sentiment of a completely written

00:14:41.090 --> 00:14:43.669
review. The two networks meet in the middle,

00:14:44.070 --> 00:14:46.850
combine their hidden states, and the model understands

00:14:46.850 --> 00:14:49.889
the ambiguous word perfectly within its complete,

00:14:50.129 --> 00:14:52.769
finalized context. That makes perfect sense.

00:14:52.769 --> 00:14:54.429
It only works when the data is already sitting

00:14:54.429 --> 00:14:56.480
there. Yeah. Then there's the encoder -decoder

00:14:56.480 --> 00:14:59.200
configuration, which the source notes was the

00:14:59.200 --> 00:15:01.419
secret sauce for early neural machine translation.

00:15:01.580 --> 00:15:03.740
Yes, like translating French to English. Right.

00:15:03.840 --> 00:15:07.080
This was a massive leap in the early 2010s. An

00:15:07.080 --> 00:15:10.740
encoder -decoder uses two separate RNNs. The

00:15:10.740 --> 00:15:13.600
first one, the encoder, reads the entire French

00:15:13.600 --> 00:15:16.860
sentence. OK. It processes it step by step until

00:15:16.860 --> 00:15:18.980
it compresses the meaning of that entire sentence

00:15:18.980 --> 00:15:21.940
into a single, highly dense mathematical vector.

00:15:22.379 --> 00:15:25.509
It's essentially a pure language concept of the

00:15:25.509 --> 00:15:27.789
sentence. It digests the French into a pure thought

00:15:27.789 --> 00:15:31.370
vector. Beautifully put. Then it hands that dense

00:15:31.370 --> 00:15:34.090
concept over to the second RNN, the decoder.

00:15:34.149 --> 00:15:36.450
And what does the decoder do? The decoder takes

00:15:36.450 --> 00:15:39.389
that concept and starts unfolding it, generating

00:15:39.389 --> 00:15:42.009
the English sentence word by word based on that

00:15:42.009 --> 00:15:44.830
pure thought. It doesn't translate word for word,

00:15:45.210 --> 00:15:47.710
it translates the entire idea. And we interact

00:15:47.710 --> 00:15:49.799
with these architectures constantly. The source

00:15:49.799 --> 00:15:52.899
points out that these LSTM networks powered Google

00:15:52.899 --> 00:15:55.559
Voice Search and dictation on Android devices

00:15:55.559 --> 00:15:58.330
for years. They were the industry standard. But

00:15:58.330 --> 00:16:00.070
some of the other applications listed in this

00:16:00.070 --> 00:16:02.490
Wikipedia article prove it's not just about text.

00:16:02.610 --> 00:16:05.169
Not at all. Sequence data is everywhere in the

00:16:05.169 --> 00:16:07.649
physical world. Right. For example, they used

00:16:07.649 --> 00:16:10.850
LSTMs to learn the long -term structure of blues

00:16:10.850 --> 00:16:14.110
music. Yes, to compose original blues songs.

00:16:14.330 --> 00:16:16.830
Because music at its core is just a sequence

00:16:16.830 --> 00:16:19.669
of notes over time. Exactly. And it scales up

00:16:19.669 --> 00:16:21.809
to incredibly high stakes physical environments

00:16:21.809 --> 00:16:25.169
as well. The article mentions an RNN system that

00:16:25.169 --> 00:16:28.009
learned how to tie knots for robotic heart surgery.

00:16:28.149 --> 00:16:30.690
Wow. But it makes sense. Tying a knot is entirely

00:16:30.690 --> 00:16:33.389
sequential. You can't skip step two and somehow

00:16:33.389 --> 00:16:36.070
arrive at step three. The precise order of movements

00:16:36.070 --> 00:16:38.370
is critical. Furthermore, they are being used

00:16:38.370 --> 00:16:40.809
to predict fusion plasma disruptions in nuclear

00:16:40.809 --> 00:16:44.309
reactors. Wait, really? Nuclear reactors? Yes.

00:16:44.529 --> 00:16:47.149
A nuclear reactor state is a continuous time

00:16:47.149 --> 00:16:50.809
series of sensor data. And RNN can analyze that

00:16:50.809 --> 00:16:53.169
ongoing sequence and predict a highly dangerous

00:16:53.169 --> 00:16:55.889
disruption before it actually occurs. That is

00:16:55.889 --> 00:16:58.610
incredible and maybe the most futuristic application

00:16:58.610 --> 00:17:02.549
listed. Decoding speech directly from the brain

00:17:02.549 --> 00:17:05.470
of a paralyzed person. A truly profound application.

00:17:05.650 --> 00:17:07.730
They take the sequential firing of neurons in

00:17:07.730 --> 00:17:10.470
the brain, feed that biological sequence into

00:17:10.470 --> 00:17:13.369
an RNN, and it outputs the sequence of words

00:17:13.369 --> 00:17:16.349
the person is trying to say. That is just staggering.

00:17:16.690 --> 00:17:19.230
It highlights the raw power of sequential processing.

00:17:19.549 --> 00:17:22.190
But taking a step back, training a system to

00:17:22.190 --> 00:17:24.849
do something that complex, without it cascading

00:17:24.849 --> 00:17:27.789
into a mess of compounding errors, requires some

00:17:27.789 --> 00:17:29.710
clever engineering tricks during the learning

00:17:29.710 --> 00:17:32.410
phase. Which naturally leads us to teacher forcing.

00:17:32.569 --> 00:17:35.430
Yes. When an RNN loops its own output back into

00:17:35.430 --> 00:17:37.769
itself, what happens if it hallucinates the wrong

00:17:37.769 --> 00:17:39.710
word at the very start of the sentence? Doesn't

00:17:39.710 --> 00:17:41.630
it just poison its own memory pool for the rest

00:17:41.630 --> 00:17:43.970
of the sequence? That is exactly the danger.

00:17:44.250 --> 00:17:46.990
If the decoder in a translation task generates

00:17:46.990 --> 00:17:49.799
the wrong second word, it's This hidden state

00:17:49.799 --> 00:17:52.319
is now corrupted by that error. So the whole

00:17:52.319 --> 00:17:55.920
thing is ruined? Basically. Every word it generates

00:17:55.920 --> 00:17:58.339
after that will likely be wrong, and the learning

00:17:58.339 --> 00:18:00.299
signal becomes useless because the whole sentence

00:18:00.299 --> 00:18:02.579
is garbage. Here's my analogy for this training

00:18:02.579 --> 00:18:05.400
method. Imagine you're teaching a kid to spell

00:18:05.400 --> 00:18:08.670
a 10 -letter word, like dictionary. OK. If they

00:18:08.670 --> 00:18:10.809
get the second letter wrong, they write a U instead

00:18:10.809 --> 00:18:13.809
of an I. If you just let them keep going, the

00:18:13.809 --> 00:18:15.950
next eight letters are going to be a confused

00:18:15.950 --> 00:18:17.990
mess because they are trying to build up a fundamental

00:18:17.990 --> 00:18:20.609
mistake. Right. The errors compound exponentially.

00:18:20.789 --> 00:18:23.109
So teacher forcing is the teacher stepping in

00:18:23.109 --> 00:18:26.589
immediately. The kid writes U. The teacher intervenes

00:18:26.589 --> 00:18:29.589
and says, no, it's an I. Now try the third letter

00:18:29.589 --> 00:18:31.970
based on the correct I. That's a great way to

00:18:31.970 --> 00:18:34.250
explain it. During training, the network uses

00:18:34.250 --> 00:18:37.130
the actual correct previous word from the training

00:18:37.130 --> 00:18:40.630
data to predict the next word, rather than relying

00:18:40.630 --> 00:18:43.690
on its own potentially flawed guesses. It keeps

00:18:43.690 --> 00:18:46.390
the training process firmly on the rails. It

00:18:46.390 --> 00:18:48.890
allows the network to learn the true statistical

00:18:48.890 --> 00:18:51.529
relationship between the second and third word,

00:18:51.930 --> 00:18:53.910
rather than getting lost in the weeds of its

00:18:53.910 --> 00:18:56.369
own early hallucination. OK, so we've mapped

00:18:56.369 --> 00:18:59.390
out this incredible looping brain. We've seen

00:18:59.390 --> 00:19:03.049
how LSTMs cured its amnesia. We've explored how

00:19:03.049 --> 00:19:05.569
it translates languages and ties surgical knots.

00:19:05.569 --> 00:19:07.890
You've covered a lot. But if we've solved all

00:19:07.890 --> 00:19:10.650
these problems, why are we suddenly hearing about

00:19:10.650 --> 00:19:13.650
a completely different technology? Anyone following

00:19:13.650 --> 00:19:16.910
AI recently knows that large language models

00:19:16.910 --> 00:19:19.230
are completely dominating the space, and they

00:19:19.230 --> 00:19:21.990
aren't using RNNs. If we connect this to the

00:19:21.990 --> 00:19:25.029
bigger picture of AI today, the natural language

00:19:25.029 --> 00:19:27.369
processing field has been overtaken by a different

00:19:27.369 --> 00:19:29.769
architecture called transformers. The source

00:19:29.769 --> 00:19:32.569
is very clear about this shift. Transformers

00:19:32.569 --> 00:19:34.910
don't use recurrence, they don't loop. No, they

00:19:34.910 --> 00:19:37.049
don't. They use a mechanism called self -attention.

00:19:37.470 --> 00:19:40.789
To explain the difference, an RNN processes a

00:19:40.789 --> 00:19:43.230
sequence one word at a time, updating its hidden

00:19:43.230 --> 00:19:46.200
state. A transformer's self -attention mechanism

00:19:46.200 --> 00:19:48.960
works by looking at the entire sequence simultaneously.

00:19:49.160 --> 00:19:52.480
All at once. Yes. It mathematically weights the

00:19:52.480 --> 00:19:55.099
importance of every single word against every

00:19:55.099 --> 00:19:57.259
other word in the document all at once. So it

00:19:57.259 --> 00:19:59.000
doesn't have to step through it chronologically.

00:19:59.259 --> 00:20:02.099
Exactly. And because they don't process sequentially,

00:20:02.599 --> 00:20:04.779
transformers are much easier to compute in parallel

00:20:04.779 --> 00:20:08.640
on massive arrays of graphics cards. Plus, they

00:20:08.640 --> 00:20:11.279
handle incredibly long -range dependencies even

00:20:11.279 --> 00:20:14.410
better than the conveyor belts of L .A. So, serious

00:20:14.410 --> 00:20:18.009
question for you. If transformers exist and they

00:20:18.009 --> 00:20:20.849
are dominating text and language right now with

00:20:20.849 --> 00:20:24.180
massive parallel processing, Why should anyone

00:20:24.180 --> 00:20:26.700
listening to this care about RNNs? Are they just

00:20:26.700 --> 00:20:28.640
a stepping stone in computer science history?

00:20:28.740 --> 00:20:31.619
Oh, they are far from obsolete. Right. Yes. Transformers

00:20:31.619 --> 00:20:33.700
are incredibly powerful, but that look at everything

00:20:33.700 --> 00:20:36.160
at once. Self -attention mechanism is extremely

00:20:36.160 --> 00:20:38.819
computationally expensive. It requires massive

00:20:38.819 --> 00:20:41.259
amounts of memory and processing power. Right.

00:20:41.579 --> 00:20:44.559
RNNs, by contrast, are highly computationally

00:20:44.559 --> 00:20:46.180
efficient. Because they only have to look at

00:20:46.180 --> 00:20:48.779
one step at a time, carrying that single hidden

00:20:48.779 --> 00:20:51.000
state forward, rather than holding the entire

00:20:51.000 --> 00:20:54.160
document in active memory. Precisely. So if you

00:20:54.160 --> 00:20:57.339
are running an AI on an edge device, like a smartwatch

00:20:57.339 --> 00:21:00.059
or a tiny environmental sensor, where you have

00:21:00.059 --> 00:21:02.059
strictly limited battery and limited memory,

00:21:02.579 --> 00:21:04.859
an RNN is often much more practical. That makes

00:21:04.859 --> 00:21:07.299
sense. Furthermore, RNNs remain highly relevant

00:21:07.299 --> 00:21:09.880
for continuous data that is inherently sequential

00:21:09.880 --> 00:21:12.160
and endless. Like the nuclear reactor sensors

00:21:12.160 --> 00:21:15.000
we mentioned. Yes. Or stock market data. Continuous

00:21:15.000 --> 00:21:18.380
time series forecasting. Exactly. A transformer

00:21:18.380 --> 00:21:21.819
needs a defined window of data, say a thousand

00:21:21.829 --> 00:21:25.390
context window to look at all at once. And RNN

00:21:25.390 --> 00:21:28.309
can just sit there processing a continuous never

00:21:28.309 --> 00:21:31.549
-ending stream of data tick by tick, updating

00:21:31.549 --> 00:21:33.950
its internal state indefinitely without running

00:21:33.950 --> 00:21:36.589
out of memory. They are fundamentally built for

00:21:36.589 --> 00:21:38.549
the flow of time. So they aren't going anywhere.

00:21:38.650 --> 00:21:40.630
We just have a specialized tool for sequential

00:21:40.630 --> 00:21:42.829
data. That's the best way to view it. Well, this

00:21:42.829 --> 00:21:44.609
has been an incredible journey through the source

00:21:44.609 --> 00:21:47.809
material. We started with Santiago Ramon y Cajal,

00:21:48.089 --> 00:21:50.430
looking at the biological loops in brain tissue

00:21:50.430 --> 00:21:53.930
in 1901. A long time ago. We took a detour through

00:21:53.930 --> 00:21:56.490
the physics of magnetic atoms settling into low

00:21:56.490 --> 00:21:59.069
energy states. We confronted the mathematical

00:21:59.069 --> 00:22:01.789
wall of the vanishing gradient and saw how the

00:22:01.789 --> 00:22:04.829
robotic sorting arms of the LSTM, forget gates,

00:22:05.289 --> 00:22:07.710
saved the day. That's a quiet story. And we explored

00:22:07.710 --> 00:22:10.029
how these time traveling networks paved the way

00:22:10.029 --> 00:22:12.900
for modern applications. from real -time translation

00:22:12.900 --> 00:22:15.890
to robotic surgery. It really is a testament

00:22:15.890 --> 00:22:18.809
to how many different fields of human inquiry,

00:22:19.109 --> 00:22:21.410
neuroscience, statistical mechanics, computer

00:22:21.410 --> 00:22:24.069
science, had to seamlessly merge to make this

00:22:24.069 --> 00:22:26.750
technology work. It is. And for you listening,

00:22:27.190 --> 00:22:29.269
I hope the next time you hit the microphone button

00:22:29.269 --> 00:22:31.750
on your phone and dictate a text message, you

00:22:31.750 --> 00:22:33.890
take a second to appreciate the mechanics behind

00:22:33.890 --> 00:22:36.910
it. Definitely. Because you are relying on decades

00:22:36.910 --> 00:22:39.569
of physicists, neuroscientists, and engineers

00:22:39.569 --> 00:22:42.130
who figured out how to mathematically teach a

00:22:42.130 --> 00:22:45.119
machine the human cons... of time and memory.

00:22:45.359 --> 00:22:47.519
And that actually brings up a fascinating final

00:22:47.519 --> 00:22:49.779
thought to consider. Oh, yeah. Think about the

00:22:49.779 --> 00:22:51.799
fundamental difference we just discussed between

00:22:51.799 --> 00:22:55.599
an RNN and the newer transformers. An RNN processes

00:22:55.599 --> 00:22:59.259
information strictly sequentially. It moves from

00:22:59.259 --> 00:23:02.000
the past into the present, carrying its memories

00:23:02.000 --> 00:23:04.900
with it, step by step. Just like human conscious

00:23:04.900 --> 00:23:07.660
experience, we can only experience the present

00:23:07.660 --> 00:23:10.240
moment informed by the past. Right. It literally

00:23:10.240 --> 00:23:13.230
experiences time. but a transformer looks at

00:23:13.230 --> 00:23:16.150
all the data at once. It sits outside the flow

00:23:16.150 --> 00:23:18.309
of time, looking at the beginning, middle, and

00:23:18.309 --> 00:23:21.329
end of a sequence simultaneously. As we continue

00:23:21.329 --> 00:23:23.430
to build more advanced artificial intelligence,

00:23:23.650 --> 00:23:26.250
we are shifting from models that process reality,

00:23:26.789 --> 00:23:29.809
like our biological time -bound brains, toward

00:23:29.809 --> 00:23:32.410
models that process information like omnipresent

00:23:32.410 --> 00:23:35.410
observers. Wow. It is profound. It makes you

00:23:35.410 --> 00:23:38.029
wonder, what does that mean for the kind of intelligence

00:23:38.029 --> 00:23:40.569
we are actually creating? If it doesn't experience

00:23:40.569 --> 00:23:42.890
the linear flow of time, is its understanding

00:23:42.890 --> 00:23:45.869
of reality fundamentally alien to our own? It's

00:23:45.869 --> 00:23:47.650
certainly a long way from the amnesiac AI we

00:23:47.650 --> 00:23:50.349
started with, trapped in a single isolated moment.

00:23:51.049 --> 00:23:52.849
Something to chew on the next time you ask a

00:23:52.849 --> 00:23:54.990
chat bot a question. Thank you for joining us

00:23:54.990 --> 00:23:56.809
on this deep dive. We'll see you next time.