WEBVTT

00:00:00.000 --> 00:00:05.059
So picture this. It's 1987, and a computer actually

00:00:05.059 --> 00:00:08.140
manages to translate a sentence from English

00:00:08.140 --> 00:00:11.119
to Spanish using an artificial neural network.

00:00:11.199 --> 00:00:12.679
Right, which was a massive deal at the time.

00:00:12.820 --> 00:00:16.260
Huge deal. But the catch, the network's entire

00:00:16.260 --> 00:00:19.339
vocabulary was exactly 31 words. Yeah, just 31.

00:00:19.620 --> 00:00:23.059
And it took decades of waiting for computer hardware

00:00:23.059 --> 00:00:25.760
to finally catch up to the sheer complexity of

00:00:25.760 --> 00:00:28.539
the math required to get past those 31 words.

00:00:29.949 --> 00:00:32.909
Welcome to today's deep dive. We are taking you

00:00:32.909 --> 00:00:35.450
on a really personalized journey to understand

00:00:35.450 --> 00:00:37.929
how the translation apps on your phone actually

00:00:37.929 --> 00:00:40.409
work. It's a fascinating evolution. It really

00:00:40.409 --> 00:00:43.469
is. Our mission today is to sort of demystify

00:00:43.469 --> 00:00:45.689
the mechanics of how computers learn to translate

00:00:45.689 --> 00:00:48.310
human languages. We're tracking that whole evolution

00:00:48.310 --> 00:00:51.750
from those clunky 1980s experiments to today's

00:00:51.750 --> 00:00:54.149
massive AI models and really trying to figure

00:00:54.149 --> 00:00:56.189
out what is actually happening when you click

00:00:56.189 --> 00:00:58.189
translate on a web page. Because it's definitely

00:00:58.189 --> 00:01:01.240
not what most people think. Exactly. Okay, let's

00:01:01.240 --> 00:01:04.099
unpack this. We all generally understand how

00:01:04.099 --> 00:01:07.040
a computer processes binary, right? Just ones

00:01:07.040 --> 00:01:10.459
and zeros. But how do we translate the abstract,

00:01:10.620 --> 00:01:13.180
messy human concept of meaning into something

00:01:13.180 --> 00:01:16.310
a machine can calculate? I mean... how is neural

00:01:16.310 --> 00:01:19.090
machine translation or NMT, as we'll call it,

00:01:19.290 --> 00:01:21.989
how is that different from just holding a massive

00:01:21.989 --> 00:01:24.230
digital dictionary and swapping the words out

00:01:24.230 --> 00:01:26.810
one by one? Well, that dictionary swap idea is

00:01:26.810 --> 00:01:28.890
probably the most common misconception. Yeah.

00:01:29.650 --> 00:01:31.790
But human language just simply defies that kind

00:01:31.790 --> 00:01:33.849
of one -to -one mapping. Right, because of grammar

00:01:33.849 --> 00:01:36.469
and stuff. Exactly. Grammar, syntax, and just

00:01:36.469 --> 00:01:39.230
context fundamentally change meaning. So neural

00:01:39.230 --> 00:01:41.150
machine translation doesn't swap words at all.

00:01:41.250 --> 00:01:43.129
It uses an artificial neural network to predict

00:01:43.129 --> 00:01:45.609
the probability of a sequence of words. It's

00:01:45.609 --> 00:01:48.109
usually modeling entire sentences in a single

00:01:48.109 --> 00:01:50.769
integrated model. Wow, entire sentences at once.

00:01:51.090 --> 00:01:53.030
Yeah. So instead of looking at isolated puzzle

00:01:53.030 --> 00:01:55.790
pieces, it processes the entire picture to predict

00:01:55.790 --> 00:01:57.750
the mathematically most probable translation.

00:01:58.030 --> 00:02:00.769
OK, so that makes it basically a game of probabilities.

00:02:01.569 --> 00:02:04.030
Like, given a specific Spanish sentence, the

00:02:04.030 --> 00:02:06.269
system just calculates the most statistically

00:02:06.269 --> 00:02:09.250
likely English equivalent. Precisely. And because

00:02:09.250 --> 00:02:11.889
it models the whole sentence, NMT has really

00:02:11.889 --> 00:02:13.810
become the dominant approach in the industry.

00:02:14.050 --> 00:02:16.729
I mean, for high resource languages. Like English

00:02:16.729 --> 00:02:18.469
or Spanish. Right, English, Spanish, French.

00:02:18.469 --> 00:02:21.909
Yeah. Languages where we have literal mountains

00:02:21.909 --> 00:02:25.389
of high quality bilingual text to train the system.

00:02:25.909 --> 00:02:29.580
For those, NMT can genuinely rival human translation.

00:02:29.780 --> 00:02:32.520
But it's not perfect everywhere, right? No, definitely

00:02:32.520 --> 00:02:35.060
not. It still struggles with low resource languages

00:02:35.060 --> 00:02:38.560
where that training data is scarce. And even

00:02:38.560 --> 00:02:40.699
then, it can occasionally produce translations

00:02:40.699 --> 00:02:43.580
that are just overly literal. Right, a bit robotic.

00:02:44.479 --> 00:02:47.360
To understand why it succeeds or why it sometimes

00:02:47.360 --> 00:02:49.659
produces that rigid translation, we really need

00:02:49.659 --> 00:02:52.099
to dig into how a neural network processes a

00:02:52.099 --> 00:02:54.379
sentence in the first place, because we are dealing

00:02:54.379 --> 00:02:56.500
with machines that don't inherently understand

00:02:56.500 --> 00:03:00.780
concepts like sarcasm or urgency. They just process

00:03:00.780 --> 00:03:03.020
numbers. Yeah, they process everything with vector

00:03:03.020 --> 00:03:05.960
mathematics. Vector mathematics. OK, break that

00:03:05.960 --> 00:03:09.659
down for me. Sure. So most NMT models use what's

00:03:09.659 --> 00:03:12.780
called an encoder decoder architecture. The first

00:03:12.780 --> 00:03:15.740
step is the encoder network. When you feed a

00:03:15.740 --> 00:03:18.819
source sentence into the system, the encoder

00:03:18.819 --> 00:03:22.099
doesn't actually see words. It converts those

00:03:22.099 --> 00:03:24.740
words into a mathematical vector. Okay. You can

00:03:24.740 --> 00:03:26.500
think of this as mapping every single word to

00:03:26.500 --> 00:03:29.240
a specific coordinate in a really high dimensional

00:03:29.240 --> 00:03:31.879
mathematical space. So instead of like a flat

00:03:31.879 --> 00:03:34.780
paper map with an X and Y axis, we are talking

00:03:34.780 --> 00:03:37.500
about a space with hundreds or even thousands

00:03:37.500 --> 00:03:40.240
of dimensions. Exactly. And the system plots

00:03:40.240 --> 00:03:43.419
a word like, say, king at a specific coordinate.

00:03:43.599 --> 00:03:46.460
and then queen ends up plotted geometrically

00:03:46.460 --> 00:03:48.379
right next to it because their semantic meanings

00:03:48.379 --> 00:03:51.120
are so similar. Yes, that's it perfectly. The

00:03:51.120 --> 00:03:53.219
distance and the direction between these coordinates

00:03:53.219 --> 00:03:56.199
represent their relationships. The vector space

00:03:56.199 --> 00:03:58.979
essentially captures the context. That's wild.

00:03:59.180 --> 00:04:01.900
It is. If you use the word bank in a sentence

00:04:01.900 --> 00:04:04.280
about a river, its vector coordinates are going

00:04:04.280 --> 00:04:06.699
to sit in your words like water and shore. But

00:04:06.699 --> 00:04:09.979
if you use bank in a sentence about a mortgage,

00:04:10.580 --> 00:04:13.349
its coordinates dynamically shift to sit near

00:04:13.349 --> 00:04:16.629
money and vault. So the encoder processes the.

00:04:16.810 --> 00:04:19.430
entire source sentence and produces a matrix

00:04:19.430 --> 00:04:22.610
of these vectors. Which is basically a complex

00:04:22.610 --> 00:04:24.810
mathematical representation of the entire thought.

00:04:25.029 --> 00:04:27.189
Exactly. Okay, so the encoder takes the source

00:04:27.189 --> 00:04:29.949
thought, maps it out as coordinates in this thousand

00:04:29.949 --> 00:04:32.910
dimensional space, and then hands that map over

00:04:32.910 --> 00:04:35.110
to the next system. Right, that map is passed

00:04:35.110 --> 00:04:37.430
to the decoder network. Okay, the decoder. Yeah,

00:04:37.430 --> 00:04:39.730
the decoder takes that mathematical representation

00:04:39.730 --> 00:04:42.889
and produces the target words, usually generating

00:04:42.889 --> 00:04:45.050
them just one at a time. And what's fascinating

00:04:45.050 --> 00:04:48.279
here is that the decoder process is autoregressive.

00:04:48.839 --> 00:04:50.860
Autoregressive, meaning what exactly? It means

00:04:50.860 --> 00:04:53.360
it relies on its own previously predicted tokens

00:04:53.360 --> 00:04:56.000
to guess the next one. So it generates the first

00:04:56.000 --> 00:04:58.240
word, factors that word into its calculations,

00:04:58.500 --> 00:05:00.899
and then uses it to guess the second word. Oh,

00:05:00.899 --> 00:05:02.839
I see. Then it uses the first and second words

00:05:02.839 --> 00:05:05.399
to guess the third and only stops when it finally

00:05:05.399 --> 00:05:07.899
generates a special end of sentence token. Okay,

00:05:08.019 --> 00:05:10.879
it's like walking across a stream on stepping

00:05:10.879 --> 00:05:12.439
stones. Oh, that's a good way to put it. Yeah,

00:05:12.480 --> 00:05:15.339
like each step you take determines where you

00:05:15.339 --> 00:05:18.100
are balanced enough to step next. You have to

00:05:18.100 --> 00:05:20.339
commit to your current step, but you're constantly

00:05:20.339 --> 00:05:22.399
looking back at the bank you came from the source

00:05:22.399 --> 00:05:24.399
sentence, right? Right. To make sure you're still

00:05:24.399 --> 00:05:26.199
heading in the right direction to reach the other

00:05:26.199 --> 00:05:28.620
side. That's a great analogy. Because if you

00:05:28.620 --> 00:05:32.759
take a misguided step early on, your entire path

00:05:32.759 --> 00:05:35.779
across that stream shifts. You're either forced

00:05:35.779 --> 00:05:38.040
to correct course, or you just end up in the

00:05:38.040 --> 00:05:40.819
wrong place entirely. Right. The autoregressive

00:05:40.819 --> 00:05:44.500
nature means every single output relies completely

00:05:44.500 --> 00:05:47.040
on the structural integrity of the previous output.

00:05:47.579 --> 00:05:50.839
OK. So if the underlying math, this whole encoder

00:05:50.839 --> 00:05:53.100
-decoder vector mapping thing, if it's so elegant.

00:05:53.660 --> 00:05:56.459
Why did we spend the early 2000s dealing with

00:05:56.459 --> 00:05:59.959
those notoriously terrible robotic internet translations?

00:06:00.399 --> 00:06:02.180
Ah, well, that comes down to the hardware. Right,

00:06:02.300 --> 00:06:04.740
because the timeline here is fascinating. We

00:06:04.740 --> 00:06:06.620
established that Robert B. Allen was running

00:06:06.620 --> 00:06:08.980
feed -forward neural networks to translate English

00:06:08.980 --> 00:06:12.379
to Spanish way back in 1987. Yeah, but Alan's

00:06:12.379 --> 00:06:16.079
1987 model was just a proof of concept. The network's

00:06:16.079 --> 00:06:18.279
input and output layers had to be specifically

00:06:18.279 --> 00:06:21.540
sized for the longest possible sentence because

00:06:21.540 --> 00:06:24.199
those early networks couldn't handle arbitrary

00:06:24.199 --> 00:06:26.839
varying sentence lengths. Right. But his summary

00:06:26.839 --> 00:06:29.040
actually outlined the dual encoder decoder model

00:06:29.040 --> 00:06:32.439
we still use today. Which is crazy. And the momentum

00:06:32.439 --> 00:06:35.319
continued in the 1990s, right? Lonnie Christman

00:06:35.319 --> 00:06:38.060
developed those RAM networks in 1991 that could

00:06:38.060 --> 00:06:40.839
encode variable length sentences into fixed size.

00:06:40.680 --> 00:06:43.480
representations. Yes, and then for Kata and Yanko

00:06:43.480 --> 00:06:46.560
in 1997. Right, they directly trained a source

00:06:46.560 --> 00:06:48.860
encoder and target decoder, so they literally

00:06:48.860 --> 00:06:51.600
had the theoretical architecture mapped out before

00:06:51.600 --> 00:06:53.480
the turn of the millennium. They had the blueprint,

00:06:53.800 --> 00:06:56.490
absolutely. but they lacked the raw material.

00:06:56.509 --> 00:06:59.709
The computing power. Exactly. The computing resources

00:06:59.709 --> 00:07:02.129
of the 1990s were just completely insufficient

00:07:02.129 --> 00:07:05.170
to process the massive bilingual datasets required

00:07:05.170 --> 00:07:07.550
to train a neural network on real -world texts.

00:07:07.970 --> 00:07:10.069
The computational complexity of calculating those

00:07:10.069 --> 00:07:12.529
high -dimensional vectors across millions of

00:07:12.529 --> 00:07:14.649
parameters, it was just way too high for the

00:07:14.649 --> 00:07:17.040
processors of that era. It sounds like trying

00:07:17.040 --> 00:07:20.100
to sequence the human genome using an abacus

00:07:20.100 --> 00:07:22.620
or something. Yeah, basically. The math is theoretically

00:07:22.620 --> 00:07:24.779
sound, but you just don't have the speed or the

00:07:24.779 --> 00:07:27.019
memory to actually finish the calculation in

00:07:27.019 --> 00:07:29.839
any kind of usable time frame. Right. And that

00:07:29.839 --> 00:07:32.459
hardware limitation forced the entire industry

00:07:32.459 --> 00:07:36.040
to pivot. So for two decades, a completely different

00:07:36.040 --> 00:07:38.339
method called statistical machine translation,

00:07:38.560 --> 00:07:42.439
or SMT, became the standard. OK, SMT. Yeah, SMT

00:07:42.439 --> 00:07:45.370
abandoned neural networks entirely. Instead,

00:07:45.750 --> 00:07:48.370
it used these massive statistical models that

00:07:48.370 --> 00:07:51.709
analyzed huge volumes of text to map the probabilities

00:07:51.709 --> 00:07:54.410
of specific phrases, which they called n -grams.

00:07:54.430 --> 00:07:56.910
Ah, so it didn't actually understand the semantic

00:07:56.910 --> 00:07:59.110
meaning of a word in a vector space like we talked

00:07:59.110 --> 00:08:01.540
about. No, not at all. which counted how often

00:08:01.540 --> 00:08:04.500
word A appeared next to word B. Well, that completely

00:08:04.500 --> 00:08:06.899
explains why early internet translations felt

00:08:06.899 --> 00:08:09.560
so disjointed. It was essentially looking up

00:08:09.560 --> 00:08:11.800
a phrase, finding the most statistically common

00:08:11.800 --> 00:08:14.819
replacement, and just pasting it in without understanding

00:08:14.819 --> 00:08:17.120
the overarching thought of the paragraph. Exactly.

00:08:17.620 --> 00:08:20.180
But SMT was the best we could do with the hardware

00:08:20.180 --> 00:08:22.480
constraints of the time. So what changed? Well,

00:08:22.579 --> 00:08:25.540
by 2013 and 2014, processors and specialized

00:08:25.540 --> 00:08:28.399
hardware, specifically GPUs, had finally caught

00:08:28.399 --> 00:08:32.159
up. And we saw the breakthrough of end -to -end

00:08:32.159 --> 00:08:34.799
neural machine translation, specifically the

00:08:34.799 --> 00:08:38.220
Seq2SAC, or sequence -to -sequence models. Right.

00:08:38.240 --> 00:08:40.799
Researchers like Couchbrenner and Bluntsom utilized

00:08:40.799 --> 00:08:42.799
convolutional neural networks, which basically

00:08:42.799 --> 00:08:45.860
apply sliding mathematical filters over the text

00:08:45.860 --> 00:08:48.980
to encode the source. And simultaneously, Cho

00:08:48.980 --> 00:08:51.740
and Sutsgaver utilized recurrent neural networks,

00:08:51.820 --> 00:08:54.519
or RNNs. And those process data sequentially,

00:08:54.899 --> 00:08:56.820
right? Maintaining a kind of internal memory

00:08:56.820 --> 00:08:58.980
state. Yes, exactly. Wait, let me pause you there.

00:08:59.299 --> 00:09:01.679
If the encoder is compressing a 50 -word sentence

00:09:01.679 --> 00:09:03.779
into the exact same fixed -size mathematical

00:09:03.779 --> 00:09:06.139
box as a three -word sentence, doesn't it run

00:09:06.139 --> 00:09:08.679
out of room? Oh, that's a very perceptive question.

00:09:08.820 --> 00:09:10.559
Like, wouldn't it sort of forget the beginning

00:09:10.559 --> 00:09:12.860
of the paragraph by the time it reaches the end?

00:09:12.940 --> 00:09:14.980
You've hit on the exact phenomenon researchers

00:09:14.980 --> 00:09:18.220
call computational amnesia. Computational amnesia.

00:09:18.240 --> 00:09:21.340
That's a great term. It is. Squeezing a long,

00:09:21.500 --> 00:09:23.980
complex sentence into a single, fixed -length

00:09:23.980 --> 00:09:27.779
vector created a really severe bottleneck. The

00:09:27.779 --> 00:09:30.279
model performed exceptionally well on short phrases,

00:09:30.779 --> 00:09:33.019
but the translation quality just plummeted on

00:09:33.019 --> 00:09:36.360
longer texts because that early context was overwritten

00:09:36.360 --> 00:09:39.039
or diluted by the time the encoder reached the

00:09:39.039 --> 00:09:41.240
end of the sentence. So how do you retain the

00:09:41.240 --> 00:09:44.360
context of a massive paragraph without just overflowing

00:09:44.360 --> 00:09:46.740
the mathematical limits of the system? You have

00:09:46.740 --> 00:09:48.720
to change how the decoder looks at the information.

00:09:48.879 --> 00:09:52.600
Okay. So in 2014, Bada now and his colleagues

00:09:52.600 --> 00:09:56.580
introduced the attention mechanism. Yes. Instead

00:09:56.580 --> 00:09:58.700
of forcing the encoder to squeeze the entire

00:09:58.700 --> 00:10:01.620
source sentence into one rigid box, the attention

00:10:01.620 --> 00:10:03.820
mechanism allows the decoder to look back at

00:10:03.820 --> 00:10:06.240
the original source sentence at every single

00:10:06.240 --> 00:10:08.980
step of the decoding process. Oh, wow. It calculates

00:10:08.980 --> 00:10:11.460
a dynamic representation that focuses or, you

00:10:11.460 --> 00:10:13.440
know, pays attention to different specific words

00:10:13.440 --> 00:10:15.179
in the source sentence, depending on what target

00:10:15.179 --> 00:10:16.980
word it is currently trying to translate. OK.

00:10:17.000 --> 00:10:19.879
So if the system is translating a long paragraph

00:10:19.879 --> 00:10:22.700
about, I don't know, a red sports car driving

00:10:22.700 --> 00:10:25.320
dangerously fast down a mountain. Sure. When

00:10:25.320 --> 00:10:27.279
the decoder gets to the specific moment where

00:10:27.279 --> 00:10:30.080
it needs to output the word for car, the attention

00:10:30.080 --> 00:10:32.879
mechanism acts like a spotlight. Yes. It highlights

00:10:32.879 --> 00:10:36.460
the vectors for red, sports and fast in the source

00:10:36.460 --> 00:10:39.019
sentence to ensure the context is perfectly aligned

00:10:39.019 --> 00:10:41.559
for that exact microsecond of translation. That's

00:10:41.559 --> 00:10:44.080
it. Exactly. It mathematically weighs the relevance

00:10:44.080 --> 00:10:46.620
of every single source word against the current

00:10:46.620 --> 00:10:49.700
target word being generated. And this completely

00:10:49.700 --> 00:10:52.120
eradicated that fixed vector bottleneck we talked

00:10:52.120 --> 00:10:54.659
about. That's incredible. The real -world impact

00:10:54.659 --> 00:10:59.240
was immediate and just massive. By 2015, Baidu

00:10:59.240 --> 00:11:02.279
had launched the first large -scale NMT system.

00:11:02.580 --> 00:11:05.870
And then in 2016, Google followed with the Google

00:11:05.870 --> 00:11:08.009
neural machine translation system. OK, here's

00:11:08.009 --> 00:11:10.330
where it gets really interesting. The attention

00:11:10.330 --> 00:11:12.590
mechanism was initially designed as a patch,

00:11:12.690 --> 00:11:15.070
right? Yeah, pretty much. It was an add -on applied

00:11:15.070 --> 00:11:17.169
to those recurrent neural networks to cure their

00:11:17.169 --> 00:11:19.870
computational amnesia. But then researchers asked,

00:11:20.070 --> 00:11:21.750
well, what would happen if you stripped away

00:11:21.750 --> 00:11:25.049
the RNNs entirely and made attention the actual

00:11:25.049 --> 00:11:27.679
core foundation of the architecture? And that

00:11:27.679 --> 00:11:29.919
inquiry led directly to the transformer, which

00:11:29.919 --> 00:11:32.519
was introduced by Voswani and colleagues in 2017.

00:11:32.740 --> 00:11:35.279
The transformer. Yes. The defining feature of

00:11:35.279 --> 00:11:38.220
the transformer is self -attention. Previous

00:11:38.220 --> 00:11:41.340
RNN models process data strictly sequentially.

00:11:41.840 --> 00:11:43.559
So word one, then word two, then word three.

00:11:43.600 --> 00:11:46.000
Right. The transformer discards the recurrent

00:11:46.000 --> 00:11:49.350
loops completely. Its encoder and decoder layers

00:11:49.350 --> 00:11:52.090
use self -attention to weigh and transform the

00:11:52.090 --> 00:11:55.009
entire sequence of words simultaneously. Everything

00:11:55.009 --> 00:11:57.529
is processed in parallel. Wait, hold on. If it

00:11:57.529 --> 00:12:00.049
processes the entire sentence in parallel, reading

00:12:00.049 --> 00:12:02.269
all the words at the exact same time, how does

00:12:02.269 --> 00:12:04.789
it know the grammar? How does it differentiate

00:12:04.789 --> 00:12:07.049
between the dog bit the man and the man bit the

00:12:07.049 --> 00:12:09.490
dog if it doesn't process the order of the words?

00:12:09.669 --> 00:12:12.029
Ah, that is the inherent challenge of parallel

00:12:12.029 --> 00:12:14.690
processing. To solve this, the transformer uses

00:12:14.690 --> 00:12:17.990
explicit positional encoding. Positional encoding.

00:12:18.230 --> 00:12:20.409
Right. Before the words are fed into the self

00:12:20.409 --> 00:12:23.509
-attention layers, the system adds a unique mathematical

00:12:23.509 --> 00:12:26.889
timestamp, usually using a mix of sine and cosine

00:12:26.889 --> 00:12:29.269
functions to each word's vector. Oh, clever.

00:12:29.549 --> 00:12:31.529
Yeah, this modifies the vector's coordinates

00:12:31.529 --> 00:12:34.809
just enough to embed its precise location in

00:12:34.809 --> 00:12:37.029
the sentence. So the model reads everything at

00:12:37.029 --> 00:12:39.409
once, but it reads those mathematical name tags

00:12:39.409 --> 00:12:42.980
to reconstruct the exact syntax in order. So

00:12:42.980 --> 00:12:45.600
it basically retains the massive speed advantage

00:12:45.600 --> 00:12:48.139
of parallel processing without sacrificing the

00:12:48.139 --> 00:12:51.580
grammar. Exactly. And this architecture officially

00:12:51.580 --> 00:12:54.019
killed off the old statistical machine translation

00:12:54.019 --> 00:12:57.360
models, didn't it? It did. NMT outclassed SMT

00:12:57.360 --> 00:13:00.820
across the board. First, because NMT uses continuous

00:13:00.820 --> 00:13:03.720
vector representations, it solved the sparsity

00:13:03.720 --> 00:13:06.259
issue. Sparsity? Yeah. In SMT, if a word was

00:13:06.259 --> 00:13:08.139
rare and didn't have enough statistical data,

00:13:08.460 --> 00:13:10.279
the system simply failed to translate it. It

00:13:10.279 --> 00:13:13.570
just threw an error or gave up. But in NMP, even

00:13:13.570 --> 00:13:15.669
a rare word is plotted somewhere in that high

00:13:15.669 --> 00:13:18.009
-dimensional space, allowing the model to infer

00:13:18.009 --> 00:13:20.490
its meaning based on its proximity to more common

00:13:20.490 --> 00:13:22.929
words. Wow, so it understands the concept of

00:13:22.929 --> 00:13:24.950
the rare word, even if it hasn't seen it a million

00:13:24.950 --> 00:13:29.750
times. Exactly. Second, SMT relied on those n

00:13:29.750 --> 00:13:32.490
-grams chunks of a few words, which created a

00:13:32.490 --> 00:13:35.470
hard cutoff. If you used a 5 -gram model, it

00:13:35.470 --> 00:13:38.190
was physically incapable of remembering a grammatical

00:13:38.190 --> 00:13:40.590
rule established six words ago. Right, because

00:13:40.590 --> 00:13:43.009
its memory was only five words long. Right. But

00:13:43.009 --> 00:13:45.389
the transformer... using self -attention, has

00:13:45.389 --> 00:13:47.970
no hard cutoff. It can connect a pronoun at the

00:13:47.970 --> 00:13:50.370
very end of a paragraph directly back to a noun

00:13:50.370 --> 00:13:52.110
at the very beginning. That's a game changer.

00:13:52.370 --> 00:13:54.950
And finally, despite the incredibly complex math,

00:13:55.210 --> 00:13:57.850
an NMT model actually uses significantly less

00:13:57.850 --> 00:14:00.710
memory than the massive, sprawling databases

00:14:00.710 --> 00:14:03.169
required for a high -level statistical model.

00:14:03.470 --> 00:14:06.129
Okay, so we have this incredibly powerful architecture

00:14:06.129 --> 00:14:08.909
in the transformer, but an architecture is essentially

00:14:08.909 --> 00:14:12.000
an empty brain. How do we train a matrix of parameters

00:14:12.000 --> 00:14:14.039
to know what a fluent translation actually looks

00:14:14.039 --> 00:14:16.700
like? Training an NMT model relies on a metric

00:14:16.700 --> 00:14:19.799
called cross entropy loss. Cross entropy loss.

00:14:20.379 --> 00:14:22.759
Sounds intense. It's really cool, actually. You

00:14:22.759 --> 00:14:25.580
feed the untrained model a massive data set of

00:14:25.580 --> 00:14:28.539
bilingual sentences. So, say, an English source

00:14:28.539 --> 00:14:31.580
and its perfect Spanish translation. The model

00:14:31.580 --> 00:14:34.240
attempts to translate the English sentence. And

00:14:34.240 --> 00:14:35.799
in the beginning, its prediction is essentially

00:14:35.799 --> 00:14:37.779
just random noise. Right. It doesn't know anything

00:14:37.779 --> 00:14:40.970
yet. Exactly. Cross entropy loss is a mathematical

00:14:40.970 --> 00:14:43.070
function that calculates exactly how far off

00:14:43.070 --> 00:14:45.330
the model's prediction was from the ground truth

00:14:45.330 --> 00:14:48.289
target. It outputs a probability distribution

00:14:48.289 --> 00:14:51.090
over the entire vocabulary, and the loss function

00:14:51.090 --> 00:14:53.269
measures the distance between the model's confident

00:14:53.269 --> 00:14:56.429
guess and the actual correct word. So the system

00:14:56.429 --> 00:14:59.309
tries, fails catastrophically, and the cross

00:14:59.309 --> 00:15:02.070
entropy math quantifies the exact magnitude and

00:15:02.070 --> 00:15:05.100
direction of that failure. Spot on. Once the

00:15:05.100 --> 00:15:07.700
failure is quantified, the model uses an algorithm

00:15:07.700 --> 00:15:10.279
called stochastic gradient descent. It calculates

00:15:10.279 --> 00:15:12.919
the slope of the error and iteratively adjusts

00:15:12.919 --> 00:15:15.080
its billions of internal parameters, you know,

00:15:15.139 --> 00:15:17.460
the dials and knobs of the neural network downhill

00:15:17.460 --> 00:15:20.659
toward zero error. It processes the data in small

00:15:20.659 --> 00:15:23.799
chunks called mini -batches, constantly tweaking

00:15:23.799 --> 00:15:26.379
the vector space to maximize the probability

00:15:26.379 --> 00:15:28.799
that it produces the exact target sentences.

00:15:28.919 --> 00:15:31.559
That makes sense. But if we connect this to the

00:15:31.559 --> 00:15:34.470
bigger picture, The autoregressive nature of

00:15:34.470 --> 00:15:37.269
the decoder creates a massive hurdle during this

00:15:37.269 --> 00:15:39.649
training phase. Because it relies on its own

00:15:39.649 --> 00:15:43.789
past guesses. Yes. The decoder uses its own previous

00:15:43.789 --> 00:15:47.230
output to guess the next word. If the untrained

00:15:47.230 --> 00:15:50.600
model gets the very first word wrong, It feeds

00:15:50.600 --> 00:15:53.480
that incorrect word back into itself, virtually

00:15:53.480 --> 00:15:55.799
guaranteeing that the second, third, and fourth

00:15:55.799 --> 00:15:58.120
words will also be completely wrong. Okay, I

00:15:58.120 --> 00:16:00.580
get it. It's like a piano teacher sitting next

00:16:00.580 --> 00:16:03.379
to a student who is learning a complex chord

00:16:03.379 --> 00:16:05.320
progression. Okay, I like where this is going.

00:16:05.480 --> 00:16:07.399
If the student hits the very first chord wrong,

00:16:07.700 --> 00:16:09.279
they might try to play the rest of the song in

00:16:09.279 --> 00:16:12.080
that wrong key, right? They're building an entire

00:16:12.080 --> 00:16:14.860
sequence of bad muscle memory based on one initial

00:16:14.860 --> 00:16:17.610
mistake. To fix this, we use a technique called

00:16:17.610 --> 00:16:20.929
teacher forcing. Like, the piano teacher physically

00:16:20.929 --> 00:16:23.669
forces the student's hands onto the correct first

00:16:23.669 --> 00:16:26.269
chord and says, now from this correct position,

00:16:26.370 --> 00:16:28.970
play the next chord. The teacher forces them

00:16:28.970 --> 00:16:30.970
to the right starting point for every single

00:16:30.970 --> 00:16:33.850
step, so they learn the correct overarching sequence.

00:16:34.090 --> 00:16:37.470
That is a brilliant analogy, yes. During training,

00:16:38.190 --> 00:16:40.070
regardless of what garbage the model actually

00:16:40.070 --> 00:16:43.669
predicts, the system forcibly feeds the correct

00:16:43.669 --> 00:16:46.690
ground truth token from the training data back

00:16:46.690 --> 00:16:49.169
in as the input for the next step. So it can't

00:16:49.169 --> 00:16:52.080
spiral. Right. Teacher forcing ensures the model

00:16:52.080 --> 00:16:54.440
doesn't spiral into compounding errors, allowing

00:16:54.440 --> 00:16:56.580
it to efficiently learn the mapping of the vector

00:16:56.580 --> 00:16:59.340
space. Incredible. OK, so up to this point, we

00:16:59.340 --> 00:17:01.840
focus strictly on dedicated NMT models, right?

00:17:02.379 --> 00:17:04.579
Systems explicitly designed with an encoder and

00:17:04.579 --> 00:17:07.240
a decoder trained on parallel bilingual data

00:17:07.240 --> 00:17:09.660
sets to do one specific job. Right, translation.

00:17:10.000 --> 00:17:12.339
But the landscape has shifted radically in just

00:17:12.339 --> 00:17:14.660
the last few years with the rise of generative

00:17:14.660 --> 00:17:17.420
large language models, or LLMs, like GPT -3 and

00:17:17.420 --> 00:17:20.740
GPT -4. It really has. The architecture of a

00:17:20.740 --> 00:17:23.079
generative LLM is fundamentally different from

00:17:23.079 --> 00:17:26.420
a dedicated NMT system. How so? Well, LLMs are

00:17:26.420 --> 00:17:29.339
typically decoder -only models. They completely

00:17:29.339 --> 00:17:31.400
discard the encoder network we talked about.

00:17:31.839 --> 00:17:34.880
And, more importantly, they are not explicitly

00:17:34.880 --> 00:17:37.859
trained on parallel translation datasets. Wait!

00:17:38.009 --> 00:17:40.769
Really, they are trained on a broader language

00:17:40.769 --> 00:17:43.170
modeling objective. They just simply predict

00:17:43.170 --> 00:17:46.029
the next logical word in a sequence drawn from

00:17:46.029 --> 00:17:48.950
a massive internet scale data set, the vast majority

00:17:48.950 --> 00:17:51.109
of which is in English. That's wild because there

00:17:51.109 --> 00:17:53.549
was a 2023 study by Hendy and colleagues that

00:17:53.549 --> 00:17:56.309
demonstrated that despite not being trained specifically

00:17:56.309 --> 00:17:59.789
to translate, These generative LLMs can produce

00:17:59.789 --> 00:18:02.849
highly fluent competitive translations in a zero

00:18:02.849 --> 00:18:05.130
-shot setting. Meaning you don't feed them bilingual

00:18:05.130 --> 00:18:07.589
examples, right? You just prompt them to translate

00:18:07.589 --> 00:18:10.470
this, and they do. How is an autocomplete engine

00:18:10.470 --> 00:18:13.369
executing high -level translations? Unprecedented

00:18:13.369 --> 00:18:16.190
scale. To compete with a dedicated translation

00:18:16.190 --> 00:18:19.190
system, these LLMs utilize a brute -force approach

00:18:19.190 --> 00:18:22.549
of sheer size. By ingesting trillions of words

00:18:22.549 --> 00:18:24.710
across multiple languages during pre -training,

00:18:25.109 --> 00:18:27.410
the model develops an internal cross -lingual

00:18:27.410 --> 00:18:29.849
representation of language. So it's not translating

00:18:29.849 --> 00:18:32.210
like the old models? Not exactly, no. It isn't

00:18:32.210 --> 00:18:34.569
translating in the traditional sense. It is mathematically

00:18:34.569 --> 00:18:36.789
predicting that the most logical continuation

00:18:36.789 --> 00:18:39.670
of the sequence, translate hello to French, is

00:18:39.670 --> 00:18:42.670
the word bonjour. Wow. And the scale comparison

00:18:42.670 --> 00:18:45.130
is just staggering. I mean, MBART, which is a

00:18:45.130 --> 00:18:48.009
dedicated NMT model heavily fine -tuned specifically

00:18:48.009 --> 00:18:51.299
for translation, uses roughly 600 any million

00:18:51.299 --> 00:18:53.740
parameters. Which is a lot. It is. The original

00:18:53.740 --> 00:18:57.200
transformer model from 2017 used 213 million,

00:18:57.480 --> 00:19:01.440
but GPT -3 utilizes 175 billion parameters. Yeah,

00:19:01.539 --> 00:19:03.319
the difference in magnitude means generative

00:19:03.319 --> 00:19:05.339
MLMs are computationally exorbitant to train

00:19:05.339 --> 00:19:07.799
and to run compared to an elegant, dedicated

00:19:07.799 --> 00:19:10.299
NMT system. So what does this all mean for you

00:19:10.299 --> 00:19:13.680
listening right now? We started in 1987 with

00:19:13.680 --> 00:19:16.299
a neural network that could barely juggle 31

00:19:16.299 --> 00:19:18.680
words because the hardware was decades behind

00:19:18.680 --> 00:19:21.819
the math. We suffered through the statistical

00:19:21.819 --> 00:19:25.039
phrase swapping era of early internet translations.

00:19:25.339 --> 00:19:28.480
We all remember those? Oh yeah. Then, the hardware

00:19:28.480 --> 00:19:31.039
finally caught up. We conquered computational

00:19:31.039 --> 00:19:33.740
amnesia with the attention mechanism, mapping

00:19:33.740 --> 00:19:36.200
entire concepts into high dimensional vector

00:19:36.200 --> 00:19:39.140
spaces, and we revolutionized parallel processing

00:19:39.140 --> 00:19:42.190
with the transformer. All of this highly complex,

00:19:42.549 --> 00:19:45.069
mathematically elegant engineering is running

00:19:45.069 --> 00:19:47.750
quietly in the background, trained step by step

00:19:47.750 --> 00:19:50.009
through teacher forcing, just to give you that

00:19:50.009 --> 00:19:51.910
effortless translation on your screen. And this

00:19:51.910 --> 00:19:53.970
raises an important question regarding how we

00:19:53.970 --> 00:19:56.309
evaluate these systems moving forward. Because

00:19:56.309 --> 00:19:58.529
while these models have reached heights that

00:19:58.529 --> 00:20:01.210
really rival human parity in high resource scenarios,

00:20:01.289 --> 00:20:04.329
we have to apply critical thinking to their limitations.

00:20:04.410 --> 00:20:08.170
They still suffer from domain shift. An NMT model

00:20:08.170 --> 00:20:11.130
trained heavily on, say, Parliamentary proceedings

00:20:11.130 --> 00:20:13.690
will stumble if you ask it to translate a patient's

00:20:13.690 --> 00:20:16.750
casual text message to a doctor. The semantic

00:20:16.750 --> 00:20:19.250
vectors shift depending on the domain medical,

00:20:19.650 --> 00:20:22.670
legal, colloquial, and recognizing those boundaries

00:20:22.670 --> 00:20:25.440
is critical. We just learned that a dedicated

00:20:25.440 --> 00:20:28.119
NMT model can translate beautifully with just

00:20:28.119 --> 00:20:30.480
a few hundred million parameters because it was

00:20:30.480 --> 00:20:33.019
explicitly designed and structured to do exactly

00:20:33.019 --> 00:20:35.759
that. But today the tech industry is increasingly

00:20:35.759 --> 00:20:39.140
relying on generative LLMs to perform those same

00:20:39.140 --> 00:20:42.819
translations using 175 billion parameters just

00:20:42.819 --> 00:20:45.720
to predict the next logical word. A massive shift.

00:20:45.940 --> 00:20:48.170
It leaves you wondering... As we move forward,

00:20:48.289 --> 00:20:51.269
are we actually achieving a better, deeper structural

00:20:51.269 --> 00:20:53.789
understanding between human languages, or are

00:20:53.789 --> 00:20:56.250
we just throwing massive computationally expensive

00:20:56.250 --> 00:20:58.549
brute force at a problem we had already solved

00:20:58.549 --> 00:21:01.289
elegantly years ago? It is a profound architectural

00:21:01.289 --> 00:21:03.009
debate, and honestly, it's one that will define

00:21:03.009 --> 00:21:05.529
the next decade of artificial intelligence. So

00:21:05.529 --> 00:21:07.509
the next time you highlight a block of text on

00:21:07.509 --> 00:21:10.779
your phone and hit Translate, remember. You aren't

00:21:10.779 --> 00:21:14.160
invoking magic. You are triggering a highly sophisticated

00:21:14.160 --> 00:21:16.839
cascade of vector mathematics, self -attention

00:21:16.839 --> 00:21:19.759
mechanisms, and decades of relentless engineering.

00:21:20.240 --> 00:21:22.079
Thank you for joining us on this deep dive. Stay

00:21:22.079 --> 00:21:23.680
curious, and we'll see you next time.