WEBVTT

00:00:00.000 --> 00:00:02.080
Okay, let's unpack this. I want you to imagine

00:00:02.080 --> 00:00:05.040
a single document. It's just 10 pages long. Right.

00:00:05.160 --> 00:00:08.279
It was uploaded to a server back in 2017, probably

00:00:08.279 --> 00:00:10.439
by some guy in a t -shirt, you know, drinking

00:00:10.439 --> 00:00:14.400
stale coffee. And yet that 10 -page PDF is effectively

00:00:14.400 --> 00:00:17.280
the constitution, the blueprint, and maybe even

00:00:17.280 --> 00:00:20.839
the big bang of the entire modern AI era all

00:00:20.839 --> 00:00:23.100
rolled into one. It's not an exaggeration at

00:00:23.100 --> 00:00:24.859
all. If you look at the landscape of technology

00:00:24.859 --> 00:00:29.390
today, I mean... Here in 2026, almost everything

00:00:29.390 --> 00:00:32.070
important AI stems from this one moment. It's

00:00:32.070 --> 00:00:34.609
a dividing line. It is. There is before this

00:00:34.609 --> 00:00:36.609
paper and there is after this paper, period.

00:00:37.000 --> 00:00:38.520
We are, of course, talking about the research

00:00:38.520 --> 00:00:40.759
paper titled Attention is All You Need by a Team

00:00:40.759 --> 00:00:42.840
at Google. And I was looking at the stats you

00:00:42.840 --> 00:00:44.820
pulled for this deep dive. This is actually insane.

00:00:45.060 --> 00:00:47.700
It's mind boggling. As of 2025, this paper sits

00:00:47.700 --> 00:00:50.240
in the top 10 most cited papers of the entire

00:00:50.240 --> 00:00:54.539
21st century. Right. Over 173 ,000 citations.

00:00:54.780 --> 00:00:56.899
And, you know, to put that in perspective, a

00:00:56.899 --> 00:00:59.640
truly groundbreaking paper in, say, biology or

00:00:59.640 --> 00:01:02.420
physics is lucky to get a few hundred citations.

00:01:02.520 --> 00:01:04.579
A few hundred. This didn't just nudge the field.

00:01:04.920 --> 00:01:07.700
It completely... reinvented the physics of how

00:01:07.700 --> 00:01:09.980
computers process information. And that's exactly

00:01:09.980 --> 00:01:12.620
why we're here today. Because everyone knows

00:01:12.620 --> 00:01:16.000
the brand names. We all know ChatGPT, Gemini,

00:01:16.140 --> 00:01:18.620
Sora. We use them. We talk to them. We let them

00:01:18.620 --> 00:01:20.400
write our emails. Sure. They're household names.

00:01:20.599 --> 00:01:23.200
But almost no one understands the engine underneath

00:01:23.200 --> 00:01:26.560
the hood. Today, we are going to deconstruct

00:01:26.560 --> 00:01:28.799
the transformer. Which is the architecture that

00:01:28.799 --> 00:01:31.510
was introduced in this paper. Exactly. Our mission

00:01:31.510 --> 00:01:34.609
is to move past the buzzwords. We're going to

00:01:34.609 --> 00:01:37.310
look at the history, the technical details, to

00:01:37.310 --> 00:01:40.310
really understand mechanically why this specific

00:01:40.310 --> 00:01:42.750
architecture changed everything. And we should

00:01:42.750 --> 00:01:44.269
probably say we're going to get into the weeds

00:01:44.269 --> 00:01:46.849
a little bit. We're going to talk about vectors

00:01:46.849 --> 00:01:49.069
and matrices. But we promise to keep it English.

00:01:49.329 --> 00:01:52.829
We'll keep it grounded. But you can't really

00:01:52.829 --> 00:01:54.849
understand the revolution without understanding

00:01:54.849 --> 00:01:58.459
the machine itself. So to appreciate the aha

00:01:58.459 --> 00:02:01.219
moment, we have to understand the before times.

00:02:01.640 --> 00:02:06.640
Take us back to, say, pre -2017. If I wanted

00:02:06.640 --> 00:02:09.280
a computer to translate a sentence from English

00:02:09.280 --> 00:02:12.060
to German, how did it do it? Because we had Google

00:02:12.060 --> 00:02:15.460
Translate before 2017. We did, but it was clunky.

00:02:16.039 --> 00:02:18.159
The dominant technology at the time was something

00:02:18.159 --> 00:02:21.740
called recurrent neural networks, or RNNs. Okay.

00:02:21.879 --> 00:02:24.360
And specifically, a more advanced version called

00:02:24.360 --> 00:02:28.099
LSTMS, long short -term memory networks. Long

00:02:28.099 --> 00:02:30.259
short -term memory. That sounds like a contradiction.

00:02:30.360 --> 00:02:32.840
It was a brilliant solution for its time, designed

00:02:32.840 --> 00:02:36.240
to solve a very specific problem, memory. But

00:02:36.240 --> 00:02:38.719
here was the massive constraint. It was strictly

00:02:38.719 --> 00:02:41.020
sequential. Sequential, so one after the other.

00:02:41.099 --> 00:02:42.939
Exactly. Imagine you're trying to read a book,

00:02:42.939 --> 00:02:44.680
but you can't look at the whole page. Instead,

00:02:44.759 --> 00:02:47.159
you have to look at the text through a tiny slit

00:02:47.159 --> 00:02:49.740
in a piece of paper that only reveals one word

00:02:49.740 --> 00:02:51.620
at a time. Okay, I'm picturing that, like reading

00:02:51.620 --> 00:02:54.210
a scroll. You read the first word, you process

00:02:54.210 --> 00:02:56.389
it, you try to remember it, then you slide the

00:02:56.389 --> 00:02:59.289
slit to the second word. You process that, add

00:02:59.289 --> 00:03:00.870
it to your memory of the first word, then the

00:03:00.870 --> 00:03:03.449
third. You're forced to move left to right, step

00:03:03.449 --> 00:03:05.949
by step by step. So if I'm the computer, I literally

00:03:05.949 --> 00:03:08.830
cannot look at word number five until I have

00:03:08.830 --> 00:03:11.110
finished processing word number four. Precisely.

00:03:11.370 --> 00:03:14.430
And that created two massive problems that were

00:03:14.430 --> 00:03:17.590
basically strangling the entire field of AI.

00:03:17.889 --> 00:03:21.400
Okay. What was the first? Spade. You couldn't

00:03:21.400 --> 00:03:23.439
parallelize the work? Hold on, let's dig into

00:03:23.439 --> 00:03:26.460
that. Parallelize is a big word. Why couldn't

00:03:26.460 --> 00:03:28.280
I just buy a thousand computers and have them

00:03:28.280 --> 00:03:30.800
all read the sentence at the same time? Because

00:03:30.800 --> 00:03:33.699
of the dependency. Think of it like a relay race.

00:03:34.159 --> 00:03:36.620
Runner B cannot start running until Runner A

00:03:36.620 --> 00:03:40.199
physically hands them the baton. It doesn't matter

00:03:40.199 --> 00:03:42.680
if you hire Usain Bolt for the second leg. He

00:03:42.680 --> 00:03:44.379
just has to stand there and wait. So if I have

00:03:44.379 --> 00:03:46.500
a really long sentence or a whole book, the computer's

00:03:46.500 --> 00:03:49.050
just chugging along one word at a time. Exactly.

00:03:49.449 --> 00:03:52.469
Modern GPUs, graphics processing units, are designed

00:03:52.469 --> 00:03:56.669
to do thousands of things at once. But RNNs forced

00:03:56.669 --> 00:03:59.449
them to do one thing at a time. It was incredibly

00:03:59.449 --> 00:04:01.889
inefficient. And the second problem. The second

00:04:01.889 --> 00:04:04.050
problem was even worse. It's a concept called

00:04:04.050 --> 00:04:06.590
the vanishing gradient. The vanishing gradient?

00:04:06.729 --> 00:04:08.930
That sounds like a horror movie title. It's the

00:04:08.930 --> 00:04:11.430
horror movie of linguistics, basically. It just

00:04:11.430 --> 00:04:14.240
means information loss. Since you're processing

00:04:14.240 --> 00:04:16.540
strictly in order, by the time the model gets

00:04:16.540 --> 00:04:19.480
to the end of a long paragraph, it tends to forget

00:04:19.480 --> 00:04:21.620
what happened at the beginning. Kind of like

00:04:21.620 --> 00:04:24.860
a game of telephone. A very good analogy. The

00:04:24.860 --> 00:04:27.220
model tries to cram the context of the first

00:04:27.220 --> 00:04:30.060
word into this little package of numbers, a state

00:04:30.060 --> 00:04:32.839
vector. Pass it to the second word, modify it,

00:04:32.879 --> 00:04:34.839
pass it to the third. And by the end? By the

00:04:34.839 --> 00:04:37.439
time you're 50 words in, the signal from word

00:04:37.439 --> 00:04:40.100
number one is barely a whisper. So if I have

00:04:40.100 --> 00:04:42.629
a sentence like... The girl who lived in the

00:04:42.629 --> 00:04:44.889
blue house down the street and liked to eat apples

00:04:44.889 --> 00:04:47.730
went to the store. By the time the computer went

00:04:47.730 --> 00:04:49.589
to the store, it might have forgotten who the

00:04:49.589 --> 00:04:52.129
girl was. Exactly. It knows someone went to the

00:04:52.129 --> 00:04:54.149
store, but that connection, that thread back

00:04:54.149 --> 00:04:57.350
to the subject, is weak. The context gets totally

00:04:57.350 --> 00:05:00.350
diluted in the bottleneck. The world needed a

00:05:00.350 --> 00:05:03.189
way to process the whole sentence at once. And

00:05:03.189 --> 00:05:05.430
that brings us to the heroes of her story. The

00:05:05.430 --> 00:05:09.050
team behind attention is all you need. Team Transformer,

00:05:09.050 --> 00:05:11.009
as they apparently call themselves. It was eight

00:05:11.009 --> 00:05:14.970
scientists at Google, Ashish Vaswani, Noam Chazir,

00:05:15.089 --> 00:05:19.829
Niki Parmar, Jacob Uskarite, Leon Jones, Aidan

00:05:19.829 --> 00:05:23.779
Gomez, Ucas Kaiser, and Ilya Polosukhin. And

00:05:23.779 --> 00:05:25.480
here's a fun fact I found in the notes. Oh, yeah.

00:05:25.560 --> 00:05:28.420
They contributed so equally to this thing that

00:05:28.420 --> 00:05:30.720
the order of their names on the paper was randomized.

00:05:30.839 --> 00:05:33.360
Which is incredibly rare in academia. You know,

00:05:33.360 --> 00:05:35.199
usually the first name is the star and the last

00:05:35.199 --> 00:05:38.040
name is the boss. But here it really was a collective

00:05:38.040 --> 00:05:40.740
mind melt. They were just jamming on ideas. And

00:05:40.740 --> 00:05:43.839
looking at where they are now in 2026, the band

00:05:43.839 --> 00:05:46.459
has definitely broken up. Oh, completely. Every

00:05:46.459 --> 00:05:48.259
single one of them left Google. They all went

00:05:48.259 --> 00:05:51.240
on to found their own huge startups like Cohere,

00:05:51.300 --> 00:05:54.639
Character .ai, Near, or join other major players.

00:05:54.839 --> 00:05:57.199
It's basically the PayPal mafia, but for this

00:05:57.199 --> 00:05:59.879
generation of AI. But before they split up, they

00:05:59.879 --> 00:06:02.319
had to name this thing. And for a paper that

00:06:02.319 --> 00:06:05.139
changed history, the naming process was surprisingly

00:06:05.139 --> 00:06:08.800
casual. Transformer. Yeah, Jacob Ooskerite just

00:06:08.800 --> 00:06:10.680
liked the sound of it. He thought it sounded

00:06:10.680 --> 00:06:13.860
powerful. In fact, Our source material says that

00:06:13.860 --> 00:06:16.439
some of the early design documents actually featured

00:06:16.439 --> 00:06:18.720
characters from the Transformers toy franchise.

00:06:19.040 --> 00:06:22.480
No way. Yes. Optimus Prime and Megatron were

00:06:22.480 --> 00:06:24.800
apparently in the margins. And the title of the

00:06:24.800 --> 00:06:28.779
paper itself, Attention is All You Need. A nod

00:06:28.779 --> 00:06:31.670
to the Beatles. All you need is love. I love

00:06:31.670 --> 00:06:33.730
that. It adds a bit of personality to what is

00:06:33.730 --> 00:06:37.149
some very dense math. But the title was also

00:06:37.149 --> 00:06:40.069
a huge controversial claim. Attention is all

00:06:40.069 --> 00:06:42.050
you need. They're saying, hey, all that recurrent

00:06:42.050 --> 00:06:43.970
stuff, all those LSTMs we've been using for a

00:06:43.970 --> 00:06:45.829
decade. I'll throw them in the trash. Yeah. It's

00:06:45.829 --> 00:06:48.649
a radical hypothesis. Yeah. Jacob Ooskerite suspected

00:06:48.649 --> 00:06:51.290
they can just ditch recurrence entirely. And

00:06:51.290 --> 00:06:53.350
our source notes that even his father, who's

00:06:53.350 --> 00:06:55.930
a famous computational linguist, was skeptical.

00:06:56.329 --> 00:06:58.730
He basically told his son, you can't just ignore

00:06:58.730 --> 00:07:01.319
the order of words. Language is order. But they

00:07:01.319 --> 00:07:04.680
did. So let's get into the machine. This is the

00:07:04.680 --> 00:07:07.259
part everyone struggles with. How does the transformer

00:07:07.259 --> 00:07:10.060
work if it doesn't read one word at a time? It

00:07:10.060 --> 00:07:13.379
uses a mechanism called self -attention. So instead

00:07:13.379 --> 00:07:15.480
of that slit -in -the -paper approach, imagine

00:07:15.480 --> 00:07:18.740
looking at the entire page instantly. The model

00:07:18.740 --> 00:07:20.899
looks at every single word in the sequence at

00:07:20.899 --> 00:07:23.620
the same time, and it calculates how much attention

00:07:23.620 --> 00:07:25.939
each word should pay to every other word. Okay,

00:07:26.000 --> 00:07:28.000
let's unpack that with an example. Let's use

00:07:28.000 --> 00:07:30.579
the sentence. The animal didn't cross the street

00:07:30.579 --> 00:07:33.560
because it was too tired. Great sentence. Okay,

00:07:33.600 --> 00:07:36.079
when a human reads that, we get to the word it

00:07:36.079 --> 00:07:39.399
because it was too tired, and we instantly know

00:07:39.399 --> 00:07:41.899
that it. refers to the animal. Right, because

00:07:41.899 --> 00:07:44.259
streets don't get tired. Exactly. We use common

00:07:44.259 --> 00:07:46.519
sense and context, but a computer doesn't have

00:07:46.519 --> 00:07:49.300
common sense. It just has numbers. So self -attention

00:07:49.300 --> 00:07:51.680
allows the model to connect the word it really

00:07:51.680 --> 00:07:54.439
strongly to the word animal and very weakly to

00:07:54.439 --> 00:07:57.079
the word street. It builds a web of relationships.

00:07:57.579 --> 00:08:00.670
Okay, but how? The paper talks about this query

00:08:00.670 --> 00:08:03.629
key and value system, the QKV. This is usually

00:08:03.629 --> 00:08:05.810
where people's eyes glaze over. I want you to

00:08:05.810 --> 00:08:07.230
explain this so my grandmother would get it.

00:08:07.310 --> 00:08:09.970
Challenge accepted. The best analogy is a library

00:08:09.970 --> 00:08:12.970
retrieval system. Okay. I'm in a library. Imagine

00:08:12.970 --> 00:08:16.269
every word in that sentence is a person in a

00:08:16.269 --> 00:08:20.470
library. Let's take the word it. It is trying

00:08:20.470 --> 00:08:23.480
to figure out what it means. So it. creates a

00:08:23.480 --> 00:08:25.560
query. Think of the query like a sticky note

00:08:25.560 --> 00:08:28.079
on its forehead that says, I am a pronoun looking

00:08:28.079 --> 00:08:30.680
for a noun that can get tired. Okay, so the query

00:08:30.680 --> 00:08:32.639
is the intent. What am I looking for? Exactly.

00:08:32.820 --> 00:08:34.779
Now, every other word in the sentence, animal,

00:08:34.960 --> 00:08:38.539
street, cross, is standing there holding a book.

00:08:39.059 --> 00:08:42.019
And on the spine of that book is a label. That

00:08:42.019 --> 00:08:44.320
label is the key. So the key is like the description

00:08:44.320 --> 00:08:46.600
of what that word offers. Yes. The word street

00:08:46.600 --> 00:08:49.559
has a key that says, I am a rigid paved surface.

00:08:50.069 --> 00:08:53.070
The word animal has a key that says, I am a biological

00:08:53.070 --> 00:08:55.870
entity that experiences fatigue. I see where

00:08:55.870 --> 00:08:57.809
this is going. My query looking for something

00:08:57.809 --> 00:09:00.409
that gets tired scans all the keys. And it finds

00:09:00.409 --> 00:09:04.070
a match. The query for it meshes perfectly with

00:09:04.070 --> 00:09:06.549
the key for animal. That mathematical match,

00:09:06.690 --> 00:09:09.250
it's called the dot product, creates a really

00:09:09.250 --> 00:09:11.309
high score. It's a high compatibility rating.

00:09:11.389 --> 00:09:13.269
It clashes with street, so that gets a low score.

00:09:13.570 --> 00:09:15.850
Okay, so I found my match. I know it belongs

00:09:15.850 --> 00:09:19.080
to animal. What happens then? Once you find the

00:09:19.080 --> 00:09:20.960
match, you take the book off the shelf and open

00:09:20.960 --> 00:09:24.220
it. The information inside, the actual meaning,

00:09:24.320 --> 00:09:27.539
the essence of that word is the value. You then

00:09:27.539 --> 00:09:30.220
absorb that value. So now the representation

00:09:30.220 --> 00:09:33.500
of the word it is updated with the value from

00:09:33.500 --> 00:09:35.340
the word animal. That is actually really helpful.

00:09:35.419 --> 00:09:37.460
So instead of just next word, next word, every

00:09:37.460 --> 00:09:39.919
word is constantly querying every other word

00:09:39.919 --> 00:09:41.960
to see, hey, are we related? Do you have context

00:09:41.960 --> 00:09:44.379
I need? Precisely. And it happens all at once.

00:09:44.559 --> 00:09:47.919
Animal is querying tired. Cross is querying street.

00:09:48.059 --> 00:09:50.759
It's like a massive simultaneous cocktail party

00:09:50.759 --> 00:09:52.879
where everyone is talking to the people most

00:09:52.879 --> 00:09:54.960
relevant to them. Now, in the paper, there's

00:09:54.960 --> 00:09:58.799
a scary looking formula. The attention QKV equals

00:09:58.799 --> 00:10:00.980
a bunch of math. Right. We won't read the calculus,

00:10:01.159 --> 00:10:03.539
but the concept is just what we described. You

00:10:03.539 --> 00:10:05.799
multiply the query in the key to get a raw score.

00:10:06.519 --> 00:10:09.360
Then you use a function called softmax. And think

00:10:09.360 --> 00:10:11.500
of softmax as like the referee. It looks at all

00:10:11.500 --> 00:10:13.740
the raw scores and turns them into clean percentages.

00:10:13.940 --> 00:10:15.919
So it's like you are 90 % related to animal,

00:10:16.100 --> 00:10:20.940
5 % to street. Exactly. And 5 % to cross. Simple

00:10:20.940 --> 00:10:24.080
enough. But then they add another layer, multi

00:10:24.080 --> 00:10:26.899
-head attention. Because apparently paying attention

00:10:26.899 --> 00:10:29.500
once isn't enough. Right. Think of it as having

00:10:29.500 --> 00:10:32.240
multiple pairs of glasses or maybe multiple agents

00:10:32.240 --> 00:10:34.700
looking at the same sentence. If you only have

00:10:34.700 --> 00:10:37.539
one head, you might only focus on who is doing

00:10:37.539 --> 00:10:39.639
what action. But language is more complex than

00:10:39.639 --> 00:10:42.179
that. Exactly. So the transformer uses multiple

00:10:42.179 --> 00:10:45.200
heads. One head might be tracking grammar -like

00:10:45.200 --> 00:10:47.399
subject -verb agreement. Another head might be

00:10:47.399 --> 00:10:50.740
tracking gender -connecting king to he. Another

00:10:50.740 --> 00:10:54.320
might be tracking tone or style. So it's like

00:10:54.320 --> 00:10:56.740
a committee. One person checks the grammar. Another

00:10:56.740 --> 00:10:59.500
person checks the definitions. A third checks

00:10:59.500 --> 00:11:02.159
the context. And they all report back at the

00:11:02.159 --> 00:11:05.720
same time. This creates a much richer, much more

00:11:05.720 --> 00:11:07.759
high -definition understanding of the sentence

00:11:07.759 --> 00:11:10.820
than an RNN ever could. Okay, I'm with you on

00:11:10.820 --> 00:11:12.379
the mechanics, but here's the part that still

00:11:12.379 --> 00:11:14.580
bothers me. You said the model reads the whole

00:11:14.580 --> 00:11:17.440
sentence at once, parallel processing. Yes. But

00:11:17.440 --> 00:11:21.019
order matters. The dog bites the man is a news

00:11:21.019 --> 00:11:23.740
story. The man bites the dog is a viral video.

00:11:23.960 --> 00:11:26.539
They're totally different. If I throw all those

00:11:26.539 --> 00:11:29.440
words into a bag at the same time, how does the

00:11:29.440 --> 00:11:32.200
model know who bit who? This was the biggest

00:11:32.200 --> 00:11:35.039
criticism Jacob Ouskarait's father had. You've

00:11:35.039 --> 00:11:37.879
lost the order. And since they ditched recurrence,

00:11:37.960 --> 00:11:40.159
they needed a way to artificially stamp the order

00:11:40.159 --> 00:11:43.159
back onto the words. This is the concept of positional

00:11:43.159 --> 00:11:45.820
encoding. And the source says they used sine

00:11:45.820 --> 00:11:48.340
and cosine wave functions. Wait, hold on. Sine

00:11:48.340 --> 00:11:50.500
waves. Like from an oscilloscope. Why would a

00:11:50.500 --> 00:11:53.039
language model need a wave function? That sounds

00:11:53.039 --> 00:11:55.100
like physics, not grammar. It does sound weird.

00:11:55.220 --> 00:11:57.980
But just think about a wave. It repeats, but

00:11:57.980 --> 00:12:00.440
it changes its values slightly at every single

00:12:00.440 --> 00:12:03.059
step. Imagine you have a color gradient that

00:12:03.059 --> 00:12:06.799
shifts very, very slowly from, say, red to blue

00:12:06.799 --> 00:12:09.480
across the sentence. The first word is pure red.

00:12:09.879 --> 00:12:12.340
The last word is pure blue. And every word in

00:12:12.340 --> 00:12:14.320
between has a slightly different shade of purple.

00:12:14.759 --> 00:12:17.039
So the word carries its meaning, but it also

00:12:17.039 --> 00:12:19.120
carries this little color that tells the model

00:12:19.120 --> 00:12:21.500
where it sits in the line. Exactly that. The

00:12:21.500 --> 00:12:23.480
sine and cosine waves are just a mathematical

00:12:23.480 --> 00:12:26.299
way to generate those unique colors or fingerprints

00:12:26.299 --> 00:12:28.899
for every position. So the model looks at the

00:12:28.899 --> 00:12:32.320
word man. It sees the meaning adult male, but

00:12:32.320 --> 00:12:34.379
it also sees a mathematical stamp that says position

00:12:34.379 --> 00:12:36.419
three. So even though it processes everything

00:12:36.419 --> 00:12:39.279
at once. That stamp preserves the sequence. That

00:12:39.279 --> 00:12:42.700
feels like a hack, a brilliant hack, but a hack.

00:12:42.860 --> 00:12:45.820
In a way, it is. But it was an incredibly elegant

00:12:45.820 --> 00:12:48.159
solution that didn't require sequential processing.

00:12:48.460 --> 00:12:51.299
It gave us the best of both worlds, the speed

00:12:51.299 --> 00:12:53.679
of parallel with a structure of order. So we

00:12:53.679 --> 00:12:56.000
have self -attention, the cocktail party. We

00:12:56.000 --> 00:12:58.860
have multi -head attention, the committee. and

00:12:58.860 --> 00:13:01.500
positional encoding, the timestamps. That's the

00:13:01.500 --> 00:13:04.059
engine. Now let's talk performance, because the

00:13:04.059 --> 00:13:06.480
whole point of this was speed. And this is where

00:13:06.480 --> 00:13:09.059
the numbers just get staggering. The source material

00:13:09.059 --> 00:13:12.139
notes that the base models were trained on just

00:13:12.139 --> 00:13:17.100
eight NVIDIA P100 GPUs. Only eight. In today's

00:13:17.100 --> 00:13:20.460
world, we talk about clusters of 100 ,000 GPUs.

00:13:20.899 --> 00:13:23.980
Eight sounds like a gaming PC setup. It was so

00:13:23.980 --> 00:13:26.899
efficient. The base model took only 12 hours

00:13:26.899 --> 00:13:31.019
to train. The big model took 3 .5 days. Okay,

00:13:31.080 --> 00:13:33.440
compare that to the before times. How long did

00:13:33.440 --> 00:13:35.539
the old stuff take? The source mentions that

00:13:35.539 --> 00:13:37.600
the previous Google neural machine translation

00:13:37.600 --> 00:13:41.200
system took nine months to develop. And the statistical

00:13:41.200 --> 00:13:43.759
approach is before that, 10 years of engineering.

00:13:44.059 --> 00:13:46.759
So we went from years to months to three and

00:13:46.759 --> 00:13:48.820
a half days. That is the power of parallelization.

00:13:49.399 --> 00:13:51.100
because you aren't waiting for the scroll to

00:13:51.100 --> 00:13:53.200
unwind anymore. You can just throw more compute

00:13:53.200 --> 00:13:55.179
at it and it actually works. You can scale it.

00:13:55.299 --> 00:13:57.159
Exactly. And originally, they just wanted to

00:13:57.159 --> 00:13:59.179
translate languages, right? The paper focuses

00:13:59.179 --> 00:14:01.360
on English to German. They didn't know they were

00:14:01.360 --> 00:14:03.500
building the foundation of HEI. That's the beautiful

00:14:03.500 --> 00:14:05.559
irony. They were just trying to get a better

00:14:05.559 --> 00:14:08.500
translation score. But they saw these early signs

00:14:08.500 --> 00:14:10.679
that this wasn't just a translator. The team

00:14:10.679 --> 00:14:14.059
decided to throw a curveball at the model. They

00:14:14.059 --> 00:14:15.840
tried it on something called English constituency

00:14:15.840 --> 00:14:19.480
parsing. Which is what? It's rigorous grammar

00:14:19.480 --> 00:14:22.779
analysis. Identifying noun phrases, verb phrases,

00:14:22.879 --> 00:14:25.240
the whole sentence tree structure. A totally

00:14:25.240 --> 00:14:27.399
different task from translation. And they didn't

00:14:27.399 --> 00:14:29.919
even tune the model for it. Nope. They just ran

00:14:29.919 --> 00:14:32.120
it. It worked brilliantly. And didn't they have

00:14:32.120 --> 00:14:34.379
it write a Wikipedia article? They did. As a

00:14:34.379 --> 00:14:37.279
test, they had it generate a fake Wikipedia article

00:14:37.279 --> 00:14:40.879
about Transformers. And it wrote this coherent,

00:14:40.980 --> 00:14:43.539
factual sounding article. Very meta. It was a

00:14:43.539 --> 00:14:45.720
proof of concept. They proved this architecture

00:14:45.720 --> 00:14:48.440
could generate coherent text, not just swap words

00:14:48.440 --> 00:14:51.000
around. They'd built a general purpose language

00:14:51.000 --> 00:14:53.799
engine. Which brings us to the aftermath. The

00:14:53.799 --> 00:14:56.759
paper was published in 2017. What happened next

00:14:56.759 --> 00:14:59.139
is basically the history of the last decade.

00:14:59.399 --> 00:15:02.370
It just triggered the AI boom. Almost immediately

00:15:02.370 --> 00:15:05.470
you saw the evolution. In 2018, we got BERT from

00:15:05.470 --> 00:15:07.669
Google, which totally revelationized Google search.

00:15:07.889 --> 00:15:10.190
And then the generative pre -trained transformers.

00:15:10.269 --> 00:15:13.210
The GPT series, yeah. OpenAI looked at this architecture

00:15:13.210 --> 00:15:15.690
and basically said, what if we just make it bigger?

00:15:16.289 --> 00:15:18.370
Much, much bigger. The source notes that the

00:15:18.370 --> 00:15:21.990
sheer scale of the 2022 boom with ChatGPT was

00:15:21.990 --> 00:15:24.970
unexpected, even to the researchers. Well, the

00:15:24.970 --> 00:15:27.070
architecture allowed for what we call quadratic

00:15:27.070 --> 00:15:29.850
scaling. Because it was parallel, you could just

00:15:29.850 --> 00:15:31.850
stack more layers, feed it the entire internet,

00:15:31.950 --> 00:15:33.730
and it just kept getting smarter. It didn't hit

00:15:33.730 --> 00:15:36.590
a ceiling like RNNs did. But here is where it

00:15:36.590 --> 00:15:38.370
gets really interesting for me. I want to connect

00:15:38.370 --> 00:15:40.389
this with a listener who thinks, okay, cool text

00:15:40.389 --> 00:15:43.529
generator. It's not just text anymore. We call

00:15:43.529 --> 00:15:45.450
it a language model. But the transformer doesn't

00:15:45.450 --> 00:15:47.809
seem to care what the language is. This is the

00:15:47.809 --> 00:15:50.909
multimodal shift. And this is key. See, the transformer

00:15:50.909 --> 00:15:54.269
treats everything as a token. A word is a token.

00:15:54.750 --> 00:15:58.139
But a patch of pixels in an image. That can also

00:15:58.139 --> 00:16:00.659
be a token. Explain that. How is a picture a

00:16:00.659 --> 00:16:03.100
language? Think about the query key concept again.

00:16:03.220 --> 00:16:05.559
If I have a picture of a beach, I have a patch

00:16:05.559 --> 00:16:08.440
of blue pixels at the top. That patch is a token.

00:16:08.580 --> 00:16:11.580
It queries the pixels around it. It asks, hey,

00:16:11.659 --> 00:16:13.899
am I part of a blueberry or am I part of the

00:16:13.899 --> 00:16:16.440
sky? And the surrounding pixels reply, well,

00:16:16.620 --> 00:16:19.259
we are white and flicky clouds and below us is

00:16:19.259 --> 00:16:22.519
an ocean. So the blue pixel says, OK, I am sky.

00:16:23.100 --> 00:16:25.740
It uses the exact same attention mechanism to

00:16:25.740 --> 00:16:28.320
understand the image. So that's why we have vision

00:16:28.320 --> 00:16:30.360
transformers, and that's why we have Sora creating

00:16:30.360 --> 00:16:33.940
video. Exactly. Sora treats chunks of video as

00:16:33.940 --> 00:16:37.019
tokens. It predicts the next frame token, just

00:16:37.019 --> 00:16:40.000
like ChatGPT predicts the next word token. And

00:16:40.000 --> 00:16:43.299
even biology. You mentioned AlphaFold. AlphaFold

00:16:43.299 --> 00:16:45.539
changed the world of biology by predicting protein

00:16:45.539 --> 00:16:48.460
structures. It treats amino acids like they're

00:16:48.460 --> 00:16:50.860
words in a sentence. It uses self -attention

00:16:50.860 --> 00:16:53.259
to see how the protein folds up on itself. This

00:16:53.259 --> 00:16:55.500
amino acid attracts that one over there. It's

00:16:55.500 --> 00:16:58.519
just... Finding relationships in data. It's incredible

00:16:58.519 --> 00:17:01.080
to think that a mechanism designed to translate

00:17:01.080 --> 00:17:04.619
the cat sat on the mat into German ended up solving

00:17:04.619 --> 00:17:07.140
protein folding and creating Hollywood -level

00:17:07.140 --> 00:17:09.539
video. It really speaks to the universality of

00:17:09.539 --> 00:17:12.559
the architecture. Attention. The ability to weigh

00:17:12.559 --> 00:17:14.200
relationships between data points, no matter

00:17:14.200 --> 00:17:16.619
how far apart they are, turns out to be a fundamental

00:17:16.619 --> 00:17:19.799
law of information processing. So what does this

00:17:19.799 --> 00:17:22.250
all mean for us? For the person listening who

00:17:22.250 --> 00:17:25.289
isn't an AI researcher. It means we've moved

00:17:25.289 --> 00:17:29.130
from a world of rigid rule -based computing to

00:17:29.130 --> 00:17:32.250
a world of relationship -based computing. Unpack

00:17:32.250 --> 00:17:35.230
that a bit. Before transformers, computers were

00:17:35.230 --> 00:17:37.690
just very bad at context. They were literal.

00:17:37.829 --> 00:17:40.529
They were linear. You made a typo. They crashed.

00:17:40.869 --> 00:17:43.410
Now we have machines that can see the whole page

00:17:43.410 --> 00:17:47.049
at once. They can understand that a word or a

00:17:47.049 --> 00:17:50.680
pixel or a gene. only has meaning in relation

00:17:50.680 --> 00:17:53.319
to the things around it. It's a connector. It's

00:17:53.319 --> 00:17:55.539
a universal connector. The transformer gave us

00:17:55.539 --> 00:17:58.299
a method to map the relationships between any

00:17:58.299 --> 00:18:01.019
set of data points. And that is why it is the

00:18:01.019 --> 00:18:03.279
engine of the 21st century. It really is the

00:18:03.279 --> 00:18:05.960
constitution of the AI era. It defined the rules

00:18:05.960 --> 00:18:08.339
of engagement for how machines learn. And we're

00:18:08.339 --> 00:18:10.599
still just seeing the early applications of that.

00:18:10.880 --> 00:18:12.900
Remember, the paper foresaw a question answering,

00:18:13.079 --> 00:18:15.059
but I don't think even the authors realized how

00:18:15.059 --> 00:18:17.420
quickly it would scale to things like robotics

00:18:17.420 --> 00:18:19.299
and reasoning. It's a good reminder that sometimes

00:18:19.299 --> 00:18:21.759
the biggest revolutions come from a 10 -page

00:18:21.759 --> 00:18:25.220
PDF. And a group of people willing to challenge

00:18:25.220 --> 00:18:27.619
the conventional wisdom. Remember, everyone was

00:18:27.619 --> 00:18:29.460
saying, you need recurrence, you have to read

00:18:29.460 --> 00:18:31.799
an order, and they just said, no, attention is

00:18:31.799 --> 00:18:33.660
all you need. I want to leave the listener with

00:18:33.660 --> 00:18:35.880
a final thought, something to chew on. We've

00:18:35.880 --> 00:18:37.839
talked about how the transformer moved us from

00:18:37.839 --> 00:18:41.759
sequential processing thinking linearly to parallel

00:18:41.759 --> 00:18:44.240
processing, seeing the whole picture at once.

00:18:44.359 --> 00:18:47.079
It makes me wonder about our own brains. We tend

00:18:47.079 --> 00:18:49.259
to think linearly. We tell stories beginning

00:18:49.259 --> 00:18:51.759
to end. We read left to right. But the transformer

00:18:51.759 --> 00:18:54.400
proved that attention without strict order is

00:18:54.400 --> 00:18:56.880
faster and more powerful for processing data.

00:18:57.450 --> 00:19:00.589
That is a deep thought. It implies that linearity

00:19:00.589 --> 00:19:03.089
might actually be a bottleneck in our own cognition.

00:19:03.369 --> 00:19:06.309
Exactly. If attention was all we needed to solve

00:19:06.309 --> 00:19:09.390
language, what other simple mechanism are we

00:19:09.390 --> 00:19:11.970
overlooking? Is there a creativity is all you

00:19:11.970 --> 00:19:14.930
need or a reasoning is all you need? Paper just

00:19:14.930 --> 00:19:17.630
waiting to be written. What's the simple mechanic

00:19:17.630 --> 00:19:20.170
that unlocks that next level of intelligence?

00:19:20.609 --> 00:19:22.670
If we find it, it'll probably be another 10 -page

00:19:22.670 --> 00:19:25.069
paper that nobody notices at first. And we'll

00:19:25.069 --> 00:19:28.829
be here to deep dive into it when it drops. Thanks

00:19:28.829 --> 00:19:30.369
for listening, everyone. Keep learning.