WEBVTT

00:00:00.000 --> 00:00:03.359
Imagine taking the mathematical concept of a

00:00:03.359 --> 00:00:06.400
king. You subtract the underlying math of a man.

00:00:06.599 --> 00:00:08.480
You add the mathematical concept of a woman.

00:00:08.900 --> 00:00:11.419
And the final equation just perfectly equals

00:00:11.419 --> 00:00:14.660
the word queen. Right. And I mean, that isn't

00:00:14.660 --> 00:00:18.589
some weird riddle. It's literally algebra. semantic

00:00:18.589 --> 00:00:21.850
algebra being performed right now by the algorithms

00:00:21.850 --> 00:00:24.210
powering the phone in your pocket. Yeah, which

00:00:24.210 --> 00:00:26.030
is just wild to think about. It sounds like straight

00:00:26.030 --> 00:00:28.530
up science fiction. It really does. But it's

00:00:28.530 --> 00:00:30.769
actually the foundational mechanism of how machines

00:00:30.769 --> 00:00:34.640
have learned to, well... process our world. And

00:00:34.640 --> 00:00:36.759
you interact with these systems every single

00:00:36.759 --> 00:00:40.240
day. I mean, we all do, whether it's the autocorrect

00:00:40.240 --> 00:00:43.340
fixing your texts or maybe a chat bot retrieving

00:00:43.340 --> 00:00:46.820
a lost password for you. Or even the route optimization

00:00:46.820 --> 00:00:49.179
tool, you know, the one constantly recalculating

00:00:49.179 --> 00:00:51.399
your commute based on traffic patterns. Exactly.

00:00:51.899 --> 00:00:53.899
But we rarely stop to ask the really obvious

00:00:53.899 --> 00:00:56.320
question, which is how did a machine literally

00:00:56.320 --> 00:00:59.399
just a box of silicon and wire actually learn

00:00:59.399 --> 00:01:02.710
to speak? Yeah. How did it learn to predict the

00:01:02.710 --> 00:01:05.969
exact word you were about to type? Right. So

00:01:05.969 --> 00:01:08.609
today our mission for this deep dive is to journey

00:01:08.609 --> 00:01:12.030
through the massive decades long evolution of

00:01:12.030 --> 00:01:14.150
language models. And we're using a really comprehensive

00:01:14.150 --> 00:01:17.549
Wikipedia article on language models as our core

00:01:17.549 --> 00:01:19.890
guide for this. We are. We're going to trace

00:01:19.890 --> 00:01:24.370
the path all the way from the 1950s era of strict,

00:01:24.670 --> 00:01:29.010
rigid grammar rules right up to the massive multi

00:01:29.010 --> 00:01:31.450
-billion parameter neural networks that are basically

00:01:31.450 --> 00:01:34.189
reshaping the modern economy. It's a huge shift.

00:01:34.189 --> 00:01:37.150
Okay, let's unpack this. Where do we even begin

00:01:37.150 --> 00:01:39.370
teaching a machine to understand something as

00:01:39.370 --> 00:01:42.069
incredibly messy as human language? Well, we

00:01:42.069 --> 00:01:44.750
really have to start in the 1950s, specifically

00:01:44.750 --> 00:01:46.750
with the pioneering work of a linguist named

00:01:46.750 --> 00:01:49.829
Noam Chomsky. Oh, right, Chomsky. Yeah, so the

00:01:49.829 --> 00:01:51.670
initial philosophy for teaching machines back

00:01:51.670 --> 00:01:54.290
then was heavily, heavily rule -based. It was

00:01:54.290 --> 00:01:56.450
built around Chomsky's theory of formal grammars.

00:01:56.650 --> 00:01:58.969
Formal grammars, meaning what, exactly? Well,

00:01:59.069 --> 00:02:00.750
the prevailing thought of the era was highly

00:02:00.750 --> 00:02:03.109
structured. Researchers basically believe that

00:02:03.109 --> 00:02:05.409
if you could just teach a computer the exact

00:02:05.409 --> 00:02:07.590
comprehensive structural rules of a language

00:02:07.590 --> 00:02:10.949
like the precise syntax Exactly the syntax the

00:02:10.949 --> 00:02:14.150
grammar the correct way to conjugate literally

00:02:14.150 --> 00:02:16.770
every verb Yeah, they thought if it knows all

00:02:16.770 --> 00:02:19.990
the rules it could theoretically generate and

00:02:19.990 --> 00:02:22.789
understand any text I mean, I guess that makes

00:02:22.789 --> 00:02:25.310
complete sense on paper on paper sure But when

00:02:25.310 --> 00:02:28.650
you actually think about it that approach treats

00:02:28.650 --> 00:02:32.319
the computer like an incredibly strict rigid

00:02:32.319 --> 00:02:34.659
English teacher. Oh, absolutely. You know the

00:02:34.659 --> 00:02:36.960
type, right? The one who stands at the chalkboard

00:02:36.960 --> 00:02:40.139
diagramming sentences and like fails you if you

00:02:40.139 --> 00:02:42.340
use a dangling modifier or end a sentence with

00:02:42.340 --> 00:02:43.939
a preposition. Yeah, the ones who don't let you

00:02:43.939 --> 00:02:46.960
start a sentence with and. Exactly. But the problem

00:02:46.960 --> 00:02:50.280
is language. as you and I actually use it in

00:02:50.280 --> 00:02:52.800
the real world, is chaotic. It's totally chaotic.

00:02:52.879 --> 00:02:56.520
It's full of slang, broken rules, sarcasm, idioms.

00:02:56.780 --> 00:02:58.620
Like, if I tell a rule -based computer from the

00:02:58.620 --> 00:03:01.300
50s that someone kicked the bucket, the machine

00:03:01.300 --> 00:03:02.939
is going to literally start looking around for

00:03:02.939 --> 00:03:05.319
a plastic pail. Right. It has no context for

00:03:05.319 --> 00:03:07.659
the metaphor, which is the exact reason that

00:03:07.659 --> 00:03:09.919
rule -based approach eventually hit an absolute

00:03:09.919 --> 00:03:12.889
wall. It just couldn't scale. Not at all. Researchers

00:03:12.889 --> 00:03:14.969
realized you just couldn't write enough individual

00:03:14.969 --> 00:03:18.090
rules to capture the infinite, messy variability

00:03:18.090 --> 00:03:20.490
of human communication. It was just too brittle.

00:03:20.770 --> 00:03:24.409
So what did they do? So by 1980, the field's

00:03:24.409 --> 00:03:28.110
entire philosophy shifted. IBM actually started

00:03:28.110 --> 00:03:31.090
conducting what our source text calls Shannon

00:03:31.090 --> 00:03:33.969
-style experiments. Shannon -style. That sounds

00:03:33.969 --> 00:03:35.669
like they started looking at the problem from

00:03:35.669 --> 00:03:38.629
a totally different angle. If they aren't teaching

00:03:38.629 --> 00:03:40.750
the rules anymore, what are they doing? they

00:03:40.750 --> 00:03:43.509
started observing human behavior instead. In

00:03:43.509 --> 00:03:45.949
these Shannon -style experiments, researchers

00:03:45.949 --> 00:03:48.750
essentially watched human subjects predict and

00:03:48.750 --> 00:03:51.669
correct text. Oh wow, so like watching a person

00:03:51.669 --> 00:03:54.530
guess what comes next. Exactly. They observed

00:03:54.530 --> 00:03:57.889
how a person naturally almost subconsciously

00:03:57.889 --> 00:04:00.210
guesses what the next letter or word in a sentence

00:04:00.210 --> 00:04:02.550
is going to be based purely on context. It's

00:04:02.550 --> 00:04:04.270
just feeling it out based on what they've read

00:04:04.270 --> 00:04:06.969
before. Right. And IBM used those observations

00:04:06.969 --> 00:04:09.349
to figure out how to improve language modeling

00:04:09.349 --> 00:04:11.789
for machines. They moved completely away from

00:04:11.789 --> 00:04:13.870
trying to be a perfect grammarian and shifted

00:04:13.870 --> 00:04:16.649
to pure statistical models. That is a massive

00:04:16.649 --> 00:04:19.589
philosophical pivot. It changed everything. This

00:04:19.589 --> 00:04:21.850
led to the breakthrough of something called word

00:04:21.850 --> 00:04:25.230
engrams. An engram model basically calculates

00:04:25.230 --> 00:04:27.930
the probability of the next word appearing based

00:04:27.930 --> 00:04:30.949
solely on a fixed window of previous words. OK,

00:04:31.009 --> 00:04:33.069
so let me make sure I've got this. If the machine

00:04:33.069 --> 00:04:35.790
only looks at one previous word to make its guess,

00:04:35.910 --> 00:04:38.389
it's called a big -gram model. You got it. And

00:04:38.389 --> 00:04:40.230
if it looks at the two previous words, it's a

00:04:40.230 --> 00:04:42.410
trigram model. Exactly right. And then it scales

00:04:42.410 --> 00:04:44.730
all the way up to an n -gram model, which just

00:04:44.730 --> 00:04:46.649
looks at n minus one words. Yeah, the window

00:04:46.649 --> 00:04:49.649
just gets larger. And our source also mentions

00:04:49.649 --> 00:04:51.930
they had to introduce special tokens for this.

00:04:52.170 --> 00:04:55.550
Tokens. Like what? Like little s inside angle

00:04:55.550 --> 00:04:58.670
brackets. They use those just to give the machine

00:04:58.670 --> 00:05:01.290
a strict mathematical signal of where a sentence

00:05:01.290 --> 00:05:04.850
formally starts and ends. Okay, so the machine

00:05:04.850 --> 00:05:07.370
is no longer actually reading, it's just calculating

00:05:07.370 --> 00:05:10.569
odds. Pure odds. If it sees the word peanut,

00:05:10.970 --> 00:05:13.949
And it knows from scanning a vast database of

00:05:13.949 --> 00:05:16.250
books that the word butter follows peanuts, say,

00:05:16.370 --> 00:05:18.870
a huge percentage of the time. It simply assigns

00:05:18.870 --> 00:05:21.870
a super high mathematical probability to butter

00:05:21.870 --> 00:05:25.730
being the next word. Okay, so if the 1950s approach

00:05:25.730 --> 00:05:29.740
was the strict English teacher... The 1980s endergram

00:05:29.740 --> 00:05:33.360
approach is basically like a savvy casino gambler.

00:05:33.560 --> 00:05:35.620
Oh, I like that. Like the machine doesn't actually

00:05:35.620 --> 00:05:37.500
know what peanut butter is, it doesn't care about

00:05:37.500 --> 00:05:40.240
the rules of culinary grammar, and it certainly

00:05:40.240 --> 00:05:42.019
has never tasted a sandwich. Definitely not.

00:05:42.060 --> 00:05:43.879
It's just sitting there at the blackjack table

00:05:43.879 --> 00:05:46.920
playing the mathematical odds on what card, or

00:05:46.920 --> 00:05:49.160
in this case, what word shows up next, based

00:05:49.160 --> 00:05:52.540
entirely on historical data. The casino analogy

00:05:52.540 --> 00:05:56.759
captures the mechanism beautifully. because it

00:05:56.759 --> 00:06:00.860
is a pure game of statistical probability. But

00:06:00.860 --> 00:06:03.339
any gambler will tell you that relying entirely

00:06:03.339 --> 00:06:05.860
on historical odds introduces a pretty fatal

00:06:05.860 --> 00:06:08.310
vulnerability. Which is? What happens when the

00:06:08.310 --> 00:06:10.750
mathematical model encounters a sequence of words

00:06:10.750 --> 00:06:13.410
it has absolutely never seen before in its training

00:06:13.410 --> 00:06:16.230
data. Oh. Wait, think about how your brain works

00:06:16.230 --> 00:06:20.089
for a second. If I say a totally brand new, bizarre

00:06:20.089 --> 00:06:21.689
sentence to you listening right now, something

00:06:21.689 --> 00:06:25.470
completely random like, uh, the purple giraffe

00:06:25.470 --> 00:06:27.910
tap danced on my toaster. Which I've definitely

00:06:27.910 --> 00:06:30.600
never heard before. Right. But you still understand

00:06:30.600 --> 00:06:32.839
the sentence. You know what those words mean,

00:06:33.199 --> 00:06:35.120
even if they've literally never been combined

00:06:35.120 --> 00:06:37.139
in the history of the English language. Exactly.

00:06:37.759 --> 00:06:40.420
We can process that novelty. But for the gambler

00:06:40.420 --> 00:06:43.560
machine, because giraffe and toaster have never

00:06:43.560 --> 00:06:45.720
appeared next to each other in its historical

00:06:45.720 --> 00:06:48.899
database, the mathematical probability of that

00:06:48.899 --> 00:06:52.360
combination is literally zero. And that breaks

00:06:52.360 --> 00:06:55.240
the entire algorithm. In computer science, this

00:06:55.240 --> 00:06:57.800
is known as the zero probability problem. It

00:06:57.800 --> 00:07:01.300
just crashes. Pretty much. Language models calculate

00:07:01.300 --> 00:07:03.779
the likelihood of a whole sentence by multiplying

00:07:03.779 --> 00:07:06.800
the probabilities of each word sequence together.

00:07:06.839 --> 00:07:10.279
Okay. So if even one of those sequences has a

00:07:10.279 --> 00:07:13.399
probability of zero, the entire equation multiplies

00:07:13.399 --> 00:07:16.220
out to zero. The model essentially throws its

00:07:16.220 --> 00:07:18.660
hands up and just fails. Because anything multiplied

00:07:18.660 --> 00:07:21.439
by zero is just zero. Exactly. The gambler goes

00:07:21.439 --> 00:07:23.600
completely bust just because it saw a purple

00:07:23.600 --> 00:07:26.139
giraffe. What's fascinating here is how researchers

00:07:26.139 --> 00:07:28.279
engineered a way around this mathematical dead

00:07:28.279 --> 00:07:31.100
end. How do you fix a zero? They developed techniques

00:07:31.100 --> 00:07:33.939
collectively called smoothing. The most basic

00:07:33.939 --> 00:07:36.240
form of this is called add one smoothing. Add

00:07:36.240 --> 00:07:39.370
one? Yeah, the engineers essentially forced the

00:07:39.370 --> 00:07:41.889
algorithm to pretend it has seen every possible

00:07:41.889 --> 00:07:45.470
word combination at least once. It assigns a

00:07:45.470 --> 00:07:48.810
baseline count of 1 to all unseen n -grams, so

00:07:48.810 --> 00:07:51.689
the math never hits absolute zero. Oh, wow. It's

00:07:51.689 --> 00:07:53.949
literally like spotting the gambler a free chip

00:07:53.949 --> 00:07:55.730
just to keep them in the game. That's a great

00:07:55.730 --> 00:07:57.730
way to put it. But our source mentions the math

00:07:57.730 --> 00:07:59.990
gets much more sophisticated than just adding

00:07:59.990 --> 00:08:02.149
a one, right? Like, there are techniques called

00:08:02.149 --> 00:08:04.769
good -turing discounting or back -off models.

00:08:04.949 --> 00:08:07.089
Yeah, it gets incredibly complex. Those more

00:08:07.089 --> 00:08:09.990
advanced smoothing techniques operate on a brilliant

00:08:09.990 --> 00:08:13.490
principle of, well, redistribution. Redistribution

00:08:13.490 --> 00:08:17.060
of what? The odds. Exactly. The model takes a

00:08:17.060 --> 00:08:19.420
tiny, tiny fraction of a percent of probability

00:08:19.420 --> 00:08:21.959
away from the highly common word combinations

00:08:21.959 --> 00:08:25.319
it has seen a million times. OK. And it redistributes

00:08:25.319 --> 00:08:27.860
that probability into a sort of rainy day fund

00:08:27.860 --> 00:08:30.839
for the unseen combinations. So it smooths out

00:08:30.839 --> 00:08:33.419
the whole distribution curve. So clever. It is.

00:08:33.860 --> 00:08:36.340
But even with smoothing, they faced another massive

00:08:36.340 --> 00:08:39.559
hurdle, which was called data sparsity. Data

00:08:39.559 --> 00:08:42.620
sparsity, meaning like no matter how many books

00:08:42.620 --> 00:08:45.120
you feed the machine, human language is just

00:08:45.120 --> 00:08:47.620
so vast that the data is always going to be spread

00:08:47.620 --> 00:08:50.100
way too thin to make highly accurate predictions.

00:08:50.320 --> 00:08:51.799
Right. If you're only looking at consecutive

00:08:51.799 --> 00:08:54.480
words, you're missing out on so much context.

00:08:54.679 --> 00:08:57.179
So how do they fix that? To solve the sparsity

00:08:57.179 --> 00:09:00.120
issue, researchers created the skip gram model.

00:09:00.279 --> 00:09:02.159
Instead of only calculating the odds of words

00:09:02.159 --> 00:09:04.240
that appear strictly right next to each other,

00:09:04.580 --> 00:09:07.419
the model was suddenly allowed to, well, skip

00:09:07.419 --> 00:09:09.879
gaps. OK, I want to make sure you all listening

00:09:09.879 --> 00:09:13.100
really grasp this mechanism, because the skipgram

00:09:13.100 --> 00:09:15.700
model is a pretty wild concept. Let's use the

00:09:15.700 --> 00:09:18.320
mandatory example provided right in our source

00:09:18.320 --> 00:09:19.919
text to illustrate this. Yeah, it's a perfect

00:09:19.919 --> 00:09:22.059
example. Take the classic sentence. The rain

00:09:22.059 --> 00:09:25.320
in Spain falls mainly on the plain. In the older

00:09:25.320 --> 00:09:28.200
traditional enneagram model, the machine only

00:09:28.200 --> 00:09:30.399
looks at consecutive pairs, right? So it sees

00:09:30.399 --> 00:09:34.659
the rain in Spain. It's very narrow. Super narrow.

00:09:35.039 --> 00:09:37.700
But when researchers implemented a one -skip,

00:09:37.740 --> 00:09:40.500
two -gram model, the machine is suddenly allowed

00:09:40.500 --> 00:09:43.860
to jump over one word to form its pairs. It starts

00:09:43.860 --> 00:09:46.860
capturing subsequences that skip a beat. Oh,

00:09:47.299 --> 00:09:49.940
so from that exact same sentence, the machine

00:09:49.940 --> 00:09:54.059
suddenly extracts pairs like the N, rain -spain,

00:09:54.279 --> 00:09:56.929
infalls. Spain, mainly, falls on. You're pulling

00:09:56.929 --> 00:09:59.610
way more data out of the same text. Wait, let

00:09:59.610 --> 00:10:01.190
me make sure I'm visualizing this correctly.

00:10:01.730 --> 00:10:04.450
Is the skipgram model essentially like reading

00:10:04.450 --> 00:10:06.690
a sentence but covering up every other word with

00:10:06.690 --> 00:10:09.070
your hand, just to see if the overall context

00:10:09.070 --> 00:10:11.370
still provides a relationship? Yeah, that's a

00:10:11.370 --> 00:10:13.350
highly effective way to visualize it. By doing

00:10:13.350 --> 00:10:16.029
that, rain and falls get mathematically linked,

00:10:16.210 --> 00:10:18.169
even though the word in was sitting right between

00:10:18.169 --> 00:10:20.649
them. Exactly. By forcing the machine to understand

00:10:20.649 --> 00:10:23.490
relationships across a slightly wider alternating

00:10:23.490 --> 00:10:26.669
distance, it basically overcomes data sparsity.

00:10:26.909 --> 00:10:29.710
It extracts exponentially more relational data

00:10:29.710 --> 00:10:32.409
like more connective tissue out of the exact

00:10:32.409 --> 00:10:34.710
same sentence. That is incredibly clever engineering.

00:10:34.950 --> 00:10:37.210
It really was a huge leap. But I mean, I'm looking

00:10:37.210 --> 00:10:40.049
at this timeline, and even with skip grams and

00:10:40.049 --> 00:10:44.549
smoothing, These 1980s and 90s models still fundamentally

00:10:44.549 --> 00:10:48.009
viewed words as isolated, discrete puzzle pieces.

00:10:48.090 --> 00:10:50.570
They did? They were still just calculating the

00:10:50.570 --> 00:10:53.370
odds of symbols appearing near each other. How

00:10:53.370 --> 00:10:55.990
did the machine eventually learn that words actually

00:10:55.990 --> 00:11:00.200
have, you know, meanings? Well, that conceptual

00:11:00.200 --> 00:11:02.720
leap brings us to the 2000s, which really marked

00:11:02.720 --> 00:11:05.139
one of the most significant paradigm shifts in

00:11:05.139 --> 00:11:07.620
computer science. What changed? The field basically

00:11:07.620 --> 00:11:09.820
realized that counting discrete words just wasn't

00:11:09.820 --> 00:11:12.220
enough anymore. They moved towards what are called

00:11:12.220 --> 00:11:14.860
continuous representations. And this birthed

00:11:14.860 --> 00:11:17.799
the era of word embeddings. Word embeddings.

00:11:18.279 --> 00:11:20.279
Yeah, when I read this part of the source material,

00:11:20.500 --> 00:11:22.139
it genuinely started to feel like we were entering

00:11:22.139 --> 00:11:24.840
the matrix. It is a profoundly abstract concept

00:11:24.840 --> 00:11:27.210
for sure. Researchers figured out how to translate

00:11:27.210 --> 00:11:30.250
words into real, valued vectors inside a massive,

00:11:30.710 --> 00:11:32.929
multi -dimensional mathematical space. Okay,

00:11:32.950 --> 00:11:36.269
for you listening, try to imagine a gigantic,

00:11:36.429 --> 00:11:40.669
invisible 3D graph floating in space, and every

00:11:40.669 --> 00:11:43.250
single word in the human vocabulary is plotted

00:11:43.250 --> 00:11:46.289
as a specific physical coordinate on that graph.

00:11:46.490 --> 00:11:49.250
And the true genius of word embeddings is exactly

00:11:49.250 --> 00:11:51.190
how those coordinates are assigned. How do they

00:11:51.190 --> 00:11:53.610
decide where to put them? Words that share similar

00:11:53.610 --> 00:11:56.809
meanings or contexts are mapped physically closer

00:11:56.809 --> 00:11:59.070
together in this mathematical vector space. Oh.

00:11:59.529 --> 00:12:02.690
So the coordinate for the word dog is physically

00:12:02.690 --> 00:12:05.490
sitting much, much closer to the coordinate for

00:12:05.490 --> 00:12:08.049
puppy than it is to the coordinate for, say,

00:12:08.549 --> 00:12:10.419
refrigerator. That is insane. They literally

00:12:10.419 --> 00:12:13.059
figured out how to map semantic meaning to spatial

00:12:13.059 --> 00:12:16.039
math. Yes. And the architecture goes way beyond

00:12:16.039 --> 00:12:18.580
simple synonyms, too. This continuous mathematical

00:12:18.580 --> 00:12:22.279
space actually preserves complex, common relationships

00:12:22.279 --> 00:12:25.139
between words like plurality, tense, and even

00:12:25.139 --> 00:12:27.320
gender. Right. Our source highlights a really

00:12:27.320 --> 00:12:29.639
famous equation that proved this concept actually

00:12:29.639 --> 00:12:31.620
worked. Yeah, let's break that down. Let's say

00:12:31.620 --> 00:12:34.419
the letter V stands for the vector, the exact

00:12:34.419 --> 00:12:37.620
mathematical coordinate of a word. The model

00:12:37.620 --> 00:12:40.440
discovered that the vector for king minus the

00:12:40.440 --> 00:12:44.039
vector for male plus the vector for female approximately

00:12:44.039 --> 00:12:46.899
equals the vector for queen. So what does this

00:12:46.899 --> 00:12:52.700
all mean? V female plus V queen. Let me walk

00:12:52.700 --> 00:12:56.039
through this. The machine locates the exact mathematical

00:12:56.039 --> 00:12:58.600
coordinate for the concept of a king. Right.

00:12:58.700 --> 00:13:01.139
It subtracts the mathematical distance representing

00:13:01.139 --> 00:13:03.980
maleness, adds the distance representing femaleness,

00:13:04.120 --> 00:13:06.200
and when it looks at whatever coordinate it lands

00:13:06.200 --> 00:13:08.679
on. The absolute nearest neighbor hovering in

00:13:08.679 --> 00:13:11.360
that exact space happens to be the word queen.

00:13:11.519 --> 00:13:14.539
The machine is doing algebra with human concept.

00:13:14.740 --> 00:13:17.500
It is executing semantic algebra. It captures

00:13:17.500 --> 00:13:19.139
compositionality. I mean, the algorithm doesn't

00:13:19.139 --> 00:13:21.600
possess a human brain. It has absolutely no lived

00:13:21.600 --> 00:13:24.100
experience of royalty or gender. Obviously. But

00:13:24.100 --> 00:13:26.379
it has mapped out the relationship between those

00:13:26.379 --> 00:13:29.220
concepts perfectly, using literally nothing but

00:13:29.220 --> 00:13:32.059
distance and direction in a vector space. That

00:13:32.059 --> 00:13:33.940
concept just blows my mind every time I think

00:13:33.940 --> 00:13:36.639
about it. It's beautiful math. But practically

00:13:36.639 --> 00:13:39.500
speaking... If you are plotting hundreds of thousands

00:13:39.500 --> 00:13:42.600
of words into multi -dimensional space and running

00:13:42.600 --> 00:13:45.379
algebraic equations on all of them simultaneously

00:13:45.379 --> 00:13:49.080
to predict a sentence, you need astronomical

00:13:49.080 --> 00:13:51.179
computing power. Well, you do. Massive amounts.

00:13:51.500 --> 00:13:53.519
And the early computing architectures designed

00:13:53.519 --> 00:13:55.860
to process these continuous space embeddings

00:13:55.860 --> 00:13:59.419
were called recurrent neural networks, or RNNs.

00:13:59.679 --> 00:14:03.000
RNNs? Yeah. They were crucial because they solved

00:14:03.000 --> 00:14:05.940
a massive bottleneck known in the field as The

00:14:05.940 --> 00:14:08.519
Curse of Dimensionality. The Curse of Dimensionality.

00:14:08.620 --> 00:14:10.399
That sounds like the title of a terrible sci

00:14:10.399 --> 00:14:12.899
-fi thriller. It kind of does. But it's a very

00:14:12.899 --> 00:14:15.279
real mathematical nightmare. What is it? It basically

00:14:15.279 --> 00:14:17.600
dictates that as your vocabulary grows, say,

00:14:17.679 --> 00:14:20.580
from 10 ,000 words to 100 ,000 words, the number

00:14:20.580 --> 00:14:23.460
of possible sequence combinations increases exponentially.

00:14:23.620 --> 00:14:26.379
Right. Because every new word adds millions of

00:14:26.379 --> 00:14:28.620
new potential sentences. Exactly. It quickly

00:14:28.620 --> 00:14:31.379
becomes a number so massive that no computer

00:14:31.379 --> 00:14:33.519
could ever calculate it using the old anagram

00:14:33.519 --> 00:14:37.600
methods. So RNNs bypass this curse by representing

00:14:37.600 --> 00:14:40.500
words not as fixed sequences, but as non -linear

00:14:40.500 --> 00:14:42.710
combinations of weight. inside a neural network.

00:14:43.070 --> 00:14:46.049
Ah, so they found a way to mathematically squish

00:14:46.049 --> 00:14:48.429
the infinite possibilities down into something

00:14:48.429 --> 00:14:51.289
actually manageable. Essentially, yes. But even

00:14:51.289 --> 00:14:54.330
those RNNs had their limitations, right? Because

00:14:54.330 --> 00:14:57.210
they still had to process information in a specific

00:14:57.210 --> 00:15:01.090
linear way. Which brings us to 2017. Here's where

00:15:01.090 --> 00:15:03.889
it gets really interesting. 2017 is the year

00:15:03.889 --> 00:15:06.289
the entire trajectory of artificial intelligence

00:15:06.289 --> 00:15:09.549
changed. Completely. What happened? This is when

00:15:09.549 --> 00:15:11.870
researchers introduced the transformer architecture.

00:15:12.529 --> 00:15:15.049
It completely replaced the recurrence mechanism

00:15:15.049 --> 00:15:18.149
in RNNs with a brand new mechanism called self

00:15:18.149 --> 00:15:20.269
-attention. Self -attention. Let's break that

00:15:20.269 --> 00:15:21.789
down for the listener because the difference

00:15:21.789 --> 00:15:24.409
in mechanism is literally everything here. Let's

00:15:24.409 --> 00:15:26.860
do it. An older RNN model was kind of like reading

00:15:26.860 --> 00:15:29.919
a book linearly. Right. It reads word by word

00:15:29.919 --> 00:15:32.039
left to right. And by the time it reaches the

00:15:32.039 --> 00:15:35.080
end of a really long paragraph, it sort of strains

00:15:35.080 --> 00:15:37.899
to remember what the subject was at the very

00:15:37.899 --> 00:15:41.100
beginning. It loses the thread. Exactly. But

00:15:41.100 --> 00:15:44.779
transformers, utilizing this new self -attention

00:15:44.779 --> 00:15:48.419
mechanism, operate entirely differently. It's

00:15:48.419 --> 00:15:51.000
like having the superpower to instantly scan

00:15:51.000 --> 00:15:53.940
an entire page at once. Yes. It immediately draws

00:15:54.000 --> 00:15:56.139
invisible highlighter lines connecting every

00:15:56.139 --> 00:15:58.340
single word to every other word simultaneously,

00:15:58.600 --> 00:16:01.100
regardless of where they sit on the page. So

00:16:01.100 --> 00:16:03.840
it instantly knows the word bank at the bottom

00:16:03.840 --> 00:16:06.120
of the page refers to the word river at the top

00:16:06.120 --> 00:16:08.240
of the page. So it knows we're talking about

00:16:08.240 --> 00:16:10.940
water, not a financial institution. Exactly.

00:16:11.320 --> 00:16:12.799
The technical term for what you're describing

00:16:12.799 --> 00:16:16.159
is parallelization. Parallelization. Yeah, because

00:16:16.159 --> 00:16:18.879
transformers process all the data simultaneously

00:16:18.879 --> 00:16:21.580
rather than sequentially left to right. They

00:16:21.580 --> 00:16:24.090
can handle massive context when They don't forget

00:16:24.090 --> 00:16:25.990
the beginning of the paragraph anymore. Right.

00:16:26.090 --> 00:16:28.549
And more importantly, parallel processing allowed

00:16:28.549 --> 00:16:31.309
for scalable training on an unprecedented volume

00:16:31.309 --> 00:16:34.450
of data. For the first time, researchers could

00:16:34.450 --> 00:16:38.289
realistically scrape and process the entire public

00:16:38.289 --> 00:16:40.509
Internet. Which is just wild. They literally

00:16:40.509 --> 00:16:43.149
fed the machine the Internet. They did. And this

00:16:43.149 --> 00:16:46.250
is the birth of large language models, or LLMs,

00:16:46.629 --> 00:16:49.190
the foundation for modern models like GPT and

00:16:49.190 --> 00:16:53.149
BERT. The scale is just staggering. We evolved

00:16:53.149 --> 00:16:56.710
from simple statistical algorithms to deep neural

00:16:56.710 --> 00:17:00.210
networks consisting of billions and now trillions

00:17:00.210 --> 00:17:02.610
of parameters. And when you scale up the data

00:17:02.610 --> 00:17:04.730
and the parameters to that extreme, our source

00:17:04.730 --> 00:17:07.069
text notes that these models started showcasing

00:17:07.069 --> 00:17:10.339
what scientists call emergent behaviors. Yeah,

00:17:10.500 --> 00:17:12.359
unexpected capabilities. They suddenly started

00:17:12.359 --> 00:17:14.740
doing things they were never explicitly programmed

00:17:14.740 --> 00:17:17.200
to do, right? They demonstrated phenomena like

00:17:17.200 --> 00:17:19.460
few -shot learning, meaning the model could learn

00:17:19.460 --> 00:17:22.420
a completely new task, like translating a new

00:17:22.420 --> 00:17:24.740
language, after being shown just a tiny handful

00:17:24.740 --> 00:17:27.500
of examples. Wow. They also displayed compositional

00:17:27.500 --> 00:17:29.460
reasoning, where they were combining multiple

00:17:29.460 --> 00:17:32.519
distinct concepts to solve complex logic problems

00:17:32.519 --> 00:17:34.660
they had never seen before. OK, but let's pause

00:17:34.660 --> 00:17:36.559
for a second and look at the reality of what

00:17:36.559 --> 00:17:38.980
we've built here. We've constructed a massively

00:17:38.980 --> 00:17:41.759
powerful parallel process in brain that read

00:17:41.759 --> 00:17:44.440
the entire public internet to learn how to speak.

00:17:44.680 --> 00:17:47.579
Yes, we did. But you and I both know the internet

00:17:47.579 --> 00:17:50.940
is. Well, it's the internet. It's a mess. It

00:17:50.940 --> 00:17:53.420
contains absolute brilliance, yes, but it also

00:17:53.420 --> 00:17:56.880
contains vast inaccuracies, weird strange rabbit

00:17:56.880 --> 00:18:00.019
holes, and terrible toxic advice. Oh, definitely.

00:18:00.559 --> 00:18:03.599
And the model inherently digests and inherits

00:18:03.599 --> 00:18:06.660
all of that raw data. So how do engineers take

00:18:06.660 --> 00:18:09.559
a machine trained on the total chaos of the internet

00:18:09.559 --> 00:18:12.440
and make it factual, helpful, and safe for a

00:18:12.440 --> 00:18:15.059
user to actually interact with? Well, that is

00:18:15.059 --> 00:18:17.359
the multi -billion dollar engineering challenge

00:18:17.359 --> 00:18:20.099
of our current era. I bet. And the primary solution

00:18:20.099 --> 00:18:22.380
the industry has implemented is a process called

00:18:22.380 --> 00:18:26.380
RLHF. our LHF, which stands for Reinforcement

00:18:26.380 --> 00:18:28.559
Learning from Human Feedback. OK, let's dig into

00:18:28.559 --> 00:18:31.200
the mechanics of that. How do you reinforce a

00:18:31.200 --> 00:18:34.079
trillion parameter neural network? Well, it utilizes

00:18:34.079 --> 00:18:37.500
policy gradient algorithms to fine tune the LLM.

00:18:37.519 --> 00:18:40.190
The model generates a potential output. which

00:18:40.190 --> 00:18:42.910
is dictated by its internal policy, and that

00:18:42.910 --> 00:18:44.990
output is then graded against reward signals

00:18:44.990 --> 00:18:47.750
derived from human preference judgments or automated

00:18:47.750 --> 00:18:49.730
systems. Wait, okay, let me simplify that. Think

00:18:49.730 --> 00:18:51.970
of it like a video game high score, right? The

00:18:51.970 --> 00:18:54.950
machine spits out an answer to a prompt. A human

00:18:54.950 --> 00:18:57.170
evaluator looks at it and essentially gives it

00:18:57.170 --> 00:18:59.650
a thumbs up for being factual and helpful, or

00:18:59.650 --> 00:19:02.309
a thumbs down for being inaccurate or unhelpful.

00:19:02.710 --> 00:19:05.750
The algorithm then slightly tweaks its internal

00:19:05.750 --> 00:19:09.029
policy, its rulebook for choosing words, so it's

00:19:09.029 --> 00:19:11.230
math - automatically more likely to chase that

00:19:11.230 --> 00:19:13.529
thumbs up high score the next time it gets a

00:19:13.529 --> 00:19:16.650
similar prompt. That's a great analogy. It iteratively

00:19:16.650 --> 00:19:19.430
optimizes the model's output distribution. This

00:19:19.430 --> 00:19:22.150
is how developers drastically improve factuality

00:19:22.150 --> 00:19:24.549
and align the model's responses with what users

00:19:24.549 --> 00:19:26.990
actually need. Got it. But of course, once you

00:19:26.990 --> 00:19:29.369
train it with RLHF, you have to definitively

00:19:29.369 --> 00:19:32.430
prove the system is actually smart. So researchers

00:19:32.430 --> 00:19:35.289
rigorously evaluate these models against comprehensive

00:19:35.289 --> 00:19:38.079
standardized benchmarks. Right. Our source mentions

00:19:38.079 --> 00:19:40.299
a few of these specific testing frameworks like

00:19:40.299 --> 00:19:43.440
MMLU, the GLUE benchmark, and the SQUAD question

00:19:43.440 --> 00:19:45.500
answering test. Yes. These are basically the

00:19:45.500 --> 00:19:47.700
standardized SATs for artificial intelligence,

00:19:47.859 --> 00:19:50.079
right? Testing everything from basic reading

00:19:50.079 --> 00:19:54.240
comprehension to advanced physics. Exactly. But

00:19:54.240 --> 00:19:57.299
if we connect this to the bigger picture, relying

00:19:57.299 --> 00:20:00.480
so heavily on these standardized benchmarks introduces

00:20:00.480 --> 00:20:03.319
a pretty major controversy in the computer science

00:20:03.319 --> 00:20:07.029
community. A controversy over tests. Yeah. It's

00:20:07.029 --> 00:20:09.210
a phenomena known as hill climbing. Wait, if

00:20:09.210 --> 00:20:11.150
developers are constantly tweaking these models

00:20:11.150 --> 00:20:14.890
specifically to beat the MMLU or the GLUE benchmark,

00:20:15.809 --> 00:20:17.849
isn't there a massive risk that they are just

00:20:17.849 --> 00:20:19.769
teaching to the test? You hit the nail on the

00:20:19.769 --> 00:20:22.369
head. Like, are these models actually achieving

00:20:22.369 --> 00:20:25.369
genuine generalization and getting smarter? Or

00:20:25.369 --> 00:20:27.329
are they just like clever high school students

00:20:27.329 --> 00:20:29.650
who figured out how to ace a standardized test

00:20:29.650 --> 00:20:31.829
without actually understanding the underlying

00:20:31.829 --> 00:20:34.539
material? That is precisely what hill climbing

00:20:34.539 --> 00:20:37.559
means, and you've hit on the exact debate dividing

00:20:37.559 --> 00:20:40.960
AI researchers right now. Are we witnessing robust

00:20:40.960 --> 00:20:43.200
capability improvements in machine intelligence,

00:20:43.759 --> 00:20:46.680
or are we just iteratively engineering incredible

00:20:46.680 --> 00:20:49.200
test -takers that are basically overfitting to

00:20:49.200 --> 00:20:51.559
the evaluation metrics? That's a huge distinction.

00:20:51.779 --> 00:20:54.599
It is. The source material makes it remarkably

00:20:54.599 --> 00:20:57.309
clear, actually. While these language models

00:20:57.309 --> 00:20:59.769
sometimes match or even exceed human performance

00:20:59.769 --> 00:21:03.589
on specific tasks, it is absolutely not clear

00:21:03.589 --> 00:21:06.329
whether they serve as plausible cognitive models.

00:21:06.630 --> 00:21:08.670
Meaning, even though they can speak to us perfectly,

00:21:09.509 --> 00:21:12.789
they do not think like we do. Not at all. The

00:21:12.789 --> 00:21:14.789
text points out a really fascinating paradox.

00:21:15.000 --> 00:21:18.480
When analyzing recurrent neural networks, researchers

00:21:18.480 --> 00:21:20.980
found that the models sometimes learn intricate

00:21:20.980 --> 00:21:23.660
mathematical patterns that humans are completely

00:21:23.660 --> 00:21:27.380
blind to. Really? Yeah. And conversely, those

00:21:27.380 --> 00:21:30.019
exact same models will utterly fail to learn

00:21:30.019 --> 00:21:32.519
simple contextual patterns that a human typically

00:21:32.519 --> 00:21:35.380
grasps with effortless ease. Okay, wow. Let's

00:21:35.380 --> 00:21:37.019
summarize this incredible journey for everyone

00:21:37.019 --> 00:21:39.259
listening because we covered a lot of ground.

00:21:39.319 --> 00:21:42.650
We really did. We started in the 1950s with Noam

00:21:42.650 --> 00:21:45.230
Chomsky trying to force rigid grammar rules into

00:21:45.230 --> 00:21:47.630
a machine, sort of like a strict teacher. Right.

00:21:47.869 --> 00:21:50.450
When that failed, we pivoted to the 1980s, where

00:21:50.450 --> 00:21:53.230
IBM's statistical guesses and the gambler's n

00:21:53.230 --> 00:21:55.589
-grams took over, eventually learning to stip

00:21:55.589 --> 00:21:57.789
words to find connections with skip -grams. And

00:21:57.789 --> 00:22:00.470
solving the zero problem along the way. Exactly.

00:22:00.970 --> 00:22:03.730
Then came the 2000s, where words were transformed

00:22:03.730 --> 00:22:05.990
into mathematical embeddings, doing semantic

00:22:05.990 --> 00:22:09.089
algebra in a 3D vector space. The king minus

00:22:09.089 --> 00:22:12.690
man plus woman equals queen. Yes. And finally,

00:22:12.869 --> 00:22:14.789
the neural revolution gave us self -attending

00:22:14.789 --> 00:22:17.849
transformers trained on the entirety of the internet

00:22:17.849 --> 00:22:21.029
and fine -tuned by human feedback to chase a

00:22:21.029 --> 00:22:24.150
high score of helpfulness. It is a phenomenal,

00:22:24.490 --> 00:22:27.609
decades -long evolution. It's combining linguistics,

00:22:28.009 --> 00:22:30.309
mathematics, and raw computational architecture.

00:22:30.480 --> 00:22:33.039
And it matters to you. It matters deeply to everyone

00:22:33.039 --> 00:22:35.259
listening. Absolutely. Every single time you

00:22:35.259 --> 00:22:37.539
use a chat bot to help draft a difficult email

00:22:37.539 --> 00:22:40.160
or rely on search information retrieval to find

00:22:40.160 --> 00:22:43.140
an obscure fact, or just let your phone's route

00:22:43.140 --> 00:22:45.240
optimization guide you around a traffic jam,

00:22:45.420 --> 00:22:47.900
you are interacting with the ghosts of this exact

00:22:47.900 --> 00:22:50.220
evolution. You're using the math. You are utilizing

00:22:50.220 --> 00:22:52.500
the skip grams, the multi -dimensional vectors,

00:22:52.500 --> 00:22:54.759
and the self -attention mechanisms built over

00:22:54.759 --> 00:22:57.220
70 years. This raises an important question,

00:22:57.559 --> 00:22:59.619
though. Oh. One that lingers at the very end

00:22:59.619 --> 00:23:01.640
of our source material, and it's something I

00:23:01.640 --> 00:23:03.839
think we all need to reckon with. Lay it on us.

00:23:04.160 --> 00:23:06.920
If these machines don't learn the way human beings

00:23:06.920 --> 00:23:09.660
learn, I mean, if their internal architecture

00:23:09.660 --> 00:23:11.799
allows them to see vast statistical patterns

00:23:11.799 --> 00:23:14.579
we are blind to, and yet they completely miss

00:23:14.579 --> 00:23:17.920
the obvious intuitive things that a human toddler

00:23:17.920 --> 00:23:21.579
can grasp instinctively, what does it mean for

00:23:21.579 --> 00:23:25.279
our future? Man. It's an unsettling thought to

00:23:25.279 --> 00:23:28.660
leave on. It is. Because as humans, we expect

00:23:28.660 --> 00:23:31.380
shared perception. When we look at an X -ray,

00:23:31.460 --> 00:23:33.759
we see the broken bone and we expect the machine

00:23:33.759 --> 00:23:36.539
to see the exact same reality we do. Right. We

00:23:36.539 --> 00:23:39.299
expect it to process like us. But we are increasingly

00:23:39.299 --> 00:23:42.619
relying on an entirely alien form of intelligence

00:23:42.619 --> 00:23:45.059
to process and filter the world's information

00:23:45.059 --> 00:23:48.000
for us. We spent 70 years teaching the machine

00:23:48.000 --> 00:23:49.880
to speak our language, but it turns out under

00:23:49.880 --> 00:23:52.059
the hood, it's not speaking our language. at

00:23:52.059 --> 00:23:53.880
all. It's just doing the math. Just doing the

00:23:53.880 --> 00:23:55.759
math. Something to ponder the next time your

00:23:55.759 --> 00:23:57.920
phone auto completes your thought before you've

00:23:57.920 --> 00:24:00.380
even finished typing it. Thanks for joining us

00:24:00.380 --> 00:24:01.920
on this deep dive. We'll see you next time.