WEBVTT

00:00:00.000 --> 00:00:03.899
Have you ever stopped to think about just how

00:00:03.899 --> 00:00:06.580
incredibly messy human language actually is?

00:00:06.679 --> 00:00:08.800
No, it's a nightmare. Right. I mean, we were

00:00:08.800 --> 00:00:13.080
constantly dealing with slang and sarcasm, metaphors,

00:00:13.660 --> 00:00:15.660
words that change meaning entirely just depending

00:00:15.660 --> 00:00:17.940
on the tone of your voice. Exactly. And you and

00:00:17.940 --> 00:00:20.399
I navigate this like poetry of everyday life

00:00:20.399 --> 00:00:23.000
without even breaking a sweat. But how do you

00:00:23.000 --> 00:00:27.019
turn that fluidity of human vocabulary into cold,

00:00:27.399 --> 00:00:30.280
hard mathematics? That is the big question. It

00:00:30.280 --> 00:00:33.280
is. And today we are opening up a stack of research

00:00:33.280 --> 00:00:36.579
for a deep dive into a 2014 breakthrough called

00:00:36.579 --> 00:00:39.740
GloVe. We really want to see the exact moment

00:00:39.740 --> 00:00:41.640
artificial intelligence figured this out. It

00:00:41.640 --> 00:00:44.479
really is one of the most fascinating problems

00:00:44.479 --> 00:00:47.460
in computer science, because for decades, computers

00:00:47.460 --> 00:00:49.740
were effectively just playing a giant, very fast

00:00:49.740 --> 00:00:52.140
game of matching letters. Just pattern recognition.

00:00:52.460 --> 00:00:53.979
Right. They didn't understand what the words

00:00:53.979 --> 00:00:55.479
meant at all. They just knew that the letters

00:00:55.479 --> 00:00:59.039
D -O -G matched other instances of D -O -G. Right,

00:00:59.079 --> 00:01:01.520
right. So if you asked an early computer a question

00:01:01.520 --> 00:01:04.060
about a canine, it wouldn't even know to look

00:01:04.060 --> 00:01:07.159
for the word dog because it had no concept that

00:01:07.159 --> 00:01:09.459
those two things were related. Wow, yeah. So

00:01:09.459 --> 00:01:11.180
to get to the modern artificial intelligence

00:01:11.180 --> 00:01:14.200
we interact with today, there had to be a bridge,

00:01:14.659 --> 00:01:17.319
you know, a way to translate the abstract concept

00:01:17.319 --> 00:01:20.719
of semantic meaning into pure numbers. OK, let's

00:01:20.719 --> 00:01:23.939
unpack this. Imagine you are trying to teach

00:01:23.939 --> 00:01:27.319
an alien. our language. I love this analogy.

00:01:27.579 --> 00:01:29.500
Right, but instead of handing them a dictionary

00:01:29.819 --> 00:01:33.239
with actual definitions, you only give them a

00:01:33.239 --> 00:01:35.780
massive spreadsheet showing how often certain

00:01:35.780 --> 00:01:38.120
words sit next to each other in a sentence. Just

00:01:38.120 --> 00:01:40.299
raw tallies. Exactly, just the raw tallies. And

00:01:40.299 --> 00:01:42.579
that is essentially what GloVe is. That is a

00:01:42.579 --> 00:01:45.359
brilliant way to conceptualize it. So GloVe,

00:01:45.439 --> 00:01:48.319
which stands for Global Vectors, was developed

00:01:48.319 --> 00:01:50.780
as an open source project at Stanford University,

00:01:50.959 --> 00:01:53.900
and it launched in 2014. OK. At the time, it

00:01:53.900 --> 00:01:56.040
was designed to compete with another famous model

00:01:56.040 --> 00:01:58.760
called Word2Vec. And it fundamentally changed

00:01:58.760 --> 00:02:01.079
natural language processing, or NLP. It was like

00:02:01.079 --> 00:02:03.799
the turning point. It was. It was the crucial

00:02:03.799 --> 00:02:05.980
stepping stone that proved we could actually

00:02:05.980 --> 00:02:08.620
quantify meaning. But before we get into the

00:02:08.620 --> 00:02:14.219
heavy math of vectors and algorithms, we need

00:02:14.219 --> 00:02:16.479
to take a step back. Because the source material

00:02:16.479 --> 00:02:18.139
points out something I found just incredible.

00:02:18.319 --> 00:02:22.219
The 1950s connection. Yes. The 2014 code for

00:02:22.219 --> 00:02:25.110
this algorithm. is actually rooted in a linguistic

00:02:25.110 --> 00:02:28.169
philosophy from the 1950s. It is, and this provides

00:02:28.169 --> 00:02:30.289
the entire foundation for the math we are going

00:02:30.289 --> 00:02:33.080
to talk about. There is a famous quote from the

00:02:33.080 --> 00:02:37.120
English linguist J .R. Firth from 1957. Oh, well,

00:02:37.120 --> 00:02:39.900
lay it on us. He said, you shall know a word

00:02:39.900 --> 00:02:42.379
by the company it keeps. You shall know a word

00:02:42.379 --> 00:02:43.919
by the company it keeps. I mean, it sounds almost

00:02:43.919 --> 00:02:46.500
like an old proverb. It does, but it perfectly

00:02:46.500 --> 00:02:48.919
encapsulates the theory behind GloVe. You see,

00:02:48.979 --> 00:02:51.379
GloVe is what we call an unsupervised learning

00:02:51.379 --> 00:02:53.879
algorithm that creates vector representations

00:02:53.879 --> 00:02:56.879
of words. Right. The ultimate goal is to map

00:02:56.879 --> 00:02:59.319
words into a meaningful space. Okay, let's pause

00:02:59.319 --> 00:03:00.759
right there because if you are sitting there

00:03:00.759 --> 00:03:03.020
listening to this on your commute, hearing the

00:03:03.020 --> 00:03:06.840
phrase meaningful space might sound a bit abstract.

00:03:07.060 --> 00:03:10.180
Fair enough, yeah. Think of a vector as an address

00:03:10.180 --> 00:03:13.560
in a massive, floating, multi -dimensional city.

00:03:13.659 --> 00:03:16.750
Okay, city of words. Exactly. The algorithm's

00:03:16.750 --> 00:03:19.289
job is to assign every single word and address

00:03:19.289 --> 00:03:21.889
in this city so that the mathematical distance

00:03:21.889 --> 00:03:24.430
between two words is directly related to their

00:03:24.430 --> 00:03:26.889
semantic similarity. So words that mean similar

00:03:26.889 --> 00:03:29.689
things end up physically closer together. Yes.

00:03:30.169 --> 00:03:33.009
Like car and truck, they end up physically closer

00:03:33.009 --> 00:03:36.009
together in this mathematical space. And GloVe

00:03:36.009 --> 00:03:38.430
achieves this by combining two different model

00:03:38.430 --> 00:03:42.530
families, global matrix factorization and local

00:03:42.530 --> 00:03:45.870
context window methods. OK, hold on. global word,

00:03:45.870 --> 00:03:48.810
word co -occurrence statistics. I mean, I know

00:03:48.810 --> 00:03:50.469
we want to understand the mechanics, but we need

00:03:50.469 --> 00:03:52.509
to keep this grounded for the listener who just

00:03:52.509 --> 00:03:55.250
wants that aha moment. Right, right. How do we

00:03:55.250 --> 00:03:57.229
actually count the company of word keeps in plain

00:03:57.229 --> 00:03:59.610
English? Well, what's fascinating here is that

00:03:59.610 --> 00:04:02.270
the underlying mechanism is incredibly intuitive

00:04:02.270 --> 00:04:04.789
once you visualize it. The algorithm doesn't

00:04:04.789 --> 00:04:07.610
read a book the way you or I do, you know, comprehending

00:04:07.610 --> 00:04:10.110
the narrative or the characters. It just scans

00:04:10.110 --> 00:04:12.930
the text looking for neighbors. So let's visualize

00:04:12.930 --> 00:04:16.829
it. Think of the algorithm as a spotlight moving

00:04:16.829 --> 00:04:19.490
across a sentence. The room is totally dark.

00:04:19.759 --> 00:04:22.079
and the spotlight only illuminates a few words

00:04:22.079 --> 00:04:24.480
at a time as it moves from left to right. I like

00:04:24.480 --> 00:04:26.579
that. And that spotlight is what the researchers

00:04:26.579 --> 00:04:29.300
call the context window. Precisely. And the size

00:04:29.300 --> 00:04:31.459
of that spotlight matters. Let's look at the

00:04:31.459 --> 00:04:33.720
exact example provided in the Wikipedia text

00:04:33.720 --> 00:04:35.540
to see how this counting works. Okay, let's do

00:04:35.540 --> 00:04:39.699
it. Imagine the sentence is, coined from global

00:04:39.699 --> 00:04:43.459
vectors, is a model for distributed word representation.

00:04:43.980 --> 00:04:46.259
Let's say our context length, our spotlight,

00:04:46.540 --> 00:04:49.319
is set to three words. So if the scotlight is

00:04:49.319 --> 00:04:51.759
centered on the word model, does it just log

00:04:51.759 --> 00:04:53.920
every single word in the sentence as its company?

00:04:54.379 --> 00:04:56.459
Not quite. There is a strict distance limit.

00:04:56.670 --> 00:04:58.889
If we center on the word model, which is the

00:04:58.889 --> 00:05:00.930
eighth word in that sentence, we only look three

00:05:00.930 --> 00:05:03.009
words to the left and three words to the right.

00:05:03.110 --> 00:05:05.569
Okay. So the 11th word in the sentence is word

00:05:05.569 --> 00:05:08.290
because word is within three spaces of model.

00:05:08.889 --> 00:05:11.470
The algorithm records that model is in the context

00:05:11.470 --> 00:05:14.769
of word. They are company. But the 12th word

00:05:14.769 --> 00:05:19.189
in the sentence is representation. So that is

00:05:19.189 --> 00:05:23.410
four spaces away from model. Exactly. So representation

00:05:23.410 --> 00:05:26.889
falls outside the spotlight. literally does not

00:05:26.889 --> 00:05:29.069
exist to the algorithm in this specific instance.

00:05:29.110 --> 00:05:31.329
It just logs a zero. Just a zero for that pairing.

00:05:31.810 --> 00:05:33.889
And there is another really important rule here.

00:05:34.250 --> 00:05:36.990
A word is never considered to be in its own context.

00:05:37.350 --> 00:05:39.689
Wait, really? Yeah, so the word model is not

00:05:39.689 --> 00:05:42.110
in the context of the word model. The only way

00:05:42.110 --> 00:05:44.970
a word counts as its own company is if it literally

00:05:44.970 --> 00:05:47.850
repeats within the same spotlight window. Oh,

00:05:47.930 --> 00:05:50.089
like if I said... I don't think that that is

00:05:50.089 --> 00:05:52.910
a problem. The first, that is right next to the

00:05:52.910 --> 00:05:55.490
second that. So they fall into each other's context.

00:05:55.610 --> 00:05:57.269
You've got it. So the algorithm does this for

00:05:57.269 --> 00:05:59.649
the entire corpus, the entire collection of text

00:05:59.649 --> 00:06:02.310
it's studying, which could be billions of words.

00:06:02.410 --> 00:06:06.569
It counts up every single time word A appears

00:06:06.569 --> 00:06:09.250
in the context of word B, and this creates a

00:06:09.250 --> 00:06:12.129
massive spud sheet, a matrix of co -occurrence

00:06:12.129 --> 00:06:14.329
statistics. So counting is just the first step,

00:06:14.430 --> 00:06:18.089
right? Right. Next comes the probabilistic modeling.

00:06:18.399 --> 00:06:21.839
And here is where we find a crucial nuance. Co

00:06:21.839 --> 00:06:24.959
-occurrence is not a two -way street. I am so

00:06:24.959 --> 00:06:27.079
glad you brought this up because this asymmetry

00:06:27.079 --> 00:06:30.920
is wild to me. The probability of seeing word

00:06:30.920 --> 00:06:34.139
A when you are looking at word B is not the same

00:06:34.139 --> 00:06:36.720
as the reverse. It's really not. The source gives

00:06:36.720 --> 00:06:39.060
a brilliant example of this with the words much

00:06:39.060 --> 00:06:43.139
and a do. A do is this archaic word. I mean,

00:06:43.300 --> 00:06:45.160
in modern English you almost never hear it unless

00:06:45.160 --> 00:06:48.060
it's in the phrase, much ado about nothing. Exactly.

00:06:48.360 --> 00:06:51.300
So if the algorithm spots the word ado, the probability

00:06:51.300 --> 00:06:53.300
that the word much is standing right next to

00:06:53.300 --> 00:06:56.600
it is practically 100%. But if we flip it around...

00:06:56.430 --> 00:06:59.550
the logic changes entirely. The word much is

00:06:59.550 --> 00:07:01.709
incredibly common. We say much better, too much,

00:07:01.810 --> 00:07:04.110
how much? So if the algorithm spots the word

00:07:04.110 --> 00:07:06.689
much, the probability that a do is the word next

00:07:06.689 --> 00:07:09.750
to it is near zero. The relationship is entirely

00:07:09.750 --> 00:07:12.189
asymmetric. Okay, so our alien spreadsheet now

00:07:12.189 --> 00:07:15.189
has these massive asymmetric tallies of how often

00:07:15.189 --> 00:07:17.810
every single word hangs out with every other

00:07:17.810 --> 00:07:21.170
word. But tallies are just raw numbers. Here's

00:07:21.170 --> 00:07:24.240
where it gets really interesting. How does a

00:07:24.240 --> 00:07:27.160
computer use raw tallies to understand the actual

00:07:27.160 --> 00:07:29.660
meaning of a word? That's the leap. Yeah, like

00:07:29.660 --> 00:07:31.819
how does it know the difference between say ice

00:07:31.819 --> 00:07:34.699
and steam? Well, to answer that, the researchers

00:07:34.699 --> 00:07:37.519
provided a brilliant data table from a corpus

00:07:37.519 --> 00:07:41.220
of six billion words. Six billion. Six billion.

00:07:41.819 --> 00:07:44.220
They looked at the probabilities of four specific

00:07:44.220 --> 00:07:46.800
words appearing in the context of ice versus

00:07:46.800 --> 00:07:50.240
steam. The four context words were solid, gas,

00:07:50.560 --> 00:07:52.720
water, and fashion. Okay, let's walk through

00:07:52.720 --> 00:07:54.680
that because this is the real magic trick of

00:07:54.680 --> 00:07:56.920
the whole algorithm. It really is. So first,

00:07:57.000 --> 00:07:59.980
they looked at the word water. both ice and steam

00:07:59.980 --> 00:08:02.339
are forms of water, right? So they both appear

00:08:02.339 --> 00:08:04.160
frequently in the context of the word water.

00:08:04.699 --> 00:08:07.180
If the computer only looked at that raw frequency,

00:08:07.680 --> 00:08:10.779
ice and steam would be mathematically indistinguishable.

00:08:10.939 --> 00:08:13.639
OK, and what about the word fashion? Well, neither

00:08:13.639 --> 00:08:16.319
ice nor steam has much to do with fashion. So

00:08:16.319 --> 00:08:18.639
the probabilities for both are incredibly low.

00:08:18.819 --> 00:08:21.300
Again, indistinguishable. So looking at words

00:08:21.300 --> 00:08:23.819
they both relate to, or words neither relates

00:08:23.819 --> 00:08:27.230
to, that doesn't help at all. Exactly. The breakthrough

00:08:27.230 --> 00:08:29.490
comes from looking at the ratio of their probabilities.

00:08:29.569 --> 00:08:32.230
A ratio. Yes. The algorithm takes the probability

00:08:32.230 --> 00:08:35.590
of a word appearing near ice and divides it by

00:08:35.590 --> 00:08:38.370
the probability of that same word appearing near

00:08:38.370 --> 00:08:40.789
steam. And this is where the word solid and gas

00:08:40.789 --> 00:08:43.169
come in. Right. When they calculated the ratio

00:08:43.169 --> 00:08:45.809
for the word solid, the result was a relatively

00:08:45.809 --> 00:08:49.389
large number, 8 .9. Okay. This tells the algorithm

00:08:49.389 --> 00:08:51.970
that solid is much more closely related to ice

00:08:51.970 --> 00:08:54.809
than it is to steam. Oh, wow. But when they calculated

00:08:54.809 --> 00:08:57.610
the ratio for the word gas, the result was a

00:08:57.610 --> 00:09:02.629
tiny decimal, 0 .085. So that tells the algorithm

00:09:02.629 --> 00:09:05.250
that gas is heavily skewed towards steam and

00:09:05.250 --> 00:09:07.950
away from ice. Precisely. But for words like

00:09:07.950 --> 00:09:10.490
water or fashion, where the occurrence rates

00:09:10.490 --> 00:09:12.870
are similar, the ratio just hovers right around

00:09:12.870 --> 00:09:14.970
one. I just marvel at the elegance of this. I

00:09:14.970 --> 00:09:16.789
mean, the computer has absolutely no idea what

00:09:16.789 --> 00:09:18.970
physics is. None. It doesn't know what temperature

00:09:18.970 --> 00:09:21.529
is. It has never felt ice or been burned by steam.

00:09:21.789 --> 00:09:25.110
It simply uses ratio math to realize that one

00:09:25.110 --> 00:09:28.269
of these words behaves like a solid and the other

00:09:28.269 --> 00:09:31.710
behaves like a gas, meaning it's just ratios.

00:09:32.000 --> 00:09:35.879
It is deeply elegant. But here is the catch.

00:09:36.019 --> 00:09:38.779
There's always a catch. Always. Having a spreadsheet

00:09:38.779 --> 00:09:41.340
with billions of words and trillions of ratios

00:09:41.340 --> 00:09:43.720
is practically useless because it's just too

00:09:43.720 --> 00:09:46.940
massive. It takes up way too much memory and

00:09:46.940 --> 00:09:50.159
computational power. We need to squish that giant

00:09:50.159 --> 00:09:53.200
spreadsheet down into a tight, highly dense map

00:09:53.200 --> 00:09:55.200
of coordinates. Right. Going back to your analogy

00:09:55.200 --> 00:09:57.519
of the multi -dimensional city, we need to assign

00:09:57.519 --> 00:10:00.429
them actual addresses. Yes. And to do that, the

00:10:00.429 --> 00:10:03.830
model uses a technique called multinomial logistic

00:10:03.830 --> 00:10:06.350
regression. Wait, you're losing me? We just said

00:10:06.350 --> 00:10:08.690
meaning is ratios. Why are we bringing in regression

00:10:08.690 --> 00:10:11.049
equations? Because the regression is the compression

00:10:11.049 --> 00:10:13.990
tool. Essentially, the idea is to learn a specific

00:10:13.990 --> 00:10:17.629
vector, an address, for each word, such that

00:10:17.629 --> 00:10:20.110
when you plug those vectors into a formula, the

00:10:20.110 --> 00:10:22.330
math spits out a number that approximates the

00:10:22.330 --> 00:10:24.230
logarithm of those co -occurrence probabilities

00:10:24.230 --> 00:10:26.429
we just talked about. Okay, and just to clarify

00:10:26.429 --> 00:10:28.330
for everyone listening, when we say logarithm

00:10:28.330 --> 00:10:32.139
here, think of it as a way to scale down massive

00:10:32.139 --> 00:10:34.879
unwieldy numbers into manageable steps so the

00:10:34.879 --> 00:10:36.659
computer doesn't get overwhelmed by the sheer

00:10:36.659 --> 00:10:39.740
scale of the data. Exactly. But if you just run

00:10:39.740 --> 00:10:42.700
a standard regression equation, your math is

00:10:42.700 --> 00:10:45.019
going to get hijacked by noise. Noise. Like what

00:10:45.019 --> 00:10:47.700
kind of noise? Imagine someone makes a typo in

00:10:47.700 --> 00:10:50.700
a blog post, and suddenly two words that have

00:10:50.700 --> 00:10:53.360
absolutely no real relationship are sitting next

00:10:53.360 --> 00:10:55.320
to each other in the spotlight window. Oh, I

00:10:55.320 --> 00:10:58.179
see. If the algorithm treats that single occurrence

00:10:58.179 --> 00:11:00.720
with too much weight, it completely throws off

00:11:00.720 --> 00:11:02.980
the semantic meaning. So how does GloVe fix the

00:11:02.980 --> 00:11:06.480
noise? By using a weighted function. They tweak

00:11:06.480 --> 00:11:08.970
the math. so that the weight of the relationship

00:11:08.970 --> 00:11:12.769
ramps up slowly as the absolute number of co

00:11:12.769 --> 00:11:15.750
-occurrences increases. OK. They cap the maximum

00:11:15.750 --> 00:11:19.309
count at 100 occurrences, meaning once two words

00:11:19.309 --> 00:11:22.370
appear together 100 times, the algorithm considers

00:11:22.370 --> 00:11:24.350
the relationship fully established. It just stops

00:11:24.350 --> 00:11:26.309
counting them. It doesn't give extra weight to

00:11:26.309 --> 00:11:28.610
the hundred and first time or the millionth time.

00:11:28.690 --> 00:11:31.309
Why cap it at exactly 100? Well, think about

00:11:31.309 --> 00:11:34.389
a word like the. If the word the appears a billion

00:11:34.389 --> 00:11:37.259
times next to dog, we don't want at having a

00:11:37.259 --> 00:11:39.720
billion times the mathematical weight of a rarer,

00:11:39.820 --> 00:11:42.340
much more descriptive word like golden. Oh, that

00:11:42.340 --> 00:11:44.340
makes total sense. So they used a fractional

00:11:44.340 --> 00:11:46.379
power curve, specifically setting a parameter

00:11:46.379 --> 00:11:49.659
called alpha to 3 fourths. So it acts like a

00:11:49.659 --> 00:11:51.620
governor on an engine. Yeah. It lets the weight

00:11:51.620 --> 00:11:54.399
grow quickly at first, but then slattens it out

00:11:54.399 --> 00:11:56.720
so common words don't drown out the meaningful

00:11:56.720 --> 00:11:59.759
ones. Exactly. All while making sure rare typos

00:11:59.759 --> 00:12:02.230
don't break the system either. Precisely. And

00:12:02.230 --> 00:12:04.649
this results in something truly magical. Because

00:12:04.649 --> 00:12:07.429
we have compressed all these ratios into spatial

00:12:07.429 --> 00:12:09.610
coordinates, you can actually do geometry with

00:12:09.610 --> 00:12:13.669
definitions. Yes. This is the classic aha moment

00:12:13.669 --> 00:12:16.690
of word embeddings. Because once you have these

00:12:16.690 --> 00:12:19.269
vectors, you can literally do math with words.

00:12:19.370 --> 00:12:21.990
You really can. If you take the vector address

00:12:21.990 --> 00:12:24.730
for king, subtract the vector for man, and add

00:12:24.730 --> 00:12:27.009
the vector for woman, the resulting coordinate

00:12:27.009 --> 00:12:29.029
you arrive at in this multi -dimensional space

00:12:29.029 --> 00:12:32.049
is the vector for queen. That's crazy. The algorithm

00:12:32.049 --> 00:12:34.789
mapped the latent concept of gender and royalty

00:12:34.789 --> 00:12:37.590
strictly through spatial distance. That blows

00:12:37.590 --> 00:12:40.509
my mind every single time I hear it. And there's

00:12:40.509 --> 00:12:42.909
a funny little quirk about the final math here,

00:12:43.009 --> 00:12:46.730
too. Two vectors. Yeah. The model actually generates

00:12:46.730 --> 00:12:49.950
two separate vectors for every single word based

00:12:49.950 --> 00:12:52.490
on how the matrix is split. But when it came

00:12:52.490 --> 00:12:55.330
time to actually use the model, the researchers

00:12:55.330 --> 00:12:57.789
found empirically that just adding those two

00:12:57.789 --> 00:13:00.730
vectors together created the best final representation.

00:13:01.210 --> 00:13:03.269
Right. It wasn't necessarily a grand theoretical

00:13:03.269 --> 00:13:05.830
design, it just worked better in practice to

00:13:05.830 --> 00:13:07.789
add them up. If we connect this to the bigger

00:13:07.789 --> 00:13:11.070
picture, that pragmatic approach is exactly what

00:13:11.070 --> 00:13:14.549
made gloves so wildly successful. Once this model

00:13:14.549 --> 00:13:17.210
was trained, it wasn't just a theoretical academic

00:13:17.210 --> 00:13:19.950
exercise. It was put to use in the real world

00:13:19.950 --> 00:13:22.389
almost immediately. Give me some examples. Where

00:13:22.389 --> 00:13:24.970
was this alien spreadsheet actually being used?

00:13:25.090 --> 00:13:27.889
The applications were surprisingly diverse. Because

00:13:27.889 --> 00:13:30.409
it mapped semantic relationships so well, it

00:13:30.409 --> 00:13:32.649
was used to find relations between zip codes

00:13:32.649 --> 00:13:35.409
in cities or to link companies to their specific

00:13:35.409 --> 00:13:37.830
products. Oh, wow. It became a core component

00:13:37.830 --> 00:13:40.950
for finding synonyms in massive databases. Have

00:13:40.950 --> 00:13:43.559
you ever heard of spaCy? the Natural Language

00:13:43.559 --> 00:13:47.080
Processing Library, right? Yes. spaCy used GloVe

00:13:47.080 --> 00:13:50.179
to build semantic word embedding features. If

00:13:50.179 --> 00:13:52.980
you queried a word, spaCy would use GloVe's vectors

00:13:52.980 --> 00:13:55.320
to compute the top list of words that matched

00:13:55.320 --> 00:13:58.120
it. And it did this using distance measures like

00:13:58.120 --> 00:14:00.960
Euclidean distance and cosine similarity. Let's

00:14:00.960 --> 00:14:03.240
visualize that, because cosine similarity sounds

00:14:03.240 --> 00:14:06.019
really intimidating. It does. But imagine drawing

00:14:06.019 --> 00:14:09.340
a line from a center point to the word ice, and

00:14:09.340 --> 00:14:11.730
another line from the center to steam. Cosine

00:14:11.730 --> 00:14:13.850
similarity is just measuring the angle between

00:14:13.850 --> 00:14:16.029
those two lines to see how closely they point

00:14:16.029 --> 00:14:18.269
in the same direction. Exactly. While Euclidean

00:14:18.269 --> 00:14:20.809
distance is just taking a ruler and measuring

00:14:20.809 --> 00:14:23.029
the straight physical gap between the two points.

00:14:23.190 --> 00:14:25.990
It's literal geometry applied to language. And

00:14:25.990 --> 00:14:27.730
because of that, it was even used in healthcare.

00:14:28.230 --> 00:14:30.029
Yes, I really wanted to dig into this because

00:14:30.029 --> 00:14:31.850
the source mentioned it's used in psychology.

00:14:32.669 --> 00:14:35.710
How does a math equation help detect psychological

00:14:35.710 --> 00:14:38.649
distress? By analyzing the structural distance

00:14:38.649 --> 00:14:41.230
of the words patients use in interviews, you

00:14:41.230 --> 00:14:44.230
see a healthy, organized mind tends to use language

00:14:44.230 --> 00:14:46.549
where the concepts group together in expected,

00:14:46.809 --> 00:14:49.840
semantically cohesive ways. But researchers found

00:14:49.840 --> 00:14:52.480
that they could use GloVe to flag underlying

00:14:52.480 --> 00:14:56.120
distress by mapping a patient's vocabulary. If

00:14:56.120 --> 00:14:58.559
a patient was jumping between concepts that were

00:14:58.559 --> 00:15:01.580
mathematically very far apart in GloVe's multidimensional

00:15:01.580 --> 00:15:05.600
space, it indicated a level of cognitive disorganization.

00:15:05.600 --> 00:15:08.440
Oh, wow. Yeah, a distress that a doctor might

00:15:08.440 --> 00:15:11.460
completely miss from just a surface -level conversation.

00:15:12.000 --> 00:15:14.620
The geometry of their vocabulary revealed their

00:15:14.620 --> 00:15:17.679
actual state of mind. That is just astounding.

00:15:17.929 --> 00:15:20.610
a doctor using an algorithm to map a patient's

00:15:20.610 --> 00:15:23.129
cognitive state based on the physical geometry

00:15:23.129 --> 00:15:25.970
of their work. It's incredible. But as revolutionary

00:15:25.970 --> 00:15:29.690
as it was, GloVe had a massive blind spot, like

00:15:29.690 --> 00:15:32.669
a fatal flaw in its design. The source points

00:15:32.669 --> 00:15:35.090
out that it is essentially helpless when it comes

00:15:35.090 --> 00:15:38.629
to homographs. Ah, yes. Homographs. Words that

00:15:38.629 --> 00:15:40.990
have the exact same spelling but entirely different

00:15:40.990 --> 00:15:43.590
meanings. Right. Because GloVe is an unsupervised

00:15:43.590 --> 00:15:46.210
algorithm that assigns a single static vector

00:15:46.210 --> 00:15:49.289
based purely on the spelling of a word, it mashes

00:15:49.289 --> 00:15:51.409
those different meanings together into one mathematical

00:15:51.409 --> 00:15:53.470
point. It's like having a contactless in your

00:15:53.470 --> 00:15:54.990
phone where you happen to know two completely

00:15:54.990 --> 00:15:57.080
different people named John Smith. Oh, that's

00:15:57.080 --> 00:15:58.840
a good way to look at it. One is your plumber

00:15:58.840 --> 00:16:01.679
in Chicago, and the other is your cousin in London.

00:16:02.519 --> 00:16:06.279
But your phone's software is so rigid that it

00:16:06.279 --> 00:16:09.080
forces them into a single contact file. Right.

00:16:09.340 --> 00:16:11.919
So now you have this Frankenstein John Smith

00:16:11.919 --> 00:16:14.740
contact that mixes up their addresses, their

00:16:14.740 --> 00:16:17.820
jobs, and their personalities. If you ask the

00:16:17.820 --> 00:16:20.080
algorithm about John Smith, it's going to give

00:16:20.080 --> 00:16:23.399
you a very confused, blended answer. That analogy

00:16:23.399 --> 00:16:26.279
hits the nail on the head. If you feed glove

00:16:26.279 --> 00:16:29.379
the word bank, it creates one vector that averages

00:16:29.379 --> 00:16:31.720
out the context of a river bank and a financial

00:16:31.720 --> 00:16:34.600
bank. It just blends them. Yes. It cannot distinguish

00:16:34.600 --> 00:16:36.600
between the two based on the immediate sentence

00:16:36.600 --> 00:16:38.720
structure because the vector address is permanently

00:16:38.720 --> 00:16:42.500
fixed. And that limitation is precisely why glove,

00:16:42.759 --> 00:16:44.919
despite its brilliance, was eventually superseded.

00:16:45.100 --> 00:16:47.220
Right. Because technology moves incredibly fast.

00:16:47.759 --> 00:16:50.320
The source notes that by 2022, the industry had

00:16:50.320 --> 00:16:52.620
essentially left glove behind. The state of the

00:16:52.620 --> 00:16:55.480
art in natural language processing moved on to

00:16:55.480 --> 00:16:58.360
transformer -based models. Like BERT. Exactly,

00:16:58.460 --> 00:17:01.600
like BERT. These newer models solved the homograph

00:17:01.600 --> 00:17:04.579
problem by adding multiple neural network attention

00:17:04.579 --> 00:17:07.019
layers on top of the word embeddings. So they

00:17:07.019 --> 00:17:09.039
don't just assign a static vector anymore? No.

00:17:09.180 --> 00:17:11.700
They dynamically adjust the word's mathematical

00:17:11.700 --> 00:17:14.480
meaning based on the entire sentence around it,

00:17:14.480 --> 00:17:16.740
every single time it's used. So what does this

00:17:16.740 --> 00:17:19.720
all mean? We started this deep dive looking at

00:17:19.720 --> 00:17:22.890
the messy poetic nature of human vocabulary.

00:17:23.369 --> 00:17:26.309
And what we found in GloVe was the vital stepping

00:17:26.309 --> 00:17:29.410
stone that proved a concept. It really did. It

00:17:29.410 --> 00:17:32.390
proved that meaning, this abstract, uniquely

00:17:32.390 --> 00:17:35.250
human experience, could actually be solved as

00:17:35.250 --> 00:17:38.230
a geometry problem. By simply counting how often

00:17:38.230 --> 00:17:40.750
words keep company and compressing the ratios

00:17:40.750 --> 00:17:42.890
of those relationships into a map of coordinates,

00:17:43.450 --> 00:17:45.190
we taught machines to understand the difference

00:17:45.190 --> 00:17:47.319
between ice and steam. It's going to legacy.

00:17:47.640 --> 00:17:49.319
And you should care about this because every

00:17:49.319 --> 00:17:51.420
time your phone autocompletes a text message

00:17:51.420 --> 00:17:53.859
or every time you chat with a modern AI assistant,

00:17:54.339 --> 00:17:56.079
you are interacting with the direct descendants

00:17:56.079 --> 00:17:58.900
of GloVe's global vectors. This raises an important

00:17:58.900 --> 00:18:01.150
question, though. We've seen how Glove learns

00:18:01.150 --> 00:18:03.970
meaning purely by observing the statistical proximity

00:18:03.970 --> 00:18:06.829
of words in a massive text corpus. Right. But

00:18:06.829 --> 00:18:09.410
that corpus, that giant alien spreadsheet, was

00:18:09.410 --> 00:18:13.150
written by us. Uh -oh. Yeah. So if an algorithm's

00:18:13.150 --> 00:18:15.430
entire understanding of the world is based on

00:18:15.430 --> 00:18:18.549
the company our words keep online, what historical

00:18:18.549 --> 00:18:22.289
biases, hidden cultural prejudices, or completely

00:18:22.289 --> 00:18:25.230
irrational human associations are permanently

00:18:25.230 --> 00:18:27.769
baked into those mathematical distances without

00:18:27.769 --> 00:18:30.569
us even realizing it? Wow, if the aliens' only

00:18:30.569 --> 00:18:33.009
dictionaries are sprite feed, they learn our

00:18:33.009 --> 00:18:35.230
flaws right along with our facts. Precisely.

00:18:35.329 --> 00:18:37.670
It's a fascinating mathematical permanence of

00:18:37.670 --> 00:18:39.950
human quarks, something to really think about

00:18:39.950 --> 00:18:42.069
the next time your AI suggests the perfect word.