WEBVTT

00:00:00.000 --> 00:00:02.299
Have you ever wondered how a computer actually

00:00:02.299 --> 00:00:04.879
reads? Oh, that's a profound question, actually.

00:00:05.019 --> 00:00:07.080
Right. Because, you know, it doesn't have a brain.

00:00:07.200 --> 00:00:09.500
Yeah. It doesn't understand poetry or emotion.

00:00:10.060 --> 00:00:12.220
So how does it actually know what a word means?

00:00:12.339 --> 00:00:15.140
Well, for decades it didn't. A word given to

00:00:15.140 --> 00:00:17.199
a machine was literally just a meaningless string

00:00:17.199 --> 00:00:20.140
of characters. Like a barcode. Exactly like a

00:00:20.140 --> 00:00:24.879
barcode. The computer had no idea that dog and

00:00:24.879 --> 00:00:27.199
puppy were related at all. They were just two

00:00:27.199 --> 00:00:29.199
completely different files. But obviously that

00:00:29.199 --> 00:00:32.280
changed. To understand how predictive text works

00:00:32.280 --> 00:00:35.320
on your phone today, or how translation algorithms

00:00:35.320 --> 00:00:38.359
got so eerily accurate, we're taking a deep dive

00:00:38.359 --> 00:00:41.560
into a foundational 2013 Google breakthrough.

00:00:41.880 --> 00:00:43.880
It was an absolute paradigm shift. It really

00:00:43.880 --> 00:00:46.380
was. And we're drawing our insights today from

00:00:46.380 --> 00:00:48.859
a really detailed Wikipedia breakdown of a model

00:00:48.859 --> 00:00:51.719
called Word2Vec. A classic. Yep. Our mission

00:00:51.719 --> 00:00:54.280
here is to skip the dense computer science jargon,

00:00:54.719 --> 00:00:57.460
keep the mind -blowing aha moments, and explore

00:00:57.460 --> 00:00:59.960
how researchers essentially taught machines the

00:00:59.960 --> 00:01:02.340
meaning of language. By turning words into math.

00:01:02.799 --> 00:01:05.120
Exactly. But to get to that point, we first have

00:01:05.120 --> 00:01:07.879
to understand how this algorithm transforms a

00:01:07.879 --> 00:01:11.019
piece of text into a specific location on a very

00:01:11.019 --> 00:01:13.540
complex map. Right, the spatial concept. Yeah.

00:01:14.099 --> 00:01:17.439
At its core, Word2Vec is what's called a shallow

00:01:17.439 --> 00:01:20.730
two -layer neural network. It digests a massive

00:01:20.730 --> 00:01:23.890
corpus of text. We're talking millions or billions

00:01:23.890 --> 00:01:26.750
of words. Just a huge amount of data. And it

00:01:26.750 --> 00:01:29.189
maps every unique word to a high dimensional

00:01:29.189 --> 00:01:32.480
vector space. Okay, so let's unpack this. the

00:01:32.480 --> 00:01:34.700
documentation mentions, these spaces typically

00:01:34.700 --> 00:01:37.560
have anywhere from 100 to 1 ,000 dimensions.

00:01:37.799 --> 00:01:41.140
Which is literally impossible for a human brain

00:01:41.140 --> 00:01:43.359
to visualize. Right. Trying to picture 1 ,000

00:01:43.359 --> 00:01:46.000
dimensional grid will just melt your brain. So

00:01:46.000 --> 00:01:48.760
instead, think of it like a giant unimaginably

00:01:48.760 --> 00:01:51.400
complex galaxy. I like that analogy. In this

00:01:51.400 --> 00:01:53.939
galaxy, every single word gets its own specific

00:01:53.939 --> 00:01:56.620
coordinate. It's a star. And because of how the

00:01:56.620 --> 00:01:59.439
algorithm processes the text, you end up with

00:01:59.439 --> 00:02:01.939
like The action verb solar system way over in

00:02:01.939 --> 00:02:03.500
one corner of the galaxy. Oh, and the capital

00:02:03.500 --> 00:02:05.219
cities are clustered together in a completely

00:02:05.219 --> 00:02:08.300
different corner. Yes. But what's fascinating

00:02:08.300 --> 00:02:11.740
here is the specific mechanism the model uses

00:02:11.740 --> 00:02:14.719
to figure out where those stars belong. Right.

00:02:14.780 --> 00:02:17.060
It relies on a mathematical principle called

00:02:17.060 --> 00:02:20.460
cosine similarity. I've heard that term, but

00:02:20.460 --> 00:02:22.659
how does it actually work geographically in our

00:02:22.659 --> 00:02:25.550
galaxy? So imagine drawing a straight line from

00:02:25.550 --> 00:02:27.870
the center of our semantic galaxy out to the

00:02:27.870 --> 00:02:30.449
star for the word walk. OK, got it. Then you

00:02:30.449 --> 00:02:32.509
draw another line from the center to the star

00:02:32.509 --> 00:02:36.689
for the word ran. Cosine similarity simply measures

00:02:36.689 --> 00:02:39.310
the angle between those two lines. Oh, wow. Yeah.

00:02:39.550 --> 00:02:41.750
If the angle is very small, the lines are pointing

00:02:41.750 --> 00:02:44.590
in almost the exact same direction, meaning the

00:02:44.590 --> 00:02:47.110
words are highly related. So it's judging a word

00:02:47.110 --> 00:02:49.550
by the company it keeps in the text. Precisely.

00:02:49.669 --> 00:02:51.610
You might say, I went for a walk to the store

00:02:51.610 --> 00:02:54.210
or I ran to the store. The surrounding context

00:02:54.210 --> 00:02:57.349
is virtually identical. Right, so the math literally

00:02:57.349 --> 00:03:00.409
pushes their stars closer together. Their angle

00:03:00.409 --> 00:03:02.629
shrinks. That makes so much sense. And the same

00:03:02.629 --> 00:03:05.229
mechanism applies to, like, geographical relationships

00:03:05.229 --> 00:03:08.750
too, right? Absolutely. The lines drawn to Berlin

00:03:08.750 --> 00:03:11.849
and Germany will have a very narrow angle between

00:03:11.849 --> 00:03:15.500
them. It even works for transitional words. But,

00:03:15.780 --> 00:03:18.439
and however, are mathematically mapped as neighbors

00:03:18.439 --> 00:03:20.879
because they serve the exact same contextual

00:03:20.879 --> 00:03:23.360
function in a sentence. Because they're always

00:03:23.360 --> 00:03:25.780
surrounded by the same types of opposing clauses.

00:03:26.379 --> 00:03:29.319
Exactly. Okay. But knowing that every word gets

00:03:29.319 --> 00:03:31.900
a coordinate based on its neighbors still leaves

00:03:31.900 --> 00:03:34.539
a massive question. How does it calculate the

00:03:34.539 --> 00:03:37.360
coordinates from scratch? Yes. The breakdown

00:03:37.360 --> 00:03:40.319
of Word2Vec shows it doesn't just use one method.

00:03:40.460 --> 00:03:43.539
It relies on two distinct model architectures

00:03:43.539 --> 00:03:46.000
to learn these representations. The two engines

00:03:46.000 --> 00:03:48.860
under the hood. They're known as continuous bag

00:03:48.860 --> 00:03:52.659
of words, or CBOW, and the continuous skip gram

00:03:52.659 --> 00:03:56.259
model. CBOW and skip gram. Right. Both of them

00:03:56.259 --> 00:03:59.199
iterate over the vast text corpus, sliding a

00:03:59.199 --> 00:04:01.939
window over a few words at a time, but they approach

00:04:01.939 --> 00:04:04.000
the math from completely opposite directions.

00:04:04.259 --> 00:04:07.139
So when you look at how CBOW operates, it's essentially

00:04:07.139 --> 00:04:09.939
playing a massive game of Mad Libs. It's a fill

00:04:09.939 --> 00:04:11.560
-in -the -blank test. That's a great way to put

00:04:11.560 --> 00:04:14.560
it. It takes the surrounding context words, combines

00:04:14.560 --> 00:04:17.439
their mathematical values, and uses that combined

00:04:17.439 --> 00:04:20.480
context to predict the missing middle word. Right.

00:04:20.560 --> 00:04:23.360
So if the sentence is the cat blank on the mat,

00:04:24.160 --> 00:04:29.360
CBOW looks at the cat on and the mat and mathematically

00:04:29.360 --> 00:04:32.060
guesses that the missing center word is probably

00:04:32.060 --> 00:04:35.040
sat. And we actually call it a bag of words because

00:04:35.040 --> 00:04:37.779
it assumes the order of those context words doesn't

00:04:37.779 --> 00:04:41.240
even matter. Wait, really? Yeah. Mat, the oncat,

00:04:41.319 --> 00:04:43.779
would yield the exact same prediction. That is

00:04:43.779 --> 00:04:45.660
wild. And then you have the other architecture,

00:04:46.019 --> 00:04:48.899
skipgram, which flips the entire process inside

00:04:48.899 --> 00:04:52.300
out. Right. So if CBOW is playing Mad Libs, skipgram

00:04:52.300 --> 00:04:54.439
is like being given a single word and being asked

00:04:54.439 --> 00:04:56.639
to hallucinate the entire sentence around it.

00:04:56.939 --> 00:04:59.810
Exactly. You feed the machine the word sat. And

00:04:59.810 --> 00:05:01.589
it has to predict that the surrounding neighborhood

00:05:01.589 --> 00:05:04.689
of words is likely the, cat, on, and map. You

00:05:04.689 --> 00:05:06.769
give it the center point, and it hallucinates

00:05:06.769 --> 00:05:09.449
the neighborhood. Yep. And Skip Graham introduces

00:05:09.449 --> 00:05:12.550
a really interesting technical nuance here. It

00:05:12.550 --> 00:05:14.529
doesn't treat all the surrounding words equally.

00:05:14.649 --> 00:05:17.649
It actually weighs nearby context words more

00:05:17.649 --> 00:05:20.250
heavily than distant ones. So the word immediately

00:05:20.250 --> 00:05:22.810
next to sat has much more influence on the math

00:05:22.810 --> 00:05:25.569
than a word five spots away. Which makes intuitive

00:05:25.569 --> 00:05:29.470
sense. The word cat is far more related to sat

00:05:29.470 --> 00:05:32.149
than the word the at the very beginning of the

00:05:32.149 --> 00:05:34.610
sentence. Precisely. So is one of these engines.

00:05:35.410 --> 00:05:37.610
objectively better than the other? Well, the

00:05:37.610 --> 00:05:40.790
authors noted a specific trade -off. CBOW is

00:05:40.790 --> 00:05:43.449
generally much faster to train because it's kind

00:05:43.449 --> 00:05:46.689
of smoothing over the context, but SkipGram,

00:05:46.790 --> 00:05:49.410
because it forces the model to predict multiple

00:05:49.410 --> 00:05:51.730
different context words from a single starting

00:05:51.730 --> 00:05:54.910
point, does a much better job representing infrequent

00:05:54.910 --> 00:05:57.430
or rare words. So it squeezes more signal out

00:05:57.430 --> 00:06:00.350
of less data. You got it. Let's linger on the

00:06:00.350 --> 00:06:02.600
math. powering these predictions for a second

00:06:02.600 --> 00:06:04.899
though, because the architecture is described

00:06:04.899 --> 00:06:08.500
in the source as linear, linear softmax. Which

00:06:08.500 --> 00:06:10.519
is quite the mouthful. Yeah, and I want to make

00:06:10.519 --> 00:06:13.120
sure you, listening to this, can really visualize

00:06:13.120 --> 00:06:15.699
what the machine is actually doing when it guesses

00:06:15.699 --> 00:06:17.920
a word. Okay, let's break it down. Let's stick

00:06:17.920 --> 00:06:20.740
with the cat sat on the mat. In standard bag

00:06:20.740 --> 00:06:22.959
of words models, the context is just a simple

00:06:22.959 --> 00:06:25.980
tally, right? The appears twice, cat appears

00:06:25.980 --> 00:06:29.639
once. Right. But in continuous bag of words,

00:06:29.779 --> 00:06:32.759
the system takes that simple tally and multiplies

00:06:32.759 --> 00:06:35.399
it by a massive matrix. And multiplying by a

00:06:35.399 --> 00:06:38.259
matrix sounds incredibly abstract. So think of

00:06:38.259 --> 00:06:40.980
this matrix like a giant audio mixing board in

00:06:40.980 --> 00:06:43.100
a recording studio. Oh, I love the mixing board

00:06:43.100 --> 00:06:45.819
analogy. Every single word in the English language

00:06:45.819 --> 00:06:48.939
has its own slider on this board. When the machine

00:06:48.939 --> 00:06:51.060
is trying to predict the missing word in the

00:06:51.060 --> 00:06:53.759
cat blank on the mat, it pushes up the sliders

00:06:53.759 --> 00:06:58.189
for the cat on and matte. And by pushing up those

00:06:58.189 --> 00:07:01.290
specific sliders, the mixing board outputs a

00:07:01.290 --> 00:07:04.769
massive list of numbers, basically one raw score

00:07:04.769 --> 00:07:07.129
for every single word in its vocabulary. Thousands

00:07:07.129 --> 00:07:09.910
of numbers. Exactly. And finally, the system

00:07:09.910 --> 00:07:12.250
applies a mathematical function called softmax.

00:07:12.730 --> 00:07:15.529
Softmax simply takes all those raw scores and

00:07:15.529 --> 00:07:17.750
squashes them into a probability distribution

00:07:17.750 --> 00:07:20.970
that adds up to 100%. So the machine looks at

00:07:20.970 --> 00:07:25.480
the output and says, I am 90 % sure the missing

00:07:25.480 --> 00:07:29.819
word is sat, 5 % sure it's slept, and 0 .001

00:07:29.819 --> 00:07:33.019
% sure it's refrigerator. Exactly. But here is

00:07:33.019 --> 00:07:35.620
the crucial part. How does it learn? If it guesses

00:07:35.620 --> 00:07:38.680
refrigerator and gets it wrong, it reaches back

00:07:38.680 --> 00:07:42.040
into that giant mixing board and slightly adjusts

00:07:42.040 --> 00:07:44.620
the internal connections. The weights. The weights,

00:07:44.800 --> 00:07:47.139
yes. It adjusts them so that sat gets a higher

00:07:47.139 --> 00:07:49.360
score next time. It does this billions of times

00:07:49.360 --> 00:07:52.240
until the sliders are perfectly tuned. And whether

00:07:52.240 --> 00:07:55.360
the system uses CBOW to predict the center word

00:07:55.360 --> 00:07:57.879
or skip gram to predict the surrounding words,

00:07:58.519 --> 00:08:01.579
that linear, linear softmax structure, that mixing

00:08:01.579 --> 00:08:03.899
board is essentially the same, isn't it? It is.

00:08:03.980 --> 00:08:06.459
The only difference is the specific goal it is

00:08:06.459 --> 00:08:08.980
trying to maximize during training. Okay, so

00:08:08.980 --> 00:08:10.800
once these engines finish crunching the data,

00:08:11.199 --> 00:08:13.339
once they finish adjusting millions of sliders

00:08:13.339 --> 00:08:16.439
on that mixing board, the resulting model accidentally

00:08:16.439 --> 00:08:18.800
unlocks something that mimics human reasoning.

00:08:19.160 --> 00:08:22.480
Yes. This is the famous 2013 discovery by Tomas

00:08:22.480 --> 00:08:25.060
Mikolov and his team. Magic trick. They realized

00:08:25.060 --> 00:08:27.300
the model had implicitly learned semantic and

00:08:27.300 --> 00:08:29.459
syntactic patterns, and more importantly, those

00:08:29.459 --> 00:08:31.740
patterns could be reproduced using basic algebra.

00:08:31.879 --> 00:08:33.720
Which brings us back to our opening equation.

00:08:33.909 --> 00:08:36.269
You take the vector, the specific coordinate

00:08:36.269 --> 00:08:38.809
for the word brother, subtract the coordinate

00:08:38.809 --> 00:08:41.350
for man, and add the coordinate for woman. And

00:08:41.350 --> 00:08:44.730
the math just works. The resulting location in

00:08:44.730 --> 00:08:47.710
that thousand -dimensional galaxy lands almost

00:08:47.710 --> 00:08:51.049
exactly on the star for the word sister. It's

00:08:51.049 --> 00:08:54.350
incredible and it works for geography too. Paris

00:08:54.350 --> 00:08:57.789
minus France plus Germany lands you right next

00:08:57.789 --> 00:09:01.289
to Berlin. Wow. It even captured syntactic changes

00:09:01.289 --> 00:09:05.149
like verb tenses, walking minus walk plus run

00:09:05.149 --> 00:09:08.159
equals running. So the model had captured multiple

00:09:08.159 --> 00:09:10.740
different degrees of similarity between words

00:09:10.740 --> 00:09:13.259
purely by analyzing their proximity in the training

00:09:13.259 --> 00:09:15.539
text. Pure proximity, yeah. Okay, wait. I have

00:09:15.539 --> 00:09:17.600
to push back on this. Go for it. If the model

00:09:17.600 --> 00:09:20.080
is just adjusting sliders on a soundboard based

00:09:20.080 --> 00:09:22.740
on a sliding window of text, does it actually

00:09:22.740 --> 00:09:24.960
understand the concept of a brother or a sister?

00:09:25.080 --> 00:09:27.419
That's the big question. Or is this just like

00:09:27.419 --> 00:09:30.080
the ultimate game of statistical guilt by association?

00:09:30.519 --> 00:09:32.259
I mean, it doesn't know what a sibling is. It

00:09:32.259 --> 00:09:34.279
just knows what letters usually stand next to

00:09:34.279 --> 00:09:38.090
it. And that skepticism is entirely warranted.

00:09:38.350 --> 00:09:40.190
The scientific community heavily debated this

00:09:40.190 --> 00:09:42.690
exact point. I can imagine. Researchers like

00:09:42.690 --> 00:09:45.669
Joachim Goldberg and Omer Levy analyzed why Word2Vec

00:09:45.669 --> 00:09:47.970
was so successful, and they actually argued that

00:09:47.970 --> 00:09:50.269
the explanations given for this algebraic magic

00:09:50.269 --> 00:09:53.679
trick were, and I quote, Very hand -wavy. Hand

00:09:53.679 --> 00:09:55.580
-wavy. A highly technical computer science term.

00:09:55.960 --> 00:09:58.100
Right. But they pointed out that the model simply

00:09:58.100 --> 00:10:00.639
aligns perfectly with a concept in linguistics

00:10:00.639 --> 00:10:03.860
called J .R. Firth's Distributional Hypothesis.

00:10:03.879 --> 00:10:06.340
Which is what? The hypothesis states that the

00:10:06.340 --> 00:10:08.580
meaning of a word is defined entirely by its

00:10:08.580 --> 00:10:12.100
use in context. Word2vec just proved you can

00:10:12.100 --> 00:10:14.480
mathematically map that context at a massive

00:10:14.480 --> 00:10:17.340
scale. Oh, I see. It performs inference for a

00:10:17.340 --> 00:10:20.399
simple generative model for text. So it's brilliant

00:10:20.399 --> 00:10:23.039
statistics mapping the g... of language, but

00:10:23.039 --> 00:10:25.379
it is not cognitive understanding. Here's where

00:10:25.379 --> 00:10:27.659
it gets really interesting though. Even if it

00:10:27.659 --> 00:10:30.759
is just statistical guilt by association, if

00:10:30.759 --> 00:10:33.360
we can mathematically map the context of English

00:10:33.360 --> 00:10:36.279
words to discover hidden relationships, what

00:10:36.279 --> 00:10:39.019
happens if we apply this exact same math to languages

00:10:39.019 --> 00:10:41.399
that aren't spoken by humans at all? The ripple

00:10:41.399 --> 00:10:44.610
effect. The underlying architecture proved so

00:10:44.610 --> 00:10:47.570
robust that researchers immediately started looking

00:10:47.570 --> 00:10:50.350
beyond human vocabulary. It scaled into biology.

00:10:51.029 --> 00:10:52.889
Exactly. Because a DNA sequence is essentially

00:10:52.889 --> 00:10:55.710
just a long string of repeating letters. It's

00:10:55.710 --> 00:10:59.149
a text corpus. So researchers created biovectors

00:10:59.149 --> 00:11:02.730
or biovec. They applied this exact same sliding

00:11:02.730 --> 00:11:06.389
window logic to biological sequences. They mapped

00:11:06.389 --> 00:11:09.629
DNA, RNA, and protein sequences. Which is mind

00:11:09.629 --> 00:11:12.470
-blowing. It really is. By treating amino acid

00:11:12.470 --> 00:11:15.389
sequences like words in a sentence, biovectors

00:11:15.389 --> 00:11:18.149
could characterize biological sequences and find

00:11:18.149 --> 00:11:21.149
underlying biochemical and biophysical patterns.

00:11:21.450 --> 00:11:23.870
So it deduced the functional meaning of a protein

00:11:23.870 --> 00:11:25.830
sequence just by looking at the company it kept.

00:11:25.889 --> 00:11:28.860
Yeah. There's even a variant called DNA2vec that

00:11:28.860 --> 00:11:31.500
proved you could use cosine similarity. You know,

00:11:31.580 --> 00:11:34.139
that angle between two lines to measure the actual

00:11:34.139 --> 00:11:37.200
physical similarity of genetic sequences. Unbelievable.

00:11:37.460 --> 00:11:39.940
And the scaling didn't stop at microscopic biology

00:11:39.940 --> 00:11:42.220
either. It scaled up linguistically. Instead

00:11:42.220 --> 00:11:44.799
of just mapping single words, researchers developed

00:11:44.799 --> 00:11:48.240
Doc2vec. Right, which generates distributed representations

00:11:48.240 --> 00:11:50.940
for variable -length texts. Meaning it captures

00:11:50.940 --> 00:11:53.899
entire sentences, paragraphs, or whole documents.

00:11:54.100 --> 00:11:56.529
Exactly. Rather than just finding the coordinate

00:11:56.529 --> 00:12:00.269
for the word Berlin, Dr. Veck can find a coordinate

00:12:00.269 --> 00:12:03.429
for an entire encyclopedia article about Berlin.

00:12:04.250 --> 00:12:07.509
It uses architectures incredibly similar to CBOW

00:12:07.509 --> 00:12:11.110
and Skipgram, but it adds a unique document identifier

00:12:11.110 --> 00:12:13.629
as an extra piece of context on that mixing board.

00:12:13.690 --> 00:12:16.090
OK, that makes sense. And from there, the technology

00:12:16.090 --> 00:12:19.350
evolved into Top2Vec, which estimates representations

00:12:19.350 --> 00:12:22.389
of entire abstract topics. Wait, how do you even

00:12:22.389 --> 00:12:24.710
calculate the coordinate for an abstract topic?

00:12:24.809 --> 00:12:27.610
You can't just draw a circle around a vague concept

00:12:27.610 --> 00:12:29.990
in a thousand -dimensional space. That's tricky.

00:12:30.129 --> 00:12:31.950
You have to squash it down somehow to make sense

00:12:31.950 --> 00:12:34.940
of it, right? That is exactly what happens. Top2Vec

00:12:34.940 --> 00:12:37.620
takes those document embeddings and reduces their

00:12:37.620 --> 00:12:40.980
dimensions using a technique called UMA. UMA.

00:12:41.480 --> 00:12:45.700
Yeah. Think of UMA like taking a complex 3D globe

00:12:45.700 --> 00:12:48.220
of the Earth and flattening it onto a 2D piece

00:12:48.220 --> 00:12:50.720
of paper without distorting the relative distances

00:12:50.720 --> 00:12:53.100
between the cities too much. So it literally

00:12:53.100 --> 00:12:55.539
flattens the semantic galaxy into a readable

00:12:55.539 --> 00:12:59.200
map. Yes. And once the space is flattened, it

00:12:59.200 --> 00:13:01.679
scans that map using a clustering algorithm called

00:13:01.679 --> 00:13:06.100
HDDS scan. If UM draws the map, HDDS scan is

00:13:06.100 --> 00:13:07.919
the tool that looks for the densely populated

00:13:07.919 --> 00:13:11.379
cities on that map. It finds tight clusters of

00:13:11.379 --> 00:13:15.220
similar documents, calculates the centroid of

00:13:15.220 --> 00:13:17.899
that cluster, and calls that the topic vector.

00:13:18.220 --> 00:13:20.440
So whatever word embeddings are situated closest

00:13:20.440 --> 00:13:22.779
to that centroid become the title of the topic.

00:13:22.899 --> 00:13:25.200
You got it. It's literally finding the center

00:13:25.200 --> 00:13:28.639
of gravity for a conversation. And that mechanism

00:13:28.639 --> 00:13:31.500
also found a massive application in medicine,

00:13:32.019 --> 00:13:34.340
specifically in radiology, with the creation

00:13:34.340 --> 00:13:37.120
of intelligent word embeddings, or IWE. Which

00:13:37.120 --> 00:13:39.019
was huge because one of the biggest challenges

00:13:39.019 --> 00:13:41.340
with the original Word2Vac is how it handles

00:13:41.340 --> 00:13:43.820
unknown or out -of -vocabulary words. Right.

00:13:43.960 --> 00:13:46.139
If it has never seen a word in its training data,

00:13:46.200 --> 00:13:48.799
it just assigns it a random, essentially useless

00:13:48.799 --> 00:13:51.419
vector. And in medicine, a radiologist's report

00:13:51.419 --> 00:13:55.389
is chaotic. They use bizarre abbreviations, ungrammatical

00:13:55.389 --> 00:13:58.129
telegraphic phrases, highly specialized jargon.

00:13:58.629 --> 00:14:01.269
A doctor at one hospital might write, patient

00:14:01.269 --> 00:14:03.649
shows, guys size E on the interior or whatever,

00:14:04.190 --> 00:14:06.269
while a doctor across the country uses completely

00:14:06.269 --> 00:14:08.450
different shorthand for the exact same issue.

00:14:08.950 --> 00:14:11.730
Exactly. So researchers combined Word2Vec with

00:14:11.730 --> 00:14:14.730
a semantic dictionary mapping technique to create

00:14:14.730 --> 00:14:18.299
IWE. Ah. They mapped all those different variations

00:14:18.299 --> 00:14:20.960
into the same vector space. Because of this,

00:14:21.159 --> 00:14:24.659
an IWE model trained on messy data from one specific

00:14:24.659 --> 00:14:27.860
hospital could successfully translate and understand

00:14:27.860 --> 00:14:30.500
reports from a completely different institution.

00:14:30.779 --> 00:14:32.779
So it generalized the meaning of radiological

00:14:32.779 --> 00:14:34.919
jargon across the entire medical field. It did.

00:14:35.039 --> 00:14:38.080
I mean, Word2Vec reshaped genomics, it transformed

00:14:38.080 --> 00:14:40.740
topic modeling, and it decoded medical jargon.

00:14:40.799 --> 00:14:43.179
It was an absolute revolution. Undoubtedly. Yet

00:14:43.179 --> 00:14:45.080
if you look at the literature today, the straight

00:14:45.100 --> 00:14:47.419
word2vec approach is officially described as

00:14:47.419 --> 00:14:50.379
dated. It is. So what does this all mean? How

00:14:50.379 --> 00:14:53.059
does a breakthrough that profound become a relic

00:14:53.059 --> 00:14:55.460
in less than a decade? Well, if we connect this

00:14:55.460 --> 00:14:57.860
to the bigger picture, it speaks to the hyper

00:14:57.860 --> 00:15:01.340
accelerated pace of AI research. Though the history

00:15:01.340 --> 00:15:04.850
of word2vec is full of irony. Oh, so. When Tomas

00:15:04.850 --> 00:15:07.490
Mikolov and his team first submitted the original

00:15:07.490 --> 00:15:11.289
Word2Vec paper in 2013, it was actually rejected

00:15:11.289 --> 00:15:13.429
by the reviewers for the International Conference

00:15:13.429 --> 00:15:16.210
on Learning Representations. Wait, really? The

00:15:16.210 --> 00:15:18.690
AI establishment completely dismissed the paper

00:15:18.690 --> 00:15:21.049
that arguably kicked off the modern embedding

00:15:21.049 --> 00:15:23.370
revolution? It happens more often than you'd

00:15:23.370 --> 00:15:26.110
think. That is hilarious. But today, the industry

00:15:26.110 --> 00:15:28.570
has moved on to transformer -based models like

00:15:28.570 --> 00:15:31.610
Elmo and BERT. Transformers add multiple neural

00:15:31.610 --> 00:15:34.789
network attention layers on of a word embedding

00:15:34.789 --> 00:15:37.830
model. Instead of just looking at a fixed sliding

00:15:37.830 --> 00:15:40.470
window of context, they dynamically weigh the

00:15:40.470 --> 00:15:42.370
relevance of every word in a sentence against

00:15:42.370 --> 00:15:44.350
every other word, no matter how far apart they

00:15:44.350 --> 00:15:46.690
are. Oh, wow. But the foundation turning those

00:15:46.690 --> 00:15:49.789
words into math still relies on the trail Word2vec

00:15:49.789 --> 00:15:52.230
blazed. There's another fascinating wrinkle to

00:15:52.230 --> 00:15:55.049
why Word2vec's supremacy faded, though. It wasn't

00:15:55.049 --> 00:15:57.450
just about Transformers being better. It turns

00:15:57.450 --> 00:15:59.490
out Word2vec's initial magic might have been

00:15:59.490 --> 00:16:03.019
a bit of an illusion. Ah yes, the 2015 study

00:16:03.019 --> 00:16:06.139
by Levy and his colleagues. Right. They demonstrated

00:16:06.139 --> 00:16:09.299
that much of Word2vec's allegedly superior performance

00:16:09.299 --> 00:16:12.100
wasn't actually a result of the model architecture

00:16:12.100 --> 00:16:16.500
itself. Which was a huge realization. CBOW and

00:16:16.500 --> 00:16:19.240
Skip Graham weren't necessarily the secret sauce.

00:16:19.539 --> 00:16:21.980
The superior performance was heavily driven by

00:16:21.980 --> 00:16:25.000
the choice of specific hyperparameters. And hyperparameters

00:16:25.000 --> 00:16:27.139
are basically the little dials and knobs the

00:16:27.139 --> 00:16:29.059
researchers tune before they start trading the

00:16:29.059 --> 00:16:31.820
model. Exactly. For instance, subsampling high

00:16:31.820 --> 00:16:35.360
-frequency words. Words like the or and appear

00:16:35.360 --> 00:16:37.340
so often they provide almost no informational

00:16:37.340 --> 00:16:40.340
value. Right. Tuning the dial to remove or subsample

00:16:40.340 --> 00:16:42.559
them speeds up training and clarifies the data.

00:16:43.240 --> 00:16:46.399
Another dial is the context window size. The

00:16:46.399 --> 00:16:48.759
authors found a sweet spot of a 10 -word window

00:16:48.759 --> 00:16:52.039
for SkipGram and a five -word window for CBOW.

00:16:52.279 --> 00:16:54.419
Looking at five words to the left and five words

00:16:54.419 --> 00:16:57.120
to the right. Yep. Then there is the training

00:16:57.120 --> 00:16:59.279
algorithm itself. You can use something called

00:16:59.279 --> 00:17:02.000
hierarchical softmax. OK, what is that? Instead

00:17:02.000 --> 00:17:04.660
of calculating the probability for all 50 ,000

00:17:04.660 --> 00:17:06.319
words in the dictionary every single time it

00:17:06.319 --> 00:17:09.680
guesses a word, it uses a Huffman tree. A Huffman

00:17:09.680 --> 00:17:12.359
tree. Think of it like playing 20 questions.

00:17:12.559 --> 00:17:16.079
Is the word a noun? Yes. Is it an animal? Yes.

00:17:16.160 --> 00:17:18.380
It drastically cuts down the math. Oh, that makes

00:17:18.380 --> 00:17:20.660
the training significantly faster. It really

00:17:20.660 --> 00:17:23.339
does. Alternatively, you can use a technique

00:17:23.339 --> 00:17:26.359
called negative Instead of updating the mixing

00:17:26.359 --> 00:17:28.559
board sliders for the one correct word and the

00:17:28.559 --> 00:17:32.940
49 ,999 wrong words, it updates the correct word

00:17:32.940 --> 00:17:35.119
and just a handful of negative samples. Like

00:17:35.119 --> 00:17:37.599
maybe five random wrong words? Exactly. It's

00:17:37.599 --> 00:17:40.240
a massive shortcut. And Levi's team showed that

00:17:40.240 --> 00:17:42.759
if you took these specific hyperparameter settings,

00:17:42.839 --> 00:17:45.319
these perfectly tuned dials, and applied them

00:17:45.319 --> 00:17:48.160
to older, more traditional natural language processing

00:17:48.160 --> 00:17:51.480
approaches, those older methods suddenly performed

00:17:51.480 --> 00:17:54.589
just as well as Word2Vec. So it wasn't that the

00:17:54.589 --> 00:17:57.730
neural network was infinitely smarter. The Google

00:17:57.730 --> 00:18:00.369
team had just figured out the absolute perfect

00:18:00.369 --> 00:18:02.930
combination of settings for the data. It was

00:18:02.930 --> 00:18:06.430
a master class in parameterization. The documentation

00:18:06.430 --> 00:18:08.990
details how dramatically these parameters affect

00:18:08.990 --> 00:18:12.990
model quality. Overall accuracy inherently increases

00:18:12.990 --> 00:18:15.210
with the number of words used in training and

00:18:15.210 --> 00:18:17.329
the number of dimensions in the vector space.

00:18:17.630 --> 00:18:19.970
But there's a massive computational cost to turning

00:18:19.970 --> 00:18:23.470
those dials up, isn't there? Huge. team reported

00:18:23.470 --> 00:18:25.930
that doubling the amount of training data results

00:18:25.930 --> 00:18:29.509
in an increase in computational complexity, mathematically

00:18:29.509 --> 00:18:32.329
equivalent to doubling the number of vector dimensions.

00:18:32.450 --> 00:18:35.130
Yikes. Yeah, the curve gets steep incredibly

00:18:35.130 --> 00:18:37.849
fast. This perfectly explains why companies spend

00:18:37.849 --> 00:18:39.910
billions of dollars on data centers to train

00:18:39.910 --> 00:18:42.289
these things. Every time you turn the dial up

00:18:42.289 --> 00:18:44.450
to make the model a little smarter, the math

00:18:44.450 --> 00:18:46.650
gets exponentially harder. It does. In fact,

00:18:46.809 --> 00:18:50.170
one study from 2017 found Word2Vec only really

00:18:50.170 --> 00:18:53.420
outperforms older techniques. like latent semantic

00:18:53.420 --> 00:18:56.259
analysis, if it's trained on a medium to large

00:18:56.259 --> 00:18:58.980
corpus, meaning more than 10 million words. If

00:18:58.980 --> 00:19:01.240
you have a small data set, the older models actually

00:19:01.240 --> 00:19:03.789
work better. Because there is no single magic

00:19:03.789 --> 00:19:06.690
algorithm in AI. It is always an interplay between

00:19:06.690 --> 00:19:09.630
the architecture, the sheer volume of data, and

00:19:09.630 --> 00:19:12.650
the precise tuning of the hyperparameters. For

00:19:12.650 --> 00:19:15.869
a skipogram model on a medium corpus, the absolute

00:19:15.869 --> 00:19:18.910
sweet spot seemed to be 50 dimensions, a window

00:19:18.910 --> 00:19:22.150
size of 15, and 10 negative samples. It's really

00:19:22.150 --> 00:19:24.190
like finding the exact recipe for a cake. You

00:19:24.190 --> 00:19:26.269
can't just throw ingredients in a bowl. The oven

00:19:26.269 --> 00:19:28.269
temperature and the baking time matter just as

00:19:28.269 --> 00:19:30.569
much as the flour itself. That's a perfect analogy.

00:19:31.240 --> 00:19:33.700
this all full circle, to you listening to this

00:19:33.700 --> 00:19:37.920
deep dive right now, why does a 2013 algorithm

00:19:37.920 --> 00:19:40.680
that is now officially considered dated actually

00:19:40.680 --> 00:19:42.940
matter to your daily life? It's a valid question.

00:19:43.200 --> 00:19:46.019
It matters because the legacy of Word2vec is

00:19:46.019 --> 00:19:48.940
literally at your fingertips. The next time you

00:19:48.940 --> 00:19:51.319
use a predictive text feature on your phone or

00:19:51.319 --> 00:19:53.640
you watch your browser seamlessly translate a

00:19:53.640 --> 00:19:56.240
foreign webpage or you read an article about

00:19:56.240 --> 00:19:59.079
AI analyzing thousands of medical reports to

00:19:59.079 --> 00:20:03.119
find a cure, You are looking at the direct descendants

00:20:03.119 --> 00:20:06.680
of word2vec. It fundamentally shifted the paradigm

00:20:06.680 --> 00:20:09.660
from rule -based linguistics to distributional

00:20:09.660 --> 00:20:12.779
semantics. It proved mathematically that meaning

00:20:12.779 --> 00:20:15.799
is, in fact, just the company a word keeps. And

00:20:15.799 --> 00:20:17.319
that leads me to a final thought I want to leave

00:20:17.319 --> 00:20:19.220
you with. We've spent this whole time talking

00:20:19.220 --> 00:20:22.480
about words and DNA and documents, but let's

00:20:22.480 --> 00:20:24.579
apply this logic to something a little closer

00:20:24.579 --> 00:20:27.440
to home. Oh boy. If an algorithm can perfectly

00:20:27.440 --> 00:20:30.380
deduce the deep semantic meaning of a word simply

00:20:30.380 --> 00:20:32.759
by mathematically mapping its closest neighbors,

00:20:33.319 --> 00:20:35.720
could that exact same distributional principle

00:20:35.720 --> 00:20:39.079
be applied to us? That is a wild thought. Right.

00:20:39.299 --> 00:20:41.279
If a computer mapped all your digital connections,

00:20:41.460 --> 00:20:43.200
your interactions, your geographic proximities,

00:20:43.279 --> 00:20:45.539
your friends, would it understand your meaning

00:20:45.539 --> 00:20:48.220
better than you do? Are you just the sum of the

00:20:48.220 --> 00:20:50.339
company you keep? Something to think about. Thank

00:20:50.339 --> 00:20:52.480
you for joining us on this deep dive. We'll catch

00:20:52.480 --> 00:20:53.039
you next time.