WEBVTT

00:00:00.000 --> 00:00:03.200
So if you sit down at your computer or you pull

00:00:03.200 --> 00:00:06.320
out your phone and you type a question into a

00:00:06.320 --> 00:00:08.699
search engine or a large language model, there

00:00:08.699 --> 00:00:10.740
is this kind of everyday magic that happens.

00:00:10.759 --> 00:00:12.630
Yeah. It really does feel like magic sometimes.

00:00:12.769 --> 00:00:14.730
Right. Because you type out a sentence in just

00:00:14.730 --> 00:00:16.449
plain English, and the machine just gets it.

00:00:16.449 --> 00:00:19.469
It understands you. But I mean, computers do

00:00:19.469 --> 00:00:21.329
not read English. They don't understand words

00:00:21.329 --> 00:00:23.170
at all. No, not even a little bit. They only

00:00:23.170 --> 00:00:25.250
understand math. Exactly. They only understand

00:00:25.250 --> 00:00:28.550
numbers. So how do you bridge that massive gap

00:00:28.550 --> 00:00:33.869
between human thought and cold, hard computation?

00:00:34.369 --> 00:00:37.270
Well, it is arguably the most important bridge.

00:00:37.479 --> 00:00:40.200
we have ever built in computer science. And to

00:00:40.200 --> 00:00:42.340
give you an idea of just how mind -bending the

00:00:42.340 --> 00:00:44.159
math on the other side of that bridge actually

00:00:44.159 --> 00:00:47.060
is, think about this. If you take the mathematical

00:00:47.060 --> 00:00:49.840
coordinates for the concept of a king, subtract

00:00:49.840 --> 00:00:53.020
the concept of a man, and then add a woman, a

00:00:53.020 --> 00:00:55.859
computer will instantly spit out the exact coordinates

00:00:55.859 --> 00:00:58.619
for the word queen. Oh, wow, that is absolutely

00:00:58.619 --> 00:01:01.000
wild. It's literally doing arithmetic with human

00:01:01.000 --> 00:01:03.679
concepts. Yeah, it is. Well, welcome to today's

00:01:03.679 --> 00:01:05.879
Deep Dive, where we are going to explore exactly

00:01:05.879 --> 00:01:08.819
how that happens. We are looking at a massive

00:01:08.819 --> 00:01:11.459
stack of research on a fascinating natural language

00:01:11.459 --> 00:01:14.540
processing concept called word embedding. And

00:01:14.540 --> 00:01:16.780
this is really for you, the learner out there

00:01:16.780 --> 00:01:18.459
who just wants to know how the digital world

00:01:18.459 --> 00:01:20.780
ticks. It's a great topic. It is. Our mission

00:01:20.780 --> 00:01:24.000
today is to demystify the invisible mathematical

00:01:24.000 --> 00:01:26.819
architecture that allows computers to quote unquote,

00:01:27.280 --> 00:01:30.420
understand human language. Because this is the

00:01:30.420 --> 00:01:34.079
secret sauce behind the entire modern AI revolution.

00:01:34.659 --> 00:01:37.280
OK, let's unpack this. I'm ready. It really is

00:01:37.280 --> 00:01:39.519
the foundational pillar of modern natural language

00:01:39.519 --> 00:01:43.859
processing. So at its core, a word embedding

00:01:43.859 --> 00:01:47.920
maps a word to a. a real -valued vector. Right,

00:01:48.019 --> 00:01:50.180
which sounds very technical. It does, but to

00:01:50.180 --> 00:01:51.560
translate that from computer science speaking

00:01:51.560 --> 00:01:53.680
just plain language, a vector is essentially

00:01:53.680 --> 00:01:56.799
just an ordered list of numbers. And what these

00:01:56.799 --> 00:01:59.560
numbers do is they define a specific coordinate

00:01:59.560 --> 00:02:03.319
in this massive mathematical space. The fundamental

00:02:03.319 --> 00:02:07.079
goal here is to encode meaning into those numbers

00:02:07.079 --> 00:02:10.120
in such a way that words with similar definitions

00:02:10.120 --> 00:02:13.000
are physically grouped closer together in that

00:02:13.000 --> 00:02:15.379
vector space. I love trying to visualize this

00:02:15.379 --> 00:02:17.759
because a list of numbers sounds so dry. But

00:02:17.759 --> 00:02:20.300
if you think about it spatially, it's just incredible.

00:02:21.080 --> 00:02:25.419
I like to picture this vector space as like a

00:02:25.419 --> 00:02:28.439
giant multi -dimensional map of towns. And it's

00:02:28.439 --> 00:02:30.280
a great way to look at it. Right. So if we have

00:02:30.280 --> 00:02:32.680
a town called Happy, the town right next door

00:02:32.680 --> 00:02:34.560
is going to be joyful. Exactly. They share the

00:02:34.560 --> 00:02:36.680
same local weather, the same neighborhood. But

00:02:36.680 --> 00:02:39.000
if you look for the town called Sad, well, that's

00:02:39.000 --> 00:02:40.280
going to be all the way on the opposite coast.

00:02:40.460 --> 00:02:42.939
It's geographically distant because it's semantically

00:02:42.939 --> 00:02:45.419
distant. Yeah. We are literally turning meaning

00:02:45.419 --> 00:02:49.319
into geography. and that spatial geography analogy

00:02:49.319 --> 00:02:51.819
is incredibly accurate to how the mass actually

00:02:51.819 --> 00:02:54.900
functions behind the scenes. And to really understand

00:02:54.900 --> 00:02:58.580
how a machine plots out that map today, we first

00:02:58.580 --> 00:03:01.500
kind of have to understand how humans have historically

00:03:01.500 --> 00:03:04.340
tried to define what meaning even is. Right,

00:03:04.340 --> 00:03:06.400
because that's a philosophical question almost.

00:03:06.539 --> 00:03:09.460
It is. And there is a famous quote from 1957

00:03:09.460 --> 00:03:12.580
from a linguist named John Rupert Firth. He said,

00:03:12.719 --> 00:03:15.659
a word is characterized by the company it keeps.

00:03:15.979 --> 00:03:19.479
The company it keeps. I love that. So meaning

00:03:19.479 --> 00:03:22.000
isn't something a word just possesses in a vacuum.

00:03:22.280 --> 00:03:24.620
No, not at all. It's completely defined by the

00:03:24.620 --> 00:03:26.819
surrounding words that constantly hang out with

00:03:26.819 --> 00:03:28.900
it. Exactly. I mean, if you see the word bark

00:03:28.900 --> 00:03:32.020
next to tree and branches, it implies one physical

00:03:32.020 --> 00:03:33.939
reality, right? Sure. But if you see it next

00:03:33.939 --> 00:03:36.060
to dog and leash, it implies a totally different

00:03:36.060 --> 00:03:39.000
one. What's fascinating here is that this foundational

00:03:39.000 --> 00:03:42.000
thought from cognitive psychology and early linguistics

00:03:42.000 --> 00:03:45.240
from the 1950s, it still dictates how modern

00:03:45.240 --> 00:03:47.259
neural networks operate today. That's great.

00:03:47.360 --> 00:03:50.400
We are literally teaching machines to read by

00:03:50.400 --> 00:03:52.919
having them study the company that words keep

00:03:52.919 --> 00:03:55.520
across just massive data sets. Wait, hold on.

00:03:55.520 --> 00:03:57.400
I have to ask about the timeline here because

00:03:57.400 --> 00:04:00.520
Firth's quote is from 1957. Yeah. If they had

00:04:00.520 --> 00:04:03.340
this whole company of word keeps theory back

00:04:03.340 --> 00:04:05.639
during the Cold War, why did this technology

00:04:05.639 --> 00:04:09.240
just sit in a drawer for 50 years? Like, if researchers

00:04:09.240 --> 00:04:11.539
knew the secret to mapping meaning, why didn't

00:04:11.539 --> 00:04:14.180
we have smart search engines and chat bots back

00:04:14.180 --> 00:04:16.519
in the 80s or 90s? Well, that is the perfect

00:04:16.519 --> 00:04:18.660
question. And I can tell you it wasn't for lack

00:04:18.660 --> 00:04:21.860
of trying. Oh, they tried. Oh, yeah. Early computer

00:04:21.860 --> 00:04:24.100
scientists did try to build these vector spaces

00:04:24.100 --> 00:04:27.259
in the 60s and 70s for information retrieval,

00:04:27.259 --> 00:04:30.779
but they ran head first. to a massive computational

00:04:30.779 --> 00:04:33.779
roadblock, which is known as the curse of dimensionality.

00:04:34.000 --> 00:04:35.939
The curse of dimensionality. That sounds terrifying.

00:04:36.060 --> 00:04:37.899
What were scientists actually dealing with there?

00:04:38.379 --> 00:04:41.199
Well, imagine you were trying to map the company

00:04:41.199 --> 00:04:43.860
a word keeps by building a giant spreadsheet.

00:04:44.199 --> 00:04:46.420
OK, I'm picturing a spreadsheet. You want to

00:04:46.420 --> 00:04:49.199
count every single time a word appears next to

00:04:49.199 --> 00:04:50.759
every other word in the English language. That's

00:04:50.759 --> 00:04:53.829
a lot of words. Exactly. In those early models,

00:04:54.370 --> 00:04:57.709
every single unique word in your vocabulary required

00:04:57.709 --> 00:05:00.370
its own distinct column, its own dimension. So

00:05:00.370 --> 00:05:02.730
if you had a working vocabulary of 100 ,000 words...

00:05:02.730 --> 00:05:04.930
Your mathematical space had 100 ,000 dimensions.

00:05:05.050 --> 00:05:08.350
Right. And the result was a wildly sparse space.

00:05:08.870 --> 00:05:11.250
Meaning if you look at the row for the word cat,

00:05:11.769 --> 00:05:13.529
you might have a few numbers in the columns for

00:05:13.529 --> 00:05:17.769
meow or milk. But the other 99 ,998 columns in

00:05:17.769 --> 00:05:20.480
that spreadsheet... were just zeros. Oh, wow.

00:05:20.579 --> 00:05:22.879
So they were trying to brute force a spreadsheet

00:05:22.879 --> 00:05:25.839
where almost every single cell was totally empty?

00:05:25.980 --> 00:05:27.899
Pretty much, yeah. And the computers of the 70s

00:05:27.899 --> 00:05:31.040
and 80s just choked on the sheer volume of nothingness.

00:05:31.319 --> 00:05:33.899
The calculations became impossibly heavy. There

00:05:33.899 --> 00:05:36.399
was just too much empty space to process efficiently.

00:05:36.660 --> 00:05:38.819
So the obvious question was, how do we compress

00:05:38.819 --> 00:05:40.620
this data without losing the underlying meaning?

00:05:40.680 --> 00:05:43.339
Right. In the late 1980s, scientists introduced

00:05:43.339 --> 00:05:46.240
something called latent semantic analysis. They

00:05:46.240 --> 00:05:48.459
used a mathematical technique specifically singular

00:05:48.459 --> 00:05:52.399
value decomposition to squash that massive sparse

00:05:52.399 --> 00:05:55.139
space into a much smaller, denser number of dimensions.

00:05:55.800 --> 00:05:57.920
Singular value decomposition. That is a heavy

00:05:57.920 --> 00:06:00.060
string of words. What is actually happening to

00:06:00.060 --> 00:06:02.300
our map of towns when they do that? Think of

00:06:02.300 --> 00:06:06.230
it like... shining a very bright flashlight on

00:06:06.230 --> 00:06:09.589
a highly complex three -dimensional object and

00:06:09.589 --> 00:06:11.370
looking at its two -dimensional shadow on the

00:06:11.370 --> 00:06:13.970
wall. Okay, I can picture that. You are technically

00:06:13.970 --> 00:06:16.589
losing a dimension of data, right? Yeah. But

00:06:16.589 --> 00:06:19.930
you still keep the core shape and the vital structural

00:06:19.930 --> 00:06:22.589
relationships of the original object. I see.

00:06:22.810 --> 00:06:25.649
It forced the data to compress, finding hidden

00:06:25.649 --> 00:06:29.269
or latent semantic relationships. Then around

00:06:29.269 --> 00:06:32.470
the year 2000, a researcher named Yoshua Bengio

00:06:32.750 --> 00:06:35.189
push this even further. He and his colleagues

00:06:35.189 --> 00:06:37.069
used early neural networks to learn what they

00:06:37.069 --> 00:06:39.819
called a distributed representation. Okay. This

00:06:39.819 --> 00:06:41.959
drastically reduced that high dimensionality

00:06:41.959 --> 00:06:44.459
into a tight, dense vector of maybe just a few

00:06:44.459 --> 00:06:46.279
hundred dimensions. And getting it down to a

00:06:46.279 --> 00:06:48.079
few hundred dimensions is where the magic really

00:06:48.079 --> 00:06:49.819
starts to take off, right? Absolutely. Because

00:06:49.819 --> 00:06:52.160
then we hit 2013, which is widely considered

00:06:52.160 --> 00:06:55.500
a massive, pivotal milestone in this field. A

00:06:55.500 --> 00:06:58.339
team at Google, led by Tomas Mikolov, created

00:06:58.339 --> 00:07:02.240
a toolkit called Word2Vec. Yes, Word2Vec. It

00:07:02.240 --> 00:07:04.060
seems like this was the moment the dam finally

00:07:04.060 --> 00:07:07.860
broke. Because Word2Vec trained these dense vector

00:07:07.860 --> 00:07:10.379
maps significantly faster than anything before

00:07:10.379 --> 00:07:12.980
it, it really moved Word embeddings from being

00:07:12.980 --> 00:07:16.660
this new, expensive academic experiment into

00:07:16.660 --> 00:07:18.480
something you could actually use in practical

00:07:18.480 --> 00:07:21.000
software. It democratized the map making process

00:07:21.000 --> 00:07:23.560
for language, absolutely. Word2Vec proved that

00:07:23.560 --> 00:07:25.660
you could train high quality word embeddings

00:07:25.660 --> 00:07:29.399
on huge data sets very efficiently. And the way

00:07:29.399 --> 00:07:31.500
it trains is brilliant. It essentially plays

00:07:31.500 --> 00:07:34.379
a massive game of fill in the blank. The algorithm

00:07:34.379 --> 00:07:37.360
looks at a sentence. hides one word, and then

00:07:37.360 --> 00:07:39.319
looks at the local context window, just a few

00:07:39.319 --> 00:07:41.339
words immediately before and after the blank,

00:07:41.720 --> 00:07:43.839
and tries to guess the missing word. Oh, wow.

00:07:43.959 --> 00:07:46.040
So it's actively testing itself on the company

00:07:46.040 --> 00:07:48.860
the word keeps. Yes. And every time it guesses

00:07:48.860 --> 00:07:51.740
wrong... It reaches into its own internal wiring

00:07:51.740 --> 00:07:54.199
and slightly adjusts the mathematical dials,

00:07:54.199 --> 00:07:56.220
the weights, to make a better guess next time.

00:07:56.399 --> 00:07:58.920
That's so cool. Over millions and millions of

00:07:58.920 --> 00:08:01.420
iterations, reading massive chunks of the internet,

00:08:02.180 --> 00:08:04.800
those internal dial settings actually become

00:08:04.800 --> 00:08:07.399
the coordinate numbers. The network's attempt

00:08:07.399 --> 00:08:10.040
to predict neighbors generates the map. Okay,

00:08:10.360 --> 00:08:12.899
so word2vec is incredibly fast and it builds

00:08:12.899 --> 00:08:16.540
this beautiful, dense map. But human language

00:08:16.540 --> 00:08:19.600
isn't just vast, it's incredibly messy. It's

00:08:19.600 --> 00:08:22.259
ambiguous. It really is. Here's where it gets

00:08:22.259 --> 00:08:25.579
really interesting. Think about a word like club.

00:08:26.180 --> 00:08:28.519
Okay. If you say to me, the club I tried yesterday

00:08:28.519 --> 00:08:30.579
was great, what are you actually talking about?

00:08:30.620 --> 00:08:32.100
Right, it could be anything. Are you talking

00:08:32.100 --> 00:08:34.600
about... turkey club sandwich? Are you talking

00:08:34.600 --> 00:08:37.399
about a loud nightclub downtown? Or are you talking

00:08:37.399 --> 00:08:39.860
about a nine iron golf club? This brings us to

00:08:39.860 --> 00:08:42.259
a major hurdle in linguistics. This is known

00:08:42.259 --> 00:08:45.279
as polysemy words with multiple related meanings

00:08:45.279 --> 00:08:48.019
and homonymy words that sound the same but have

00:08:48.019 --> 00:08:50.320
totally different meanings. Yeah. And this was

00:08:50.320 --> 00:08:53.299
the fatal flaw of early static embeddings like

00:08:53.299 --> 00:08:56.480
word2vec. They conflated all of those distinct

00:08:56.480 --> 00:08:59.159
meanings into a single point on our map. That's

00:08:59.159 --> 00:09:02.320
a huge geographical trap. If the algorithm tries

00:09:02.320 --> 00:09:05.860
to map club, and it averages out the sandwich,

00:09:06.240 --> 00:09:08.840
the golf club, and the nightclub, it's going

00:09:08.840 --> 00:09:10.539
to drop the pin somewhere in the middle of the

00:09:10.539 --> 00:09:13.240
ocean. Exactly. It forces a single word to be

00:09:13.240 --> 00:09:15.740
a single vector, which means it is geographically

00:09:15.740 --> 00:09:17.700
lost because it's trying to be in three places

00:09:17.700 --> 00:09:19.779
at once. Which means it fails to understand the

00:09:19.779 --> 00:09:22.019
specific sentence you are typing. So the field

00:09:22.019 --> 00:09:25.490
had to adapt. Researchers began developing multi

00:09:25.490 --> 00:09:28.350
-sense embeddings. Approaches like multi -sense

00:09:28.350 --> 00:09:31.909
Skipgram or MSSG tried to handle this, but a

00:09:31.909 --> 00:09:35.570
really robust solution was MSSA, or Most Suitable

00:09:35.570 --> 00:09:37.750
Sense Annotation. Okay, how did that work? It

00:09:37.750 --> 00:09:40.539
brought in outside knowledge. MSSA relies on

00:09:40.539 --> 00:09:42.879
pre -existing lexical databases, things like

00:09:42.879 --> 00:09:45.340
WordNet. These are essentially massive structured

00:09:45.340 --> 00:09:47.559
digital dictionaries that already know the different

00:09:47.559 --> 00:09:49.960
definitions of a word exist. Wait, if Club has

00:09:49.960 --> 00:09:51.740
three different meanings, how does the computer

00:09:51.740 --> 00:09:53.759
know which one I mean in real time before it

00:09:53.759 --> 00:09:56.500
does the math? By looking at a predefined sliding

00:09:56.500 --> 00:09:59.679
window of context around the word. MSSA uses

00:09:59.679 --> 00:10:02.100
a knowledge -based approach to label the specific

00:10:02.100 --> 00:10:04.500
sense of the word before it calculates the math.

00:10:04.669 --> 00:10:07.590
Oh, I get it. Once the computer knows you are

00:10:07.590 --> 00:10:09.769
talking about the sandwich, because words like

00:10:09.769 --> 00:10:12.809
turkey and bacon are nearby, then it produces

00:10:12.809 --> 00:10:16.129
a distinct multi -sense embedding just for the

00:10:16.129 --> 00:10:18.230
sandwich definition. So it's essentially teaching

00:10:18.230 --> 00:10:20.750
the computer to ask, wait, which definition are

00:10:20.750 --> 00:10:23.179
we using right now? before it plots the point.

00:10:23.919 --> 00:10:26.559
Exactly. That makes total sense. But even that

00:10:26.559 --> 00:10:29.600
isn't the final form of this technology. The

00:10:29.600 --> 00:10:32.500
research moves into the late 2010s, introducing

00:10:32.500 --> 00:10:35.740
the modern marvels that power the math of AI

00:10:35.740 --> 00:10:39.019
tools we use today like ELMO and BERT. Yeah,

00:10:39.019 --> 00:10:41.299
this was the true paradigm shift because even

00:10:41.299 --> 00:10:43.620
with WordNet, you are still relying on a fixed

00:10:43.620 --> 00:10:46.200
dictionary. Right. Models like BERT, which stands

00:10:46.200 --> 00:10:48.659
for bi -directional encoded representations from

00:10:48.659 --> 00:10:51.259
transformers, they create the embedding on the

00:10:51.259 --> 00:10:54.370
fly. at the token level. Every single occurrence

00:10:54.370 --> 00:10:57.110
of a word gets its own unique mathematical coordinate

00:10:57.110 --> 00:10:59.809
based entirely on the specific sentence it appears

00:10:59.809 --> 00:11:02.429
in at that exact moment. So if I write a whole

00:11:02.429 --> 00:11:04.470
paragraph about playing a round of golf and I

00:11:04.470 --> 00:11:07.000
use the word club five different times in slightly

00:11:07.000 --> 00:11:09.879
different ways. Burt isn't just pulling a prepackaged

00:11:09.879 --> 00:11:12.379
golf club vector from a dictionary file. Not

00:11:12.379 --> 00:11:15.539
at all. It actively reads the entire sentence

00:11:15.539 --> 00:11:17.960
surrounding that specific instance of the word

00:11:17.960 --> 00:11:20.399
club. It weighs the importance of every other

00:11:20.399 --> 00:11:23.059
word in that sentence simultaneously and generates

00:11:23.059 --> 00:11:24.720
a brand new coordinate right then and there.

00:11:24.840 --> 00:11:27.480
That is amazing. It places occurrences of the

00:11:27.480 --> 00:11:30.159
same exact word in totally different regions

00:11:30.159 --> 00:11:32.519
of its embedding space depending on the nuance

00:11:32.519 --> 00:11:35.710
of the surrounding sentence. It finally mastered

00:11:35.710 --> 00:11:39.610
Firth's rule from 1957. It truly judges the word

00:11:39.610 --> 00:11:41.950
entirely by the company it is keeping at that

00:11:41.950 --> 00:11:44.340
exact second. That dynamic on -the -fly mapping

00:11:44.340 --> 00:11:47.139
is why chatbots suddenly got incredibly good

00:11:47.139 --> 00:11:49.440
at understanding context over the last few years.

00:11:49.519 --> 00:11:51.759
Yeah, it was a game changer. OK, so we've mapped

00:11:51.759 --> 00:11:54.539
out how these vectors decode the grammar and

00:11:54.539 --> 00:11:57.139
the meaning of human language. But if language

00:11:57.139 --> 00:12:00.240
is really just a sequence of symbols, what other

00:12:00.240 --> 00:12:02.559
sequential data can we translate into a vector

00:12:02.559 --> 00:12:05.019
space? Because the research takes a wild turn

00:12:05.019 --> 00:12:07.000
here. It really does. It introduces something

00:12:07.000 --> 00:12:10.059
called biovectors, developed by researchers Asgari

00:12:10.059 --> 00:12:13.059
and Mofrad. They realized we can treat biological

00:12:13.059 --> 00:12:16.279
sequences like DNA, RNA, and amino acids as words.

00:12:16.440 --> 00:12:19.000
It is a brilliant, unconventional application

00:12:19.000 --> 00:12:22.080
of the technology. They created protevec for

00:12:22.080 --> 00:12:25.159
proteins, which are amino acid sequences, and

00:12:25.159 --> 00:12:27.700
genevec for gene sequences. They essentially

00:12:27.700 --> 00:12:30.279
treat chunks of these biological building blocks

00:12:30.279 --> 00:12:33.519
as engrams. And just to clarify, an engram is

00:12:33.519 --> 00:12:36.220
basically just a continuous sequence or chunk

00:12:36.220 --> 00:12:38.600
of items from a given sample, right? Like taking

00:12:38.600 --> 00:12:40.519
a long string of letters and breaking it into

00:12:40.519 --> 00:12:42.700
three -letter words. Precisely. They took massive

00:12:42.700 --> 00:12:45.240
databases of genetic code, broke them down into

00:12:45.240 --> 00:12:48.100
these angogram words, and asked the algorithm

00:12:48.100 --> 00:12:50.740
to read the code of life using the exact same

00:12:50.740 --> 00:12:53.340
mathematical tools we use to read emails. But

00:12:53.340 --> 00:12:55.799
how does reading a sequence of amino acids like

00:12:55.799 --> 00:12:59.659
text actually translate to physical biological

00:12:59.659 --> 00:13:02.580
reality? Think back to the text analogy. Just

00:13:02.580 --> 00:13:05.200
as the word bark constantly appearing next to

00:13:05.200 --> 00:13:08.399
tree implies a semantic relationship, a certain

00:13:08.399 --> 00:13:10.259
sequence of amino acids constantly appearing

00:13:10.259 --> 00:13:12.789
next to another sequence implies a physical relationship.

00:13:12.809 --> 00:13:15.470
Oh, interesting. It implies they are highly likely

00:13:15.470 --> 00:13:17.909
to fold together physically in a three -dimensional

00:13:17.909 --> 00:13:21.269
protein structure. The neural network maps out

00:13:21.269 --> 00:13:24.590
the towns of genetic code and discovers the biophysical

00:13:24.590 --> 00:13:27.669
rules of how they interact without a human biologist

00:13:27.669 --> 00:13:30.350
having to explicitly program those rules into

00:13:30.350 --> 00:13:33.519
the machine. That is staggering. It learns the

00:13:33.519 --> 00:13:35.659
physical language of biology just by looking

00:13:35.659 --> 00:13:38.460
at the company the genes keep. And biology isn't

00:13:38.460 --> 00:13:40.539
the only non -human language we are translating.

00:13:41.100 --> 00:13:43.399
There are also incredible applications in game

00:13:43.399 --> 00:13:46.840
design. Oh, yes. Researchers Rybie and Cook proposed

00:13:46.840 --> 00:13:49.519
a way to discover emergent gameplay by using

00:13:49.519 --> 00:13:51.919
logs of player data. Yes, another fascinating

00:13:51.919 --> 00:13:55.000
translation. To do this, you transcribe the actions

00:13:55.000 --> 00:13:57.559
that occur during a game like a chess match into

00:13:57.559 --> 00:14:00.419
a formal language. Every move, like moving a

00:14:00.419 --> 00:14:03.960
pawn from E2 to E4, is logged as a word in a

00:14:03.960 --> 00:14:06.460
sequence. A full game is a sentence. Millions

00:14:06.460 --> 00:14:08.559
of games become a massive book. Once you have

00:14:08.559 --> 00:14:10.820
that resulting text, you run it through the same

00:14:10.820 --> 00:14:13.639
word -to -vec embedding process. If we connect

00:14:13.639 --> 00:14:16.259
this to the bigger picture, the resulting vectors

00:14:16.259 --> 00:14:18.500
capture expert knowledge about games that are

00:14:18.500 --> 00:14:20.500
not explicitly stated in the game's official

00:14:20.500 --> 00:14:22.679
rulebook. Right, because the rulebook of chess

00:14:22.679 --> 00:14:26.059
just tells you how a knight is allowed to move.

00:14:26.399 --> 00:14:29.220
It doesn't tell you the strategic value of controlling

00:14:29.220 --> 00:14:31.659
the center of the board. Exactly. But the algorithm

00:14:31.659 --> 00:14:34.779
learns that advanced strategy simply by observing

00:14:34.779 --> 00:14:38.259
millions of sequences of moves. It learns the

00:14:38.259 --> 00:14:41.600
unspoken strategies, the emergent dynamics, just

00:14:41.600 --> 00:14:44.259
by mapping out the coordinates of the game logs

00:14:44.259 --> 00:14:47.139
and seeing which moves constantly hang out together

00:14:47.139 --> 00:14:49.840
in winning games. Which brings us to a critical

00:14:49.840 --> 00:14:53.440
realization. If word embeddings are incredibly

00:14:53.440 --> 00:14:56.019
efficient at capturing the unwritten emergent

00:14:56.019 --> 00:14:59.320
rules of chess and the unspoken biophysical laws

00:14:59.320 --> 00:15:02.620
of proteins, we have to confront how they capture

00:15:02.620 --> 00:15:05.240
the unwritten rules of human society. So what

00:15:05.240 --> 00:15:07.600
does this all mean? If they learn exactly what

00:15:07.600 --> 00:15:09.639
they observe, what happens when we look at the

00:15:09.639 --> 00:15:12.019
data we are feeding these massive models? Well,

00:15:12.039 --> 00:15:14.159
this is a major area of study within natural

00:15:14.159 --> 00:15:16.409
language processing. The researchers point out

00:15:16.409 --> 00:15:18.809
that word embeddings inevitably contain the biases

00:15:18.809 --> 00:15:21.029
and stereotypes contained within their training

00:15:21.029 --> 00:15:24.809
data sets. A landmark 2016 paper by Bullet Bassey

00:15:24.809 --> 00:15:27.929
and several colleagues analyzed a publicly available

00:15:27.929 --> 00:15:30.990
word2vec embedding that had been trained on a

00:15:30.990 --> 00:15:33.649
massive corpus of Google News texts. And it is

00:15:33.649 --> 00:15:35.769
worth pointing out that Google News text is written

00:15:35.769 --> 00:15:38.049
by professional journalists. This wasn't trained

00:15:38.049 --> 00:15:40.610
on random toxic internet forum comments. It was

00:15:40.610 --> 00:15:43.179
considered a highly professional clean data set.

00:15:43.379 --> 00:15:46.320
It was. But when the researchers used this trained

00:15:46.320 --> 00:15:49.299
embedding to extract word analogies using that

00:15:49.299 --> 00:15:52.730
exact same king minus man plus woman. Yeah. math

00:15:52.730 --> 00:15:54.529
we talked about at the beginning, they found

00:15:54.529 --> 00:15:57.350
deeply disproportionate word associations. According

00:15:57.350 --> 00:15:59.529
to the paper, the model generated analogies like,

00:16:00.070 --> 00:16:02.669
man is to computer programmer as woman is to

00:16:02.669 --> 00:16:05.909
homemaker. The math literally calculated a stereotype.

00:16:05.950 --> 00:16:08.669
It mapped the town of man physically closer to

00:16:08.669 --> 00:16:10.950
programmer and the town of woman physically closer

00:16:10.950 --> 00:16:13.830
to homemaker. And it did this purely because

00:16:13.830 --> 00:16:16.730
of the historical structural biases present in

00:16:16.730 --> 00:16:19.429
the millions of news articles it read. It absorbs

00:16:19.429 --> 00:16:22.860
societal bias and encoded it as a factual geographical

00:16:22.860 --> 00:16:25.559
rule of language. This raises an important question

00:16:25.559 --> 00:16:28.759
regarding how these tools are deployed. Research

00:16:28.759 --> 00:16:31.779
by Jiuzhou and others demonstrates that if these

00:16:31.779 --> 00:16:34.279
trained word embeddings are applied to real -world

00:16:34.279 --> 00:16:37.580
tasks without careful oversight, the algorithms

00:16:37.580 --> 00:16:40.600
do not just passively reflect the bias found

00:16:40.600 --> 00:16:43.279
in that unaltered training data. They actively

00:16:43.279 --> 00:16:46.120
perpetuate it. Yes. The researchers concluded

00:16:46.120 --> 00:16:49.120
that word embeddings can even amplify these existing

00:16:49.120 --> 00:16:51.679
biases when they are integrated into automated

00:16:51.679 --> 00:16:54.659
systems, like resume screening tools or search

00:16:54.659 --> 00:16:57.059
engine rankings. So according to these studies,

00:16:57.080 --> 00:16:59.700
we are feeding them our history. They're encoding

00:16:59.700 --> 00:17:02.139
that history into rigid mathematical coordinates

00:17:02.139 --> 00:17:04.500
and then applying those rules to our future.

00:17:04.700 --> 00:17:07.059
It really reframes how we look at the data we

00:17:07.059 --> 00:17:09.099
provide them. It really does. The researchers

00:17:09.099 --> 00:17:11.160
who published these papers stress that it requires

00:17:11.160 --> 00:17:14.369
a massive ongoing effort in the field to actively

00:17:14.369 --> 00:17:16.329
devias these embeddings. Oh, they don't even

00:17:16.329 --> 00:17:19.269
do that. They essentially have to mathematically

00:17:19.269 --> 00:17:22.730
adjust the map after it is drawn, shifting the

00:17:22.730 --> 00:17:25.509
coordinates to ensure that the vector for a neutral

00:17:25.509 --> 00:17:29.269
profession like programmer is equidistant to

00:17:29.269 --> 00:17:32.490
both man and woman. It is a complex technical

00:17:32.490 --> 00:17:34.630
challenge that the industry is still grappling

00:17:34.630 --> 00:17:37.380
with today. We have covered a massive amount

00:17:37.380 --> 00:17:39.220
of ground today. You've taken quite a journey

00:17:39.220 --> 00:17:42.079
with us. We started all the way back with John

00:17:42.079 --> 00:17:45.900
Rupert Firth's 1957 Linguistic Theories, the

00:17:45.900 --> 00:17:48.640
foundational idea that words are completely defined

00:17:48.640 --> 00:17:51.500
by their neighbors. A long time ago. Yeah. We

00:17:51.500 --> 00:17:53.660
watched the computer scientists of the 80s and

00:17:53.660 --> 00:17:56.299
90s battle the terrifying curse of dimensionality,

00:17:56.720 --> 00:17:59.019
finally shrinking those massive spreadsheets

00:17:59.019 --> 00:18:01.740
using singular value decomposition and breaking

00:18:01.740 --> 00:18:04.160
through with Google's Word2Vec predicting hidden

00:18:04.160 --> 00:18:07.349
words. We navigated the tricky multi -meaning

00:18:07.349 --> 00:18:10.430
puzzle of the word club and saw how modern tools

00:18:10.430 --> 00:18:13.970
like BERT use dynamic on -the -fly context windows

00:18:13.970 --> 00:18:16.589
to solve it. We even explored how this exact

00:18:16.589 --> 00:18:18.890
same math is decoding the physical folding of

00:18:18.890 --> 00:18:21.490
DNA, capturing the hidden strategies of chess,

00:18:21.930 --> 00:18:24.430
and reflecting the biases hidden in our own professional

00:18:24.430 --> 00:18:27.529
datasets. It represents a profound shift in how

00:18:27.529 --> 00:18:30.509
we interact with technology. Word embeddings

00:18:30.509 --> 00:18:33.269
are the hidden geography of meaning. There are

00:18:33.269 --> 00:18:35.650
the underlying infrastructure shaping the digital

00:18:35.650 --> 00:18:38.309
tools you interact with every single day, turning

00:18:38.309 --> 00:18:40.569
abstract human thought into computable math.

00:18:41.269 --> 00:18:43.130
Absolutely. But I want to leave you with a final

00:18:43.130 --> 00:18:46.150
lingering question to explore on your own. Think

00:18:46.150 --> 00:18:47.990
about the digital footprint you create every

00:18:47.990 --> 00:18:50.809
single day. The sequence of apps you open, the

00:18:50.809 --> 00:18:53.730
locations your GPS tracks, the transactions you

00:18:53.730 --> 00:18:56.190
make. Okay. If word embeddings are powerful enough

00:18:56.190 --> 00:18:58.609
to capture the unwritten rules of chess, the

00:18:58.609 --> 00:19:00.690
folding of proteins and the invisible patterns

00:19:00.690 --> 00:19:03.390
of human bias just by looking at the company

00:19:03.390 --> 00:19:06.990
a word keeps. What unspoken rules about your

00:19:06.990 --> 00:19:09.549
life and habits might be mapped out if the daily

00:19:09.549 --> 00:19:11.809
digital trails you leave behind were embedded

00:19:11.809 --> 00:19:15.019
into a vector space? What a thought. Your whole

00:19:15.019 --> 00:19:17.480
life, your routines and secrets mapped out in

00:19:17.480 --> 00:19:19.900
a multi -dimensional space. Maybe the town of

00:19:19.900 --> 00:19:22.059
morning coffee is mathematically right next door

00:19:22.059 --> 00:19:25.240
to the town of checking emails. That giant multi

00:19:25.240 --> 00:19:27.220
-dimensional map of towns isn't just for language

00:19:27.220 --> 00:19:30.319
anymore, it's mapping us. Thank you for joining

00:19:30.319 --> 00:19:32.799
us on this deep dive. Stay curious and we will

00:19:32.799 --> 00:19:33.700
catch you next time.