WEBVTT

00:00:00.000 --> 00:00:03.799
Imagine holding an apple in your hand, like right

00:00:03.799 --> 00:00:06.919
now. You can instantly see its height, its width,

00:00:07.280 --> 00:00:09.800
its depth. Great. You can rotate it. Exactly.

00:00:09.880 --> 00:00:12.220
You just turn it around, judge its weight, and

00:00:12.220 --> 00:00:14.519
understand its physical shape in the real world.

00:00:14.900 --> 00:00:18.399
And your brain, it evolved to do this flawlessly.

00:00:18.640 --> 00:00:20.260
Yeah, it's highly specialized for three -dimensional

00:00:20.260 --> 00:00:23.620
space. But what if I asked you to visualize an

00:00:23.620 --> 00:00:28.160
apple that exists in, like, a thousand dimensions

00:00:28.160 --> 00:00:30.699
at the exact same time. You, I mean, you physically

00:00:30.699 --> 00:00:32.960
cannot do it. Right. The human brain just hits

00:00:32.960 --> 00:00:35.799
a biological brick wall there. We're completely

00:00:35.799 --> 00:00:38.380
trapped in our default operating system of, you

00:00:38.380 --> 00:00:40.759
know, up, down, left, right, forward, and backward.

00:00:40.820 --> 00:00:43.119
And that is a massive problem when you step into

00:00:43.119 --> 00:00:46.479
the world of modern data. Because today, we're

00:00:46.479 --> 00:00:49.780
generating datasets where a single object doesn't

00:00:49.780 --> 00:00:53.950
just have three properties. It has hundreds or

00:00:53.950 --> 00:00:56.270
even thousands of variables attached to it all

00:00:56.270 --> 00:00:58.570
at once. It really is the ultimate cognitive

00:00:58.570 --> 00:01:00.369
bottleneck of our era because we've built these

00:01:00.369 --> 00:01:02.250
machines that can easily collect and process

00:01:02.250 --> 00:01:04.909
thousands of dimensions of information, but we

00:01:04.909 --> 00:01:07.769
can't see it. Exactly. If human beings cannot

00:01:07.769 --> 00:01:10.129
actually see those relationships, we can't intuitively

00:01:10.129 --> 00:01:13.030
understand them. So we're just left staring at

00:01:13.030 --> 00:01:15.840
these giant spreadsheets of numbers. completely

00:01:15.840 --> 00:01:19.500
blind to the hidden structures inside, we desperately

00:01:19.500 --> 00:01:22.120
need a bridge between that mathematical complexity

00:01:22.120 --> 00:01:24.760
of the data and, well, the visual limitations

00:01:24.760 --> 00:01:27.219
of our own minds. Which brings us to the mission

00:01:27.219 --> 00:01:30.219
of our deep dive today. We are looking at a really

00:01:30.219 --> 00:01:32.879
dense, highly technical set of source materials

00:01:32.879 --> 00:01:35.599
detailing a statistical method with, frankly,

00:01:36.439 --> 00:01:38.959
a very intimidating name. Oh, yeah. It's called

00:01:38.959 --> 00:01:41.900
t -distributed stochastic neighbor embedding.

00:01:42.170 --> 00:01:45.030
Or, thankfully, T -S -N -E for short. Much better.

00:01:45.170 --> 00:01:47.909
Yeah, way better. So our goal for you, the listener,

00:01:48.310 --> 00:01:50.390
is to decode this complex math. We're going to

00:01:50.390 --> 00:01:53.349
figure out how it acts as this massive dimensional

00:01:53.349 --> 00:01:55.650
translator, taking those impossible layers of

00:01:55.650 --> 00:01:57.909
data and just crushing them down into two or

00:01:57.909 --> 00:02:00.109
three dimensional maps that you and I can actually

00:02:00.109 --> 00:02:03.349
look at and comprehend. OK, let's unpack this.

00:02:03.590 --> 00:02:05.930
To truly understand how T -S -N -E pulls off

00:02:05.930 --> 00:02:07.810
this translation, we should probably look at

00:02:07.810 --> 00:02:10.590
its foundation. The algorithm builds on something

00:02:10.590 --> 00:02:12.949
called stochastic neighbor embedding, which was

00:02:12.949 --> 00:02:15.330
originally developed by Jeffrey Hinton and Sam

00:02:15.330 --> 00:02:18.469
Rolwis. And then later, Lawrence Vandermaten,

00:02:18.650 --> 00:02:22.110
working alongside Hinton, proposed the T distributed

00:02:22.110 --> 00:02:24.610
variant that, you know, really became the industry

00:02:24.610 --> 00:02:27.550
standard. In the formal language of the field,

00:02:27.689 --> 00:02:31.110
it is a nonlinear dimensionality reduction technique.

00:02:31.330 --> 00:02:33.909
Wait, I have to stop you there because nonlinear

00:02:33.909 --> 00:02:36.289
dimensionality reduction is exactly the kind

00:02:36.289 --> 00:02:39.500
of phrase that makes people like immediately

00:02:39.500 --> 00:02:41.560
turn off their brains. Fair enough. It's a mouthful.

00:02:41.680 --> 00:02:44.280
Yeah. Let me try an analogy here to ground this

00:02:44.280 --> 00:02:48.139
for the listener. So imagine you are tasked with

00:02:48.139 --> 00:02:50.780
organizing a massive party. OK. Let's say there

00:02:50.780 --> 00:02:53.400
are a million guests crammed into a giant convention

00:02:53.400 --> 00:02:55.879
center. And your job is to group them together

00:02:55.879 --> 00:02:58.319
on a flat, two -dimensional map of the room.

00:02:58.560 --> 00:03:01.280
A seating chart from hell. Right. Exactly. But

00:03:01.280 --> 00:03:02.939
you aren't just grouping them by their age or

00:03:02.939 --> 00:03:04.680
their hometown. Right. You have to group them

00:03:04.680 --> 00:03:07.599
by, say, 500 different personality traits simultaneously.

00:03:07.770 --> 00:03:11.210
And that is your high dimensional space. Each

00:03:11.210 --> 00:03:13.590
individual guest at the party is a data point.

00:03:13.969 --> 00:03:17.169
And those 500 distinct traits, those are the

00:03:17.169 --> 00:03:20.610
500 dimensions. Right. But here's where my brain

00:03:20.610 --> 00:03:23.330
completely breaks trying to draw that map. Like

00:03:23.330 --> 00:03:25.430
if I put all the heavy metal fans in one corner,

00:03:25.569 --> 00:03:28.590
that's great. Sure. But what if Half of those

00:03:28.590 --> 00:03:30.969
heavy metal fans are also obsessed with baking

00:03:30.969 --> 00:03:33.509
sourdough bread, and the other half are amateur

00:03:33.509 --> 00:03:36.469
botanists. Yeah, they get split up. Exactly.

00:03:36.810 --> 00:03:38.949
And what if the botanists share sleep habits

00:03:38.949 --> 00:03:41.129
with the classical music fans who are all the

00:03:41.129 --> 00:03:42.810
way on the other side of the room? You can't

00:03:42.810 --> 00:03:45.370
just squish 500 overlapping interests onto a

00:03:45.370 --> 00:03:47.750
flat piece of paper without tearing those relationships

00:03:47.750 --> 00:03:50.370
apart. You've just hit on the exact mathematical

00:03:50.370 --> 00:03:53.430
problem TSNE is designed to solve. Oh, wow. It

00:03:53.430 --> 00:03:56.349
acts as the ultimate party planner for this impossible

00:03:56.349 --> 00:03:58.569
scenario. What's fascinating here is that the

00:03:58.569 --> 00:04:01.490
algorithm completely abandons the idea of fixed

00:04:01.490 --> 00:04:04.330
rigid geometry. What do you mean by rigid geometry?

00:04:04.789 --> 00:04:07.849
Like, instead of trying to use a ruler to measure

00:04:07.849 --> 00:04:10.229
the distance between these people's traits, it

00:04:10.229 --> 00:04:13.669
uses probability. It asks a much more fluid question.

00:04:13.810 --> 00:04:16.629
It says, based on these 500 traits, what are

00:04:16.629 --> 00:04:18.629
the odds that these two specific people would

00:04:18.629 --> 00:04:20.750
find each other in this massive crowd and start

00:04:20.750 --> 00:04:23.959
talking? Oh, OK. So it trades a ruler for a set

00:04:23.959 --> 00:04:26.360
of odds. Exactly. How does that actually play

00:04:26.360 --> 00:04:28.579
out in the math, though? The source material

00:04:28.579 --> 00:04:31.139
breaks the process down into two main stages,

00:04:31.199 --> 00:04:33.560
starting with that high dimensional space. Yeah.

00:04:33.579 --> 00:04:36.560
So first, TSNE looks at the original data of

00:04:36.560 --> 00:04:40.100
those 500 trait party guests. And it computes

00:04:40.100 --> 00:04:42.480
probabilities that are proportional to how similar

00:04:42.480 --> 00:04:45.639
the objects are. OK. To do this, it centers a

00:04:45.639 --> 00:04:48.339
Gaussian distribution, which you can just picture

00:04:48.339 --> 00:04:51.230
as a classic bell curve. Right, a standard bell

00:04:51.230 --> 00:04:54.370
curve. Over every single data point. The algorithm

00:04:54.370 --> 00:04:56.350
essentially says, if I were to randomly pick

00:04:56.350 --> 00:04:58.509
a neighbor for the specific point based on the

00:04:58.509 --> 00:05:00.810
shape of this bell curve, what is the conditional

00:05:00.810 --> 00:05:02.810
probability I would pick this other specific

00:05:02.810 --> 00:05:05.189
point? So objects that share a lot of traits

00:05:05.189 --> 00:05:07.709
get a very high probability of being neighbors.

00:05:08.250 --> 00:05:10.930
objects with nothing in common get a microscopic

00:05:10.930 --> 00:05:13.589
probability. It sounds like it's building this

00:05:13.589 --> 00:05:17.370
massive invisible web of percentages in the high

00:05:17.370 --> 00:05:20.129
dimensional space. Like everyone has a specific

00:05:20.129 --> 00:05:21.810
percentage chance of standing next to everyone

00:05:21.810 --> 00:05:25.709
else. That is stage one completed. Now, the algorithm

00:05:25.709 --> 00:05:28.670
has to move to stage two. It takes a blank, flat,

00:05:28.769 --> 00:05:31.129
two -dimensional map and scatters new, low -dimensional

00:05:31.129 --> 00:05:33.250
points onto it. Just randomly. Just scatters

00:05:33.250 --> 00:05:35.750
them out. The ultimate goal is to move those

00:05:35.750 --> 00:05:38.290
flat points around until the web of probabilities

00:05:38.290 --> 00:05:40.870
on the flat map matches the web of probabilities

00:05:40.870 --> 00:05:43.769
in the 500 -dimensional space as closely as physically

00:05:43.769 --> 00:05:46.250
possible. Okay. Now, to connect those two stages,

00:05:46.389 --> 00:05:49.209
the text introduces a mechanism called minimizing

00:05:49.209 --> 00:05:54.199
the Kohlbach -Leibler divergence. or KL divergence

00:05:54.199 --> 00:05:56.300
using gradient descent. Yes. Let me see if I

00:05:56.300 --> 00:05:58.920
can visualize this one. Is KL divergence basically

00:05:58.920 --> 00:06:01.100
acting like a strict referee? A referee. Yeah,

00:06:01.399 --> 00:06:03.899
like the referee is constantly watching the flat

00:06:03.899 --> 00:06:06.220
map and comparing it to the high dimensional

00:06:06.220 --> 00:06:08.959
space. And if the map puts two people next to

00:06:08.959 --> 00:06:10.519
each other who actually have nothing in common,

00:06:10.980 --> 00:06:13.399
the referee throws a flag and assigns a penalty

00:06:13.399 --> 00:06:16.649
score. I like that. And gradient descent is just

00:06:16.649 --> 00:06:19.529
the mathematical process of tweaking the map

00:06:19.529 --> 00:06:22.709
over and over to minimize the referee's penalties.

00:06:23.009 --> 00:06:25.990
The referee analogy for KL divergence is spot

00:06:25.990 --> 00:06:28.670
-on. I mean, technically it is an asymmetric

00:06:28.670 --> 00:06:31.569
measure of how one probability distribution diverges

00:06:31.569 --> 00:06:34.329
from a second expected distribution, but... Right.

00:06:34.550 --> 00:06:37.129
Referee is easier. But let's upgrade your idea

00:06:37.129 --> 00:06:39.649
of gradient descent. It's often described as

00:06:39.649 --> 00:06:42.550
taking tiny steps down a mathematical hill, but

00:06:42.550 --> 00:06:45.079
in the context of TSNE... it's much more visceral

00:06:45.079 --> 00:06:47.920
than that. Imagine the points on your flat map

00:06:47.920 --> 00:06:50.920
are all connected by a complex system of physical

00:06:50.920 --> 00:06:53.939
springs. Springs, like actual coiled tension

00:06:53.939 --> 00:06:57.180
springs. Exactly. When the KL divergence referee

00:06:57.180 --> 00:06:59.600
throws a flag because two points are placed poorly,

00:07:00.079 --> 00:07:02.319
it's like adding massive tension to the springs

00:07:02.319 --> 00:07:05.399
connecting them. So gradient descent is the process

00:07:05.399 --> 00:07:08.800
of those springs. violently snapping and yanking

00:07:08.800 --> 00:07:11.420
the points across the map to relieve the tension.

00:07:11.639 --> 00:07:14.339
That's a great image. Yeah. The algorithm continually

00:07:14.339 --> 00:07:16.360
calculates the gradient, which is the direction

00:07:16.360 --> 00:07:18.959
of the pole, and updates the map, letting the

00:07:18.959 --> 00:07:21.199
points push and pull each other until the entire

00:07:21.199 --> 00:07:23.600
system settles into a state of minimal tension.

00:07:24.079 --> 00:07:27.980
That sounds incredibly chaotic. and mathematically

00:07:27.980 --> 00:07:30.319
exhausting for a computer to calculate. Like

00:07:30.319 --> 00:07:32.120
every single point is pulling on every other

00:07:32.120 --> 00:07:34.500
point constantly. It is intensely demanding.

00:07:34.839 --> 00:07:36.939
The source actually notes that for a data set

00:07:36.939 --> 00:07:40.899
with n elements, standard TSNE runs in what computer

00:07:40.899 --> 00:07:44.019
scientists call O of n squared time and requires

00:07:44.019 --> 00:07:46.879
O of n squared space. Meaning if you double the

00:07:46.879 --> 00:07:49.370
amount of data, The computational cost doesn't

00:07:49.370 --> 00:07:52.009
just double, it quadruples. Precisely. If you

00:07:52.009 --> 00:07:54.310
have a thousand data points, the computer is

00:07:54.310 --> 00:07:56.829
managing a million calculations. Right. If you

00:07:56.829 --> 00:07:58.529
scale that up to a million data points, you're

00:07:58.529 --> 00:08:00.709
suddenly asking the machine to handle a trillion

00:08:00.709 --> 00:08:03.430
calculations. The memory and processing power

00:08:03.430 --> 00:08:05.850
required become just a massive bottleneck as

00:08:05.850 --> 00:08:08.790
the data set grows. So assuming you have, like,

00:08:08.949 --> 00:08:11.430
a supercomputer that can handle the springs snapping

00:08:11.430 --> 00:08:14.529
back and forth a trillion times, what does this

00:08:14.529 --> 00:08:17.610
all mean for the actual map? Like how does it

00:08:17.610 --> 00:08:20.310
avoid just lumping everyone together in the middle?

00:08:20.430 --> 00:08:22.540
What do you mean? I mean, if everything shares

00:08:22.540 --> 00:08:24.899
at least some tiny similarity with everything

00:08:24.899 --> 00:08:27.600
else, wouldn't the tension just crush all the

00:08:27.600 --> 00:08:30.740
points into one dense overlapping black hole

00:08:30.740 --> 00:08:32.720
in the center of the screen? That black hole

00:08:32.720 --> 00:08:34.860
scenario is exactly what would happen without

00:08:34.860 --> 00:08:37.340
a few very clever mathematical interventions.

00:08:37.940 --> 00:08:40.639
To prevent the map from collapsing in on itself,

00:08:41.220 --> 00:08:44.279
TSNE relies on a user -defined parameter called

00:08:44.279 --> 00:08:47.559
perplexity. Perplexity? Yeah. Which is, ironically,

00:08:47.639 --> 00:08:49.700
how I feel when I try to read statistical mathematics.

00:08:49.720 --> 00:08:52.970
Yeah, it's a great name. In this context, perplexity

00:08:52.970 --> 00:08:55.250
is a number you choose before running the algorithm,

00:08:55.629 --> 00:08:57.990
typically set somewhere between 5 and 50. Okay.

00:08:58.269 --> 00:09:00.429
The technical definition involves matching the

00:09:00.429 --> 00:09:02.649
Shannon entropy of the conditional probabilities,

00:09:03.210 --> 00:09:05.389
but functionally you can interpret perplexity

00:09:05.389 --> 00:09:07.649
as a smooth measure of the effective number of

00:09:07.649 --> 00:09:09.990
neighbors a point has. Okay, let me translate

00:09:09.990 --> 00:09:11.809
that real quick. You're basically giving the

00:09:11.809 --> 00:09:14.049
algorithm an attention span. You're telling it,

00:09:14.129 --> 00:09:16.759
hey, Only focus on the closest 30 people to this

00:09:16.759 --> 00:09:19.440
point and safely ignore the thousands of other

00:09:19.440 --> 00:09:21.620
people who are further away. That is exactly

00:09:21.620 --> 00:09:23.879
what it does. And the algorithm adapts to the

00:09:23.879 --> 00:09:26.639
density of the data to make it happen. It adjusts

00:09:26.639 --> 00:09:29.159
the bandwidth of those Gaussian bell curves we

00:09:29.159 --> 00:09:31.399
talked about earlier. So they change size. Yeah.

00:09:31.559 --> 00:09:34.419
In a really crowded part of the high -dimensional

00:09:34.419 --> 00:09:37.600
space, it shrinks the bell curve to focus only

00:09:37.600 --> 00:09:40.919
on the immediate neighbors. In a sparse, empty

00:09:40.919 --> 00:09:43.759
area, it widens the bell curve to cast a larger

00:09:43.759 --> 00:09:46.919
net. That's smart. It ensures that every single

00:09:46.919 --> 00:09:49.519
point gets to care about the exact same effective

00:09:49.519 --> 00:09:51.759
number of neighbors, regardless of how packed

00:09:51.759 --> 00:09:54.000
the data is. So it adjusts its own zoom lens

00:09:54.000 --> 00:09:55.840
depending on the crowd. That's really cool. But

00:09:55.840 --> 00:09:57.919
the source material mentions another huge obstacle

00:09:57.919 --> 00:10:00.379
the algorithm has to overcome, something incredibly

00:10:00.379 --> 00:10:03.399
ominous. sounding, the curse of dimensionality.

00:10:03.480 --> 00:10:08.080
Ah, yes. The curse of dimensionality is a foundational

00:10:08.080 --> 00:10:11.139
nightmare in data science. When you use regular

00:10:11.139 --> 00:10:13.200
Euclidean distance. Like just measuring with

00:10:13.200 --> 00:10:15.879
a ruler. Right, the simple straight line distance

00:10:15.879 --> 00:10:18.519
we use in the physical world to measure the space

00:10:18.519 --> 00:10:21.620
between two objects. It completely loses its

00:10:21.620 --> 00:10:24.379
meaning in high dimensions. How does distance

00:10:24.379 --> 00:10:26.840
lose its meaning? I mean, a mile's a mile, isn't

00:10:26.840 --> 00:10:29.450
it? Well, in three dimensions, yes. But in high

00:10:29.450 --> 00:10:32.590
dimensions, volume expands exponentially. Think

00:10:32.590 --> 00:10:35.649
of a multi -dimensional orange. OK. As you add

00:10:35.649 --> 00:10:38.610
hundreds of dimensions, the math dictates that

00:10:38.610 --> 00:10:41.009
almost all of the volume of that orange gets

00:10:41.009 --> 00:10:44.070
pushed to the very outer surface. If it were

00:10:44.070 --> 00:10:47.129
a 500 -dimensional orange, it would be 99 % peel

00:10:47.129 --> 00:10:50.549
and almost zero juice. Wait, really? So the data

00:10:50.549 --> 00:10:53.429
points all migrate to the edges? Yes. And because

00:10:53.429 --> 00:10:55.669
everything is pushed to the vast outer crust

00:10:55.669 --> 00:10:58.889
of the space, almost all points seem completely

00:10:58.889 --> 00:11:01.730
equidistant from each other. If distances lose

00:11:01.730 --> 00:11:04.110
their ability to discriminate, the probabilities

00:11:04.110 --> 00:11:06.409
of things being neighbors become too similar.

00:11:06.990 --> 00:11:08.909
Asymptotically, they just converge to a constant

00:11:08.909 --> 00:11:11.409
number. So if everyone is equally far away from

00:11:11.409 --> 00:11:13.129
you, you don't actually have any neighbors. The

00:11:13.129 --> 00:11:15.090
whole concept of a neighborhood just breaks down.

00:11:15.470 --> 00:11:18.850
Exactly. To fix this on the final flat map, Vanermaten

00:11:18.850 --> 00:11:21.940
and Hinton did something ingenious. they completely

00:11:21.940 --> 00:11:24.659
abandoned the Gaussian bell curve for the low

00:11:24.659 --> 00:11:26.679
dimensional map. OK, what did they use instead?

00:11:26.860 --> 00:11:29.539
They swapped it out for a heavy -tailed student

00:11:29.539 --> 00:11:33.340
t distribution, specifically a Cauchy distribution

00:11:33.340 --> 00:11:36.700
with one degree of freedom. This is actually

00:11:36.700 --> 00:11:40.210
where the t in TSNE comes from. Oh. Okay, let's

00:11:40.210 --> 00:11:42.870
visualize a heavy tail. If a normal bell curve

00:11:42.870 --> 00:11:45.450
looks like a hill, where the slopes swoop all

00:11:45.450 --> 00:11:47.889
the way down to touch the floor, this Cauchy

00:11:47.889 --> 00:11:49.590
distribution looks like a hill where the slopes

00:11:49.590 --> 00:11:51.769
hover above the ground. Right. Like they stretch

00:11:51.769 --> 00:11:54.970
out infinitely without ever touching zero. What

00:11:54.970 --> 00:11:57.490
does that actually achieve though? It artificially

00:11:57.490 --> 00:12:00.470
inflates the space. Because the tails hover above

00:12:00.470 --> 00:12:03.070
the ground, there is a much higher mathematical

00:12:03.070 --> 00:12:05.870
probability of finding data points way out on

00:12:05.870 --> 00:12:08.629
the fringes. It essentially gives the algorithm

00:12:08.629 --> 00:12:11.529
the permission it needs to push dissimilar objects

00:12:11.529 --> 00:12:13.909
much, much further apart on the flat map than

00:12:13.909 --> 00:12:15.730
they strictly are in the high -dimensional space.

00:12:16.070 --> 00:12:19.009
Oh, I see. It gives the springs permission to

00:12:19.009 --> 00:12:22.309
stretch. It artificially shoves groups away from

00:12:22.309 --> 00:12:24.710
each other to keep the visual clean, preventing

00:12:24.710 --> 00:12:27.090
that overlapping black hole we were worried about.

00:12:27.250 --> 00:12:30.529
Yes, it creates those beautiful, distinct visual

00:12:30.529 --> 00:12:34.519
islands of data that TSNE is famous for. Here's

00:12:34.519 --> 00:12:36.399
where it gets really interesting, though, because,

00:12:36.500 --> 00:12:38.480
I mean, we've been deep in the theoretical math

00:12:38.480 --> 00:12:42.220
for a while now. Gaussian kernels, Cauchy distributions,

00:12:42.720 --> 00:12:45.100
O of n squared complexity. We've been in the

00:12:45.100 --> 00:12:47.299
weeds. Yeah, but this isn't just an academic

00:12:47.299 --> 00:12:49.830
exercise happening on a chalkboard. The real

00:12:49.830 --> 00:12:52.409
-world applications of this algorithm are astonishing.

00:12:52.610 --> 00:12:54.769
Like, when you take this ability to find hidden

00:12:54.769 --> 00:12:57.409
islands of data and apply it to actual human

00:12:57.409 --> 00:13:00.090
problems, it is quite literally saving lives.

00:13:00.330 --> 00:13:03.090
Oh, definitely. The range of fields relying on

00:13:03.090 --> 00:13:05.809
TSNE to see hidden patterns is just staggering.

00:13:06.049 --> 00:13:08.429
Take medicine, for example. The source highlights

00:13:08.429 --> 00:13:11.610
researchers using TSNE for analyzing EEG signals

00:13:11.610 --> 00:13:14.110
to detect epileptic seizures. Which makes perfect

00:13:14.110 --> 00:13:16.950
sense when you think about it. And EEG is recording

00:13:16.950 --> 00:13:19.340
the chaotic, high -dimensional electrical noise

00:13:19.340 --> 00:13:22.580
of the entire human brain millions of overlapping

00:13:22.580 --> 00:13:25.100
signals. To a human eye looking at a raw chart,

00:13:25.440 --> 00:13:28.620
it's just a wall of static. But T -SNE can map

00:13:28.620 --> 00:13:30.919
that electrical chaos and visually group the

00:13:30.919 --> 00:13:34.039
moments in time that look similar. It pulls the

00:13:34.039 --> 00:13:36.700
distinct, hidden signature of a seizure out of

00:13:36.700 --> 00:13:39.480
the static and plots it as an isolated island

00:13:39.480 --> 00:13:42.080
on a map. It's incredible. The text also emphasizes

00:13:42.080 --> 00:13:45.139
its use in exploring breast cancer data, specifically

00:13:45.139 --> 00:13:48.340
computer -aided diagnosis, or CADEX. Wait, how

00:13:48.340 --> 00:13:50.980
does the algorithm map a tumor? Well, tumor doesn't

00:13:50.980 --> 00:13:53.320
just have one or two features. When doctors analyze

00:13:53.320 --> 00:13:55.539
tissue, they're looking at cellular density,

00:13:56.000 --> 00:13:58.279
the thickness of cell walls, genetic markers,

00:13:58.779 --> 00:14:01.419
dozens or hundreds of microscopic variables.

00:14:01.759 --> 00:14:04.340
Ah, okay. That is a high dimensional space. Exactly.

00:14:04.779 --> 00:14:07.120
TSNE flattens all of those variables. It allows

00:14:07.120 --> 00:14:09.700
a researcher to plot a new patient's tumor profile

00:14:09.700 --> 00:14:12.340
on a screen and see instantly if it visually

00:14:12.340 --> 00:14:14.820
lands on the island of known benign tumors or

00:14:14.820 --> 00:14:17.340
if it clusters with the malignant ones. It takes

00:14:17.340 --> 00:14:19.840
the microscopic and makes it entirely visual.

00:14:20.000 --> 00:14:23.039
And it completely jumps industries, too. In technology

00:14:23.039 --> 00:14:25.899
and computer security, researchers use it to

00:14:25.899 --> 00:14:28.639
analyze the behavior of off -the -shelf anti

00:14:28.639 --> 00:14:32.200
-virus engines, like grouping how different software

00:14:32.200 --> 00:14:35.580
reacts to millions of lines of malicious code.

00:14:35.820 --> 00:14:38.279
It's heavily utilized in arts and culture too.

00:14:38.639 --> 00:14:41.340
Natural language processing uses T -S -N -E to

00:14:41.340 --> 00:14:44.120
generate word embeddings from 19th century literature.

00:14:44.360 --> 00:14:47.240
I love this example so much. It maps the vocabulary

00:14:47.240 --> 00:14:50.000
habits of authors. It plots how often they use

00:14:50.000 --> 00:14:52.559
specific prepositions or adjectives. Suddenly

00:14:52.559 --> 00:14:54.679
you can take an anonymous piece of text, run

00:14:54.679 --> 00:14:57.360
it through T -S -N -E, and watch it visually

00:14:57.360 --> 00:14:59.639
cluster right next to Charles Dickens. Because

00:14:59.639 --> 00:15:02.519
the algorithm recognized his subconscious high

00:15:02.519 --> 00:15:05.419
-dimensional word choice fingerprint. Yes. to

00:15:05.419 --> 00:15:07.659
learn features from music audio using deep belief

00:15:07.659 --> 00:15:10.159
networks. Earth scientists use it for geological

00:15:10.159 --> 00:15:13.179
domain interpretation, identifying complex material

00:15:13.179 --> 00:15:15.480
types in geochemical data scattered across the

00:15:15.480 --> 00:15:17.539
globe. From the electrical storms in our brains

00:15:17.539 --> 00:15:20.379
to 19th century literature to the literal dirt

00:15:20.379 --> 00:15:22.679
under our feet. The primary takeaway here is

00:15:22.679 --> 00:15:26.419
that TSNE is not just a math trick. It is a fundamental

00:15:26.419 --> 00:15:29.500
lens. When professionals across these entirely

00:15:29.500 --> 00:15:31.659
different disciplines are drowning in variables,

00:15:32.320 --> 00:15:35.519
TSNE acts as a translator. It converts abstract

00:15:35.519 --> 00:15:38.320
numbers into spatial intuition, giving researchers

00:15:38.320 --> 00:15:40.879
a kind of superhuman sight. But, and this is

00:15:40.879 --> 00:15:44.399
a massive, but, the source material has an entire

00:15:44.399 --> 00:15:47.820
section dedicated to the outputs of TSNE that

00:15:47.820 --> 00:15:51.259
serves as a giant flashing warning label. Yes.

00:15:51.620 --> 00:15:53.379
This raises an important question, maybe the

00:15:53.379 --> 00:15:55.080
most important question of our entire discussion.

00:15:55.580 --> 00:15:57.500
Given everything we know about how the algorithm

00:15:57.500 --> 00:16:00.120
artificially inflates space and uses heavy tails,

00:16:00.889 --> 00:16:03.590
How much should we trust our own eyes when we

00:16:03.590 --> 00:16:05.649
look at a TSNE map? Wait, so you're telling me

00:16:05.649 --> 00:16:07.950
the visual clusters we see on these maps, those

00:16:07.950 --> 00:16:10.590
beautiful distinct islands of data that are helping

00:16:10.590 --> 00:16:13.029
doctors and geologists, might be completely fake?

00:16:13.129 --> 00:16:15.070
They absolutely can be, and this is where critical

00:16:15.070 --> 00:16:17.470
thinking becomes mandatory. According to the

00:16:17.470 --> 00:16:20.230
source, TSNE plots often seem to display clusters,

00:16:20.649 --> 00:16:23.190
but those visual groupings are heavily manipulated

00:16:23.190 --> 00:16:25.549
by the chosen parameterization. You mean the

00:16:25.549 --> 00:16:28.419
settings? Right. Remember that perplexity number

00:16:28.419 --> 00:16:31.799
you have to choose, the attention span. If you

00:16:31.799 --> 00:16:34.860
set that number wrong for your specific data

00:16:34.860 --> 00:16:39.360
set, TSAE can and will produce visual clusters

00:16:39.360 --> 00:16:43.259
even in data that has absolutely no clear grouping

00:16:43.259 --> 00:16:46.570
in reality. So it will just invent quirks. because

00:16:46.570 --> 00:16:48.789
it's trying so hard to push things apart with

00:16:48.789 --> 00:16:51.230
that heavy -tailed t -distribution. Like, it

00:16:51.230 --> 00:16:53.950
wants to make a map so badly, it will draw borders

00:16:53.950 --> 00:16:56.049
where none exist. Yes, it will show you false

00:16:56.049 --> 00:16:59.409
findings. But the distortion gets even more counterintuitive

00:16:59.409 --> 00:17:02.230
than that. Let's say you do have real mathematically

00:17:02.230 --> 00:17:05.089
verified clusters. On a TSNE map, you might see

00:17:05.089 --> 00:17:08.029
one cluster that is huge and spread out and another

00:17:08.029 --> 00:17:10.849
that is tiny and dense. You might see two clusters

00:17:10.849 --> 00:17:12.710
right next to each other and a third one way

00:17:12.710 --> 00:17:14.529
off in the corner of the screen. Naturally looking

00:17:14.529 --> 00:17:16.049
at that, my brain immediately at least says,

00:17:16.170 --> 00:17:18.809
OK, the big cluster has more variance inside

00:17:18.809 --> 00:17:20.509
of it, and the two clusters close together are

00:17:20.509 --> 00:17:22.789
closely related, while the one far away is an

00:17:22.789 --> 00:17:25.289
outlier. And your brain would be completely fundamentally

00:17:25.289 --> 00:17:28.349
wrong. The source explicitly states that the

00:17:28.349 --> 00:17:31.430
size of clusters produced by TSNE is mathematically

00:17:31.430 --> 00:17:34.390
not informative, and neither is the distance

00:17:34.390 --> 00:17:36.150
between different clusters. Are you kidding?

00:17:36.430 --> 00:17:39.930
Nope. The algorithm distorts the macrogeography

00:17:39.930 --> 00:17:42.109
of the map to preserve the local neighborhoods.

00:17:42.430 --> 00:17:45.170
The heavy tail inflates space unpredictably.

00:17:45.390 --> 00:17:49.109
You cannot use a ruler on a TSN plot. You cannot

00:17:49.109 --> 00:17:52.430
say cluster A is closer to cluster B than cluster

00:17:52.430 --> 00:17:56.109
C. That is wild. The big picture is essentially

00:17:56.109 --> 00:17:58.910
a highly distorted illusion. So it successfully

00:17:58.910 --> 00:18:02.109
brings the data down into two dimensions, but

00:18:02.109 --> 00:18:04.289
it fundamentally warps the global distances just

00:18:04.289 --> 00:18:06.410
to make the local neighborhoods look good. Exactly.

00:18:06.490 --> 00:18:08.910
So if the distances are meaningless and the clusters

00:18:08.910 --> 00:18:11.930
might be fake, what is the solution? How do researchers

00:18:11.930 --> 00:18:14.049
actually use this without getting fooled by their

00:18:14.049 --> 00:18:16.589
own data? The solution is a process the text

00:18:16.589 --> 00:18:19.289
calls interactive exploration. You cannot just

00:18:19.289 --> 00:18:21.430
run the TSNA algorithm once, print out the picture,

00:18:21.509 --> 00:18:22.970
and publish your paper. You have to actively

00:18:22.970 --> 00:18:24.750
play with the parameters. You have to run it

00:18:24.750 --> 00:18:27.349
with different perplexities, validate the findings

00:18:27.349 --> 00:18:30.109
against other analytical methods, and truly understand

00:18:30.109 --> 00:18:32.890
the math beneath the image. Which totally explains

00:18:32.890 --> 00:18:35.190
why the software section of the article highlights

00:18:35.190 --> 00:18:37.529
tools that are specifically designed for this

00:18:37.529 --> 00:18:40.420
kind of iteration. Because of that brutal O of

00:18:40.420 --> 00:18:42.700
N squared computation time we mentioned earlier,

00:18:43.099 --> 00:18:45.019
developers have built approximation methods.

00:18:45.200 --> 00:18:48.519
Right, to speed things up. Yeah. There is a C++

00:18:48.519 --> 00:18:50.960
implementation called Barnes -Hutt, available

00:18:50.960 --> 00:18:53.579
directly from Vandermaten, which groups distant

00:18:53.579 --> 00:18:56.200
points together as a single massive point to

00:18:56.200 --> 00:18:59.660
save calculation time. Yes, and the R package,

00:18:59.960 --> 00:19:03.799
Artsny, or ELKI, which also utilizes that Barnes

00:19:03.799 --> 00:19:06.259
-Hutt approximation to make the algorithm viable

00:19:06.259 --> 00:19:08.619
for massive data sets. And Python users, they

00:19:08.619 --> 00:19:11.220
rely on scikit -learn, which is hugely popular

00:19:11.220 --> 00:19:13.779
and handles both exact solutions and the Barnes

00:19:13.779 --> 00:19:16.319
-Hutt approximation. There's TensorBoard, the

00:19:16.319 --> 00:19:19.240
Julia package, T -Series. The point is, all of

00:19:19.240 --> 00:19:21.480
these tools are being built not to generate a

00:19:21.480 --> 00:19:24.880
single static image, but for progressive visual

00:19:24.880 --> 00:19:27.180
analytics. Exactly. They are built so you can

00:19:27.180 --> 00:19:29.380
turn the dials in real time and watch how the

00:19:29.380 --> 00:19:32.160
map changes. Because if the map radically transforms

00:19:32.160 --> 00:19:34.380
and your beautiful clusters just disappear the

00:19:34.380 --> 00:19:36.559
moment you change the propensity from 20 to 30,

00:19:37.180 --> 00:19:38.819
those clusters probably weren't real in the first

00:19:38.819 --> 00:19:42.140
place. Exactly. The truth of the data has to

00:19:42.140 --> 00:19:45.039
be robust across different parameters. It has

00:19:45.039 --> 00:19:46.700
been mathematically proven that with the right

00:19:46.700 --> 00:19:50.220
parameter choices, TSNE can recover well -separated

00:19:50.220 --> 00:19:52.539
clusters, but you have to do the rigorous work

00:19:52.539 --> 00:19:54.940
to find those correct settings. It demands an

00:19:54.940 --> 00:19:58.160
active, highly skeptical user. So to summarize

00:19:58.160 --> 00:20:01.119
our deep dive today, TSE &E is this brilliantly

00:20:01.119 --> 00:20:04.019
complex, beautifully flawed translator. It takes

00:20:04.019 --> 00:20:06.180
the incomprehensible scale of high dimensional

00:20:06.180 --> 00:20:08.539
space, whether that's hundreds of variables in

00:20:08.539 --> 00:20:11.480
a breast cancer tumor or 500 personality traits

00:20:11.480 --> 00:20:14.200
at a massive convention. And it uses probabilities

00:20:14.200 --> 00:20:17.000
and heavy tail distributions to squish that reality

00:20:17.000 --> 00:20:19.420
down into a 2D map we can actually comprehend.

00:20:19.609 --> 00:20:22.470
It really is a tool that has fundamentally revolutionized

00:20:22.470 --> 00:20:25.190
data visualization, but it demands that we do

00:20:25.190 --> 00:20:27.769
not take its visual outputs at face value. It

00:20:27.769 --> 00:20:29.910
serves as a perfect example of why we can never

00:20:29.910 --> 00:20:32.609
let algorithms do our thinking for us. It provides

00:20:32.609 --> 00:20:35.009
a unique perspective, but it does not provide

00:20:35.009 --> 00:20:37.369
an absolute truth. Right. Which brings me back

00:20:37.369 --> 00:20:39.289
to where we started. We talked about how our

00:20:39.289 --> 00:20:41.710
brains evolved for three -dimensional space,

00:20:42.170 --> 00:20:44.789
how we are biologically wired to look at the

00:20:44.789 --> 00:20:46.769
physical world and immediately find structure.

00:20:46.960 --> 00:20:49.279
We look up at a random scattering of stars and

00:20:49.279 --> 00:20:51.579
we see a hunter with a belt. We see faces in

00:20:51.579 --> 00:20:54.400
the clouds. We are deeply wired pattern recognition

00:20:54.400 --> 00:20:57.539
machines. We will trust our eyes over abstract

00:20:57.539 --> 00:21:00.519
numbers almost every single time. And that leaves

00:21:00.519 --> 00:21:02.819
a lingering question for you to mull over. We've

00:21:02.819 --> 00:21:05.200
just learned that a TSNE map will show you cluster

00:21:05.200 --> 00:21:07.839
sizes and distances that look incredibly meaningful.

00:21:08.380 --> 00:21:10.859
But mathematically? They are just an illusion

00:21:10.859 --> 00:21:14.079
of the translation process. So in an era where

00:21:14.079 --> 00:21:17.660
we increasingly rely on complex algorithms to

00:21:17.660 --> 00:21:20.440
compress impossible hyperdimensional reality

00:21:20.440 --> 00:21:22.420
into something our primate brains can digest,

00:21:23.099 --> 00:21:25.839
are we actually using these visualizations to

00:21:25.839 --> 00:21:28.500
uncover the hard truth of the data? Or are we

00:21:28.500 --> 00:21:30.460
just using them to paint a picture that our human

00:21:30.460 --> 00:21:33.380
brains are biologically desperate to see? A very

00:21:33.380 --> 00:21:36.039
necessary reminder to always question the lens

00:21:36.039 --> 00:21:38.220
through which we view the world. Thank you for

00:21:38.220 --> 00:21:41.259
joining us on this custom deep dive. Keep exploring,

00:21:41.779 --> 00:21:44.160
keep tweaking those parameters, and most importantly,

00:21:44.400 --> 00:21:46.059
keep questioning the data around you.