WEBVTT

00:00:00.000 --> 00:00:02.919
Imagine, just for a second, that you are staring

00:00:02.919 --> 00:00:05.980
at a massive, like, incredibly chaotic spreadsheet.

00:00:06.419 --> 00:00:08.960
Oh, yeah, like a terrifying wall of data. Exactly.

00:00:09.220 --> 00:00:12.400
We are talking thousands upon thousands of rows.

00:00:12.480 --> 00:00:15.759
Maybe it's, you know, financial time series data

00:00:15.759 --> 00:00:17.739
from the stock markets. Or geographic current

00:00:17.739 --> 00:00:21.000
statistics, right? Yeah. like 10 ,000 individual

00:00:21.000 --> 00:00:23.399
coordinates of burglaries across a major city

00:00:23.399 --> 00:00:27.120
over the last decade. Or perhaps it's complex

00:00:27.120 --> 00:00:29.960
biological data, like thousands of slightly different

00:00:29.960 --> 00:00:32.500
genetic sequences. It's just a sea of noise.

00:00:32.560 --> 00:00:34.380
Right. You're looking at this digital noise and

00:00:34.380 --> 00:00:36.560
you need to instantly understand its underlying

00:00:36.560 --> 00:00:39.299
structure. You need to find the hidden family

00:00:39.299 --> 00:00:41.880
trees buried inside those raw numbers to figure

00:00:41.880 --> 00:00:44.219
out what is related to what. Which is not easy.

00:00:44.460 --> 00:00:46.700
No. So how do computers actually do that? How

00:00:46.700 --> 00:00:49.820
do they take total chaos? like uncover the hidden

00:00:49.820 --> 00:00:52.520
order. Well, it's the ultimate needle in a haystack

00:00:52.520 --> 00:00:54.619
problem, really. And it's something we rely on

00:00:54.619 --> 00:00:57.039
constantly without even realizing it. Yeah. What

00:00:57.039 --> 00:01:00.060
we are exploring today is a foundational method

00:01:00.060 --> 00:01:03.200
in data mining and statistics called hierarchical

00:01:03.200 --> 00:01:07.859
clustering analysis, or HCA. HCA. Right. It's

00:01:07.859 --> 00:01:10.079
essentially how algorithms organize the world

00:01:10.079 --> 00:01:14.379
entirely based on math and distance. In our current

00:01:14.379 --> 00:01:18.879
reality of absolute information overload. organizing

00:01:18.879 --> 00:01:22.140
raw unstructured data into actionable hierarchies

00:01:22.140 --> 00:01:24.780
is crucial. Definitely. This is the invisible

00:01:24.780 --> 00:01:26.739
engine running behind the scenes of tools you

00:01:26.739 --> 00:01:29.420
might use every day, whether that's open source

00:01:29.420 --> 00:01:32.780
software like Python SciPy library or massive

00:01:32.780 --> 00:01:34.819
commercial implementations in enterprise software

00:01:34.819 --> 00:01:38.260
like MATLAB or SAS. OK, let's unpack this. Because

00:01:38.260 --> 00:01:40.739
the premise of how it actually works is wild

00:01:40.739 --> 00:01:43.019
to me. It is pretty counterintuitive. Yeah. You

00:01:43.019 --> 00:01:46.019
don't even strictly need the raw observations

00:01:46.019 --> 00:01:48.340
or the specific details of the data to do this.

00:01:48.379 --> 00:01:51.079
You just need a matrix of distances between things.

00:01:51.260 --> 00:01:53.719
Exactly. Just the distances. It's like, OK, imagine

00:01:53.719 --> 00:01:56.379
walking into a giant gymnasium full of thousands

00:01:56.379 --> 00:01:58.560
of strangers. Your job is to organize them into

00:01:58.560 --> 00:02:01.659
distinct, tightly -knit families. But you aren't

00:02:01.659 --> 00:02:03.900
allowed to talk to them. A single word. Right.

00:02:04.040 --> 00:02:05.859
You can't look at their faces. You can't ask

00:02:05.859 --> 00:02:07.560
about their backgrounds. You can't even see what

00:02:07.560 --> 00:02:10.099
they are wearing. You are blind to all of that.

00:02:10.400 --> 00:02:12.719
So what do you have? You just have a tape measure.

00:02:13.120 --> 00:02:15.960
You measure exactly how far apart every single

00:02:15.960 --> 00:02:17.759
person is standing from everyone else in the

00:02:17.759 --> 00:02:21.379
room. And you build the complex family trees

00:02:21.379 --> 00:02:24.800
purely from those physical distances. You're

00:02:24.800 --> 00:02:27.939
on the exact right track there, but it's even

00:02:27.939 --> 00:02:30.330
more abstract than that. Really. Yep, the algorithm

00:02:30.330 --> 00:02:32.310
literally does not care what the objects are.

00:02:32.650 --> 00:02:35.349
It is entirely blind to the who or the what.

00:02:35.490 --> 00:02:38.949
Okay. It only cares about the mathematical space

00:02:38.949 --> 00:02:41.990
between point A and point B. if you can define

00:02:41.990 --> 00:02:43.909
a distance, whether that's physical inches between

00:02:43.909 --> 00:02:46.889
people in a room or the percentage difference

00:02:46.889 --> 00:02:49.210
between two DNA sequences. Or the variance in

00:02:49.210 --> 00:02:51.550
two stock prices. Exactly. The algorithm can

00:02:51.550 --> 00:02:54.069
cluster it. So before we break out the algorithmic

00:02:54.069 --> 00:02:56.150
tape measure and start grouping people, we need

00:02:56.150 --> 00:02:59.050
an actual game plan. Right. From the deep dive

00:02:59.050 --> 00:03:01.610
into the source material, there are two grand

00:03:01.610 --> 00:03:03.849
strategies for doing this. We can either build

00:03:03.849 --> 00:03:06.759
up or we can break down. Precisely. Those are

00:03:06.759 --> 00:03:09.379
known as agglomerative clustering and divisive

00:03:09.379 --> 00:03:11.599
clustering. Okay, so agglomerative is the bottom

00:03:11.599 --> 00:03:13.860
-up approach, right? And it seems to be the most

00:03:13.860 --> 00:03:16.439
common. It is, yeah. It starts with every single

00:03:16.439 --> 00:03:19.020
data point, every person in our gymnasium as

00:03:19.020 --> 00:03:21.979
their own isolated individual cluster. An island

00:03:21.979 --> 00:03:25.159
of one. An island of one. And then, step by step,

00:03:25.500 --> 00:03:28.479
the algorithm finds the two closest individuals

00:03:28.479 --> 00:03:30.819
and merges them into a pair. And then what? Then

00:03:30.819 --> 00:03:32.819
it finds the next closest pair, or adds a third

00:03:32.819 --> 00:03:34.939
person to the first pair. It just keeps doing

00:03:34.939 --> 00:03:37.020
that, building larger and larger groups until

00:03:37.020 --> 00:03:39.300
everyone is eventually lumped into one giant

00:03:39.300 --> 00:03:41.819
cluster, or until you specifically tell it to

00:03:41.819 --> 00:03:44.719
stop. Spot on. And then you have the inverse

00:03:44.719 --> 00:03:47.039
strategy, which is divisive clustering. The top

00:03:47.039 --> 00:03:49.819
-down approach. Exactly. Instead of starting

00:03:49.819 --> 00:03:52.199
with individuals, you start with the entire room

00:03:52.199 --> 00:03:55.599
of people grouped into one massive single cluster.

00:03:55.800 --> 00:03:59.580
Like the whole gymnasium. Right. Then, recursively,

00:03:59.939 --> 00:04:02.500
the algorithm fractures that giant group into

00:04:02.500 --> 00:04:05.699
two smaller subsets. Then it splits those subsets

00:04:05.699 --> 00:04:09.500
again, and again, slicing the data down until

00:04:09.500 --> 00:04:11.599
you are left with individual data points. Which

00:04:11.599 --> 00:04:14.840
seems like a lot of work. It is. Divisive methods

00:04:14.840 --> 00:04:17.759
are less common in everyday applications, but

00:04:17.759 --> 00:04:20.720
they are incredibly powerful if your primary

00:04:20.720 --> 00:04:24.600
goal is to identify the large major fault lines

00:04:24.600 --> 00:04:27.220
in your data first. rather than focusing on the

00:04:27.220 --> 00:04:29.639
granular tiny relationships at the bottom. Exactly.

00:04:29.819 --> 00:04:31.439
I like to picture the agglomerative approach

00:04:31.439 --> 00:04:33.959
like individuals forming a friendship. Two people

00:04:33.959 --> 00:04:36.579
hit it off, then that pair joins up with another

00:04:36.579 --> 00:04:39.379
pair they meet at a party to form a broader friend

00:04:39.379 --> 00:04:41.240
group. That's a great way to think about it.

00:04:41.439 --> 00:04:43.920
And eventually, over years, all those interconnected

00:04:43.920 --> 00:04:47.990
friend groups merge to form a whole sp - rolling

00:04:47.990 --> 00:04:50.990
community. Yes, very organic. Divisive clustering,

00:04:51.110 --> 00:04:54.250
on the other hand, is like a massive global corporation

00:04:54.250 --> 00:04:57.170
that decides to fracture. The global conglomerate

00:04:57.170 --> 00:05:00.230
splits into a North American division and a European

00:05:00.230 --> 00:05:02.329
division. Right. Big structural changes. And

00:05:02.329 --> 00:05:04.649
then those divisions split into smaller regional

00:05:04.649 --> 00:05:06.910
departments, and then those fracture down into

00:05:06.910 --> 00:05:10.209
individual isolated teams. That is a highly accurate

00:05:10.209 --> 00:05:13.089
way to visualize the two architectures. You are

00:05:13.089 --> 00:05:15.870
either building the pyramid brick by single brick

00:05:15.870 --> 00:05:18.560
from the ground up. or you are starting with

00:05:18.560 --> 00:05:20.980
a solid mountain and carving the pyramid out

00:05:20.980 --> 00:05:23.560
of it from the peak down. But wait, I have to

00:05:23.560 --> 00:05:25.540
push back on the math here for that top -down

00:05:25.540 --> 00:05:28.000
divisive method. Okay, let's hear it. If we were

00:05:28.000 --> 00:05:30.819
starting with a mountain and trying to find the

00:05:30.819 --> 00:05:33.720
absolute perfect place to make that first massive

00:05:33.720 --> 00:05:36.620
slice, Doesn't the algorithm have to calculate

00:05:36.620 --> 00:05:40.079
every possible way to divide the group? Because

00:05:40.079 --> 00:05:42.779
the text says if you use an exhaustive search,

00:05:42.980 --> 00:05:45.759
the mathematical time complexity is O of 2 to

00:05:45.759 --> 00:05:49.550
the power of n. Yes. Big O of 2 to the n. For

00:05:49.550 --> 00:05:51.670
anyone who remembers their high school algebra,

00:05:52.370 --> 00:05:55.209
an exponential equation like 2 to the power of

00:05:55.209 --> 00:05:57.769
n is basically a computational nightmare for

00:05:57.769 --> 00:05:59.930
a machine, isn't it? Oh, absolutely. If you have

00:05:59.930 --> 00:06:02.290
just 100 people in a room, 2 to the power of

00:06:02.290 --> 00:06:04.449
100 is, well, it's a number so large it breaks

00:06:04.449 --> 00:06:06.870
your brain. It's more than the number of atoms

00:06:06.870 --> 00:06:09.610
in the observable universe. Exactly. So how is

00:06:09.610 --> 00:06:11.910
that even possible? What's fascinating here is

00:06:11.910 --> 00:06:15.069
that you've hit on the exact reason why we almost

00:06:15.069 --> 00:06:17.490
never use brute force divides of clustering in

00:06:17.490 --> 00:06:19.350
the real world. Because the computer would just

00:06:19.350 --> 00:06:22.129
crash. Worse than crash. You are completely right.

00:06:22.310 --> 00:06:24.949
O of 2 to the n is mathematically explosive.

00:06:25.259 --> 00:06:28.500
If a computer tried to exhaustively check every

00:06:28.500 --> 00:06:32.040
single possible way to split even a tiny data

00:06:32.040 --> 00:06:34.699
set of a hundred people, the sun would literally

00:06:34.699 --> 00:06:37.360
burn out and the universe would end before the

00:06:37.360 --> 00:06:39.139
computer finished calculating the first split.

00:06:39.360 --> 00:06:42.360
Wow! Okay, so we definitely don't do that. No,

00:06:42.459 --> 00:06:45.240
we can't check every combination. Instead...

00:06:44.810 --> 00:06:47.670
Data scientists have to rely on faster heuristics.

00:06:47.769 --> 00:06:50.350
Like shortcuts? Basically. We use algorithms

00:06:50.350 --> 00:06:53.209
like k -means or a highly specific, clever heuristic

00:06:53.209 --> 00:06:56.129
called the Diana algorithm to smartly guess where

00:06:56.129 --> 00:06:58.490
the natural fault lines are and make those splits

00:06:58.490 --> 00:07:00.769
without checking every impossible permutation.

00:07:01.129 --> 00:07:04.069
OK, so we have our game plan. We are either smartly

00:07:04.069 --> 00:07:06.189
hollowing out the mountain from the top or building

00:07:06.189 --> 00:07:08.410
up from the bricks at the bottom. Right. But

00:07:08.410 --> 00:07:11.290
how does this computer actually decide who belongs

00:07:11.290 --> 00:07:13.550
together? What are the rules of attraction here?

00:07:13.720 --> 00:07:16.000
That is the core of the whole process. I know

00:07:16.000 --> 00:07:18.540
we need a measure of dissimilarity, like Euclidean

00:07:18.540 --> 00:07:21.060
distance, which is literally just a straight

00:07:21.060 --> 00:07:23.100
line measurement between point A and point B.

00:07:23.139 --> 00:07:25.879
Like a literal ruler. Right. But once we have

00:07:25.879 --> 00:07:28.180
individuals in groups, how do we measure the

00:07:28.180 --> 00:07:30.939
distance between two whole groups? This is where

00:07:30.939 --> 00:07:33.439
the real magic happens. To measure the distance

00:07:33.439 --> 00:07:36.420
between groups, you need a linkage criterion.

00:07:36.600 --> 00:07:39.500
A linkage criterion. Yes. The linkage criterion

00:07:39.500 --> 00:07:42.220
is the mathematical rule that tells the algorithm

00:07:42.220 --> 00:07:44.649
how to calculate the distance, not just between

00:07:44.649 --> 00:07:47.709
two isolated individuals, but between two growing

00:07:47.709 --> 00:07:51.089
clusters of individuals. OK. And the rule you

00:07:51.089 --> 00:07:53.670
choose completely changes the physical shape

00:07:53.670 --> 00:07:55.610
of the clusters you end up with. Here's where

00:07:55.610 --> 00:07:58.089
it gets really interesting. Because if I'm grouping

00:07:58.089 --> 00:08:00.810
these strangers in the gymnasium, the most logical

00:08:00.810 --> 00:08:03.790
thing to my human brain is to just look for the

00:08:03.790 --> 00:08:06.449
absolute shortest distance between any two people

00:08:06.449 --> 00:08:08.290
in different groups, right? Sure. That makes

00:08:08.290 --> 00:08:11.569
intuitive sense. Just find the two closest individuals.

00:08:11.949 --> 00:08:13.930
If Bob from group A is standing really close

00:08:13.930 --> 00:08:16.629
to Sarah from group B, we just merge both of

00:08:16.629 --> 00:08:18.470
their entire extended friend groups together

00:08:18.470 --> 00:08:21.069
based on that one connection. Boom. Single linkage.

00:08:21.790 --> 00:08:24.189
That seems foolproof. It sounds foolproof until

00:08:24.189 --> 00:08:26.709
you see what it actually builds. Uh oh. That

00:08:26.709 --> 00:08:29.430
is indeed called single linkage, or minimum distance

00:08:29.430 --> 00:08:32.149
clustering. But if you rely on that, you run

00:08:32.149 --> 00:08:34.980
into a massive problem known as chaining. Chaining.

00:08:35.259 --> 00:08:37.740
Yeah. Because single linkage only requires one

00:08:37.740 --> 00:08:40.120
single point of proximity to bridge two massive

00:08:40.120 --> 00:08:44.000
clusters, you end up with these elongated, weird,

00:08:44.600 --> 00:08:47.440
snake -like chains of data points. Oh, I see.

00:08:47.580 --> 00:08:49.259
Think of it like building a bridge out of a bunch

00:08:49.259 --> 00:08:52.039
of random wooden planks. To connect two piles

00:08:52.039 --> 00:08:54.059
of planks, single linkage just finds the two

00:08:54.059 --> 00:08:56.320
longest planks that barely touch at the very

00:08:56.320 --> 00:08:59.039
tips. So they are technically connected. Technically,

00:08:59.200 --> 00:09:01.279
but the people on the far left end of the new

00:09:01.279 --> 00:09:03.259
mega group have absolutely nothing in common

00:09:03.259 --> 00:09:05.899
with the people on the far right end The cluster

00:09:05.899 --> 00:09:09.610
is drawn out and stringy Ah. OK, so it connects

00:09:09.610 --> 00:09:12.610
them, but the group itself lacks any real cohesion,

00:09:13.250 --> 00:09:15.029
which leads us to the opposite approach, right?

00:09:15.070 --> 00:09:17.549
Complete linkage. Right. If single linkage is

00:09:17.549 --> 00:09:19.350
about the minimum distance, complete linkage

00:09:19.350 --> 00:09:21.809
looks at the maximum distance. The maximum. With

00:09:21.809 --> 00:09:23.970
complete linkage, the algorithm looks at the

00:09:23.970 --> 00:09:26.629
two people in group A and group B who are the

00:09:26.629 --> 00:09:28.769
absolute furthest apart from each other. Wait,

00:09:28.929 --> 00:09:32.289
really? Yes. For those two groups to merge, even

00:09:32.289 --> 00:09:35.299
their most distant opposite members must be relatively

00:09:35.299 --> 00:09:38.039
close together. That sets a profoundly higher

00:09:38.039 --> 00:09:41.320
bar for merging. It does. Returning to your construction

00:09:41.320 --> 00:09:44.000
metaphor, if single linkage is a rickety bridge

00:09:44.000 --> 00:09:46.679
touching at the tips, complete linkage is like

00:09:46.679 --> 00:09:49.259
building a foundational wall. Every brick matters.

00:09:49.500 --> 00:09:52.039
Exactly. Every single brick needs to be tightly

00:09:52.039 --> 00:09:54.200
packed and close to every other brick in the

00:09:54.200 --> 00:09:56.899
block before you combine them. The result is

00:09:56.899 --> 00:09:59.379
that complete linkage tend to produce highly

00:09:59.379 --> 00:10:03.080
compact, tight, spherical clusters. OK, so if

00:10:03.080 --> 00:10:05.639
a city planner is looking at that map of 10 ,000

00:10:05.639 --> 00:10:07.980
burglaries and they want to know exactly where

00:10:07.980 --> 00:10:10.600
to build three new police stations to cover the

00:10:10.600 --> 00:10:13.259
densest hot spots, they want complete linkage.

00:10:13.799 --> 00:10:16.639
They want tight, cohesive spheres of activity,

00:10:17.179 --> 00:10:20.519
not long meandering snakes of data. That is a

00:10:20.519 --> 00:10:23.019
perfect real -world application of it. And there's

00:10:23.019 --> 00:10:25.399
so many other nuanced ways to measure this group

00:10:25.399 --> 00:10:28.000
distance too, like Ward's method. Oh, Ward's

00:10:28.000 --> 00:10:29.879
method is fascinating. From my understanding,

00:10:30.039 --> 00:10:31.980
Ward's method doesn't just look at the closest

00:10:31.980 --> 00:10:34.759
or furthest points, it looks at the minimum increase

00:10:34.759 --> 00:10:37.840
of the sum of squares. Which, to translate that

00:10:37.840 --> 00:10:39.899
out of math jargon, essentially means it tries

00:10:39.899 --> 00:10:42.759
to minimize the variance within a cluster. It

00:10:42.759 --> 00:10:45.820
only merges two groups, if doing so, keeps the

00:10:45.820 --> 00:10:48.559
new combined group as tightly packed and uniform

00:10:48.559 --> 00:10:51.100
as possible. Exactly. Ward's method is brilliant

00:10:51.100 --> 00:10:54.720
for finding evenly sized, highly homogenous groups.

00:10:55.039 --> 00:10:58.019
And then you have methods like UPGMA. UPGMA.

00:10:58.120 --> 00:11:00.259
Which stands for Unweighted Pair Group Method

00:11:00.259 --> 00:11:02.500
with Arithmetic Mean. Rolls right off the tongue.

00:11:02.740 --> 00:11:04.840
Ah, it's a mouthful, but the concept is elegant.

00:11:06.059 --> 00:11:08.639
UPGMA calculates the average distance between

00:11:08.639 --> 00:11:11.100
all possible pairs of points across the two clusters.

00:11:11.559 --> 00:11:13.980
Oh, so it's not just the closest or furthest.

00:11:14.139 --> 00:11:16.220
Right, it's the average of every single connection.

00:11:16.490 --> 00:11:19.490
This is actually the go -to method for evolutionary

00:11:19.490 --> 00:11:22.870
biologists. Why is that? If they are looking

00:11:22.870 --> 00:11:25.669
at thousands of genetic mutations and trying

00:11:25.669 --> 00:11:28.409
to build a phylogenetic tree, a family tree of

00:11:28.409 --> 00:11:32.149
species, UPGMA gives them a beautifully balanced

00:11:32.149 --> 00:11:35.490
average rate of mutation between different genetic

00:11:35.490 --> 00:11:37.970
lineages. It really emphasizes a crucial point

00:11:37.970 --> 00:11:40.950
though. These algorithms aren't just magic objective

00:11:40.950 --> 00:11:43.549
black boxes. Not at all. The human's choice of

00:11:43.549 --> 00:11:46.490
the rule dictates the truth. the machine finds.

00:11:47.090 --> 00:11:49.690
If you tell it to prioritize the closest neighbors,

00:11:50.009 --> 00:11:53.009
you get one reality. If you tell it to prioritize

00:11:53.009 --> 00:11:55.649
the overall tightness of the group, you get a

00:11:55.649 --> 00:11:57.809
completely different reality from the exact same

00:11:57.809 --> 00:12:00.190
raw data. It's all about the lens you choose

00:12:00.190 --> 00:12:02.610
to view the data through. So let's say the algorithm

00:12:02.610 --> 00:12:05.149
has crunched the numbers, it's rigorously applied

00:12:05.149 --> 00:12:08.009
our chosen linkage criteria, and it has successfully

00:12:08.009 --> 00:12:10.809
grouped the data. But a machine's mathematical

00:12:10.809 --> 00:12:13.720
output is just raw data. How does it translate

00:12:13.720 --> 00:12:16.259
that into a map that our human brains can actually

00:12:16.259 --> 00:12:19.600
read and use? It generates a visual output called

00:12:19.600 --> 00:12:22.779
a dendrogram. A dendrogram. It's a literal tree

00:12:22.779 --> 00:12:25.519
-like diagram that maps out the entire hierarchy.

00:12:26.259 --> 00:12:28.820
Imagine a tree with its singular root at the

00:12:28.820 --> 00:12:31.480
bottom and its branches reaching up, splitting

00:12:31.480 --> 00:12:34.240
over and over until they terminate in individual

00:12:34.240 --> 00:12:36.340
leaves at the very top. So the leaves are the

00:12:36.340 --> 00:12:39.600
raw data? Exactly. The leaves are your raw, individual

00:12:39.600 --> 00:12:42.700
data points, and the branches show exactly where

00:12:42.700 --> 00:12:45.820
and when they merged. And the y -axis, the vertical

00:12:45.820 --> 00:12:47.980
height of the tree, represents the mathematical

00:12:47.980 --> 00:12:51.340
distance. The higher up a branch merges, the

00:12:51.340 --> 00:12:53.620
further apart those groups were in reality. What

00:12:53.620 --> 00:12:55.679
I love about the dendrogram is that you can take

00:12:55.679 --> 00:12:58.220
a metaphorical pair of hedge clippers and just

00:12:58.220 --> 00:13:00.659
cut the tree horizontally at any height you want.

00:13:00.779 --> 00:13:02.700
It's a great feature. If you cut it really high

00:13:02.700 --> 00:13:05.639
up near the root, you get just two or three massive

00:13:05.639 --> 00:13:08.039
broad clusters. But if you move your clippers

00:13:08.039 --> 00:13:10.820
lower and cut down near the leaves, you get many

00:13:10.820 --> 00:13:14.039
smaller, highly specific, tight clusters. You

00:13:14.039 --> 00:13:16.000
get to choose the precise number of clusters

00:13:16.000 --> 00:13:18.200
you want just by drawing a line across the tree.

00:13:18.440 --> 00:13:22.539
It's an incredibly intuitive visual tool. I want

00:13:22.539 --> 00:13:25.529
to circle back to how we actually build that

00:13:25.529 --> 00:13:27.830
tree when we are working top -down. Right, the

00:13:27.830 --> 00:13:30.090
divisive approach. Getting to that dendrogram

00:13:30.090 --> 00:13:33.110
using divisive clustering requires some truly

00:13:33.110 --> 00:13:36.629
clever mechanics to avoid that sun -burning -out

00:13:36.629 --> 00:13:38.769
computation problem we mentioned earlier. The

00:13:38.769 --> 00:13:40.710
universe -ending problem. Yeah, we want to avoid

00:13:40.710 --> 00:13:43.549
that. We touched on the Diana algorithm, which

00:13:43.549 --> 00:13:46.169
stands for divisive analysis. Okay. If you look

00:13:46.169 --> 00:13:49.210
under the hood at how Diana actually works, it's

00:13:49.210 --> 00:13:51.509
brilliant. It doesn't so much divide a cluster

00:13:51.509 --> 00:13:55.570
as it hollows it out Yes. The mechanism behind

00:13:55.570 --> 00:13:58.690
this is amazing. It literally sounds like a political

00:13:58.690 --> 00:14:01.590
rebellion inside the data. A rebellion. Walk

00:14:01.590 --> 00:14:04.210
me through that. Well, think about a giant political

00:14:04.210 --> 00:14:07.279
party. Diana looks inside this massive cluster

00:14:07.279 --> 00:14:10.279
and finds the one single data point that is the

00:14:10.279 --> 00:14:12.639
most dissimilar to everyone else. The outlier.

00:14:12.840 --> 00:14:15.879
The one object that is deeply unhappy and just

00:14:15.879 --> 00:14:18.659
doesn't fit in. It pops that object out and puts

00:14:18.659 --> 00:14:21.679
it into its own brand new splinter group. Then

00:14:21.679 --> 00:14:23.600
the algorithm looks at all the remaining objects

00:14:23.600 --> 00:14:27.059
in the old cluster and asks, are any of you actually

00:14:27.059 --> 00:14:29.059
more similar to the splinter group than you are

00:14:29.059 --> 00:14:30.960
to the establishment? Than if they are. They

00:14:30.960 --> 00:14:34.000
migrate over. One data point gets fed up, start

00:14:34.000 --> 00:14:36.200
to splinter. group and recruits everyone else

00:14:36.200 --> 00:14:40.139
who also wants to leave. And the math governing

00:14:40.139 --> 00:14:42.740
that algorithmic rebellion is fascinating. How

00:14:42.740 --> 00:14:45.340
does it actually calculate the rebellion? Diana

00:14:45.340 --> 00:14:48.700
uses a specific equation during this migration

00:14:48.700 --> 00:14:52.159
phase, calculating a value called d of i for

00:14:52.159 --> 00:14:55.899
every single object. d o This equation literally

00:14:55.899 --> 00:14:58.379
measures how strongly an object wants to leave

00:14:58.379 --> 00:15:01.340
its current cluster. It calculates the object's

00:15:01.340 --> 00:15:04.139
average dissimilarity to its current peers and

00:15:04.139 --> 00:15:06.700
subtracts its average dissimilarity to the new

00:15:06.700 --> 00:15:09.500
rebel splinter group. It's algorithmic peer pressure.

00:15:09.679 --> 00:15:13.039
It really is. But here is the genius part. The

00:15:13.039 --> 00:15:15.759
desire to leave is attenuated, it's mathematically

00:15:15.759 --> 00:15:18.480
dampened, if the object wouldn't actually fit

00:15:18.480 --> 00:15:20.500
into the splinter group either. Oh, interesting.

00:15:20.840 --> 00:15:23.799
The equation ensures that an object only defects

00:15:23.799 --> 00:15:27.000
if it truly belongs with the rebels. Otherwise,

00:15:27.259 --> 00:15:29.779
it stays put, perhaps waiting to start its own

00:15:29.779 --> 00:15:31.539
separate splinter group later down the line.

00:15:31.620 --> 00:15:34.360
That is smart. It's a highly sophisticated way

00:15:34.360 --> 00:15:36.960
of letting the internal tensions of the data

00:15:36.960 --> 00:15:39.750
dictate the fractures naturally. So what does

00:15:39.750 --> 00:15:41.990
this all mean for the people actually doing this

00:15:41.990 --> 00:15:44.870
work? This sounds incredibly powerful. I can

00:15:44.870 --> 00:15:47.370
easily see why it's used for analyzing everything

00:15:47.370 --> 00:15:50.649
from social network friend graphs to evolutionary

00:15:50.649 --> 00:15:53.750
biology. It's indispensable. But there is always

00:15:53.750 --> 00:15:56.330
a catch when we do these deep dives. Why isn't

00:15:56.330 --> 00:15:59.789
hierarchical clustering the only tool data scientists

00:15:59.789 --> 00:16:02.789
ever use? If it's this good, why use anything

00:16:02.789 --> 00:16:05.580
else? It comes down to computational cost. Ah,

00:16:05.720 --> 00:16:07.960
back to the math. Always back to the math. We

00:16:07.960 --> 00:16:10.179
talked about how O of 2 to the n is a nightmare

00:16:10.179 --> 00:16:12.799
for divisive clustering, but even the standard

00:16:12.799 --> 00:16:15.100
agglomerative approach, the bottom -up method

00:16:15.100 --> 00:16:17.399
that builds from the leaves to the root, has

00:16:17.399 --> 00:16:20.019
significant limitations. Really? I thought the

00:16:20.019 --> 00:16:22.899
bottom -up was the easy one. Easier, but not

00:16:22.899 --> 00:16:26.120
free. The standard algorithm has a time complexity

00:16:26.120 --> 00:16:29.659
of O of n cubed and requires a memory space of

00:16:29.659 --> 00:16:32.299
omega n squared. Okay, for the non -computer

00:16:32.299 --> 00:16:34.139
scientists listening, putting that in perspective,

00:16:34.759 --> 00:16:37.580
O of n cubed means if you double the amount of

00:16:37.580 --> 00:16:40.879
data, the time it takes the computer to run doesn't

00:16:40.879 --> 00:16:43.559
just double. No, not all. It takes eight times

00:16:43.559 --> 00:16:45.879
as long. Exactly. If you're wondering why Spotify

00:16:45.879 --> 00:16:48.580
doesn't just recalculate its entire global user

00:16:48.580 --> 00:16:50.940
hierarchy every single time you like a new song,

00:16:51.279 --> 00:16:54.860
this is why. To cluster millions of users hierarchically

00:16:54.860 --> 00:16:57.379
from scratch would literally melt their servers.

00:16:57.679 --> 00:17:00.679
Precisely. And the omega n squared memory requirement

00:17:00.679 --> 00:17:04.240
is also a massive bottleneck. That is quadratic

00:17:04.240 --> 00:17:07.160
gross, not exponential, but it is still huge.

00:17:07.279 --> 00:17:09.480
How huge? It means the memory space the computer

00:17:09.480 --> 00:17:11.720
needs just to hold the distance. matrix, just

00:17:11.720 --> 00:17:13.819
to remember how far apart everyone is before

00:17:13.819 --> 00:17:16.440
it even starts grouping them, grows by the square

00:17:16.440 --> 00:17:19.200
of the data. So for 100 ,000 users, the computer

00:17:19.200 --> 00:17:21.940
has to store the distances between 10 billion

00:17:21.940 --> 00:17:25.200
different pairs. 10 billion. It makes standard

00:17:25.200 --> 00:17:27.660
hierarchical agglomerative clustering entirely

00:17:27.660 --> 00:17:30.559
too slow and too memory heavy for large datasets.

00:17:31.380 --> 00:17:34.079
But human engineers are clever, and we've found

00:17:34.079 --> 00:17:37.849
workarounds. There are optimal, efficient algorithmic

00:17:37.849 --> 00:17:40.849
variations, right? Yes, thankfully. Like S -link,

00:17:41.009 --> 00:17:43.490
which is optimized specifically for single linkage,

00:17:43.789 --> 00:17:46.710
and C -E -N -K for complete linkage. Right, those

00:17:46.710 --> 00:17:49.069
bring the time complexity down significantly

00:17:49.069 --> 00:17:52.089
to O of N squared. Or data scientists use spatial

00:17:52.089 --> 00:17:54.990
tricks, like quad trees. Yes, quad trees are

00:17:54.990 --> 00:17:57.490
a great optimization. For those trying to visualize

00:17:57.490 --> 00:18:00.250
a quad tree, imagine taking a giant paper map

00:18:00.250 --> 00:18:03.009
of a city full of data points. Instead of measuring

00:18:03.009 --> 00:18:04.869
the distance between every single point, point,

00:18:05.250 --> 00:18:07.930
you draw a cross over the map, dividing it into

00:18:07.930 --> 00:18:10.710
four large squares. The quadrants. Right. If

00:18:10.710 --> 00:18:12.869
a square is empty, you ignore it. If a square

00:18:12.869 --> 00:18:15.230
has points, you draw another cross inside it,

00:18:15.269 --> 00:18:17.849
making four smaller squares. You recursively

00:18:17.849 --> 00:18:20.670
divide the 2D space into quadrants, which allows

00:18:20.670 --> 00:18:22.849
the computer to instantly find which points are

00:18:22.849 --> 00:18:24.630
in the same general neighborhood. Without having

00:18:24.630 --> 00:18:26.789
to measure the distance to a point on the complete

00:18:26.789 --> 00:18:29.450
opposite side of the map. Exactly. It's a fantastic

00:18:29.450 --> 00:18:32.369
spatial shortcut. It is. But even with quad trees

00:18:32.369 --> 00:18:36.039
in Esley and K, There is another more fundamental

00:18:36.039 --> 00:18:38.339
quirk to the math that data scientists have to

00:18:38.339 --> 00:18:40.779
watch out for, depending on the linkage criteria

00:18:40.779 --> 00:18:44.319
they choose. Oh, another catch. It's a structural

00:18:44.319 --> 00:18:47.119
problem called reversals, or departures from

00:18:47.119 --> 00:18:50.059
ultrametricity. Departures from ultrametricity.

00:18:50.079 --> 00:18:52.700
That sounds like a sci -fi novel. Yeah. What

00:18:52.700 --> 00:18:56.319
exactly is a reversal? Let's say we are using

00:18:56.319 --> 00:18:58.740
centroid linkage, where we measure the distance

00:18:58.740 --> 00:19:01.220
between the literal center of mass of two groups.

00:19:01.319 --> 00:19:04.440
Okay. Can a reversal mean the algorithm basically

00:19:04.440 --> 00:19:07.099
contradicts itself? That is exactly what it does,

00:19:07.240 --> 00:19:09.660
and it is a major headache. If we connect this

00:19:09.660 --> 00:19:11.980
to the bigger picture, the entire point of the

00:19:11.980 --> 00:19:14.599
dendrogram tree is that vertical height represents

00:19:14.599 --> 00:19:17.160
distance and time. Right. The higher up, the

00:19:17.160 --> 00:19:19.420
further apart. As you move up the tree toward

00:19:19.420 --> 00:19:21.980
the root, you are merging clusters that are further

00:19:21.980 --> 00:19:25.339
and further apart. A reversal fundamentally breaks

00:19:25.339 --> 00:19:28.200
that physical logic. How so? Imagine drawing

00:19:28.200 --> 00:19:31.059
a standard family tree. A reversal is the mathematical

00:19:31.059 --> 00:19:32.920
equivalent of drawing the tree and somehow finding

00:19:32.920 --> 00:19:35.519
out that a grandfather is chronologically younger

00:19:35.519 --> 00:19:37.799
than his own grandson. The branches cross back

00:19:37.799 --> 00:19:41.640
over themselves. The algorithm merges two clusters

00:19:41.640 --> 00:19:44.660
high up on the tree, but the mathematical distance

00:19:44.660 --> 00:19:46.859
between them is somehow smaller than a merge

00:19:46.859 --> 00:19:48.980
it already completed lower down on the branches.

00:19:49.119 --> 00:19:52.000
Oh wow. It implies a kind of mathematical time

00:19:52.000 --> 00:19:54.490
travel that breaks the visual layout. That seems

00:19:54.490 --> 00:19:56.589
like it would totally break your ability to just

00:19:56.589 --> 00:19:58.890
take your hedge clippers and cut the tree horizontally,

00:19:59.390 --> 00:20:01.690
because a single straight line wouldn't represent

00:20:01.690 --> 00:20:05.029
a consistent distance anymore. Exactly. It completely

00:20:05.029 --> 00:20:08.269
destroys the clean monotonic hierarchy we rely

00:20:08.269 --> 00:20:11.210
on to interpret the data, and it forces data

00:20:11.210 --> 00:20:13.390
scientists to be highly intentional. You have

00:20:13.390 --> 00:20:16.210
to pay attention. You can't just blindly apply

00:20:16.210 --> 00:20:19.660
a linkage criterion. like centroid linkage without

00:20:19.660 --> 00:20:21.720
understanding the mechanics beneath it, because

00:20:21.720 --> 00:20:23.819
it might warp the very structure of the reality

00:20:23.819 --> 00:20:26.240
you're trying to visualize. It really emphasizes

00:20:26.240 --> 00:20:29.829
that you have to know your tools. Every choice

00:20:29.829 --> 00:20:32.390
from the distance metric to the linkage rule

00:20:32.390 --> 00:20:35.349
changes the final outcome. So to bring this all

00:20:35.349 --> 00:20:37.789
together for you listening, hierarchical clustering

00:20:37.789 --> 00:20:40.970
gives us a profound mathematical way to uncover

00:20:40.970 --> 00:20:43.809
the hidden structures of the world. By calculating

00:20:43.809 --> 00:20:46.369
nothing but the distances between things, whether

00:20:46.369 --> 00:20:48.650
those are crime scenes, DNA strands, or people

00:20:48.650 --> 00:20:51.490
in a room, we can build up intricate communities

00:20:51.490 --> 00:20:54.170
from the bottom up using agglomerative clustering.

00:20:54.650 --> 00:20:58.150
Or we can hollow out massive chaotic data sets.

00:20:58.089 --> 00:21:01.490
from the top down using divisive algorithms like

00:21:01.490 --> 00:21:03.900
Diana. We can force the data to show us long

00:21:03.900 --> 00:21:06.539
connected networks with single linkage or tight

00:21:06.539 --> 00:21:09.299
spherical groupings with complete linkage. It

00:21:09.299 --> 00:21:12.220
takes that terrifying wall of raw confusing data

00:21:12.220 --> 00:21:14.440
and draws a family tree that our human brains

00:21:14.440 --> 00:21:16.940
can actually comprehend. It brings incredible

00:21:16.940 --> 00:21:20.920
order to chaos. And yet there is one highly specific

00:21:20.920 --> 00:21:23.380
mechanism in the agglomerative process that I

00:21:23.380 --> 00:21:25.839
think leaves us with an incredibly profound question.

00:21:25.900 --> 00:21:28.400
What's that? During the bottom up grouping process,

00:21:28.640 --> 00:21:30.559
if they're tied minimum distance. OK, meaning

00:21:30.559 --> 00:21:32.559
two completely different pairs of points are

00:21:32.559 --> 00:21:34.900
the exact same distance apart down to the decimal.

00:21:35.240 --> 00:21:37.660
Exactly. The algorithm has to choose which care

00:21:37.660 --> 00:21:40.359
to merge first. And to do that, it might simply

00:21:40.359 --> 00:21:43.119
choose a pair randomly to break the chi. A literal

00:21:43.119 --> 00:21:45.920
coin flip inside the machine. A coin flip. But

00:21:45.920 --> 00:21:48.400
because every subsequent merge depends on the

00:21:48.400 --> 00:21:51.720
groups formed before it, that single random choice

00:21:51.720 --> 00:21:54.740
generates structurally different dendrograms.

00:21:54.960 --> 00:21:58.440
The entire final tree changes shape. Wow. One

00:21:58.440 --> 00:22:00.380
tiebreaker changes the whole family tree. It

00:22:00.380 --> 00:22:02.059
raises an important question, something for you

00:22:02.059 --> 00:22:05.039
to really ponder after this deep dive. If the

00:22:05.039 --> 00:22:08.180
exact same raw data analyzed by the exact same

00:22:08.180 --> 00:22:10.759
logical rules can yield completely different

00:22:10.759 --> 00:22:13.720
structurally distinct factual hierarchies just

00:22:13.720 --> 00:22:15.759
because a machine flipped a coin to break a tie,

00:22:16.660 --> 00:22:18.740
how much of the order we see in the digital world

00:22:18.740 --> 00:22:21.359
is objectively real? and how much of it is just

00:22:21.359 --> 00:22:24.339
a random byproduct of an algorithm trying to

00:22:24.339 --> 00:22:27.480
force a messy, tied up reality into a neat, clean

00:22:27.480 --> 00:22:30.059
tree. That is heavy. Next time you walk into

00:22:30.059 --> 00:22:31.700
a room full of strangers, just think about how

00:22:31.700 --> 00:22:33.559
many radically different ways a computer could

00:22:33.559 --> 00:22:35.680
organize your lives, all depending on a single

00:22:35.680 --> 00:22:38.039
digital coin toss. Thanks for taking this deep

00:22:38.039 --> 00:22:38.680
dive with us.
