WEBVTT

00:00:00.000 --> 00:00:03.060
You know, sometimes you look at your digital

00:00:03.060 --> 00:00:06.080
life and it... It honestly feels like there's

00:00:06.080 --> 00:00:08.640
this invisible sorting hat just quietly putting

00:00:08.640 --> 00:00:10.939
everything into neat little boxes. Oh, absolutely.

00:00:11.240 --> 00:00:14.800
Yeah, it creates this, um, this illusion of intuition,

00:00:14.800 --> 00:00:17.079
I guess. Right, like you take a hundred photos

00:00:17.079 --> 00:00:19.500
on a weekend trip and suddenly your phone is

00:00:19.500 --> 00:00:21.760
perfectly grouped, you know, all the pictures

00:00:21.760 --> 00:00:25.019
of your dog into one distinct folder. Yeah, or

00:00:25.019 --> 00:00:27.579
you log onto a retailer's website and it feels

00:00:27.579 --> 00:00:30.079
like they know exactly which specific demographic

00:00:30.079 --> 00:00:32.659
bucket you fall into. It's unnerving, like they're

00:00:32.659 --> 00:00:35.119
pitching you products that are just way too accurate

00:00:35.119 --> 00:00:37.140
to your tastes. It really makes you feel like

00:00:37.140 --> 00:00:40.299
the machine just naturally understands the the

00:00:40.299 --> 00:00:42.880
nuanced shape of your life, you know, your habits,

00:00:43.079 --> 00:00:45.420
your preferences. But I mean, it's not magic.

00:00:45.619 --> 00:00:48.799
It's geometry. And today we are taking a deep

00:00:48.799 --> 00:00:51.659
dive into the very blueprint of that geometry.

00:00:52.119 --> 00:00:54.600
It's a fascinating subject. It really is. We're

00:00:54.600 --> 00:00:56.859
pulling from this comprehensive Wikipedia deep

00:00:56.859 --> 00:01:01.009
dive on a a mathematical workhorse called k -means

00:01:01.009 --> 00:01:03.170
clustering. Yeah, which is basically the engine

00:01:03.170 --> 00:01:05.829
behind so much of that sorting. Exactly. And

00:01:05.829 --> 00:01:08.549
our mission today is to figure out exactly how

00:01:08.549 --> 00:01:11.650
this invisible algorithm forces order onto our

00:01:11.650 --> 00:01:15.890
chaotic lives, translate its incredibly dense

00:01:15.890 --> 00:01:19.269
math into a clear mental model, and just reveal

00:01:19.269 --> 00:01:21.909
how it physically organizes the digital world

00:01:21.909 --> 00:01:25.750
around you. It's a phenomenal topic because it

00:01:25.750 --> 00:01:28.629
really exposes the tension between raw, messy

00:01:28.629 --> 00:01:31.510
human data and the machine's desperate need for

00:01:31.510 --> 00:01:34.269
order. Yes. The machine needs its boxes. Exactly.

00:01:34.549 --> 00:01:37.310
You are taking a chaotic universe of information

00:01:37.310 --> 00:01:40.689
and commanding the math to draw hard, rigid boundaries

00:01:40.689 --> 00:01:43.209
just to make sense of it all. OK, let's unpack

00:01:43.209 --> 00:01:45.250
this because before we can understand where this

00:01:45.250 --> 00:01:47.010
algorithm is secretly operating in your life,

00:01:47.090 --> 00:01:49.250
whether it's grouping your photos or targeting

00:01:49.250 --> 00:01:51.129
your ads, we really have to understand what it's

00:01:51.129 --> 00:01:53.569
actually doing data on a fundamental level. Right,

00:01:53.750 --> 00:01:56.209
the mechanics of it. Yeah. Let's start with that

00:01:56.209 --> 00:01:58.489
name. K means clustering. I mean, the jargon

00:01:58.489 --> 00:02:00.890
is thick right out of the gate. It sounds intimidating

00:02:00.890 --> 00:02:03.950
for sure, but it's surprisingly literal once

00:02:03.950 --> 00:02:06.150
you strip away the math speak. Okay, lay it on

00:02:06.150 --> 00:02:09.169
me. So the K is simply a placeholder variable.

00:02:09.409 --> 00:02:12.550
It literally just stands for the exact number

00:02:12.729 --> 00:02:15.789
of clusters or groups that you wanna sort your

00:02:15.789 --> 00:02:18.810
data into. And you, the human operator, have

00:02:18.810 --> 00:02:22.069
to decide K. Like if you have a massive database

00:02:22.069 --> 00:02:24.550
of customers and you wanna sort them into, say,

00:02:25.030 --> 00:02:27.810
five distinct groups for five different email

00:02:27.810 --> 00:02:30.669
campaigns. Then K equals five. Exactly, K equals

00:02:30.669 --> 00:02:33.009
five. So K is just the number of boxes we are

00:02:33.009 --> 00:02:37.270
forcing the data into. And the means part, I'm

00:02:37.270 --> 00:02:39.090
assuming that's not about the algorithm being

00:02:39.090 --> 00:02:42.340
like... aggressive or mean? No, not quite. Means

00:02:42.340 --> 00:02:44.780
refers to the mathematical center point of those

00:02:44.780 --> 00:02:47.240
groups, the centroid. Centroid, that sounds like

00:02:47.240 --> 00:02:49.759
a transformer. It kind of does, but if you imagine

00:02:49.759 --> 00:02:52.560
like plotting a scatter plot of a thousand data

00:02:52.560 --> 00:02:56.580
points on a graph, the mean is the literal physical

00:02:56.580 --> 00:02:59.419
center of mass for a specific group of those

00:02:59.419 --> 00:03:01.219
points. So it's like the geographic heart of

00:03:01.219 --> 00:03:02.879
the cluster. That's a perfect way to put it,

00:03:02.960 --> 00:03:04.939
yes. And the way the algorithm actually finds

00:03:04.939 --> 00:03:07.479
these geographical hearts is through what the

00:03:07.479 --> 00:03:09.360
source material calls the standard algorithm.

00:03:09.530 --> 00:03:12.469
which is also known as Lloyd's algorithm. Right.

00:03:12.550 --> 00:03:14.969
Named after Stuart Lloyd. Yeah. And it's essentially

00:03:14.969 --> 00:03:17.909
this continuous iterative two -step dance. You

00:03:17.909 --> 00:03:19.930
have step one, which is the assignment step,

00:03:20.289 --> 00:03:23.289
and step two, the update step. Let's visualize

00:03:23.289 --> 00:03:25.490
that assignment step first because it's pretty

00:03:25.490 --> 00:03:29.349
wild. The algorithm takes your K number. Let's

00:03:29.349 --> 00:03:32.080
say you chose three. And it literally throws

00:03:32.080 --> 00:03:35.139
down three center points onto your graph of data.

00:03:35.139 --> 00:03:37.659
Just drops them in randomly. This boom, boom,

00:03:37.699 --> 00:03:40.960
boom. Exactly. Then it looks at every single

00:03:40.960 --> 00:03:43.680
piece of data on that board and assigns it to

00:03:43.680 --> 00:03:45.840
whichever of those three center points it is

00:03:45.840 --> 00:03:48.599
physically closest to. And what happens mathematically

00:03:48.599 --> 00:03:51.520
when it does that is absolutely wild. Like, the

00:03:51.520 --> 00:03:54.419
source notes that this assignment process mathematically

00:03:54.419 --> 00:03:57.379
partitions the entire data space into geometric

00:03:57.379 --> 00:04:00.870
regions called Voronoi cells. Which, if you've

00:04:00.870 --> 00:04:03.629
never seen a voronoi diagram, picture a beautiful

00:04:03.629 --> 00:04:05.849
stained glass window. That stained glass visual

00:04:05.849 --> 00:04:07.969
is perfect, actually, because it highlights the

00:04:07.969 --> 00:04:11.319
rigidity of the math. When the algorithm assigns

00:04:11.319 --> 00:04:14.259
a data point to its closest center, it is effectively

00:04:14.259 --> 00:04:18.019
drawing a hard microscopic boundary line exactly

00:04:18.019 --> 00:04:20.680
halfway between two competing centers. Oh, wow.

00:04:20.759 --> 00:04:23.360
Yeah, it creates these polygonal tiles that fit

00:04:23.360 --> 00:04:26.680
perfectly together with zero overlap. Like you

00:04:26.680 --> 00:04:29.220
are either inside the glass tile for center A,

00:04:29.540 --> 00:04:31.360
or you're inside the glass tile for center B.

00:04:31.360 --> 00:04:33.860
There is no gray area. But the algorithm doesn't

00:04:33.860 --> 00:04:36.779
just stop once it draws that stained glass window.

00:04:37.199 --> 00:04:40.329
That brings us to step two. the update step.

00:04:40.529 --> 00:04:42.810
Right, because the center has changed. Exactly.

00:04:43.110 --> 00:04:45.009
Because you just assigned a whole bunch of random

00:04:45.009 --> 00:04:47.930
data points to a cluster, the actual literal

00:04:47.930 --> 00:04:51.110
center of mass of that new group has probably

00:04:51.110 --> 00:04:52.970
shifted away from where you originally dropped

00:04:52.970 --> 00:04:55.009
your center point. It definitely has. So the

00:04:55.009 --> 00:04:57.569
algorithm has to recalculate the mean. It literally

00:04:57.569 --> 00:05:01.209
drags the center point to the true middle of

00:05:01.209 --> 00:05:04.509
its newly assigned cluster. Which, um... immediately

00:05:04.509 --> 00:05:06.769
invalidates the stained glass window you just

00:05:06.769 --> 00:05:09.569
drew. Exactly. I was trying to wrap my head around

00:05:09.569 --> 00:05:11.610
this endless loop of assigning and updating,

00:05:12.129 --> 00:05:14.589
and I came up with an analogy. Well, let's hear

00:05:14.589 --> 00:05:17.410
it. Okay. Imagine you are hosting this incredibly

00:05:17.410 --> 00:05:21.569
chaotic dinner party in a massive open ballroom.

00:05:21.589 --> 00:05:24.829
Okay. And you need to sort your hundreds of million

00:05:24.829 --> 00:05:27.930
guests into K number of tables. Okay. Let's stick

00:05:27.930 --> 00:05:30.329
with K equals three. Three giant round tables.

00:05:30.509 --> 00:05:33.620
A true logistical nightmare for any host. Total

00:05:33.620 --> 00:05:36.079
nightmare. People are just standing around randomly.

00:05:36.800 --> 00:05:39.639
So you physically drop three tables in random

00:05:39.639 --> 00:05:42.079
spots across the ballroom. Just drag them out

00:05:42.079 --> 00:05:44.720
there. Yeah. Then you get on a microphone and

00:05:44.720 --> 00:05:47.100
tell everyone, hey, go sit at the table that

00:05:47.100 --> 00:05:49.779
you are currently standing closest to. That is

00:05:49.779 --> 00:05:51.680
our assignment step. And everyone scrambles,

00:05:51.699 --> 00:05:53.620
and the room is instantly divided into three

00:05:53.620 --> 00:05:56.680
distinct crowds based purely on proximity. Right.

00:05:56.879 --> 00:05:59.379
But once everyone sits down, you look around

00:05:59.379 --> 00:06:02.220
and realize the conversational energy is completely

00:06:02.220 --> 00:06:05.220
off -center. Naturally. Like the actual dense

00:06:05.220 --> 00:06:08.720
core of the crowd at table one is heavily skewed,

00:06:09.379 --> 00:06:11.240
say 20 feet to the left of where the physical

00:06:11.240 --> 00:06:13.100
table actually is. Okay, I see where this is

00:06:13.100 --> 00:06:15.220
going. So to make sure everyone can hear each

00:06:15.220 --> 00:06:17.980
other, you grab the physical table and you literally

00:06:17.980 --> 00:06:20.939
drag it across the floor to the exact center

00:06:20.939 --> 00:06:23.079
of where that specific group is currently sitting.

00:06:23.620 --> 00:06:26.360
You update the center. But by dragging table

00:06:26.360 --> 00:06:30.079
1 20 feet to the left, you just changed the geography

00:06:30.079 --> 00:06:32.699
of the entire ballroom. Exactly. Because you

00:06:32.699 --> 00:06:35.120
moved the tables, a guest who was sitting on

00:06:35.120 --> 00:06:37.639
the edge of table 1 might certainly look around

00:06:37.639 --> 00:06:40.980
and realize, wait. Because table two also got

00:06:40.980 --> 00:06:43.439
dragged over this way. I'm actually physically

00:06:43.439 --> 00:06:46.240
closer to table two now, right the boundary shifted

00:06:46.240 --> 00:06:49.519
Yeah, so they stand up abandon table one and

00:06:49.519 --> 00:06:52.220
walk over to table two They are reassigning themselves

00:06:52.220 --> 00:06:54.779
based on the new geometry Yes, and you just keep

00:06:54.779 --> 00:06:56.920
doing this over and over people reassigning themselves

00:06:56.920 --> 00:06:59.579
to the closest table and you Dragging the physical

00:06:59.579 --> 00:07:01.600
tables to the new center of whatever the group

00:07:01.600 --> 00:07:04.779
looks like until it stabilizes exactly you repeat

00:07:04.779 --> 00:07:07.500
this exhausting dance until finally the table

00:07:07.500 --> 00:07:10.620
stopped moving and Absolutely nobody needs to

00:07:10.620 --> 00:07:13.209
stand up and switch tables anymore. That analogy

00:07:13.209 --> 00:07:16.670
captures the mechanics flawlessly. And it actually

00:07:16.670 --> 00:07:19.589
brings us to the underlying math that drives

00:07:19.589 --> 00:07:22.430
all that table moving. Oh, boy. OK, bring on

00:07:22.430 --> 00:07:24.610
the math. Don't worry, it's not too bad. The

00:07:24.610 --> 00:07:26.850
algorithm's ultimate objective is built on a

00:07:26.850 --> 00:07:29.209
concept called within cluster sum of squares,

00:07:29.629 --> 00:07:33.029
or WCSS. OK, within cluster sum of squares sounds

00:07:33.029 --> 00:07:35.370
like a terrifying question on a graduate level

00:07:35.370 --> 00:07:38.149
math exam. It does, yeah. But your dinner party

00:07:38.149 --> 00:07:41.350
actually decodes it perfectly. The algorithm's

00:07:41.350 --> 00:07:44.639
entire underlying goal is to minimize variance.

00:07:45.819 --> 00:07:47.920
It wants to make sure that everyone sitting at

00:07:47.920 --> 00:07:50.779
a given table is as physically close to the center

00:07:50.779 --> 00:07:53.420
of their table as mathematically possible, and

00:07:53.420 --> 00:07:56.100
it calculates this using what are called squared

00:07:56.100 --> 00:07:59.079
Euclidean distances. Why are we squaring the

00:07:59.079 --> 00:08:01.860
distance? If I am five feet away from the table,

00:08:01.939 --> 00:08:04.680
why does the man need to square that to 25? Because

00:08:04.680 --> 00:08:08.379
squaring heavily penalizes outliers. Oh, interesting.

00:08:08.560 --> 00:08:10.959
Yeah. Think about your dinner party. If a guest

00:08:10.959 --> 00:08:12.779
is sitting two feet away from the center of the

00:08:12.779 --> 00:08:15.980
table, two squared is four penalty points. The

00:08:15.980 --> 00:08:17.699
algorithm doesn't mind that too much. Right.

00:08:17.740 --> 00:08:20.360
Four is a small penalty. But if a guest is stranded

00:08:20.360 --> 00:08:22.519
way out in the hallway, 10 feet away from the

00:08:22.519 --> 00:08:26.079
table, 10 squared is 100 penalty points. Oh,

00:08:26.120 --> 00:08:29.540
wow. So it ramps up fast. Exactly. The math panics.

00:08:29.620 --> 00:08:33.039
It hates large numbers. So it will... aggressively

00:08:33.039 --> 00:08:36.360
drag the table toward that stranded guest to

00:08:36.360 --> 00:08:39.000
minimize that massive squared penalty. Okay,

00:08:39.000 --> 00:08:41.059
that makes total sense. Yeah, minimizing the

00:08:41.059 --> 00:08:44.460
WCSS simply means the algorithm has manipulated

00:08:44.460 --> 00:08:47.460
the room until it found the absolute tightest,

00:08:47.460 --> 00:08:50.779
most compact groups possible. It is so elegantly

00:08:50.779 --> 00:08:52.960
simple when you visualize the mechanics like

00:08:52.960 --> 00:08:56.080
that. But, you know, knowing how elegantly it

00:08:56.080 --> 00:08:58.700
forces order onto a room really makes you wonder

00:08:58.700 --> 00:09:00.740
who actually came up with this. Yeah, the history

00:09:00.740 --> 00:09:03.509
is fascinating. And more importantly, why, despite

00:09:03.509 --> 00:09:06.149
being run on literal supercomputers today, it

00:09:06.149 --> 00:09:08.389
almost never finds the perfect mathematical answer.

00:09:08.649 --> 00:09:10.990
Well, the history here is surprisingly old for

00:09:10.990 --> 00:09:12.950
something that serves as a foundational pillar

00:09:12.950 --> 00:09:15.090
of modern machine learning. How old are we talking?

00:09:15.250 --> 00:09:17.269
The standard algorithm was first proposed by

00:09:17.269 --> 00:09:19.409
a researcher named Stuart Lloyd at Bell Labs

00:09:19.409 --> 00:09:23.590
all the way back in 1957. 1957. Yeah. Though

00:09:23.590 --> 00:09:25.470
interestingly, he didn't formally publish it

00:09:25.470 --> 00:09:28.429
in a journal until 1982. What was he even trying

00:09:28.429 --> 00:09:31.269
to do in 1957 that required sorting data like

00:09:31.269 --> 00:09:33.230
this? I mean, they barely had computers. Right,

00:09:33.289 --> 00:09:35.649
he was working on pulse code modulation. Okay,

00:09:35.769 --> 00:09:38.220
what is that? Basically, this is the early mathematics

00:09:38.220 --> 00:09:41.240
of taking a smooth, continuous analog signal

00:09:41.240 --> 00:09:43.840
like a human voice traveling over a telephone

00:09:43.840 --> 00:09:47.179
wire and chopping it up into discrete, manageable

00:09:47.179 --> 00:09:50.139
digital chunks. Ah, so turning sound into data.

00:09:50.399 --> 00:09:53.299
Exactly. He needed a mathematical way to find

00:09:53.299 --> 00:09:55.440
the most representative means of those sound

00:09:55.440 --> 00:09:58.340
waves to digitize them efficiently. And it wasn't

00:09:58.340 --> 00:10:00.899
even called K -Means back then, right? The source

00:10:00.899 --> 00:10:03.399
says James McQueen finally coined the term K

00:10:03.399 --> 00:10:06.480
-Means a decade later in 1967. That is correct.

00:10:07.039 --> 00:10:10.299
But here is the massive catch with Stuart Lloyd's

00:10:10.299 --> 00:10:11.860
brilliant algorithm. I knew there was a catch.

00:10:12.190 --> 00:10:14.889
Finding the absolute mathematically perfect optimal

00:10:14.889 --> 00:10:18.370
solution for K means meaning the guaranteed absolute

00:10:18.370 --> 00:10:20.889
lowest possible variance across your entire data

00:10:20.889 --> 00:10:23.509
set is what computer scientists call NP -hard.

00:10:23.730 --> 00:10:25.970
NP -hard. Which translates to what, practically

00:10:25.970 --> 00:10:28.289
speaking? Just incredibly difficult to solve.

00:10:28.509 --> 00:10:31.289
Beyond difficult, it implies a combinatorial

00:10:31.289 --> 00:10:33.990
explosion. A combinatorial explosion. OK. Even

00:10:33.990 --> 00:10:36.070
if you just have a simple two -dimensional flat

00:10:36.070 --> 00:10:38.990
graph with a few dozen data points, checking

00:10:38.990 --> 00:10:42.009
every single possible cluster combination to

00:10:42.009 --> 00:10:44.850
guarantee you have the absolute best one can

00:10:44.850 --> 00:10:48.179
take exponential time. Wait, seriously? Just

00:10:48.179 --> 00:10:51.259
for a few dozen points. Imagine trying to arrange

00:10:51.259 --> 00:10:53.960
a seating chart for a wedding with 10 ,000 guests,

00:10:54.299 --> 00:10:56.659
and you have to mathematically check every single

00:10:56.659 --> 00:10:59.539
possible combination of who sits where before

00:10:59.539 --> 00:11:01.340
you make a decision. That sounds like actual

00:11:01.340 --> 00:11:04.659
torture. It is. As you add more dimensions or

00:11:04.659 --> 00:11:06.919
more clusters, it would literally take the age

00:11:06.919 --> 00:11:09.039
of the universe to compute the perfect answer.

00:11:09.259 --> 00:11:11.679
Wait, I have to push back here. Sure. If finding

00:11:11.679 --> 00:11:14.639
the perfect, mathematically true answer takes

00:11:14.639 --> 00:11:16.960
an eternity, even for modern cloud computing

00:11:16.960 --> 00:11:20.399
servers, why is this algorithm so overwhelmingly

00:11:20.399 --> 00:11:22.980
popular? Why do we use it to compress images

00:11:22.980 --> 00:11:25.480
and target marketing if it can't actually guarantee

00:11:25.480 --> 00:11:29.080
the right answer? That is the brilliant, pragmatic

00:11:29.080 --> 00:11:31.639
compromise of k -means. It relies heavily on

00:11:31.639 --> 00:11:34.960
heuristics. Heuristics. Rule of fun stuff. Exactly.

00:11:35.700 --> 00:11:38.440
Because finding the global optimum, the absolute

00:11:38.440 --> 00:11:40.879
best answer in the universe, is computationally

00:11:40.879 --> 00:11:43.899
impossible, the algorithm happily settles for

00:11:43.899 --> 00:11:46.460
a local optimum. A local optimum. Yeah. It finds

00:11:46.460 --> 00:11:48.940
an answer that is good enough for its immediate

00:11:48.940 --> 00:11:51.419
surroundings, and then it simply stops looking.

00:11:51.639 --> 00:11:53.419
It's the good enough algorithm. It literally

00:11:53.419 --> 00:11:55.360
just throws its hands up and says, hey, this

00:11:55.360 --> 00:11:57.720
was fine. Let's go to lunch. If we connect this

00:11:57.720 --> 00:12:00.320
to the bigger picture, that is exactly what you

00:12:00.320 --> 00:12:03.649
want in the real world of massive petabyte scale

00:12:03.649 --> 00:12:06.330
data sets. You just need it to work fast. Right.

00:12:06.330 --> 00:12:09.710
You don't usually need the perfect fundamental

00:12:09.710 --> 00:12:12.330
mathematical truth of the universe. You just

00:12:12.330 --> 00:12:14.710
need a highly functional grouping and you need

00:12:14.710 --> 00:12:17.710
it in three seconds. Makes sense. Lloyd's algorithm

00:12:17.710 --> 00:12:20.529
is famous because it usually converges on a local

00:12:20.529 --> 00:12:23.909
optimum in just a dozen or so iterations of that

00:12:23.909 --> 00:12:26.929
table moving dance. It is incredibly ruthlessly

00:12:26.929 --> 00:12:29.750
efficient. But OK, if it is settling for a local

00:12:29.750 --> 00:12:33.039
optimum, a good enough answer based on its immediate

00:12:33.039 --> 00:12:36.500
surroundings, then where it starts, the search

00:12:36.500 --> 00:12:39.440
must matter immensely. Oh, absolutely. Going

00:12:39.440 --> 00:12:42.139
back to the dinner party, if I accidentally drop

00:12:42.139 --> 00:12:44.600
all three tables bunched up in the far right

00:12:44.600 --> 00:12:47.200
corner of the ballroom, the final grouping of

00:12:47.200 --> 00:12:49.600
guests is going to look vastly different than

00:12:49.600 --> 00:12:51.679
if I had spread the tables evenly across the

00:12:51.679 --> 00:12:53.940
room to begin with. You have zeroed in on one

00:12:53.940 --> 00:12:56.940
of the most heavily researched areas of k -means

00:12:56.940 --> 00:13:00.000
clustering, initialization. where you drop the

00:13:00.000 --> 00:13:02.740
tables. Exactly. Where you place those starting

00:13:02.740 --> 00:13:05.720
centers drastically dictates the final outcome.

00:13:06.440 --> 00:13:09.460
The source details a few highly specific ways

00:13:09.460 --> 00:13:11.960
data scientists try to handle this. Like what?

00:13:12.360 --> 00:13:14.740
For instance, there's the 4G method. Yeah. This

00:13:14.740 --> 00:13:17.659
is where you randomly pick k actual data points

00:13:17.659 --> 00:13:19.840
from your data set and use them as your starting

00:13:19.840 --> 00:13:22.159
centers. OK. So in the dinner party, that would

00:13:22.159 --> 00:13:25.080
be like closing my eyes. pointing at three random

00:13:25.080 --> 00:13:27.679
guests and just dropping the heavy tables directly

00:13:27.679 --> 00:13:31.179
onto their toes. Ouch. But yes, exactly. Because

00:13:31.179 --> 00:13:33.620
you are picking existing data points, it tends

00:13:33.620 --> 00:13:36.539
to spread the initial means out somewhat naturally

00:13:36.539 --> 00:13:39.139
across where the data actually lives. Right,

00:13:39.240 --> 00:13:40.940
because guests aren't usually standing in the

00:13:40.940 --> 00:13:43.480
walls. Exactly. But then there is an alternative

00:13:43.480 --> 00:13:45.220
approach called the random partition method.

00:13:45.620 --> 00:13:48.259
This one takes a totally different route. It

00:13:48.259 --> 00:13:51.000
randomly assigns every single point in the data

00:13:51.000 --> 00:13:54.049
set to a cluster first, before there are even

00:13:54.049 --> 00:13:57.169
any centers, and then calculates the center of

00:13:57.169 --> 00:13:59.710
those completely random jumbled assignments.

00:14:00.169 --> 00:14:02.409
Wait, if everything is totally jumbled across

00:14:02.409 --> 00:14:05.389
the entire room, wouldn't the center of those

00:14:05.389 --> 00:14:08.769
random groups all just average out to the exact

00:14:08.769 --> 00:14:11.590
middle of the ballroom? Yes, they all start bunched

00:14:11.590 --> 00:14:13.809
up in the exact center of the room and slowly

00:14:13.809 --> 00:14:15.830
have to drift outwards to find their crowds.

00:14:16.299 --> 00:14:19.059
That seems inefficient. It's a slow drift, and

00:14:19.059 --> 00:14:21.700
it is actually incredibly risky. If you start

00:14:21.700 --> 00:14:24.019
them all tightly packed in the middle, they might

00:14:24.019 --> 00:14:26.659
get trapped in a really poor local optimum and

00:14:26.659 --> 00:14:29.519
fail to capture groups on the outer edges. That

00:14:29.519 --> 00:14:32.080
is why researchers rarely run k -means just once.

00:14:32.200 --> 00:14:34.460
They will run the algorithm 50 times with 50

00:14:34.460 --> 00:14:36.559
different random starting points and then just

00:14:36.559 --> 00:14:38.580
pick the resulting map that has the tightest

00:14:38.580 --> 00:14:41.879
mathematical variance. Okay, so because this

00:14:41.879 --> 00:14:44.580
algorithm relies on these aggressive mathematical

00:14:44.580 --> 00:14:47.720
shortcuts, these good enough heuristics and random

00:14:47.720 --> 00:14:50.399
starting points, it must have some serious blind

00:14:50.399 --> 00:14:52.580
spots. It definitely does. And here's where it

00:14:52.580 --> 00:14:56.120
gets really interesting. When commines encounters

00:14:56.120 --> 00:14:59.269
the messy unpredictable shapes of real -world

00:14:59.269 --> 00:15:02.690
data, it can make some hilarious and deeply problematic

00:15:02.690 --> 00:15:05.710
mistakes. Oh, it absolutely does. The algorithm

00:15:05.710 --> 00:15:08.529
has strict, uncompromising assumptions built

00:15:08.529 --> 00:15:11.009
into its geometry. Like what kind of assumptions?

00:15:11.110 --> 00:15:14.590
It assumes that clusters are sphere - like, perfectly

00:15:14.590 --> 00:15:17.009
round, symmetrical bubbles, and it assumes that

00:15:17.009 --> 00:15:19.429
they are all of roughly similar size. Fierce

00:15:19.429 --> 00:15:21.970
and equal sizes. Yeah. And when real -world data

00:15:21.970 --> 00:15:24.649
violates those geometric assumptions, commines

00:15:24.649 --> 00:15:27.629
fails completely and aggressively. The Wikipedia

00:15:27.629 --> 00:15:30.090
source gives a fantastic, almost comical example

00:15:30.090 --> 00:15:32.610
of this called the mouse dataset. Yes, the mouse.

00:15:32.889 --> 00:15:35.870
Imagine plotting thousands of data points on

00:15:35.870 --> 00:15:38.750
a graph and they happen to form the exact shape

00:15:38.750 --> 00:15:41.149
of a Mickey Mouse head. You know, you have a

00:15:41.149 --> 00:15:43.529
massive, incredibly dense circle of points at

00:15:43.529 --> 00:15:46.690
the bottom representing the head and two much

00:15:46.690 --> 00:15:49.669
smaller, less dense circles at the top left and

00:15:49.669 --> 00:15:52.750
top right representing the ears. It is a brilliant

00:15:52.750 --> 00:15:55.750
visual representation of three distinct clusters.

00:15:56.370 --> 00:15:59.309
Any human being looks at that graph and instantly,

00:15:59.830 --> 00:16:01.850
intuitively says, oh, there are three groups

00:16:01.850 --> 00:16:05.330
here. One giant head, two tiny ears. Exactly.

00:16:05.509 --> 00:16:08.309
It's obvious. But if you tell Cosneys to find

00:16:08.309 --> 00:16:11.200
three clusters in that data, it completely butchers

00:16:11.200 --> 00:16:13.340
the mouse. It really does. Because it strictly

00:16:13.340 --> 00:16:15.279
assumes that clusters must be of similar size

00:16:15.279 --> 00:16:17.539
and perfectly spherical. It cannot comprehend

00:16:17.539 --> 00:16:19.480
that the head is massive and the ears are small,

00:16:19.799 --> 00:16:22.039
so it just violently chops the entire image into

00:16:22.039 --> 00:16:24.720
three roughly equal -sized chunks. Yeah, it's

00:16:24.720 --> 00:16:27.320
brutal. It will slice a voronoi boundary line

00:16:27.320 --> 00:16:28.960
right through the middle of the massive head,

00:16:29.559 --> 00:16:31.120
stealing thousands of points from the face just

00:16:31.120 --> 00:16:33.179
to make the ear clusters bigger and mathematically

00:16:33.179 --> 00:16:36.450
equal. It is entirely blind to the density of

00:16:36.450 --> 00:16:39.029
the points or the distinct unequal shapes of

00:16:39.029 --> 00:16:42.190
reality. It only sees its desperate need to draw

00:16:42.190 --> 00:16:44.809
equidistant lines. Right, it just wants balance.

00:16:45.090 --> 00:16:46.950
And it struggles just as much with shapes that

00:16:46.950 --> 00:16:49.909
aren't perfectly round. The text brings up another

00:16:49.909 --> 00:16:53.629
famous benchmark in machine learning, the Iris

00:16:53.629 --> 00:16:56.029
flower dataset. Right, I've heard of this one.

00:16:56.110 --> 00:16:59.110
This is a classic dataset containing petal and

00:16:59.110 --> 00:17:02.029
sepal measurements of three distinct real -world

00:17:02.029 --> 00:17:05.990
species of Iris. But if you feed that data into

00:17:05.990 --> 00:17:08.549
k -means and tell it to find three clusters,

00:17:09.089 --> 00:17:11.609
it routinely fails to separate the actual botanical

00:17:11.609 --> 00:17:14.190
species correctly. Yeah, and the reason why reveals

00:17:14.190 --> 00:17:16.789
the limitation of the sphere assumption. In the

00:17:16.789 --> 00:17:19.150
data, two of those iris species have somewhat

00:17:19.150 --> 00:17:21.069
similar measurements, so their data points kind

00:17:21.069 --> 00:17:23.009
of bleed into each other on the graph. So they

00:17:23.009 --> 00:17:25.089
don't look like neat little circles. Exactly.

00:17:25.309 --> 00:17:27.210
Instead of forming two separate circles, they

00:17:27.210 --> 00:17:31.019
form one long, stretched -out oblong shape. K

00:17:31.019 --> 00:17:34.059
-means looks at that long oval -shaped cluster.

00:17:34.539 --> 00:17:36.559
And because its fundamental math desperately

00:17:36.559 --> 00:17:39.940
wants to draw perfect spheres, it will literally

00:17:39.940 --> 00:17:43.579
chop that visible continuous oval perfectly in

00:17:43.579 --> 00:17:46.799
half. It is literally forcing its rigid worldview

00:17:46.799 --> 00:17:49.220
onto the data rather than listening to what the

00:17:49.220 --> 00:17:51.720
data's natural shape is actually telling it.

00:17:51.819 --> 00:17:54.900
Which highlights the absolute biggest, most glaring

00:17:54.900 --> 00:17:57.940
flaw of K -means clustering, the K -conundrum.

00:17:57.960 --> 00:18:01.019
Oh, right, the K -conundrum. the human operator

00:18:01.019 --> 00:18:03.460
have to tell the algorithm exactly how many clusters

00:18:03.460 --> 00:18:06.299
exist before it is even allowed to look at the

00:18:06.299 --> 00:18:09.059
data. That seems entirely backward. If the whole

00:18:09.059 --> 00:18:11.319
point of using an advanced machine learning algorithm

00:18:11.319 --> 00:18:13.380
is to discover hidden patterns in my massive

00:18:13.380 --> 00:18:15.720
data set, how on earth am I supposed to know

00:18:15.720 --> 00:18:17.619
how many patterns there are before I even start?

00:18:17.720 --> 00:18:20.519
It is a massive structural limitation. If you

00:18:20.519 --> 00:18:23.480
blindly guess k equals 3 on that iris data set,

00:18:23.960 --> 00:18:26.569
it chops a natural cluster in half. If you guess

00:18:26.569 --> 00:18:29.390
K equals 2, it would actually capture the geometric

00:18:29.390 --> 00:18:31.990
structure much better, grouping the two overlapping

00:18:31.990 --> 00:18:34.690
species together into one big bubble. But then

00:18:34.690 --> 00:18:37.049
you fundamentally miss the biological fact that

00:18:37.049 --> 00:18:40.269
there are three classes of flowers. So how do

00:18:40.269 --> 00:18:44.190
data scientists deal with this K conundrum? Do

00:18:44.190 --> 00:18:46.849
they just guess and... Hope for the best? Well,

00:18:46.950 --> 00:18:49.450
they use a variety of diagnostic checks to try

00:18:49.450 --> 00:18:52.410
and mathematically estimate k. The source outlines

00:18:52.410 --> 00:18:55.190
several techniques. Like what? There is the elbow

00:18:55.190 --> 00:18:57.990
method, where you run the algorithm for k equals

00:18:57.990 --> 00:19:00.769
1, then 2, then 3, and you plot the variance

00:19:00.769 --> 00:19:03.329
on a line graph. You look for a sharp bend or

00:19:03.329 --> 00:19:06.349
an elbow in the line to find the sweet spot of

00:19:06.349 --> 00:19:08.390
diminishing returns. OK, that sounds reasonable.

00:19:08.529 --> 00:19:10.970
But the text explicitly notes this is considered

00:19:10.970 --> 00:19:13.950
highly unreliable in practice because real -world

00:19:13.950 --> 00:19:16.779
data rarely produces a sharp, obvious elbow.

00:19:16.980 --> 00:19:18.920
It's more of a gentle curve. So if the elbow

00:19:18.920 --> 00:19:21.180
is too blurry, what else is there? There's the

00:19:21.180 --> 00:19:23.799
silhouette score, which evaluates how tightly

00:19:23.799 --> 00:19:26.559
knit an object is to its own cluster compared

00:19:26.559 --> 00:19:28.579
to how close it is to the neighboring cluster.

00:19:28.880 --> 00:19:31.720
Or you have the gap statistic, which compares

00:19:31.720 --> 00:19:34.319
your data's clustering against a random null

00:19:34.319 --> 00:19:37.700
distribution. Wait, a null distribution? Are

00:19:37.700 --> 00:19:39.480
you saying the algorithm essentially creates

00:19:39.480 --> 00:19:42.880
a totally random fake data set? just to see if

00:19:42.880 --> 00:19:44.880
its real clusters are actually statistically

00:19:44.880 --> 00:19:47.819
meaningful or just a random mathematical coincidence.

00:19:48.039 --> 00:19:50.559
That is a perfect translation. Yes. It generates

00:19:50.559 --> 00:19:53.680
a totally uniform box of random noise, clusters

00:19:53.680 --> 00:19:56.140
it, and compares the variance to your actual

00:19:56.140 --> 00:19:58.980
data. Oh, wow. If your data clusters way tighter

00:19:58.980 --> 00:20:00.559
than the random noise, you've probably found

00:20:00.559 --> 00:20:03.400
a good k. But the reality is, all of these are

00:20:03.400 --> 00:20:05.539
just sophisticated mathematical ways of trying

00:20:05.539 --> 00:20:08.380
to justify the k you've chosen. None of them

00:20:08.380 --> 00:20:10.700
are magic bullets. OK, so let me summarize this.

00:20:10.940 --> 00:20:13.680
The algorithm stubbornly forces all data into

00:20:13.680 --> 00:20:16.519
perfect spheres. It completely struggles with

00:20:16.519 --> 00:20:19.099
overlapping shapes. It violently chops mouse

00:20:19.099 --> 00:20:22.140
heads into thirds to maintain equal sizes. And

00:20:22.140 --> 00:20:24.160
it forces you to guess the answer before you

00:20:24.160 --> 00:20:26.160
even start the test. That's a pretty damning

00:20:26.160 --> 00:20:28.319
summary, but yes. If it is this mathematically

00:20:28.319 --> 00:20:31.180
flawed and rigid, why on earth is it used everywhere

00:20:31.180 --> 00:20:33.920
in modern technology? Because when it is applied

00:20:33.920 --> 00:20:36.359
correctly and you understand its limitations,

00:20:36.880 --> 00:20:39.819
it is an incredibly powerful tool for simplification.

00:20:40.359 --> 00:20:43.240
It reduces complex high -dimensional reality

00:20:43.240 --> 00:20:46.359
into manageable chunks better than almost anything

00:20:46.359 --> 00:20:48.660
else. Let's talk about those real -world applications

00:20:48.660 --> 00:20:50.319
because this is where the algorithm actually

00:20:50.319 --> 00:20:53.619
touches the listener's daily life. The first

00:20:53.619 --> 00:20:57.059
major one mentioned is vector quantization, which

00:20:57.059 --> 00:21:00.579
is essentially a very fancy academic way of saying

00:21:00.579 --> 00:21:03.130
image compression. What's fascinating here is

00:21:03.130 --> 00:21:06.309
how commines treat something visual, like colors,

00:21:06.809 --> 00:21:09.750
as pure mathematical coordinates. Colors as math.

00:21:09.950 --> 00:21:12.400
Okay. Imagine a high -definition photograph on

00:21:12.400 --> 00:21:15.079
your phone. It might have millions of distinct,

00:21:15.299 --> 00:21:18.200
subtle color variations. Millions of pixels,

00:21:18.380 --> 00:21:20.420
each representing a slightly different data point

00:21:20.420 --> 00:21:22.940
of red, green, and blue. Right, ton of data.

00:21:23.220 --> 00:21:25.579
Storing the exact hex code for every single pixel

00:21:25.579 --> 00:21:27.440
takes a massive amount of memory. But if I'm

00:21:27.440 --> 00:21:30.160
a web developer and I need that high -res image

00:21:30.160 --> 00:21:32.740
to load instantly on a user's slow cell connection,

00:21:33.180 --> 00:21:35.200
I can't be sending millions of unique colors

00:21:35.200 --> 00:21:37.519
over the network. It would take forever. Exactly.

00:21:37.579 --> 00:21:40.660
So you run the image through K -means. You tell

00:21:40.660 --> 00:21:43.779
the algorithm, I only want 64 colors to represent

00:21:43.779 --> 00:21:47.339
this entire massive image. So k equals 64. Yep.

00:21:48.119 --> 00:21:51.119
The algorithm maps out every pixel, and it finds

00:21:51.119 --> 00:21:54.180
the 64 mathematical centers of all the colors

00:21:54.180 --> 00:21:57.460
in that photo. It groups millions of similar

00:21:57.460 --> 00:22:00.579
subtle shades of reds, blues, and greens into

00:22:00.579 --> 00:22:03.339
64 distinct Voronoi cells. And then it strips

00:22:03.339 --> 00:22:06.019
away all the nuance, it says. Every single pixel

00:22:06.019 --> 00:22:08.839
that got assigned to this specific subtle shade

00:22:08.839 --> 00:22:11.559
of ocean blue is now going to be forced to become

00:22:11.559 --> 00:22:14.619
exactly this one. Uniform centroid shade of blue.

00:22:14.759 --> 00:22:17.559
Yes. It drastically reduces the file size because

00:22:17.559 --> 00:22:19.559
the computer no longer has to remember millions

00:22:19.559 --> 00:22:22.039
of colors. It only has to store a tiny palette

00:22:22.039 --> 00:22:24.759
of K colors and a simple map of which pixel gets

00:22:24.759 --> 00:22:27.140
painted which color. That's brilliant. And human

00:22:27.140 --> 00:22:28.819
eyes often can't even tell the difference because

00:22:28.819 --> 00:22:30.920
the algorithmic centroids capture the visual

00:22:30.920 --> 00:22:33.960
essence of the image so accurately. That is wild.

00:22:34.160 --> 00:22:37.200
It is essentially creating a custom, hyper -efficient

00:22:37.200 --> 00:22:40.359
paint -by -numbers kit for every single image

00:22:40.359 --> 00:22:42.359
on the internet. That's exactly what it is. What

00:22:42.359 --> 00:22:45.200
about applications beyond just pixels and images?

00:22:46.240 --> 00:22:48.720
The source also dives into cluster analysis,

00:22:49.299 --> 00:22:53.000
specifically market segmentation. This is a classic

00:22:53.000 --> 00:22:56.240
foundational business application. Major retailers

00:22:56.240 --> 00:22:59.440
have massive sprawling databases of customer

00:22:59.440 --> 00:23:01.259
behavior. Oh yeah, they know everything. They

00:23:01.259 --> 00:23:04.579
really do. They know your age, your income, your

00:23:04.579 --> 00:23:07.359
purchase history, how many milliseconds you hover

00:23:07.359 --> 00:23:10.640
over a specific product. They run k -means to

00:23:10.640 --> 00:23:13.380
group these millions of chaotic individual customers.

00:23:13.740 --> 00:23:16.019
Sorting us into buckets. Right. They might test

00:23:16.019 --> 00:23:18.380
the data and find that k equals four gives them

00:23:18.380 --> 00:23:21.480
four highly actionable customer profiles. the

00:23:21.480 --> 00:23:24.240
bargain hunters, the impulse buyers, the brand

00:23:24.240 --> 00:23:26.720
loyalists, and the seasonal shoppers. And because

00:23:26.720 --> 00:23:29.759
K -Means draws those rigid, hard, voronoi boundary

00:23:29.759 --> 00:23:31.900
lines we talked about, you could have a situation

00:23:31.900 --> 00:23:34.220
where two neighbors buy almost the exact same

00:23:34.220 --> 00:23:36.680
things, but because one neighbor buys slightly

00:23:36.680 --> 00:23:39.299
more coffee, they cross a millimeter over that

00:23:39.299 --> 00:23:41.640
mathematical boundary line. Exactly. Suddenly,

00:23:41.779 --> 00:23:44.200
the algorithm places them in two entirely different

00:23:44.200 --> 00:23:46.680
voronoi cells. One gets targeted with late -night

00:23:46.680 --> 00:23:49.619
flash sale emails as an impulse buyer, and the

00:23:49.619 --> 00:23:53.440
other gets loyalty rewards. It simplifies millions

00:23:53.440 --> 00:23:57.240
of nuanced, complex individuals into K number

00:23:57.240 --> 00:24:00.279
of manageable, highly targeted marketing campaigns.

00:24:01.099 --> 00:24:03.579
The text also mentions a really fascinating modern

00:24:03.579 --> 00:24:06.059
application, feature learning and natural language

00:24:06.059 --> 00:24:09.029
processing, or NLP. Yes, this is quite advanced

00:24:09.029 --> 00:24:11.410
and incredibly relevant right now. With all the

00:24:11.410 --> 00:24:14.809
AI models everywhere. Exactly. In NLP, before

00:24:14.809 --> 00:24:17.609
you can train a massive complex deep learning

00:24:17.609 --> 00:24:20.670
model or an AI neural network to understand human

00:24:20.670 --> 00:24:23.599
tech, say, Figuring out if a specific word in

00:24:23.599 --> 00:24:26.500
a sentence is a person, a place, or a corporate

00:24:26.500 --> 00:24:29.759
entity, you can use k -means as a sort of pre

00:24:29.759 --> 00:24:32.359
-processing rough draft. So instead of forcing

00:24:32.359 --> 00:24:34.160
the neural network to read the dictionary and

00:24:34.160 --> 00:24:36.579
learn everything from scratch, k -means acts

00:24:36.579 --> 00:24:39.069
like a cheat sheet. That is exactly what it does.

00:24:39.210 --> 00:24:41.549
You mathematically map words as physical points

00:24:41.549 --> 00:24:44.049
in a multidimensional space. You use k -means

00:24:44.049 --> 00:24:46.089
to cluster words that frequently hang out together

00:24:46.089 --> 00:24:48.410
in similar sentences. It groups them together

00:24:48.410 --> 00:24:50.670
into Voronoi cells. That makes total sense. The

00:24:50.670 --> 00:24:52.529
neural network, then, doesn't have to learn the

00:24:52.529 --> 00:24:54.650
definition of every single word in isolation.

00:24:55.089 --> 00:24:57.529
It can look at the clusters k -means build and

00:24:57.529 --> 00:25:01.349
say, ah, bank, money, and vault are all in the

00:25:01.349 --> 00:25:03.980
same neighborhood. They function similarly. It

00:25:03.980 --> 00:25:06.500
provides a foundational layer of understanding

00:25:06.500 --> 00:25:09.240
for the AI. So bring this all back to you listening

00:25:09.240 --> 00:25:12.500
right now. When you are scrolling online and

00:25:12.500 --> 00:25:16.299
you receive a hyper -specific targeted ad that

00:25:16.299 --> 00:25:19.519
feels like it read your mind, or when a website

00:25:19.519 --> 00:25:22.119
loads a beautiful complex image instantly on

00:25:22.119 --> 00:25:24.980
your phone, or when an AI seems to fundamentally

00:25:24.980 --> 00:25:28.019
understand the context of your prompt, you are

00:25:28.019 --> 00:25:30.640
directly benefiting from or being targeted by

00:25:30.640 --> 00:25:33.319
a k -mean centroid. You've been pulled into the

00:25:33.319 --> 00:25:36.180
orbit of a Voronoi cell. You are a complex data

00:25:36.180 --> 00:25:37.779
point that has been assigned to a mathematical

00:25:37.779 --> 00:25:40.079
mean. It's kind of humbling, honestly. It really

00:25:40.079 --> 00:25:42.579
is. So what does this all mean? We started back

00:25:42.579 --> 00:25:45.319
in 1957 with Stuart Lloyd at Bell Labs trying

00:25:45.319 --> 00:25:47.720
to chop up analog sound waves for pulse code

00:25:47.720 --> 00:25:51.240
modulation. A long time ago. Yeah. We explored

00:25:51.240 --> 00:25:54.759
the elegant, albeit NP -hard, geometry of Voronoi

00:25:54.759 --> 00:25:57.740
cells and the exhausting chaotic dinner party

00:25:57.740 --> 00:26:00.359
of assigning guests and dragging tables to minimize

00:26:00.359 --> 00:26:02.660
variance. We'll forget the mouse. Right. We saw

00:26:02.660 --> 00:26:05.019
how the algorithm completely butchers a mouse

00:26:05.019 --> 00:26:07.500
-shaped data set because of its rigid obsession

00:26:07.500 --> 00:26:10.059
with perfect spheres. And we uncovered how those

00:26:10.059 --> 00:26:12.660
very same rigid assumptions make it the hyper

00:26:12.660 --> 00:26:14.940
-efficient backbone of modern image compression

00:26:14.940 --> 00:26:18.039
and targeted marketing today. It truly is a testament

00:26:18.039 --> 00:26:20.619
to the power of heuristic thinking. In a modern

00:26:20.619 --> 00:26:23.180
world absolutely drowning in petabytes of data,

00:26:23.640 --> 00:26:26.559
finding a good enough local optimum is often

00:26:26.559 --> 00:26:29.539
far more valuable and computationally feasible

00:26:29.539 --> 00:26:32.779
than endlessly searching for perfect universal

00:26:32.779 --> 00:26:35.039
truth. It really is. But I want to leave you

00:26:35.039 --> 00:26:37.140
with one final thought, building on something

00:26:37.140 --> 00:26:39.680
tucked away near the very end of our source material.

00:26:39.880 --> 00:26:42.000
Oh, what's that? The text explicitly mentions

00:26:42.000 --> 00:26:44.480
that because k -means is so rigid, researchers

00:26:44.480 --> 00:26:46.599
eventually had to develop alternative algorithms

00:26:46.599 --> 00:26:50.069
to fix its geometric flaws. Algorithms like mean

00:26:50.069 --> 00:26:52.009
shift, which doesn't require you to guess the

00:26:52.009 --> 00:26:54.750
number of k -clusters beforehand, or the Gaussian

00:26:54.750 --> 00:26:57.470
mixture model. Which is a fascinating alternative.

00:26:57.529 --> 00:27:00.730
Right. Unlike k -means, which draws a hard rigid

00:27:00.730 --> 00:27:03.650
circle and says you are either 100 % in or out,

00:27:04.250 --> 00:27:06.670
a Gaussian mixture model looks at probability.

00:27:06.769 --> 00:27:09.470
It says, hey, this data point has a 70 % chance

00:27:09.470 --> 00:27:11.509
of belonging to this long stretched out oval

00:27:11.509 --> 00:27:14.509
group and a 30 % chance of belonging to the one

00:27:14.509 --> 00:27:17.240
next to it. It allows the boundaries to be soft

00:27:17.240 --> 00:27:19.720
and the shapes to stretch and warp based on the

00:27:19.720 --> 00:27:21.700
actual shape of the data. It allows the data

00:27:21.700 --> 00:27:23.839
to dictate the shape of the categories rather

00:27:23.839 --> 00:27:26.380
than forcing the categories onto the data. Exactly.

00:27:26.519 --> 00:27:28.859
And that is what I want you to ponder. In a world

00:27:28.859 --> 00:27:32.140
completely obsessed with categorization, K means

00:27:32.140 --> 00:27:35.539
forces data into a strict predefined number of

00:27:35.539 --> 00:27:38.500
boxes based on rigid mathematical boundary lines.

00:27:38.619 --> 00:27:41.930
Hard lines. Yeah. But human behavior, your behavior,

00:27:42.130 --> 00:27:44.289
the overlapping, incredibly messy data of the

00:27:44.289 --> 00:27:47.170
real world rarely fits into neat, equal -sized

00:27:47.170 --> 00:27:50.230
spheres. As you navigate the digital world today,

00:27:50.650 --> 00:27:53.089
and that invisible sorting hat quietly does its

00:27:53.089 --> 00:27:55.910
work, ask yourself, are the personalized digital

00:27:55.910 --> 00:27:57.769
experiences you're getting truly tailored to

00:27:57.769 --> 00:28:00.789
your unique, complex shape? Or were you just

00:28:00.789 --> 00:28:02.789
forced into a predetermined k -box because you

00:28:02.789 --> 00:28:04.509
happened to fall a millimeter on the wrong side

00:28:04.509 --> 00:28:06.329
of a marketer's mathematical boundary line?
