WEBVTT

00:00:00.000 --> 00:00:02.600
So imagine for a second that you are blindfolded,

00:00:02.759 --> 00:00:05.019
right? And you're just dropped out of a helicopter

00:00:05.019 --> 00:00:07.299
into some completely unknown area of the city.

00:00:07.459 --> 00:00:10.980
That is quite the start to a Tuesday. Right.

00:00:11.380 --> 00:00:13.580
But bear with me. You have absolutely no idea

00:00:13.580 --> 00:00:16.440
where you are. And to figure out if you've landed

00:00:16.440 --> 00:00:19.440
in, say, a commercial district or a residential

00:00:19.440 --> 00:00:22.609
neighborhood, Well, you don't sit there and try

00:00:22.609 --> 00:00:25.969
to construct some complex mental map of urban

00:00:25.969 --> 00:00:28.350
zoning laws. No, of course not. You just reach

00:00:28.350 --> 00:00:30.489
out. Exactly. You just reach out, walk around

00:00:30.489 --> 00:00:34.070
slightly, and identify the five closest buildings

00:00:34.070 --> 00:00:36.070
to you. Makes sense. And if you feel around and

00:00:36.070 --> 00:00:38.890
find three houses and two office buildings, you

00:00:38.890 --> 00:00:41.170
logically deduce, hey, I must be in a residential

00:00:41.170 --> 00:00:44.289
area. Yeah. You rely entirely on your immediate

00:00:44.289 --> 00:00:46.950
surroundings. Yeah. And today, we're going straight

00:00:46.950 --> 00:00:50.759
to the core of how computers use this exact this

00:00:50.759 --> 00:00:54.399
very primitive human instinct to drive modern

00:00:54.399 --> 00:00:57.259
artificial intelligence. It's honestly a phenomenal

00:00:57.259 --> 00:01:00.140
starting point, because when we strip away all

00:01:00.140 --> 00:01:02.520
the intimidating jargon surrounding machine learning,

00:01:02.899 --> 00:01:05.640
the actual mechanics of how software makes decisions,

00:01:05.659 --> 00:01:08.640
they often resemble the most basic, intuitive

00:01:08.640 --> 00:01:11.000
ways we humans make sense of our environment.

00:01:11.700 --> 00:01:14.120
Like, we assume algorithms require some hyper

00:01:14.120 --> 00:01:17.359
-futuristic logic, but the foundations are just

00:01:17.359 --> 00:01:20.120
highly practical. And that is exactly our mission

00:01:20.120 --> 00:01:22.939
for you today. We're doing a deep dive into the

00:01:22.939 --> 00:01:25.040
source material on the K -Nearest Neighbor's

00:01:25.040 --> 00:01:29.319
algorithm, or KNN, as it's usually called. KNN,

00:01:29.319 --> 00:01:32.140
yes. It is one of the absolute oldest classification

00:01:32.140 --> 00:01:34.900
methods out there, which I found crazy, like

00:01:34.900 --> 00:01:37.780
dating back to 1951. And we really want to cut

00:01:37.780 --> 00:01:40.319
through the dense technical concepts here to

00:01:40.319 --> 00:01:42.640
understand not just what this algorithm does,

00:01:43.079 --> 00:01:45.640
but why it matters when we rely on computers

00:01:45.640 --> 00:01:48.280
to categorize literally everything from Text

00:01:48.280 --> 00:01:50.180
documents to live video streams. Absolutely.

00:01:50.459 --> 00:01:52.799
OK, let's unpack this. I just used that blindfold

00:01:52.799 --> 00:01:54.879
analogy to describe reaching out to five buildings.

00:01:55.359 --> 00:01:57.319
In the literature we're looking at, this seems

00:01:57.319 --> 00:01:59.500
to map directly onto the core function of the

00:01:59.500 --> 00:02:01.859
algorithm, right? Oh, it maps perfectly. I mean,

00:02:02.019 --> 00:02:04.739
the documentation formally calls this a plurality

00:02:04.739 --> 00:02:07.819
vote. Plurality vote, OK. Yeah. So in your scenario,

00:02:08.080 --> 00:02:10.979
the five closest buildings represent what the

00:02:10.979 --> 00:02:14.259
algorithm calls the K nearest training examples.

00:02:14.919 --> 00:02:17.919
That K is really just a positive integer. Usually

00:02:17.919 --> 00:02:19.400
a pretty small number, right? Right, usually

00:02:19.400 --> 00:02:22.580
small. So the algorithm looks at an unknown object,

00:02:22.639 --> 00:02:25.379
which is you, standing there blindfolded, finds

00:02:25.379 --> 00:02:27.819
its closest neighbors, and simply assigns it

00:02:27.819 --> 00:02:30.000
to the class that is most common among those

00:02:30.000 --> 00:02:33.379
neighbors. So if k equals 1. If k is 1, it just

00:02:33.379 --> 00:02:35.879
takes the exact identity of the single closest

00:02:35.879 --> 00:02:39.180
building. No voting needed. That is, well, the

00:02:39.180 --> 00:02:41.759
history here is fascinating too, because like

00:02:41.759 --> 00:02:44.080
I mentioned, this isn't some modern Silicon Valley

00:02:44.080 --> 00:02:46.740
invention. Not at all. The sources know the fundamental

00:02:46.860 --> 00:02:49.879
was actually developed by statisticians, Evelyn

00:02:49.879 --> 00:02:53.099
Fix and Joseph Hodges, back in 1951. Yes, and

00:02:53.099 --> 00:02:55.879
it was later expanded upon by Thomas Cover. Right.

00:02:56.340 --> 00:02:58.560
But looking at how it actually functions today,

00:02:58.780 --> 00:03:00.900
the categorical voting makes total sense, like

00:03:00.900 --> 00:03:04.139
three houses beat two offices, easy. But I'm

00:03:04.139 --> 00:03:06.400
trying to picture what happens if we aren't sorting

00:03:06.400 --> 00:03:08.500
things into rigid categories. Okay, what do you

00:03:08.500 --> 00:03:11.919
mean? Well, say I want the algorithm to predict

00:03:11.919 --> 00:03:15.340
a specific continuous number, like... estimating

00:03:15.340 --> 00:03:17.740
the property value of a new house based on its

00:03:17.740 --> 00:03:20.919
neighbors. A house can't be, you know, category

00:03:20.919 --> 00:03:23.620
$500 ,000 and just vote on it. The logic seems

00:03:23.620 --> 00:03:25.580
like it would just hit a wall. You'd think so,

00:03:25.719 --> 00:03:27.819
yeah. But what's fascinating here is that the

00:03:27.819 --> 00:03:30.060
algorithm adapts to continuous variables through

00:03:30.060 --> 00:03:33.379
a process called KNN regression. Regression.

00:03:33.740 --> 00:03:36.199
Yeah, it's sometimes referred to as nearest neighbor

00:03:36.199 --> 00:03:39.159
smoothing. So instead of taking a plurality vote

00:03:39.159 --> 00:03:41.599
of categories, the algorithm just shifts its

00:03:41.599 --> 00:03:44.180
math. It calculates the actual average value

00:03:44.180 --> 00:03:46.800
of those k nearest neighbors. Oh, wow. So it's

00:03:46.800 --> 00:03:48.919
not actually predicting a new value from thin

00:03:48.919 --> 00:03:52.139
air. It's just finding a localized compromise

00:03:52.139 --> 00:03:55.039
of what already exists right around it. Precisely.

00:03:55.500 --> 00:03:57.780
So if you want to know the property value of

00:03:57.780 --> 00:04:00.819
a new house, and you set your K to five, the

00:04:00.819 --> 00:04:03.639
algorithm finds the five most physically similar

00:04:03.639 --> 00:04:06.400
houses in your historical data. Right. It takes

00:04:06.400 --> 00:04:08.719
their known property values, averages them out,

00:04:08.900 --> 00:04:11.599
and assigns that average to the new house. The

00:04:11.599 --> 00:04:14.819
neighborhood logic remains identical. It just

00:04:14.819 --> 00:04:18.040
outputs a mean rather than a majority. OK, but

00:04:18.040 --> 00:04:20.819
wait. That raises a massive logistical problem,

00:04:21.019 --> 00:04:23.540
though. How so? You just said it finds the five

00:04:23.540 --> 00:04:27.600
most physically similar houses. How does a computer

00:04:27.600 --> 00:04:30.720
actually measure similarity or nearness? I mean,

00:04:30.720 --> 00:04:32.180
if you're measuring a house, you have square

00:04:32.180 --> 00:04:34.079
footage, which is in the thousands, and then

00:04:34.079 --> 00:04:36.139
you have the number of bedrooms, which is usually

00:04:36.139 --> 00:04:39.060
less than 10. I see where you're going with this.

00:04:39.160 --> 00:04:41.420
Or even worse. If the algorithm is analyzing

00:04:41.420 --> 00:04:43.439
countries, it might look at life expectancy in

00:04:43.439 --> 00:04:46.740
years versus GDP in trillions of dollars. If

00:04:46.740 --> 00:04:49.300
a computer is just crunching raw numbers, that

00:04:49.300 --> 00:04:51.540
GDP figure is going to completely swallow the

00:04:51.540 --> 00:04:53.839
life expectancy figure, right? Just because trillions

00:04:53.839 --> 00:04:56.079
are mathematically way larger than double digits.

00:04:56.300 --> 00:04:58.800
You've zeroed in on one of the most critical

00:04:58.800 --> 00:05:02.339
vulnerabilities of K &N. The computer does not

00:05:02.339 --> 00:05:05.290
know. what a dollar or a year is. Right. It's

00:05:05.290 --> 00:05:07.709
just numbers. Exactly. It just sees a massive

00:05:07.709 --> 00:05:10.689
number versus a tiny number. So if the features

00:05:10.689 --> 00:05:13.129
represent different physical units or they just

00:05:13.129 --> 00:05:15.610
come in vastly different scales, the accuracy

00:05:15.610 --> 00:05:19.329
of the algorithm severely degrades. The distance

00:05:19.329 --> 00:05:22.029
calculation just gets completely warped by whichever

00:05:22.029 --> 00:05:24.689
feature has the largest absolute numbers. So

00:05:24.689 --> 00:05:26.589
we have to somehow force all these different

00:05:26.589 --> 00:05:29.110
measurements to speak the same numerical language

00:05:29.110 --> 00:05:31.800
before the algorithm even looks at them. Exactly.

00:05:32.199 --> 00:05:34.300
The data has to undergo what we call feature

00:05:34.300 --> 00:05:36.720
-wise normalizing. Normalizing, got it. Yeah,

00:05:36.800 --> 00:05:38.740
we scale the features down so they carry the

00:05:38.740 --> 00:05:41.199
appropriate weight. Often that means compressing

00:05:41.199 --> 00:05:43.639
everything into a proportional range, like between

00:05:43.639 --> 00:05:46.339
0 and 1. Oh, OK. So a trillion dollars becomes

00:05:46.339 --> 00:05:49.339
like 0 .8, and 80 years of life expectancy becomes

00:05:49.339 --> 00:05:52.500
0 .8 as well. Right. And researchers put massive

00:05:52.500 --> 00:05:55.500
effort into this pre -processing stage. The text

00:05:55.500 --> 00:05:57.860
actually mentions they frequently use evolutionary

00:05:57.860 --> 00:06:00.139
algorithms to optimize how features are scaled,

00:06:00.259 --> 00:06:02.759
or they scale them based on the mutual information

00:06:02.759 --> 00:06:05.019
they share with the training classes. OK. So

00:06:05.019 --> 00:06:07.319
assuming we've scaled everything properly so

00:06:07.319 --> 00:06:10.970
the GDP doesn't crush the life expectancy. What

00:06:10.970 --> 00:06:14.129
kind of ruler is the algorithm actually using

00:06:14.129 --> 00:06:17.410
to measure this distance? Well, for continuous

00:06:17.410 --> 00:06:20.430
variables like height, weight, or distance, it

00:06:20.430 --> 00:06:22.990
typically uses the standard straight line measurement

00:06:22.990 --> 00:06:25.850
we all learned in high school geometry, Euclidean

00:06:25.850 --> 00:06:28.970
distance. Ah, good old Euclidean distance. Lies.

00:06:29.629 --> 00:06:31.930
But the algorithm is highly adaptable, depending

00:06:31.930 --> 00:06:34.290
on the data. If you are working with discrete

00:06:34.290 --> 00:06:37.050
variables, like, say, classifying text documents,

00:06:37.410 --> 00:06:40.350
it uses an overlap metric, also known as Hamming

00:06:40.350 --> 00:06:42.129
distance. Hamming distance. Okay, I'm picturing

00:06:42.129 --> 00:06:44.290
that like grading a multiple -choice test. You

00:06:44.290 --> 00:06:46.569
aren't measuring a physical line between two

00:06:46.569 --> 00:06:48.449
essays. You're just counting exactly how many

00:06:48.449 --> 00:06:50.370
specific positions match up and how many don't.

00:06:50.569 --> 00:06:53.589
That is a highly accurate way to visualize it.

00:06:53.629 --> 00:06:56.310
It measures the minimum number of substitutions

00:06:56.310 --> 00:06:58.750
required to change one string of data into the

00:06:58.750 --> 00:07:00.389
other. That makes a lot of sense. And it goes

00:07:00.389 --> 00:07:03.589
even deeper. In bioinformatics, specifically

00:07:03.589 --> 00:07:06.389
with highly complex gene expression microarray

00:07:06.389 --> 00:07:09.459
data, researchers use correlation coefficients

00:07:09.459 --> 00:07:12.420
as their distance metric. Like what kind of coefficients?

00:07:12.500 --> 00:07:14.579
Things like Pearson and Spearman correlation.

00:07:15.000 --> 00:07:17.300
As long as you give it the appropriate mathematical

00:07:17.300 --> 00:07:20.120
measuring stick, the algorithm can find a neighborhood

00:07:20.120 --> 00:07:23.610
in almost any data set. You know, there's a behavioral

00:07:23.610 --> 00:07:25.889
quirk of this algorithm mentioned in the sources

00:07:25.889 --> 00:07:28.290
that I found just incredibly relatable. Oh, yeah.

00:07:28.470 --> 00:07:33.350
It calls KNN a lazy algorithm because it defers

00:07:33.350 --> 00:07:36.649
all computation until the absolute last second.

00:07:37.089 --> 00:07:39.230
It essentially just memorizes the training data

00:07:39.230 --> 00:07:41.629
without doing anything with it. It's very true.

00:07:42.060 --> 00:07:44.420
It totally reminds me of a college student who

00:07:44.420 --> 00:07:47.220
refuses to study all semester, brings an enormous

00:07:47.220 --> 00:07:49.699
stack of textbooks to an open book final exam,

00:07:50.220 --> 00:07:52.759
and then frantically tries to look up the closest

00:07:52.759 --> 00:07:55.519
matching answers in real time while the clock

00:07:55.519 --> 00:07:57.990
is ticking. The lazy learning categorization

00:07:57.990 --> 00:08:00.730
is completely accurate. I mean, unlike other

00:08:00.730 --> 00:08:03.230
machine learning methods, there is no explicit

00:08:03.230 --> 00:08:06.149
training phase where the algorithm builds some

00:08:06.149 --> 00:08:09.329
complex internal model or decision tree beforehand.

00:08:09.410 --> 00:08:11.769
It just sits there. It just stores the feature

00:08:11.769 --> 00:08:14.970
vectors and class labels. That's it. The actual

00:08:14.970 --> 00:08:17.509
mathematical function is only approximated locally

00:08:17.509 --> 00:08:20.370
when a query is actually made. But because it's

00:08:20.370 --> 00:08:22.509
doing this open book frantic search at the very

00:08:22.509 --> 00:08:25.350
last second, relying entirely on a local vote,

00:08:25.769 --> 00:08:27.689
here's where it gets really interesting. Doesn't

00:08:27.689 --> 00:08:31.069
that expose a huge flaw if the historical data

00:08:31.069 --> 00:08:33.269
is skewed? How do you mean? Let's say we have

00:08:33.269 --> 00:08:35.769
a data set that is flooded with blue squares,

00:08:36.389 --> 00:08:39.250
just thousands of them, and it only has a tiny

00:08:39.250 --> 00:08:42.409
handful of red triangles. If I drop a new point

00:08:42.409 --> 00:08:44.549
right next to a red triangle, but there are 100

00:08:44.549 --> 00:08:46.870
blue squares nearby just because they are everywhere

00:08:46.870 --> 00:08:50.049
in the system, those blue squares are gonna win

00:08:50.049 --> 00:08:53.769
the vote by sheer volume. The nuance of that

00:08:53.769 --> 00:08:56.049
smaller group gets totally drowned out by the

00:08:56.049 --> 00:08:58.570
tyranny of the majority. That skewed class distribution

00:08:58.570 --> 00:09:01.789
is a very real problem. The majority class will

00:09:01.789 --> 00:09:03.929
dominate the predictions simply because they

00:09:03.929 --> 00:09:06.570
show up more often in that k -nearest neighborhood.

00:09:06.830 --> 00:09:09.309
So how do we fix it? Well, the mathematical solution

00:09:09.309 --> 00:09:12.850
to this is quite elegant. We use a weighted nearest

00:09:12.850 --> 00:09:15.570
neighbor classifier. OK, so we stop treating

00:09:15.570 --> 00:09:18.450
every neighbor's vote as equal. How do we decide

00:09:18.450 --> 00:09:22.129
whose vote matters more? We assign weights based

00:09:22.129 --> 00:09:25.389
on proximity. Instead of every one of the neighbors

00:09:25.389 --> 00:09:27.330
getting an equal vote, which would be a weight

00:09:27.330 --> 00:09:31.269
of one over K, we multiply their vote by a weight

00:09:31.269 --> 00:09:33.629
proportional to the inverse of their distance.

00:09:33.789 --> 00:09:35.690
Inverse distance. OK, meaning the closer you

00:09:35.690 --> 00:09:37.870
are to the query point, the louder your voice

00:09:37.870 --> 00:09:40.269
gets in the vote. Exactly. So a red triangle

00:09:40.269 --> 00:09:43.049
sitting right next to me gets a massive megaphone,

00:09:43.210 --> 00:09:45.509
while the 50 blue squares sitting further away

00:09:45.509 --> 00:09:48.590
get their volume turned way down, even if they're

00:09:48.620 --> 00:09:50.679
technically inside my neighborhood. Exactly.

00:09:50.879 --> 00:09:52.980
It prevents distant crowds from shouting down

00:09:52.980 --> 00:09:55.659
the immediate local reality. And the sources

00:09:55.659 --> 00:09:57.759
also mentioned another fascinating abstraction

00:09:57.759 --> 00:10:00.360
to solve this, called a self -organizing map,

00:10:00.860 --> 00:10:05.139
or SOM. SOM. How does an SOM physically change

00:10:05.139 --> 00:10:09.269
the data to fix the skew? Think of it as creating

00:10:09.269 --> 00:10:11.690
regional representatives. Instead of looking

00:10:11.690 --> 00:10:14.389
at every single individual point, which gives

00:10:14.389 --> 00:10:16.690
the advantage to the most crowded demographic,

00:10:16.929 --> 00:10:19.590
the sum creates nodes. Node, okay. Yeah, and

00:10:19.590 --> 00:10:22.090
each node acts as a single representative center

00:10:22.090 --> 00:10:24.730
for a cluster of similar points, regardless of

00:10:24.730 --> 00:10:27.370
how densely packed that original cluster was.

00:10:27.509 --> 00:10:30.070
Oh, wow. It essentially levels the playing field

00:10:30.070 --> 00:10:33.190
by condensing the data into evenly weighted prototypes

00:10:33.190 --> 00:10:36.879
before the K &N vote is even applied. That makes

00:10:36.879 --> 00:10:40.379
the voting process infinitely fair. But that

00:10:40.379 --> 00:10:42.080
brings us to the actual size of the neighborhood.

00:10:42.639 --> 00:10:45.100
How do we pick the magic number K? That is the

00:10:45.100 --> 00:10:46.840
good question. Right. Because if I ask three

00:10:46.840 --> 00:10:49.600
people, I get a very specific local answer. But

00:10:49.600 --> 00:10:52.639
if my K is 300, my immediate neighborhood gets

00:10:52.639 --> 00:10:55.059
completely diluted by the broader city. You've

00:10:55.059 --> 00:10:57.179
hit on the core tension of tuning the algorithm.

00:10:57.980 --> 00:11:00.279
Larger values of kin definitely reduce the effect

00:11:00.279 --> 00:11:02.220
of random noise because you're taking a broader

00:11:02.220 --> 00:11:04.480
consensus. But you are absolutely right that

00:11:04.480 --> 00:11:07.220
it makes the boundaries between different categories

00:11:07.220 --> 00:11:09.340
a lot blurrier. So is there a rule of thumb?

00:11:09.580 --> 00:11:12.080
Well, if you are dealing with a binary classification

00:11:12.080 --> 00:11:14.200
problem, meaning there are only two possible

00:11:14.200 --> 00:11:16.759
categories to choose from, you must choose an

00:11:16.759 --> 00:11:19.320
odd number for K. Oh, to prevent a tied vote.

00:11:19.820 --> 00:11:21.820
Exactly! Because if K is 4, you could end up

00:11:21.820 --> 00:11:24.080
with a 2 to 2 split and the algorithm just gets

00:11:24.080 --> 00:11:27.389
stuck. Precisely. And to find the empirically

00:11:27.389 --> 00:11:30.509
optimal k, researchers often utilize what's called

00:11:30.509 --> 00:11:33.509
the bootstrap method. They repeatedly resample

00:11:33.509 --> 00:11:35.769
the data, estimating the accuracy of different

00:11:35.769 --> 00:11:38.529
k values, until they find the mathematical sweet

00:11:38.529 --> 00:11:40.929
spot that balances noise reduction with distinct

00:11:40.929 --> 00:11:43.409
boundaries. OK, I want to circle back to the

00:11:43.409 --> 00:11:45.970
lazy learning concept for a second, because I

00:11:45.970 --> 00:11:47.970
just can't shake the logistical nightmare of

00:11:47.970 --> 00:11:51.350
it. the open book test problem. Yes. If the algorithm

00:11:51.350 --> 00:11:53.970
doesn't build a streamlined model in advance,

00:11:54.470 --> 00:11:56.730
and it just calculates the distance to every

00:11:56.730 --> 00:11:58.909
single stored example at the very last minute,

00:11:59.169 --> 00:12:01.730
what happens when the data set is massive? Yeah,

00:12:01.909 --> 00:12:03.450
that's a problem. I mean, if I have a million

00:12:03.450 --> 00:12:06.830
data points, and each point has dozens of features,

00:12:07.529 --> 00:12:09.509
searching through all of that in real time seems

00:12:09.509 --> 00:12:12.120
like it would just crash a computer. Oh, it absolutely

00:12:12.120 --> 00:12:14.840
would. And this introduces one of the most dramatically

00:12:14.840 --> 00:12:17.899
named concepts in all of data science, the curse

00:12:17.899 --> 00:12:20.779
of dimensionality. The curse of dimensionality.

00:12:21.500 --> 00:12:24.159
Honestly, it sounds like an Indiana Jones artifact,

00:12:24.259 --> 00:12:26.519
but I'm guessing it's a mathematical trap. This

00:12:26.519 --> 00:12:28.600
raises an important question, right? Because

00:12:28.600 --> 00:12:31.879
it is a severe trap. When you start adding too

00:12:31.879 --> 00:12:34.120
many dimensions to your data, usually once you

00:12:34.120 --> 00:12:36.799
get past 10 dimensions or features, the very

00:12:36.799 --> 00:12:39.940
concept of Euclidean distance begins to physically

00:12:39.940 --> 00:12:41.820
break down. Wait, I'm starting to picture how

00:12:41.820 --> 00:12:43.960
distance just breaks down. A foot is a foot.

00:12:44.120 --> 00:12:45.759
Regardless of how many things you're measuring,

00:12:45.840 --> 00:12:48.360
right? Not in high dimensional space. Imagine

00:12:48.360 --> 00:12:51.399
a standard 3D room. OK. There's a clear center

00:12:51.399 --> 00:12:54.600
and there are corners. But as you add dimension,

00:12:54.840 --> 00:12:57.539
say you stretch it into a 100 dimensional room,

00:12:58.159 --> 00:13:01.120
the mathematical volume of that space expands

00:13:01.120 --> 00:13:04.179
exponentially. The geometry works out such that

00:13:04.179 --> 00:13:06.159
almost all the data points get pushed to the

00:13:06.159 --> 00:13:08.559
outer crust of that space. The crust. Yes, the

00:13:08.559 --> 00:13:11.379
center becomes virtually empty. Oh, wow. So if

00:13:11.379 --> 00:13:13.700
my query point is in the middle of this 100 -dimensional

00:13:13.700 --> 00:13:16.279
room, practically all the historical data points

00:13:16.279 --> 00:13:18.620
are pushed out to the walls, meaning they're

00:13:18.620 --> 00:13:20.940
almost exactly the same distance away from me.

00:13:21.120 --> 00:13:23.759
Exactly. When every point is roughly equidistant

00:13:23.759 --> 00:13:26.580
from the search query, the concept of a nearest

00:13:26.580 --> 00:13:28.740
neighbor becomes mathematically meaningless.

00:13:29.000 --> 00:13:31.100
Because everyone is equally far away. Right.

00:13:31.309 --> 00:13:33.590
The algorithm gets paralyzed because nothing

00:13:33.590 --> 00:13:35.730
stands out as being closer than anything else.

00:13:36.029 --> 00:13:38.009
So if the data has too many features, the algorithm

00:13:38.009 --> 00:13:40.330
essentially goes blind. So how do we break the

00:13:40.330 --> 00:13:43.570
curse? We have to aggressively simplify the data

00:13:43.570 --> 00:13:46.889
using dimension reduction. Before we ever apply

00:13:46.889 --> 00:13:50.250
the KNN algorithm, we transform the data to extract

00:13:50.250 --> 00:13:52.789
only the most critical information, creating

00:13:52.789 --> 00:13:55.830
a low dimensional embedding. The sources cite

00:13:55.830 --> 00:13:57.929
techniques like principal component analysis

00:13:57.929 --> 00:14:02.379
or PCA and linear discriminant analysis. I hear

00:14:02.379 --> 00:14:04.620
terms like principal component analysis a lot

00:14:04.620 --> 00:14:07.460
in AI discussions. What is that physically doing

00:14:07.460 --> 00:14:10.179
to the data to solve this dimensionality problem?

00:14:10.519 --> 00:14:12.860
Think of it like taking a 3D shadow of a complex

00:14:12.860 --> 00:14:16.389
object. If you have a highly complex multidimensional

00:14:16.389 --> 00:14:19.409
shape, PCA essentially shines a mathematical

00:14:19.409 --> 00:14:21.610
light on it to cast a flattened shadow. OK, I

00:14:21.610 --> 00:14:23.889
can picture that. It squashes the data down so

00:14:23.889 --> 00:14:26.570
you only keep the most distinct important outlines,

00:14:26.669 --> 00:14:29.129
the principal components, and it throws away

00:14:29.129 --> 00:14:31.970
all the useless depth and noise that's just confusing

00:14:31.970 --> 00:14:35.490
the algorithm. The text gives a great real -world

00:14:35.490 --> 00:14:37.809
example of this in a computer vision pipeline

00:14:37.809 --> 00:14:40.529
for face recognition. Like how facial recognition

00:14:40.529 --> 00:14:45.440
unlocks a smartphone. Yes. run KNN on millions

00:14:45.440 --> 00:14:47.940
of raw high -definition pixels. That's way too

00:14:47.940 --> 00:14:50.519
many dimensions. Right. So the pipeline starts

00:14:50.519 --> 00:14:52.600
with a hair face detection algorithm just to

00:14:52.600 --> 00:14:55.120
locate the face, uses mean shift tracking to

00:14:55.120 --> 00:14:58.419
follow it, and then crucially uses a PCA projection.

00:14:58.639 --> 00:15:01.100
It compresses that incredibly complex visual

00:15:01.100 --> 00:15:04.240
data down into a manageable feature space. Oh.

00:15:04.360 --> 00:15:06.259
So only after the face is flattened into its

00:15:06.259 --> 00:15:08.820
core mathematical shadow is the KNN classification

00:15:08.820 --> 00:15:11.320
applied to identify who it actually is. Exactly.

00:15:12.479 --> 00:15:15.620
problem by squashing the features. But what about

00:15:15.620 --> 00:15:18.480
the sheer volume of the data? Like, squashing

00:15:18.480 --> 00:15:20.600
the features helps, but if I still have a billion

00:15:20.600 --> 00:15:22.519
individual records, the last -minute open book

00:15:22.519 --> 00:15:24.740
search still takes way too long. Do we just,

00:15:24.740 --> 00:15:26.899
I don't know, randomly delete half the historical

00:15:26.899 --> 00:15:29.700
data and hope for the best? Not randomly, no.

00:15:29.840 --> 00:15:32.200
We use data reduction techniques to systematically

00:15:32.200 --> 00:15:35.519
eliminate stored examples. A classic method mentioned

00:15:35.519 --> 00:15:37.799
in the text is the condensed nearest neighbor

00:15:37.799 --> 00:15:41.059
rule, or CNN, also known as the Hart algorithm.

00:15:41.320 --> 00:15:44.340
OK, so how does the Hart algorithm decide what

00:15:44.340 --> 00:15:47.240
gets kept and what gets thrown in the trash without

00:15:47.240 --> 00:15:49.600
ruining the accuracy? It looks for what we call

00:15:49.600 --> 00:15:52.559
prototypes. Essentially, if a data point is sitting

00:15:52.559 --> 00:15:54.919
completely surrounded by other points of its

00:15:54.919 --> 00:15:57.659
exact same class, it's considered an absorbed

00:15:57.659 --> 00:16:00.360
point. Absorbed. Yeah. It doesn't offer any new

00:16:00.360 --> 00:16:02.779
information for drawing the boundary lines between

00:16:02.779 --> 00:16:05.299
categories. It's redundant. It's safe and obvious.

00:16:05.740 --> 00:16:08.580
So the algorithm deletes it. It prioritizes keeping

00:16:08.580 --> 00:16:11.000
points that have a high border ratio. Border

00:16:11.000 --> 00:16:14.220
ratio. Meaning we want to keep the data points

00:16:14.220 --> 00:16:17.769
that sit. right on the turbulent messy edge between

00:16:17.769 --> 00:16:20.350
two different classes. Precisely. Those boundary

00:16:20.350 --> 00:16:22.789
cases are the most informative for making future

00:16:22.789 --> 00:16:25.490
decisions. The safe echo chambers in the middle

00:16:25.490 --> 00:16:28.950
of a cluster are just noise. By tossing them

00:16:28.950 --> 00:16:31.269
out, this algorithm can sometimes reduce a data

00:16:31.269 --> 00:16:33.789
set to just 15 or 20 percent of its original

00:16:33.789 --> 00:16:36.590
size while maintaining full accuracy. Keep the

00:16:36.590 --> 00:16:39.049
edge cases, throw away the echo chamber. I absolutely

00:16:39.049 --> 00:16:40.990
love that. That's a re -efficient. Now, we've

00:16:40.990 --> 00:16:43.690
spent this entire time talking about how K and

00:16:43.690 --> 00:16:46.730
N use a similarity to find its neighbors. But

00:16:46.730 --> 00:16:49.129
the source material mentioned a variant that

00:16:49.129 --> 00:16:51.429
flips this entirely on its head, and I found

00:16:51.429 --> 00:16:53.629
it fascinating. The furthest neighbor variant.

00:16:54.730 --> 00:16:57.409
Instead of finding what's closest, the algorithm

00:16:57.409 --> 00:16:59.970
actively seeks out maximal dissimilarity. It

00:16:59.970 --> 00:17:02.409
looks for the total opposites. Why would we ever

00:17:02.409 --> 00:17:04.869
want an algorithm to actively seek out things

00:17:04.869 --> 00:17:08.079
that don't belong? It has a highly creative application

00:17:08.079 --> 00:17:10.819
and recommender systems. It's used to build what's

00:17:10.819 --> 00:17:13.380
called an inverted neighborhood model. So instead

00:17:13.380 --> 00:17:16.680
of a streaming platform doing the standard, you

00:17:16.680 --> 00:17:19.259
know... People who like this movie also like

00:17:19.259 --> 00:17:22.359
these exact same movies. What is the furthest

00:17:22.359 --> 00:17:26.079
neighbor doing? It identifies the users who are

00:17:26.079 --> 00:17:29.339
the absolute most dissimilar to you. And it uses

00:17:29.339 --> 00:17:32.000
their preferences inversely to generate your

00:17:32.000 --> 00:17:34.200
recommendations. That's wild. Now, historically,

00:17:34.539 --> 00:17:37.220
furthest neighbor methods achieve lower predictive

00:17:37.220 --> 00:17:39.779
accuracy. They aren't going to perfectly guess

00:17:39.779 --> 00:17:42.519
your next favorite movie. But what they do achieve

00:17:42.519 --> 00:17:45.599
is a massive increase in novelty and diversity.

00:17:45.660 --> 00:17:48.130
Oh, to break you out of the echo chain. Exactly.

00:17:48.609 --> 00:17:51.130
They are designed to break you out of your algorithmic

00:17:51.130 --> 00:17:53.809
filter bubble by recommending things that the

00:17:53.809 --> 00:17:55.809
traditional nearest neighbor method would just

00:17:55.809 --> 00:17:58.690
completely ignore for being too weird or unpopular.

00:17:58.869 --> 00:18:00.529
An algorithm mathematically designed to give

00:18:00.529 --> 00:18:03.009
you a surprise. That's great. Are there other

00:18:03.009 --> 00:18:05.269
ways we use this to find the weirdos in the data?

00:18:05.519 --> 00:18:09.019
Yes, actually. KNN is also highly effective for

00:18:09.019 --> 00:18:11.920
anomaly detection. The mechanism is incredibly

00:18:11.920 --> 00:18:14.579
simple. If we measure the distance to your cave,

00:18:14.859 --> 00:18:17.660
nearest neighbor, and that distance is massively

00:18:17.660 --> 00:18:20.839
large, it tells us that your local density is

00:18:20.839 --> 00:18:23.650
very low. You are sitting out there... entirely

00:18:23.650 --> 00:18:26.210
by yourself. You're a statistical outlier. Exactly.

00:18:26.670 --> 00:18:29.349
Despite being such a simple foundational model,

00:18:29.529 --> 00:18:32.130
using K &N for finding anomalies and outliers

00:18:32.130 --> 00:18:34.630
works remarkably well. It even holds its own

00:18:34.630 --> 00:18:37.069
against much more complex modern data mining

00:18:37.069 --> 00:18:39.549
methods. So what does this all mean? Let's tie

00:18:39.549 --> 00:18:41.849
all of this together. We've gone from blindfolded

00:18:41.849 --> 00:18:44.809
people feeling houses to flattening faces into

00:18:44.809 --> 00:18:47.109
mathematical shadows. We have covered a lot of

00:18:47.109 --> 00:18:50.650
ground. At its core, KNN is this brilliantly

00:18:50.650 --> 00:18:52.789
simple concept from the 1950s, the idea that

00:18:52.789 --> 00:18:55.230
you are who you hang out with, mathematically

00:18:55.230 --> 00:18:58.150
brought to life. We've explored how it uses plurality

00:18:58.150 --> 00:19:01.029
voting, how it survives skewed data by turning

00:19:01.029 --> 00:19:03.329
up the volume on the closest neighbors, and how

00:19:03.329 --> 00:19:05.230
it survives the absolute nightmare that is the

00:19:05.230 --> 00:19:07.609
curse of dimensionality by squashing data down

00:19:07.609 --> 00:19:10.349
to its most useful prototypes. And if we connect

00:19:10.349 --> 00:19:12.950
this to the bigger picture... understanding these

00:19:12.950 --> 00:19:15.769
mechanisms, how algorithms handle noisy data,

00:19:16.250 --> 00:19:19.210
skewed populations and border cases, it's vital.

00:19:19.390 --> 00:19:22.029
It isn't just an academic exercise. It gives

00:19:22.029 --> 00:19:24.789
you a framework to think critically about how

00:19:24.789 --> 00:19:27.730
software is categorizing your own digital footprint

00:19:27.730 --> 00:19:30.849
every single day and why it might be making the

00:19:30.849 --> 00:19:33.410
assumptions it makes. Absolutely. And as we wrap,

00:19:33.470 --> 00:19:35.490
I want to leave you with one final thought to

00:19:35.490 --> 00:19:38.420
mull over. and this builds directly on something

00:19:38.420 --> 00:19:40.480
kind of hidden in the source material regarding

00:19:40.480 --> 00:19:42.839
the one nearest neighbor classifier. Oh, the

00:19:42.839 --> 00:19:44.740
simplest version. Right. This is the absolute

00:19:44.740 --> 00:19:47.279
simplest version, where k equals exactly 1. You

00:19:47.279 --> 00:19:49.900
only look at the single closest point. Ah, you

00:19:49.900 --> 00:19:52.220
were referring to the Bayes error rate property.

00:19:52.299 --> 00:19:55.059
It is one of the most profound mathematical proofs

00:19:55.059 --> 00:19:57.779
in classification logic. It really is. Because

00:19:57.779 --> 00:20:00.359
the math shows that as your historical data set

00:20:00.359 --> 00:20:03.380
grows and approaches infinity, simply using the

00:20:03.380 --> 00:20:05.960
one nearest neighbor rule guarantees an error

00:20:05.960 --> 00:20:08.160
rate that is no worse than twice the Bayes error

00:20:08.160 --> 00:20:10.200
rate. Which is huge. Right. For context, the

00:20:10.200 --> 00:20:13.410
Bayes error is the theoretical minimum possible

00:20:13.410 --> 00:20:16.390
error you can ever mathematically achieve. Think

00:20:16.390 --> 00:20:18.130
about the philosophical weight of that for a

00:20:18.130 --> 00:20:20.950
second. We constantly assume that making the

00:20:20.950 --> 00:20:24.289
best decision requires a complex consensus. Yeah,

00:20:24.470 --> 00:20:27.390
a committee. Exactly. We assume we need a weighted

00:20:27.390 --> 00:20:29.990
average of hundreds of different opinions or

00:20:29.990 --> 00:20:33.950
a massive convoluted decision tree. But this

00:20:33.950 --> 00:20:36.329
mathematical proof says that's not necessarily

00:20:36.329 --> 00:20:39.279
true. It challenges our entire assumption about

00:20:39.279 --> 00:20:41.700
what constitutes intelligence in decision -making.

00:20:42.079 --> 00:20:43.980
Right. With enough historical experience, with

00:20:43.980 --> 00:20:46.920
a massive enough memory of past data, simply

00:20:46.920 --> 00:20:49.799
looking at the single closest past example gets

00:20:49.799 --> 00:20:52.720
you incredibly close to the absolute best possible

00:20:52.720 --> 00:20:54.539
decision you could ever mathematically make.

00:20:54.700 --> 00:20:56.920
It's pretty mind -blowing. Does that mean having

00:20:56.920 --> 00:20:59.920
a massive, perfectly preserved memory of past

00:20:59.920 --> 00:21:02.680
experiences is actually more effective than having

00:21:02.680 --> 00:21:05.960
a highly complex reasoning process? If you're

00:21:05.960 --> 00:21:08.000
blindfolded in the city, maybe you don't need

00:21:08.000 --> 00:21:10.119
to feel five buildings to know where you are.

00:21:10.700 --> 00:21:12.819
If you've been to enough places, maybe you only

00:21:12.819 --> 00:21:15.019
need to touch one wall. Thank you so much for

00:21:15.019 --> 00:21:16.819
joining us on this deep dive. Keep questioning

00:21:16.819 --> 00:21:19.019
the algorithms, keep exploring the data, and

00:21:19.019 --> 00:21:20.019
we'll catch you next time.