WEBVTT

00:00:00.000 --> 00:00:03.859
Imagine walking into like a massive sold out

00:00:03.859 --> 00:00:07.000
stadium. Oh, yeah, like 80 ,000 people. Exactly.

00:00:07.240 --> 00:00:10.060
100 ,000 people all shouting at the exact same

00:00:10.060 --> 00:00:13.679
time. It's just this this physical wall of white

00:00:13.679 --> 00:00:15.500
noise. Right. You can't hear anything. You can't

00:00:15.500 --> 00:00:17.739
make out a single conversation, let alone, you

00:00:17.739 --> 00:00:20.699
know, a single word. It is completely overwhelming.

00:00:21.140 --> 00:00:24.480
But what if you had this magical dial on your

00:00:24.480 --> 00:00:26.800
headphones? OK, I like where this is going. You

00:00:26.800 --> 00:00:29.379
turn it one click. and suddenly that chaotic

00:00:29.379 --> 00:00:31.920
background noise just drops. You just hear the

00:00:31.920 --> 00:00:35.119
collective, like, underlying mood of the crowd.

00:00:35.780 --> 00:00:37.840
And then you turn it one more click, and you

00:00:37.840 --> 00:00:40.859
clearly hear the three main chants echoing around

00:00:40.859 --> 00:00:43.560
the arena. You've somehow extracted the actual

00:00:43.560 --> 00:00:46.119
meaning hidden inside all that screaming. And

00:00:46.119 --> 00:00:48.119
the crazy thing is that dial actually exists

00:00:48.119 --> 00:00:50.340
mathematically. Right. I mean, you're basically

00:00:50.340 --> 00:00:53.320
filtering out the statistical noise to find the

00:00:53.320 --> 00:00:56.740
true underlying structure. It's it's what data

00:00:56.740 --> 00:00:58.740
scientists call finding the signal. And that

00:00:58.740 --> 00:01:00.340
is exactly what we are doing today, because,

00:01:00.500 --> 00:01:02.679
look, you the listener, we all live in an era

00:01:02.679 --> 00:01:05.700
of just absolute information overload. Oh, totally.

00:01:05.879 --> 00:01:08.260
We're drowning in it. You are bombarded with

00:01:08.260 --> 00:01:11.280
massive data sets every single day, whether you

00:01:11.280 --> 00:01:14.060
realize it or not. And the tool that acts like

00:01:14.060 --> 00:01:17.079
that magical headphone dial It's called Principal

00:01:17.079 --> 00:01:20.280
Component Analysis, or PCA. Yeah, PCA. It's this

00:01:20.280 --> 00:01:23.519
120 -year -old mathematical engine that secretly

00:01:23.519 --> 00:01:27.140
dictates how your intelligence is measured, how

00:01:27.140 --> 00:01:30.140
your country is ranked globally, even how algorithms

00:01:30.140 --> 00:01:32.000
know what you're going to buy. Yeah, it's everywhere.

00:01:32.120 --> 00:01:35.019
But it has a massive blind spot. So today, we're

00:01:35.019 --> 00:01:37.239
taking a deep dive into a stack of research,

00:01:37.760 --> 00:01:40.120
including a really comprehensive overview of

00:01:40.120 --> 00:01:43.409
PCA. to understand how modern science makes sense

00:01:43.409 --> 00:01:46.290
of the chaos. And critically, where that math

00:01:46.290 --> 00:01:48.689
goes terribly wrong. Right. Because it does go

00:01:48.689 --> 00:01:50.750
wrong. It really does. But, you know, it is the

00:01:50.750 --> 00:01:52.810
hidden architecture behind so much of our world.

00:01:53.209 --> 00:01:55.689
When you have a data set with hundreds or, I

00:01:55.689 --> 00:01:58.370
don't know, thousands of variables, human brains

00:01:58.370 --> 00:02:01.390
just shut down. Yeah. I mean, I can barely handle

00:02:01.390 --> 00:02:04.810
a spreadsheet with 20 columns. Exactly. We can't

00:02:04.810 --> 00:02:08.889
visualize a thousand dimensions. So PCA is the

00:02:08.889 --> 00:02:11.129
mathematical technique that takes that massive

00:02:11.129 --> 00:02:13.830
high -dimensional data and basically translates

00:02:13.830 --> 00:02:16.469
it into a lower dimensional space. While keeping

00:02:16.469 --> 00:02:18.669
the most important info intact, right? Right.

00:02:18.669 --> 00:02:20.849
You don't want to lose the actual signal. See,

00:02:20.849 --> 00:02:23.490
jumping straight into high -dimensional mathematical

00:02:23.490 --> 00:02:26.469
spaces is usually where my brain just blue -screened.

00:02:26.490 --> 00:02:28.330
Fair enough, yeah. Before we get into the heavy

00:02:28.330 --> 00:02:30.389
mechanics of how a computer actually calculates

00:02:30.389 --> 00:02:34.129
this stuff, we really need to visualize what

00:02:34.129 --> 00:02:37.009
dimensionality reduction looks like in physical

00:02:37.009 --> 00:02:39.770
space. Well, the source material offers a really

00:02:39.770 --> 00:02:42.810
great geometric way to think about this. It describes

00:02:42.810 --> 00:02:45.969
PCA as fitting a P -dimensional ellipsoid to

00:02:45.969 --> 00:02:49.889
the data. An ellipsoid, so like a 3D oval, like

00:02:49.889 --> 00:02:51.669
a rugby ball or a zeppelin or something? Yeah,

00:02:51.669 --> 00:02:54.370
think of a swarm of bees. If you track their

00:02:54.370 --> 00:02:57.990
positions, they might form this giant fuzzy cloud

00:02:57.990 --> 00:03:00.370
in the shape of that rugby ball. Right, they're

00:03:00.370 --> 00:03:02.889
swarming around the queen or something. Exactly.

00:03:03.370 --> 00:03:06.250
And each axis of that ellipsoid represents what

00:03:06.250 --> 00:03:09.949
we call a principal component. Now, what's fascinating

00:03:09.949 --> 00:03:13.189
here is that if one axis of that rugby ball is

00:03:13.189 --> 00:03:16.229
really short or narrow, it means the variance,

00:03:16.509 --> 00:03:19.009
like the actual spread of the data along that

00:03:19.009 --> 00:03:21.389
direction is super small. Because the bees just

00:03:21.389 --> 00:03:24.449
aren't moving much in that specific direction.

00:03:24.610 --> 00:03:27.469
Right. So you can just drop that axis entirely

00:03:27.469 --> 00:03:30.150
to simplify the picture. You just delete it.

00:03:30.159 --> 00:03:32.060
You basically just ignore it. You reduce the

00:03:32.060 --> 00:03:34.819
dimensions, but you retain the axes where the

00:03:34.819 --> 00:03:37.120
swarm is spread out the widest. Because that's

00:03:37.120 --> 00:03:39.780
where the most important information, or movement,

00:03:40.080 --> 00:03:42.439
actually lives. Exactly. Okay, let's unpack this

00:03:42.439 --> 00:03:44.379
with another analogy. Because when I was reading

00:03:44.379 --> 00:03:46.560
the source, I was thinking of it like casting

00:03:46.560 --> 00:03:49.219
a shadow of a complex three -dimensional object

00:03:49.219 --> 00:03:52.300
onto a flat 2D wall. Oh, that's a really good

00:03:52.300 --> 00:03:53.919
way to picture it. Like if you have a physical

00:03:53.919 --> 00:03:56.719
teapot. Right. and you shine a flashlight on

00:03:56.719 --> 00:03:59.020
it straight from the front, the shadow on the

00:03:59.020 --> 00:04:01.000
wall might just look like a solid boring circle.

00:04:01.139 --> 00:04:03.280
Yeah, you wouldn't even know it's a teapot. Exactly.

00:04:03.780 --> 00:04:05.840
You've successfully reduced the dimensions from

00:04:05.840 --> 00:04:09.259
3D to TD, but you've chosen a terrible angle.

00:04:09.759 --> 00:04:12.139
You lose all the information about what the object

00:04:12.139 --> 00:04:14.699
actually is. The essence of the object is completely

00:04:14.699 --> 00:04:17.670
lost in that specific projection. But... If you

00:04:17.670 --> 00:04:20.189
just rotate that teapot until the spout and the

00:04:20.189 --> 00:04:22.009
handle are perfectly silhouetted against the

00:04:22.009 --> 00:04:24.810
wall, you instantly recognize it. Right. You

00:04:24.810 --> 00:04:27.649
are rotating the object to find the angle that

00:04:27.649 --> 00:04:30.589
gives you the maximum spread, the most recognizable

00:04:30.589 --> 00:04:33.790
silhouette. In PCA terms, you're finding the

00:04:33.790 --> 00:04:36.170
plane with the maximum variance. Yeah, and the

00:04:36.170 --> 00:04:38.430
algorithm does exactly that, but sequentially.

00:04:38.730 --> 00:04:41.589
Step by step. Right. So the first principal component,

00:04:41.649 --> 00:04:44.689
that first axis, is a derived variable that explains

00:04:44.689 --> 00:04:47.149
the absolute most variance in the data. It's

00:04:47.149 --> 00:04:49.250
the longest axis of our rugby ball. Exactly.

00:04:49.389 --> 00:04:52.769
It's the best possible silhouette. Then the algorithm

00:04:52.769 --> 00:04:55.110
looks for the second principal component. This

00:04:55.110 --> 00:04:57.990
one explains the next most variance, but with

00:04:57.990 --> 00:05:01.240
a strict geometric rule. It has to be completely

00:05:01.240 --> 00:05:03.959
orthogonal. Orthogonal meaning? At a perfect

00:05:03.959 --> 00:05:06.500
right angle to the first one. Ah, okay. So it

00:05:06.500 --> 00:05:08.740
has to be completely uncorrelated to the first

00:05:08.740 --> 00:05:11.639
component. Right. It's capturing brand new information

00:05:11.639 --> 00:05:13.920
that the first one missed without any overlapping.

00:05:14.139 --> 00:05:16.800
And it proceeds like that, step by step, finding

00:05:16.800 --> 00:05:19.540
new orthogonal directions until all the variance

00:05:19.540 --> 00:05:21.720
in your dataset is totally explained. So it takes

00:05:21.720 --> 00:05:24.800
a messy dataset where everything is highly correlated

00:05:24.800 --> 00:05:27.920
and tangled up. And transforms it into a neat,

00:05:28.240 --> 00:05:30.699
independent set of principal components. Okay,

00:05:31.079 --> 00:05:33.399
so visualizing rotating a teapot is one thing.

00:05:33.759 --> 00:05:36.540
But how does a computer actually rotate a spreadsheet

00:05:36.540 --> 00:05:39.660
with, like, a thousand columns? Yeah, the math

00:05:39.660 --> 00:05:42.810
part. The source mentions it operates using covariance

00:05:42.810 --> 00:05:45.269
matrices and eigenvectors. Yes. Let's translate

00:05:45.269 --> 00:05:47.930
that into plain English. How does the math actually

00:05:47.930 --> 00:05:50.470
find this maximum variance? Well, let's build

00:05:50.470 --> 00:05:53.149
one together. Imagine a data set with just two

00:05:53.149 --> 00:05:55.810
variables, height and shoe size. Okay, pretty

00:05:55.810 --> 00:05:58.629
standard. If you plot 100 people on a graph with

00:05:58.629 --> 00:06:00.790
height on the vertical axis and shoe size on

00:06:00.790 --> 00:06:03.370
the horizontal axis, the dots don't just form

00:06:03.370 --> 00:06:06.389
a random circle. No, they'd form like a diagonal

00:06:06.389 --> 00:06:10.470
line. Right. A diagonal cigar -shaped swarm pointing

00:06:10.470 --> 00:06:13.889
up and to the right. Because taller people naturally

00:06:13.889 --> 00:06:16.470
tend to have larger feet. The two variables are

00:06:16.470 --> 00:06:18.850
moving together. Exactly. And that relationship

00:06:18.850 --> 00:06:22.430
is what a covariance matrix captures. It's basically

00:06:22.430 --> 00:06:25.209
a mathematical grid that calculates exactly how

00:06:25.209 --> 00:06:28.410
much every single variable in your data set changes

00:06:28.410 --> 00:06:30.769
in tandem with every other variable. Got it.

00:06:30.829 --> 00:06:33.069
And once you have that grid of relationships,

00:06:33.449 --> 00:06:36.759
the math calculates the eigenvectors. OK. The

00:06:36.759 --> 00:06:40.060
scary word. What is an eigenvector actually doing

00:06:40.060 --> 00:06:43.800
to our cigar shaped swarm of dots? Honestly,

00:06:43.980 --> 00:06:46.279
an eigenvector is essentially just the computer

00:06:46.279 --> 00:06:48.680
drawing a brand new line straight through the

00:06:48.680 --> 00:06:51.600
longest part of that cigar shape. Oh, that's

00:06:51.600 --> 00:06:53.839
it. That's it. Instead of sticking with the original

00:06:53.839 --> 00:06:57.540
height and shoe size axis, PCA creates a new

00:06:57.540 --> 00:07:00.519
diagonal axis that captures both. So this new

00:07:00.519 --> 00:07:03.009
line. is our first principal component. Yes.

00:07:03.290 --> 00:07:06.009
It represents a hidden latent concept, like maybe

00:07:06.009 --> 00:07:08.170
we just call it overall size. OK, that makes

00:07:08.170 --> 00:07:09.769
sense. And the eigenvalue is just the number

00:07:09.769 --> 00:07:11.910
that tells you exactly how much the total variance

00:07:11.910 --> 00:07:14.310
is captured by that new line. OK, that feels

00:07:14.310 --> 00:07:18.069
like pure elegant logic. Yeah. The math just

00:07:18.069 --> 00:07:21.410
finds the natural variance. It does. But the

00:07:21.410 --> 00:07:24.329
source material points out a massive trap here

00:07:24.329 --> 00:07:28.360
that kind of completely shatters that illusion

00:07:28.360 --> 00:07:32.160
of mathematical objectivity. Ah yes, the scaling

00:07:32.160 --> 00:07:35.800
trap. The scaling trap. It turns out PCA is incredibly

00:07:35.800 --> 00:07:38.920
sensitive to how human beings format the numbers

00:07:38.920 --> 00:07:41.540
before the algorithm even boots up. If we connect

00:07:41.540 --> 00:07:43.949
this to the bigger picture, the scaling trap

00:07:43.949 --> 00:07:46.709
reveals how fragile this kind of data analysis

00:07:46.709 --> 00:07:49.470
can actually be. The source explicitly notes

00:07:49.470 --> 00:07:51.810
that if you multiply just one variable in your

00:07:51.810 --> 00:07:54.269
data set by 100... Just one column. Right, just

00:07:54.269 --> 00:07:56.529
one column. The first principal component will

00:07:56.529 --> 00:07:58.829
almost entirely align with that single variable.

00:07:58.970 --> 00:08:01.569
Because the algorithm is blindly hunting for

00:08:01.569 --> 00:08:03.810
variance. Yeah. Like if you artificially inflate

00:08:03.810 --> 00:08:05.870
a number, the algorithm acts like a dog seen

00:08:05.870 --> 00:08:07.730
as quarrel. Yeah, that's exactly it. It just

00:08:07.730 --> 00:08:09.850
chases the biggest number and ignores the actual

00:08:09.850 --> 00:08:12.370
relationships. The text gives... a brilliant,

00:08:12.370 --> 00:08:15.189
simple illustration of this involving temperature.

00:08:15.709 --> 00:08:17.889
Imagine you're analyzing a data set that includes,

00:08:17.889 --> 00:08:20.509
say, temperature and humidity. If you record

00:08:20.509 --> 00:08:22.370
that temperature in Fahrenheit, the numbers are

00:08:22.370 --> 00:08:24.730
going to be larger and have a wider numeric spread

00:08:24.730 --> 00:08:26.949
than if you record the exact same physical temperatures

00:08:26.949 --> 00:08:29.689
in Celsius. Right, because the gap between freezing

00:08:29.689 --> 00:08:33.629
and boiling in Fahrenheit is 180 degrees, but

00:08:33.629 --> 00:08:36.879
in Celsius, it's only 100. Exactly. It's the

00:08:36.879 --> 00:08:39.000
exact same physical heat just written with bigger

00:08:39.000 --> 00:08:42.539
digits. Right. So if you run PCA on the Fahrenheit

00:08:42.539 --> 00:08:44.980
data set, you will get a completely different

00:08:44.980 --> 00:08:47.039
principal component than if you run it on the

00:08:47.039 --> 00:08:50.279
Celsius data. Wow. The algorithm will mathematically

00:08:50.279 --> 00:08:53.039
conclude that temperature is a much more dominant

00:08:53.039 --> 00:08:56.120
important factor in the Fahrenheit data set purely

00:08:56.120 --> 00:08:58.840
because the digits themselves are larger. That

00:08:58.840 --> 00:09:01.259
is wild. If your variables have different units

00:09:01.259 --> 00:09:05.059
like mixing temperature in degrees, mass in kilograms,

00:09:05.360 --> 00:09:08.220
distance in millimeters, the CA becomes completely

00:09:08.220 --> 00:09:10.759
arbitrary. The variable with the largest absolute

00:09:10.759 --> 00:09:12.919
numbers just dominates the shadow on the wall.

00:09:13.299 --> 00:09:15.639
Completely. See, we tend to treat algorithms

00:09:15.639 --> 00:09:19.279
as these unbiased arbiters of truth. You know,

00:09:19.360 --> 00:09:21.740
if a computer spits out a graph, we just assume

00:09:21.740 --> 00:09:24.379
it's objective science. Right, we trust the machine.

00:09:24.679 --> 00:09:27.159
But if the human running the numbers doesn't

00:09:27.159 --> 00:09:29.970
perfectly standardize the data beforehand, The

00:09:29.970 --> 00:09:32.769
truth the algorithm finds is totally warped.

00:09:32.950 --> 00:09:35.929
Like, a casual choice between Celsius and Fahrenheit

00:09:35.929 --> 00:09:38.350
completely changes what the computer tells you

00:09:38.350 --> 00:09:40.250
is the fundamental structure of your data. Which

00:09:40.250 --> 00:09:42.870
is why that human prep work is mandatory. You

00:09:42.870 --> 00:09:45.669
have to mean center the data. Mean centering.

00:09:45.809 --> 00:09:48.009
Meaning what, exactly? It means subtracting the

00:09:48.009 --> 00:09:50.409
average so everything is centered on zero. And

00:09:50.409 --> 00:09:52.610
you also have to standardize the unit so every

00:09:52.610 --> 00:09:54.809
variable starts with equal variance. So you level

00:09:54.809 --> 00:09:56.509
the playing field before the computer looks at

00:09:56.509 --> 00:09:59.299
it. Exactly. Without that human intervention,

00:09:59.539 --> 00:10:02.940
the math is entirely blind. It's honestly terrifying

00:10:02.940 --> 00:10:06.519
that a simple unit choice can warp the data that

00:10:06.519 --> 00:10:09.340
badly. Because given how fragile and sensitive

00:10:09.340 --> 00:10:12.059
this algorithm is to human input, you'd think

00:10:12.059 --> 00:10:14.299
we'd only use it in highly controlled lab experiments,

00:10:14.460 --> 00:10:16.600
right? I would think so. But we actually use

00:10:16.600 --> 00:10:20.500
it to define incredibly messy, subjective human

00:10:20.500 --> 00:10:23.419
concepts. The source outlines applications that

00:10:23.419 --> 00:10:26.360
directly impact almost every aspect of society.

00:10:26.539 --> 00:10:30.419
Yeah, because PCA handles chaotic, multivariable

00:10:30.419 --> 00:10:33.879
environments so well, it really excels anywhere

00:10:33.879 --> 00:10:36.100
humans are trying to pin down abstract ideas.

00:10:36.340 --> 00:10:37.919
Take psychology, for example. OK, let's talk

00:10:37.919 --> 00:10:40.379
about psychology. In the early 1900s, researchers

00:10:40.379 --> 00:10:42.860
hypothesized that human intelligence wasn't just

00:10:42.860 --> 00:10:46.500
a single metric. It was a complex mix of spatial

00:10:46.500 --> 00:10:49.200
reasoning, verbal skills, mathematical deduction,

00:10:49.519 --> 00:10:52.460
memory. But you can't put a physical ruler up

00:10:52.460 --> 00:10:55.159
to someone's brain and measure deduction. Right.

00:10:55.279 --> 00:10:58.379
You can only give them tests. Right. So in 1904,

00:10:58.639 --> 00:11:00.879
Charles Spearman developed factor analysis, which

00:11:00.879 --> 00:11:03.419
is mathematically intertwined with PCA, to find

00:11:03.419 --> 00:11:06.159
the underlying components of intelligence. 1904.

00:11:06.419 --> 00:11:09.279
Yeah, and by 1924, a psychologist named Thurstone

00:11:09.279 --> 00:11:12.539
was using it to look for 56 distinct factors.

00:11:12.779 --> 00:11:15.940
56. Right. And if a person scored highly on a

00:11:15.940 --> 00:11:18.379
vocabulary test, a reading comprehension test,

00:11:18.580 --> 00:11:21.440
and a verbal analogy test, the math would recognize

00:11:21.440 --> 00:11:23.919
that those variables all moved in tandem. Just

00:11:23.919 --> 00:11:27.549
like heightened shoe size. Exactly. PCA draws

00:11:27.549 --> 00:11:30.990
a new axis through that variance and calls that

00:11:30.990 --> 00:11:34.889
hidden latent variable verbal intelligence. And

00:11:34.889 --> 00:11:36.730
this mathematical transformation is literally

00:11:36.730 --> 00:11:39.490
the bedrock of the standard IQ tests we still

00:11:39.490 --> 00:11:42.110
use today. We are using a mathematical concept

00:11:42.110 --> 00:11:44.730
from 1901 to define what it means to be smart.

00:11:44.870 --> 00:11:46.950
Pretty much, yeah. And we do the exact same thing

00:11:46.950 --> 00:11:49.570
in global economics, which is crazy to me. The

00:11:49.570 --> 00:11:52.320
United Nations uses the human development index,

00:11:52.440 --> 00:11:55.299
the HDI, to rank countries. Right. But how do

00:11:55.299 --> 00:11:58.259
you mathematically rank development? A spreadsheet

00:11:58.259 --> 00:12:00.480
for a country has columns for life expectancy,

00:12:00.899 --> 00:12:04.080
education levels, per capita income, infrastructure.

00:12:04.220 --> 00:12:06.899
It's a huge mix of data. You cannot add a year

00:12:06.899 --> 00:12:09.659
of life expectancy to a dollar of GDP. It's comparing

00:12:09.659 --> 00:12:12.360
apples to tractors. Well, by applying PCA to

00:12:12.360 --> 00:12:14.639
all those diverse indicators, you extract the

00:12:14.639 --> 00:12:16.700
principal components. The text notes that the

00:12:16.700 --> 00:12:18.320
city development index actually started with

00:12:18.320 --> 00:12:22.049
about 200 indicators from 254 global cities 200

00:12:22.049 --> 00:12:24.549
columns of noise exactly 200 columns of noise

00:12:24.549 --> 00:12:27.929
but PCI boiled that massive complexity down it

00:12:27.929 --> 00:12:29.889
found that the coefficients roughly correlated

00:12:29.889 --> 00:12:32.490
to the average costs of providing services oh

00:12:32.490 --> 00:12:36.070
wow yeah creating a single powerful metric for

00:12:36.070 --> 00:12:39.090
effective physical and social investment it found

00:12:39.090 --> 00:12:41.950
the core signal of development hiding inside

00:12:41.950 --> 00:12:44.370
millions of disparate data points here's where

00:12:44.370 --> 00:12:47.350
it gets really interesting though It gets even

00:12:47.350 --> 00:12:50.429
more granular than that. It's used heavily in

00:12:50.429 --> 00:12:52.610
market research. Oh, absolutely. The source brought

00:12:52.610 --> 00:12:55.990
up the 2013 Oxford Internet Survey. They asked

00:12:55.990 --> 00:12:59.149
2 ,000 people a massive battery of questions

00:12:59.149 --> 00:13:01.330
about their attitudes and beliefs regarding the

00:13:01.330 --> 00:13:03.070
internet. I can only imagine the data there.

00:13:03.269 --> 00:13:05.850
Right. Questions like, do you use the internet

00:13:05.850 --> 00:13:09.169
to avoid thinking about your problems? Or do

00:13:09.169 --> 00:13:11.730
you use it to find romantic partners? Do you

00:13:11.730 --> 00:13:15.139
use it to pay bills faster? A giant spreadsheet

00:13:15.139 --> 00:13:17.559
of thousands of survey answers is essentially

00:13:17.559 --> 00:13:19.679
that stadium of people shouting, you just can't

00:13:19.679 --> 00:13:23.279
read it. So the researchers ran PCA on that exact

00:13:23.279 --> 00:13:25.700
survey data. Yep. The math looked at how all

00:13:25.700 --> 00:13:28.059
those answers covariate. It found that people

00:13:28.059 --> 00:13:30.879
who answered yes to avoiding work also tended

00:13:30.879 --> 00:13:33.360
to answer yes to playing online games. Makes

00:13:33.360 --> 00:13:36.440
sense. It clustered these behaviors and extracted

00:13:36.440 --> 00:13:39.500
four principal dimensions that completely summarize

00:13:39.500 --> 00:13:41.759
our relationship with the Internet. Four dimensions

00:13:41.759 --> 00:13:44.539
out of all those questions. Just four. The text

00:13:44.539 --> 00:13:48.679
lists them as escape, social networking, efficiency,

00:13:49.159 --> 00:13:51.720
and problem creating. You feed the machine thousands

00:13:51.720 --> 00:13:54.740
of arbitrary survey bubbles, and the math draws

00:13:54.740 --> 00:13:58.799
four new axes. It reveals that human beings fundamentally

00:13:58.799 --> 00:14:02.440
use the internet to escape reality, talk to friends,

00:14:02.759 --> 00:14:05.909
work faster, or cause trouble. It's amazing.

00:14:06.110 --> 00:14:08.210
It literally found the psychological silhouettes

00:14:08.210 --> 00:14:11.090
of our digital lives. It did. And if we move

00:14:11.090 --> 00:14:14.769
from psychology to pure biology, PCA is also

00:14:14.769 --> 00:14:17.429
heavily utilized in neuroscience. Neuroscience.

00:14:17.509 --> 00:14:20.009
Wait, so we're going from ranking national infrastructure

00:14:20.009 --> 00:14:23.230
to mapping physical brains. Yes. When neuroscientists

00:14:23.230 --> 00:14:25.649
record brain activity, a single electrode might

00:14:25.649 --> 00:14:27.769
pick up the electrical action potentials, the

00:14:27.769 --> 00:14:30.149
microscopic spikes from several different neurons

00:14:30.149 --> 00:14:32.490
firing all at once. So it's literally the stadium

00:14:32.490 --> 00:14:35.309
analogy, but inside your head. Yes, exactly.

00:14:35.470 --> 00:14:38.629
It's a jumble of electrical noise. So neuroscientists

00:14:38.629 --> 00:14:41.129
use a technique called spike -triggered covariance,

00:14:41.289 --> 00:14:43.730
which relies heavily on PCA to sort those spikes

00:14:43.730 --> 00:14:46.080
out. Okay, how did that work? They reduce the

00:14:46.080 --> 00:14:48.100
dimensionality of the waveforms to figure out

00:14:48.100 --> 00:14:50.960
exactly which specific properties of a visual

00:14:50.960 --> 00:14:54.659
or auditory stimulus are causing a specific individual

00:14:54.659 --> 00:14:57.519
neuron to fire. I am just marveling at the versatility

00:14:57.519 --> 00:15:00.399
here. Carl Pearson invents a mathematical formula

00:15:00.399 --> 00:15:03.500
in 1901 to solve abstract mechanics problems.

00:15:03.519 --> 00:15:07.009
Right. And today... That exact same formula maps

00:15:07.009 --> 00:15:09.909
firing neurons, ranks the development of nations,

00:15:10.429 --> 00:15:13.210
measures your IQ, and figures out why we're addicted

00:15:13.210 --> 00:15:16.309
to social media. It is an omnipresent lens. It

00:15:16.309 --> 00:15:18.710
is a remarkable lens, but... And this is crucial,

00:15:18.850 --> 00:15:20.990
lenses can distort reality just as easily as

00:15:20.990 --> 00:15:23.549
they focus it. The source material is very clear

00:15:23.549 --> 00:15:26.409
about the limitations of PCA. When this mathematical

00:15:26.409 --> 00:15:29.090
tool is misused, or its fundamental assumptions

00:15:29.090 --> 00:15:32.049
are ignored, it causes severe real -world friction.

00:15:32.330 --> 00:15:34.210
OK, let's talk about when the magic dial on the

00:15:34.210 --> 00:15:36.889
headphones breaks. Because if this math governs

00:15:36.889 --> 00:15:39.110
so much of our world, we really need to know

00:15:39.110 --> 00:15:41.710
where it fails. Well, the most fundamental flaw

00:15:41.710 --> 00:15:44.450
is that PCA relies entirely on a linear model.

00:15:45.070 --> 00:15:48.169
Linear meaning straight lines. Right. It assumes

00:15:48.169 --> 00:15:50.629
the data behaves in straight lines, like our

00:15:50.629 --> 00:15:53.669
cigar -shaped scatterplot from earlier. But the

00:15:53.669 --> 00:15:56.750
text warns that if a data set has a hidden nonlinear

00:15:56.750 --> 00:15:59.750
pattern, say, the data points form a complex

00:15:59.750 --> 00:16:03.549
curve, or a U -shape, or a spiral. What's PCA

00:16:03.549 --> 00:16:06.120
do, then? It won't just fail to find the pattern.

00:16:06.200 --> 00:16:08.559
It will draw a straight line right through the

00:16:08.559 --> 00:16:10.740
middle of the spiral and steer your analysis

00:16:10.740 --> 00:16:13.299
completely backward. You'll draw entirely the

00:16:13.299 --> 00:16:16.200
wrong conclusions. Yikes. And there are also

00:16:16.200 --> 00:16:18.860
specific scientific fields where the basic requirements

00:16:18.860 --> 00:16:22.100
of PCA literally break the laws of physics. Yes.

00:16:22.259 --> 00:16:24.320
Astronomy is a fantastic example from the text.

00:16:24.480 --> 00:16:26.240
We talked earlier about how you absolutely have

00:16:26.240 --> 00:16:28.700
to mean center your data, right? Where you subtract

00:16:28.700 --> 00:16:30.700
the average so everything centers around zero.

00:16:30.779 --> 00:16:33.500
Right. That is mandatory prep work. Or the eigenvector.

00:16:33.320 --> 00:16:35.120
will just point toward the center of the data

00:16:35.120 --> 00:16:37.659
instead of the actual variance. But in astrophysics,

00:16:37.700 --> 00:16:40.519
you are measuring light. You are analyzing signals

00:16:40.519 --> 00:16:43.600
from stars and galaxies. And astrophysical signals

00:16:43.600 --> 00:16:46.019
are strictly non -negative. Because you cannot

00:16:46.019 --> 00:16:48.960
have negative light. Exactly. But when astronomers

00:16:48.960 --> 00:16:52.840
run PCA and do that mandatory mean removal process,

00:16:53.539 --> 00:16:56.200
the mass subtracts the average from everything.

00:16:56.659 --> 00:17:00.149
So if a star emits five units of light, and the

00:17:00.149 --> 00:17:03.409
average is 10, the math says that star now emits

00:17:03.409 --> 00:17:05.650
negative five units of light. It creates what

00:17:05.650 --> 00:17:09.210
the text calls unphysical negative fluxes. The

00:17:09.210 --> 00:17:11.809
algorithm basically invents a scenario that cannot

00:17:11.809 --> 00:17:14.920
exist in reality. just to satisfy its own equations.

00:17:15.259 --> 00:17:16.880
This raises an important question, though. I

00:17:16.880 --> 00:17:19.279
mean, the source notes that astronomers often

00:17:19.279 --> 00:17:22.059
have to abandon PCA entirely and use a different

00:17:22.059 --> 00:17:24.859
method called non -negative matrix factorization.

00:17:25.059 --> 00:17:27.940
NMF, right. Right, NMF. Precisely because it

00:17:27.940 --> 00:17:29.880
respects the physical boundary that you can't

00:17:29.880 --> 00:17:32.630
have less than zero light. But think about that.

00:17:32.750 --> 00:17:35.589
If a rigid algorithm forces astrophysicists to

00:17:35.589 --> 00:17:38.230
literally invent negative light, we have to ask

00:17:38.230 --> 00:17:40.789
what it is doing to data in the saucer sciences,

00:17:41.190 --> 00:17:43.150
where the boundaries of reality aren't as clear.

00:17:43.369 --> 00:17:45.390
That is a very good point. The source highlights

00:17:45.390 --> 00:17:47.690
a major controversy in population genetics that

00:17:47.690 --> 00:17:49.930
illustrates this exact danger. Oh, the Erinilike

00:17:49.930 --> 00:17:53.029
paper. Yes. This part of the source was fascinating.

00:17:53.750 --> 00:17:57.450
So in August 2022, A molecular biologist named

00:17:57.450 --> 00:18:01.349
Aaron Elheik published a paper analyzing 12 different

00:18:01.349 --> 00:18:04.890
PCA applications in population genetics. For

00:18:04.890 --> 00:18:07.589
context for you listening, PCA is constantly

00:18:07.589 --> 00:18:10.950
used to map genetic variation and trace ancient

00:18:10.950 --> 00:18:13.369
human migrations. Right, you take thousands of

00:18:13.369 --> 00:18:15.970
genetic markers, run PCA, and it clusters people

00:18:15.970 --> 00:18:18.670
together to show who migrated where. But Elheik

00:18:18.670 --> 00:18:21.130
analyzed these highly cited studies, and according

00:18:21.130 --> 00:18:23.880
to the source, he called the results erroneous,

00:18:24.099 --> 00:18:26.380
contradictory, and absurd. Wow. He didn't hold

00:18:26.380 --> 00:18:28.460
back. No, he didn't. He argued that the results

00:18:28.460 --> 00:18:30.359
achieved in these genetic studies were characterized

00:18:30.359 --> 00:18:32.819
by cherry picking and circular reasoning. Which

00:18:32.819 --> 00:18:35.160
goes back to the prep work, right? Exactly. Because

00:18:35.160 --> 00:18:37.759
PCA is so incredibly sensitive to how the data

00:18:37.759 --> 00:18:39.819
is scaled and which variables are included, it's

00:18:39.819 --> 00:18:42.700
very easy to manipulate the method. If a researcher

00:18:42.700 --> 00:18:45.019
wants to prove that two ancient populations are

00:18:45.019 --> 00:18:47.740
distinct, they can pre -select the specific genetic

00:18:47.740 --> 00:18:49.900
markers that happen to vary between those groups,

00:18:50.319 --> 00:18:53.319
run PCA, and the algorithm will obediently draw

00:18:53.319 --> 00:18:56.420
a massive line dividing them. So is PCA just

00:18:56.420 --> 00:18:59.220
being used as a math -washing tool here? Math

00:18:59.220 --> 00:19:02.420
-washing. That's a great term. I mean, researchers

00:19:02.420 --> 00:19:06.299
can use this incredibly complex, dense math to

00:19:06.299 --> 00:19:09.359
make their cherry -picked subjective data look

00:19:09.359 --> 00:19:12.779
like rigorously objective science. To a layperson

00:19:12.779 --> 00:19:15.500
or even a peer reviewer, if you say the principal

00:19:15.500 --> 00:19:18.160
component analysis proves these groups are genetically

00:19:18.160 --> 00:19:21.259
distinct, it just sounds unassailable. It sounds

00:19:21.259 --> 00:19:23.539
like an absolute fact. You assume the computer

00:19:23.539 --> 00:19:26.099
did the work, not the biased human who prepped

00:19:26.099 --> 00:19:29.119
the data. The math itself isn't lying, but the

00:19:29.119 --> 00:19:31.619
human choices made before the math dictates the

00:19:31.619 --> 00:19:34.099
outcome. And the source also mentions researchers

00:19:34.099 --> 00:19:36.380
at Kansas State University who found another

00:19:36.380 --> 00:19:38.799
flaw. Oh, the autocorrelation thing. Yeah, they

00:19:38.799 --> 00:19:41.400
found that PCA can be seriously biased by sampling

00:19:41.400 --> 00:19:43.940
errors if something called autocorrelation isn't

00:19:43.940 --> 00:19:46.920
handled correctly. How does autocorrelation bias

00:19:46.920 --> 00:19:49.819
the results? So autocorrelation is when data

00:19:49.819 --> 00:19:52.279
points influence each other sequentially. Think

00:19:52.279 --> 00:19:54.000
of taking a temperature reading every single

00:19:54.000 --> 00:19:56.980
minute. The temperature at 1 .00 p .m. and 1

00:19:56.980 --> 00:19:59.220
.01 p .m. are going to be basically identical.

00:19:59.440 --> 00:20:01.160
Right, it doesn't change that fast. They aren't

00:20:01.160 --> 00:20:03.539
independent variables. They are correlated with

00:20:03.539 --> 00:20:06.539
themselves over time. If you feed that into PCA

00:20:06.539 --> 00:20:09.140
without correcting for it, the algorithm sees

00:20:09.140 --> 00:20:11.140
all those identical minute -by -minute readings

00:20:11.140 --> 00:20:13.740
as a massive block of variance and just skews

00:20:13.740 --> 00:20:15.700
the principal components toward it. It thinks

00:20:15.700 --> 00:20:17.980
it found a massive signal, but it's really just

00:20:17.980 --> 00:20:20.480
the same data repeating. Exactly. It's a reminder

00:20:20.480 --> 00:20:23.240
to always question the assumptions baked into

00:20:23.240 --> 00:20:25.160
the data models we encounter in the news. We've

00:20:25.160 --> 00:20:27.599
gone from casting shadows of teapots to measuring

00:20:27.599 --> 00:20:31.410
IQs to negative starlight to... It's been quite

00:20:31.410 --> 00:20:33.930
a journey. So what does this all mean? For you

00:20:33.930 --> 00:20:37.509
listening, think of PCA as a mental model. It

00:20:37.509 --> 00:20:39.970
is a lens for focusing on what truly matters,

00:20:40.470 --> 00:20:44.089
finding the axis of maximum variance in a sea

00:20:44.089 --> 00:20:47.089
of statistical noise. Yeah. In your own life,

00:20:47.150 --> 00:20:49.049
when you are overwhelmed with information at

00:20:49.049 --> 00:20:52.329
work or in the news, try to ask yourself, what

00:20:52.329 --> 00:20:54.950
is the principal component here? What is the

00:20:54.950 --> 00:20:57.549
one underlying variable that is actually driving

00:20:57.549 --> 00:21:00.819
all this noise? It's honestly a brilliant shortcut

00:21:00.819 --> 00:21:03.099
to reducing the overwhelm. It's an incredibly

00:21:03.099 --> 00:21:05.319
powerful framework for navigating the modern

00:21:05.319 --> 00:21:08.779
world. But, you know, there is one final really

00:21:08.779 --> 00:21:10.839
provocative concept from the source material

00:21:10.839 --> 00:21:13.640
regarding PCA and information theory. Oh yes,

00:21:14.019 --> 00:21:16.099
referencing Linsker's work. Exactly, referencing

00:21:16.099 --> 00:21:18.819
a researcher named Linsker. The underlying math

00:21:18.819 --> 00:21:22.420
proves something profound. Dimensionality reduction

00:21:22.420 --> 00:21:25.640
always results in a loss of information. Always.

00:21:25.779 --> 00:21:28.759
It only minimizes that loss under very specific,

00:21:29.119 --> 00:21:32.279
perfectly bell -shaped data distributions, what

00:21:32.279 --> 00:21:35.319
statisticians call Gaussian noise models. If

00:21:35.319 --> 00:21:37.579
your data isn't perfectly symmetrical Gaussian

00:21:37.579 --> 00:21:40.500
noise whenever you reduce dimensions, something

00:21:40.500 --> 00:21:42.980
real is permanently destroyed. You cannot cast

00:21:42.980 --> 00:21:45.660
a 2D shadow without permanently losing the 3D

00:21:45.660 --> 00:21:47.940
reality of the object. The shadow is simpler

00:21:47.940 --> 00:21:50.200
to understand, but it is fundamentally less real.

00:21:50.410 --> 00:21:52.869
That's beautifully put. So here is a final thought

00:21:52.869 --> 00:21:55.990
to mull over. If we spend our entire lives constantly

00:21:55.990 --> 00:21:58.589
trying to use mental PCA, you know, trying to

00:21:58.589 --> 00:22:00.869
distill the overwhelming chaotic data of our

00:22:00.869 --> 00:22:03.130
daily experiences, our relationships, our politics

00:22:03.130 --> 00:22:06.130
into simple digestible principal components.

00:22:06.490 --> 00:22:09.609
What vital, quiet nuances are we permanently

00:22:09.609 --> 00:22:12.269
filtering out? By turning down the dial on the

00:22:12.269 --> 00:22:14.849
stadium noise to only hear the loudest, most

00:22:14.849 --> 00:22:18.609
dominant chants, what beautiful complex whispered

00:22:18.609 --> 00:22:20.990
conversations are we permanently losing as mere

00:22:20.990 --> 00:22:23.250
noise? Sometimes the noise isn't just noise.

00:22:23.910 --> 00:22:26.009
Sometimes it's the whole point. Until next time,

00:22:26.150 --> 00:22:26.869
keep diving deep.
