WEBVTT

00:00:00.000 --> 00:00:02.879
If someone, you know, handed you a spreadsheet

00:00:02.879 --> 00:00:05.719
with 10 million rows of numbers, you'd probably

00:00:05.719 --> 00:00:08.279
call it a nightmare. Oh, absolutely. I mean,

00:00:08.500 --> 00:00:10.320
most people would just run the other way. Right.

00:00:10.740 --> 00:00:12.779
But to an artificial intelligence, that same

00:00:12.779 --> 00:00:16.899
spreadsheet is well, it's the literal blueprint

00:00:16.899 --> 00:00:20.100
of reality. Welcome to the deep dive. We are

00:00:20.100 --> 00:00:21.980
so glad you're joining us today. Yeah, it's great

00:00:21.980 --> 00:00:23.440
to have you here. We're looking at something

00:00:23.440 --> 00:00:26.079
today that is practically the air we breathe

00:00:26.079 --> 00:00:29.059
in the modern world. Exactly. And yet. It's almost

00:00:29.059 --> 00:00:31.100
entirely invisible to us. I mean, we hear the

00:00:31.100 --> 00:00:33.659
word data constantly, right? Go everywhere. It's

00:00:33.659 --> 00:00:35.859
the ultimate buzzword in every boardroom. Right.

00:00:35.859 --> 00:00:38.320
It drives our global economy. And it's this invisible

00:00:38.320 --> 00:00:41.840
force deciding what media you consume. But today,

00:00:41.960 --> 00:00:45.200
our mission is to completely demystify the foundational

00:00:45.200 --> 00:00:48.320
container of the information age. Which is such

00:00:48.320 --> 00:00:50.640
a vital mission, honestly. We think so. We are

00:00:50.640 --> 00:00:52.960
taking the concept of a data set and breaking

00:00:52.960 --> 00:00:56.600
down exactly what it is, how it's built, and

00:00:56.600 --> 00:00:59.439
how it's secretly powers everything from medical

00:00:59.439 --> 00:01:01.840
breakthroughs to your morning commute. Okay,

00:01:01.979 --> 00:01:04.239
let's unpack this. Let's start with the absolute

00:01:04.239 --> 00:01:06.700
bedrock definition because I think people overcomplicate

00:01:06.700 --> 00:01:09.859
it. Fundamentally, a data set is simply a collection

00:01:09.859 --> 00:01:12.680
of data. Just a collection. Right. And I know

00:01:12.680 --> 00:01:15.239
that sounds deceptively simple. It's almost like

00:01:15.239 --> 00:01:17.680
describing a massive library as, you know, just

00:01:17.849 --> 00:01:20.590
a room full of paper. Yeah, which totally undersells

00:01:20.590 --> 00:01:23.069
what a library actually is. Exactly. You have

00:01:23.069 --> 00:01:25.510
to realize that these collections are the empirical

00:01:25.510 --> 00:01:28.950
foundations of modern discovery. I mean, without

00:01:28.950 --> 00:01:31.170
the data set, there is no scientific method.

00:01:31.430 --> 00:01:33.969
There's no machine learning, no statistical analysis.

00:01:34.390 --> 00:01:36.930
It is the raw material of human experience and

00:01:36.930 --> 00:01:40.069
the natural world, but captured, standardized,

00:01:40.290 --> 00:01:42.790
and organized into a format that mathematics

00:01:42.790 --> 00:01:47.400
can actually process. Organized is really the

00:01:47.400 --> 00:01:49.980
critical operational word there because reality,

00:01:50.019 --> 00:01:52.959
as we experience it, is completely chaotic. Highly

00:01:52.959 --> 00:01:55.260
chaotic. Right. If you just have a giant unstructured

00:01:55.260 --> 00:01:57.620
pile of random facts and figures, well, that's

00:01:57.620 --> 00:02:00.219
not a data set. That's just noise. Yeah, it's

00:02:00.219 --> 00:02:03.659
useless. To extract value, you really need architecture.

00:02:04.040 --> 00:02:06.620
Now most of you listening are probably already

00:02:06.620 --> 00:02:08.879
intimately familiar with the most common structural

00:02:08.879 --> 00:02:11.000
reality we're talking about here. You mean tabular

00:02:11.000 --> 00:02:13.659
data. Tabular data, spreadsheets, I mean we all

00:02:13.659 --> 00:02:15.719
know how they work. You have rows representing

00:02:15.719 --> 00:02:18.800
individual records or observations and then columns

00:02:18.800 --> 00:02:21.400
representing the specific variables or attributes

00:02:21.400 --> 00:02:23.900
of those records. Right. But rather than just

00:02:23.900 --> 00:02:26.879
thinking of it as some mundane office tool, I

00:02:26.879 --> 00:02:29.460
like to think of a tabular data set like a high

00:02:29.460 --> 00:02:32.259
-resolution digital photograph. Oh, I like that.

00:02:32.259 --> 00:02:34.780
How so? Well, a digital photo captures a moment

00:02:34.780 --> 00:02:38.120
in reality through millions of individual pixels,

00:02:38.120 --> 00:02:41.659
right? And every single pixel is defined by specific

00:02:41.659 --> 00:02:44.180
numerical values for red, green, and blue. OK,

00:02:44.259 --> 00:02:46.400
yeah. A tabular data set functions the exact

00:02:46.400 --> 00:02:48.939
same way. But instead of capturing light, it

00:02:48.939 --> 00:02:51.639
captures complex societal or scientific reality.

00:02:52.270 --> 00:02:54.870
So each individual observation is a pixel, and

00:02:54.870 --> 00:02:57.169
the standardized variables are those RGB values.

00:02:57.229 --> 00:02:59.789
That makes total sense. Right. And if you assemble

00:02:59.789 --> 00:03:02.729
enough of these structural pixels, a highly accurate,

00:03:03.229 --> 00:03:05.909
measurable picture of reality emerges. I love

00:03:05.909 --> 00:03:08.430
that high resolution photograph metaphor, because

00:03:08.430 --> 00:03:10.270
it really highlights the main challenge here,

00:03:10.330 --> 00:03:13.509
which is digitizing reality. It's not easy. No,

00:03:13.550 --> 00:03:15.900
it's not. Some things are easy to translate into

00:03:15.900 --> 00:03:18.680
that digital picture. The numerical variables.

00:03:19.199 --> 00:03:21.039
I mean, if we are measuring a population, things

00:03:21.039 --> 00:03:24.900
like height in centimeters or age in years, those

00:03:24.900 --> 00:03:27.300
translate perfectly into numbers. Let's steel

00:03:27.300 --> 00:03:30.400
for it. Exactly. But reality isn't just numbers.

00:03:30.759 --> 00:03:33.879
A massive part of building a data set is translating

00:03:33.879 --> 00:03:36.740
what we call nominal data, which are non -numerical

00:03:36.740 --> 00:03:39.879
categories, into a mathematical framework. Precisely.

00:03:39.960 --> 00:03:42.599
Consider variables like a person's profession,

00:03:43.699 --> 00:03:46.159
their geographical location, or even a medical

00:03:46.159 --> 00:03:47.879
diagnosis. Right. You can't just put those in

00:03:47.879 --> 00:03:50.620
a calculator. Exactly. You cannot calculate the

00:03:50.620 --> 00:03:53.120
mathematical average of the color blue, and you

00:03:53.120 --> 00:03:55.319
definitely can't find a standard deviation of

00:03:55.319 --> 00:03:57.740
a zip code. Right. That would be absurd. It would

00:03:57.740 --> 00:04:01.000
be. And this is where standardizing variables

00:04:01.000 --> 00:04:03.539
into strict levels of measurement becomes the

00:04:03.539 --> 00:04:06.199
central challenge of data science. You have to

00:04:06.199 --> 00:04:09.719
translate abstract human concepts into a rigorously

00:04:09.719 --> 00:04:12.400
defined language that statistical algorithms

00:04:12.400 --> 00:04:14.520
can actually understand. So you have to put rules

00:04:14.520 --> 00:04:18.000
on it. Yes. Because if you don't define the parameters

00:04:18.000 --> 00:04:20.920
of your nominal data perfectly, the algorithm

00:04:20.920 --> 00:04:23.879
will try to treat a zip code like a regular integer

00:04:23.879 --> 00:04:27.100
and your entire mathematical picture of reality

00:04:27.100 --> 00:04:29.899
completely shatters. Man, which is why the structure

00:04:29.899 --> 00:04:33.339
is so vital. It really is like the ultimate God

00:04:33.339 --> 00:04:35.519
-level spreadsheet of life. That's a great way

00:04:35.519 --> 00:04:38.399
to put it. Yeah, I mean imagine you are a single

00:04:38.399 --> 00:04:41.220
row on a spreadsheet and every column is a different

00:04:41.220 --> 00:04:43.060
trait about you. Your height, your favorite color.

00:04:43.420 --> 00:04:45.699
It's taking the chaos of human existence and

00:04:45.699 --> 00:04:47.779
forcing it into neat little boxes. That's exactly

00:04:47.779 --> 00:04:49.220
what it's doing. And it's important to point

00:04:49.220 --> 00:04:51.959
out that while tabular data is the classic pixel

00:04:51.959 --> 00:04:54.490
by pixel format, Data sets don't always have

00:04:54.490 --> 00:04:56.709
to be spreadsheets. No, not at all. They can

00:04:56.709 --> 00:04:59.829
also just be massive collections of documents

00:04:59.829 --> 00:05:02.990
or files, right? Like a server containing thousands

00:05:02.990 --> 00:05:05.730
of raw unstructured audio recordings of human

00:05:05.730 --> 00:05:09.149
speech is technically also a data set. It is.

00:05:09.430 --> 00:05:12.129
And navigating those unstructured data sets requires

00:05:12.129 --> 00:05:14.250
entirely different computational techniques.

00:05:14.430 --> 00:05:16.939
Like what? Well, things like natural language

00:05:16.939 --> 00:05:19.199
processing, just to extract the variables we

00:05:19.199 --> 00:05:21.319
actually care about from all that audio. Wow.

00:05:21.639 --> 00:05:24.779
And the sheer scale of this discipline is...

00:05:24.779 --> 00:05:26.740
I mean, it's almost hard to wrap your head around.

00:05:26.920 --> 00:05:29.079
Oh, the scale is unbelievable. Yeah. In the open

00:05:29.079 --> 00:05:31.319
data movement, where governments and institutions

00:05:31.319 --> 00:05:33.459
make their information freely available to the

00:05:33.459 --> 00:05:36.360
public, they use the data set as a literal unit

00:05:36.360 --> 00:05:38.980
of measurement for transparency. Which is fantastic

00:05:38.980 --> 00:05:42.259
for research. It is. Take the European data portal,

00:05:42.540 --> 00:05:46.319
data .europa .eu. They aggregate over one million

00:05:46.319 --> 00:05:49.980
distinct data sets. One million independent collections

00:05:49.980 --> 00:05:52.779
of information. And remember, a single one of

00:05:52.779 --> 00:05:55.230
those data sets could easily contain tens of

00:05:55.230 --> 00:05:57.870
millions of individual rows. So we're talking

00:05:57.870 --> 00:06:00.629
billions, maybe trillions of data points. Oh,

00:06:00.629 --> 00:06:03.670
easily. The volume of standardized information

00:06:03.670 --> 00:06:07.170
we are sitting on right now as a species is staggering.

00:06:07.399 --> 00:06:09.959
It really is mind -blowing, but structuring data

00:06:09.959 --> 00:06:12.319
like this creates an entirely new problem. It

00:06:12.319 --> 00:06:14.939
always does. Because the high -resolution photograph

00:06:14.939 --> 00:06:18.120
metaphor works perfectly in theory. Every pixel

00:06:18.120 --> 00:06:20.819
is accounted for. Every variable is standardized.

00:06:21.180 --> 00:06:24.620
But the real world is rarely a perfect photograph.

00:06:24.839 --> 00:06:28.240
Reality is notoriously uncooperative. Very uncooperative.

00:06:28.439 --> 00:06:31.019
So let's talk about the messy reality of data

00:06:31.019 --> 00:06:33.920
collection and the math used to fix it. Because

00:06:33.920 --> 00:06:36.279
where do these massive data sets actually come

00:06:36.279 --> 00:06:38.149
from in the first place? Well, traditionally

00:06:38.149 --> 00:06:41.430
in statistics, they come from actual physical

00:06:41.430 --> 00:06:44.110
observations obtained by sampling a statistical

00:06:44.110 --> 00:06:46.790
population. Okay, meaning real people doing real

00:06:46.790 --> 00:06:49.389
things. Right. Researchers design a study, they

00:06:49.389 --> 00:06:51.730
measure environmental factors, they poll demographics,

00:06:52.089 --> 00:06:54.430
or they record astronomical phenomena. So they're

00:06:54.430 --> 00:06:57.550
out in the field. Exactly. Every single row corresponds

00:06:57.550 --> 00:06:59.670
to a hard observation of one element in that

00:06:59.670 --> 00:07:02.870
population. But interestingly, modern data sets

00:07:02.870 --> 00:07:05.050
don't always come from the physical world anymore.

00:07:05.259 --> 00:07:07.779
Yeah, they can be generated entirely by algorithms.

00:07:08.279 --> 00:07:10.920
Wait, generated by algorithms? Like fake data?

00:07:11.079 --> 00:07:13.720
We call it synthetic data. And it's a massive

00:07:13.720 --> 00:07:16.100
field right now. Oh, wow. It's often generated

00:07:16.100 --> 00:07:19.480
for the explicit purpose of testing complex software

00:07:19.480 --> 00:07:22.939
systems without compromising security. Oh, I

00:07:22.939 --> 00:07:24.500
think I see where you're going with this. Yeah.

00:07:24.920 --> 00:07:27.720
Let's say you build a revolutionary new database

00:07:27.720 --> 00:07:31.290
system for a national hospital network. You absolutely

00:07:31.290 --> 00:07:34.389
need to stress test your software with millions

00:07:34.389 --> 00:07:37.269
of patient records to see if it crashes. Right,

00:07:37.269 --> 00:07:40.110
but you can't just dump real people's medical

00:07:40.110 --> 00:07:42.730
history into an untested system. You obviously

00:07:42.730 --> 00:07:45.670
cannot use real, sensitive patient medical histories.

00:07:46.290 --> 00:07:49.069
So an algorithm generates a massive synthetic

00:07:49.069 --> 00:07:51.649
data set. That perfectly mimics the real thing.

00:07:51.810 --> 00:07:53.689
Exactly. It perfectly mimics the statistical

00:07:53.689 --> 00:07:56.189
properties of human health records. But it is

00:07:56.189 --> 00:07:58.149
completely artificial, so you can safely test

00:07:58.149 --> 00:08:00.470
the software. Okay, well, synthetic data makes

00:08:00.470 --> 00:08:02.509
perfect sense for testing, but let's go back

00:08:02.509 --> 00:08:05.250
to the human side, the actual physical observations,

00:08:05.509 --> 00:08:08.449
because when humans are involved, things break.

00:08:08.569 --> 00:08:10.860
They absolutely do. People skip questions on

00:08:10.860 --> 00:08:13.879
surveys, sensors malfunction during weather balloon

00:08:13.879 --> 00:08:17.120
flights, files get corrupted. We constantly run

00:08:17.120 --> 00:08:19.939
into the reality of missing or suspicious values.

00:08:19.959 --> 00:08:22.139
It's unavoidable. And from my understanding,

00:08:22.500 --> 00:08:25.459
these missing values have to be flagged. And

00:08:25.459 --> 00:08:28.579
very often, researchers use an imputation method

00:08:28.579 --> 00:08:32.519
to complete the data set. Yes, imputation. It's

00:08:32.519 --> 00:08:35.039
a cornerstone of data cleaning. OK, I have to

00:08:35.039 --> 00:08:37.639
push back on this concept of imputation. Sure.

00:08:38.159 --> 00:08:40.860
Wait, if imputation is used to complete missing

00:08:40.860 --> 00:08:44.039
or suspicious data, I mean, aren't statisticians

00:08:44.039 --> 00:08:46.419
essentially just guessing to fill in the blanks?

00:08:46.460 --> 00:08:48.320
It can look that way from the outside. Right.

00:08:48.340 --> 00:08:51.259
So how does that not completely corrupt the integrity

00:08:51.259 --> 00:08:54.039
of the data set? I mean, if my row on the spreadsheet

00:08:54.039 --> 00:08:56.559
is missing my income bracket and an algorithm

00:08:56.559 --> 00:08:59.340
just fills it in for me, you aren't recording

00:08:59.340 --> 00:09:01.559
reality anymore. You're altering it. You're painting

00:09:01.559 --> 00:09:03.659
over the photograph. What's fascinating here

00:09:03.659 --> 00:09:05.940
is how advanced statistics actually accounts

00:09:05.940 --> 00:09:09.019
for the shape of the unknown. It is absolutely

00:09:09.019 --> 00:09:12.059
not wild guessing. It's not. No, it is a highly

00:09:12.059 --> 00:09:14.720
calculated, mathematically rigorous probability.

00:09:15.259 --> 00:09:17.879
And it works because of the broader structural

00:09:17.879 --> 00:09:20.279
properties of the data set itself. Okay, walk

00:09:20.279 --> 00:09:21.720
me through the mechanics of that because I'm

00:09:21.720 --> 00:09:24.759
skeptical. How do you probabilistically guess

00:09:24.759 --> 00:09:27.620
my missing income without it being a mathematical

00:09:27.620 --> 00:09:31.090
lie? Well, every valid data set is defined by

00:09:31.090 --> 00:09:33.450
overarching statistical measures. Two of the

00:09:33.450 --> 00:09:35.909
most critical are standard deviation and kurtosis.

00:09:36.149 --> 00:09:38.730
OK. Standard deviation and kurtosis. Right. And

00:09:38.730 --> 00:09:41.110
these aren't just obscure terms from a college

00:09:41.110 --> 00:09:43.950
math class. They are the architectural blueprint

00:09:43.950 --> 00:09:46.669
of the data. Standard deviation tells us how

00:09:46.669 --> 00:09:48.429
spread out the numbers are from the average.

00:09:48.570 --> 00:09:51.029
Give me an example of that. Sure. If we're looking

00:09:51.029 --> 00:09:53.269
at the ages of students in a high school, the

00:09:53.269 --> 00:09:56.120
standard deviation is extremely tight. because

00:09:56.120 --> 00:09:59.259
almost everyone is between 14 and 18. But we're

00:09:59.259 --> 00:10:01.480
looking at the ages of people in the large shopping

00:10:01.480 --> 00:10:04.259
mall. The standard deviation is huge. You have

00:10:04.259 --> 00:10:06.299
toddlers and you have octogenarians. Okay, right.

00:10:06.500 --> 00:10:08.960
So you understand the general spread of the reality

00:10:08.960 --> 00:10:11.740
you've captured. Exactly. And kurtosis describes

00:10:11.740 --> 00:10:13.799
the specific shape of the data's distribution,

00:10:14.340 --> 00:10:16.980
focusing primarily on the tails or the extremes.

00:10:17.100 --> 00:10:19.480
The tails. Yeah, kurtosis tells you the probability

00:10:19.480 --> 00:10:22.340
of finding extreme outliers. Is it a smooth normal

00:10:22.340 --> 00:10:25.879
bell curve? Or is it a distribution with incredibly

00:10:25.879 --> 00:10:29.340
fat tails, meaning wild extreme values are actually

00:10:29.340 --> 00:10:31.960
quite common in this specific population? I see.

00:10:32.100 --> 00:10:34.720
So standard deviation gives you the spread, and

00:10:34.720 --> 00:10:37.679
kurtosis tells you how likely it is to find weird

00:10:37.679 --> 00:10:41.230
extreme anomalies. Yes. and by analyzing these

00:10:41.230 --> 00:10:43.850
broader properties. the standard deviation, the

00:10:43.850 --> 00:10:45.929
kurtosis, and the correlation between different

00:10:45.929 --> 00:10:48.470
variables. Researchers can look at the verified

00:10:48.470 --> 00:10:51.509
data they do have and make logically sound deductions

00:10:51.509 --> 00:10:53.710
about the missing pieces. So they use what's

00:10:53.710 --> 00:10:55.929
there to figure out what is missing. Exactly.

00:10:56.169 --> 00:10:57.929
If they know your zip code, your profession,

00:10:58.049 --> 00:11:00.269
and your level of education, and they understand

00:11:00.269 --> 00:11:02.230
the precise statistical shape of the rest of

00:11:02.230 --> 00:11:05.169
the population, they can infer the most mathematically

00:11:05.169 --> 00:11:07.950
probable value for your missing income cell.

00:11:08.289 --> 00:11:11.039
Oh, wow. So it's less like guessing a random

00:11:11.039 --> 00:11:13.620
number out of thin air and much more like looking

00:11:13.620 --> 00:11:15.960
at a nearly finished jigsaw puzzle. That's a

00:11:15.960 --> 00:11:17.840
perfect way to look at it. Yeah. I mean, even

00:11:17.840 --> 00:11:21.000
if a piece is missing, by looking at the exact

00:11:21.000 --> 00:11:24.179
contours of the pieces around it, you know precisely

00:11:24.179 --> 00:11:26.460
what shape and color the missing piece has to

00:11:26.460 --> 00:11:28.460
be. That's it. Exactly. The integrity of the

00:11:28.460 --> 00:11:31.259
puzzle remains completely intact because the

00:11:31.259 --> 00:11:34.659
imputed value is statistically locked in by the

00:11:34.659 --> 00:11:37.100
verified reality of the rest of the data. That

00:11:37.100 --> 00:11:40.320
is wild. And the software tools used to execute

00:11:40.320 --> 00:11:44.440
this are incredibly robust. Take SPSS, for example,

00:11:44.559 --> 00:11:47.129
which is one of the most widely used statistical

00:11:47.129 --> 00:11:49.850
software suites in the world. I've heard of SPSS.

00:11:49.970 --> 00:11:52.490
Yeah, it deliberately presents its data in this

00:11:52.490 --> 00:11:56.289
highly rigid classical tabular format to facilitate

00:11:56.289 --> 00:11:59.769
this exact kind of rigorous cleaning, imputation,

00:11:59.929 --> 00:12:02.549
and analysis. It literally forces the researcher

00:12:02.549 --> 00:12:04.690
to confront the structure. Okay, well that actually

00:12:04.690 --> 00:12:06.389
makes me feel a lot better about imputation.

00:12:06.450 --> 00:12:09.549
Good. So we have collected the raw data. We have

00:12:09.549 --> 00:12:12.149
wrestled with the messy reality of missing variables.

00:12:12.570 --> 00:12:15.070
We've used standard deviation and kurtosis to

00:12:14.919 --> 00:12:17.580
perform imputation and statistically heal the

00:12:17.580 --> 00:12:20.120
broken pieces of the puzzle. Right. Once these

00:12:20.120 --> 00:12:22.600
massive collections of information are cleaned,

00:12:22.980 --> 00:12:25.620
completed, and verified, what does the engine

00:12:25.620 --> 00:12:29.080
actually do? Like, how are they deployed? Well,

00:12:29.340 --> 00:12:32.100
once verified, they transition from being mere

00:12:32.100 --> 00:12:34.779
records of the past to becoming the predictive

00:12:34.779 --> 00:12:37.679
engines of the modern world. They touch literally

00:12:37.679 --> 00:12:41.240
every sector of human endeavor. Everything. Everything.

00:12:41.919 --> 00:12:44.799
In the hard sciences, data sets provide the empirical

00:12:44.799 --> 00:12:47.700
bedrock for everything. We're talking about mapping

00:12:47.700 --> 00:12:50.539
the human genome in biology, modeling climate

00:12:50.539 --> 00:12:52.759
change in environmental science, and tracking

00:12:52.759 --> 00:12:55.200
particle collisions in physics. And in government,

00:12:55.460 --> 00:12:57.679
I assume those open data portals we talked about

00:12:57.679 --> 00:13:01.019
earlier exist for a reason. Oh, definitely. They

00:13:01.019 --> 00:13:03.039
publish this information to promote democratic

00:13:03.039 --> 00:13:05.360
transparency, but honestly, more importantly,

00:13:05.840 --> 00:13:08.460
to facilitate urban and social planning. Right,

00:13:08.600 --> 00:13:10.360
because a city planner doesn't just guess where

00:13:10.360 --> 00:13:13.039
a new subway line should go. Exactly. They use

00:13:13.039 --> 00:13:15.740
massive geographical and transit data sets to

00:13:15.740 --> 00:13:18.000
determine exactly which neighborhoods are underserved

00:13:18.000 --> 00:13:20.559
or which traffic intersections are statistically

00:13:20.559 --> 00:13:22.879
the most dangerous based on collision variables.

00:13:23.019 --> 00:13:25.279
And the corporate sector? The corporate sector

00:13:25.279 --> 00:13:28.120
uses them for deep market analysis, operational

00:13:28.120 --> 00:13:31.779
efficiency, and supply chain logistics. And then

00:13:31.779 --> 00:13:34.440
healthcare networks rely on massive patient data

00:13:34.440 --> 00:13:37.539
sets to conduct clinical research, identify disease

00:13:37.539 --> 00:13:40.279
patterns, and ultimately improve patient survival

00:13:40.279 --> 00:13:42.940
outcomes. Wow. But here's where it gets really

00:13:42.940 --> 00:13:45.740
interesting. Artificial intelligence and machine

00:13:45.740 --> 00:13:48.940
learning. Yes. Because all of those fields, science,

00:13:49.120 --> 00:13:52.179
government, business, they are rapidly being

00:13:52.179 --> 00:13:54.940
taken over by algorithmic decision -making. And

00:13:54.940 --> 00:13:57.440
data sets are the only reason AI exists. They

00:13:57.440 --> 00:13:59.389
are the foundation of it all. They are essential

00:13:59.389 --> 00:14:01.929
for training, validating, and testing algorithms

00:14:01.929 --> 00:14:05.269
for complex tasks like image recognition, natural

00:14:05.269 --> 00:14:07.110
language processing, and predictive modeling.

00:14:07.429 --> 00:14:09.169
If we connect this to the bigger picture, you

00:14:09.169 --> 00:14:11.529
begin to see why data is referred to as the new

00:14:11.529 --> 00:14:14.090
oil. It really is the fuel that powers machine

00:14:14.090 --> 00:14:16.629
cognition. Think of a data set as the ultimate

00:14:16.629 --> 00:14:18.870
rigorous textbook for an artificial intelligence.

00:14:18.970 --> 00:14:21.610
I mean, we tend to anthropomorphize AI, assuming

00:14:21.610 --> 00:14:23.769
it just magically knows how to read a hastily

00:14:23.769 --> 00:14:26.049
scribbled handwritten note. Right. Or how to

00:14:26.049 --> 00:14:28.190
instantly recognize a partially obscured stop

00:14:28.190 --> 00:14:32.059
sign on a dark, rainy road. Exactly. But it doesn't.

00:14:32.159 --> 00:14:34.759
It has to study. It has to ingest millions of

00:14:34.759 --> 00:14:36.759
individual examples from a training data set

00:14:36.759 --> 00:14:40.559
to slowly mathematically deduce the visual patterns

00:14:40.559 --> 00:14:42.779
of what a stop sign actually is. It takes massive

00:14:42.779 --> 00:14:45.059
amounts of repetition. And then, and this is

00:14:45.059 --> 00:14:47.820
the crucial part, it uses a completely separate

00:14:47.820 --> 00:14:50.299
sequestered portion of that data set to pass

00:14:50.299 --> 00:14:52.799
its final exam. Yes, the validation phase. It's

00:14:52.799 --> 00:14:55.720
called validation and testing. The AI has to

00:14:55.720 --> 00:14:58.019
prove it can identify a stop sign it has never

00:14:58.019 --> 00:15:01.480
seen before. And that workflow, you know, training

00:15:01.480 --> 00:15:04.259
on one subset of data and testing on another,

00:15:05.000 --> 00:15:07.519
is the absolute gold standard for proving an

00:15:07.519 --> 00:15:09.620
algorithm actually works in the real world. It

00:15:09.620 --> 00:15:12.059
has to prove it. Right. Whether it's a doctor

00:15:12.059 --> 00:15:14.840
relying on an algorithm to spot microscopic tumors

00:15:14.840 --> 00:15:17.779
on an MRI or a self -driving car navigating a

00:15:17.779 --> 00:15:20.419
busy crosswalk, data sets are the invisible engine

00:15:20.419 --> 00:15:22.639
driving evidence in these decisions. We have

00:15:22.639 --> 00:15:25.360
moved fundamentally past human intuition. We

00:15:25.360 --> 00:15:27.879
are relying on structured empirical evidence.

00:15:28.039 --> 00:15:30.820
It's the ultimate societal shift from, I think,

00:15:31.139 --> 00:15:34.279
to the data objectively shows. But if data sets

00:15:34.279 --> 00:15:37.179
are the textbooks teaching our artificial intelligence,

00:15:37.659 --> 00:15:39.899
and they are the foundation driving all this

00:15:39.899 --> 00:15:42.460
modern science, there must be some legendary

00:15:42.460 --> 00:15:44.419
textbooks out there, right? Well, there certainly

00:15:44.419 --> 00:15:46.899
are. Foundational texts that the entire industry

00:15:46.899 --> 00:15:49.840
agrees on. Absolutely. The statistical and machine

00:15:49.840 --> 00:15:52.379
learning literature has its own highly respected

00:15:52.379 --> 00:15:55.580
Hall of Famer. There are classic, universally

00:15:55.580 --> 00:15:58.460
cited data sets that built the very foundation

00:15:58.460 --> 00:16:01.059
of modern statistical modeling. Let's talk about

00:16:01.059 --> 00:16:03.580
a few of them because they are fascinating. First

00:16:03.580 --> 00:16:07.259
up, we have the Iris Flower data set. A true

00:16:07.259 --> 00:16:09.580
classic. This was introduced by the legendary

00:16:09.580 --> 00:16:12.179
statistician Ronald Fisher all the way back in

00:16:12.179 --> 00:16:16.019
1936. It is considered a classic multivariate

00:16:16.019 --> 00:16:18.279
data set. Right. And just to clarify for everyone,

00:16:18.720 --> 00:16:20.879
multivariate simply means it involves observing

00:16:20.879 --> 00:16:23.710
and analyzing multiple vari - at the exact same

00:16:23.710 --> 00:16:27.330
time to find complex correlations. OK. So Fisher

00:16:27.330 --> 00:16:29.029
wasn't just measuring the length of a petal.

00:16:29.190 --> 00:16:31.090
He was measuring the length and the width of

00:16:31.090 --> 00:16:33.669
both the petals and the sepals across different

00:16:33.669 --> 00:16:35.870
species of iris flowers simultaneously. Wow.

00:16:36.090 --> 00:16:38.269
Yeah, he was proving that you could mathematically

00:16:38.269 --> 00:16:42.350
classify a biological species based on interconnected

00:16:42.350 --> 00:16:47.210
numerical dimensions. A true pioneer. And from

00:16:47.210 --> 00:16:50.250
1936 flowers, we jump to something vastly more

00:16:50.250 --> 00:16:56.129
modern. The M. This is an incredibly famous collection

00:16:56.129 --> 00:17:00.169
of handwritten digits. tens of thousands of scanned

00:17:00.169 --> 00:17:02.490
images of the numbers zero through nine written

00:17:02.490 --> 00:17:04.809
out by high school students and Census Bureau

00:17:04.809 --> 00:17:07.769
employees. It's huge. It is the ultimate training

00:17:07.769 --> 00:17:10.309
ground for image processing and classification

00:17:10.309 --> 00:17:12.829
algorithms. I mean, if you have ever deposited

00:17:12.829 --> 00:17:14.690
a check by snapping a picture of it with your

00:17:14.690 --> 00:17:17.269
banking app, the optical character recognition

00:17:17.269 --> 00:17:19.910
reading your sloppy handwriting owes its existence

00:17:19.910 --> 00:17:22.849
to the MS database. It really does. But, you

00:17:22.849 --> 00:17:24.829
know, this raises an important question, though,

00:17:25.109 --> 00:17:27.109
a vital one for understanding the industry. That's

00:17:27.109 --> 00:17:30.119
that. Why are data scientists today in laboratories

00:17:30.119 --> 00:17:33.059
pushing the absolute bleeding edge of artificial

00:17:33.059 --> 00:17:35.539
intelligence, still using a biological flower

00:17:35.539 --> 00:17:39.460
dataset from 1936, or a database of old pixelated

00:17:39.460 --> 00:17:41.420
handwritten digits from the MS selection? That

00:17:41.420 --> 00:17:43.980
is a great point. Right. I mean, we are drowning

00:17:43.980 --> 00:17:46.980
in new information. Why not just generate new,

00:17:47.180 --> 00:17:50.400
hyper -advanced data? Yeah. If the European portal

00:17:50.400 --> 00:17:53.059
alone has a million fresh datasets, why keep

00:17:53.059 --> 00:17:56.349
dusting off the old stuff? Because these classics

00:17:56.349 --> 00:17:59.630
serve as universal benchmarks. It is entirely

00:17:59.630 --> 00:18:02.190
about standardization and peer review. OK, explain

00:18:02.190 --> 00:18:05.390
that. Well, if you build a brand new revolutionary

00:18:05.390 --> 00:18:08.049
image classification algorithm tomorrow, you

00:18:08.049 --> 00:18:10.509
could easily test it on 100 ,000 photos you took

00:18:10.509 --> 00:18:14.089
yourself. Sure. But if you publish a paper claiming

00:18:14.089 --> 00:18:17.390
your algorithm is 99 % accurate on your own personal

00:18:17.390 --> 00:18:20.150
photos, the scientific community will ignore

00:18:20.150 --> 00:18:23.029
you. It means absolutely nothing to them. Why?

00:18:23.289 --> 00:18:25.309
because they have no empirical frame of reference.

00:18:25.769 --> 00:18:27.230
Oh, because they don't know if your personal

00:18:27.230 --> 00:18:29.130
photos were just incredibly easy for a computer

00:18:29.130 --> 00:18:31.210
to classify. Exactly. You might have just taken

00:18:31.210 --> 00:18:33.509
pictures of perfectly lit stop signs. Exactly,

00:18:33.589 --> 00:18:36.289
you haven't proven anything. But if you run your

00:18:36.289 --> 00:18:38.910
new algorithm against the MNIST handwritten digits,

00:18:39.549 --> 00:18:42.410
you are testing it on the exact same brutal exam

00:18:42.410 --> 00:18:44.869
that thousands of other algorithms have taken

00:18:44.869 --> 00:18:48.000
over the last few decades. Uh, I see. It allowed

00:18:48.000 --> 00:18:51.440
you to directly mathematically compare your new

00:18:51.440 --> 00:18:54.680
model's performance against the entire historical

00:18:54.680 --> 00:18:57.880
lineage of previous models. It is the only way

00:18:57.880 --> 00:19:01.559
to objectively prove that your new AI is actually

00:19:01.559 --> 00:19:04.119
an evolutionary step forward. It is the standardized

00:19:04.119 --> 00:19:06.579
test of the machine learning world. Without the

00:19:06.579 --> 00:19:08.779
benchmark, you can't measure progress. I love

00:19:08.779 --> 00:19:11.250
that. It's essential. Now, before we wrap up,

00:19:11.369 --> 00:19:13.809
I want to jump in and highlight one more classic

00:19:13.809 --> 00:19:16.430
from the historical Hall of Fame, because the

00:19:16.430 --> 00:19:19.009
mechanics of this one absolutely blew my mind.

00:19:19.450 --> 00:19:22.910
Anscombe's Quartet. Yes, Anscombe's Quartet is

00:19:22.910 --> 00:19:25.710
a masterpiece of statistical literature, a very

00:19:25.710 --> 00:19:28.220
humbling one. It is incredible. Anscombe's Quartet

00:19:28.220 --> 00:19:31.119
is a small collection of data created in 1973

00:19:31.119 --> 00:19:34.079
by a statistician named Francis Anscombe. It

00:19:34.079 --> 00:19:36.079
consists of four distinct data sets, and let

00:19:36.079 --> 00:19:37.599
me explain the setup here because it is crucial.

00:19:37.619 --> 00:19:39.740
Oh, please do. If you just look at the raw numbers

00:19:39.740 --> 00:19:41.599
of these four data sets in a spreadsheet, and

00:19:41.599 --> 00:19:44.180
if you run the standard summaries statistics,

00:19:44.299 --> 00:19:46.420
if you calculate the mean, the variance, the

00:19:46.420 --> 00:19:48.119
correlation between the variables, and the linear

00:19:48.119 --> 00:19:50.440
regression line, all the math tells you these

00:19:50.440 --> 00:19:52.829
four data sets are identical. They look exactly

00:19:52.829 --> 00:19:56.049
the same on paper. Exactly. The summary statistics

00:19:56.049 --> 00:19:58.529
map perfectly down to multiple decimal points.

00:19:58.849 --> 00:20:01.130
The pure math says these four populations are

00:20:01.130 --> 00:20:03.769
the exact same. The summary metrics are completely

00:20:03.769 --> 00:20:05.670
indistinguishable. The moment you take those

00:20:05.670 --> 00:20:08.029
four data sets and actually map them out visually

00:20:08.029 --> 00:20:10.930
like When you plot the coordinates on a scatter

00:20:10.930 --> 00:20:13.809
graph, you see that they are completely wildly

00:20:13.809 --> 00:20:15.769
different realities. It's such a shock the first

00:20:15.769 --> 00:20:19.049
time you see it. It really is. Graph one is a

00:20:19.049 --> 00:20:21.470
chaotic but standard scatter of dots showing

00:20:21.470 --> 00:20:24.750
a general upward trend. Graph two is a perfect,

00:20:24.990 --> 00:20:28.410
smooth, mathematically precise curve. Graph three

00:20:28.410 --> 00:20:31.009
is a fiercely tight, straight line of dots with

00:20:31.009 --> 00:20:33.970
one massive, insane outlier skewing the math.

00:20:34.130 --> 00:20:36.549
Right. And graph four is a single vertical line

00:20:36.549 --> 00:20:39.309
of data points with one outlier. way off to the

00:20:39.309 --> 00:20:42.410
side. Anscombe's Quartet proves definitively

00:20:42.410 --> 00:20:44.569
that the raw summary numbers on a spreadsheet

00:20:44.569 --> 00:20:47.069
can completely lie to you until you physically

00:20:47.069 --> 00:20:49.670
map them out. It fundamentally altered how we

00:20:49.670 --> 00:20:52.150
approach data. You cannot blindly trust summary

00:20:52.150 --> 00:20:54.569
math like averages and standard deviations in

00:20:54.569 --> 00:20:57.269
a vacuum. You have to visualize the geometry

00:20:57.269 --> 00:20:59.210
of the data. You have to see the shape. It's

00:20:59.210 --> 00:21:02.309
like looking at a massive data set of human beings,

00:21:02.869 --> 00:21:05.069
seeing that two specific people have the exact

00:21:05.069 --> 00:21:08.130
same height, weight, and annual income, and assuming

00:21:08.130 --> 00:21:10.569
their lives are identical. Yeah, which we know

00:21:10.569 --> 00:21:13.849
isn't true. Right, because when you graph their

00:21:13.849 --> 00:21:16.670
daily habits out, You realize one is a professional

00:21:16.670 --> 00:21:19.049
athlete who trains all day, and the other is

00:21:19.049 --> 00:21:20.849
a corporate accountant who just happens to hit

00:21:20.849 --> 00:21:24.630
the gym really hard on weekends. The visual context

00:21:24.630 --> 00:21:27.589
changes the entire interpretation of reality.

00:21:27.890 --> 00:21:30.089
And that brings us full circle back to the central

00:21:30.089 --> 00:21:32.710
challenge of the data set, really, which is the

00:21:32.710 --> 00:21:35.829
complexity of accurately capturing the chaotic

00:21:35.829 --> 00:21:38.630
reality of the real world. It really does. So

00:21:38.630 --> 00:21:40.509
let's zoom out and review the map of what we've

00:21:40.509 --> 00:21:42.970
covered today. Sounds good. We started this deep

00:21:42.970 --> 00:21:45.490
dive looking at a deceptively simple definition,

00:21:46.009 --> 00:21:48.690
a collection of data. We explored the tabular

00:21:48.690 --> 00:21:51.710
anatomy, the rows representing individual observations

00:21:51.710 --> 00:21:53.970
like digital pixels, and the columns capturing

00:21:53.970 --> 00:21:56.829
standardized variables. And we learned how translating

00:21:56.829 --> 00:22:00.009
messy, nominal human concepts into mathematical

00:22:00.009 --> 00:22:03.430
frameworks is the ultimate challenge. Yes, we

00:22:03.430 --> 00:22:06.390
saw how statisticians wrestle with the imperfect

00:22:06.390 --> 00:22:10.069
reality of missing information using deep structural

00:22:10.069 --> 00:22:13.009
properties like standard deviation and kurtosis

00:22:13.009 --> 00:22:16.710
to probabilistically heal broken data sets through

00:22:16.710 --> 00:22:20.730
imputation. We also saw how those refined mathematically

00:22:20.730 --> 00:22:23.430
verified collections of information transition

00:22:23.430 --> 00:22:26.170
into the engines of the modern world. They are

00:22:26.170 --> 00:22:28.910
the underlying fuel driving scientific discovery,

00:22:29.390 --> 00:22:31.930
precision urban planning, and serving as a foundational

00:22:31.950 --> 00:22:34.089
textbooks for the artificial intelligence models

00:22:34.089 --> 00:22:36.609
that are actually shaping our future. And we

00:22:36.609 --> 00:22:38.230
walked through the Statistical Hall of Fame.

00:22:38.450 --> 00:22:40.730
We discovered why multivariate flower measurements

00:22:40.730 --> 00:22:43.609
from 1936 and old handwritten digits are still

00:22:43.609 --> 00:22:46.170
utilized as the universal benchmarks to test

00:22:46.170 --> 00:22:48.390
the most advanced neural networks on the planet.

00:22:48.450 --> 00:22:50.809
Which is still so cool to me. It's amazing. And

00:22:50.809 --> 00:22:53.049
thanks to the mind -bending reality of Anscombe's

00:22:53.049 --> 00:22:55.869
Quartet, we learned that we always, always need

00:22:55.869 --> 00:22:58.430
to visualize the data because summary math can

00:22:58.430 --> 00:23:00.950
hide wildly different truths. So what does this

00:23:00.950 --> 00:23:03.680
all mean? Well, it means that the invisible architecture

00:23:03.680 --> 00:23:06.680
of our modern society is built entirely on these

00:23:06.680 --> 00:23:10.359
rows and columns. But I would like to leave you

00:23:10.359 --> 00:23:12.599
with a more personal thought to mull over. Oh,

00:23:12.619 --> 00:23:14.579
I like the sound of that. Throughout this entire

00:23:14.579 --> 00:23:17.740
deep dive, as you have been listening, you have

00:23:17.740 --> 00:23:21.640
been actively generating your own personal invisible

00:23:21.640 --> 00:23:24.099
data set. Every link you click, every change

00:23:24.099 --> 00:23:26.940
in your GPS location, every transaction you make,

00:23:27.240 --> 00:23:29.059
the exact amount of time you spend listening

00:23:29.059 --> 00:23:31.380
to audio. It's all being tracked. It is all being

00:23:31.380 --> 00:23:34.440
recorded, standardized, and slotted into a massive

00:23:34.440 --> 00:23:37.660
tabular structure somewhere. You are a continuously

00:23:37.660 --> 00:23:40.900
updating row on thousands of different spreadsheets.

00:23:42.000 --> 00:23:43.859
So if someone were to take your personal life

00:23:43.859 --> 00:23:46.039
data set, all your unique habits, your movements,

00:23:46.140 --> 00:23:48.559
your choices, and graph it out like Anscombe's

00:23:48.559 --> 00:23:51.380
Quartet, what hidden statistical fallacies or

00:23:51.380 --> 00:23:54.450
surprises truths would it reveal about you? That's

00:23:54.450 --> 00:23:57.069
a deep question. Would the summary math of your

00:23:57.069 --> 00:23:59.849
life, your average income, your demographic bracket,

00:24:00.289 --> 00:24:02.890
would that tell the real story or would the visual

00:24:02.890 --> 00:24:05.849
graph of your daily actions show a shape that

00:24:05.849 --> 00:24:07.670
is entirely different from what the algorithms

00:24:07.670 --> 00:24:11.009
assume? That is a phenomenal and frankly slightly

00:24:11.009 --> 00:24:13.750
terrifying question to end on. Thank you so much

00:24:13.750 --> 00:24:15.769
for joining us on this deep dive into the hidden

00:24:15.769 --> 00:24:18.430
world of datasets. We hope it changed the way

00:24:18.430 --> 00:24:20.910
you see the digital architecture constantly operating

00:24:20.910 --> 00:24:22.970
around you. Keep questioning the numbers you

00:24:22.970 --> 00:24:25.269
see. Remember to look at the whole picture and

00:24:25.269 --> 00:24:26.309
always stay curious.