WEBVTT

00:00:00.000 --> 00:00:03.319
So imagine a massive research hospital, right?

00:00:03.740 --> 00:00:06.219
And they want to study this sprawling database

00:00:06.219 --> 00:00:09.019
of patient records to find a cure for some rare

00:00:09.019 --> 00:00:12.900
disease. They want to know basically everything,

00:00:13.160 --> 00:00:17.899
your symptoms, your daily habits, your location

00:00:17.899 --> 00:00:20.379
data, genetic markers, all of it. Yeah, the whole

00:00:20.379 --> 00:00:23.170
picture. Exactly. And, you know, you want them

00:00:23.170 --> 00:00:26.190
to find that cure, but how exactly do they crunch

00:00:26.190 --> 00:00:28.710
all that highly sensitive data without just,

00:00:28.710 --> 00:00:31.890
I mean, exposing your deepest, most personal

00:00:31.890 --> 00:00:34.030
secrets to the entire world? Yeah, it's kind

00:00:34.030 --> 00:00:36.229
of the ultimate modern paradox, isn't it? It

00:00:36.229 --> 00:00:38.689
really is. Because as a society, we desperately

00:00:38.689 --> 00:00:41.390
need this large -scale data, you know, to improve

00:00:41.390 --> 00:00:43.979
health care or city planning or tech. Yeah. But

00:00:43.979 --> 00:00:46.939
as individuals, we need absolute privacy to just

00:00:46.939 --> 00:00:49.280
function safely. Right, exactly. And so to figure

00:00:49.280 --> 00:00:51.380
out how we actually solve that paradox, we are

00:00:51.380 --> 00:00:54.570
pulling from a single. a really incredibly comprehensive

00:00:54.570 --> 00:00:57.009
source today. We're doing a deep dive into the

00:00:57.009 --> 00:00:59.229
Wikipedia article on differential privacy. Yeah,

00:00:59.229 --> 00:01:02.090
it's a fascinating topic. It is. And our mission

00:01:02.090 --> 00:01:04.390
for this deep dive is to sort of unpack this

00:01:04.390 --> 00:01:08.150
mathematically rigorous framework, understand

00:01:08.150 --> 00:01:10.870
how it lets researchers safely share general

00:01:10.870 --> 00:01:14.709
data and, spoiler alert, discover why the math

00:01:14.709 --> 00:01:17.469
behind it is absolute genius. But the actual

00:01:17.469 --> 00:01:20.069
physical computers running that math might still

00:01:20.069 --> 00:01:22.250
totally betray you. Yeah, that last part. is

00:01:22.250 --> 00:01:25.989
always a wild realization for people. But to

00:01:25.989 --> 00:01:28.329
set the stage here, we really need to define

00:01:28.329 --> 00:01:30.689
what we're actually looking at. Because differential

00:01:30.689 --> 00:01:33.670
privacy or DP, it's not a software program you

00:01:33.670 --> 00:01:36.129
just download. OK. And it's not an encryption

00:01:36.129 --> 00:01:38.840
key either. It is. Well, it's a mathematical

00:01:38.840 --> 00:01:40.859
constraint that's applied to algorithms. Oh,

00:01:40.879 --> 00:01:42.500
interesting. So it's more of a set of rules.

00:01:42.700 --> 00:01:45.439
Exactly. At its core, it's a framework that allows

00:01:45.439 --> 00:01:47.540
someone holding a database to share aggregate,

00:01:47.739 --> 00:01:49.579
high -level patterns about a group of people,

00:01:50.019 --> 00:01:51.500
while strictly limiting the information that

00:01:51.500 --> 00:01:53.659
gets leaked about any specific individual in

00:01:53.659 --> 00:01:56.019
that group. Right. And it achieves this by basically

00:01:56.019 --> 00:01:58.879
injecting carefully calibrated noise into the

00:01:58.879 --> 00:02:01.849
statistical computations. OK, let's unpack this.

00:02:02.370 --> 00:02:06.109
Because before we can really appreciate the genius

00:02:06.109 --> 00:02:09.509
of injecting calibrated noise, we have to look

00:02:09.509 --> 00:02:12.229
at why this framework was even invented in the

00:02:12.229 --> 00:02:14.729
first place. Oh, absolutely. We have to understand

00:02:14.729 --> 00:02:18.449
why traditional anonymization completely spectacularly

00:02:18.449 --> 00:02:21.349
fails. Because intuitively, you'd think, ah,

00:02:21.469 --> 00:02:23.569
hey, just delete my name and social security

00:02:23.569 --> 00:02:25.669
number from the spreadsheet, and I'm safe. Right.

00:02:25.669 --> 00:02:27.550
I mean, for a long time, the scientific and data

00:02:27.550 --> 00:02:29.770
communities actually thought that kind of base

00:02:29.680 --> 00:02:32.740
anonymity was enough. They really did. They did.

00:02:33.099 --> 00:02:37.539
But if we go back to 1979, researchers Dorothy

00:02:37.539 --> 00:02:40.400
Denning, Peter J. Denning, and Mayor D. Schwartz

00:02:40.400 --> 00:02:42.800
introduced this concept they called the tracker.

00:02:43.020 --> 00:02:45.780
The tracker sounds very sci -fi. Yeah, a bit.

00:02:46.199 --> 00:02:48.500
This was basically a theoretical adversary who

00:02:48.500 --> 00:02:51.219
could learn confidential information from a statistical

00:02:51.219 --> 00:02:53.960
database just by cross -referencing a series

00:02:53.960 --> 00:02:56.900
of highly targeted queries. Oh, wow. Yeah. They

00:02:56.900 --> 00:02:59.219
realized early on that every single time you

00:02:59.219 --> 00:03:01.379
ask a database a question, even a totally general

00:03:01.379 --> 00:03:03.919
statistical question, it leaks a tiny bit of

00:03:03.919 --> 00:03:06.400
privacy. It's like playing the game Guess Who?

00:03:06.590 --> 00:03:08.590
Oh, yeah, exactly. You aren't asking for the

00:03:08.590 --> 00:03:11.270
person's name, but if you ask enough broad questions

00:03:11.270 --> 00:03:13.810
about glasses and hair color, eventually there

00:03:13.810 --> 00:03:16.090
is literally only one face left standing on the

00:03:16.090 --> 00:03:19.409
board. Precisely. And this creeping realization

00:03:19.409 --> 00:03:21.969
kind of culminated in a massive breakthrough

00:03:21.969 --> 00:03:26.370
in 2003 by researchers Kobi Nassim and Erich

00:03:26.370 --> 00:03:29.430
De Neur. OK. They proved what is now known as

00:03:29.430 --> 00:03:31.830
the fundamental law of information recovery.

00:03:32.189 --> 00:03:35.770
The fundamental law. Yes. They showed, with rigorous

00:03:35.770 --> 00:03:38.770
mathematical proof, that it is strictly impossible

00:03:38.770 --> 00:03:41.590
to publish arbitrary queries on a private statistical

00:03:41.590 --> 00:03:44.689
database without revealing private information.

00:03:44.889 --> 00:03:47.189
Wait, impossible? Briefly impossible. In fact,

00:03:47.289 --> 00:03:49.250
they proved that a surprisingly small number

00:03:49.250 --> 00:03:52.310
of completely random queries can just reveal

00:03:52.310 --> 00:03:55.169
the entire contents of a database. So if an attacker

00:03:55.169 --> 00:03:58.620
asks enough seeming innocent questions. The whole

00:03:58.620 --> 00:04:00.819
vault just swings open. There's no way to stop

00:04:00.819 --> 00:04:03.419
it. None. The source material actually gives

00:04:03.419 --> 00:04:05.919
a fantastic, really simple example of how an

00:04:05.919 --> 00:04:08.240
attacker does this without ever explicitly asking

00:04:08.240 --> 00:04:10.539
for someone's specific data. I'd love to hear

00:04:10.539 --> 00:04:13.180
it. So imagine a medical database with just six

00:04:13.180 --> 00:04:17.600
people in it. Let's say Ross, Monica, Joey, Phoebe,

00:04:17.759 --> 00:04:20.069
Chandler, and Rachel. All right, I feel like

00:04:20.069 --> 00:04:22.170
we're trapped in a 90s sitcom right now, but

00:04:22.170 --> 00:04:24.870
go on. Yeah, I know. So let's say we have a column

00:04:24.870 --> 00:04:27.269
tracking whether these individuals have diabetes.

00:04:27.850 --> 00:04:30.750
A1 means they have it. A0 means they don't. Got

00:04:30.750 --> 00:04:34.089
it. Now imagine our attacker wants to know Chandler's

00:04:34.089 --> 00:04:37.370
medical status. Chandler is in row five of this

00:04:37.370 --> 00:04:39.860
database. Now, the attacker isn't allowed to

00:04:39.860 --> 00:04:42.360
just ask the database, does Chandler have diabetes?

00:04:42.540 --> 00:04:44.480
Right. The system would flag that as a privacy

00:04:44.480 --> 00:04:47.139
violation and just block it. Exactly. But the

00:04:47.139 --> 00:04:48.939
attacker is allowed to ask for partial sums.

00:04:49.199 --> 00:04:51.180
Right. Broad aggregate data, which is the kind

00:04:51.180 --> 00:04:53.779
of stuff researchers or policymakers ask for

00:04:53.779 --> 00:04:56.339
all the time to understand public health. Exactly.

00:04:56.699 --> 00:04:59.259
So the attacker asks the database, what is the

00:04:59.259 --> 00:05:01.259
sum of the diabetes column for the first five

00:05:01.259 --> 00:05:04.220
rows? The database does the math and answers

00:05:04.220 --> 00:05:07.689
three. Okay. Then the attacker asks a slightly

00:05:07.689 --> 00:05:10.610
different question. What is the sum of the diabetes

00:05:10.610 --> 00:05:13.689
column for the first four rows? The database

00:05:13.689 --> 00:05:16.810
answers two. Oh wow, I see it. Yeah, the difference

00:05:16.810 --> 00:05:19.750
between those two answers is exactly one. Meaning

00:05:19.750 --> 00:05:22.660
row five, which we know is Chandler must be a

00:05:22.660 --> 00:05:26.000
one. He has diabetes. Yes. The attacker just

00:05:26.000 --> 00:05:28.079
proved Chandler has a medical condition without

00:05:28.079 --> 00:05:31.240
ever explicitly asking about him. That's crazy.

00:05:31.480 --> 00:05:34.220
Because if Chandler's status had been zero, the

00:05:34.220 --> 00:05:36.000
two queries would have returned the exact same

00:05:36.000 --> 00:05:38.560
number and the difference would be zero. Right.

00:05:39.160 --> 00:05:42.420
This basically proves the fundamental law. Exact

00:05:42.420 --> 00:05:46.100
answers to data queries inevitably unavoidably

00:05:46.100 --> 00:05:48.620
leak individual data. Okay, let me see if I'm

00:05:48.620 --> 00:05:50.540
getting this. It's like trying to figure out

00:05:50.540 --> 00:05:52.899
if your friend is at a crowded party. Okay. You

00:05:52.899 --> 00:05:54.839
don't ask the host if your friend is there because

00:05:54.839 --> 00:05:57.399
maybe the host is sworn to secrecy. Instead,

00:05:57.439 --> 00:05:59.740
you ask the host exactly how many slices of pizza

00:05:59.740 --> 00:06:02.319
we're eating. Oh, I like this. Right. And because

00:06:02.319 --> 00:06:04.220
you happen to know exactly how much everyone

00:06:04.220 --> 00:06:07.319
else at the party eats, the missing slice gives

00:06:07.319 --> 00:06:10.120
your friend away. The math is just inescapable.

00:06:10.399 --> 00:06:13.279
That is a perfect way to visualize it. The surrounding

00:06:13.279 --> 00:06:17.240
data isolates the target. And what's fascinating

00:06:17.240 --> 00:06:20.399
here is that once Nissim and Deneuer established

00:06:20.399 --> 00:06:23.180
this fundamental law, the scientific community

00:06:23.180 --> 00:06:26.220
had a massive reckoning. I bet. Because if exact

00:06:26.220 --> 00:06:29.170
answers always leak data... Well, the only possible

00:06:29.170 --> 00:06:31.670
solution is to stop giving exact answers. Wait,

00:06:31.689 --> 00:06:33.930
if exact answers are fundamentally broken, how

00:06:33.930 --> 00:06:36.129
on earth did researchers respond? Did they just,

00:06:36.129 --> 00:06:38.949
like, stop querying databases altogether? No,

00:06:38.949 --> 00:06:41.149
they had to invent an entirely new way to answer

00:06:41.149 --> 00:06:44.250
questions. And that leads us to a massive breakthrough

00:06:44.250 --> 00:06:47.670
in 2006. Okay, what happened in 2006? A paper

00:06:47.670 --> 00:06:50.410
by Cynthia Dwork, Frank McSherry, Kabi Nisman,

00:06:50.410 --> 00:06:53.209
and Adam D. Smith introduced the invention of

00:06:53.209 --> 00:06:55.910
epsilon differential privacy. Epsilon differential

00:06:55.910 --> 00:06:58.519
privacy. Yeah. This work was so revolutionary

00:06:58.519 --> 00:07:01.139
it later won them the Gordle Prize in Theoretical

00:07:01.139 --> 00:07:03.120
Computer Science. Okay, epsilon differential

00:07:03.120 --> 00:07:04.879
privacy. I mean, that sounds super intimidating.

00:07:04.980 --> 00:07:06.920
What does it actually mean for, you know, you

00:07:06.920 --> 00:07:09.160
and me? So they created a strict mathematical

00:07:09.160 --> 00:07:11.779
definition for privacy loss. What the source

00:07:11.779 --> 00:07:15.160
calls pure differential privacy guarantees that

00:07:15.160 --> 00:07:17.160
changing a single entry in a database, meaning

00:07:17.160 --> 00:07:19.399
adding your data, removing your data, or changing

00:07:19.399 --> 00:07:22.920
your data, only creates a tiny mathematically

00:07:22.920 --> 00:07:25.480
constrained change in the probability distribution

00:07:25.480 --> 00:07:28.319
of the output. Okay, translate that into plain

00:07:28.319 --> 00:07:30.660
English for us. Like, what is the core intuition

00:07:30.660 --> 00:07:33.800
we need to grasp here? The intuition is brilliant

00:07:33.800 --> 00:07:36.819
in its simplicity, really. Yeah. It essentially

00:07:36.819 --> 00:07:39.740
promises you this. You are given roughly the

00:07:39.740 --> 00:07:41.819
exact same amount of privacy whether your personal

00:07:41.819 --> 00:07:44.220
data is included in the database or whether you

00:07:44.220 --> 00:07:46.139
stayed home and didn't participate at all. Really?

00:07:46.259 --> 00:07:48.420
Yeah. These statistical functions run on the

00:07:48.420 --> 00:07:50.959
database won't be sustainably affected by the

00:07:50.959 --> 00:07:53.279
addition, removal, or change of any one individual.

00:07:53.519 --> 00:07:56.139
That is profound. You're basically telling citizens

00:07:56.139 --> 00:07:59.399
participating in this medical study or this census

00:07:59.399 --> 00:08:02.480
carries almost zero additional risk to your privacy

00:08:02.480 --> 00:08:05.240
compared to just not participating. Yeah. But

00:08:05.240 --> 00:08:08.300
wait, I'm stuck. If you're injecting random noise

00:08:08.300 --> 00:08:11.019
into my medical data to blur the results, aren't

00:08:11.019 --> 00:08:13.980
you just ruining the study? Like, how can a researcher

00:08:13.980 --> 00:08:17.420
use a database full of lies? That is the million

00:08:17.420 --> 00:08:20.529
dollar question. And it's exactly why the 2006

00:08:20.529 --> 00:08:23.649
paper was titled Calibrating Noise to Sensitivity.

00:08:24.329 --> 00:08:26.970
Calibrating noise? Right. The noise they inject

00:08:26.970 --> 00:08:29.709
isn't just some random destructive fog. It is

00:08:29.709 --> 00:08:32.850
highly calibrated based on a concept called sensitivity.

00:08:33.470 --> 00:08:35.409
It depends entirely on how many people's data

00:08:35.409 --> 00:08:38.009
are involved in a specific query. Okay, walk

00:08:38.009 --> 00:08:40.389
us through how sensitivity works in practice.

00:08:40.669 --> 00:08:43.190
Think about it like this. If a database only

00:08:43.190 --> 00:08:46.159
contains data from one single person, that person

00:08:46.159 --> 00:08:48.679
contributes 100 % to the result of any query.

00:08:49.200 --> 00:08:51.039
The sensitivity is extremely high. Makes sense.

00:08:51.220 --> 00:08:53.820
But if the database has 100 people, each person

00:08:53.820 --> 00:08:57.000
only contributes 1%. Right. The key insight of

00:08:57.000 --> 00:08:59.899
differential privacy is that as you query fewer

00:08:59.899 --> 00:09:02.379
and fewer people, you have to add more and more

00:09:02.379 --> 00:09:04.460
noise to the result to maintain the same level

00:09:04.460 --> 00:09:07.299
of privacy. Fewer people equals higher sensitivity,

00:09:07.620 --> 00:09:09.860
which demands more noise to protect them. Here's

00:09:09.860 --> 00:09:12.120
where it gets really interesting, because establishing

00:09:12.120 --> 00:09:15.159
that we must add noise to protect sensitivity

00:09:15.159 --> 00:09:19.019
is one thing. But exactly how do you inject that

00:09:19.019 --> 00:09:21.039
noise without destroying the underlying truth

00:09:21.039 --> 00:09:24.039
of the data? Like, how do we lie for the greater

00:09:24.039 --> 00:09:27.259
good? So the source outlines a few specific mechanisms

00:09:27.259 --> 00:09:29.820
for this. Let's start with a classic one developed

00:09:29.820 --> 00:09:32.279
originally in the social sciences. It's called

00:09:32.279 --> 00:09:35.059
randomized response. Randomized response. Yeah,

00:09:35.059 --> 00:09:37.360
think of it as the coin toss method. Okay. I

00:09:37.360 --> 00:09:39.039
want to try to work through this one. How does

00:09:39.039 --> 00:09:42.600
the coin toss protect me? Well, imagine you are

00:09:42.600 --> 00:09:45.299
conducting a survey and you need to ask people

00:09:45.299 --> 00:09:48.240
about a highly sensitive or even like an illegal

00:09:48.240 --> 00:09:50.039
behavior. Right. Obviously, people are going

00:09:50.039 --> 00:09:52.019
to lie if you just ask them directly on a form.

00:09:52.200 --> 00:09:54.740
Exactly. So you give every participant a coin

00:09:54.740 --> 00:09:57.779
and these exact instructions. You tell them toss

00:09:57.779 --> 00:10:00.580
the coin. If it lands on heads, toss it again,

00:10:01.059 --> 00:10:02.980
but ignore the result and just answer the question

00:10:02.980 --> 00:10:05.259
honestly. Okay. If your first toss is tails,

00:10:05.259 --> 00:10:08.320
toss it again. But this time, answer yes if it's

00:10:08.320 --> 00:10:11.100
heads and no if it's tails. OK, wait. So if I

00:10:11.100 --> 00:10:13.840
answer yes on the survey, the researcher looking

00:10:13.840 --> 00:10:17.179
at the paper has absolutely no idea if I'm confessing

00:10:17.179 --> 00:10:20.320
to the illegal behavior or if I just flip tails

00:10:20.320 --> 00:10:24.419
and then heads. I have complete plausible deniability.

00:10:24.659 --> 00:10:27.720
Exactly. Every single person who answers has

00:10:27.720 --> 00:10:31.409
an ironclad mathematical alibi. But if we connect

00:10:31.409 --> 00:10:34.070
this to the bigger picture, the researcher can

00:10:34.070 --> 00:10:36.289
still find the aggregate truth of the crowd.

00:10:36.610 --> 00:10:38.730
How does that work? Let's break down the math.

00:10:38.929 --> 00:10:40.750
Out of the whole group, we know statistically

00:10:40.750 --> 00:10:43.029
that a quarter of all responses will be a forced

00:10:43.029 --> 00:10:46.269
yes from the coin tosses and a quarter will be

00:10:46.269 --> 00:10:48.690
a forced no. Right, because of the probabilities.

00:10:48.909 --> 00:10:51.190
Right. The remaining half of the responses are

00:10:51.190 --> 00:10:54.450
the true honest answers. So the expected positive

00:10:54.450 --> 00:10:57.350
responses are one quarter times one minus the

00:10:57.350 --> 00:10:59.600
true proportion. plus three quarters times the

00:10:59.600 --> 00:11:01.659
true proportion. Oh, I see. Because the researcher

00:11:01.659 --> 00:11:03.779
knows the exact probability of the coin tosses

00:11:03.779 --> 00:11:06.159
like that know that 25 % of the yes answers are

00:11:06.159 --> 00:11:08.480
fake. They can just use basic algebra to work

00:11:08.480 --> 00:11:11.039
backwards. Yes, exactly. They just subtract the

00:11:11.039 --> 00:11:13.460
fake noise to figure out the true proportion

00:11:13.460 --> 00:11:16.320
of the group that engages in the behavior, even

00:11:16.320 --> 00:11:18.940
though they still have absolutely no idea which

00:11:18.940 --> 00:11:21.960
specific individual said yes, honestly. It's

00:11:21.960 --> 00:11:24.059
beautiful, isn't it? Now, that's for surveys

00:11:24.059 --> 00:11:27.529
where humans actually flip coins. But for non

00:11:27.529 --> 00:11:30.789
-survey data like algorithms processing billions

00:11:30.789 --> 00:11:33.389
of database records instantly, they use something

00:11:33.389 --> 00:11:36.230
called the Laplace mechanism. The Laplace mechanism.

00:11:36.409 --> 00:11:39.070
How does an algorithm flip a coin? It uses a

00:11:39.070 --> 00:11:41.529
mathematical formula. It adds noise drawn from

00:11:41.529 --> 00:11:44.269
a specific probability curve called the Laplace

00:11:44.269 --> 00:11:46.690
distribution, which is basically a bell curve

00:11:46.690 --> 00:11:49.450
centered on zero. The amount of privacy protection,

00:11:49.610 --> 00:11:51.370
which mathematicians represent with the Greek

00:11:51.370 --> 00:11:54.350
letter epsilon, is determined by taking the sensitivity

00:11:54.350 --> 00:11:56.610
of the function we talked about earlier. and

00:11:56.610 --> 00:11:58.830
dividing it by the scale of the noise. Okay,

00:11:58.830 --> 00:12:00.730
let me see if I can visualize the slip place

00:12:00.730 --> 00:12:03.370
mechanism. It's like an automated version of

00:12:03.370 --> 00:12:05.470
plausible deniability. Yeah, that's a good way

00:12:05.470 --> 00:12:07.350
to put it. It almost sounds like a choir director

00:12:07.350 --> 00:12:09.149
trying to figure out if the group as a whole

00:12:09.149 --> 00:12:11.490
can hit a really difficult high note. Oh, I like

00:12:11.490 --> 00:12:14.210
where this is going. So the director tells half

00:12:14.210 --> 00:12:17.129
the choir to sing the wrong note on purpose.

00:12:17.870 --> 00:12:20.049
The director can still hear the overall pitch

00:12:20.049 --> 00:12:22.710
and volume of the room to know if the group is

00:12:22.710 --> 00:12:25.750
generally capable of hitting the note. but no

00:12:25.750 --> 00:12:28.009
single person can ever be accused of being a

00:12:28.009 --> 00:12:29.929
terrible singer because they can just say, hey,

00:12:29.950 --> 00:12:31.690
I was assigned to sing the wrong note. That is

00:12:31.690 --> 00:12:34.809
a brilliant analogy. The overall signal of the

00:12:34.809 --> 00:12:38.210
room remains completely intact while the individual

00:12:38.210 --> 00:12:41.429
noise provides perfect cover. Right. But this

00:12:41.429 --> 00:12:44.509
raises an incredibly important question. If this

00:12:44.509 --> 00:12:46.850
mechanism is so effective at hiding individuals,

00:12:47.409 --> 00:12:50.090
what stops our attacker from earlier from just

00:12:50.090 --> 00:12:52.429
asking the exact same noisy question over and

00:12:52.429 --> 00:12:55.320
over again? Oh, right. Because if I flip a coin

00:12:55.320 --> 00:12:58.000
once, you don't know my true answer. But if you

00:12:58.000 --> 00:13:00.059
lock me in a room and make me flip a coin a thousand

00:13:00.059 --> 00:13:02.519
times for the same question, the laws of averages

00:13:02.519 --> 00:13:04.399
will eventually strip away the noise and betray

00:13:04.399 --> 00:13:06.200
me, won't they? They absolutely will. The noise

00:13:06.200 --> 00:13:09.139
cancels itself out over time. And that brings

00:13:09.139 --> 00:13:11.240
us to a foundational concept in differential

00:13:11.240 --> 00:13:14.159
privacy called composability. Composability?

00:13:14.240 --> 00:13:16.960
Yeah, it introduces the idea of a privacy loss

00:13:16.960 --> 00:13:19.820
budget. If you query a differentially private

00:13:19.820 --> 00:13:23.049
mechanism a certain number of times, called tux

00:13:23.049 --> 00:13:26.649
times the privacy result degrades. The protection

00:13:26.649 --> 00:13:29.889
becomes epsilon times t. This is known as sequential

00:13:29.889 --> 00:13:33.429
composition. Basically, every single time you

00:13:33.429 --> 00:13:35.809
ask the database a question, you spend a little

00:13:35.809 --> 00:13:37.769
bit of your privacy budget. It's literally like

00:13:37.769 --> 00:13:39.990
withdrawing money from a bank account. Exactly.

00:13:40.129 --> 00:13:42.090
Once the budget is spent, the database just has

00:13:42.090 --> 00:13:44.490
to be locked down and shut off, or else you risk

00:13:44.490 --> 00:13:47.049
exposing the individuals inside. Yep. That's

00:13:47.049 --> 00:13:49.740
the reality of it. So with these strict budgets

00:13:49.740 --> 00:13:51.840
and all this complex noise, I mean, who is actually

00:13:51.840 --> 00:13:54.679
using this in the real world? Is this just theoretical

00:13:54.679 --> 00:13:56.860
math sitting in a university lab somewhere? Well,

00:13:56.860 --> 00:14:00.000
not at all. The source details over a dozen major

00:14:00.000 --> 00:14:02.379
real -world deployments. This is actively running

00:14:02.379 --> 00:14:05.200
our digital lives right now. Really? Yeah. The

00:14:05.200 --> 00:14:07.259
US Census Bureau started using it way back in

00:14:07.259 --> 00:14:10.320
2008 to protect commuting patterns. And they

00:14:10.320 --> 00:14:13.620
very notably used differential privacy for the

00:14:13.679 --> 00:14:16.620
2021 redistricting data that came out of the

00:14:16.620 --> 00:14:20.299
2020 census. That is massive. The census literally

00:14:20.299 --> 00:14:22.720
determines political representation and billions

00:14:22.720 --> 00:14:25.299
in federal funding. I imagine injecting noise

00:14:25.299 --> 00:14:27.480
into that data caused some serious friction.

00:14:27.480 --> 00:14:29.820
Oh, it caused a total uproar among demographers.

00:14:29.879 --> 00:14:32.860
They rely on exact block by block data. But the

00:14:32.860 --> 00:14:34.740
Census Bureau realized that with modern computing,

00:14:35.259 --> 00:14:38.120
their old anonymization methods were completely

00:14:38.120 --> 00:14:40.080
vulnerable to the reconstruction attacks we talked

00:14:40.080 --> 00:14:42.159
about earlier. The tracker attacks. Right. They

00:14:42.159 --> 00:14:44.769
had to inject noise. even if it meant a neighborhood

00:14:44.769 --> 00:14:47.350
block might statistically look like it has three

00:14:47.350 --> 00:14:49.970
adults and 14 children. It's a crazy trade -off,

00:14:50.129 --> 00:14:52.409
and Big Tech is doing this too, right? Everywhere.

00:14:52.850 --> 00:14:55.629
In 2014, Google launched a system called Rappore

00:14:55.629 --> 00:14:58.169
to collect telemetry data like what software

00:14:58.169 --> 00:15:01.269
is crashing or acting maliciously without compromising

00:15:01.269 --> 00:15:03.710
individual users. Makes sense. Apple integrated

00:15:03.710 --> 00:15:06.970
differential privacy into iOS 10 back in 2016

00:15:06.970 --> 00:15:09.529
for their personal assistance. Microsoft used

00:15:09.529 --> 00:15:13.340
it for Windows telemetry in 2017. And in 2020,

00:15:13.899 --> 00:15:16.620
Facebook released a massive 55 trillion cell

00:15:16.620 --> 00:15:19.100
dataset to researchers studying elections and

00:15:19.100 --> 00:15:22.500
democracy. Wait, 55 trillion cells? Why so massive?

00:15:22.700 --> 00:15:24.899
Because they wanted researchers to be able to

00:15:24.899 --> 00:15:27.559
study how election interference spreads across

00:15:27.559 --> 00:15:30.820
demographics. But they absolutely could not release

00:15:30.820 --> 00:15:34.440
the actual browsing habits or political affiliations

00:15:34.440 --> 00:15:38.740
of individual Facebook users. By applying Laplace

00:15:38.740 --> 00:15:41.799
noise to the data set, they could offer macroscopic

00:15:41.799 --> 00:15:45.080
insights into democracy while shielding the microscopic

00:15:45.080 --> 00:15:48.919
user data. So what does this all mean for policymakers

00:15:48.919 --> 00:15:52.299
and for us? Because if Apple, Google, and the

00:15:52.299 --> 00:15:54.580
federal government are making explicit choices

00:15:54.580 --> 00:15:57.139
about the trade -off between privacy and accuracy,

00:15:57.580 --> 00:15:59.419
aren't they just turning a dial behind closed

00:15:59.419 --> 00:16:02.159
doors? That's a very valid concern. Right. Because

00:16:02.159 --> 00:16:04.320
if they inject too little noise because they

00:16:04.320 --> 00:16:06.259
want their data to be highly accurate, aren't

00:16:06.259 --> 00:16:08.580
they giving us a false sense of security? You've

00:16:08.580 --> 00:16:10.759
hit on the central public -purpose tension of

00:16:10.759 --> 00:16:13.320
differential privacy. The main concern is that

00:16:13.320 --> 00:16:16.240
explicit trade -off. More privacy, meaning a

00:16:16.240 --> 00:16:18.940
smaller epsilon and more noise, lowers the utility

00:16:18.940 --> 00:16:22.120
and accuracy of the data set. Less privacy improves

00:16:22.120 --> 00:16:25.840
accuracy, but drastically increases risk. Differential

00:16:25.840 --> 00:16:28.840
privacy is brilliant because it gives you a quantified

00:16:28.840 --> 00:16:31.600
mathematical measure of privacy loss. It gives

00:16:31.600 --> 00:16:33.860
you an upper bound. But you're exactly right.

00:16:34.000 --> 00:16:36.220
If a company or a government sets that parameter

00:16:36.220 --> 00:16:39.659
loosely to favor their own data needs, it encourages

00:16:39.659 --> 00:16:42.399
greater data sharing while masking the actual

00:16:42.399 --> 00:16:45.100
privacy risk to the public. It really is a dial

00:16:45.100 --> 00:16:47.379
they get to turn. And we just have to trust that

00:16:47.379 --> 00:16:49.580
they've set the dial to a setting that actually

00:16:49.580 --> 00:16:52.360
protects us. Yes, unfortunately. But even if

00:16:52.360 --> 00:16:54.299
we trust them, even if they set the dial perfectly,

00:16:54.379 --> 00:16:56.679
and even if the math is flawless on the whiteboard,

00:16:56.740 --> 00:16:59.019
we have to talk about the plot twist of this

00:16:59.019 --> 00:17:02.679
deep dive. Ah, yes. The hardware problem. Exactly.

00:17:02.879 --> 00:17:04.740
Because math... doesn't exist in a vacuum. It

00:17:04.740 --> 00:17:07.779
runs on physical hardware. This is where pristine

00:17:07.779 --> 00:17:10.819
theoretical math collides head on with messy

00:17:10.819 --> 00:17:14.039
computer engineering. Because differential privacy

00:17:14.039 --> 00:17:16.920
techniques are implemented on real silicon computers,

00:17:17.519 --> 00:17:19.900
they are vulnerable to attacks that the math

00:17:19.900 --> 00:17:22.380
simply cannot compensate for. It's perfect math

00:17:22.380 --> 00:17:24.519
meeting imperfect computers. How does the computer

00:17:24.519 --> 00:17:26.799
actually break the math? Let's look at something

00:17:26.799 --> 00:17:30.420
called floating point arithmetic leakage. Differentially,

00:17:30.779 --> 00:17:33.559
private algorithms, like the Laplace mechanism

00:17:33.559 --> 00:17:36.759
we discussed, are built on probability distributions

00:17:36.759 --> 00:17:40.859
that assume continuous numbers. Imagine a smooth,

00:17:41.299 --> 00:17:43.599
infinite, unbroken line of possible numbers.

00:17:43.799 --> 00:17:45.859
Right. In pure math, between the numbers 1 and

00:17:45.859 --> 00:17:48.019
2, there are infinite fractions and decimals.

00:17:48.279 --> 00:17:50.440
Right. But computers don't have infinite memory.

00:17:50.740 --> 00:17:53.220
They can't store infinite decimals. They use

00:17:53.220 --> 00:17:55.740
discrete floating -point arithmetic, which is

00:17:55.740 --> 00:17:59.059
basically binary scientific notation to approximate

00:17:59.059 --> 00:18:02.160
real numbers. OK. They snap to a grid. It's what

00:18:02.160 --> 00:18:05.000
programmers call a leaky abstraction. Because

00:18:05.000 --> 00:18:07.740
of this, if you do a naive software implementation

00:18:07.740 --> 00:18:10.279
of the Laplace mechanism, it actually covers

00:18:10.279 --> 00:18:12.759
less than 80 % of all the double precision floating

00:18:12.759 --> 00:18:14.599
point numbers the computer can generate. Wait,

00:18:14.680 --> 00:18:16.859
wait. So the computer literally skips over certain

00:18:16.859 --> 00:18:19.019
decimal values, like there are gaps in the noise.

00:18:19.180 --> 00:18:21.259
Exactly. There are gaps in the grid. And because

00:18:21.259 --> 00:18:23.859
of that physical hardware limitation, an attacker

00:18:23.859 --> 00:18:25.720
can look at the output, notice that it falls

00:18:25.720 --> 00:18:27.900
into one of those gaps, and immediately know

00:18:27.900 --> 00:18:30.640
the data is skewed. The source notes that just

00:18:30.640 --> 00:18:33.420
one single sample from a naive implementation

00:18:33.420 --> 00:18:36.700
can allow an attacker to distinguish between

00:18:36.700 --> 00:18:40.380
two adjacent data sets, meaning data sets that

00:18:40.380 --> 00:18:42.819
differ by only one person, with a probability

00:18:42.819 --> 00:18:46.519
of more than 35%. That is wild. The math says

00:18:46.519 --> 00:18:48.759
you are perfectly safe. But the silicon chip

00:18:48.759 --> 00:18:51.299
says, actually, I can't process that decimal,

00:18:51.359 --> 00:18:53.779
so I'm leaking your data. Yeah, it's pretty alarming.

00:18:54.039 --> 00:18:56.779
And the source material mentions another hardware

00:18:56.779 --> 00:18:58.920
vulnerability that feels, I don't know, straight

00:18:58.920 --> 00:19:02.279
out of a Cold War spy movie, timing side -channel

00:19:02.279 --> 00:19:05.779
attacks. Yes. This one is really insidious. Even

00:19:05.779 --> 00:19:08.039
if the output number is perfectly noisy, mathematically

00:19:08.039 --> 00:19:10.440
sound, and falls on the right grid, an attacker

00:19:10.440 --> 00:19:12.559
could just sit back and monitor the time it takes

00:19:12.559 --> 00:19:15.000
the computer's CPU to calculate the answer. Wait,

00:19:15.000 --> 00:19:17.539
how does a stopwatch exfiltrate my medical data?

00:19:17.819 --> 00:19:20.599
So unlike basic integer math, like 2 plus 2,

00:19:20.599 --> 00:19:23.319
which usually takes the CPU the exact same amount

00:19:23.319 --> 00:19:25.809
of time every single time. Floating -point arithmetic

00:19:25.809 --> 00:19:29.329
has input -dependent timing variability. Depending

00:19:29.329 --> 00:19:32.369
on the specific numbers being crunched, the processor

00:19:32.369 --> 00:19:34.890
takes slightly longer or slightly shorter to

00:19:34.890 --> 00:19:37.869
finish the job. For instance, handling what computer

00:19:37.869 --> 00:19:41.009
scientists call subnormals, which are incredibly

00:19:41.009 --> 00:19:43.130
microscopically tiny floating -point numbers

00:19:43.130 --> 00:19:46.950
near zero, can take the CPU up to 100 times longer

00:19:46.950 --> 00:19:49.250
than typical calculations. Okay, I think I've

00:19:49.250 --> 00:19:51.230
got an analogy for this. Let's hear it. It's

00:19:51.230 --> 00:19:53.630
like playing poker against a mathematical genius

00:19:53.630 --> 00:19:57.369
who has a flawless, unbreakable poker face. You

00:19:57.369 --> 00:19:59.250
can't read their expression at all, the math

00:19:59.250 --> 00:20:01.789
is perfect. Right. But you notice it takes them

00:20:01.789 --> 00:20:04.390
exactly three seconds longer to place a bet every

00:20:04.390 --> 00:20:06.789
time they have a terrible hand. The face didn't

00:20:06.789 --> 00:20:09.569
give them away, the clock on the wall did. That

00:20:09.569 --> 00:20:12.349
is exactly what a timing side channel attack

00:20:12.349 --> 00:20:15.509
is. And in differential privacy, where you are

00:20:15.509 --> 00:20:17.509
trying to hide a single bit of information, like

00:20:17.509 --> 00:20:20.250
whether Chandler has diabetes or not, a targeted

00:20:20.250 --> 00:20:23.930
timing attack can be used to observe the CPU,

00:20:24.569 --> 00:20:27.119
measure the delay, and exfiltrate the very bit

00:20:27.119 --> 00:20:29.880
the system was designed to protect. That's terrifying.

00:20:30.119 --> 00:20:32.640
A catastrophic compromise, really. So to summarize

00:20:32.640 --> 00:20:35.400
this incredible journey we've been on, we started

00:20:35.400 --> 00:20:38.559
with the fundamental law that every single data

00:20:38.559 --> 00:20:42.700
query leaks privacy. Yes. We explored the brilliant

00:20:42.700 --> 00:20:45.460
mathematical solution of differential privacy,

00:20:45.599 --> 00:20:48.900
injecting calibrated noise to give individuals

00:20:48.900 --> 00:20:52.400
plausible deniability while preserving the aggregate

00:20:52.400 --> 00:20:55.579
trends. Right. It is literally what allowed society

00:20:55.579 --> 00:20:57.720
to learn about the forest without inspecting

00:20:57.720 --> 00:21:01.220
the individual trees. But we end with the sobering

00:21:01.220 --> 00:21:03.299
reality that the physical computers running this

00:21:03.299 --> 00:21:05.920
perfect math have engineering vulnerabilities

00:21:05.920 --> 00:21:09.099
like floating point limits and CPU timing that

00:21:09.099 --> 00:21:11.599
can completely shatter those privacy guarantees.

00:21:11.839 --> 00:21:14.680
It is a constant exhausting arms race between

00:21:14.680 --> 00:21:17.299
theoretical information theory and physical computer

00:21:17.299 --> 00:21:20.480
engineering. But before we wrap up, there's one

00:21:20.480 --> 00:21:23.079
final incredibly provocative concept from the

00:21:23.079 --> 00:21:24.410
source material that we haven't touched on. yet.

00:21:24.549 --> 00:21:27.190
It's called group privacy. What happens when

00:21:27.190 --> 00:21:30.289
we look at groups instead of individuals? So

00:21:30.289 --> 00:21:32.029
everything we've discussed so far, the privacy

00:21:32.029 --> 00:21:34.750
budget in Epsilon, is explicitly designed to

00:21:34.750 --> 00:21:36.890
protect one single individual, the difference

00:21:36.890 --> 00:21:39.309
of one row in a database. But what if you were

00:21:39.309 --> 00:21:42.430
a researcher who wants to protect a specific

00:21:42.430 --> 00:21:45.470
group of C individuals, say a whole family of

00:21:45.470 --> 00:21:48.630
five, or a small marginalized community in a

00:21:48.630 --> 00:21:51.490
specific town? OK, yeah. The math shows that

00:21:51.490 --> 00:21:55.019
to protect a group of size C, The risk of exposure

00:21:55.019 --> 00:21:58.180
scales exponentially. Meaning what for our privacy

00:21:58.180 --> 00:22:00.480
budget? The source notes that the probability

00:22:00.480 --> 00:22:03.299
bounds grow by a factor of e to the power of

00:22:03.299 --> 00:22:05.960
epsilon times c. Okay, that's a lot of math.

00:22:06.019 --> 00:22:08.140
Let me translate that out of Math Jerkin. It

00:22:08.140 --> 00:22:10.460
basically means you have to drastically slash

00:22:10.460 --> 00:22:12.579
your privacy budget. you essentially had to divide

00:22:12.579 --> 00:22:16.119
your Epsilon budget by C. Oh, wow. Yeah. So if

00:22:16.119 --> 00:22:18.240
you want to protect a specific group of 10 people

00:22:18.240 --> 00:22:20.200
with the exact same mathematical guarantee you

00:22:20.200 --> 00:22:22.740
give to one person, you burn through your data

00:22:22.740 --> 00:22:25.700
utility 10 times faster. Wow. It just drains

00:22:25.700 --> 00:22:28.099
the account instantly. Yes, exactly. So here

00:22:28.099 --> 00:22:30.380
is a final thought for you to ponder as you go

00:22:30.380 --> 00:22:33.920
about your day. As our society's data sets get

00:22:33.920 --> 00:22:36.589
larger and larger, But the questions we want

00:22:36.589 --> 00:22:39.430
to ask get more and more specific. Can we ever

00:22:39.430 --> 00:22:42.130
truly study small, marginalized groups or specific

00:22:42.130 --> 00:22:45.349
families without completely exhausting our mathematical

00:22:45.349 --> 00:22:48.049
privacy budget and rendering the data totally

00:22:48.049 --> 00:22:51.089
useless? It's a huge question. Does the hard,

00:22:51.289 --> 00:22:53.809
unforgiving math of differential privacy ultimately

00:22:53.809 --> 00:22:56.269
force us to choose between protecting minority

00:22:56.269 --> 00:22:58.589
groups from exposure and actually understanding

00:22:58.589 --> 00:23:01.210
them well enough to help them? It is a profound,

00:23:01.730 --> 00:23:04.480
unresolved limitation of the math, really. And

00:23:04.480 --> 00:23:06.579
it has massive implications for the future of

00:23:06.579 --> 00:23:08.680
social science and public policy. It really does.

00:23:08.759 --> 00:23:10.339
I invite you to mull that over until our next

00:23:10.339 --> 00:23:11.579
deep dive. Thanks for listening.