WEBVTT

00:00:00.000 --> 00:00:03.839
Imagine for a second that you have a perfect

00:00:03.839 --> 00:00:07.160
mathematical toolkit. It's shiny, it's precise,

00:00:07.160 --> 00:00:09.599
and it has successfully cracked open some of

00:00:09.599 --> 00:00:11.400
the biggest mysteries in the history of science.

00:00:12.039 --> 00:00:14.800
But then you walk up to a totally new door. And

00:00:14.800 --> 00:00:17.260
what's behind it? Well, maybe behind this door

00:00:17.260 --> 00:00:20.940
is, you know. the complex genetic history of

00:00:20.940 --> 00:00:24.460
humanity, or the chaotic branching spread of

00:00:24.460 --> 00:00:27.660
a global epidemic, or even the incredibly messy

00:00:27.660 --> 00:00:30.519
bouncing signals of radio wave propagation. Oh,

00:00:30.539 --> 00:00:33.060
wow. Yeah, the really big stuff. Exactly. So

00:00:33.060 --> 00:00:34.880
you pull out your perfect toolkit. You try to

00:00:34.880 --> 00:00:36.799
pick the lock. And the lock is just too complex.

00:00:36.920 --> 00:00:39.340
The math literally gets stuck. It just completely

00:00:39.340 --> 00:00:41.399
fails. Yeah. So what do you do? You don't just

00:00:41.399 --> 00:00:43.679
throw up your hands and give up. You learn how

00:00:43.679 --> 00:00:46.619
to approximate. Welcome to today's deep dive.

00:00:46.759 --> 00:00:49.679
I am very excited for this one. Me too. Our mission

00:00:49.679 --> 00:00:51.979
today is exploring a comprehensive breakdown

00:00:51.979 --> 00:00:54.899
of something called approximate Bayesian computation,

00:00:55.359 --> 00:00:59.200
or ABC. And I know that title sounds incredibly

00:00:59.200 --> 00:01:02.000
dense. It really does. But I promise you, this

00:01:02.000 --> 00:01:05.040
is actually a master class in how modern scientists

00:01:05.040 --> 00:01:08.099
solve the impossible problems in our messy, unpredictable

00:01:08.099 --> 00:01:11.959
world. The goal for you today is to walk away

00:01:11.959 --> 00:01:14.650
understanding how researchers bypass absolute

00:01:14.650 --> 00:01:18.950
mathematical dead ends using some, well, incredibly

00:01:18.950 --> 00:01:21.950
clever simulations. It's really true. We are

00:01:21.950 --> 00:01:25.230
looking at a framework that fundamentally changed

00:01:25.230 --> 00:01:28.010
what kind of questions statisticians were even

00:01:28.010 --> 00:01:30.250
allowed to ask. Allowed to ask. Yeah, because

00:01:30.250 --> 00:01:32.230
for a long time if you couldn't write out the

00:01:32.230 --> 00:01:35.519
exact perfect equation for a phenomenon... You

00:01:35.519 --> 00:01:37.140
essentially couldn't study it rigorously, you

00:01:37.140 --> 00:01:40.159
were just locked out. So approximate Bayesian

00:01:40.159 --> 00:01:43.299
computation, or ABC, is really about trading

00:01:43.299 --> 00:01:46.159
a little bit of theoretical purity. for a massive,

00:01:46.159 --> 00:01:48.400
massive amount of practical power. It allows

00:01:48.400 --> 00:01:50.500
us to look at systems that are far too tangled

00:01:50.500 --> 00:01:52.620
for traditional math. OK, let's unpack this.

00:01:52.680 --> 00:01:55.200
Let's unpack this lockpicking hack. Imagine you

00:01:55.200 --> 00:01:57.659
are trying to reverse engineer a master chef's

00:01:57.659 --> 00:01:59.799
secret sauce. I like where this is going. Right.

00:01:59.939 --> 00:02:01.500
You can't just walk into the kitchen and ask

00:02:01.500 --> 00:02:03.379
them for the exact recipe that is completely

00:02:03.379 --> 00:02:05.920
off limits. It'd throw you out. Exactly. So you

00:02:05.920 --> 00:02:08.599
just keep making your own sauces, you tweak the

00:02:08.599 --> 00:02:11.000
ingredients, you taste yours, you taste the original,

00:02:11.099 --> 00:02:12.960
and you throw away the ones that taste terrible.

00:02:13.080 --> 00:02:15.259
Naturally. And you just keep the ones that are

00:02:15.259 --> 00:02:18.240
close enough until you find the closest possible

00:02:18.240 --> 00:02:22.060
match. That essentially is what ABC does with

00:02:22.060 --> 00:02:25.099
incredibly complex data. That is a remarkably

00:02:25.099 --> 00:02:27.479
accurate way to picture the mechanics of it.

00:02:28.080 --> 00:02:31.080
Spot on. But to really understand why scientists

00:02:31.080 --> 00:02:33.439
are forced to approximate this sauce in the first

00:02:33.439 --> 00:02:35.740
place, we have to look at the original recipe

00:02:35.740 --> 00:02:37.539
book they were trying to use. Which is Bayes'

00:02:37.539 --> 00:02:40.259
Theorem, right? Exactly. Bayes' Theorem. Anyone

00:02:40.259 --> 00:02:42.400
who follows statistics knows Bayesian inference

00:02:42.400 --> 00:02:46.680
is all about updating your beliefs based on new

00:02:46.680 --> 00:02:49.180
evidence. You start with a prior. Which is what

00:02:49.180 --> 00:02:51.560
you believe before seeing any data at all. Right.

00:02:51.680 --> 00:02:53.340
Then you look at the data through something called

00:02:53.340 --> 00:02:55.719
the likelihood function. And finally, you get

00:02:55.719 --> 00:02:58.210
your posterior. which is your updated belief.

00:02:58.530 --> 00:03:00.650
OK, so if I think a coin is fair, that is my

00:03:00.650 --> 00:03:02.810
prior, right? And then if I flip it 10 times

00:03:02.810 --> 00:03:05.150
and get 10 heads in a row, the likelihood of

00:03:05.150 --> 00:03:07.789
that happening with a genuinely fair coin is

00:03:07.789 --> 00:03:10.509
tiny. Extremely tiny. So my posterior belief

00:03:10.509 --> 00:03:13.469
updates to say, hey, this coin is probably rigged.

00:03:13.569 --> 00:03:16.750
We're all fairly familiar with that cycle. Yeah,

00:03:16.789 --> 00:03:19.310
and for simple models like a coin flip, deriving

00:03:19.310 --> 00:03:22.430
an exact analytical formula for that likelihood

00:03:22.430 --> 00:03:25.050
function is neat and tidy. I mean, you can write

00:03:25.050 --> 00:03:28.650
it on a napkin. Sure. But here is the core bottleneck

00:03:28.650 --> 00:03:32.310
that sparked this entire field. What happens

00:03:32.310 --> 00:03:35.909
when your model is not a coin flip? Right, because

00:03:35.909 --> 00:03:38.490
the real world isn't a coin flip. Exactly. What

00:03:38.490 --> 00:03:41.110
if your model is a massive ecological network

00:03:41.110 --> 00:03:44.830
or the cellular pathways of a disease spreading

00:03:44.830 --> 00:03:47.419
through a huge population? That sounds mathematically

00:03:47.419 --> 00:03:50.719
terrifying. It is. To calculate the exact likelihood

00:03:50.719 --> 00:03:53.120
for those complex, highly connected systems,

00:03:53.620 --> 00:03:55.479
you would have to mathematically account for

00:03:55.479 --> 00:03:58.500
every single interacting variable at once. Oh,

00:03:58.539 --> 00:04:00.719
wow. It is like trying to calculate the exact

00:04:00.719 --> 00:04:03.099
trajectory of every individual water molecule

00:04:03.099 --> 00:04:05.439
in a hurricane. That's impossible. Right. It

00:04:05.439 --> 00:04:08.979
is either mathematically elusive, or it requires

00:04:08.979 --> 00:04:11.580
so much computational power that the universe

00:04:11.580 --> 00:04:13.740
would end before a supercomputer finished the

00:04:13.740 --> 00:04:16.720
math. So the likelihood is the linchpin. If we

00:04:16.720 --> 00:04:18.759
can't calculate how likely the data is under

00:04:18.759 --> 00:04:21.879
our model... the whole Bayesian machine grinds

00:04:21.879 --> 00:04:24.079
to a halt. Completely to a halt. That must have

00:04:24.079 --> 00:04:26.800
been incredibly frustrating. You have these brilliant

00:04:26.800 --> 00:04:29.500
researchers wanting to study messy real world

00:04:29.500 --> 00:04:32.879
biology or climate science, but they are stuck

00:04:32.879 --> 00:04:35.839
playing with neat, analytically tractable models

00:04:35.839 --> 00:04:37.980
just because the math works. Yeah, it was a huge

00:04:37.980 --> 00:04:40.000
problem. It's like searching for your lost keys

00:04:40.000 --> 00:04:41.660
under the street lamp just because the light

00:04:41.660 --> 00:04:43.139
is better there, even though you drop them in

00:04:43.139 --> 00:04:46.660
the dark alley. That's a great analogy. And what's

00:04:46.660 --> 00:04:48.860
fascinating here is that a statistician named

00:04:48.860 --> 00:04:52.089
Don Donald Rubin pointed out this exact absurdity

00:04:52.089 --> 00:04:55.350
back in 1984. Oh, really? In the 80s? Yeah. He

00:04:55.350 --> 00:04:57.850
argued that applied statisticians shouldn't just

00:04:57.850 --> 00:04:59.990
let the limitations of the likelihood function

00:04:59.990 --> 00:05:03.529
dictate the scope of their curiosity. He envisioned

00:05:03.529 --> 00:05:05.589
computational methods that could estimate the

00:05:05.589 --> 00:05:08.350
posterior without ever needing to calculate the

00:05:08.350 --> 00:05:11.579
exact math of the likelihood. Just bypassing

00:05:11.579 --> 00:05:13.959
the roadblock entirely. Exactly. Opening up a

00:05:13.959 --> 00:05:17.199
much wider range of real -world models. And since

00:05:17.199 --> 00:05:19.480
traditional days hit that computational brick

00:05:19.480 --> 00:05:23.500
wall, scientists needed a workaround. Which I

00:05:23.500 --> 00:05:26.680
guess brings us back to our chef sauce, the ABC

00:05:26.680 --> 00:05:29.779
rejection algorithm. Yes. The basic rejection

00:05:29.779 --> 00:05:32.560
algorithm is the simplest, most fundamental form

00:05:32.560 --> 00:05:35.860
of this entire concept. It completely bypasses

00:05:35.860 --> 00:05:38.029
the likelihood function. Let me break down the

00:05:38.029 --> 00:05:40.209
actual mechanism for you. Please do. Step one,

00:05:40.350 --> 00:05:42.389
you sample a random parameter point from your

00:05:42.389 --> 00:05:44.910
prior distribution. So you essentially make an

00:05:44.910 --> 00:05:47.470
educated guess about how the system works. Exactly.

00:05:47.610 --> 00:05:50.509
Just a guess. Then step two, you run a computer

00:05:50.509 --> 00:05:53.589
simulation using that guessed parameter to generate

00:05:53.589 --> 00:05:56.550
a fake simulated data set. Okay, so I guess a

00:05:56.550 --> 00:05:59.189
recipe, and then I cook the sauce. Spot on. Step

00:05:59.189 --> 00:06:01.509
three, you compare your simulated data set to

00:06:01.509 --> 00:06:04.189
the actual observed data set from the real world.

00:06:04.529 --> 00:06:06.850
Use a distance measure to calculate how mathematically

00:06:06.850 --> 00:06:09.470
far apart they are. So I taste both sauces to

00:06:09.470 --> 00:06:12.009
see how different they are. Right. And step four

00:06:12.009 --> 00:06:14.290
is the critical one. If the distance between

00:06:14.290 --> 00:06:17.970
your simulated data and the real data is less

00:06:17.970 --> 00:06:21.649
than or equal to a specific tolerance level.

00:06:21.850 --> 00:06:24.410
Like tolerance level. Yeah, which the math calls

00:06:24.410 --> 00:06:27.050
epsilon. So if it's within that tolerance, you

00:06:27.050 --> 00:06:30.050
accept that parameter as a good guess. If the

00:06:30.050 --> 00:06:32.189
distance is greater than the tolerance, you throw

00:06:32.189 --> 00:06:35.230
it in the trash. Just dump it. Exactly. You repeat

00:06:35.230 --> 00:06:38.069
this millions of times, and the parameters you

00:06:38.069 --> 00:06:40.629
accept form an approximate sample of your desired

00:06:40.629 --> 00:06:43.100
posterior distribution. What blows my mind is

00:06:43.100 --> 00:06:45.060
that the history of this idea didn't just start

00:06:45.060 --> 00:06:48.480
in the 1980s. No, it goes way back. Right. While

00:06:48.480 --> 00:06:51.439
Rubin conceptualized it as a modern thought experiment,

00:06:51.959 --> 00:06:54.220
Francis Galton actually built a physical version

00:06:54.220 --> 00:06:57.439
of this in the late 1800s. The quincunx. Yeah,

00:06:57.439 --> 00:06:59.579
he used a mechanical device called a two -stage

00:06:59.579 --> 00:07:02.160
quincunx. It was basically a board with pegs

00:07:02.160 --> 00:07:04.279
and he would drop beads through it. Like a plinko

00:07:04.279 --> 00:07:07.519
board. Exactly like plinko. The way the beads

00:07:07.519 --> 00:07:10.160
bounced and settled at the bottom was a physical

00:07:10.160 --> 00:07:12.870
simulation of probability. He was physically

00:07:12.870 --> 00:07:15.129
dropping beads to implement an ABC rejection

00:07:15.129 --> 00:07:18.470
scheme, accepting or rejecting outcomes to approximate

00:07:18.470 --> 00:07:20.550
a parameter without writing out the underlying

00:07:20.550 --> 00:07:23.209
equation. Galton's machine was incredibly ahead

00:07:23.209 --> 00:07:26.189
of its time. But the real breakthrough for modern

00:07:26.189 --> 00:07:28.389
science, where this jumped from a statistical

00:07:28.389 --> 00:07:31.129
curiosity to a vital tool, happened much later.

00:07:31.189 --> 00:07:34.839
When was that? In 1997. Simon Tabaret and his

00:07:34.839 --> 00:07:37.220
colleagues were looking at the genealogy of DNA.

00:07:37.879 --> 00:07:40.160
They were trying to find coalescence times, basically,

00:07:40.279 --> 00:07:42.779
tracing mutations back to find the time of the

00:07:42.779 --> 00:07:45.519
most recent common ancestor for a set of DNA

00:07:45.519 --> 00:07:48.839
sequences. And the math was too hard. The analytical

00:07:48.839 --> 00:07:51.500
math for their demographic models was completely

00:07:51.500 --> 00:07:53.920
impossible. So instead of trying to write an

00:07:53.920 --> 00:07:57.079
equation, they simulated thousands of fake evolutionary

00:07:57.079 --> 00:08:00.579
family trees. And they accepted or rejected those

00:08:00.579 --> 00:08:03.680
fake trees based on comparing the number of mutations

00:08:03.680 --> 00:08:06.439
in their synthetic DNA to the real DNA. That

00:08:06.439 --> 00:08:09.379
is brilliant. It was a game changer. Then in

00:08:09.379 --> 00:08:12.740
1999, Jonathan Pritchard used this same hack

00:08:12.740 --> 00:08:15.899
to model human rye chromosomes. And finally,

00:08:16.060 --> 00:08:20.319
in 2002, Mark Beaumont formalized the term approximate

00:08:20.319 --> 00:08:23.399
Bayesian computation. Here's where it gets really

00:08:23.399 --> 00:08:25.220
interesting. Let's talk about that tolerance

00:08:25.220 --> 00:08:28.699
level, epsilon, because there is a massive paradox

00:08:28.699 --> 00:08:30.600
here for you to consider as a listener. Oh, the

00:08:30.600 --> 00:08:33.039
tolerance paradox, yes. If you set your tolerance

00:08:33.039 --> 00:08:36.340
to zero, meaning you demand an absolutely exact,

00:08:36.659 --> 00:08:39.379
perfect match between your simulated data and

00:08:39.379 --> 00:08:42.000
the real -world data, you are going to reject

00:08:42.000 --> 00:08:44.559
almost everything. Everything. Because the real

00:08:44.559 --> 00:08:47.460
world is just too complex. Exactly. The probability

00:08:47.460 --> 00:08:49.779
of a computer generating a perfectly identical

00:08:49.779 --> 00:08:52.580
data set to real life is practically zero. But

00:08:52.580 --> 00:08:54.940
on the flip side, if your tolerance is too high,

00:08:55.500 --> 00:08:57.360
like if you are too forgiving, you just accept

00:08:57.360 --> 00:08:59.419
every single simulation. Which is totally useless.

00:08:59.559 --> 00:09:01.360
Right. You learn absolutely nothing and your

00:09:01.360 --> 00:09:03.340
final result just looks exactly like your initial

00:09:03.340 --> 00:09:06.679
guess. Finding that Goldilocks tolerance in epsilon

00:09:06.679 --> 00:09:09.919
greater than zero but not too large is the entire

00:09:09.919 --> 00:09:13.289
art of ABC. It validates the approach because

00:09:13.289 --> 00:09:16.470
it provides a sample of parameter values approximately

00:09:16.470 --> 00:09:19.230
distributed according to the truth without ever

00:09:19.230 --> 00:09:21.429
touching a likelihood function. This sounds great

00:09:21.429 --> 00:09:23.929
in theory until I try to scale it up. What do

00:09:23.929 --> 00:09:27.350
you mean? If I am studying global weather patterns

00:09:27.350 --> 00:09:30.309
or sequencing a massive genome, my dataset is

00:09:30.309 --> 00:09:33.590
gigantic. It is highly dimensional. If I try

00:09:33.590 --> 00:09:36.669
to compare a massive simulated dataset to a massive

00:09:36.669 --> 00:09:39.570
real dataset, even with a generous tolerance,

00:09:40.029 --> 00:09:42.490
the probability of them being even remotely close

00:09:42.490 --> 00:09:45.009
drops back down to near zero. Yes, and there

00:09:45.009 --> 00:09:48.330
you run face -first into what statisticians ominously

00:09:48.330 --> 00:09:51.070
call the curse of dimensionality. Sounds like

00:09:51.070 --> 00:09:53.649
a movie title. It really does. But it's a huge

00:09:53.649 --> 00:09:56.669
problem. Comparing full, highly complex data

00:09:56.669 --> 00:09:59.610
sets is wildly inefficient computationally. The

00:09:59.610 --> 00:10:01.669
acceptance rate plummets to a fraction of a percent.

00:10:01.870 --> 00:10:04.110
So how do they fix it? To solve this, scientists

00:10:04.110 --> 00:10:06.730
use summary statistics. Instead of comparing

00:10:06.730 --> 00:10:09.509
the raw full data set, they distill the massive

00:10:09.509 --> 00:10:11.950
data down into a set of lower dimensional numbers.

00:10:11.950 --> 00:10:14.289
Oh, I see. They extract only the most relevant

00:10:14.289 --> 00:10:17.009
information and compare those summaries instead.

00:10:17.590 --> 00:10:20.190
So instead of comparing every single pixel in

00:10:20.190 --> 00:10:22.429
two high -resolution photographs to see if they

00:10:22.429 --> 00:10:25.409
match, you just compare like the average brightness

00:10:25.409 --> 00:10:28.529
and the primary color profile. Exactly. In a

00:10:28.529 --> 00:10:31.090
perfect mathematical world, you would use what

00:10:31.090 --> 00:10:33.799
are called Sufficient statistics. Sufficient

00:10:33.799 --> 00:10:36.919
meaning. A sufficient statistic captures absolutely

00:10:36.919 --> 00:10:40.320
all the vital information in the data about the

00:10:40.320 --> 00:10:42.340
parameter you care about. Think about a simple

00:10:42.340 --> 00:10:45.919
coin flip again. OK. If you flip a coin 100 times,

00:10:46.500 --> 00:10:48.740
a sufficient statistic is just the total number

00:10:48.740 --> 00:10:52.279
of heads. You don't need to know the exact chronological

00:10:52.279 --> 00:10:54.639
sequence of heads, tails, tails, heads. Right.

00:10:54.659 --> 00:10:56.620
The order doesn't matter. Exactly. Just the total

00:10:56.620 --> 00:10:58.539
number gives you 100 % of the information you

00:10:58.539 --> 00:11:00.669
need to know if the coin is fair. If you use

00:11:00.669 --> 00:11:03.950
a sufficient statistic, you gain incredible computational

00:11:03.950 --> 00:11:06.809
efficiency without introducing any error at all.

00:11:07.129 --> 00:11:09.730
But the source outlines a massive catch here.

00:11:10.210 --> 00:11:12.669
Outside of a very specific mathematical family

00:11:12.669 --> 00:11:14.889
called the exponential family of distributions,

00:11:15.830 --> 00:11:18.070
finding a finite set of sufficient statistics

00:11:18.070 --> 00:11:20.809
for complex models is practically impossible.

00:11:21.409 --> 00:11:23.649
It really is. So researchers are almost always

00:11:23.649 --> 00:11:26.320
forced to use insufficient summary statistics.

00:11:26.799 --> 00:11:29.100
These are informative metrics, but they inherently

00:11:29.100 --> 00:11:31.580
throw away some of the data's complexity. Okay,

00:11:31.679 --> 00:11:33.759
I have to push back on this entirely. Go for

00:11:33.759 --> 00:11:36.500
it. Wait, if we're specifically studying complex

00:11:36.500 --> 00:11:38.440
systems, aren't we shooting ourselves in the

00:11:38.440 --> 00:11:40.899
foot by throwing away data to create these summary

00:11:40.899 --> 00:11:43.620
statistics? Doesn't this information loss inflate

00:11:43.620 --> 00:11:46.299
our credible intervals and introduce bias? That

00:11:46.299 --> 00:11:49.100
is a very valid critique. This raises an important

00:11:49.100 --> 00:11:51.019
question about how to choose these statistics.

00:11:51.500 --> 00:11:53.720
Because the inexactness of the tolerance introduces

00:11:53.720 --> 00:11:56.460
one layer of bias, and the insufficiency of the

00:11:56.460 --> 00:11:59.179
summary statistics introduces a second, entirely

00:11:59.179 --> 00:12:01.460
different layer of bias. So you're compounding

00:12:01.460 --> 00:12:04.840
the errors? Potentially, yes. How we choose these

00:12:04.840 --> 00:12:06.840
statistics is arguably the most dangerous part

00:12:06.840 --> 00:12:09.919
of the process. If you choose poorly, your results

00:12:09.919 --> 00:12:12.360
will be overly broad or just flat -out wrong.

00:12:12.559 --> 00:12:16.059
So what are the remedies? The text outlines several

00:12:16.059 --> 00:12:18.259
remedies researchers use to defend against this.

00:12:18.919 --> 00:12:21.519
One major framework focuses on minimizing something

00:12:21.519 --> 00:12:25.080
called the expected posterior entropy, or EPE.

00:12:25.389 --> 00:12:28.250
Let's unpack EPE because it sounds intimidating.

00:12:28.909 --> 00:12:31.250
Entropy in this context is basically a measure

00:12:31.250 --> 00:12:33.809
of uncertainty or surprise, right? Yes, exactly.

00:12:34.330 --> 00:12:36.450
Minimizing the expected pose to your entropy

00:12:36.450 --> 00:12:38.970
means you are trying to choose summary statistics

00:12:38.970 --> 00:12:41.830
that give you the sharpest, most confident updated

00:12:41.830 --> 00:12:44.590
belief possible. Making the data as clear as

00:12:44.590 --> 00:12:46.929
possible. Right. You are mathematically testing

00:12:46.929 --> 00:12:49.289
different combinations of summaries to find the

00:12:49.289 --> 00:12:51.929
ones that maximize the useful signal and minimize

00:12:51.929 --> 00:12:53.990
the unhelpful noise. That makes sense. There

00:12:53.990 --> 00:12:57.409
is also a semi -automatic ABC approach. Before

00:12:57.409 --> 00:12:59.750
they even run the main experiment, researchers

00:12:59.750 --> 00:13:02.570
use linear regression on preliminary simulated

00:13:02.570 --> 00:13:05.649
data to figure out which summaries best approximate

00:13:05.649 --> 00:13:07.889
the truth. So it is highly sophisticated prep

00:13:07.889 --> 00:13:10.389
work just to decide how to summarize the data

00:13:10.389 --> 00:13:12.470
before before the real test even begins. Exactly.

00:13:12.570 --> 00:13:14.350
You have to tune the instrument before you can

00:13:14.350 --> 00:13:16.990
play it. Let's ground this theory in reality.

00:13:17.409 --> 00:13:19.789
To see how this actually works in the wild, the

00:13:19.789 --> 00:13:22.269
source material breaks down how biologists use

00:13:22.269 --> 00:13:25.769
this to study fruit flies. Specifically, Drosophila

00:13:25.769 --> 00:13:29.049
melanogaster and a transcription factor called

00:13:29.049 --> 00:13:32.379
the sonic hedgehog gene. The behavior of this

00:13:32.379 --> 00:13:35.820
gene is incredibly difficult to measure directly.

00:13:36.139 --> 00:13:38.860
It can be modeled using a dynamic bistable hidden

00:13:38.860 --> 00:13:42.000
Markov model. Okay, dense name. Very dense. But

00:13:42.000 --> 00:13:44.779
the mechanism is actually intuitive. Imagine

00:13:44.779 --> 00:13:47.639
the biological system has only two states, like

00:13:47.639 --> 00:13:49.820
a faulty light switch. It is either in state

00:13:49.820 --> 00:13:53.679
A or state B. The probability of the system transitioning

00:13:53.679 --> 00:13:56.019
from one state to the other at any given moment

00:13:56.019 --> 00:13:58.220
is the hidden parameter we are trying to find

00:13:58.220 --> 00:14:00.759
our theta. So we can't see the switch directly.

00:14:00.980 --> 00:14:03.460
We can only guess its position based on the light

00:14:03.460 --> 00:14:05.980
it gives off. And our lab equipment isn't perfect,

00:14:06.019 --> 00:14:08.200
so there's a chance we measure the state incorrectly.

00:14:08.700 --> 00:14:10.940
Let's say our measurements are only 80 % accurate.

00:14:11.179 --> 00:14:14.320
Right. Our gamma is 0 .8. So we observe this

00:14:14.320 --> 00:14:16.340
gene in a real fruit fly for a short sequence

00:14:16.340 --> 00:14:19.360
of time, say, 20 microscopic frames. It spends

00:14:19.360 --> 00:14:21.539
most of the time in state A. but occasionally

00:14:21.539 --> 00:14:24.100
flickers to state B. OK. Because observing the

00:14:24.100 --> 00:14:26.539
raw sequence of 20 frames is highly dimensional,

00:14:27.139 --> 00:14:30.019
the researchers need a summary statistic. They

00:14:30.019 --> 00:14:32.139
decide to simply count the number of times the

00:14:32.139 --> 00:14:34.559
state switches. So just counting the flips. Exactly.

00:14:34.899 --> 00:14:37.580
In our real fruit fly observation, the gene switches

00:14:37.580 --> 00:14:39.720
states exactly six times. So our summary statistic

00:14:39.720 --> 00:14:42.980
is six. Next, we set our tolerance. We set our

00:14:42.980 --> 00:14:45.700
epsilon to 2. This means if a computer simulation

00:14:45.700 --> 00:14:48.120
produces a sequence with four, five, six, seven,

00:14:48.139 --> 00:14:50.679
or eight switches, we accept the parameter that

00:14:50.679 --> 00:14:53.740
generated it. If it produces 15 switches, we

00:14:53.740 --> 00:14:56.620
throw it out. Perfect. So we start the simulations

00:14:56.620 --> 00:14:59.139
with random guesses for the transition probability,

00:14:59.320 --> 00:15:01.860
our theta. Let's say our computer randomly guesses

00:15:01.860 --> 00:15:05.620
a probability of 0 .43. Okay. It runs the simulated

00:15:05.620 --> 00:15:08.039
fruit fly, and the fake sequence happens to have

00:15:08.039 --> 00:15:10.620
exactly six switches. The distance to our real

00:15:10.620 --> 00:15:13.620
data is zero. 0 is less than our tolerance of

00:15:13.620 --> 00:15:17.200
2, so we accept 0 .43 as a highly plausible parameter.

00:15:17.720 --> 00:15:20.639
But then, our next random guess is a much higher

00:15:20.639 --> 00:15:23.960
probability, say 0 .68. Right. The simulation

00:15:23.960 --> 00:15:26.120
runs, and because the transition probability

00:15:26.120 --> 00:15:29.519
is so high, the sequence is incredibly noisy.

00:15:30.120 --> 00:15:32.120
The faulty light switch flips back and forth

00:15:32.120 --> 00:15:34.960
constantly, resulting in 13 switches. A lot of

00:15:34.960 --> 00:15:37.240
flickering? Yeah. The distance from our real

00:15:37.240 --> 00:15:41.100
data is 7. Since 7 is far greater than our tolerance

00:15:41.100 --> 00:15:44.779
of 2, we reject 0 .68 entirely. Exactly. You

00:15:44.779 --> 00:15:47.360
run this thousands of times. Instead of trying

00:15:47.360 --> 00:15:49.840
to calculate the microscopic genetic flips using

00:15:49.840 --> 00:15:52.840
impossible analytical math, you just keep the

00:15:52.840 --> 00:15:55.159
parameters from the simulations that act like

00:15:55.159 --> 00:15:57.840
the real thing. So what does this all mean? In

00:15:57.840 --> 00:16:00.259
this specific experiment, the algorithm hones

00:16:00.259 --> 00:16:02.759
in on a cluster of values that accurately reflect

00:16:02.759 --> 00:16:06.759
the hidden true transition rate. But the source

00:16:06.759 --> 00:16:09.039
text uses this exact experiment to deliver a

00:16:09.039 --> 00:16:11.519
massive warning about summary statistics. This

00:16:11.519 --> 00:16:14.379
is a red flag. A huge red flag. In this fruit

00:16:14.379 --> 00:16:16.759
fly example, simply counting the number of switches

00:16:16.759 --> 00:16:19.240
is an insufficient statistic. Counting switches

00:16:19.240 --> 00:16:21.360
tells you how many times it flips, but it throws

00:16:21.360 --> 00:16:23.460
away all the temporal data about when it flipped.

00:16:23.620 --> 00:16:26.759
Right, the timing is lost. Exactly. The text

00:16:26.759 --> 00:16:29.299
proves mathematically that even if you drop your

00:16:29.299 --> 00:16:31.860
tolerance down to zero, meaning you only accept

00:16:31.860 --> 00:16:34.460
simulations that produce exactly six switches,

00:16:35.019 --> 00:16:38.019
no more, no less, there is still a significant

00:16:38.019 --> 00:16:40.919
deviation from the theoretical true answer. Wow.

00:16:41.080 --> 00:16:44.480
Even at zero tolerance? Yes. Because the summary

00:16:44.480 --> 00:16:47.259
statistic was insufficient, it skewed the result

00:16:47.399 --> 00:16:50.139
even at perfect tolerance. The only way to fix

00:16:50.139 --> 00:16:53.820
that underlying bias is to observe much longer

00:16:53.820 --> 00:16:56.600
sequences of data. That fruit fly example proves

00:16:56.600 --> 00:16:58.860
that even if your simulation perfectly matches

00:16:58.860 --> 00:17:01.340
your summary statistic, your underlying parameter

00:17:01.340 --> 00:17:03.860
could still be completely wrong. It's sobering.

00:17:03.940 --> 00:17:06.799
It really is. That means ABC isn't just a magic

00:17:06.799 --> 00:17:09.660
wand. It introduces entirely new ways for scientists

00:17:09.660 --> 00:17:11.799
to accidentally lie to themselves. The source

00:17:11.799 --> 00:17:14.140
outlines several major pitfalls that researchers

00:17:14.140 --> 00:17:16.740
have to navigate. The most dangerous is model

00:17:16.740 --> 00:17:19.450
comparison. How so? Well, ABC is just used to

00:17:19.450 --> 00:17:22.130
find a single parameter. It's often used to compute

00:17:22.130 --> 00:17:24.910
Bayes factors to rank entirely different scientific

00:17:24.910 --> 00:17:27.109
models against each other. A Bayes factor is

00:17:27.109 --> 00:17:29.190
essentially a ratio of how well two different

00:17:29.190 --> 00:17:31.849
models predict the same data, right? Exactly.

00:17:32.529 --> 00:17:35.750
But if you compute that ratio using insufficient

00:17:35.750 --> 00:17:38.670
summary statistics instead of the raw data, you

00:17:38.670 --> 00:17:41.450
are comparing how well the models predict the

00:17:41.450 --> 00:17:44.400
summary. not the raw reality. Oh, that's dangerous.

00:17:44.680 --> 00:17:47.140
The math shows that this can render your entire

00:17:47.140 --> 00:17:50.619
model ranking completely misleading. That is

00:17:50.619 --> 00:17:53.240
terrifying for a researcher. You could publish

00:17:53.240 --> 00:17:55.779
a paper declaring that Model A is the definitive

00:17:55.779 --> 00:17:58.839
explanation for a disease over Model B, but it's

00:17:58.839 --> 00:18:00.900
only because the arbitrary way you summarize

00:18:00.900 --> 00:18:03.920
the data accidentally favored Model A. It happens.

00:18:04.579 --> 00:18:06.799
Another massive pitfall is a return to the curse

00:18:06.799 --> 00:18:09.460
of dimensionality. but this time in the parameter

00:18:09.460 --> 00:18:11.880
space itself. Meaning having too many unknowns.

00:18:12.160 --> 00:18:15.859
Right. If your climate model have dozens of different

00:18:15.859 --> 00:18:18.279
unknown parameters you're trying to guess simultaneously,

00:18:18.799 --> 00:18:20.819
the combination of guesses becomes infinite.

00:18:21.180 --> 00:18:23.519
Your acceptance rates will plummet. Right. You

00:18:23.519 --> 00:18:25.380
reach a point where if a model is consistently

00:18:25.380 --> 00:18:27.359
rejected, you don't actually know if the model

00:18:27.359 --> 00:18:29.400
is fundamentally wrong or if you just haven't

00:18:29.400 --> 00:18:31.759
explored the massive parameter space enough to

00:18:31.759 --> 00:18:33.759
find the right combination of variables. It's

00:18:33.759 --> 00:18:35.990
like trying to find a needle in a haystack. But

00:18:35.990 --> 00:18:38.210
every time you add a new variable to your model,

00:18:38.490 --> 00:18:41.450
the haystack multiplies in size by 10. Eventually

00:18:41.450 --> 00:18:43.529
you just run out of computing power before you

00:18:43.529 --> 00:18:45.730
find the needle. Exactly. And then there is the

00:18:45.730 --> 00:18:48.910
issue of priors. Because this whole process relies

00:18:48.910 --> 00:18:51.490
so heavily on sampling from your initial beliefs,

00:18:52.089 --> 00:18:55.269
if you use highly subjective guesses to set the

00:18:55.269 --> 00:18:58.150
bounds of your parameters, you can ruin the inference

00:18:58.150 --> 00:19:00.490
right from the start. To protect against all

00:19:00.490 --> 00:19:03.529
of this, scientists have developed strict quality

00:19:03.529 --> 00:19:06.339
controls. They do not just run the code and hope

00:19:06.339 --> 00:19:09.619
for the best. One fascinating mechanistic remedy

00:19:09.619 --> 00:19:14.859
is called noisy ABC. Explain how noisy ABC actually

00:19:14.859 --> 00:19:17.700
works, because it sounds completely counterintuitive

00:19:17.700 --> 00:19:20.299
to add noise to an already messy process. It

00:19:20.299 --> 00:19:22.599
does sound backward. When you use a hard tolerance

00:19:22.599 --> 00:19:25.019
level, say, accepting everything under a distance

00:19:25.019 --> 00:19:27.200
of two and rejecting everything over it, you

00:19:27.200 --> 00:19:29.440
create a harsh mathematical cliff. OK, a sharp

00:19:29.440 --> 00:19:32.359
drop off. Right. And this harsh cut off introduces

00:19:32.359 --> 00:19:36.640
a specific kind of bias. Noisy ABC intentionally

00:19:36.640 --> 00:19:39.319
adds a calculated mathematical static to the

00:19:39.319 --> 00:19:42.619
summary statistics themselves. This static smooths

00:19:42.619 --> 00:19:45.339
out that harsh cliff, turning a hard cutoff into

00:19:45.339 --> 00:19:48.680
a gentle curve. Counter -intuitively, by adding

00:19:48.680 --> 00:19:51.440
this specific noise, it systematically compensates

00:19:51.440 --> 00:19:53.640
for the bias introduced by the tolerance level,

00:19:53.920 --> 00:19:57.240
leading to a more accurate final posterior. They

00:19:57.240 --> 00:19:59.960
also heavily rely on cross -validation. This

00:19:59.960 --> 00:20:02.059
is where they generate artificial data sets from

00:20:02.059 --> 00:20:04.140
their prior beliefs, where they actively know

00:20:04.140 --> 00:20:07.140
the secret true parameter values that generated

00:20:07.140 --> 00:20:09.880
the data. Yes, a control test. Then they run

00:20:09.880 --> 00:20:12.420
their own ABC algorithm blindly on that fake

00:20:12.420 --> 00:20:14.779
data to see if it can successfully recover those

00:20:14.779 --> 00:20:17.839
secret true values. If the algorithm can't find

00:20:17.839 --> 00:20:19.900
the truth in a perfectly controlled environment,

00:20:20.279 --> 00:20:22.039
you know you can't trust it out in the wild.

00:20:22.299 --> 00:20:24.559
Exactly. And researchers are no longer writing

00:20:24.559 --> 00:20:26.559
all this code from scratch. There's been a huge

00:20:26.559 --> 00:20:29.359
push to standardize the implementation through

00:20:29.359 --> 00:20:31.980
advanced peer -reviewed software packages, things

00:20:31.980 --> 00:20:36.160
like PyABC, DIY, ABC, and ELFI. I must save so

00:20:36.160 --> 00:20:39.079
much time. It does. These platforms have built

00:20:39.079 --> 00:20:41.880
insanity checks and utilize what are known as

00:20:41.880 --> 00:20:44.359
sequential Monte Carlo methods. Sequential Monte

00:20:44.359 --> 00:20:46.920
Carlo is brilliant. Instead of just throwing

00:20:46.920 --> 00:20:49.220
millions of random darts at a board and hoping

00:20:49.220 --> 00:20:51.680
one hits the bullseye, which wastes massive amounts

00:20:51.680 --> 00:20:54.779
of computing power, these advanced programs work

00:20:54.779 --> 00:20:57.490
iteratively. Step by step. Right. They look at

00:20:57.490 --> 00:20:59.829
where the previous darts landed, adjust their

00:20:59.829 --> 00:21:02.769
aim based on the ones that got close, and slowly

00:21:02.769 --> 00:21:05.049
shrink the size of the overall dartboard as they

00:21:05.049 --> 00:21:08.390
go. It explores the parameter space far more

00:21:08.390 --> 00:21:10.690
efficiently than the basic rejection algorithm.

00:21:11.049 --> 00:21:13.910
It's incredibly elegant. Now, opponents of this

00:21:13.910 --> 00:21:16.809
entire framework will often complain that researchers

00:21:16.809 --> 00:21:20.069
are only testing a small number of subjectively

00:21:20.069 --> 00:21:23.789
chosen models, and that technically, All of those

00:21:23.789 --> 00:21:26.089
models are probably wrong in an absolute sense

00:21:26.089 --> 00:21:28.910
because they're approximations. But that critique

00:21:28.910 --> 00:21:31.289
seems to miss the point of why we use this in

00:21:31.289 --> 00:21:33.910
the first place. When you are dealing with incredibly

00:21:33.910 --> 00:21:37.309
complex chaotic systems, finding a mathematically

00:21:37.309 --> 00:21:40.289
perfect true explanation is a total fantasy.

00:21:40.430 --> 00:21:43.329
It doesn't exist. Right. Assessing the predictive

00:21:43.329 --> 00:21:46.490
ability of a statistical model as an approximation

00:21:46.490 --> 00:21:50.130
of a complex phenomenon is far more valuable

00:21:50.130 --> 00:21:54.029
than standard, rigid hypothesis testing. It is

00:21:54.029 --> 00:21:56.190
not about finding the perfect explanation. It

00:21:56.190 --> 00:21:59.309
is about finding the most incredibly useful explanation.

00:21:59.650 --> 00:22:01.789
If we connect this to the bigger picture, it

00:22:01.789 --> 00:22:05.210
is the recognition that in complex systems, approximation

00:22:05.210 --> 00:22:07.930
is not a failure of precision. It is literally

00:22:07.930 --> 00:22:10.970
the only valid way forward. And that really brings

00:22:10.970 --> 00:22:12.809
us to the core of what we've discovered today.

00:22:13.349 --> 00:22:16.490
Approximate Bayesian computation is so much more

00:22:16.490 --> 00:22:18.869
than just a computational hack for lazy computers.

00:22:19.329 --> 00:22:21.910
Absolutely. It represents a profound philosophical

00:22:21.910 --> 00:22:24.670
shift in how we understand the world. By letting

00:22:24.670 --> 00:22:26.930
go of the need for exact analytical math and

00:22:26.930 --> 00:22:29.690
by bravely embracing simulation, tolerance, and

00:22:29.690 --> 00:22:32.089
summary statistics, scientists are peeking into

00:22:32.089 --> 00:22:34.109
the mechanics of phenomena that would otherwise

00:22:34.109 --> 00:22:36.150
remain permanently locked away. It's like having

00:22:36.150 --> 00:22:38.809
a new set of eyes. It really is. Whether you

00:22:38.809 --> 00:22:41.259
are trying to map the ancient migration routes

00:22:41.259 --> 00:22:43.900
of human genetics, tracking the spread of a modern

00:22:43.900 --> 00:22:46.480
epidemic, or modeling the invisible dance of

00:22:46.480 --> 00:22:49.319
radio waves, this methodology proves a vital

00:22:49.319 --> 00:22:52.460
point. You do not need perfect information to

00:22:52.460 --> 00:22:54.829
find the truth. You just need a really smart

00:22:54.829 --> 00:22:57.690
way to measure how close your guesses are and

00:22:57.690 --> 00:23:00.049
the computing power to let the simulations run.

00:23:00.589 --> 00:23:03.529
Before we wrap up today's deep dive, I want to

00:23:03.529 --> 00:23:05.509
leave you with a final thought to mull over.

00:23:06.410 --> 00:23:08.690
We started by talking about a perfect mathematical

00:23:08.690 --> 00:23:11.990
toolkit and a lock that was too complex. We learned

00:23:11.990 --> 00:23:14.349
that to pick that lock, our most advanced tools

00:23:14.349 --> 00:23:16.910
for understanding reality require us to intentionally

00:23:16.910 --> 00:23:20.150
introduce tolerance. They fundamentally require

00:23:20.150 --> 00:23:22.609
us to accept insufficient summaries of the raw

00:23:22.440 --> 00:23:25.359
data, intentionally throwing away complexity

00:23:25.359 --> 00:23:27.700
just so our minds and our computers can process

00:23:27.700 --> 00:23:30.160
it. It makes you wonder at what point does our

00:23:30.160 --> 00:23:32.279
mathematical approximation of the real world

00:23:32.279 --> 00:23:34.359
become the reality we actually interact with.

00:23:34.920 --> 00:23:37.240
If the map is always an intentional approximation,

00:23:37.460 --> 00:23:39.240
at what point does it completely replace the

00:23:39.240 --> 00:23:39.599
territory?
