WEBVTT

00:00:00.000 --> 00:00:04.639
Imagine for a second that you were tasked with

00:00:04.639 --> 00:00:08.439
finding the absolute perfect settings for an

00:00:08.439 --> 00:00:11.359
incredibly complex system. Oh yeah, like standing

00:00:11.359 --> 00:00:13.599
in front of a massive control board. Exactly.

00:00:13.660 --> 00:00:15.339
You're tweaking dials, you're flipping switches,

00:00:15.400 --> 00:00:17.260
trying to squeeze out the best possible performance,

00:00:17.679 --> 00:00:21.379
but there is a massive paralyzing catch. Right,

00:00:21.399 --> 00:00:24.480
the cost. Right. Every single guess you make,

00:00:24.760 --> 00:00:26.559
every time you turn a dial just a fraction of

00:00:26.559 --> 00:00:28.679
a millimeter to see what happens, it costs an

00:00:28.679 --> 00:00:31.059
immense amount of time. or a staggering amount

00:00:31.059 --> 00:00:33.780
of money, or just a colossal drain on computing

00:00:33.780 --> 00:00:36.020
power. You basically can't just try everything.

00:00:36.479 --> 00:00:38.719
The stakes are way too high to just guess and

00:00:38.719 --> 00:00:41.619
check. So how do you find the absolute best answer

00:00:41.619 --> 00:00:45.539
with the fewest possible attempts? Well, welcome

00:00:45.539 --> 00:00:48.460
to today's deep dive. Today we are cracking open

00:00:48.460 --> 00:00:50.619
the mathematical engine that solves this exact

00:00:50.619 --> 00:00:52.979
problem. It's a fascinating one, too. It really

00:00:52.979 --> 00:00:55.399
is. It's a sequential design strategy called

00:00:55.399 --> 00:00:58.159
Bayesian optimization. And we are pulling from

00:00:58.159 --> 00:01:00.619
a massive comprehensive source document on the

00:01:00.619 --> 00:01:03.460
subject to really understand how it works. Our

00:01:03.460 --> 00:01:05.659
mission today is basically to figure out how

00:01:05.659 --> 00:01:08.420
machines and, you know, the scientists who program

00:01:08.420 --> 00:01:11.840
them, make these highly efficient optimal choices

00:01:11.840 --> 00:01:14.560
when they are flying completely blind. And the

00:01:14.560 --> 00:01:16.959
stakes for understanding this framework? couldn't

00:01:16.959 --> 00:01:19.700
be higher for you or really anyone following

00:01:19.700 --> 00:01:22.060
modern technology. Absolutely. If you look at

00:01:22.060 --> 00:01:25.000
the visual backdrop behind me today, these shifting

00:01:25.000 --> 00:01:27.120
interconnected star charts that kind of dissolve

00:01:27.120 --> 00:01:30.159
into complex math equations, that really represents

00:01:30.159 --> 00:01:32.920
the unseen landscape we're navigating here. It's

00:01:32.920 --> 00:01:35.439
a great visual for it. Because with the 21st

00:01:35.439 --> 00:01:37.780
century explosion of artificial intelligence,

00:01:38.500 --> 00:01:41.579
this specific mathematical framework has quietly

00:01:41.579 --> 00:01:44.140
become the secret engine behind the scenes. It's

00:01:44.140 --> 00:01:46.530
everywhere. It is the invisible hand guiding

00:01:46.530 --> 00:01:48.549
everything from the training of massive neural

00:01:48.549 --> 00:01:52.569
networks to the design of entirely new physical

00:01:52.569 --> 00:01:55.349
materials in a laboratory. It's literally the

00:01:55.349 --> 00:01:57.209
science of making the most out of the unknown.

00:01:57.650 --> 00:01:59.670
Okay, let's unpack this. Before we get into how

00:01:59.670 --> 00:02:02.329
this technology evolved to practically run the

00:02:02.329 --> 00:02:05.250
modern AI world, we first need to understand

00:02:05.250 --> 00:02:07.250
the fundamental problem it was built to solve

00:02:07.250 --> 00:02:09.729
in the first place. Right, the core issue. Yeah,

00:02:09.770 --> 00:02:12.250
this concept of the black box function. The black

00:02:12.250 --> 00:02:15.620
box is the perfect starting point. So Bayesian

00:02:15.620 --> 00:02:18.280
optimization is deployed when you have an objective

00:02:18.280 --> 00:02:21.360
function that is continuous, but its internal

00:02:21.360 --> 00:02:24.159
structure is completely, completely hidden from

00:02:24.159 --> 00:02:26.500
you. Like you literally cannot see the gears

00:02:26.500 --> 00:02:29.599
turning inside. Exactly. And furthermore, it's

00:02:29.599 --> 00:02:32.580
incredibly difficult or computationally expensive

00:02:32.580 --> 00:02:35.939
to evaluate. All you can do is observe the output

00:02:35.939 --> 00:02:39.099
for a given input. So you feed the box a specific

00:02:39.099 --> 00:02:41.409
set of numbers, and then you just wait. You wait,

00:02:41.469 --> 00:02:44.110
sometimes for hours or even days, and you see

00:02:44.110 --> 00:02:45.969
what single number comes out the other side.

00:02:46.370 --> 00:02:49.270
And importantly, you cannot evaluate its derivatives.

00:02:49.650 --> 00:02:51.789
Meaning, you don't get those handy mathematical

00:02:51.789 --> 00:02:53.750
shortcuts that tell you which direction is up

00:02:53.750 --> 00:02:55.650
or down the slope of the function. Right. You

00:02:55.650 --> 00:02:57.509
can't just follow the curve. Without derivatives,

00:02:57.530 --> 00:03:00.330
you have no local compass. And as a practical

00:03:00.330 --> 00:03:03.090
limitation to keep in mind, the math underlying

00:03:03.090 --> 00:03:05.370
this framework generally works best when you

00:03:05.370 --> 00:03:08.030
are dealing with 20 dimensions or fewer. Which,

00:03:08.250 --> 00:03:10.810
just to be clear, trying to optimize something

00:03:10.810 --> 00:03:13.349
across 20 different intersecting dimensions is

00:03:13.349 --> 00:03:16.129
still incredibly complex to visualize. Oh, absolutely.

00:03:16.270 --> 00:03:18.409
It's mind -bending. But I think I have a way

00:03:18.409 --> 00:03:21.550
for you, the listener, to picture what this black

00:03:21.550 --> 00:03:25.270
box really feels like. Imagine you are trying

00:03:25.270 --> 00:03:29.509
to find the absolute highest peak in a vast,

00:03:29.810 --> 00:03:32.780
rugged mountain range. OK, I like this. but you're

00:03:32.780 --> 00:03:34.800
completely blindfolded. You can't see the peaks

00:03:34.800 --> 00:03:36.780
in the distance. You can't see the valleys below

00:03:36.780 --> 00:03:39.680
you. The only way you can know your current altitude

00:03:39.680 --> 00:03:43.500
is by stopping, setting up a massive camp, and

00:03:43.500 --> 00:03:46.319
taking an incredibly expensive time -consuming

00:03:46.319 --> 00:03:49.030
GPS reading. Right, because you can't just walk

00:03:49.030 --> 00:03:51.509
in a neat little grid taking readings every five

00:03:51.509 --> 00:03:53.629
feet. Exactly. You'd run out of expedition money

00:03:53.629 --> 00:03:56.210
and time almost immediately. You need a much

00:03:56.210 --> 00:03:58.469
smarter strategy to guess where the highest peak

00:03:58.469 --> 00:04:02.030
might be based on just the very few scattered

00:04:02.030 --> 00:04:04.590
readings you've already taken. That analogy captures

00:04:04.590 --> 00:04:07.710
the tension perfectly. You can't wander aimlessly,

00:04:07.710 --> 00:04:10.250
and you certainly can't afford to be exhaustive.

00:04:10.389 --> 00:04:12.849
So what's the move? This is where the Bayesian

00:04:12.849 --> 00:04:15.680
strategy steps in to save the day. Because the

00:04:15.680 --> 00:04:18.100
true landscape of that mountain range, the function

00:04:18.100 --> 00:04:20.839
itself is unknown. We treat it mathematically

00:04:20.839 --> 00:04:24.000
as a random function. And what we do first is

00:04:24.000 --> 00:04:27.600
place a prior. over it. A prior. So mapping that

00:04:27.600 --> 00:04:29.939
back to the mountain analogy, that would be our

00:04:29.939 --> 00:04:32.060
initial pre -existing belief about what the mountain

00:04:32.060 --> 00:04:34.019
range probably looks like before we even take

00:04:34.019 --> 00:04:36.279
a single step. That is the essence of it, yeah.

00:04:36.360 --> 00:04:38.660
Like assuming mountains generally slope upward

00:04:38.660 --> 00:04:41.199
or that peaks might cluster together. Right.

00:04:41.660 --> 00:04:44.540
The prior captures our initial mathematical beliefs

00:04:44.540 --> 00:04:46.680
about the behavior of the function before we've

00:04:46.680 --> 00:04:48.660
taken any of those costly measurements. Then

00:04:48.660 --> 00:04:50.379
you take a reading. You get some actual data.

00:04:50.560 --> 00:04:53.220
Yes. You gather your first piece of real data

00:04:53.220 --> 00:04:56.220
in math terms, a function. evaluation and based

00:04:56.220 --> 00:04:59.500
on that concrete new data you update your prior

00:04:59.500 --> 00:05:02.079
to form what the framework calls a posterior

00:05:02.079 --> 00:05:05.259
distribution. So the posterior is basically our

00:05:05.259 --> 00:05:08.579
updated slightly less blind map of the landscape.

00:05:08.839 --> 00:05:11.160
Exactly. It takes our assumptions and bends them

00:05:11.160 --> 00:05:13.480
around the cold hard facts of the GPS reading

00:05:13.480 --> 00:05:16.339
we just took. It represents our newly updated

00:05:16.339 --> 00:05:18.839
understanding of the world. Every single time

00:05:18.839 --> 00:05:21.360
you take a costly measurement, your posterior

00:05:21.360 --> 00:05:24.439
distribution shifts, warps, and refines itself.

00:05:24.560 --> 00:05:26.980
It's adapting. Yes. You aren't just looking for

00:05:26.980 --> 00:05:30.000
the peak. You are mathematically mapping out

00:05:30.000 --> 00:05:33.300
the uncertainty itself. You're zeroing in on

00:05:33.300 --> 00:05:36.079
where the highest peak is most likely to be hiding

00:05:36.079 --> 00:05:38.180
in the dark. Now that we understand the core

00:05:38.180 --> 00:05:41.699
goal, this idea of intelligently mapping an unseen

00:05:41.699 --> 00:05:44.579
landscape while blindfolded, I really want to

00:05:44.579 --> 00:05:46.699
look at how mathematicians actually crack the

00:05:46.699 --> 00:05:49.319
code on this. The history is pretty fascinating.

00:05:49.459 --> 00:05:51.079
Yeah, because we have this great theoretical

00:05:51.079 --> 00:05:54.180
concept. But in the early 1960s, an American

00:05:54.180 --> 00:05:56.600
applied mathematician named Harold J. Kushner

00:05:56.600 --> 00:05:59.920
started laying the actual groundwork. Kushner's

00:05:59.920 --> 00:06:03.660
work in 1964 was a massive leap forward. He published

00:06:03.660 --> 00:06:06.300
a paper proposing a new method for locating the

00:06:06.300 --> 00:06:09.060
maximum point of an arbitrary multi -peak curve

00:06:09.060 --> 00:06:12.079
in a noisy environment. Imagine a roller coaster

00:06:12.079 --> 00:06:14.899
track, hidden in dense fog, where your altimeter

00:06:14.899 --> 00:06:16.800
occasionally gives you slightly wrong readings.

00:06:17.000 --> 00:06:19.420
Exactly, and he was trying to find the highest

00:06:19.420 --> 00:06:22.259
point on that track. Now, he wasn't explicitly

00:06:22.259 --> 00:06:24.939
calling it Bayesian optimization just yet, but

00:06:24.939 --> 00:06:27.519
he provided the crucial theoretical foundation

00:06:27.519 --> 00:06:30.199
for making decisions under that kind of mathematical

00:06:30.199 --> 00:06:32.800
fog. But the framework really takes its modern

00:06:32.800 --> 00:06:37.319
shape a bit later. In 1978, thanks to a Lithuanian

00:06:37.319 --> 00:06:40.399
scientist named Jonas Malkus, he published a

00:06:40.399 --> 00:06:42.720
paper discussing how to use Bayesian methods

00:06:42.720 --> 00:06:45.699
specifically to seek extreme values under uncertainty.

00:06:45.959 --> 00:06:49.100
and he proposed a mechanism called the expected

00:06:49.100 --> 00:06:51.819
improvement principle. Yes, the EI principle.

00:06:51.980 --> 00:06:53.839
This seems like the moment the light bulb really

00:06:53.839 --> 00:06:56.259
went on for the field. What's fascinating here

00:06:56.259 --> 00:06:59.220
is that Mockus's expected improvement principle

00:06:59.220 --> 00:07:02.180
isn't just a historical footnote, it is literally

00:07:02.180 --> 00:07:04.839
still one of the core sampling strategies driving

00:07:04.839 --> 00:07:08.160
algorithms today. Really? Still today? Oh yeah.

00:07:08.300 --> 00:07:10.480
Before Marcus, you might just look for any point

00:07:10.480 --> 00:07:12.480
that might be higher than your current spot.

00:07:13.180 --> 00:07:15.959
Expected improvement actually calculates a magnitude

00:07:15.959 --> 00:07:18.300
of the potential gain. So it doesn't just ask,

00:07:18.480 --> 00:07:21.089
will this next step be higher? Right. It asks,

00:07:21.490 --> 00:07:24.290
how much higher could it possibly be, multiplied

00:07:24.290 --> 00:07:26.310
by the probability that it actually is higher?

00:07:26.889 --> 00:07:29.889
It fundamentally cracked the code on how to balance

00:07:29.889 --> 00:07:33.410
the risk of exploring new areas versus optimizing

00:07:33.410 --> 00:07:35.829
the function efficiently based on what you already

00:07:35.829 --> 00:07:38.430
know. And then the timeline jumps again to 1998.

00:07:39.089 --> 00:07:41.509
Donald R. Jones and his colleagues took Mockus's

00:07:41.509 --> 00:07:44.009
work and introduced something called the Gaussian

00:07:44.009 --> 00:07:47.680
process, while elaborating heavily on that expected

00:07:47.680 --> 00:07:49.560
improvement principle. They really brought it

00:07:49.560 --> 00:07:51.379
into a new era. They started bringing this out

00:07:51.379 --> 00:07:53.819
of pure theoretical math and into computer science

00:07:53.819 --> 00:07:55.459
and engineering. But wait, I have a question

00:07:55.459 --> 00:07:57.579
about this. Sure, what is it? If they are bringing

00:07:57.579 --> 00:08:01.120
this highly complex, constantly updating topographical

00:08:01.120 --> 00:08:04.420
map into computer science in the late 90s, wouldn't

00:08:04.420 --> 00:08:07.040
the computers of 1998 basically catch on fire

00:08:07.040 --> 00:08:09.639
trying to process this? Oh yeah. I mean, the

00:08:09.639 --> 00:08:11.959
computing power back then must have been a massive

00:08:11.959 --> 00:08:14.279
bottleneck for something this math heavy. You're

00:08:14.279 --> 00:08:17.000
totally right. The computational complexity of

00:08:17.000 --> 00:08:19.860
Bayesian optimization was absolutely a wall they

00:08:19.860 --> 00:08:22.540
slammed into. The processing power available

00:08:22.540 --> 00:08:25.139
in 1998 just couldn't handle the sheer weight

00:08:25.139 --> 00:08:27.980
of the math for anything beyond relatively simple

00:08:27.980 --> 00:08:30.180
problems. It was an incredibly elegant theory

00:08:30.180 --> 00:08:33.840
locked inside sluggish, incapable hardware. It

00:08:33.840 --> 00:08:36.440
was a sports car stuck in endless traffic. It

00:08:36.440 --> 00:08:38.299
just had to wait for the roads to clear. And

00:08:38.299 --> 00:08:40.379
those roads clear dramatically in the 21st century.

00:08:40.480 --> 00:08:42.960
They really did. With the sudden massive rise

00:08:42.960 --> 00:08:44.740
of artificial artificial intelligence, deep learning,

00:08:44.919 --> 00:08:47.879
and bionic robots, Bayesian optimization finally

00:08:47.879 --> 00:08:50.399
had the horsepower it required. Suddenly we had

00:08:50.399 --> 00:08:53.179
the massive parallel compute power necessary

00:08:53.179 --> 00:08:56.419
to run these complex posterior updates in real

00:08:56.419 --> 00:08:59.600
time. Exactly. And it rapidly became a vital

00:08:59.600 --> 00:09:01.919
foundational tool for what the industry calls

00:09:01.919 --> 00:09:05.399
hyperparameter tuning. Major tech players like

00:09:05.399 --> 00:09:09.279
Google, Meta, OpenAI began actively embedding

00:09:09.279 --> 00:09:11.200
Bayesian optimization into their deep learning

00:09:11.200 --> 00:09:13.159
frameworks. Just to drastically improve their

00:09:13.159 --> 00:09:15.179
search efficiency. Right. Because training a

00:09:15.179 --> 00:09:18.039
massive AI model is the ultimate blindfolded

00:09:18.039 --> 00:09:20.799
walk in the mountains. Every single time they

00:09:20.799 --> 00:09:22.820
test a new configuration for a neural network,

00:09:23.440 --> 00:09:25.379
it takes huge amounts of server time. Thousands

00:09:25.379 --> 00:09:28.139
of dollars in electricity. And days of waiting.

00:09:28.539 --> 00:09:31.279
They desperately needed Mockus and Kushner's

00:09:31.279 --> 00:09:34.240
math to find the best AI settings without going

00:09:34.240 --> 00:09:36.620
bankrupt running endless trial and error tests.

00:09:37.659 --> 00:09:40.299
It's crazy to think about, but the theoretical

00:09:40.299 --> 00:09:43.519
mathematicians of the 1960s and 70s build the

00:09:43.519 --> 00:09:46.399
exact perfect engine that the AI pioneers of

00:09:46.399 --> 00:09:49.000
the 2000s didn't even know they needed yet. It's

00:09:49.000 --> 00:09:51.200
a beautiful full circle moment in computer science.

00:09:51.580 --> 00:09:53.860
So with companies like OpenAI utilizing this

00:09:53.860 --> 00:09:56.639
today to build the tools we use, we need to look

00:09:56.639 --> 00:09:58.500
under the hood of how it actually operates in

00:09:58.500 --> 00:10:00.600
practice. Let's do it. We know we have this updated

00:10:00.600 --> 00:10:02.879
posterior distribution or actively updating map

00:10:02.879 --> 00:10:05.379
of the mountain, but how does the system actually

00:10:05.379 --> 00:10:08.240
use that map to decide where to place the very

00:10:08.240 --> 00:10:10.539
next guess? That's the million dollar question.

00:10:10.940 --> 00:10:13.799
Right. If I'm on the mountain, How does the system

00:10:13.799 --> 00:10:15.679
tell me whether I should take a helicopter to

00:10:15.679 --> 00:10:17.840
a completely uncharted dark side of the mound

00:10:17.840 --> 00:10:21.759
range, which is exploration, or just set up camp

00:10:21.759 --> 00:10:24.000
right next to the highest peak I've already found

00:10:24.000 --> 00:10:27.039
and check a few feet to the left, which is exploitation?

00:10:27.559 --> 00:10:29.919
This raises an important question, and it is

00:10:29.919 --> 00:10:32.360
the exact tension at the heart of the entire

00:10:32.360 --> 00:10:35.799
system. Exploration versus exploitation. Okay.

00:10:35.950 --> 00:10:38.789
The mechanism that makes this highly critical

00:10:38.789 --> 00:10:41.889
decision is called an acquisition function. You'll

00:10:41.889 --> 00:10:44.250
sometimes see it referred to as an infill sampling

00:10:44.250 --> 00:10:46.710
criterion in the literature. Acquisition function,

00:10:46.850 --> 00:10:49.870
got it. These are strict mathematical rules that

00:10:49.870 --> 00:10:52.350
use your current updated map to determine the

00:10:52.350 --> 00:10:54.830
absolute best coordinates for your very next

00:10:54.830 --> 00:10:57.429
blindfolded jump. It's the decision engine. It

00:10:57.429 --> 00:10:59.789
looks at the map and points the finger. And the

00:10:59.789 --> 00:11:01.710
nuance here is brilliant because there isn't

00:11:01.710 --> 00:11:04.210
just one single way to point that finger. You

00:11:04.210 --> 00:11:06.389
choose different acquisition functions depending

00:11:06.389 --> 00:11:09.110
on your specific appetite for risk. Like what?

00:11:09.309 --> 00:11:11.350
What are our options? Yeah. Well, for instance,

00:11:11.350 --> 00:11:14.169
you have the probability of improvement. This

00:11:14.169 --> 00:11:16.509
function just looks for any guaranteed step up

00:11:16.509 --> 00:11:18.909
no matter how small. It's highly exploitative.

00:11:19.450 --> 00:11:21.610
Then you have Marcus's expected improvement,

00:11:21.789 --> 00:11:24.230
which we discussed. Right. Trying to balance

00:11:24.230 --> 00:11:26.549
the actual size of the potential gain against

00:11:26.549 --> 00:11:29.350
the risk. Exactly. Finding the biggest possible

00:11:29.350 --> 00:11:32.629
lead, not just a tiny inch upward. Oh, and you

00:11:32.629 --> 00:11:35.669
also have Bayesian expected losses. Okay. Wow.

00:11:35.789 --> 00:11:37.730
Yeah, and then you have something like the upper

00:11:37.730 --> 00:11:40.629
confidence bound or UCB This one is fascinating

00:11:40.629 --> 00:11:43.429
because it specifically rewards ignorance. Wait

00:11:43.429 --> 00:11:46.070
rewards ignorance. How does a math equation reward

00:11:46.070 --> 00:11:49.169
ignorance? It's clever UCB looks at the map and

00:11:49.169 --> 00:11:51.250
intentionally calculates the uncertainty of a

00:11:51.250 --> 00:11:53.600
region. It essentially says We have absolutely

00:11:53.600 --> 00:11:55.899
no idea what is over in that dark corner of the

00:11:55.899 --> 00:11:58.559
map. Because our uncertainty is so high, the

00:11:58.559 --> 00:12:01.860
potential for a massive undiscovered peak existing

00:12:01.860 --> 00:12:04.740
there is technically also high. So it forces

00:12:04.740 --> 00:12:07.860
you to go look. It aggressively forces exploration

00:12:07.860 --> 00:12:10.440
so you don't get stuck on a tiny foothill thinking

00:12:10.440 --> 00:12:12.259
it's Mount Everest just because you haven't looked

00:12:12.259 --> 00:12:14.500
anywhere else. And there are others too, like

00:12:14.500 --> 00:12:17.179
Thompson sampling, which uses randomized sampling

00:12:17.179 --> 00:12:19.419
from the posterior to naturally balance that

00:12:19.419 --> 00:12:22.000
exploration exploitation trade -off. And how

00:12:22.000 --> 00:12:24.580
do they actually maximize these functions? Is

00:12:24.580 --> 00:12:27.139
there a specific technique? Yeah, they are typically

00:12:27.139 --> 00:12:29.740
maximized using numerical optimization techniques.

00:12:30.059 --> 00:12:32.940
Things like Newton's method or the BFGS algorithm

00:12:32.940 --> 00:12:35.919
are very common here. Wait, if we keep taking

00:12:35.919 --> 00:12:38.720
GPS readings on this mountain and we are using

00:12:38.720 --> 00:12:40.840
this proxy model we mentioned earlier, the Gaussian

00:12:40.840 --> 00:12:45.179
process, to constantly redraw this smooth, perfect

00:12:45.179 --> 00:12:49.460
3D topographical map, won't our map eventually

00:12:49.460 --> 00:12:51.940
become a problem? What do you mean? Won't it

00:12:51.940 --> 00:12:54.379
become so insanely detailed and massive that

00:12:54.379 --> 00:12:56.500
it takes longer to read the map than to just

00:12:56.500 --> 00:12:58.360
climb the mountain? Like if you feed a Gaussian

00:12:58.360 --> 00:13:00.779
process a million data points, doesn't it just

00:13:00.779 --> 00:13:03.000
completely buckle under its own weight? That

00:13:03.000 --> 00:13:05.820
is a fantastic point, and yes, that is the exact

00:13:05.820 --> 00:13:08.340
limitation of the standard approach. I knew it.

00:13:08.460 --> 00:13:11.580
When there is a massive amount of data, the mathematical

00:13:11.580 --> 00:13:14.080
training of a Gaussian process becomes brutally

00:13:14.080 --> 00:13:17.580
slow. The computational cost skyrockets because

00:13:17.580 --> 00:13:20.480
it is trying to maintain a continuous, perfectly

00:13:20.480 --> 00:13:23.700
smoothed mathematical landscape connecting every

00:13:23.700 --> 00:13:26.120
single point to every other point. Which makes

00:13:26.120 --> 00:13:28.779
it really difficult for standard Bayesian optimization

00:13:28.779 --> 00:13:32.129
to work well in highly complex fields, right?

00:13:32.149 --> 00:13:35.289
Oh, yeah, like drug development or sprawling

00:13:35.289 --> 00:13:37.110
medical experiments where the data sets aren't

00:13:37.110 --> 00:13:39.450
just large, they are gargantuan. So what do they

00:13:39.450 --> 00:13:42.149
do if the Gaussian map gets too heavy to unfold?

00:13:42.570 --> 00:13:44.889
Do they just abandon the Bayesian approach entirely?

00:13:44.990 --> 00:13:47.230
Not at all. They just swap out the proxy model

00:13:47.230 --> 00:13:49.490
for something lighter. The sources highlight

00:13:49.490 --> 00:13:52.710
a really clever computationally cheaper alternative

00:13:52.710 --> 00:13:55.990
called the Parsons tree estimator or TPE. How

00:13:55.990 --> 00:13:58.190
does the Parsons tree bypass that data overload?

00:13:58.289 --> 00:14:00.419
How does it draw the map differently? Instead

00:14:00.419 --> 00:14:02.860
of trying to maintain that perfect continuous

00:14:02.860 --> 00:14:05.620
topographical landscape, the Parsons Tree estimator

00:14:05.620 --> 00:14:08.419
essentially stops trying to draw the mountain

00:14:08.419 --> 00:14:11.080
and instead just sorts your data into buckets.

00:14:11.600 --> 00:14:14.879
Yeah, it constructs two separate distributions.

00:14:15.019 --> 00:14:17.700
It looks at all your past GPS readings and says,

00:14:18.139 --> 00:14:21.100
let's put the top 10 % of our highest readings

00:14:21.100 --> 00:14:23.919
into bucket A, the high performers, and let's

00:14:23.919 --> 00:14:27.259
throw the remaining 90 % into bucket B, the low

00:14:27.259 --> 00:14:29.440
performers. So it's just splitting the data into

00:14:29.440 --> 00:14:31.539
good and bad rather than mapping every single

00:14:31.539 --> 00:14:34.419
inch. Right. And by simply comparing the mathematical

00:14:34.419 --> 00:14:36.799
likelihood of those two distinct buckets, it

00:14:36.799 --> 00:14:39.539
can highly efficiently find the location that

00:14:39.539 --> 00:14:42.080
maximizes expected improvement. Oh, that's smart.

00:14:42.200 --> 00:14:45.039
It asks where on the map is bucket A most likely

00:14:45.039 --> 00:14:48.139
to occur compared to bucket B. It finds the next

00:14:48.139 --> 00:14:50.240
best spot without getting bogged down by the

00:14:50.240 --> 00:14:52.879
sheer crushing weight of calculating a smooth

00:14:52.879 --> 00:14:55.700
curve through all the data combined. It's a brilliant

00:14:55.700 --> 00:14:57.860
shortcut. Here's where it gets really interesting.

00:14:58.179 --> 00:15:00.279
We have these pristine mathematical models. We

00:15:00.279 --> 00:15:03.360
have Gaussian processes laying down smooth fabric.

00:15:03.740 --> 00:15:06.320
We have Parson Tree estimators sorting data into

00:15:06.320 --> 00:15:09.659
neat buckets. But the real world is rarely pristine.

00:15:09.820 --> 00:15:13.039
Never pristine. Exactly. What happens when the

00:15:13.039 --> 00:15:15.299
GPS reader breaks or the wind blows the climber

00:15:15.299 --> 00:15:18.029
off the mountain? How does Bayesian optimization

00:15:18.029 --> 00:15:21.419
survive outside the vacuum of pure theory? That

00:15:21.419 --> 00:15:24.039
brings us to a subfield known as exotic Bayesian

00:15:24.039 --> 00:15:27.139
optimization. Exotic Bayesian optimization. It

00:15:27.139 --> 00:15:28.799
sounds like a math problem wearing a Hawaiian

00:15:28.799 --> 00:15:31.539
shirt. It's a great term because standard optimization

00:15:31.539 --> 00:15:34.240
assumes that every single point you want to test

00:15:34.240 --> 00:15:37.480
is relatively easy to evaluate and that the answers

00:15:37.480 --> 00:15:40.600
you get back are clean, static, and reliable.

00:15:40.799 --> 00:15:43.320
But the real world isn't like that. Great. Exotic

00:15:43.320 --> 00:15:45.960
problems occur when the real world injects chaos

00:15:45.960 --> 00:15:48.259
into the system. Chaos like what? Give me an

00:15:48.259 --> 00:15:51.000
example. Imagine there is inherent unavoidable

00:15:51.000 --> 00:15:53.860
noise in the data you are collecting. Your sensors

00:15:53.860 --> 00:15:55.820
are faulty. Or maybe you are running parallel

00:15:55.820 --> 00:15:59.000
evaluations, like sending out five blindfolded

00:15:59.000 --> 00:16:01.240
climbers at the exact same time, and their results

00:16:01.240 --> 00:16:04.159
are arriving back to base camp entirely out of

00:16:04.159 --> 00:16:06.539
order. Oh, that would be a nightmare. Or what

00:16:06.539 --> 00:16:09.139
if the quality of your evaluations relies on

00:16:09.139 --> 00:16:11.440
a strict trade -off between the difficulty of

00:16:11.440 --> 00:16:14.240
the test and the accuracy of the result? Sometimes

00:16:14.240 --> 00:16:17.019
there are random shifting environmental conditions

00:16:17.019 --> 00:16:20.070
throwing off the baseline. Or what if the evaluation

00:16:20.070 --> 00:16:22.809
suddenly requires calculating partial derivatives

00:16:22.809 --> 00:16:25.710
on the fly? Exactly. When the clean assumptions

00:16:25.710 --> 00:16:28.110
of the standard model completely break down,

00:16:28.429 --> 00:16:31.070
the optimization becomes exotic. And despite

00:16:31.070 --> 00:16:34.129
all that real -world messiness, the applications

00:16:34.129 --> 00:16:36.389
for this framework are just everywhere. Once

00:16:36.389 --> 00:16:38.470
you know what it is, you see it hiding in every

00:16:38.470 --> 00:16:40.830
industry. It's not just a theoretical exercise.

00:16:41.029 --> 00:16:43.750
It is a testament to how versatile the core philosophy

00:16:43.750 --> 00:16:45.470
really is. I mean, we're talking about learning

00:16:45.470 --> 00:16:48.360
to rank. algorithms for search engines. Yes,

00:16:48.659 --> 00:16:51.039
and visual design by crowds. Which is fascinating,

00:16:51.139 --> 00:16:53.779
using this math to optimize aesthetics based

00:16:53.779 --> 00:16:56.480
on messy subjective human feedback. And then

00:16:56.480 --> 00:16:58.639
there's sensor networks, reinforcement learning.

00:16:58.899 --> 00:17:01.500
Even experimental particle physics uses this

00:17:01.500 --> 00:17:04.420
to design the actual physical geometry of particle

00:17:04.420 --> 00:17:07.000
detectors. It's in chemistry material design.

00:17:07.839 --> 00:17:10.660
There's a specific mention in the sources of

00:17:10.660 --> 00:17:13.940
using Bayesian optimization to develop ultra

00:17:13.940 --> 00:17:16.259
-high specific strength carbon nanolattices.

00:17:16.519 --> 00:17:19.460
Literally finding the perfect microscopic structure

00:17:19.460 --> 00:17:22.339
for new super materials. But if we connect this

00:17:22.339 --> 00:17:24.859
to the bigger picture, one of the most compelling

00:17:24.859 --> 00:17:27.759
and visual applications is in modern robotics.

00:17:28.039 --> 00:17:30.680
Oh, specifically, automatic gait optimization.

00:17:30.940 --> 00:17:32.940
Getting a robot to walk without falling over.

00:17:33.140 --> 00:17:35.799
Exactly. Helping a bionic robot learn to walk

00:17:35.799 --> 00:17:39.299
under extreme uncertainty. The robot is the blindfolded

00:17:39.299 --> 00:17:41.839
climber. It doesn't know the exact friction of

00:17:41.839 --> 00:17:44.180
the laboratory floor. It doesn't know the exact

00:17:44.180 --> 00:17:46.519
weight distribution of its own limbs while they

00:17:46.519 --> 00:17:50.279
are in motion. It has to try a step. observe

00:17:50.279 --> 00:17:52.079
the result, which is almost always violently

00:17:52.079 --> 00:17:55.200
falling over, and then update its posterior distribution

00:17:55.200 --> 00:17:57.579
with that failure. And then use an acquisition

00:17:57.579 --> 00:17:59.940
function to decide exactly how much power to

00:17:59.940 --> 00:18:01.799
send to its knee and ankle joints for the very

00:18:01.799 --> 00:18:04.099
next step. It is literally learning to walk through

00:18:04.099 --> 00:18:06.980
Bayesian inference. Every single fall makes the

00:18:06.980 --> 00:18:09.619
map clearer. That is incredible. But there was

00:18:09.619 --> 00:18:12.019
one specific deep dive example in the material

00:18:12.019 --> 00:18:14.529
that I found highly relatable. because it's a

00:18:14.529 --> 00:18:16.490
technology you and I use almost every single

00:18:16.490 --> 00:18:19.589
day. Facial recognition. Yes. The optimization

00:18:19.589 --> 00:18:22.930
of the histogram of oriented gradients, or the

00:18:22.930 --> 00:18:26.670
HOG algorithm. Right. For those unfamiliar, HOG

00:18:26.670 --> 00:18:29.029
is a really popular feature extraction method

00:18:29.029 --> 00:18:31.710
for computer vision. It essentially looks at

00:18:31.710 --> 00:18:34.150
an image and tries to find the edges and outlines

00:18:34.150 --> 00:18:36.990
of an object like a human face by analyzing the

00:18:36.990 --> 00:18:39.690
gradients of light to dark pixels. But its accuracy

00:18:39.690 --> 00:18:41.769
relies heavily on how its internal parameters

00:18:41.769 --> 00:18:44.250
are set. And optimizing those parameters manually

00:18:44.250 --> 00:18:47.150
is notoriously frustrating. Yeah. Like, how many

00:18:47.150 --> 00:18:49.309
pixels should it group together? What's the threshold

00:18:49.309 --> 00:18:51.950
for an edge? It's a classic black box problem.

00:18:52.069 --> 00:18:54.390
You tweak a setting, run 1 ,000 images through

00:18:54.390 --> 00:18:56.369
it, and see if it recognized the faces better

00:18:56.369 --> 00:18:59.029
or worse. It would take forever. It does. But

00:18:59.029 --> 00:19:01.369
scientists proposed a novel approach using that

00:19:01.369 --> 00:19:03.829
lighter, less expensive proxy model we talked

00:19:03.829 --> 00:19:06.710
about earlier. The tree -structured Parson estimator,

00:19:07.029 --> 00:19:10.759
the TPE. Yes, the bucket method. They applied

00:19:10.759 --> 00:19:13.319
the TPE bucket method to the facial recognition

00:19:13.319 --> 00:19:16.599
software. By using that specific Bayesian technique,

00:19:16.720 --> 00:19:19.619
they were able to simultaneously optimize both

00:19:19.619 --> 00:19:23.079
the complex internal parameters of the HOG algorithm

00:19:23.079 --> 00:19:25.740
and the size of the images being processed. That's

00:19:25.740 --> 00:19:28.960
huge. Because TPE could handle the complex overlapping

00:19:28.960 --> 00:19:31.339
variables without buckling under the computational

00:19:31.339 --> 00:19:34.240
weight, it drastically pushed the accuracy of

00:19:34.240 --> 00:19:36.400
the facial recognition forward. And the source

00:19:36.400 --> 00:19:38.859
notes that this optimized approach has massive

00:19:38.859 --> 00:19:40.980
potential to be adapted for all kinds of other

00:19:40.980 --> 00:19:42.920
computer vision applications, right? Exactly.

00:19:43.119 --> 00:19:45.599
Pushing the boundaries of handcrafted parameter

00:19:45.599 --> 00:19:48.640
-based algorithms far beyond what human trial

00:19:48.640 --> 00:19:51.299
and error could ever achieve. So what does this

00:19:51.299 --> 00:19:54.240
all mean? We pull back from the carbon analogies,

00:19:54.380 --> 00:19:56.960
the falling robots, and the computer vision algorithms.

00:19:57.759 --> 00:20:00.099
The ultimate takeaway for you listening to this

00:20:00.099 --> 00:20:02.599
is that Bayesian optimization is the premier

00:20:02.599 --> 00:20:04.920
framework for making decisions under uncertainty.

00:20:05.279 --> 00:20:07.920
It really is. It mathematically proves that you

00:20:07.920 --> 00:20:10.240
do not need to know everything to make the best

00:20:10.240 --> 00:20:13.240
possible choice. You don't need to take a million

00:20:13.240 --> 00:20:16.220
GPS readings. You just need to learn with ruthless

00:20:16.220 --> 00:20:18.460
efficiency from the very few choices you can

00:20:18.460 --> 00:20:21.059
make constantly updating your map of the world.

00:20:21.200 --> 00:20:23.720
That is the absolute essence of it. It's about

00:20:23.720 --> 00:20:26.740
maximizing the value of your own ignorance. And

00:20:26.740 --> 00:20:28.940
honestly, looking at how elegantly this math

00:20:28.940 --> 00:20:32.539
updates its beliefs, it raises a final somewhat

00:20:32.539 --> 00:20:34.779
provocative thought for you to mull over. Oh,

00:20:34.779 --> 00:20:37.420
I'm ready. Where are we taking this? Since Bayesian

00:20:37.420 --> 00:20:40.019
optimization is fundamentally about mathematically

00:20:40.019 --> 00:20:43.279
updating our beliefs, turning our initial priors

00:20:43.279 --> 00:20:45.680
into updated posteriors based strictly on the

00:20:45.680 --> 00:20:48.599
arrival of new evidence, could we theoretically

00:20:48.599 --> 00:20:52.599
use this to map Human stubbornness. Whoa. What

00:20:52.599 --> 00:20:54.599
do you mean by that? Mapping stubbornness. Think

00:20:54.599 --> 00:20:57.559
about it. We all operate with our own internal

00:20:57.559 --> 00:21:00.819
personal acquisition functions. Every single

00:21:00.819 --> 00:21:03.339
day we are faced with the choice to either explore

00:21:03.339 --> 00:21:06.500
new unknown ideas that might challenge us or

00:21:06.500 --> 00:21:08.900
exploit the comfortable known beliefs we already

00:21:08.900 --> 00:21:11.480
hold. Imagine if we could actually measure that

00:21:11.480 --> 00:21:13.519
mathematically. Imagine if we could calculate

00:21:13.519 --> 00:21:17.200
your personal posterior distribution to see exactly

00:21:17.200 --> 00:21:19.750
where you fail to adapt to new data. Where do

00:21:19.750 --> 00:21:22.450
you ignore the GPS reading? Simply because you

00:21:22.450 --> 00:21:25.230
prefer the warm comfort of exploitation over

00:21:25.230 --> 00:21:27.769
the intellectual growth and the inherent risk

00:21:27.769 --> 00:21:30.430
of exploration. Exactly. So the real question

00:21:30.430 --> 00:21:33.089
is, are you constantly updating your map when

00:21:33.089 --> 00:21:35.750
new facts arrive? Or are you just sitting in

00:21:35.750 --> 00:21:37.849
the dark on the side of the mountain, refusing

00:21:37.849 --> 00:21:40.289
to take the blindfold off because you like the

00:21:40.289 --> 00:21:42.230
spot you're sitting in? It's a tough question

00:21:42.230 --> 00:21:45.329
to answer. That is a wild, brilliant thought

00:21:45.329 --> 00:21:47.680
to leave on. Thank you for joining us on this

00:21:47.680 --> 00:21:50.640
deep dive, keep updating your priors, keep exploring

00:21:50.640 --> 00:21:52.640
the dark parts of the map, and we will catch

00:21:52.640 --> 00:21:53.500
you next time.
