WEBVTT

00:00:00.000 --> 00:00:04.620
Imagine a supercomputer is given a billion euros.

00:00:04.839 --> 00:00:08.560
Okay, and its single task is to fix unemployment

00:00:08.560 --> 00:00:12.900
across like 200 different cities. It has no empathy,

00:00:13.099 --> 00:00:15.419
doesn't care about human suffering, doesn't watch

00:00:15.419 --> 00:00:18.300
the local news, and it cannot be swayed by a

00:00:18.300 --> 00:00:21.219
politician's sob story. It only cares about one

00:00:21.219 --> 00:00:24.899
single rigid mathematical formula, a formula

00:00:24.899 --> 00:00:28.899
that dictates exactly how much pain the system

00:00:28.899 --> 00:00:31.579
is experiencing and how to minimize it. Today,

00:00:31.699 --> 00:00:34.280
we are looking at the exact formulas that govern

00:00:34.280 --> 00:00:36.939
scenarios just like that. It is a really profound

00:00:36.939 --> 00:00:38.979
shift in perspective, honestly. I mean, human

00:00:38.979 --> 00:00:41.000
regret is a feeling, right? You take a wrong

00:00:41.000 --> 00:00:43.539
exit, you grimace, you maybe mutter under your

00:00:43.539 --> 00:00:45.579
breath. Yeah, exactly. But you don't pull out

00:00:45.579 --> 00:00:47.899
a calculator. When we move into the realm of

00:00:47.899 --> 00:00:49.840
statistics and machine learning, well, we have

00:00:49.840 --> 00:00:51.979
to translate that qualitative human frustration

00:00:51.979 --> 00:00:55.450
into a rigid quantitative rule. And that is our

00:00:55.450 --> 00:00:57.909
mission for today's Deep Dive. We are exploring

00:00:57.909 --> 00:01:00.990
the concept of loss functions, drawing from the

00:01:00.990 --> 00:01:03.210
comprehensive Wikipedia article on the subject.

00:01:03.450 --> 00:01:06.450
We are going to demystify the mathematical engines

00:01:06.450 --> 00:01:09.489
that run optimization problems, economics, and

00:01:09.489 --> 00:01:12.189
statistical models. We really want to understand

00:01:12.189 --> 00:01:15.950
how math quantifies being wrong. Right. So that

00:01:15.950 --> 00:01:20.290
algorithms, statisticians, and economists can

00:01:20.290 --> 00:01:22.349
actually make better decisions. Exactly. And

00:01:22.349 --> 00:01:24.879
whether you realize it or not, these functions

00:01:24.879 --> 00:01:26.659
are governing the world around you right now,

00:01:26.780 --> 00:01:29.079
like from the GPS in your phone to the way your

00:01:29.079 --> 00:01:30.840
insurance premiums are calculated. But absolutely.

00:01:31.000 --> 00:01:33.140
OK, let's unpack this. Before we can use a loss

00:01:33.140 --> 00:01:35.840
function, I mean, we really have to define what

00:01:35.840 --> 00:01:37.700
it actually is and where it comes from. Yeah,

00:01:37.739 --> 00:01:40.519
so at its most basic level, a loss function,

00:01:40.780 --> 00:01:42.799
which is sometimes called a cost function or

00:01:42.799 --> 00:01:45.459
an error function, is essentially a mathematical

00:01:45.459 --> 00:01:48.569
translation. OK. It takes an event. or like the

00:01:48.569 --> 00:01:50.870
values of certain variables, and maps them onto

00:01:50.870 --> 00:01:53.930
a real number. A real number. Right. And that

00:01:53.930 --> 00:01:57.090
real number intuitively represents a cost associated

00:01:57.090 --> 00:02:00.250
with the event. And in any optimization problem,

00:02:00.629 --> 00:02:02.730
well, the universal goal is to minimize that

00:02:02.730 --> 00:02:04.689
loss function. You want to push it down. Exactly.

00:02:04.769 --> 00:02:06.670
Push that number as close to zero as possible.

00:02:06.930 --> 00:02:08.689
Just to establish the flip side of that coin,

00:02:09.110 --> 00:02:12.770
the source mentions objective functions or reward

00:02:12.770 --> 00:02:14.960
functions. Right, the opposite. Yeah. That's

00:02:14.960 --> 00:02:16.860
when you have a mathematical formula where the

00:02:16.860 --> 00:02:19.639
goal is to maximize the outcome, like maximizing

00:02:19.639 --> 00:02:22.199
profit or utility or a population's fitness.

00:02:22.379 --> 00:02:25.000
Yeah. Yeah. But today, we are focusing purely

00:02:25.000 --> 00:02:27.280
on the pain. We are focusing on the loss. We

00:02:27.280 --> 00:02:30.719
are indeed. And while it sounds like a very modern

00:02:30.719 --> 00:02:33.979
machine learning era concept, the history actually

00:02:33.979 --> 00:02:37.379
stretches back centuries. Oh, really? Yeah. The

00:02:37.379 --> 00:02:40.219
concept of mapping an error to a cost goes back

00:02:40.219 --> 00:02:42.580
to Pierre -Simon Laplace, the brilliant 18th

00:02:42.580 --> 00:02:47.060
century mathematician. Wow. OK. But it was reintroduced

00:02:47.060 --> 00:02:49.879
into modern statistics in the mid 20th century

00:02:49.879 --> 00:02:53.319
by a man named Abraham Wald. We also see it pop

00:02:53.319 --> 00:02:55.759
up in actuarial science with Harold Cramer in

00:02:55.759 --> 00:02:58.319
the 1920s. Actuarial science. So like insurance.

00:02:58.599 --> 00:03:00.979
Exactly. He was using these concepts to model

00:03:00.979 --> 00:03:03.460
the risk of ruin for insurance companies. He

00:03:03.460 --> 00:03:06.360
was calculating the exact mathematical loss if

00:03:06.360 --> 00:03:09.099
benefits paid out exceeded the premiums coming

00:03:09.099 --> 00:03:11.729
in over time. That makes total sense. To make

00:03:11.729 --> 00:03:14.150
this really concrete for you listening, the best

00:03:14.150 --> 00:03:17.189
way I can visualize a loss function is to imagine

00:03:17.189 --> 00:03:21.550
a highly judgmental GPS. A judgmental GPS. I

00:03:21.550 --> 00:03:23.530
love that. Right. Because a standard GPS just

00:03:23.530 --> 00:03:26.550
says, Recalculating. It knows you missed a turn,

00:03:26.669 --> 00:03:29.310
but it treats all missed turns equally. Right.

00:03:29.490 --> 00:03:31.629
A loss function GPS doesn't just say you missed

00:03:31.629 --> 00:03:35.229
a turn. It calculates the exact agonizing cost

00:03:35.229 --> 00:03:38.129
of that specific detour. Oh, that's brutal. It

00:03:38.129 --> 00:03:40.629
chimes in and says, you just wasted 12 ounces

00:03:40.629 --> 00:03:42.990
of gasoline, added four minutes of drive time,

00:03:43.009 --> 00:03:45.889
and increased your baseline stress level by 14%.

00:03:45.889 --> 00:03:49.349
It gives a heavy numerical value to your specific

00:03:49.349 --> 00:03:52.000
failure. That captures the essence of it beautifully.

00:03:52.419 --> 00:03:55.080
It also highlights why this is a completely necessary

00:03:55.080 --> 00:03:57.439
concept. Oh, so. Well, computers and statistical

00:03:57.439 --> 00:04:00.680
models cannot optimize based on a vague human

00:04:00.680 --> 00:04:03.020
sense of, oops, I took a wrong turn. Yeah, they

00:04:03.020 --> 00:04:05.080
don't have feelings. Exactly. They need a scoreboard.

00:04:05.199 --> 00:04:07.159
They need to know if mistake A is twice as bad

00:04:07.159 --> 00:04:11.360
as mistake B, or like 10 times as bad. That translation

00:04:11.360 --> 00:04:13.620
is what allows an algorithm to update its behavior

00:04:13.620 --> 00:04:16.259
and actually learn. But the way you keep score

00:04:16.259 --> 00:04:18.759
completely changes how the game is played, right?

00:04:18.759 --> 00:04:20.899
Absolutely. Once you decide you need to measure

00:04:20.899 --> 00:04:22.740
a mistake, you have to decide how to measure

00:04:22.740 --> 00:04:25.300
it. And according to the source material, there

00:04:25.300 --> 00:04:28.939
are different mathematical flavors of these functions.

00:04:29.040 --> 00:04:30.720
Yeah, let's start with the absolute simplest

00:04:30.720 --> 00:04:35.220
flavor, the zero one loss function. OK. In information

00:04:35.220 --> 00:04:37.720
theory, this is often known as Hamming distortion.

00:04:38.160 --> 00:04:41.699
It is strictly binary. If your estimate exactly

00:04:41.699 --> 00:04:45.220
matches the target, the loss is zero. Okay. And

00:04:45.220 --> 00:04:47.699
if you're wrong? If you were wrong by any margin

00:04:47.699 --> 00:04:49.819
whatsoever, the loss is one. Let me make sure

00:04:49.819 --> 00:04:52.259
I'm visualizing this. It's essentially a pure

00:04:52.259 --> 00:04:54.860
pass -fail system. Yes. If we are playing a game

00:04:54.860 --> 00:04:57.579
where you guess my weight perfectly, you get

00:04:57.579 --> 00:05:00.740
a zero. You win. But if you're off by half a

00:05:00.740 --> 00:05:02.480
pound, you get a one. Right. And if you're off

00:05:02.480 --> 00:05:05.279
by 50 pounds, you still get a one. The algorithm

00:05:05.279 --> 00:05:08.879
feels the exact same amount of pain for a tiny

00:05:08.879 --> 00:05:11.120
rounding error as it does for a catastrophic

00:05:11.120 --> 00:05:13.579
miscalculation. That is the core of it. It doesn't

00:05:13.579 --> 00:05:16.800
care about the magnitude of the error, only the

00:05:16.800 --> 00:05:18.899
presence of it. Think of a computer password.

00:05:20.100 --> 00:05:21.879
You don't get partial credit for getting eight

00:05:21.879 --> 00:05:24.180
out of nine characters correct. Oh, yeah. That's

00:05:24.180 --> 00:05:27.319
a great point. You are just denied entry. The

00:05:27.319 --> 00:05:31.779
loss is one. But in many fields, Like linear

00:05:31.779 --> 00:05:34.240
regression or the design of scientific experiments,

00:05:34.600 --> 00:05:37.339
that binary system is just useless. Because it's

00:05:37.339 --> 00:05:40.879
too rigid. Exactly. We need to measure how far

00:05:40.879 --> 00:05:44.439
off we are. That brings us to a very common approach.

00:05:45.620 --> 00:05:48.480
The quadratic loss function, also known as the

00:05:48.480 --> 00:05:50.839
squared error loss. Right. Our source explains

00:05:50.839 --> 00:05:52.620
this is where you take the difference between

00:05:52.620 --> 00:05:54.620
your estimate and the target and you square it.

00:05:54.680 --> 00:05:57.660
Yes. So if you're off by 2, your loss is 4. If

00:05:57.660 --> 00:06:01.569
you're off by 3, your loss is 9. It's symmetric,

00:06:01.790 --> 00:06:04.509
meaning if you overshoot the target, it causes

00:06:04.509 --> 00:06:06.930
the same mathematical pain as undershooting it.

00:06:07.069 --> 00:06:09.110
Correct. But I have to say, looking at this,

00:06:09.129 --> 00:06:10.829
I want to push back a little bit. Please do.

00:06:11.069 --> 00:06:13.029
Why are we squaring it? Why not just look at

00:06:13.029 --> 00:06:14.829
the absolute difference? That's a very common

00:06:14.829 --> 00:06:17.110
question. Like, if I'm off by four, why isn't

00:06:17.110 --> 00:06:19.930
the penalty just four? Squaring the error seems

00:06:19.930 --> 00:06:22.449
incredibly dramatic. It feels like we are massively

00:06:22.449 --> 00:06:25.430
over -punishing outliers. You're not wrong. Right.

00:06:25.790 --> 00:06:28.269
If an algorithm is making 100 guesses, And most

00:06:28.269 --> 00:06:30.990
of them are off by one or two, but one wild guess

00:06:30.990 --> 00:06:34.069
is off by 10. Suddenly that one guess is generating

00:06:34.069 --> 00:06:37.129
a loss of 100. It totally dominates the average.

00:06:37.449 --> 00:06:40.149
You've hit on the exact flaw of the quadratic

00:06:40.149 --> 00:06:43.009
loss function. I knew it. Because of that squaring

00:06:43.009 --> 00:06:45.949
mechanism, it assigns vastly more importance

00:06:45.949 --> 00:06:49.889
to large outliers than to the everyday average

00:06:49.889 --> 00:06:52.470
data. Yeah, that seems like a problem. It is.

00:06:53.029 --> 00:06:55.750
If you have a data set with a few massive anomalies,

00:06:56.029 --> 00:06:58.829
the quadratic loss will basically bend over backwards

00:06:58.829 --> 00:07:01.730
to minimize those specific anomalies, which can

00:07:01.730 --> 00:07:04.029
warp your whole model. Then why is it the default

00:07:04.029 --> 00:07:06.610
for so many models? Why do we use it? Because

00:07:06.610 --> 00:07:08.689
of the underlying mechanics of optimization.

00:07:08.930 --> 00:07:11.290
We have to remember how these algorithms actually

00:07:11.290 --> 00:07:15.089
work. They're trying to find the absolute minimum

00:07:15.089 --> 00:07:17.829
of a function, the lowest possible point of error.

00:07:18.009 --> 00:07:21.680
To do that, Algorithms essentially act like blind

00:07:21.680 --> 00:07:23.939
hikers trying to find the bottom of a valley.

00:07:24.240 --> 00:07:26.079
Blind hikers. Yeah. They can't see the valley,

00:07:26.139 --> 00:07:27.920
so they feel the slope of the ground under their

00:07:27.920 --> 00:07:29.860
feet to figure out which way is downhill. OK,

00:07:29.920 --> 00:07:32.060
I'm following. They use calculus -like derivatives

00:07:32.060 --> 00:07:35.519
to feel the slope? Precisely. Now, imagine absolute

00:07:35.519 --> 00:07:38.459
loss, just taking the straightforward difference

00:07:38.459 --> 00:07:42.009
without squaring it. On a graph, that creates

00:07:42.009 --> 00:07:44.889
a sharp mathematical V shape. Like a literal

00:07:44.889 --> 00:07:48.009
letter V? Yes. At the very bottom of that V,

00:07:48.110 --> 00:07:51.529
where the error is zero, there is a sharp jagged

00:07:51.529 --> 00:07:55.310
point. In calculus terms, a sharp point like

00:07:55.310 --> 00:07:58.029
that is not differentiable. Meaning what, exactly?

00:07:58.170 --> 00:08:00.209
Meaning there is no continuous slope to feel.

00:08:00.519 --> 00:08:03.259
When our blind algorithmic hiker reaches the

00:08:03.259 --> 00:08:06.480
bottom of that V, it gets confused. It can't

00:08:06.480 --> 00:08:08.579
calculate a derivative and the optimization process

00:08:08.579 --> 00:08:11.300
just breaks down. Oh, so the algorithm literally

00:08:11.300 --> 00:08:14.279
trips over the sharp point. It does. The quadratic

00:08:14.279 --> 00:08:17.360
loss function, however, creates a smooth, continuous

00:08:17.360 --> 00:08:19.720
U -shape of parabola. Ah, because it's squared.

00:08:20.040 --> 00:08:22.740
Exactly. Because it's squared, the curve gently

00:08:22.740 --> 00:08:25.490
rounds out at the bottom. It is globally continuous

00:08:25.490 --> 00:08:27.889
and differentiable everywhere. The hiker can

00:08:27.889 --> 00:08:29.990
slide smoothly right down to the minimum. It

00:08:29.990 --> 00:08:32.070
is beautifully tractable for things like linear

00:08:32.070 --> 00:08:34.590
regression and least squares. So we basically

00:08:34.590 --> 00:08:36.889
sacrifice a little bit of common sense regarding

00:08:36.889 --> 00:08:39.929
outliers just to make the math smoother for the

00:08:39.929 --> 00:08:41.830
computers. We make a trade -off for computational

00:08:41.830 --> 00:08:44.549
efficiency, yes. That's wild. Though there are

00:08:44.549 --> 00:08:46.450
alternatives mentioned in the source like Huber

00:08:46.450 --> 00:08:49.850
loss or Log -Kosch loss that try to blend the

00:08:49.850 --> 00:08:52.889
two. staying smooth at the bottom, but not over

00:08:52.889 --> 00:08:56.490
-punishing outliers at the extremes. Right. But

00:08:56.490 --> 00:08:58.330
before we move on from the flavors of failure,

00:08:58.429 --> 00:09:00.750
there's one more fascinating concept we really

00:09:00.750 --> 00:09:03.789
have to discuss. Leonard J. Savage's concept

00:09:03.789 --> 00:09:07.070
of regret. Oh yeah, this one felt deeply philosophical

00:09:07.070 --> 00:09:09.889
to me. Savage's regret takes this out of pure

00:09:09.889 --> 00:09:12.610
math and almost into psychology. It's a brilliant

00:09:12.610 --> 00:09:15.860
framing. Savage argued that a loss function shouldn't

00:09:15.860 --> 00:09:18.159
just measure how far off your guess was from

00:09:18.159 --> 00:09:20.779
a target. Okay. He said the loss should be the

00:09:20.779 --> 00:09:22.639
difference between the consequence of the decision

00:09:22.639 --> 00:09:25.159
you actually made and the consequence of the

00:09:25.159 --> 00:09:27.360
best possible decision you could have made if

00:09:27.360 --> 00:09:29.620
you had possessed a crystal ball. It measures

00:09:29.620 --> 00:09:32.399
the agony of hindsight. Let's say you're listening

00:09:32.399 --> 00:09:34.039
to this and thinking about the stock market.

00:09:34.250 --> 00:09:37.509
You invest in a safe index fund and make $500.

00:09:37.950 --> 00:09:41.409
So I'll win. Yeah. A standard objective function

00:09:41.409 --> 00:09:44.149
might say, great, you maximized your profit by

00:09:44.149 --> 00:09:47.809
500, loss is zero. Right. But Savage's regret

00:09:47.809 --> 00:09:49.730
looks at the fact that you could have invested

00:09:49.730 --> 00:09:54.490
in a breakout tech stock and made five. And therefore

00:09:54.490 --> 00:09:59.470
your loss function is $4 ,500. The pain isn't

00:09:59.470 --> 00:10:02.789
what you lost, it's the theoretical optimum that

00:10:02.789 --> 00:10:05.789
you missed out on under perfectly known circumstances.

00:10:06.309 --> 00:10:08.610
Man, that is depressing. But you know, it's easy

00:10:08.610 --> 00:10:10.950
to square a number or calculate hindsight when

00:10:10.950 --> 00:10:13.049
we're just talking about dots on a graph or a

00:10:13.049 --> 00:10:15.649
theoretical stock trade. Sure. But human society

00:10:15.649 --> 00:10:18.409
is incredibly messy. How do we build these functions

00:10:18.409 --> 00:10:22.259
to reflect actual real world priorities? like

00:10:22.259 --> 00:10:25.779
square a city's budget deficit or a university's

00:10:25.779 --> 00:10:27.799
lack of funding. What's fascinating here is that

00:10:27.799 --> 00:10:31.240
the map somehow has to elicit complex human preferences

00:10:31.240 --> 00:10:33.899
and turn them into a scalar valued function.

00:10:34.320 --> 00:10:37.080
Ragnar Frisch, a pioneer in economics, actually

00:10:37.080 --> 00:10:39.519
highlighted this exact problem in his Nobel Prize

00:10:39.519 --> 00:10:42.950
lecture. Oh, did he? He did. He pointed out that

00:10:42.950 --> 00:10:45.409
while sometimes the physical problem dictates

00:10:45.409 --> 00:10:48.190
the function when it comes to social or economic

00:10:48.190 --> 00:10:51.210
planning, a human decision maker's preference

00:10:51.210 --> 00:10:55.190
has to be explicitly extracted. The source mentions

00:10:55.190 --> 00:10:58.830
a researcher named Andronik Tangian who figured

00:10:58.830 --> 00:11:01.610
out a highly practical way to do this using something

00:11:01.610 --> 00:11:04.750
he called indifference points. Yes, exactly.

00:11:04.809 --> 00:11:07.470
How does that actually work in practice? So Tangian

00:11:07.470 --> 00:11:10.149
conducted computer -assisted interviews with

00:11:10.149 --> 00:11:12.509
policy decision makers. He would present them

00:11:12.509 --> 00:11:14.649
with trade -offs. Like a game of would you rather?

00:11:15.070 --> 00:11:17.309
Sort of. For example, he might ask a politician,

00:11:17.909 --> 00:11:20.169
would you accept a 5 % budget cut to the science

00:11:20.169 --> 00:11:22.690
department if it meant a 2 % increase for the

00:11:22.690 --> 00:11:25.620
library? Okay. If the politician says no, he

00:11:25.620 --> 00:11:28.159
tweaks the numbers. How about a 4 % cut for a

00:11:28.159 --> 00:11:31.379
3 % increase? He keeps adjusting these hypothetical

00:11:31.379 --> 00:11:33.899
scenarios until the decision -maker literally

00:11:33.899 --> 00:11:36.279
shrugs and says, I don't care, both of those

00:11:36.279 --> 00:11:38.379
outcomes seem equally acceptable to me. That

00:11:38.379 --> 00:11:41.480
shrug is the indifference point. Exactly. It

00:11:41.480 --> 00:11:43.799
is the mathematical moment where preference shifts.

00:11:44.190 --> 00:11:47.409
By mapping dozens or hundreds of these indifference

00:11:47.409 --> 00:11:50.149
points, Tangian could mathematically trace a

00:11:50.149 --> 00:11:52.850
curve that represented the human values of the

00:11:52.850 --> 00:11:54.970
decision makers. That's incredible. It gets better.

00:11:55.149 --> 00:11:57.929
He then used those curves to build complex objective

00:11:57.929 --> 00:12:00.549
functions for massive real -world applications.

00:12:01.190 --> 00:12:03.929
For instance, he used this method to optimally

00:12:03.929 --> 00:12:06.970
distribute budgets across 16 different Westphalian

00:12:06.970 --> 00:12:09.669
universities. Which is just crazy to think about.

00:12:09.730 --> 00:12:11.129
If you are listening to this and you've ever

00:12:11.129 --> 00:12:13.389
wondered why your local university got a certain

00:12:13.389 --> 00:12:16.470
grant or why a specific town got regional funding,

00:12:16.830 --> 00:12:18.929
it's not always just politicians arguing in a

00:12:18.929 --> 00:12:22.149
room. No, not at all. It is quite literally abstract

00:12:22.149 --> 00:12:24.649
math attempting to optimize an objective function.

00:12:24.460 --> 00:12:26.860
built from human indifference points. Tangent

00:12:26.860 --> 00:12:29.740
also applied this to distribute European subsidies

00:12:29.740 --> 00:12:32.419
aimed at equalizing unemployment rates across

00:12:32.419 --> 00:12:36.600
271 distinct German regions. Wow. The loss function

00:12:36.600 --> 00:12:39.000
in that scenario was the inequality of unemployment.

00:12:39.600 --> 00:12:42.000
The optimization problem was figuring out exactly

00:12:42.000 --> 00:12:44.759
how to route millions of euros to smooth out

00:12:44.759 --> 00:12:47.720
that specific loss curve. It really grounds these

00:12:47.720 --> 00:12:51.059
abstract ideas. But here is the thing. It is

00:12:51.059 --> 00:12:53.240
one thing to budget a university when you have

00:12:53.240 --> 00:12:55.779
all the data in front of you. You know the student

00:12:55.779 --> 00:12:58.740
population. You know the tax revenue. Life usually

00:12:58.740 --> 00:13:02.080
isn't like that. How do these loss functions

00:13:02.080 --> 00:13:05.159
operate when we are trying to navigate the unknown,

00:13:05.580 --> 00:13:08.320
when there is a total fog of future uncertainty?

00:13:08.649 --> 00:13:11.250
That brings us into statistical decision theory,

00:13:11.710 --> 00:13:14.610
which generally splits into two massive philosophical

00:13:14.610 --> 00:13:17.889
camps. The frequentist approach and the Bayesian

00:13:17.889 --> 00:13:20.330
approach. Oh boy. Yeah. Both are trying to make

00:13:20.330 --> 00:13:22.470
a decision based on the expected value of the

00:13:22.470 --> 00:13:25.690
loss function, but they define expected in radically

00:13:25.690 --> 00:13:27.820
different ways. Let's break that down because

00:13:27.820 --> 00:13:30.799
the Wikipedia definitions were, frankly, incredibly

00:13:30.799 --> 00:13:33.220
dense. Let's start with the frequentist approach.

00:13:33.860 --> 00:13:36.340
The source says it evaluates the risk function

00:13:36.340 --> 00:13:38.799
by looking at the expected value with respect

00:13:38.799 --> 00:13:41.679
to the probability distribution of observed data.

00:13:42.259 --> 00:13:44.440
What does that actually mean if I'm trying to

00:13:44.440 --> 00:13:47.960
solve a problem? Imagine you are trying to determine

00:13:47.960 --> 00:13:51.039
if a coin is weighted or rigged. A frequentist

00:13:51.200 --> 00:13:53.820
looks at the problem by considering all the possible

00:13:53.820 --> 00:13:56.399
alternate universes of data that might have been

00:13:56.399 --> 00:13:59.000
generated alternate universes yeah they consider

00:13:59.000 --> 00:14:01.639
the average loss over every conceivable sample

00:14:01.639 --> 00:14:04.039
of coin flips you might have drawn assuming the

00:14:04.039 --> 00:14:08.320
coin has a fixed true state like being 60 weighted

00:14:08.320 --> 00:14:12.220
to heads they are averaging out the risk over

00:14:12.409 --> 00:14:14.870
infinite theoretical repetitions of the experiment.

00:14:15.090 --> 00:14:17.149
So it's heavily reliant on hypothetical data,

00:14:17.570 --> 00:14:19.389
alternate universes of coin flips that I didn't

00:14:19.389 --> 00:14:21.389
actually witness. Exactly. How does the Bayesian

00:14:21.389 --> 00:14:24.169
approach differ? The Bayesian approach calculates

00:14:24.169 --> 00:14:27.990
expected loss using what we call prior and posterior

00:14:27.990 --> 00:14:31.049
distributions. This is known as Bayes risk. Okay,

00:14:31.250 --> 00:14:33.230
stick with the coin analogy for me. Right. A

00:14:33.230 --> 00:14:35.429
Bayesian starts with a prior belief, say, a strong

00:14:35.429 --> 00:14:37.669
assumption that most coins are fair. Then they

00:14:37.669 --> 00:14:39.850
flip the coin 10 times and it lands on heads.

00:14:40.159 --> 00:14:42.860
eight times. Okay. They use that actual observed

00:14:42.860 --> 00:14:45.759
data to update their belief, creating a posterior

00:14:45.759 --> 00:14:48.500
distribution. The massive advantage here is that

00:14:48.500 --> 00:14:51.500
the Bayesian framework only requires you to optimize

00:14:51.500 --> 00:14:53.740
your decision based on the data you currently

00:14:53.740 --> 00:14:56.220
hold in your hand. I see. You don't have to worry

00:14:56.220 --> 00:14:58.620
about the infinite alternate universes of data

00:14:58.620 --> 00:15:00.659
you didn't collect. Not at all. You just say,

00:15:01.039 --> 00:15:03.480
given my prior belief and the eight heads I just

00:15:03.480 --> 00:15:06.500
saw, what is my optimal decision right now to

00:15:06.500 --> 00:15:10.059
minimize loss? That's the beauty of it. Choosing

00:15:10.059 --> 00:15:12.580
a frequentist optimal decision rule that accounts

00:15:12.580 --> 00:15:15.759
for all possible observations is mathematically

00:15:15.759 --> 00:15:19.259
a much more difficult and rigid problem. The

00:15:19.259 --> 00:15:21.759
Bayesian approach anchors itself strictly to

00:15:21.759 --> 00:15:23.820
the data you actually possess. Okay, so that

00:15:23.820 --> 00:15:26.039
is how statisticians define the boundaries of

00:15:26.039 --> 00:15:29.179
uncertainty. But what about the actual decision

00:15:29.179 --> 00:15:31.659
rules you apply when navigating that uncertainty?

00:15:31.919 --> 00:15:34.080
Right, the practical application. Yeah, the source

00:15:34.080 --> 00:15:36.179
outlines a few, but one jumped out at me immediately.

00:15:36.539 --> 00:15:38.919
The minimax rule. Choosing the decision with

00:15:38.919 --> 00:15:42.059
the lowest worst case loss. The minimax rule.

00:15:42.440 --> 00:15:44.840
It is an absolute cornerstone of decision theory.

00:15:45.120 --> 00:15:46.799
Here's where it gets really interesting. I was

00:15:46.799 --> 00:15:48.620
trying to think of how to explain minimax and

00:15:48.620 --> 00:15:51.320
it hit me. It's like packing for a vacation by

00:15:51.320 --> 00:15:53.960
only looking at the absolute worst possible weather

00:15:53.960 --> 00:15:55.919
forecast. OK, let's hear it. You're flying to

00:15:55.919 --> 00:15:59.100
Florida for a beach trip, but you see a 1 % chance

00:15:59.100 --> 00:16:03.840
of a freak, historic blizzard. So you pack a

00:16:03.840 --> 00:16:06.840
full -body, sub -zero snowsuit just in case.

00:16:06.879 --> 00:16:09.500
Oh, that's perfect. Because if you pack the snowsuit,

00:16:09.700 --> 00:16:12.039
your absolute worst -case scenario freezing to

00:16:12.039 --> 00:16:14.889
death in Florida is minimized. Your ultimate

00:16:14.889 --> 00:16:17.690
loss is just the annoyance of lugging a heavy

00:16:17.690 --> 00:16:20.429
bag through the airport. That is a highly illustrative

00:16:20.429 --> 00:16:23.210
analogy. It captures the sheer pessimism of the

00:16:23.210 --> 00:16:25.429
minimax rule. It's extremely pessimistic. You

00:16:25.429 --> 00:16:28.110
are deliberately looking for the maximum possible

00:16:28.110 --> 00:16:31.350
loss across all scenarios and then picking the

00:16:31.350 --> 00:16:33.429
option that makes that maximum loss as small

00:16:33.429 --> 00:16:36.049
as possible. You minimize the maximum minimax.

00:16:36.149 --> 00:16:38.490
It's the ultimate mathematical defense mechanism

00:16:38.490 --> 00:16:40.789
against catastrophe. And this raises a crucial

00:16:40.789 --> 00:16:43.899
point about how economic agents meaning humans

00:16:43.899 --> 00:16:46.919
behave under uncertainty. In economics, this

00:16:46.919 --> 00:16:48.980
decision -making is often modeled using the von

00:16:48.980 --> 00:16:51.759
Neumann -Morgenstern utility function. That's

00:16:51.759 --> 00:16:54.700
a mouthful. It is. It seeks to maximize expected

00:16:54.700 --> 00:16:57.080
utility, but the math shifts depending entirely

00:16:57.080 --> 00:17:00.539
on human psychology. Imagine a game show. You

00:17:00.539 --> 00:17:04.319
can take a guaranteed $50 ,000, or you can flip

00:17:04.319 --> 00:17:07.099
a coin for a chance to win $100 ,000 or nothing.

00:17:07.480 --> 00:17:10.480
OK, standard game show setup. Right. The mathematical

00:17:10.480 --> 00:17:13.299
expected value of both choices is exactly $50

00:17:13.299 --> 00:17:17.140
,000. If an agent is risk -neutral, their loss

00:17:17.140 --> 00:17:19.900
function sees those choices as identical. Because

00:17:19.900 --> 00:17:23.039
the math balances out. Exactly. But most humans

00:17:23.039 --> 00:17:26.240
are risk -averse. The pain of losing the guaranteed

00:17:26.240 --> 00:17:29.079
$50 ,000 far outweighs the joy of potentially

00:17:29.079 --> 00:17:31.559
winning $100 ,000. Oh, yeah, I'd take the 50

00:17:31.559 --> 00:17:34.480
grand in a heartbeat. See? For a risk -averse

00:17:34.480 --> 00:17:37.599
agent, the loss function is warped by fear, And

00:17:37.599 --> 00:17:40.160
the math has to bend to accommodate that psychological

00:17:40.160 --> 00:17:42.559
reality. So we've looked at all these incredibly

00:17:42.559 --> 00:17:45.140
elegant mathematical frameworks, smooth quadratic

00:17:45.140 --> 00:17:47.579
parabolas, Bayesian expected loss, von Neumann

00:17:47.579 --> 00:17:50.339
-Morgenstern utility curves. Yes. But do these

00:17:50.339 --> 00:17:53.079
elegant formulas actually survive contact with

00:17:53.079 --> 00:17:54.720
reality? Because the real world doesn't feel

00:17:54.720 --> 00:17:56.980
like a smooth parabola. That is perhaps the most

00:17:56.980 --> 00:17:59.859
heated debate in this entire field. And our source

00:17:59.859 --> 00:18:02.799
highlights two prominent thinkers who forcefully

00:18:02.799 --> 00:18:04.980
argue that classical symmetrical mathematical

00:18:04.980 --> 00:18:08.200
models often fail disastrously in the real world.

00:18:08.460 --> 00:18:11.960
Really? W. Edwards Deming, the legendary statistician

00:18:11.960 --> 00:18:14.660
and quality control expert, and Nassim Nicholas

00:18:14.660 --> 00:18:16.900
Tillip, who writes extensively about extreme

00:18:16.900 --> 00:18:19.460
probability and risk. Oh, I've heard of Taleb.

00:18:20.059 --> 00:18:22.740
Their core argument is that empirical reality

00:18:22.740 --> 00:18:25.519
should be the sole basis for selecting a loss

00:18:25.519 --> 00:18:27.779
function, right? Exactly. An empirical reality

00:18:27.779 --> 00:18:31.619
is rarely mathematically nice. It's rarely continuous

00:18:31.619 --> 00:18:34.220
or differentiable or symmetric. Let's think back

00:18:34.220 --> 00:18:36.039
to the quadratic loss function you pushed back

00:18:36.039 --> 00:18:38.880
on earlier. That function is symmetric. Being

00:18:38.880 --> 00:18:41.259
four units above the target generates the exact

00:18:41.259 --> 00:18:43.880
same mathematical pain as being four units below

00:18:43.880 --> 00:18:46.400
the target. Deming and Taleb point out that in

00:18:46.400 --> 00:18:49.200
real life, asymmetry is the rule, not the exception.

00:18:49.480 --> 00:18:51.640
The source gave a brilliant everyday example

00:18:51.640 --> 00:18:54.319
of this. Missing a plane. Oh, this is a great

00:18:54.319 --> 00:18:57.099
one. Let's say your target is to arrive at the

00:18:57.099 --> 00:18:59.240
airport gate exactly at the moment it closes.

00:18:59.680 --> 00:19:02.299
If you arrive 10 minutes early, your error is

00:19:02.299 --> 00:19:04.740
10 minutes. Your loss. You wait around in a plastic

00:19:04.740 --> 00:19:06.960
chair. Maybe you buy an overpriced coffee. Right.

00:19:07.099 --> 00:19:09.700
Minor inconvenience. But if you arrive 10 minutes

00:19:09.700 --> 00:19:12.960
late, your error is still 10 minutes. On a quadratic

00:19:12.960 --> 00:19:16.180
loss graph, those two errors are perfectly symmetric.

00:19:16.380 --> 00:19:18.779
But the real -world loss is completely asymmetric.

00:19:18.920 --> 00:19:21.559
Beyond asymmetric. It's a massive discontinuity.

00:19:21.880 --> 00:19:23.940
If you are 10 minutes late, you miss the flight

00:19:23.940 --> 00:19:26.619
in time. You lose hundreds of dollars, you ruin

00:19:26.619 --> 00:19:29.059
your vacation, you literally sleep on the airport

00:19:29.059 --> 00:19:32.759
floor. Arriving slightly late is infinitely more

00:19:32.759 --> 00:19:35.319
costly than arriving slightly early. We see this

00:19:35.319 --> 00:19:38.599
lethal asymmetry everywhere. Consider pharmacology

00:19:38.599 --> 00:19:41.759
and drug dosing. The target is the perfect dose

00:19:41.759 --> 00:19:44.920
to cure the patient. If you give too little an

00:19:44.920 --> 00:19:47.819
error below the target, the cost is a lack of

00:19:47.819 --> 00:19:50.839
efficacy. The patient's symptoms persist. Not

00:19:50.839 --> 00:19:53.079
ideal, but okay. But if you give too much an

00:19:53.079 --> 00:19:55.359
error above the target by the exact same amount,

00:19:55.799 --> 00:19:58.359
the cost isn't just a lingering cough. The cost

00:19:58.359 --> 00:20:02.960
is severe toxicity. It could be legal. You absolutely

00:20:02.960 --> 00:20:05.299
cannot square those errors and pretend they generate

00:20:05.299 --> 00:20:07.980
the same amount of loss. Or consider structural

00:20:07.980 --> 00:20:11.279
engineering or even ecological stress. The source

00:20:11.279 --> 00:20:14.180
mentions traffic, water pipes or ecosystems.

00:20:14.619 --> 00:20:17.059
Yes, those are prime examples of discontinuous

00:20:17.059 --> 00:20:20.539
loss. A bridge's structural beams or a city's

00:20:20.539 --> 00:20:22.859
traffic grid can often tolerate increased load

00:20:22.859 --> 00:20:25.579
or stress with very little noticeable change.

00:20:25.880 --> 00:20:28.819
So the system absorbs the error? Exactly. The

00:20:28.819 --> 00:20:31.579
loss function barely ticks upward. but it only

00:20:31.579 --> 00:20:33.779
absorbs it up to a critical threshold. And then?

00:20:33.779 --> 00:20:36.660
And then suddenly a single additional car causes

00:20:36.660 --> 00:20:39.420
total gridlock or the beam snaps and the bridge

00:20:39.420 --> 00:20:42.200
collapses catastrophically. The loss goes from

00:20:42.200 --> 00:20:44.839
near zero to infinity in an instant. So what

00:20:44.839 --> 00:20:47.019
does this all mean? Like when our neat mathematical

00:20:47.019 --> 00:20:49.880
formulas say one thing and the messy asymmetric

00:20:49.880 --> 00:20:52.259
reality does another. If we connect this to the

00:20:52.259 --> 00:20:54.460
bigger picture. It means that choosing a loss

00:20:54.460 --> 00:20:56.720
function is not just an arbitrary math problem

00:20:56.720 --> 00:20:59.819
confined to a textbook. It requires deep, applied

00:20:59.819 --> 00:21:02.160
knowledge of the actual domain you are working

00:21:02.160 --> 00:21:05.680
in. If you apply a smooth, symmetric, continuous

00:21:05.680 --> 00:21:08.940
math formula to an asymmetric, discontinuous

00:21:08.940 --> 00:21:11.700
reality, you will build systems that are totally

00:21:11.700 --> 00:21:14.740
blind to catastrophe. That's terrifying. The

00:21:14.740 --> 00:21:16.420
source notes something remarkable, actually.

00:21:16.680 --> 00:21:20.740
Your concept of the best estimate actually changes

00:21:20.740 --> 00:21:23.059
depending on the loss function you choose to

00:21:23.059 --> 00:21:26.160
apply. Wait, really? The truth changes? The mathematical

00:21:26.160 --> 00:21:29.299
truth changes, yes. If you use a squared error

00:21:29.299 --> 00:21:32.640
loss, the mathematical mean, or average, is the

00:21:32.640 --> 00:21:35.319
best estimator to minimize loss. Okay. But if

00:21:35.319 --> 00:21:37.720
you shift to an absolute difference loss, suddenly

00:21:37.720 --> 00:21:39.859
the median is the mathematically proven best

00:21:39.859 --> 00:21:42.619
way to minimize expected loss. Your entire view

00:21:42.619 --> 00:21:44.960
of reality shifts depending on the rule of regret

00:21:44.960 --> 00:21:47.500
you choose to apply to it. It completely reframes

00:21:47.500 --> 00:21:49.799
how you look at every decision. We started this

00:21:49.799 --> 00:21:51.940
deep dive looking at Pierre Simonopoulos and

00:21:51.940 --> 00:21:53.819
Abraham Wald trying to figure out how to put

00:21:53.819 --> 00:21:55.720
a hard number on a mistake. We covered a lot

00:21:55.720 --> 00:21:58.720
of ground. We really did. We explored algorithms

00:21:58.720 --> 00:22:01.099
deciding university budgets in Germany based

00:22:01.099 --> 00:22:05.000
on human shrugs and indifference points. We navigated

00:22:05.000 --> 00:22:08.140
the fog of Bayesian probability, packed our minimax

00:22:08.140 --> 00:22:10.640
snowsuits, and journeyed all the way to the sheer

00:22:10.640 --> 00:22:13.640
terrifying asymmetry of missing a flight or snapping

00:22:13.640 --> 00:22:16.220
a bridge. It serves as a powerful reminder that

00:22:16.220 --> 00:22:19.269
behind every algorithm Behind every economic

00:22:19.269 --> 00:22:22.130
policy, and behind every machine learning model

00:22:22.130 --> 00:22:24.509
curating your digital life, there's a loss function.

00:22:24.609 --> 00:22:28.369
Yeah. And someone, somewhere, decided what counts

00:22:28.369 --> 00:22:30.710
as a mistake, and exactly how heavily to punish

00:22:30.710 --> 00:22:32.069
it. And that leaves me with a thought I want

00:22:32.069 --> 00:22:33.789
to pass on to you, the listener. Think about

00:22:33.789 --> 00:22:36.710
your own daily choices. What implicit loss function

00:22:36.710 --> 00:22:39.329
is running in your head right now? Are you operating

00:22:39.329 --> 00:22:42.069
on a symmetric quadratic loss where you treat

00:22:42.069 --> 00:22:44.309
every single mistake whether it's burning toast

00:22:44.309 --> 00:22:46.769
or blowing a job interview with equal escalating

00:22:46.769 --> 00:22:49.829
mathematical weight? Or are you secretly driven

00:22:49.829 --> 00:22:52.529
by an asymmetric loss function where the sheer

00:22:52.529 --> 00:22:54.849
terror of arriving just one minute too late to

00:22:54.849 --> 00:22:57.789
an opportunity is quietly dictating your entire

00:22:57.789 --> 00:23:00.509
life? How might changing your own personal mathematical

00:23:00.509 --> 00:23:02.809
rules of regret change the very real decisions

00:23:02.809 --> 00:23:03.650
you make tomorrow?
