WEBVTT

00:00:00.000 --> 00:00:02.459
Imagine for a second, you know, that you're making

00:00:02.459 --> 00:00:05.559
a highly complex decision. Right. But the catch

00:00:05.559 --> 00:00:08.460
is the outcome of every single choice you make

00:00:08.460 --> 00:00:11.439
is just slightly unpredictable. Which is, I mean,

00:00:11.640 --> 00:00:13.820
pretty much all of life, right? Exactly. Think

00:00:13.820 --> 00:00:16.239
about navigating your career over the next decade

00:00:16.239 --> 00:00:20.170
or If you want to get a bit more technical, imagine

00:00:20.170 --> 00:00:23.609
trying to manage the routing of millions of data

00:00:23.609 --> 00:00:26.449
packets in a global telecom network. Yeah, you

00:00:26.449 --> 00:00:28.809
can make a choice. You can take an action, but

00:00:28.809 --> 00:00:33.299
you can never be. 100 % certain of what the exact

00:00:33.299 --> 00:00:35.420
result will be. Because the environment always

00:00:35.420 --> 00:00:37.600
pushes back. Exactly. No matter how precise your

00:00:37.600 --> 00:00:39.399
intentions are, there's always this inherent

00:00:39.399 --> 00:00:42.859
element of randomness or, you know, noise between

00:00:42.859 --> 00:00:45.880
your action and the final outcome. So navigating

00:00:45.880 --> 00:00:48.619
life or even designing an artificial intelligence

00:00:48.619 --> 00:00:51.259
to do it for us is essentially like playing a

00:00:51.259 --> 00:00:53.240
board game where the dice are slightly loaded.

00:00:53.320 --> 00:00:55.159
Yeah. And the rules occasionally shift while

00:00:55.159 --> 00:00:57.299
you're looking the other way. So how do we actually

00:00:57.299 --> 00:00:59.710
map that out mathematically? Today we are going

00:00:59.710 --> 00:01:02.490
on a deep dive into the Markov decision process,

00:01:03.490 --> 00:01:06.569
or MDP for short. Yeah, it's such an elegant

00:01:06.569 --> 00:01:09.370
concept. It remains the foundational mathematical

00:01:09.370 --> 00:01:11.950
framework for sequential decision making under

00:01:11.950 --> 00:01:14.769
uncertainty. And we are pulling from a really

00:01:14.769 --> 00:01:17.569
comprehensive Wikipedia article on the subject

00:01:17.569 --> 00:01:19.980
for this deep dive. We are. The framework was

00:01:19.980 --> 00:01:22.379
originally formalized back in the 1950s within

00:01:22.379 --> 00:01:25.079
the field of operations research, and it borrows

00:01:25.079 --> 00:01:27.459
its name from the Russian mathematician Andrei

00:01:27.459 --> 00:01:30.620
Markov, specifically relying on the Markov property.

00:01:30.829 --> 00:01:33.609
Right, the Markov property, which is this assumption

00:01:33.609 --> 00:01:35.890
that the future depends only on your present

00:01:35.890 --> 00:01:38.549
state. It completely ignores the sequence of

00:01:38.549 --> 00:01:40.849
events that preceded it. Your current situation

00:01:40.849 --> 00:01:43.590
is literally all that matters for the next step.

00:01:43.989 --> 00:01:46.189
You don't need to consult a history book to calculate

00:01:46.189 --> 00:01:49.250
your next best move. Exactly. The past is irrelevant

00:01:49.250 --> 00:01:51.989
to the math. So our mission today is to give

00:01:51.989 --> 00:01:54.969
you, the listener, a shortcut to being deeply

00:01:54.969 --> 00:01:57.069
informed on how this actually works under the

00:01:57.069 --> 00:01:59.700
hood. We want to take these incredibly dense

00:01:59.700 --> 00:02:02.659
concepts like value iteration, Q -learning, and

00:02:02.659 --> 00:02:05.000
continuous time differential equations and just

00:02:05.000 --> 00:02:07.200
translate them into a relatable framework. A

00:02:07.200 --> 00:02:10.039
framework that really explains how systems, and

00:02:10.039 --> 00:02:13.580
frankly humans, learn to survive in totally unpredictable

00:02:13.580 --> 00:02:16.840
environments. Okay, let's unpack this. Since

00:02:16.840 --> 00:02:18.439
we're talking to an audience that already knows

00:02:18.439 --> 00:02:21.120
their way around basic modeling, we can fly through

00:02:21.120 --> 00:02:23.240
the standard blueprint pretty quickly. Right,

00:02:23.300 --> 00:02:25.560
the classic four -tuple. Yeah, the four -tuple.

00:02:25.780 --> 00:02:28.699
States... Actions, probabilities, and rewards.

00:02:29.300 --> 00:02:32.759
S, A, P, and R. Which is basically the board,

00:02:32.939 --> 00:02:35.219
the pieces, and the rules. Right. S is where

00:02:35.219 --> 00:02:38.099
you are. A is your menu of available moves. And

00:02:38.099 --> 00:02:42.180
R is the immediate payoff. But the tricky variable

00:02:42.180 --> 00:02:44.860
here is P, right? The probability transition.

00:02:45.020 --> 00:02:47.400
Yeah, P is where things get messy. Because how

00:02:47.400 --> 00:02:49.319
does the math actually handle the likelihood

00:02:49.319 --> 00:02:51.879
of landing in a specific future state? Well,

00:02:51.879 --> 00:02:54.000
what's fascinating here is how the mathematics

00:02:54.000 --> 00:02:56.500
of that probability transition, that p variable,

00:02:56.900 --> 00:02:59.120
has to warp depending on the nature of your environment.

00:02:59.479 --> 00:03:02.520
Warp how? So if your state space is discrete

00:03:02.520 --> 00:03:05.180
-like, think of the finite squares on a chessboard.

00:03:05.639 --> 00:03:07.919
The probability is calculated using a simple

00:03:07.919 --> 00:03:10.219
counting measure. You're essentially just adding

00:03:10.219 --> 00:03:12.919
up distinct separate outcomes to find your odds.

00:03:12.960 --> 00:03:15.900
It's very clean. But what if it's not a chessboard?

00:03:15.960 --> 00:03:18.620
What if it's fluid? Right. If your states are

00:03:18.620 --> 00:03:21.520
continuous, Say you're tracking the precise temperature

00:03:21.520 --> 00:03:24.599
of a server rack down to the decimal, the math

00:03:24.599 --> 00:03:27.719
shifts, you have to use the Lebesgue measure.

00:03:27.900 --> 00:03:30.759
Lebesgue measure. Yeah. Instead of counting individual

00:03:30.759 --> 00:03:33.939
drops of water, you're integrating over a continuous

00:03:33.939 --> 00:03:37.360
area to calculate the volume of a fluid. The

00:03:37.360 --> 00:03:39.219
probability becomes a curve that you actually

00:03:39.219 --> 00:03:43.080
have to find the area under. OK. Let's ground

00:03:43.080 --> 00:03:45.379
this in a classic example from control theory

00:03:45.379 --> 00:03:48.360
to make it real. The pole balancing model. Oh,

00:03:48.659 --> 00:03:50.599
classic. Imagine you're trying to balance a long

00:03:50.599 --> 00:03:54.120
broom on the palm of your hand. The state, S,

00:03:54.360 --> 00:03:56.300
is made up of continuous measurements, right?

00:03:56.379 --> 00:03:58.360
Yeah, the position of your hand, the speed it's

00:03:58.360 --> 00:04:00.819
moving, the angle of the broom, the angular velocity

00:04:00.819 --> 00:04:04.419
of its fall. Exactly. And the action, A, is applying

00:04:04.419 --> 00:04:07.219
force to the left or the right. And your reward,

00:04:07.460 --> 00:04:09.960
R, is simply a one if the broom stays up and

00:04:09.960 --> 00:04:12.520
a zero if it drops. Right. But wait, I want to

00:04:12.520 --> 00:04:14.379
challenge something here. Before it. If pole

00:04:14.379 --> 00:04:17.240
balancing is driven by the strict deterministic

00:04:17.240 --> 00:04:20.439
laws of classical mechanics, why do we need that

00:04:20.439 --> 00:04:24.379
P variable at all? Isn't physics entirely predictable?

00:04:25.480 --> 00:04:27.920
Theoretically, yes. Gravity and momentum follow

00:04:27.920 --> 00:04:30.379
very strict equations. Right. But in practice,

00:04:30.800 --> 00:04:33.180
physical execution is messy. Think about it.

00:04:33.560 --> 00:04:35.459
If we're talking about a robotic cart trying

00:04:35.459 --> 00:04:38.680
to balance that pole, the sensors reading the

00:04:38.680 --> 00:04:41.319
angle might have a millisecond of latency. Oh,

00:04:41.319 --> 00:04:44.920
sure. or the motor might experience a tiny microscopic

00:04:44.920 --> 00:04:47.680
voltage drop, causing it to push slightly weaker

00:04:47.680 --> 00:04:50.579
than the math commanded it to. I see. The MDP

00:04:50.579 --> 00:04:53.579
framework uses that probability transition, P,

00:04:53.800 --> 00:04:56.100
to mathematically absorb the noise of reality.

00:04:56.620 --> 00:04:58.680
You might command the court to move exactly two

00:04:58.680 --> 00:05:01.060
inches to the right, but P dictates the bell

00:05:01.060 --> 00:05:02.899
curve of where it actually ends up. OK, that

00:05:02.899 --> 00:05:04.860
makes total sense. Yeah. So the four -tuple gives

00:05:04.860 --> 00:05:06.980
us the board, the pieces, and the physics engine

00:05:06.980 --> 00:05:09.629
of our game. But to actually win, We need an

00:05:09.629 --> 00:05:12.290
objective. We need a strategy. Right. Which the

00:05:12.290 --> 00:05:14.350
framework calls a policy, denoted by the Greek

00:05:14.350 --> 00:05:16.769
letter pi. Yeah, a policy is simply a mapping

00:05:16.769 --> 00:05:19.589
function. Think of it as a massive lookup table

00:05:19.589 --> 00:05:22.269
telling the agent exactly which action to take

00:05:22.269 --> 00:05:24.750
for every single conceivable state. And the goal

00:05:24.750 --> 00:05:27.949
is to find the optimal policy, the one that maximizes

00:05:27.949 --> 00:05:30.769
the sum of all future rewards. Exactly. But here's

00:05:30.769 --> 00:05:32.990
the thing. If you're looking at a potentially

00:05:32.990 --> 00:05:36.250
infinite horizon of future decisions, you can't

00:05:36.250 --> 00:05:38.949
just treat a reward 10 years from now. the same

00:05:38.949 --> 00:05:41.529
as a reward today. You really can't. Which brings

00:05:41.529 --> 00:05:44.589
us to the discount factor, gamma. Gamma, a number

00:05:44.589 --> 00:05:47.949
between zero and one. Yes. Is setting gamma below

00:05:47.949 --> 00:05:51.310
one essentially like reverse compound interest

00:05:51.310 --> 00:05:54.509
or like financial inflation? How do you mean?

00:05:54.790 --> 00:05:56.889
Like deciding between eating one marshmallow

00:05:56.889 --> 00:05:59.050
right now versus two marshmallows in an hour.

00:05:59.470 --> 00:06:01.589
A dollar today has more purchasing power than

00:06:01.589 --> 00:06:04.389
a dollar in a decade. So we mathematically force

00:06:04.389 --> 00:06:07.139
the AI to care about the near term. That is a

00:06:07.139 --> 00:06:09.240
highly accurate way to look at it. It acts as

00:06:09.240 --> 00:06:12.000
an inflation rate on the value of success. If

00:06:12.000 --> 00:06:15.139
gamma is low, say 0 .1, the inflation is brutal.

00:06:15.560 --> 00:06:17.480
The algorithm only cares about the immediate

00:06:17.480 --> 00:06:19.920
payoff because future rewards are deemed worthless.

00:06:20.199 --> 00:06:24.079
But if gamma is 0 .99, it's willing to invest

00:06:24.079 --> 00:06:26.980
in long -term strategies. But why not just set

00:06:26.980 --> 00:06:29.959
gamma to exactly one? Wouldn't we just value

00:06:29.959 --> 00:06:33.019
all future rewards equally? Why do we specifically

00:06:33.019 --> 00:06:35.579
need to be short -sighted? because of the math

00:06:35.579 --> 00:06:38.620
of infinity. If an algorithm runs forever and

00:06:38.620 --> 00:06:41.560
it values all future rewards equally, the total

00:06:41.560 --> 00:06:45.379
expected reward adds up to infinity. You can't

00:06:45.379 --> 00:06:47.959
mathematically compare two strategies if they

00:06:47.959 --> 00:06:50.839
both just yield an infinite score. Right. Infinity

00:06:50.839 --> 00:06:54.629
equals infinity. Exactly. By enforcing a discount

00:06:54.629 --> 00:06:57.529
factor, the value of those distant future rewards

00:06:57.529 --> 00:07:00.209
eventually decays to zero, which allows the total

00:07:00.209 --> 00:07:02.970
sum to converge on a finite number. It literally

00:07:02.970 --> 00:07:05.589
forces the math to be solvable. It does, though

00:07:05.589 --> 00:07:08.029
I should mention there are alternatives. In learning

00:07:08.029 --> 00:07:09.870
theory, researchers sometimes use what's called

00:07:09.870 --> 00:07:12.250
the H -step return. The H -step return? Yeah.

00:07:12.449 --> 00:07:14.569
Instead of an inflation rate slowly fading the

00:07:14.569 --> 00:07:17.529
future out, the H -step return enforces a hard,

00:07:17.810 --> 00:07:20.870
strict deadline. Like a cutoff. Exactly. It tells

00:07:20.870 --> 00:07:23.100
the agent to maximize the reward over the next

00:07:23.100 --> 00:07:25.160
eight steps, weighting them all equally, and

00:07:25.160 --> 00:07:27.420
ignoring absolutely everything that happens after

00:07:27.420 --> 00:07:30.420
that deadline. Wow. And there is a reassuring

00:07:30.420 --> 00:07:32.680
mathematical guarantee built into this framework

00:07:32.680 --> 00:07:35.300
too, right? Oh, regarding deterministic policies.

00:07:35.379 --> 00:07:38.360
Yeah. If your environment is entirely deterministic,

00:07:39.019 --> 00:07:41.800
meaning taking action x and state y will yield

00:07:41.800 --> 00:07:45.300
the exact same reward 100 % of the time, there

00:07:45.300 --> 00:07:48.060
will always be an optimal policy that is also

00:07:48.060 --> 00:07:51.470
purely deterministic. Yes. The agent will never

00:07:51.470 --> 00:07:53.730
need to randomly roll the dice on its actions

00:07:53.730 --> 00:07:57.350
to succeed. But, you know, knowing an optimal

00:07:57.350 --> 00:07:59.649
policy exists mathematically is very different

00:07:59.649 --> 00:08:01.709
from a computer actually calculating it. Oh,

00:08:01.949 --> 00:08:04.389
totally different. If an AI is playing out millions

00:08:04.389 --> 00:08:06.529
of futures, where is it storing all that data

00:08:06.529 --> 00:08:10.050
to compare it? So, to solve finite MDPs, computers

00:08:10.050 --> 00:08:13.129
rely on dynamic programming. which requires allocating

00:08:13.129 --> 00:08:15.430
memory for two massive data structures. Arrays,

00:08:15.470 --> 00:08:18.189
right. Yeah. A value array which estimates the

00:08:18.189 --> 00:08:20.350
expected long -term reward for being in a specific

00:08:20.350 --> 00:08:23.300
state. and a policy array, which stores the best

00:08:23.300 --> 00:08:25.819
known action for that state. And how do we actually

00:08:25.819 --> 00:08:27.579
fill those arrays? Well, the first foundational

00:08:27.579 --> 00:08:30.620
method is value iteration, formulated by Richard

00:08:30.620 --> 00:08:33.919
Bellman back in 1957. It uses backward induction.

00:08:34.159 --> 00:08:35.919
And the crazy thing is, the algorithm doesn't

00:08:35.919 --> 00:08:38.720
even attempt to store a specific policy at first.

00:08:39.200 --> 00:08:42.980
It literally populates the value array with completely

00:08:42.980 --> 00:08:47.009
random guesses. Wait. Just garbage random numbers.

00:08:47.370 --> 00:08:50.830
Total garbage. Then, it sweeps through the entire

00:08:50.830 --> 00:08:53.309
state space, looking at the neighboring states,

00:08:53.750 --> 00:08:56.269
and recursively updates its guesses based on

00:08:56.269 --> 00:08:59.190
the highest possible expected returns of those

00:08:59.190 --> 00:09:01.590
neighbors. But wait, if it starts with garbage

00:09:01.590 --> 00:09:04.389
random guesses, how does recursively updating

00:09:04.389 --> 00:09:07.129
garbage eventually lead to the perfect mathematical

00:09:07.129 --> 00:09:09.870
truth? It works because of the Banach Fixpoint

00:09:09.870 --> 00:09:12.870
Theorem. The Banach Fixpoint Theorem. Yes. As

00:09:12.870 --> 00:09:15.169
long as your discount factor, your gamma, is

00:09:15.169 --> 00:09:17.809
strictly less than one, the theorem proves that

00:09:17.809 --> 00:09:20.049
the mathematical distance between the algorithm's

00:09:20.049 --> 00:09:22.850
current guess and the true optimal value inevitably

00:09:22.850 --> 00:09:25.269
shrinks with every single sweep. So the errors

00:09:25.269 --> 00:09:28.169
just get compressed out. Exactly. It iterates

00:09:28.169 --> 00:09:30.149
until the numbers converge and stop changing

00:09:30.149 --> 00:09:32.210
entirely. That's incredible. And then we have

00:09:32.210 --> 00:09:34.350
the second major method, right? Policy iteration.

00:09:34.690 --> 00:09:37.509
Introduced by Ronald Howard in 1960. Yeah. And

00:09:37.509 --> 00:09:40.090
instead of just guessing values, policy iteration

00:09:40.090 --> 00:09:42.610
constantly alternates between two distinct steps.

00:09:42.769 --> 00:09:46.309
Right. It evaluates and improves. First, it locks

00:09:46.309 --> 00:09:48.730
in its current policy and calculates exactly

00:09:48.730 --> 00:09:51.669
how much long -term reward that specific strategy

00:09:51.669 --> 00:09:54.309
will yield. Okay. Once it knows the value of

00:09:54.309 --> 00:09:57.049
its current behavior, it scans the states to

00:09:57.049 --> 00:09:59.730
see if flipping a single action would increase

00:09:59.730 --> 00:10:02.429
that value. And if it does? If it finds a better

00:10:02.429 --> 00:10:05.190
action, it updates the policy array and then

00:10:05.190 --> 00:10:07.649
starts the evaluation phase all over again. The

00:10:07.649 --> 00:10:10.090
history behind this is amazing to me. Ronald

00:10:10.090 --> 00:10:13.639
Howard didn't invent policy iteration for some

00:10:13.639 --> 00:10:16.679
high -tech robotics lab. No, he did it. He invented

00:10:16.679 --> 00:10:19.159
it to solve a massive logistical nightmare for

00:10:19.159 --> 00:10:21.639
this year's catalog. It's such a great story.

00:10:21.799 --> 00:10:23.299
He needed to figure out exactly who should get

00:10:23.299 --> 00:10:25.940
a physical catalog mailed to them to maximize

00:10:25.940 --> 00:10:28.240
profits. And to map that to our four -tuple,

00:10:28.580 --> 00:10:30.580
Howard treated the customer's purchasing history

00:10:30.580 --> 00:10:32.740
over the last six months as the state. Right.

00:10:33.080 --> 00:10:36.059
The action was a simple binary choice. Spend

00:10:36.059 --> 00:10:38.960
the 50 cents to mail a catalog or don't. And

00:10:38.960 --> 00:10:41.679
the probability. The transition probability was

00:10:41.679 --> 00:10:44.539
the statistical likelihood that sending the catalog

00:10:44.539 --> 00:10:47.080
would prompt the customer to buy a $10 item,

00:10:47.480 --> 00:10:49.559
transitioning them into a new state of purchasing

00:10:49.559 --> 00:10:53.320
frequency. That is wild. Yeah. By bouncing back

00:10:53.320 --> 00:10:55.639
and forth between evaluating the mailing strategy

00:10:55.639 --> 00:10:58.460
and improving it, he found the exact mathematical

00:10:58.460 --> 00:11:01.399
threshold where the cost of mailing broke even

00:11:01.399 --> 00:11:04.100
with the expected future revenue. But here is

00:11:04.100 --> 00:11:06.440
where I think the pure math hits a brick wall.

00:11:06.620 --> 00:11:09.629
OK, lay it on me. This dynamic programming operates

00:11:09.629 --> 00:11:12.750
in polynomial time, which is generally efficient,

00:11:13.490 --> 00:11:15.549
but it suffers from the curse of dimensionality.

00:11:17.419 --> 00:11:20.059
The curse. Because if I add just a few more variables

00:11:20.059 --> 00:11:22.500
to my state space, say we are routing those telecom

00:11:22.500 --> 00:11:24.879
packets we mentioned earlier, and we add latency,

00:11:25.259 --> 00:11:27.100
packet size, and server temperature to the mix.

00:11:27.299 --> 00:11:30.259
The number of possible states just explodes exponentially.

00:11:30.600 --> 00:11:32.840
Exactly. The arrays become too massive for any

00:11:32.840 --> 00:11:35.419
computer RAM to hold. Yeah. So how is this framework

00:11:35.419 --> 00:11:38.320
actually useful in modern, complex applications

00:11:38.320 --> 00:11:40.700
if the state space basically breaks the hardware

00:11:40.700 --> 00:11:43.080
immediately? Well, if we connect this to the

00:11:43.080 --> 00:11:45.340
bigger picture, you are highlighting exactly

00:11:45.340 --> 00:11:48.720
why engineers rarely use pure exhaustive dynamic

00:11:48.720 --> 00:11:51.720
programming today. We don't. No, we used targeted

00:11:51.720 --> 00:11:54.940
approximations. Techniques like prioritized sweeping.

00:11:55.179 --> 00:11:57.639
prioritized sweeping. Yeah, instead of updating

00:11:57.639 --> 00:11:59.899
the value of every single state in the array

00:11:59.899 --> 00:12:02.980
equally, the algorithm calculates a mathematical

00:12:02.980 --> 00:12:06.259
error margin for each state. It preferentially

00:12:06.259 --> 00:12:09.480
spends its computational energy updating the

00:12:09.480 --> 00:12:11.559
states with the largest errors. The states where

00:12:11.559 --> 00:12:14.480
the action is actually happening. Exactly. Ignoring

00:12:14.480 --> 00:12:16.700
the millions of corner cases that will likely

00:12:16.700 --> 00:12:19.259
never occur. Or we use Monte Carlo tree search,

00:12:19.279 --> 00:12:22.269
right? which simply samples a few likely paths

00:12:22.269 --> 00:12:24.990
into the future rather than exhaustively calculating

00:12:24.990 --> 00:12:27.110
every single permutation of the universe. Spot

00:12:27.110 --> 00:12:29.929
on. But, you know, value iteration and policy

00:12:29.929 --> 00:12:32.309
iterations still fundamentally require you to

00:12:32.309 --> 00:12:34.649
know the rules of the game upfront. You need

00:12:34.649 --> 00:12:37.090
the transition probabilities. Right. The p variable.

00:12:37.250 --> 00:12:40.129
Yeah. What happens if you are a robotic agent

00:12:40.129 --> 00:12:43.070
dropped into an environment with zero instructions,

00:12:43.450 --> 00:12:45.809
totally blind to the physics of the world? That

00:12:45.809 --> 00:12:48.570
requires abandoning pure dynamic programming.

00:12:49.070 --> 00:12:50.929
and moving to simulators and reinforcement learning.

00:12:51.070 --> 00:12:52.990
Exactly. If you don't have the probability matrix,

00:12:53.049 --> 00:12:55.250
you have to interact with the environment to

00:12:55.250 --> 00:12:57.090
reverse engineer it. So you use a simulator.

00:12:57.269 --> 00:12:59.730
You can use an episodic simulator, which drops

00:12:59.730 --> 00:13:02.750
you into a state and lets you take actions until

00:13:02.750 --> 00:13:05.669
you hit a failure or a success, generating a

00:13:05.669 --> 00:13:07.990
full trajectory of experience. OK. So that's

00:13:07.990 --> 00:13:10.049
like playing a level of a video game from start

00:13:10.049 --> 00:13:12.710
to finish, just learning the physics engine by

00:13:12.710 --> 00:13:15.429
trial and error. Exactly. Or you use a generative

00:13:15.429 --> 00:13:18.899
model, often written as G of S comma A. Right.

00:13:19.679 --> 00:13:22.679
This is a single step simulator. You hand it

00:13:22.679 --> 00:13:26.440
any state and any action, and it instantly generates

00:13:26.440 --> 00:13:28.679
the subsequent state and reward for you. Oh,

00:13:28.679 --> 00:13:31.360
wow. Yeah. It allows the algorithm to sample

00:13:31.360 --> 00:13:33.500
the environment's reactions without having to

00:13:33.500 --> 00:13:36.399
linearly play through the entire timeline just

00:13:36.399 --> 00:13:38.759
to reach that specific scenario. So this brings

00:13:38.759 --> 00:13:41.759
us to reinforcement learning, or RL. But I'm

00:13:41.759 --> 00:13:44.539
stuck on one thing here. If the agent doesn't

00:13:44.539 --> 00:13:47.850
know the probability map, How on earth is it

00:13:47.850 --> 00:13:50.210
learning a mathematical policy just by taking

00:13:50.210 --> 00:13:53.669
actions? I mean, isn't button mashing just chaos?

00:13:54.549 --> 00:13:57.070
It can be. Here's where it gets really interesting.

00:13:57.210 --> 00:14:00.629
How does a random success turn into a formalized

00:14:00.629 --> 00:14:03.710
strategy? It relies on sample efficiency and

00:14:03.710 --> 00:14:06.950
a massive breakthrough algorithm called Q -learning.

00:14:07.049 --> 00:14:09.789
Q -learning? Yeah. Instead of trying to reconstruct

00:14:09.789 --> 00:14:12.509
that massive probability matrix of how the world

00:14:12.509 --> 00:14:15.389
works, Q -learning bypasses the physics entirely.

00:14:15.590 --> 00:14:17.649
It just ignores the rules. Pretty much. It builds

00:14:17.649 --> 00:14:20.330
a Q -table where the Q stands for the quality

00:14:20.330 --> 00:14:23.730
of a specific state action pair. But mechanically,

00:14:24.039 --> 00:14:26.519
How does the math update when it actually finds

00:14:26.519 --> 00:14:29.019
a reward? Through what's called a temporal difference

00:14:29.019 --> 00:14:30.840
update. Okay, walk me through that. Let's say

00:14:30.840 --> 00:14:33.220
the agent takes an action and unexpectedly hits

00:14:33.220 --> 00:14:35.860
a massive reward. It doesn't just record that

00:14:35.860 --> 00:14:37.940
the current state is good. The algorithm takes

00:14:37.940 --> 00:14:40.019
a fraction of that reward and mathematically

00:14:40.019 --> 00:14:43.159
pushes it backward to the Q value of the state

00:14:43.159 --> 00:14:45.399
action pair that immediately preceded it. Like

00:14:45.399 --> 00:14:48.399
leaving a breadcrumb trail. Exactly like a mathematical

00:14:48.399 --> 00:14:51.559
breadcrumb trail. Over thousands of episodes,

00:14:51.840 --> 00:14:54.100
the rewards propagate backward through the queue

00:14:54.100 --> 00:14:56.960
table, lighting up the best sequence of actions

00:14:56.960 --> 00:14:59.460
without the agent ever needing to calculate the

00:14:59.460 --> 00:15:01.720
underlying probability of the environment. It's

00:15:01.720 --> 00:15:05.000
learning entirely by consequence. Pure consequence.

00:15:05.159 --> 00:15:07.519
But even with queue learning, we are assuming

00:15:07.519 --> 00:15:09.779
the agent knows exactly what state it's currently

00:15:09.779 --> 00:15:13.539
in. What if the sensors are broken? Or what if

00:15:13.539 --> 00:15:15.559
you're playing poker and the opponent's cards

00:15:15.559 --> 00:15:18.720
are hidden? That introduces the first major extension

00:15:18.720 --> 00:15:22.259
of the framework, partial observability or POMDP.

00:15:22.360 --> 00:15:25.159
The POMDP. Yeah. When an agent cannot directly

00:15:25.159 --> 00:15:27.679
observe its true state, it has to maintain a

00:15:27.679 --> 00:15:29.879
probability distribution over all the possible

00:15:29.879 --> 00:15:32.320
states it might be in. This is called a belief

00:15:32.320 --> 00:15:34.080
state. A belief state, okay. Every time it takes

00:15:34.080 --> 00:15:36.559
an action and receives an observation, it uses

00:15:36.559 --> 00:15:39.220
Bayes' theorem to update its belief state, and

00:15:39.220 --> 00:15:41.399
then it makes its next decision based on that

00:15:41.399 --> 00:15:43.480
distribution of probabilities rather than a concrete

00:15:43.480 --> 00:15:46.279
fact. Man, the math must get heavy. And what

00:15:46.279 --> 00:15:49.399
about constrained MDPs or CMDPs? In the real

00:15:49.399 --> 00:15:51.600
world, you are rarely just trying to maximize

00:15:51.600 --> 00:15:54.879
one single reward. CMDPs are heavily utilized

00:15:54.879 --> 00:15:57.159
in robotics and autonomous driving, actually.

00:15:57.220 --> 00:15:59.840
Really? Yeah. Think about it. You want an autonomous

00:15:59.840 --> 00:16:02.519
car to reach its destination as fast as possible,

00:16:02.940 --> 00:16:06.279
but you have rigid constraints. It cannot exceed

00:16:06.279 --> 00:16:08.899
the speed limit, and it absolutely cannot hit

00:16:08.899 --> 00:16:11.419
a pedestrian. Right, obviously. The math here

00:16:11.419 --> 00:16:13.899
fundamentally changes. You can't solve it with

00:16:13.899 --> 00:16:16.139
standard dynamic programming anymore. You have

00:16:16.139 --> 00:16:18.820
to use linear programming to balance the multiple

00:16:18.820 --> 00:16:22.259
competing cost functions. Wow. And fascinatingly,

00:16:22.519 --> 00:16:24.840
the optimal policy in a constrained environment

00:16:24.840 --> 00:16:27.200
actually changes depending on what state you

00:16:27.200 --> 00:16:29.480
start in. Because your starting battery life

00:16:29.480 --> 00:16:32.039
or starting position restricts the mathematical

00:16:32.039 --> 00:16:34.100
envelope of what moves are safely available to

00:16:34.100 --> 00:16:36.460
you. Exactly. But I think the most mind -bending

00:16:36.460 --> 00:16:38.860
extension we need to cover is the continuous

00:16:38.860 --> 00:16:42.360
time Markov process. Oh yeah. In a board game

00:16:42.360 --> 00:16:45.460
or even a digital simulator, time ticks in discrete

00:16:45.460 --> 00:16:48.799
turns, right? Step one, step two. Yeah, but during

00:16:48.799 --> 00:16:51.220
something like a viral epidemic or a continuous

00:16:51.220 --> 00:16:54.100
flow of data packets in a telecom queue, time

00:16:54.100 --> 00:16:57.080
is constantly flowing. How do you mathematically

00:16:57.080 --> 00:17:00.019
map a decision space when actions can happen

00:17:00.019 --> 00:17:02.440
at literally any millisecond? This raises an

00:17:02.440 --> 00:17:04.420
important question about how we actually digitize

00:17:04.420 --> 00:17:08.390
reality. In a standard MDP, time is a staircase,

00:17:08.609 --> 00:17:11.170
and you sum up discrete rewards over distinct

00:17:11.170 --> 00:17:14.730
steps. But in continuous time, the system dynamics

00:17:14.730 --> 00:17:17.069
are governed by ordinary differential equations,

00:17:17.589 --> 00:17:20.609
ODE's. So time becomes a smooth ramp. Exactly.

00:17:21.009 --> 00:17:23.450
The mathematics replace discrete summations with

00:17:23.450 --> 00:17:26.210
continuous integrals. You are calculating the

00:17:26.210 --> 00:17:28.410
shifting area under a curve. But if you aren't

00:17:28.410 --> 00:17:31.109
waiting for a discrete clock to tick, How do

00:17:31.109 --> 00:17:34.109
you calculate the optimal move? You use the Hamilton

00:17:34.109 --> 00:17:36.690
-Jacobi -Bellman equation or the HJB equation.

00:17:36.690 --> 00:17:38.849
Okay, the HJB. Instead of looking ahead to step

00:17:38.849 --> 00:17:42.069
t plus 1, the HJB equation takes the limit as

00:17:42.069 --> 00:17:45.130
the change in time approaches zero. It uses partial

00:17:45.130 --> 00:17:47.269
derivatives to find the continuous rate of change.

00:17:47.349 --> 00:17:49.809
So it's constantly updating. Yes, it provides

00:17:49.809 --> 00:17:52.650
an optimal control vector that continuously steers

00:17:52.650 --> 00:17:55.670
the system based on instantaneous feedback rather

00:17:55.670 --> 00:17:58.329
than waiting for a turn to end. Like you're actively

00:17:58.329 --> 00:18:01.210
steering a ship through flowing water, constantly

00:18:01.210 --> 00:18:03.549
adjusting the rudder rather than moving a static

00:18:03.549 --> 00:18:06.170
piece on a grid. That's a great analogy. Okay,

00:18:06.390 --> 00:18:08.049
let's bring this all together. We started with

00:18:08.049 --> 00:18:11.569
the rigid four -tuple states, actions, probabilities,

00:18:11.829 --> 00:18:14.210
and rewards. We looked at the math of balancing

00:18:14.210 --> 00:18:17.089
a pole, we saw how Ronald Howard conquered the

00:18:17.089 --> 00:18:20.150
Sears catalog mailing list, and we move through

00:18:20.150 --> 00:18:23.049
the Wild West of key learning and continuous

00:18:23.049 --> 00:18:26.019
time differential equations. So what does this

00:18:26.019 --> 00:18:28.099
all mean for you, the listener? To put it simply,

00:18:28.420 --> 00:18:30.759
the theoretical mathematics of artificial intelligence

00:18:30.759 --> 00:18:33.519
really mirror the algorithms you run in your

00:18:33.519 --> 00:18:36.220
own head. Every single day, you are operating

00:18:36.220 --> 00:18:38.559
as an agent in a partially observable environment.

00:18:38.660 --> 00:18:41.440
You truly are. Every time you choose a new route

00:18:41.440 --> 00:18:44.079
to work or try a new communication style in a

00:18:44.079 --> 00:18:46.440
meeting, you are taking an action under deep

00:18:46.440 --> 00:18:49.660
uncertainty. You are updating your personal internal

00:18:49.660 --> 00:18:53.079
Q values based on the unexpected rewards or penalties

00:18:53.079 --> 00:18:55.799
you encounter. You are leaving yourself a breadcrumb

00:18:55.799 --> 00:18:58.240
trail of what works and what doesn't, slowly

00:18:58.240 --> 00:19:01.059
refining your optimal policy for life. And your

00:19:01.059 --> 00:19:03.839
personal discount factor, your gamma, dictates

00:19:03.839 --> 00:19:05.859
whether you are trading your future potential

00:19:05.859 --> 00:19:08.680
for an immediate payout or investing in actions

00:19:08.680 --> 00:19:11.059
that might not yield a reward for a decade. I

00:19:11.059 --> 00:19:13.359
love that. And I want to leave you with one final

00:19:13.359 --> 00:19:16.980
provocative thought. There is a deeply abstract

00:19:16.980 --> 00:19:20.230
mathematical field called category theory. And

00:19:20.230 --> 00:19:23.289
when applied to MDPs, it creates a context -dependent

00:19:23.289 --> 00:19:26.730
Markov decision process. Right. In this interpretation,

00:19:27.230 --> 00:19:29.309
moving from one context to another doesn't just

00:19:29.309 --> 00:19:32.130
change your state. It fundamentally alters the

00:19:32.130 --> 00:19:34.670
entire set of available actions and the laws

00:19:34.670 --> 00:19:37.069
of probability governing them. Meaning, the rules

00:19:37.069 --> 00:19:39.470
of the game, the very physics of your choices

00:19:39.470 --> 00:19:41.730
change entirely depending on the mathematical

00:19:41.730 --> 00:19:43.849
object or environment you place yourself in.

00:19:44.170 --> 00:19:46.640
I want you to ponder that in your own life. Think

00:19:46.640 --> 00:19:48.460
about that board game with the loaded dice we

00:19:48.460 --> 00:19:50.339
talked about at the very beginning. How does

00:19:50.339 --> 00:19:53.539
simply changing your context, moving to a new

00:19:53.539 --> 00:19:56.759
city, switching careers, or even just changing

00:19:56.759 --> 00:19:59.900
who you surround yourself with, completely rewrite

00:19:59.900 --> 00:20:02.420
your mathematical landscape for future possibilities?

00:20:03.039 --> 00:20:05.420
Most of the time, we are so focused on calculating

00:20:05.420 --> 00:20:08.299
the best move within our current game, we forget

00:20:08.299 --> 00:20:10.440
that we have the power to change the board entirely.
