WEBVTT

00:00:00.000 --> 00:00:04.339
You know, whether you are asking Siri to set

00:00:04.339 --> 00:00:07.419
a timer, or investing your life savings in the

00:00:07.419 --> 00:00:10.140
stock market, or even having your DNA analyzed,

00:00:10.519 --> 00:00:12.960
you are interacting with a ghost. A mathematical

00:00:12.960 --> 00:00:15.199
ghost, to be precise. Right, yeah, not a literal

00:00:15.199 --> 00:00:18.570
ghost, but there's this... this invisible architecture

00:00:18.570 --> 00:00:20.789
operating right beneath the surface of our digital

00:00:20.789 --> 00:00:23.670
lives. And it's constantly making these highly

00:00:23.670 --> 00:00:26.550
educated guesses about things it can't directly

00:00:26.550 --> 00:00:28.949
see. It is entirely behind the scenes. Exactly.

00:00:29.050 --> 00:00:31.469
So today we're opening the hood on that invisible

00:00:31.469 --> 00:00:34.009
architecture. We're doing a deep dive into the

00:00:34.009 --> 00:00:36.549
hidden Markov model. This is the mathematical

00:00:36.549 --> 00:00:38.950
engine that allows a machine to deduce the invisible

00:00:38.950 --> 00:00:41.829
forces shaping our world purely by looking at,

00:00:41.829 --> 00:00:44.679
well, the shadows they leave behind. We're using

00:00:44.679 --> 00:00:47.399
a comprehensive breakdown from Wikipedia as our

00:00:47.399 --> 00:00:50.340
source map today. It is a fantastic topic. The

00:00:50.340 --> 00:00:53.000
mission today is really to demystify how these

00:00:53.000 --> 00:00:55.640
algorithms actually work in practice. Yeah, so,

00:00:55.719 --> 00:00:57.759
okay, let's unpack this. At its core, a hidden

00:00:57.759 --> 00:01:01.140
Markov model or an HMM is basically a system

00:01:01.140 --> 00:01:03.439
split into two distinct layers, right? That's

00:01:03.439 --> 00:01:05.969
right, two layers. You have one layer. which

00:01:05.969 --> 00:01:08.209
is just a sequence of events we can actually

00:01:08.209 --> 00:01:10.290
observe in the real world. And then you have

00:01:10.290 --> 00:01:12.790
the second layer, which is a sequence of hidden

00:01:12.790 --> 00:01:15.409
states. We can't see them, but they are directly

00:01:15.409 --> 00:01:18.569
causing those observable events. And the primary

00:01:18.569 --> 00:01:21.370
objective of this model is to reverse engineer

00:01:21.370 --> 00:01:24.510
reality. It tries to learn about that hidden

00:01:24.510 --> 00:01:27.090
layer strictly by observing the visible layer.

00:01:27.109 --> 00:01:29.989
Like playing detective. Exactly like that. And

00:01:29.989 --> 00:01:33.069
the engine that makes this possible, the core

00:01:33.069 --> 00:01:35.859
rule, is what we call the Markov. property. Which

00:01:35.859 --> 00:01:37.540
is kind of a weird rule when you think about

00:01:37.540 --> 00:01:40.280
it. It is. It's a strict mathematical rule stating

00:01:40.280 --> 00:01:42.500
that the current hidden state is influenced only

00:01:42.500 --> 00:01:45.260
by the state immediately preceding it. It operates

00:01:45.260 --> 00:01:48.439
with zero long -term memory. Zero. None at all.

00:01:48.480 --> 00:01:51.060
Right. What happened two steps ago or ten steps

00:01:51.060 --> 00:01:53.239
ago doesn't mathematically matter to the current

00:01:53.239 --> 00:01:55.620
state. Which, I mean, if you think about how

00:01:55.620 --> 00:01:57.840
human memory works or even real -world physics,

00:01:57.900 --> 00:02:00.299
that sounds totally counterintuitive at first.

00:02:00.620 --> 00:02:03.340
Like, why force the system to have amnesia? Because

00:02:03.340 --> 00:02:06.030
without that amnesia, the calculations would

00:02:06.030 --> 00:02:09.430
literally break modern computers. By assuming

00:02:09.430 --> 00:02:12.289
that only the immediate past matters, the Markov

00:02:12.289 --> 00:02:15.629
property takes this chaotic, infinitely complex

00:02:15.629 --> 00:02:19.169
universe and simplifies it into calculable probabilities.

00:02:19.349 --> 00:02:21.610
Oh, I see. So it's a shortcut. A necessary one.

00:02:21.759 --> 00:02:24.759
If a system had to remember the entire history

00:02:24.759 --> 00:02:26.860
of everything that led up to a single moment,

00:02:27.099 --> 00:02:29.699
just to guess what happens next, the math becomes

00:02:29.699 --> 00:02:32.960
intractable. The Markov property trims the fat

00:02:32.960 --> 00:02:38.180
so the algorithm can actually run. I mean, the

00:02:38.180 --> 00:02:40.319
source material gives us two fantastic scenarios.

00:02:40.879 --> 00:02:42.659
Let's start with the Alice and Bob weather game.

00:02:42.800 --> 00:02:45.159
A classic example. Yeah, it's great. So imagine

00:02:45.159 --> 00:02:47.639
two friends, Alice and Bob, who live far apart

00:02:47.639 --> 00:02:50.800
and talk on the phone every day. Bob is a creature

00:02:50.800 --> 00:02:52.960
of habit. He really only does three things. He

00:02:52.960 --> 00:02:55.099
walks in the park, he goes shopping, or he cleans

00:02:55.099 --> 00:02:57.500
his apartment. And his choice is determined exclusively

00:02:57.500 --> 00:02:59.639
by the weather where he lives. Right, exclusively.

00:02:59.800 --> 00:03:01.819
The catch is that Alice has no idea what the

00:03:01.819 --> 00:03:04.900
weather is actually like where Bob lives. So

00:03:04.900 --> 00:03:06.860
the weather is the hidden state. And the only

00:03:06.860 --> 00:03:09.460
things Alice has access to are the observations.

00:03:10.139 --> 00:03:12.139
So Bob telling her, you know, hey, I cleaned

00:03:12.139 --> 00:03:15.419
my apartment today or I went for a walk. Exactly.

00:03:16.439 --> 00:03:19.400
But Alice isn't totally in the dark here. She

00:03:19.400 --> 00:03:21.860
knows the general rules of his city. Like, she

00:03:21.860 --> 00:03:24.280
knows it rains a lot there. She knows that if

00:03:24.280 --> 00:03:26.900
it's rainy today, there is maybe a 30 % chance

00:03:26.900 --> 00:03:29.900
tomorrow will be sunny. And she knows his habits.

00:03:29.979 --> 00:03:31.780
Right. She knows that if it's raining, there's

00:03:31.780 --> 00:03:34.340
a 50 % chance he'll stay in and clean. But if

00:03:34.340 --> 00:03:36.800
it's sunny, there's a 60 % chance he'll go for

00:03:36.800 --> 00:03:39.340
a walk. So using just those observations and

00:03:39.340 --> 00:03:41.740
those probabilities, Alice has to reconstruct

00:03:41.740 --> 00:03:44.389
the hidden weather patterns day by day. The text

00:03:44.389 --> 00:03:46.810
provides another, perhaps more mechanical example

00:03:46.810 --> 00:03:49.669
to illustrate this architecture. Imagine a genie

00:03:49.669 --> 00:03:52.349
in a hidden room. Okay, a genie. In this room,

00:03:52.349 --> 00:03:54.650
there are several urns, and each urn contains

00:03:54.650 --> 00:03:57.909
a specific known mix of uniquely labeled balls.

00:03:58.569 --> 00:04:01.409
The genie randomly picks an urn, draws a ball,

00:04:01.590 --> 00:04:03.750
and drops it onto a conveyor belt that leads

00:04:03.750 --> 00:04:06.169
out of the room. So you, standing outside the

00:04:06.169 --> 00:04:08.150
room, you just see the sequence of colored balls

00:04:08.150 --> 00:04:10.129
rolling down the conveyor belt. You never see

00:04:10.129 --> 00:04:13.780
the urns. Never. You only see the output. Watching

00:04:13.780 --> 00:04:16.079
this model feels a lot like sitting in Plato's

00:04:16.079 --> 00:04:18.600
cave. You know, like you're watching the shadows

00:04:18.600 --> 00:04:21.279
on a cave wall to try and guess the shape of

00:04:21.279 --> 00:04:24.639
the physical object casting them. You only ever

00:04:24.639 --> 00:04:27.500
see the byproduct of the reality, never the reality

00:04:27.500 --> 00:04:30.240
itself. Plato's cave is exactly the right way

00:04:30.240 --> 00:04:32.819
to think about it. The sequence of balls rolling

00:04:32.819 --> 00:04:35.500
out of the room is the shadow on the wall. and

00:04:35.500 --> 00:04:38.220
the hidden math is trying to reverse -engineer

00:04:38.220 --> 00:04:41.839
the true object casting it. To do that, the architecture

00:04:41.839 --> 00:04:44.759
of a hidden Markov model relies on two specific

00:04:44.759 --> 00:04:48.120
sets of rules. First you have transition probabilities.

00:04:48.259 --> 00:04:50.079
Transition probabilities, okay. These are the

00:04:50.079 --> 00:04:52.439
rules for moving from one hidden urn to the next

00:04:52.439 --> 00:04:54.459
or, you know, one weather state to the next.

00:04:54.839 --> 00:04:56.939
If you have a specific number of hidden states,

00:04:56.980 --> 00:05:00.639
say, n states, you have an n squared matrix of

00:05:00.639 --> 00:05:03.199
transition probabilities mapping every possible

00:05:03.199 --> 00:05:06.160
move. From any one state to any other. Correct.

00:05:06.329 --> 00:05:09.050
Then you have the emission probabilities. These

00:05:09.050 --> 00:05:11.670
dictate how likely a hidden state is to produce

00:05:11.670 --> 00:05:15.310
a specific observation, like Bob's 50 % chance

00:05:15.310 --> 00:05:18.129
of cleaning in the rain or the 80 % chance that

00:05:18.129 --> 00:05:20.810
urn number three produces a red ball. Wait, I

00:05:20.810 --> 00:05:22.550
have to push back here on behalf of the listener

00:05:22.550 --> 00:05:25.329
for a second. Go ahead. If the genie is choosing

00:05:25.329 --> 00:05:28.310
urns based only on the single previous urn...

00:05:28.379 --> 00:05:31.579
or if the weather today only depends on yesterday,

00:05:32.360 --> 00:05:34.600
because of that strict Markov property we talked

00:05:34.600 --> 00:05:37.699
about, doesn't that make the system incredibly

00:05:37.699 --> 00:05:39.759
short -sighted? It does seem that way. I mean,

00:05:39.920 --> 00:05:43.019
how can this accurately mild complex reality

00:05:43.019 --> 00:05:46.279
if it has zero long term memory? It feels like

00:05:46.279 --> 00:05:48.220
trying to predict a novel by only looking at

00:05:48.220 --> 00:05:50.439
the previous word. What's fascinating here is

00:05:50.439 --> 00:05:52.800
that while it feels limiting, this is where the

00:05:52.800 --> 00:05:55.040
math gets incredibly elegant. Even though the

00:05:55.040 --> 00:05:57.639
model only looks one step back, these strict

00:05:57.639 --> 00:06:00.800
interlocking probabilistic rules compound. Be

00:06:00.800 --> 00:06:03.439
compound. Well, it is not just looking at one

00:06:03.439 --> 00:06:06.279
transition in a vacuum. It is analyzing a long

00:06:06.279 --> 00:06:09.199
chain of them. When you chain enough of these

00:06:09.199 --> 00:06:12.060
short -term dependencies together through transition

00:06:12.060 --> 00:06:14.920
and emission matrices, they become incredibly

00:06:14.920 --> 00:06:17.980
powerful at mapping highly complex, seemingly

00:06:17.980 --> 00:06:21.319
long -term patterns in the data. Oh, wow. A single

00:06:21.319 --> 00:06:23.220
word doesn't tell you the plot of the novel,

00:06:23.800 --> 00:06:26.240
but observing the transition probabilities of

00:06:26.240 --> 00:06:29.180
10 ,000 words in a row absolutely maps out the

00:06:29.180 --> 00:06:31.199
grammar and tone of the book. Okay, that makes

00:06:31.199 --> 00:06:33.220
sense. So we've got the rules of the game. Yeah.

00:06:33.360 --> 00:06:35.319
We know how the matrices are set up. Now how

00:06:35.319 --> 00:06:37.399
do we actually win the game? Right. How do we

00:06:37.399 --> 00:06:39.899
solve the puzzle? Exactly. How do we use this

00:06:39.899 --> 00:06:43.000
model to solve the puzzle of the hidden truth?

00:06:43.540 --> 00:06:46.339
The source lays out three main inference tasks

00:06:46.339 --> 00:06:49.600
for these latent or hidden variables. There are

00:06:49.600 --> 00:06:51.980
three major questions we can ask the model to

00:06:51.980 --> 00:06:54.139
solve. The first is called filtering. Filtering,

00:06:54.220 --> 00:06:57.019
right. This is when we want to find the distribution

00:06:57.019 --> 00:06:59.699
over the hidden states at the very end of a sequence.

00:07:00.540 --> 00:07:02.579
We use something called the forward algorithm

00:07:02.579 --> 00:07:06.060
to figure out what state the process is in right

00:07:06.060 --> 00:07:09.040
at this exact moment, based on accumulating the

00:07:09.040 --> 00:07:10.920
probabilities of all the observations leading

00:07:10.920 --> 00:07:13.420
up to it. OK, so that's the present moment. Then

00:07:13.420 --> 00:07:15.759
there's the second task, which is smoothing.

00:07:16.160 --> 00:07:18.759
Yes, smoothing. This is when you're looking backward.

00:07:19.560 --> 00:07:22.019
You want to find the distribution of a hidden

00:07:22.019 --> 00:07:24.420
state somewhere in the middle of a past sequence.

00:07:25.180 --> 00:07:28.579
To do this, the model uses the forward -backward

00:07:28.579 --> 00:07:30.980
algorithm. Which is very clever mathematically.

00:07:31.220 --> 00:07:33.139
Yeah, it calculates the probabilities from the

00:07:33.139 --> 00:07:35.040
beginning of the sequence up to that middle point

00:07:35.040 --> 00:07:37.279
and also from the end of the sequence backward

00:07:37.279 --> 00:07:39.899
to that middle point to basically pinch the exact

00:07:39.899 --> 00:07:43.110
probability. And finally, the third task is finding

00:07:43.110 --> 00:07:45.670
the most likely explanation. This doesn't just

00:07:45.670 --> 00:07:48.449
look at one point in time. It asks, what is the

00:07:48.449 --> 00:07:50.990
joint probability of the entire sequence of hidden

00:07:50.990 --> 00:07:53.449
states that generated our observations? The whole

00:07:53.449 --> 00:07:55.670
thing. The whole sequence. And for this, we use

00:07:55.670 --> 00:07:58.189
the famous Viterbi algorithm. OK, here's where

00:07:58.189 --> 00:08:00.430
it gets really interesting. Let's put this into

00:08:00.430 --> 00:08:03.740
real -world terms for you. Think of these three

00:08:03.740 --> 00:08:06.699
tools like tracking a friend on a cross -country

00:08:06.699 --> 00:08:09.680
drive. I like this analogy Yeah, so filtering

00:08:09.680 --> 00:08:12.079
is like using their current speed and direction

00:08:12.079 --> 00:08:16.180
to guess exactly where they are right now smoothing

00:08:16.399 --> 00:08:18.839
is trying to figure out which specific gas station

00:08:18.839 --> 00:08:21.220
they stopped at three hours ago based on their

00:08:21.220 --> 00:08:23.620
overall trajectory. Right. But the most likely

00:08:23.620 --> 00:08:26.459
explanation, the Viterbi algorithm, is like looking

00:08:26.459 --> 00:08:29.800
at a pile of crumpled gas receipts and reconstructing

00:08:29.800 --> 00:08:32.879
the entire cross -country road trip map from

00:08:32.879 --> 00:08:35.980
start to finish. The road trip analogy perfectly

00:08:35.980 --> 00:08:38.419
captures the scale of the Viterbi algorithm,

00:08:38.419 --> 00:08:40.960
and it brings up a crucial mechanical distinction.

00:08:41.240 --> 00:08:43.720
Which is? You might wonder... Why can't we just

00:08:43.720 --> 00:08:46.379
use the filtering tool over and over again? Why

00:08:46.379 --> 00:08:48.539
not just find the most likely state for step

00:08:48.539 --> 00:08:51.279
one, then step two, then step three, and stitch

00:08:51.279 --> 00:08:53.500
them together to make our map? Right. I was actually

00:08:53.500 --> 00:08:55.779
just thinking that. If I know the most likely

00:08:55.779 --> 00:08:58.100
weather for Monday, Tuesday, and Wednesday individually.

00:08:58.440 --> 00:09:00.539
Shouldn't that be the most likely sequence for

00:09:00.539 --> 00:09:03.559
the week? Not necessarily. The most likely individual

00:09:03.559 --> 00:09:06.419
states don't always form the most likely continuous

00:09:06.419 --> 00:09:09.080
sequence. Wait, really? Why not? Because the

00:09:09.080 --> 00:09:11.940
transitions between certain states might be mathematically

00:09:11.940 --> 00:09:15.620
impossible or highly improbable. If you try to

00:09:15.620 --> 00:09:18.740
guess a 10 word sentence, the number of possible

00:09:18.740 --> 00:09:21.820
grammatical combinations is astronomical. If

00:09:21.820 --> 00:09:24.519
a machine brute forced it by checking every single

00:09:24.519 --> 00:09:26.440
path through the matrix, it would take years.

00:09:26.600 --> 00:09:29.340
Oh, because of all the branches. Exactly. The

00:09:29.340 --> 00:09:31.899
brilliance of the Viterbi algorithm is dynamic

00:09:31.899 --> 00:09:34.740
programming. Instead of mapping every route,

00:09:35.059 --> 00:09:38.120
it only remembers the single best path to reach

00:09:38.120 --> 00:09:41.200
the current word, instantly discarding all the

00:09:41.200 --> 00:09:44.259
suboptimal routes. It cuts a process that should

00:09:44.110 --> 00:09:47.090
to take centuries down to milliseconds. So it's

00:09:47.090 --> 00:09:49.330
ruthlessly efficient. It prunes the dead ends

00:09:49.330 --> 00:09:51.549
immediately instead of following them to the

00:09:51.549 --> 00:09:53.990
finish line. Exactly. The source gives a great

00:09:53.990 --> 00:09:56.309
example regarding part of speech tagging and

00:09:56.309 --> 00:09:58.909
language processing. If you are trying to understand

00:09:58.909 --> 00:10:01.169
a sentence, you don't just want the probability

00:10:01.169 --> 00:10:03.230
of what part of speech a single word is. Right.

00:10:03.230 --> 00:10:05.909
You need context. To actually make sense of the

00:10:05.909 --> 00:10:08.230
grammar, you need the entire sequence of parts

00:10:08.230 --> 00:10:10.669
of speech to align logically. You need the whole

00:10:10.669 --> 00:10:13.960
road trip. Which means you require the viterbi

00:10:13.960 --> 00:10:16.659
algorithm to find that most likely explanation

00:10:16.659 --> 00:10:19.559
for the whole sentence Ensuring that a noun actually

00:10:19.559 --> 00:10:21.899
follows an adjective in a way that makes linguistic

00:10:21.899 --> 00:10:25.220
sense that makes total sense We use the terby

00:10:25.220 --> 00:10:27.259
when the context of the whole sequence is what

00:10:27.259 --> 00:10:31.269
actually matters but All of this puzzle solving

00:10:31.269 --> 00:10:33.330
assumes we already know the rules, right? He

00:10:33.330 --> 00:10:35.110
does assume that. Like, it assumes that we know

00:10:35.110 --> 00:10:37.929
Alice's probabilities for Bob's weather, or the

00:10:37.929 --> 00:10:40.289
exact mix of colored balls and the genies' urns.

00:10:41.049 --> 00:10:43.529
How does a machine learn the rules of the hidden

00:10:43.529 --> 00:10:47.080
world from scratch if no one inputs them? And

00:10:47.080 --> 00:10:49.259
where does this actually touch your daily life?

00:10:49.440 --> 00:10:51.919
Well, the history here is quite rich. The core

00:10:51.919 --> 00:10:54.360
math was developed in the late 1960s by Leonard

00:10:54.360 --> 00:10:56.919
E. Baum and his colleagues, but it really took

00:10:56.919 --> 00:10:59.539
off in the mid -1970s when it was applied to

00:10:59.539 --> 00:11:01.500
speech recognition. Speech recognition, okay.

00:11:01.720 --> 00:11:05.779
And then it exploded in the 1980s with bioinformatics,

00:11:06.139 --> 00:11:10.179
specifically analyzing DNA sequences. Today,

00:11:10.379 --> 00:11:12.480
the applications are endless. We are talking

00:11:12.480 --> 00:11:15.820
computational finance, protein folding, handwriting

00:11:15.820 --> 00:11:18.940
recognition, even predicting solar irradiance

00:11:18.940 --> 00:11:21.539
variability. And Siri. The source specifically

00:11:21.539 --> 00:11:23.879
mentions Siri's speech recognition. Yes. Siri

00:11:23.879 --> 00:11:26.500
is a perfect everyday example. But let's dig

00:11:26.500 --> 00:11:28.220
into the mechanics of that. We're saying the

00:11:28.220 --> 00:11:30.259
algorithm learns the transition and emission

00:11:30.259 --> 00:11:34.000
probabilities, but how can it possibly learn

00:11:34.000 --> 00:11:36.799
the rules of the hidden states if those states

00:11:36.799 --> 00:11:40.409
are by definition, hidden. It is a paradox, isn't

00:11:40.409 --> 00:11:42.629
it? And I really need to logic this out. If it's

00:11:42.629 --> 00:11:44.470
guessing the rules based on the data, but it

00:11:44.470 --> 00:11:46.230
needs the rules to understand the data, isn't

00:11:46.230 --> 00:11:48.409
it just spinning its wheels? How does tweaking

00:11:48.409 --> 00:11:50.710
the probabilities actually lock it into the correct

00:11:50.710 --> 00:11:53.429
pattern instead of just, you know, a mathematically

00:11:53.429 --> 00:11:55.730
convenient hallucination? It is the ultimate

00:11:55.730 --> 00:11:57.809
chicken and egg problem. If we connect this to

00:11:57.809 --> 00:12:00.870
the bigger picture to solve it, we rely on algorithms

00:12:00.870 --> 00:12:03.289
like Baumwelsch, which is a type of expectation

00:12:03.289 --> 00:12:07.600
maximization algorithm, or EM. For more complex

00:12:07.600 --> 00:12:10.759
time series predictions, systems might use Markov

00:12:10.759 --> 00:12:14.539
chain Monte Carlo or MCMC sampling. Okay, lots

00:12:14.539 --> 00:12:17.740
of acronyms. True, but the core logic of expectation

00:12:17.740 --> 00:12:19.860
maximization directly answers your question.

00:12:20.519 --> 00:12:23.860
It uses iterative local maximum likelihood estimates.

00:12:24.080 --> 00:12:25.860
Okay, let's translate that for everyone. Think

00:12:25.860 --> 00:12:28.419
of it like trying to tune a blurry radio station

00:12:28.419 --> 00:12:30.889
in the dark. Great way to visualize it. You don't

00:12:30.889 --> 00:12:32.649
know the exact frequency. Those are the hidden

00:12:32.649 --> 00:12:35.070
rules. But you can hear the static, which are

00:12:35.070 --> 00:12:37.950
the errors in your output. Expectation maximization

00:12:37.950 --> 00:12:40.009
is basically the process of turning the dial

00:12:40.009 --> 00:12:42.629
just a millimeter to the left. Did the audio

00:12:42.629 --> 00:12:45.309
get clearer? Yes. Keep going. Did it get worse?

00:12:45.529 --> 00:12:48.330
Turn back. It iterates. Right. It iterates this

00:12:48.330 --> 00:12:51.029
microtuning until the music comes through crystal

00:12:51.029 --> 00:12:54.350
clear. That is precisely how it functions. The

00:12:54.350 --> 00:12:57.009
algorithm splits the work into two distinct steps.

00:12:57.549 --> 00:13:00.980
First is the expectation step. Given its current

00:13:00.980 --> 00:13:03.840
fuzzy guess of the rules, it asks, what is the

00:13:03.840 --> 00:13:06.940
most likely sequence of hidden states? It makes

00:13:06.940 --> 00:13:08.960
a rough map. Okay, step one is the rough map.

00:13:09.159 --> 00:13:11.960
Then comes the maximization step. Given that

00:13:11.960 --> 00:13:14.360
rough map, it asks, how should we mathematically

00:13:14.360 --> 00:13:17.019
update our transition and emission rules to make

00:13:17.019 --> 00:13:19.820
this specific map even more likely to occur?

00:13:20.000 --> 00:13:22.220
Oh, I see. It updates the rules, which changes

00:13:22.220 --> 00:13:24.039
the map, which updates the rules again. It loops

00:13:24.039 --> 00:13:26.539
this process over and over until the probabilities

00:13:26.539 --> 00:13:29.059
stabilize, the static fades, and the pattern

00:13:29.059 --> 00:13:31.159
locks in. So it literally pulls itself up by

00:13:31.159 --> 00:13:33.360
its own mathematical bootstraps. It does. And

00:13:33.360 --> 00:13:34.759
that brings it right back to you, the listener.

00:13:35.080 --> 00:13:37.580
Every single time your phone's voice assistant

00:13:37.580 --> 00:13:40.639
understands your spoken words, translating the

00:13:40.639 --> 00:13:42.860
messy audio waves of your voice into a hidden

00:13:42.860 --> 00:13:46.100
sequence of grammatical text, it is solving this

00:13:46.100 --> 00:13:49.860
exact probabilistic puzzle. It's using expectation

00:13:49.860 --> 00:13:52.820
maximization to tune the radio dial of your speech.

00:13:53.399 --> 00:13:55.519
It is quite literally mapping the audio shadows

00:13:55.519 --> 00:13:57.740
back to the linguistic urns in fractions of a

00:13:57.740 --> 00:14:01.299
second. But wait, there is a fatal flaw in everything

00:14:01.299 --> 00:14:04.490
we've talked about so far. A flaw? Yeah. Reality

00:14:04.490 --> 00:14:07.009
doesn't happen in neat discrete steps like drawing

00:14:07.009 --> 00:14:09.610
a single ball from an urn or flipping a coin

00:14:09.610 --> 00:14:12.529
between rainy and sunny. The real world is continuous.

00:14:12.909 --> 00:14:15.269
You know, it's messy. The stock market doesn't

00:14:15.269 --> 00:14:18.570
just go up or down in rigid boxes. It fluctuates

00:14:18.570 --> 00:14:20.909
wildly across a continuous spectrum. That is

00:14:20.909 --> 00:14:23.029
very true. So what happens to our Markov model?

00:14:23.230 --> 00:14:25.509
when the data doesn't fit into clean little boxes.

00:14:25.830 --> 00:14:28.350
It evolves. The source outlines several major

00:14:28.350 --> 00:14:31.450
extensions to handle exactly this kind of complexity.

00:14:31.710 --> 00:14:33.850
When you were dealing with continuous state spaces

00:14:33.850 --> 00:14:36.629
like tracking the continuous real -time trajectory

00:14:36.629 --> 00:14:38.789
of a rocket through the atmosphere rather than

00:14:38.789 --> 00:14:41.370
discrete weather states, the math shifts. Oh,

00:14:41.370 --> 00:14:43.590
so. We use things like Kalman filters, which

00:14:43.590 --> 00:14:46.129
adapt the Markov model to handle data that flows

00:14:46.129 --> 00:14:48.769
without rigid breaks. But the truly significant

00:14:48.769 --> 00:14:51.529
shift in modern AI has been the move toward discriminative

00:14:51.529 --> 00:14:54.779
approaches. source heavily emphasizes this shift.

00:14:55.279 --> 00:14:58.519
It details maximum R -SPAY Markov model's MEMS

00:14:58.519 --> 00:15:01.259
and linear chain conditional random fields, or

00:15:01.259 --> 00:15:03.700
CRFs. But let's not just name -drop the jargon.

00:15:04.340 --> 00:15:06.440
How do these actually differ from the classic

00:15:06.440 --> 00:15:09.600
hidden urns? The difference is in how they view

00:15:09.600 --> 00:15:12.840
the world. Traditional HMMs are generative models.

00:15:13.299 --> 00:15:15.899
They try to mathematically recreate exactly how

00:15:15.899 --> 00:15:17.860
the hidden states generated the observations

00:15:17.860 --> 00:15:21.200
from scratch. They model the entire joint distribution.

00:15:21.279 --> 00:15:24.580
OK, recreating the whole world. Yes. A discriminative

00:15:24.580 --> 00:15:26.820
model, like a conditional random field, doesn't

00:15:26.820 --> 00:15:29.039
care about recreating the whole world. It just

00:15:29.039 --> 00:15:31.559
models the conditional distribution. It only

00:15:31.559 --> 00:15:33.639
cares about drawing the correct boundaries to

00:15:33.639 --> 00:15:36.080
classify the observations. Think of a traditional

00:15:36.080 --> 00:15:38.519
generative HMM. as looking at the words in a

00:15:38.519 --> 00:15:40.620
sentence through a narrow straw, right? You only

00:15:40.620 --> 00:15:43.159
see one word at a time, strictly limited by that

00:15:43.159 --> 00:15:45.580
one -step Markov property we talked about. Yes,

00:15:45.779 --> 00:15:48.399
the amnesia. Right, the amnesia. Conditional

00:15:48.399 --> 00:15:51.080
random field takes the straw away. It allows

00:15:51.080 --> 00:15:53.480
the model to look at the whole sequence simultaneously.

00:15:54.120 --> 00:15:56.399
It recognizes context, like the fact that the

00:15:56.399 --> 00:15:58.649
word bank... It means something entirely different

00:15:58.649 --> 00:16:01.190
if the word river is nearby rather than if the

00:16:01.190 --> 00:16:03.850
word money is nearby. Exactly. By taking the

00:16:03.850 --> 00:16:06.750
straw away, CRFs solve what is known as the label

00:16:06.750 --> 00:16:09.769
bias problem. They allow engineers to inject

00:16:09.769 --> 00:16:12.389
domain -specific knowledge and look at combinations

00:16:12.389 --> 00:16:15.269
of nearby observations all at once. Which is

00:16:15.269 --> 00:16:18.190
huge for complex data. It is. We are also seeing

00:16:18.190 --> 00:16:20.110
the integration of recurrent neural networks,

00:16:20.509 --> 00:16:23.090
specifically reservoir networks, which feed temporal

00:16:23.090 --> 00:16:26.269
dynamics into the model. This helps the HMM handle

00:16:26.269 --> 00:16:29.350
non - stationary data. That's data where the

00:16:29.350 --> 00:16:31.730
underlying rules of the universe are constantly

00:16:31.730 --> 00:16:34.549
shifting over time. Which brings us to a massive

00:16:34.549 --> 00:16:38.330
update from the source material. In 2023, two

00:16:38.330 --> 00:16:40.750
breakthrough algorithms were introduced, the

00:16:40.750 --> 00:16:43.029
discriminative forward -backward and discriminative

00:16:43.029 --> 00:16:45.509
Viterbi algorithms. A very recent development.

00:16:45.990 --> 00:16:48.649
Yeah, super recent. They bypassed the need to

00:16:48.649 --> 00:16:51.169
model the joint distribution entirely using only

00:16:51.169 --> 00:16:55.840
conditional distributions. But wait! Uh, doesn't

00:16:55.840 --> 00:16:58.159
that rewrite the rulebook? We spent this whole

00:16:58.159 --> 00:16:59.940
deep dive saying we needed the full generative

00:16:59.940 --> 00:17:02.340
model that earns the specific emission probabilities

00:17:02.340 --> 00:17:05.079
to figure out the hidden states. And now we don't.

00:17:05.220 --> 00:17:08.019
It is a profound paradigm shift in the mathematics.

00:17:08.359 --> 00:17:10.799
By skipping the joint law, by not forcing the

00:17:10.799 --> 00:17:13.299
algorithm to learn the entire generative rulebook

00:17:13.299 --> 00:17:16.859
of the universe, These 2023 algorithms make the

00:17:16.859 --> 00:17:19.400
model vastly more computationally efficient.

00:17:19.460 --> 00:17:21.480
Oh, because it's doing less math. Precisely.

00:17:21.839 --> 00:17:24.119
You get the sequence -solving power of the Viterbi

00:17:24.119 --> 00:17:26.819
algorithm without the immense computational baggage

00:17:26.819 --> 00:17:29.400
of a traditional generative model. It makes the

00:17:29.400 --> 00:17:31.779
HMM incredibly versatile for cutting -edge AI

00:17:31.779 --> 00:17:34.279
applications where speed, scale and efficiency

00:17:34.279 --> 00:17:36.660
are paramount. You are getting the road trip

00:17:36.660 --> 00:17:38.720
map without having to mathematically simulate

00:17:38.720 --> 00:17:41.460
the engine of the car. That is wild. I mean,

00:17:41.460 --> 00:17:43.940
we've gone from a theoretical genie drawing balls

00:17:43.940 --> 00:17:47.440
from hidden urns in the 1960s to teaching Siri

00:17:47.440 --> 00:17:50.599
how to recognize the messy audio of your voice

00:17:50.599 --> 00:17:54.160
to these 2023 algorithms that slice through the

00:17:54.160 --> 00:17:56.660
underlying math faster and more efficiently than

00:17:56.660 --> 00:17:59.180
ever before. It's an incredible evolution. So,

00:17:59.220 --> 00:18:01.279
you know, the next time you are looking at a

00:18:01.279 --> 00:18:02.799
sequence of events, whether it's the erratic

00:18:02.799 --> 00:18:04.619
jumping in the stock market, the sequence of

00:18:04.619 --> 00:18:07.180
nucleotides in a DNA test, or just trying to

00:18:07.180 --> 00:18:09.450
guess if your friend is going to clean their

00:18:09.450 --> 00:18:11.930
apartment based on the weather. Remember, there

00:18:11.930 --> 00:18:14.549
is a hidden Markov chain operating right beneath

00:18:14.549 --> 00:18:17.089
the surface, calculating the odds. And before

00:18:17.089 --> 00:18:20.109
we wrap up, there is one final, almost philosophical

00:18:20.109 --> 00:18:21.990
concept from the measure theory section of the

00:18:21.990 --> 00:18:23.869
text that I think you should take with you. Ooh,

00:18:23.869 --> 00:18:26.220
go for it. Let's hear it. We established at the

00:18:26.220 --> 00:18:28.400
very beginning that the hidden part of a Markov

00:18:28.400 --> 00:18:31.000
model strictly depends only on the immediate

00:18:31.000 --> 00:18:34.000
past. It operates with mathematical amnesia.

00:18:34.220 --> 00:18:36.700
Right, the one -step rule. However, the measure

00:18:36.700 --> 00:18:38.740
theory proves that the observable sequence, the

00:18:38.740 --> 00:18:41.160
part we actually see, can be non -Markovian.

00:18:41.500 --> 00:18:45.200
Wait, meaning the shadows do have a memory, even

00:18:45.200 --> 00:18:48.039
if the object casting them doesn't. That is exactly

00:18:48.039 --> 00:18:51.160
what it means. For example, if you observe a

00:18:51.160 --> 00:18:53.539
long enough sequence of a specific outcome on

00:18:53.539 --> 00:18:56.859
the visible layer, let's call it outcome B, you

00:18:56.859 --> 00:18:59.119
might become increasingly mathematically certain

00:18:59.119 --> 00:19:02.829
that the underlying hidden state is A. This implies

00:19:02.829 --> 00:19:05.210
that the visible, observable part of the system

00:19:05.210 --> 00:19:07.930
can actually be affected by something infinitely

00:19:07.930 --> 00:19:10.869
far back in the past, even if the hidden engine

00:19:10.869 --> 00:19:14.109
driving it only looks one step back. Whoa! So

00:19:14.109 --> 00:19:15.950
even if the hidden mechanics driving the universe

00:19:15.950 --> 00:19:18.490
only care about what happened yesterday, our

00:19:18.490 --> 00:19:20.789
visible world, the shadows we see on the cave

00:19:20.789 --> 00:19:23.369
wall, might actually carry the mathematical fingerprints

00:19:23.369 --> 00:19:26.150
of the infinite past. Exactly. The observables

00:19:26.150 --> 00:19:28.250
remember what the hidden states forget. That

00:19:28.250 --> 00:19:30.630
is so cool. It's a mathematical quirk with profound

00:19:30.630 --> 00:19:33.630
philosophical weight. The ghost in the machine

00:19:33.630 --> 00:19:36.349
might only be looking one step ahead, but the

00:19:36.349 --> 00:19:38.950
shadows it casts stretch all the way back to

00:19:38.950 --> 00:19:40.809
the beginning. A perfect way to summarize it.

00:19:41.009 --> 00:19:43.009
Now that is a thought to leave you with. Until

00:19:43.009 --> 00:19:45.250
next time, keep looking for the true shapes casting

00:19:45.250 --> 00:19:45.710
the shadows.