WEBVTT

00:00:00.000 --> 00:00:02.660
Imagine you were planning a cross -country road

00:00:02.660 --> 00:00:06.080
trip, and I don't just mean plugging a destination

00:00:06.080 --> 00:00:07.599
into your phone and taking the main highways.

00:00:07.820 --> 00:00:09.539
Right, that would be too easy. Yeah, exactly.

00:00:09.740 --> 00:00:12.740
I mean, you decide for some completely wild reason

00:00:12.740 --> 00:00:15.500
that you are going to map out absolutely every

00:00:15.500 --> 00:00:18.359
single possible route, like every dirt road,

00:00:18.719 --> 00:00:21.500
every wrong turn, every scenic bypass, every

00:00:21.500 --> 00:00:23.640
single time you could possibly pull over for

00:00:23.640 --> 00:00:27.030
gas or coffee or... You know, just stretch your

00:00:27.030 --> 00:00:29.769
legs. I mean, it sounds completely exhausting

00:00:29.769 --> 00:00:32.850
just thinking about it. You'd never actually

00:00:32.850 --> 00:00:34.869
leave the driveway. You really wouldn't. Now

00:00:34.869 --> 00:00:37.549
imagine doing that, but the map of those choices

00:00:37.549 --> 00:00:41.109
is so unimaginably massive that the number of

00:00:41.109 --> 00:00:43.450
possible routes is actually larger than the number

00:00:43.450 --> 00:00:45.950
of atoms in the observable universe. Oh, wow.

00:00:46.170 --> 00:00:48.250
Yeah. You are trying to find the single best

00:00:48.250 --> 00:00:50.929
pass to your destination, but you literally cannot

00:00:50.929 --> 00:00:53.250
see the whole map. It is physically impossible

00:00:53.250 --> 00:00:55.740
to draw it all out. Which perfectly frames the

00:00:55.740 --> 00:00:58.119
exact problem computer scientists faced when

00:00:58.119 --> 00:01:00.479
trying to get machines to play highly complex

00:01:00.479 --> 00:01:04.540
board games. The game tree, which is simply the

00:01:04.540 --> 00:01:06.840
mathematical map of all possible moves and counter

00:01:06.840 --> 00:01:10.280
moves in a game, it can become unfathomably huge.

00:01:10.819 --> 00:01:13.219
And that impossible map is our mission for today's

00:01:13.219 --> 00:01:16.079
Deep Dive. We are taking a stack of fascinating

00:01:16.079 --> 00:01:18.760
sources, primarily centered around the Wikipedia

00:01:18.760 --> 00:01:21.799
foundational document for a concept called Monte

00:01:21.799 --> 00:01:24.480
Carlo Tree Search, or MCPA. Right, a very famous

00:01:24.480 --> 00:01:26.640
algorithm. We are going to figure out the mathematical

00:01:26.640 --> 00:01:28.799
magic trick that allowed machines to conquer

00:01:28.799 --> 00:01:31.280
the most complex games in human history, like

00:01:31.280 --> 00:01:34.819
the ancient board game Go, without needing...

00:01:34.890 --> 00:01:37.349
explicit step -by -step human theories to do

00:01:37.349 --> 00:01:39.409
it. It is an incredible story. And, you know,

00:01:39.409 --> 00:01:41.650
while we are talking about artificial intelligence

00:01:41.650 --> 00:01:43.909
and board games, the underlying mechanics here

00:01:43.909 --> 00:01:45.769
are profound for anyone listening. Yeah, it's

00:01:45.769 --> 00:01:48.209
bigger than games. Exactly. This isn't just about

00:01:48.209 --> 00:01:51.030
AI playing games. It's about the fundamental

00:01:51.030 --> 00:01:53.709
principles of decision making under severe uncertainty.

00:01:54.010 --> 00:01:57.170
It's about how you or a machine can confidently

00:01:57.170 --> 00:01:59.670
choose a path when you can't possibly know where

00:01:59.670 --> 00:02:02.689
every single path leads. OK, let's unpack this.

00:02:03.030 --> 00:02:05.599
To really understand how modern AI navigates

00:02:05.599 --> 00:02:08.479
that impossible complexity, we first need to

00:02:08.479 --> 00:02:12.300
establish why traditional computing hit a brick

00:02:12.300 --> 00:02:15.259
wall. Like, people might remember that IBM's

00:02:15.259 --> 00:02:18.180
Deep Blue Beat world chess champion, Garry Kasparov,

00:02:18.240 --> 00:02:21.280
back in 1997. Right, a huge moment. So why did

00:02:21.280 --> 00:02:23.780
it take another two decades for a computer to

00:02:23.780 --> 00:02:25.860
master Go? Well, it comes down to something called

00:02:25.860 --> 00:02:28.560
the branching factor. In chess, on any given

00:02:28.560 --> 00:02:31.099
turn, you have an average of maybe 35 possible

00:02:31.099 --> 00:02:34.159
valid moves. OK, 35. But in Go, you are placing

00:02:34.159 --> 00:02:37.759
stones on a massive 19 by 19 grid. So the average

00:02:37.759 --> 00:02:41.199
branching factor is around 250. Oh, wow. So every

00:02:41.199 --> 00:02:44.919
single turn, the map of choices splits into 250

00:02:44.919 --> 00:02:47.020
new paths. And then each of those splits into

00:02:47.020 --> 00:02:50.840
250 more. Exactly. The game tree explodes exponentially.

00:02:51.280 --> 00:02:55.039
In chess, early AI could use brute force to look

00:02:55.039 --> 00:02:57.960
several moves ahead and then use a static evaluation

00:02:57.960 --> 00:02:59.919
function. What does that mean exactly? Static

00:02:59.919 --> 00:03:02.080
evaluation. It essentially looked at the board

00:03:02.080 --> 00:03:04.680
and said, you know, I have more valuable pieces

00:03:04.680 --> 00:03:08.740
than you, so I must be winning. But Go is so

00:03:08.740 --> 00:03:11.680
fluid and abstract that a single stone placed

00:03:11.680 --> 00:03:14.819
on turn 10 might quietly determine the outcome

00:03:14.819 --> 00:03:18.840
of a massive battle on turn 150. Right. You cannot

00:03:18.840 --> 00:03:21.419
easily evaluate who is winning just by looking

00:03:21.419 --> 00:03:24.919
at a static board state. You have to play it

00:03:24.919 --> 00:03:26.479
out. But we just established that playing it

00:03:26.479 --> 00:03:28.300
out perfectly is impossible because there are

00:03:28.300 --> 00:03:31.039
more paths than atoms in the universe. So how

00:03:31.039 --> 00:03:33.639
did scientists solve a problem that is literally

00:03:33.639 --> 00:03:36.060
too big to calculate? By entirely giving up on

00:03:36.060 --> 00:03:38.360
calculating it perfectly, actually, and relying

00:03:38.360 --> 00:03:40.909
on statistics instead. And the foundation for

00:03:40.909 --> 00:03:43.830
that goes all the way back to the 1940s with

00:03:43.830 --> 00:03:45.870
something called the Monte Carlo method. The

00:03:45.870 --> 00:03:48.629
1940s. So this predates modern computers by quite

00:03:48.629 --> 00:03:51.689
a bit. It does. The core idea of the Monte Carlo

00:03:51.689 --> 00:03:54.610
method is to use random sampling to solve deterministic

00:03:54.610 --> 00:03:56.430
problems that are just too difficult to solve

00:03:56.430 --> 00:03:58.610
with traditional equations. OK, give me an example

00:03:58.610 --> 00:04:01.919
of that. Think about it like this. If you have

00:04:01.919 --> 00:04:04.340
a weirdly shaped pond and you want to know its

00:04:04.340 --> 00:04:07.379
exact surface area, you could try to do incredibly

00:04:07.379 --> 00:04:11.610
complex calculus. Or... You could draw the square

00:04:11.610 --> 00:04:15.050
around the pond, throw 10 ,000 darts randomly

00:04:15.050 --> 00:04:17.670
into that square, and just count the percentage

00:04:17.670 --> 00:04:19.829
of darts that land in the water versus the dirt.

00:04:20.129 --> 00:04:22.529
Oh, I see. You throw enough random darts, the

00:04:22.529 --> 00:04:24.569
statistical average gives you a highly accurate

00:04:24.569 --> 00:04:27.230
answer without doing any of the heavy math. Exactly.

00:04:27.350 --> 00:04:29.329
That makes sense. But if they figured that out

00:04:29.329 --> 00:04:31.870
in the 40s, why did it take so long to apply

00:04:31.870 --> 00:04:34.449
it to games? Because applying it to sequential

00:04:34.449 --> 00:04:37.290
decision making like a game required a conceptual

00:04:37.290 --> 00:04:40.819
leap. and frankly, more computing power. The

00:04:40.819 --> 00:04:43.600
sources point to 1987 as a major breakthrough.

00:04:44.220 --> 00:04:46.899
A researcher named Bruce Abramson wrote his PhD

00:04:46.899 --> 00:04:49.480
thesis combining traditional game search methods

00:04:49.480 --> 00:04:52.660
with an expected outcome model. Meaning, instead

00:04:52.660 --> 00:04:54.759
of the computer trying to statically judge if

00:04:54.759 --> 00:04:57.220
a board position was good or bad, it just started

00:04:57.220 --> 00:05:01.430
throwing darts. Precisely. Abramson said, What

00:05:01.430 --> 00:05:03.870
if, from this current position, the computer

00:05:03.870 --> 00:05:05.870
just plays out the rest of the game completely

00:05:05.870 --> 00:05:08.730
randomly to the very end and sees who wins? He

00:05:08.730 --> 00:05:11.209
tested it on tic -tac -toe, Othello, and even

00:05:11.209 --> 00:05:15.089
chess. Wow. Then, in 1992, another researcher,

00:05:15.290 --> 00:05:18.250
B. Brugman, applied this specifically to a go

00:05:18.250 --> 00:05:21.310
-playing program. Which eventually led to Remy

00:05:21.310 --> 00:05:23.730
Coulom officially coining the trim Monte Carlo

00:05:23.730 --> 00:05:27.550
tree search in 2006. Yeah. And this foundational

00:05:27.550 --> 00:05:30.029
concept, what they call pure Monte Carlo game

00:05:30.029 --> 00:05:33.180
search, wild to visualize. It really is. It's

00:05:33.180 --> 00:05:35.160
essentially like trying to decide the best opening

00:05:35.160 --> 00:05:38.100
move in a game by cloning yourself 10 ,000 times,

00:05:38.500 --> 00:05:40.199
having all those clones run out and make completely

00:05:40.199 --> 00:05:42.660
random moves until someone wins or loses, and

00:05:42.660 --> 00:05:44.819
then simply picking the initial move that resulted

00:05:44.819 --> 00:05:47.240
in the most surviving clones. That is a highly

00:05:47.240 --> 00:05:49.579
entertaining but actually structurally accurate

00:05:49.579 --> 00:05:52.379
way to visualize pure Monte Carlo search. I have

00:05:52.379 --> 00:05:55.240
to push back here though. Is it really that simple?

00:05:55.449 --> 00:05:58.569
just chaotic randomness until a pattern emerges.

00:05:59.329 --> 00:06:01.910
Because human strategy is so complex, it feels

00:06:01.910 --> 00:06:04.189
like there has to be a catch. Well, the catch

00:06:04.189 --> 00:06:07.209
is computational time. But mathematically, the

00:06:07.209 --> 00:06:09.730
brilliance of pure Monte Carlo search is that

00:06:09.730 --> 00:06:12.589
for games with a finite number of moves and a

00:06:12.589 --> 00:06:14.970
finite length, like the board filling game Hex,

00:06:14.990 --> 00:06:18.600
for example, it genuinely works. Really? Just

00:06:18.600 --> 00:06:20.959
pure randomness. Yeah. As the number of random

00:06:20.959 --> 00:06:23.759
simulations approaches infinity, the algorithm

00:06:23.759 --> 00:06:26.819
naturally converges to optimal play. It eventually

00:06:26.819 --> 00:06:29.740
finds the perfect move without needing to understand

00:06:29.740 --> 00:06:31.939
the strategy of the game at all. It just knows

00:06:31.939 --> 00:06:34.500
the mechanical rules of how pieces move, and

00:06:34.500 --> 00:06:36.660
the law of large numbers does the rest. Okay,

00:06:36.920 --> 00:06:39.379
so cloning yourself 10 ,000 times sounds great

00:06:39.379 --> 00:06:41.500
in theory when you're writing a research paper.

00:06:42.000 --> 00:06:44.399
But you... Sitting at a computer or playing a

00:06:44.399 --> 00:06:46.399
game don't have infinite time. Right, you don't.

00:06:46.439 --> 00:06:47.819
Do you have a few seconds before the turn clock

00:06:47.819 --> 00:06:50.980
runs out? So how does a computer actually organize

00:06:50.980 --> 00:06:53.680
this chaotic simulation process efficiently?

00:06:54.120 --> 00:06:56.379
So it isn't just wasting time sprinting down

00:06:56.379 --> 00:06:59.439
terrible paths. It requires a highly specific

00:06:59.439 --> 00:07:02.060
architecture. The algorithm needs a structure

00:07:02.060 --> 00:07:05.100
to build out its game tree smartly, prioritizing

00:07:05.100 --> 00:07:08.060
the paths that actually show promise. This is

00:07:08.060 --> 00:07:11.000
the anatomy of a decision within MCTS. There

00:07:11.000 --> 00:07:14.360
are four distinct steps to a single round. Let's

00:07:14.360 --> 00:07:16.480
walk through those steps. I think a scout analogy

00:07:16.480 --> 00:07:19.060
works perfectly here for you listening. Imagine

00:07:19.060 --> 00:07:22.339
you are an explorer trying to find treasure in

00:07:22.339 --> 00:07:25.699
a massive, uncharted forest. Step one of the

00:07:25.699 --> 00:07:28.199
algorithm is called selection. Selection, yes.

00:07:28.240 --> 00:07:30.259
This is your scout walking down the path from

00:07:30.259 --> 00:07:32.040
base camp that you have already partially mapped

00:07:32.040 --> 00:07:34.819
out. You use a specific mathematical formula

00:07:34.819 --> 00:07:37.879
to walk past the trees, choosing the most promising

00:07:37.879 --> 00:07:40.420
route until you reach the very edge of your known

00:07:40.420 --> 00:07:42.920
map. In computer science, that edge is called

00:07:42.920 --> 00:07:45.399
a leaf node. And once that scout reaches the

00:07:45.399 --> 00:07:47.439
leaf node, the frontier of the known map, the

00:07:47.439 --> 00:07:49.959
algorithm moves to step two, which is expansion.

00:07:50.259 --> 00:07:52.920
Expansion is the scout taking out a machete and

00:07:52.920 --> 00:07:54.800
hacking through the brush to find just one new

00:07:54.800 --> 00:07:57.319
clearing. You add one new potential move, one

00:07:57.319 --> 00:08:00.000
new node to your map. Then comes step three,

00:08:00.000 --> 00:08:01.860
simulation, which is sometimes called play out

00:08:01.860 --> 00:08:04.339
or roll out. Right, and this is where the original

00:08:04.339 --> 00:08:06.379
Monte Carlo magic happens. Yes, this is where

00:08:06.379 --> 00:08:08.680
you tell your scout to just start sprinting wildly

00:08:08.680 --> 00:08:11.160
in a completely random direction. Just completely

00:08:11.160 --> 00:08:14.160
in the dark. Exactly. No machete, no map, just

00:08:14.160 --> 00:08:16.639
run until you either find the treasure, hit a

00:08:16.639 --> 00:08:19.819
wall, or fall into a pit. The computer plays

00:08:19.819 --> 00:08:22.019
the game out using purely random moves until

00:08:22.019 --> 00:08:24.600
a win, loss, or draw is reached. Which leads

00:08:24.600 --> 00:08:26.759
to the fourth, and arguably the most crucial

00:08:26.759 --> 00:08:30.389
step. Backpropagation. Backpropagation is the

00:08:30.389 --> 00:08:32.389
scout running all the way back to base camp,

00:08:32.889 --> 00:08:35.450
following the exact path they just took to tell

00:08:35.450 --> 00:08:37.950
the boss the results of that wild sprint. They

00:08:37.950 --> 00:08:40.570
update the data on a map. Yeah. Selection, expansion,

00:08:40.929 --> 00:08:44.230
simulation, backpropagation over and over, thousands

00:08:44.230 --> 00:08:46.509
of times a second. What's fascinating here is

00:08:46.509 --> 00:08:49.409
how that backpropagation step mathematically

00:08:49.409 --> 00:08:52.009
shapes the growth of the tree. Let's look at

00:08:52.009 --> 00:08:54.850
how the scoreboard actually updates. Imagine

00:08:54.850 --> 00:08:57.529
the computer is playing a two -player game. black

00:08:57.529 --> 00:09:00.870
pieces versus white pieces. During backpropagation,

00:09:01.090 --> 00:09:03.570
the computer tracks the ratio of wins to total

00:09:03.570 --> 00:09:06.389
playouts for every single intersection or node

00:09:06.389 --> 00:09:09.870
along that path. Now, if the white player loses

00:09:09.870 --> 00:09:13.169
that random simulation, every single node along

00:09:13.169 --> 00:09:15.750
the path increments its total simulation count.

00:09:16.289 --> 00:09:19.169
The visits go up by one, but only the black nodes

00:09:19.169 --> 00:09:21.750
get credited with a win. Wait, so even if white

00:09:21.750 --> 00:09:24.250
is the one who played a specific move on that

00:09:24.250 --> 00:09:27.730
path, Because white ultimately lost the simulation,

00:09:28.269 --> 00:09:30.629
black gets the point on the scoreboard for its

00:09:30.629 --> 00:09:33.309
own proceeding move. Precisely. You are keeping

00:09:33.309 --> 00:09:35.330
score from the perspective of the player who

00:09:35.330 --> 00:09:37.730
just moved. This brilliant little accounting

00:09:37.730 --> 00:09:40.029
trick ensures that during the selection phase,

00:09:40.549 --> 00:09:42.809
each player's choices naturally expand toward

00:09:42.809 --> 00:09:45.129
the most promising moves for that specific player.

00:09:45.789 --> 00:09:48.269
As a result, the tree grows asymmetrically. Oh,

00:09:48.330 --> 00:09:50.250
that makes so much sense. It doesn't waste precious

00:09:50.250 --> 00:09:52.789
time fully mapping out the terrible, obviously

00:09:52.789 --> 00:09:55.830
losing moves. It naturally burrows deeper and

00:09:55.830 --> 00:09:58.269
deeper into the lines of play where both players

00:09:58.269 --> 00:10:00.649
are making strong, highly contested choices.

00:10:00.789 --> 00:10:03.529
Exactly. It focuses the computing power exactly

00:10:03.529 --> 00:10:05.769
where the battle is actually happening. But wait.

00:10:06.029 --> 00:10:08.370
If backpropagation is constantly updating the

00:10:08.370 --> 00:10:10.350
scoreboard and telling the scout where the gold

00:10:10.350 --> 00:10:13.110
is, how does the algorithm decide whether to

00:10:13.110 --> 00:10:16.210
keep mining that known gold vein or to go look

00:10:16.210 --> 00:10:18.669
for a completely different, maybe bigger gold

00:10:18.669 --> 00:10:22.139
vein in a different part of the forest? Now we

00:10:22.139 --> 00:10:24.600
are hitting on the ultimate dilemma in computer

00:10:24.600 --> 00:10:28.019
science and frankly in human psychology. Explanation

00:10:28.019 --> 00:10:30.539
versus exploitation. This is one of the most

00:10:30.539 --> 00:10:33.000
relatable concepts in the source material. Think

00:10:33.000 --> 00:10:35.480
about it like going out to dinner. Exploitation

00:10:35.480 --> 00:10:37.980
is deciding to go to that taco place down the

00:10:37.980 --> 00:10:40.299
street because you have been there ten times

00:10:40.299 --> 00:10:42.639
and you know it's a solid nine out of ten. You

00:10:42.639 --> 00:10:44.539
have the data. It's a safe bet. Right. You know

00:10:44.539 --> 00:10:46.700
what you're getting. But exploration on the other

00:10:46.700 --> 00:10:49.580
hand is trying the brand new sushi spot that

00:10:49.580 --> 00:10:52.559
just opened across town. It might be a transcendent

00:10:52.559 --> 00:10:54.960
10 out of 10, but it also might give you terrible

00:10:54.960 --> 00:10:57.139
food poisoning. You have absolutely no data.

00:10:57.860 --> 00:11:01.019
So how does a machine solve human FOMO, the fear

00:11:01.019 --> 00:11:03.679
of missing out? This raises an important question,

00:11:03.899 --> 00:11:06.240
and it's one that researchers Levante Caxis and

00:11:06.240 --> 00:11:09.840
Saba Shepesvary solved in 2006. Building off

00:11:09.840 --> 00:11:12.740
earlier work from 2002 regarding Markov decision

00:11:12.740 --> 00:11:15.779
processes, they introduced a mathematical formula

00:11:15.779 --> 00:11:18.779
called UCT, which stands for Upper Confidence

00:11:18.779 --> 00:11:23.299
Bounds Applied to Trees. UCT is essentially the

00:11:23.299 --> 00:11:26.179
mathematical core for FOMO. I love that. Break

00:11:26.179 --> 00:11:29.039
the UCT math down for us. How does it balance

00:11:29.039 --> 00:11:32.120
the tacos and the sushi? So, the UCT formula

00:11:32.120 --> 00:11:34.639
assigns a numerical score to every possible next

00:11:34.639 --> 00:11:37.139
move, and the computer always picks the move

00:11:37.139 --> 00:11:39.620
with the highest score. The formula is essentially

00:11:39.620 --> 00:11:42.120
two halves added together. The first half is

00:11:42.120 --> 00:11:44.399
the exploitation component. It simply looks at

00:11:44.399 --> 00:11:46.860
the average win ratio of the move. If a move

00:11:46.860 --> 00:11:49.279
wins a lot, this number is high. It aggressively

00:11:49.279 --> 00:11:51.720
rewards going to the taco place. Okay, simple

00:11:51.720 --> 00:11:53.879
enough. But if it only did that, it would just

00:11:53.879 --> 00:11:55.419
find one good move and never look at anything

00:11:55.419 --> 00:11:58.059
else. Exactly, which is why the second half of

00:11:58.059 --> 00:12:00.580
the formula is the exploration component. This

00:12:00.580 --> 00:12:03.240
part is a fraction. The numerator, the top number,

00:12:03.419 --> 00:12:05.580
is based on the total number of times the parent

00:12:05.580 --> 00:12:08.240
node has been visited. The denominator, the bottom

00:12:08.240 --> 00:12:10.600
number, is the number of times this specific

00:12:10.600 --> 00:12:13.299
child node has been visited. So if the computer

00:12:13.299 --> 00:12:15.879
has run 100 simulations through the main base

00:12:15.879 --> 00:12:19.080
cam, but has only sent a scout down this one

00:12:19.080 --> 00:12:22.039
specific new path, twice. Then you have a large

00:12:22.039 --> 00:12:24.320
number divided by a very small number, which

00:12:24.320 --> 00:12:27.000
creates a massive result. As the total number

00:12:27.000 --> 00:12:29.519
of simulations grows, the exploration value of

00:12:29.519 --> 00:12:31.980
those unvisited, neglected nodes gets artificially

00:12:31.980 --> 00:12:34.820
inflated. Mathematically, the unknown sushi place

00:12:34.820 --> 00:12:37.220
starts screaming louder and louder until its

00:12:37.220 --> 00:12:39.580
score actually eclipses the safe taco place,

00:12:39.899 --> 00:12:42.179
and the algorithm is practically forced to send

00:12:42.179 --> 00:12:44.740
a scout to go try it. That's so cool! Yeah, it

00:12:44.740 --> 00:12:46.919
forces the computer to periodically double -check

00:12:46.919 --> 00:12:49.740
its blind spots. That is absolutely brilliant.

00:12:49.960 --> 00:12:52.019
It mathematically guarantees that the system

00:12:52.019 --> 00:12:55.220
won't just get stuck in a comfortable rut. It

00:12:55.220 --> 00:12:58.500
inherently values curiosity. And armed with this

00:12:58.500 --> 00:13:01.440
UCT formula, Monte Carlo Tree Search basically

00:13:01.440 --> 00:13:04.039
took over the gaming world. The milestones from

00:13:04.039 --> 00:13:06.240
the sources are incredible. They really are.

00:13:06.440 --> 00:13:10.059
In 2008, a program called MoGo reached master

00:13:10.059 --> 00:13:14.679
level in 9x9 Go. By 2012, a program called Zen

00:13:14.919 --> 00:13:18.320
beat an amateur in the full 19 by 19 game. And

00:13:18.320 --> 00:13:22.659
the absolute pinnacle arrived in 2016. DeepMind

00:13:22.659 --> 00:13:25.840
developed AlphaGo, which combined this UCT -guided

00:13:25.840 --> 00:13:28.659
Monte Carlo tree search with deep neural networks.

00:13:28.679 --> 00:13:31.960
Like, the big match. Yes. AlphaGo played a five

00:13:31.960 --> 00:13:34.379
-game match against Lee Sedol, one of the greatest

00:13:34.379 --> 00:13:36.840
professional Go players in human history, and

00:13:36.840 --> 00:13:39.720
defeated him four to one. It was widely considered

00:13:39.720 --> 00:13:41.840
a watershed moment in artificial intelligence.

00:13:42.039 --> 00:13:44.440
Because the advantages of MCTS are perfectly

00:13:44.440 --> 00:13:46.940
suited for it. It doesn't need an explicit evaluation

00:13:46.940 --> 00:13:49.580
function. It handles massive branching factors

00:13:49.580 --> 00:13:52.019
beautifully because it grows east and west. and

00:13:52.019 --> 00:13:54.679
it can be stopped at any time to just yield the

00:13:54.679 --> 00:13:57.440
best move it has found so far. But here's where

00:13:57.440 --> 00:14:00.100
it gets really interesting. You just mentioned

00:14:00.100 --> 00:14:04.159
AlphaGo beat Leicital 4 to 1, which means Leicital,

00:14:04.360 --> 00:14:08.740
a human brain, won game four. He did. Wait. If

00:14:08.740 --> 00:14:11.620
AlphaGo is this omniscient AI that evaluates

00:14:11.620 --> 00:14:14.980
millions of futures a second and uses perfect

00:14:14.980 --> 00:14:18.139
UCT math to balance exploration and exploitation,

00:14:19.019 --> 00:14:21.620
how on earth did a human beat it? What happened

00:14:21.620 --> 00:14:24.009
in game four? You are pointing out the Achilles

00:14:24.009 --> 00:14:26.309
heel of Monte Carlo tree search. In the field,

00:14:26.330 --> 00:14:28.809
we call them trap states. Trap states. Yeah.

00:14:28.929 --> 00:14:30.610
Remember how we said the algorithm's greatest

00:14:30.610 --> 00:14:33.129
strength is its asymmetrical growth. It focuses

00:14:33.129 --> 00:14:35.690
its computing power only on the promising paths

00:14:35.690 --> 00:14:38.309
and heavily prunes away the sequences it deems

00:14:38.309 --> 00:14:40.649
less relevant or statistically losing. Right.

00:14:40.649 --> 00:14:43.090
It ignores the bad moves to save time. But what

00:14:43.090 --> 00:14:45.250
if a move looks superficially bad statistically?

00:14:45.370 --> 00:14:48.389
It loses 99 % of the random rollouts. But it

00:14:48.389 --> 00:14:50.529
actually sets up a devastating, highly subtle,

00:14:50.590 --> 00:14:53.980
long -term trap 20 moves later. Oh, no way. Because

00:14:53.980 --> 00:14:56.399
MCTS stops exploring paths that initially look

00:14:56.399 --> 00:14:59.879
poor, a deeply subtle trap laid by a genius level

00:14:59.879 --> 00:15:02.340
human can fall completely off the algorithm's

00:15:02.340 --> 00:15:05.820
search radar. Oh. So it literally engineers its

00:15:05.820 --> 00:15:09.720
own tunnel vision. AlphaGo looked at Lee Siedel's

00:15:09.720 --> 00:15:12.700
incredibly weird setup moving game. for, ran

00:15:12.700 --> 00:15:15.700
a few simulations, said, you know, statistically,

00:15:15.820 --> 00:15:17.480
this looks like a weak line of play. I'm going

00:15:17.480 --> 00:15:20.000
to ignore it and go mine my gold vein over here.

00:15:20.159 --> 00:15:22.059
Exactly. And by the time the trap was finally

00:15:22.059 --> 00:15:24.740
sprung, dozens of moves later, it was too late

00:15:24.740 --> 00:15:27.379
for the computer to recover. Right. It confidently

00:15:27.379 --> 00:15:29.840
pruned the very branch that held its own defeat.

00:15:30.039 --> 00:15:34.019
Wow. So to fix these trap states and to deal

00:15:34.019 --> 00:15:36.039
with the massive computational time required

00:15:36.039 --> 00:15:38.620
to run millions of games, software engineers

00:15:38.620 --> 00:15:41.899
had to figure out ways to supercharge MCTS. They

00:15:41.899 --> 00:15:44.320
had to speed run the algorithm without ruining

00:15:44.320 --> 00:15:46.080
the mechanics that made it special in the first

00:15:46.080 --> 00:15:48.299
place. And the sources outlined several major

00:15:48.299 --> 00:15:50.500
improvements engineers developed to do just that.

00:15:50.919 --> 00:15:52.779
One of the primary shifts was moving from light

00:15:52.779 --> 00:15:54.960
play -outs to heavy play -outs. Light play -outs

00:15:54.960 --> 00:15:56.940
being the original concept, right? Like completely

00:15:56.940 --> 00:16:00.389
random moves, sprints in the dark. Yes. Heavy

00:16:00.389 --> 00:16:03.009
play -outs, on the other hand, apply heuristics,

00:16:03.330 --> 00:16:05.710
human rules of thumb, to influence the choices

00:16:05.710 --> 00:16:08.789
during the random sprint. Like prioritizing the

00:16:08.789 --> 00:16:11.850
last good reply heuristic, or teaching the simulation

00:16:11.850 --> 00:16:14.889
to recognize specific, known patterns of stones

00:16:14.889 --> 00:16:16.970
and gurney go and react to them. But hold on,

00:16:16.990 --> 00:16:19.129
I have to step in here. Aren't we just cheating

00:16:19.129 --> 00:16:22.700
at that point? How so? If the entire beauty of

00:16:22.700 --> 00:16:24.980
this system was that the machine learns on its

00:16:24.980 --> 00:16:28.460
own from pure unadulterated randomness without

00:16:28.460 --> 00:16:32.080
human bias, doesn't injecting expert human knowledge

00:16:32.080 --> 00:16:34.500
back into the simulations defeat the whole purpose?

00:16:34.759 --> 00:16:36.639
If we connect this to the bigger picture, it

00:16:36.639 --> 00:16:39.299
reveals a genuinely fascinating paradox mentioned

00:16:39.299 --> 00:16:41.659
in the research. You would logically think that

00:16:41.659 --> 00:16:44.179
forcing the AI to play perfect expert human moves

00:16:44.179 --> 00:16:46.000
during its simulations would make its overall

00:16:46.000 --> 00:16:48.379
strategy stronger. Yeah, of course. But it turns

00:16:48.379 --> 00:16:50.620
out playing sub -optimally during those random

00:16:50.639 --> 00:16:53.080
simulations actually sometimes makes a Monte

00:16:53.080 --> 00:16:55.440
Carlo tree search program play stronger overall.

00:16:55.740 --> 00:16:58.179
Wait, making mistakes on purpose makes it better.

00:16:58.580 --> 00:17:02.100
Yes. The AI desperately needs a little bit of

00:17:02.100 --> 00:17:05.160
error to remain flexible. If you inject too much

00:17:05.160 --> 00:17:07.400
expert human knowledge and make the simulations

00:17:07.400 --> 00:17:10.940
too rigid and too perfect, the algorithm ironically

00:17:10.940 --> 00:17:14.039
loses the ability to discover those weird, unpredictable

00:17:14.039 --> 00:17:16.619
paths that human experts haven't thought of.

00:17:16.839 --> 00:17:19.599
It needs the noise. It needs the chaos to find

00:17:19.599 --> 00:17:22.140
the brilliant exceptions to the rule. That is

00:17:22.140 --> 00:17:24.559
incredible. The flaw is the feature. Exactly.

00:17:24.960 --> 00:17:27.819
And to speed the computations up even more, researchers

00:17:27.819 --> 00:17:30.759
implemented techniques like RAV rapid action

00:17:30.759 --> 00:17:32.980
value estimation. What's that? It's basically

00:17:32.980 --> 00:17:35.880
a fur cut. If placing a piece in a certain spot

00:17:35.880 --> 00:17:38.519
is a really good idea later in a random simulation,

00:17:39.099 --> 00:17:41.059
the algorithm assumes it might be a good idea

00:17:41.059 --> 00:17:43.680
to play it right now. It updates the value of

00:17:43.680 --> 00:17:46.299
that move across the whole tree, regardless of

00:17:46.299 --> 00:17:48.480
the exact sequence of prior moves. Oh, that makes

00:17:48.480 --> 00:17:51.180
sense. It's all about extracting the absolute

00:17:51.180 --> 00:17:53.799
maximum amount of actionable information from

00:17:53.799 --> 00:17:55.880
every single random sprint. Right. And of course,

00:17:55.980 --> 00:17:57.940
the most brute force way to supercharge this

00:17:57.940 --> 00:18:00.380
is parallelization running multiple computing

00:18:00.380 --> 00:18:02.819
threads at once. But that presents its own mechanical

00:18:02.819 --> 00:18:04.920
nightmare. Because if you have multiple scouts

00:18:04.920 --> 00:18:07.359
running through the forest, how do they all share

00:18:07.359 --> 00:18:10.859
the same map? The source material lists leaf,

00:18:11.279 --> 00:18:14.079
root, and tree parallelization. But how does

00:18:14.079 --> 00:18:16.140
that actually work without breaking the system?

00:18:16.319 --> 00:18:18.960
Well, think about tree parallelization, where

00:18:18.960 --> 00:18:21.539
multiple processors are trying to build and update

00:18:21.539 --> 00:18:25.279
the exact same game tree simultaneously. Imagine

00:18:25.279 --> 00:18:27.900
four of your scouts running back to base cam

00:18:27.900 --> 00:18:30.819
and all trying to draw on the exact same physical

00:18:30.819 --> 00:18:32.940
piece of paper at the same time. That would be

00:18:32.940 --> 00:18:35.359
a mess. They're going to bump hands, scribble

00:18:35.359 --> 00:18:37.180
over each other's lines, and completely ruin

00:18:37.180 --> 00:18:40.400
the map. Exactly. In computer science, to prevent

00:18:40.400 --> 00:18:42.599
processors from overwriting each other's data,

00:18:42.569 --> 00:18:45.849
you have to use mutex's mutual exclusion locks.

00:18:46.309 --> 00:18:48.609
One processor essentially locks the data node,

00:18:49.029 --> 00:18:51.329
updates it, and unlocks it before the next one

00:18:51.329 --> 00:18:53.529
can touch it. But if they're constantly locking

00:18:53.529 --> 00:18:55.910
the map, everyone else has to stand around waiting

00:18:55.910 --> 00:18:58.509
to draw, which completely defeats the purpose

00:18:58.509 --> 00:19:01.230
of being fast. Which is why alternatives like

00:19:01.230 --> 00:19:04.150
route parallelization are so popular. Instead

00:19:04.150 --> 00:19:06.410
of sharing one map, you just build completely

00:19:06.410 --> 00:19:09.390
separate base camps. You let four different processors

00:19:09.390 --> 00:19:11.490
build their own completely independent trees,

00:19:11.829 --> 00:19:13.529
and at the end of the time limit, they all just

00:19:13.529 --> 00:19:16.230
vote on the best opening move. No shared map,

00:19:16.410 --> 00:19:18.589
no clashing hands. So what does this all mean?

00:19:18.990 --> 00:19:21.190
We started this deep dive with the impossible

00:19:21.190 --> 00:19:23.930
task of mapping a cross -country trip where the

00:19:23.930 --> 00:19:26.390
potential routes outnumber the atoms in the universe.

00:19:26.490 --> 00:19:28.869
A truly impossible task. We looked at a problem

00:19:28.869 --> 00:19:31.609
that brute force and human calculation could

00:19:31.609 --> 00:19:34.400
never solve. And what we found is that Monte

00:19:34.400 --> 00:19:37.779
Carlo tree search takes the overwhelmingly infinite

00:19:37.779 --> 00:19:41.660
and makes it solvable. Not through perfect godlike

00:19:41.660 --> 00:19:44.119
knowledge of the whole map, but through structured

00:19:44.119 --> 00:19:47.299
bounded exploration. It uses just a little bit

00:19:47.299 --> 00:19:49.839
of randomness guided by the elegant math of what

00:19:49.839 --> 00:19:52.559
works, constantly balancing the safety of the

00:19:52.559 --> 00:19:55.220
gold it knows against the potential of the wild

00:19:55.220 --> 00:19:57.660
frontier it hasn't explored yet. It truly is

00:19:57.660 --> 00:20:00.220
a profound mechanism for navigating the unknown.

00:20:00.859 --> 00:20:03.359
And the sources specifically note that MCTS is

00:20:03.359 --> 00:20:05.579
increasingly being used in applications outside

00:20:05.579 --> 00:20:08.460
of games, which leaves us with a highly provocative

00:20:08.460 --> 00:20:10.720
thought to consider. Love those. Let's hear it.

00:20:10.859 --> 00:20:13.660
If an algorithm can beautifully balance exploitation

00:20:13.660 --> 00:20:15.920
and exploration to navigate the near infinite

00:20:15.920 --> 00:20:19.319
futures of a go board, could we one day use MCTS

00:20:19.319 --> 00:20:21.900
style simulations to map out the consequences

00:20:21.900 --> 00:20:25.019
of our own massive global societal decisions?

00:20:25.500 --> 00:20:29.000
Are we as a society stuck exploiting the comfortable

00:20:29.000 --> 00:20:31.660
paths we already know? And is it time we purposely

00:20:31.660 --> 00:20:34.500
injected a little UCT exploration into our own

00:20:34.500 --> 00:20:36.779
lives to find the better future that might be

00:20:36.779 --> 00:20:38.680
hidden right off our current radar? To purposely

00:20:38.680 --> 00:20:40.799
wander off the known map, just see what's out

00:20:40.799 --> 00:20:42.480
there. I love that. Thank you for joining us

00:20:42.480 --> 00:20:44.500
on this deep dive. As always, keep questioning

00:20:44.500 --> 00:20:46.279
the map and stay endlessly curious.
