WEBVTT

00:00:00.000 --> 00:00:03.180
You know, whether you are furiously prepping

00:00:03.180 --> 00:00:05.860
for a highly technical data science meeting right

00:00:05.860 --> 00:00:09.039
now, or maybe you just have this kind of insatiable

00:00:09.039 --> 00:00:11.460
curiosity about how modern artificial intelligence

00:00:11.460 --> 00:00:14.300
actually makes decisions under the hood, today's

00:00:14.300 --> 00:00:17.059
Deep Dive is designed specifically for you. Yeah,

00:00:17.059 --> 00:00:19.339
we're looking at a system that basically takes

00:00:19.339 --> 00:00:23.620
pure injected chaos and somehow, against all

00:00:23.620 --> 00:00:26.920
logic, turns it into razor sharp insight. Right.

00:00:27.070 --> 00:00:29.550
It genuinely pushes the boundaries of how we

00:00:29.550 --> 00:00:32.270
think about problem solving. I mean, we are trained

00:00:32.270 --> 00:00:34.289
to think that adding randomness to a logical

00:00:34.289 --> 00:00:37.289
process. degrades it. Oh, absolutely. But in

00:00:37.289 --> 00:00:39.950
this case, throwing blindfolds and intentional

00:00:39.950 --> 00:00:42.549
noise at an algorithm actually forces it to become

00:00:42.549 --> 00:00:44.649
significantly more precise. It's fascinating.

00:00:44.729 --> 00:00:47.590
It really is. So today our mission is to completely

00:00:47.590 --> 00:00:49.630
demystify this foundational machine learning

00:00:49.630 --> 00:00:52.649
method. We are using a really comprehensive Wikipedia

00:00:52.649 --> 00:00:55.329
article on the random forest algorithm as our

00:00:55.329 --> 00:00:57.030
source material. And there's a lot to cover in

00:00:57.030 --> 00:00:59.960
there. Oh, tons. We're going to explore how this

00:00:59.960 --> 00:01:02.960
evolved from a theoretical concept in the 90s

00:01:02.960 --> 00:01:05.739
into this mathematical powerhouse. We'll unpack

00:01:05.739 --> 00:01:08.280
the notorious black box of how it evaluates its

00:01:08.280 --> 00:01:11.640
own decisions. And we're also going to reveal

00:01:11.640 --> 00:01:14.599
some truly surprising hidden connections it shares

00:01:14.599 --> 00:01:17.159
with entirely different algorithms. Yeah, the

00:01:17.159 --> 00:01:18.959
connections to other machine learning concepts

00:01:18.959 --> 00:01:22.299
are probably my favorite part. Same here. OK,

00:01:22.439 --> 00:01:26.069
let's unpack this. And to do that, we have to

00:01:26.069 --> 00:01:28.769
start with the fundamental building block. Because

00:01:28.769 --> 00:01:30.909
before we can talk about a forest, we need to

00:01:30.909 --> 00:01:33.260
look at a single decision tree. Right, because

00:01:33.260 --> 00:01:35.739
the entire genius of the forest only makes sense

00:01:35.739 --> 00:01:38.019
once you understand the fatal flaw of the single

00:01:38.019 --> 00:01:40.739
tree. Exactly. Now decision trees themselves

00:01:40.739 --> 00:01:43.500
are incredibly popular. The source material notes

00:01:43.500 --> 00:01:45.780
they are essentially the standard off -the -shelf

00:01:45.780 --> 00:01:48.079
procedure for data mining. For good reason, honestly.

00:01:48.159 --> 00:01:50.540
Yeah. For one, they are invariant under -scaling.

00:01:50.700 --> 00:01:53.280
You don't have to perfectly normalize or standardize

00:01:53.280 --> 00:01:56.239
all your data to tight little ranges before you

00:01:56.239 --> 00:01:58.540
feed it in. Which saves a massive amount of pre

00:01:58.540 --> 00:02:01.469
-processing time. Right. And they are also incredibly

00:02:01.469 --> 00:02:04.430
robust against irrelevant features and, maybe

00:02:04.430 --> 00:02:06.810
most importantly for developers, they produce

00:02:06.810 --> 00:02:10.069
highly inspectable models. You can visually trace

00:02:10.069 --> 00:02:12.110
the path from the root node all the way down

00:02:12.110 --> 00:02:14.590
to the leaf and see exactly the logic the model

00:02:14.590 --> 00:02:17.330
used. The interpretability is a massive selling

00:02:17.330 --> 00:02:20.689
point. But there's a catch. As researchers Hasty,

00:02:20.930 --> 00:02:23.229
Tibshirani, and Friedman point out in the text,

00:02:23.310 --> 00:02:26.229
there is a severe limitation. They are, in their

00:02:26.229 --> 00:02:30.039
words, seldom accurate. Which is... Obviously

00:02:30.039 --> 00:02:32.139
a catastrophic problem for a prediction model.

00:02:32.300 --> 00:02:35.500
Yeah, a bit of a deal breaker. The core mathematical

00:02:35.500 --> 00:02:38.500
issue arises when these trees are grown very

00:02:38.500 --> 00:02:40.560
deep. What do you mean by deep, exactly? Well,

00:02:40.560 --> 00:02:42.620
when a decision tree is allowed to keep asking

00:02:42.620 --> 00:02:45.379
questions, just making splits all the way down

00:02:45.379 --> 00:02:47.560
until it has isolated almost every individual

00:02:47.560 --> 00:02:50.620
data point, it begins to learn highly irregular

00:02:50.620 --> 00:02:53.280
idiosyncratic patterns. Ah, I see. In machine

00:02:53.280 --> 00:02:55.120
learning terms, this is described as having low

00:02:55.120 --> 00:02:57.759
bias, but very high variance. You know, I was

00:02:57.759 --> 00:03:00.099
thinking about this high variance problem. And

00:03:00.099 --> 00:03:02.539
it feels like a student who perfectly memorizes

00:03:02.539 --> 00:03:05.530
a specific practice test. Like they know that

00:03:05.530 --> 00:03:08.310
on page 3, the answer to the second math problem

00:03:08.310 --> 00:03:11.520
is 42. Right. They just rote memorize it. Exactly.

00:03:11.580 --> 00:03:13.879
So they get 100 % on the practice test, but then

00:03:13.879 --> 00:03:16.960
they completely bomb the actual final exam because

00:03:16.960 --> 00:03:18.960
they didn't learn the underlying mathematical

00:03:18.960 --> 00:03:21.639
formulas. They just memorize the specific noise

00:03:21.639 --> 00:03:24.219
and the quirks of that one practice test. They

00:03:24.219 --> 00:03:27.180
overfit. That is a perfect analogy. If we look

00:03:27.180 --> 00:03:30.039
at the math behind that overfitting, the tree

00:03:30.039 --> 00:03:32.659
isn't finding the true underlying signal of the

00:03:32.659 --> 00:03:36.120
data. It's just drawing incredibly jagged complex

00:03:36.120 --> 00:03:38.560
boundaries around the noise it was trained on.

00:03:38.729 --> 00:03:40.669
So it's hyper -fixating on the wrong things.

00:03:40.969 --> 00:03:43.150
Exactly. So if you feed it slightly different

00:03:43.150 --> 00:03:45.530
training data, the resulting tree looks completely

00:03:45.530 --> 00:03:47.729
different. That instability is the variance.

00:03:47.830 --> 00:03:50.210
Which brings us to the actual topic today. Right.

00:03:50.210 --> 00:03:52.590
And that leads us to the core thesis of the random

00:03:52.590 --> 00:03:57.189
forest. The goal is to take multiple deep decision

00:03:57.189 --> 00:04:00.030
trees. These over -memorizing students. Yes.

00:04:00.110 --> 00:04:02.009
Take a bunch of those students, train them on

00:04:02.009 --> 00:04:03.930
slightly different variations of the same data,

00:04:04.250 --> 00:04:06.960
and average their predictions together. You sacrifice

00:04:06.960 --> 00:04:09.800
a tiny bit of bias and you completely lose that

00:04:09.800 --> 00:04:13.159
lovely interpretability, but you violently drive

00:04:13.159 --> 00:04:16.439
down the variance. Wow, okay. So we move from

00:04:16.439 --> 00:04:18.959
a single overfitting tree to planting a whole

00:04:18.959 --> 00:04:22.290
forest. But if we trace the evolution of this

00:04:22.290 --> 00:04:24.310
idea, the source notes, it didn't just happen

00:04:24.310 --> 00:04:26.470
overnight, right? No, not at all. It took years

00:04:26.470 --> 00:04:29.730
of theory. Right. Because back in 1995, researcher

00:04:29.730 --> 00:04:32.569
Tin Kam Ho created the first algorithm for this

00:04:32.569 --> 00:04:35.129
using something called the random subspace method.

00:04:35.470 --> 00:04:37.410
And she was actually building on even earlier

00:04:37.410 --> 00:04:40.009
theories by Eugene Kleinberg called stochastic

00:04:40.009 --> 00:04:42.360
discrimination. Yeah, the core realization there

00:04:42.360 --> 00:04:44.819
was that if you restrict what an algorithm can

00:04:44.819 --> 00:04:47.139
see— Like literally blinding it to certain parts

00:04:47.139 --> 00:04:49.300
of the data. Exactly. If you blind it, it actually

00:04:49.300 --> 00:04:51.800
gets smarter. And that idea was later expanded

00:04:51.800 --> 00:04:54.579
upon by Leo Breiman, who introduced a core mechanism

00:04:54.579 --> 00:04:57.540
that makes modern random forests work. And Adele

00:04:57.540 --> 00:05:00.720
Cutler was involved in extending this, too. Breiman

00:05:00.720 --> 00:05:03.060
and Cutler actually trademarked random forests

00:05:03.060 --> 00:05:06.920
in 2006. Oh, wait. It's trademarked. Yeah. Currently

00:05:06.920 --> 00:05:09.959
owned by Minitab Inc. But the core technique

00:05:09.959 --> 00:05:13.079
they popularize is called bagging, which is shorthand

00:05:13.079 --> 00:05:15.540
for bootstrap aggregating. OK. Let's break down

00:05:15.540 --> 00:05:18.000
how bagging actually works mechanically. Sure.

00:05:18.100 --> 00:05:20.860
So if you have a training data set, say 1 ,000

00:05:20.860 --> 00:05:23.579
rows of data, the algorithm doesn't just hand

00:05:23.579 --> 00:05:26.360
all 1 ,000 rows to every tree. That would just

00:05:26.360 --> 00:05:29.379
make identical tree. Right. Instead, it repeatedly

00:05:29.379 --> 00:05:32.519
samples that data with replacement to build a

00:05:32.519 --> 00:05:34.860
unique training set for each individual tree.

00:05:35.139 --> 00:05:39.069
Wait. With replacement, meaning? It's like drawing

00:05:39.069 --> 00:05:41.209
names from a hat, but after you draw a name and

00:05:41.209 --> 00:05:43.810
write it down, you put this slip of paper back

00:05:43.810 --> 00:05:46.170
into the hat before you draw again. Correct.

00:05:46.529 --> 00:05:48.629
Because you are putting the data points back,

00:05:48.949 --> 00:05:51.610
a single tree's training set might contain duplicates

00:05:51.610 --> 00:05:53.930
of certain rows while completely missing others.

00:05:54.149 --> 00:05:57.009
Ah, okay. That introduces some randomness. Exactly.

00:05:57.290 --> 00:05:59.449
Well done. Statistically, drawing with replacement

00:05:59.449 --> 00:06:02.670
means each tree only sees about 63 % of the unique

00:06:02.670 --> 00:06:04.990
data points from the original set. The remaining

00:06:04.990 --> 00:06:08.120
37 % are left out entirely for... that specific

00:06:08.120 --> 00:06:10.420
tree. Okay, I actually have to push back on this

00:06:10.420 --> 00:06:12.740
logic for a second. Go for it. Even if we are

00:06:12.740 --> 00:06:15.000
sampling with replacement and shuffling the rows

00:06:15.000 --> 00:06:18.000
around, we are still generally feeding the same

00:06:18.000 --> 00:06:20.439
overall massive data set to a bunch of trees.

00:06:20.579 --> 00:06:23.759
If there is one incredibly dominant pattern in

00:06:23.759 --> 00:06:27.300
the data, won't every single tree just latch

00:06:27.300 --> 00:06:30.300
onto that exact same pattern first? That's a

00:06:30.300 --> 00:06:32.899
very valid concern. Right, like they'll all look

00:06:32.899 --> 00:06:35.199
slightly different at the bottom, but the core

00:06:35.199 --> 00:06:37.899
decisions at the top will be identical, so averaging

00:06:37.899 --> 00:06:40.439
them won't actually help that much. What's fascinating

00:06:40.439 --> 00:06:43.579
here is that Breiman anticipated that exact bottleneck.

00:06:43.819 --> 00:06:46.939
If you just use standard bagging, the trees remain

00:06:46.939 --> 00:06:50.199
highly correlated. They all make the same primary

00:06:50.199 --> 00:06:52.839
mistakes. So how do you fix it? The solution

00:06:52.839 --> 00:06:55.279
to break that correlation is called feature bagging.

00:06:55.579 --> 00:06:58.120
How does feature bagging change the math? Well,

00:06:58.279 --> 00:07:00.199
standard bagging samples the rows of your data

00:07:00.199 --> 00:07:02.879
set. Feature bagging samples the columns, the

00:07:02.879 --> 00:07:05.639
variables themselves. Oh, I see. Yeah. At every

00:07:05.639 --> 00:07:07.779
single node where a tree is trying to decide

00:07:07.779 --> 00:07:10.500
how to split the data, the algorithm steps in

00:07:10.500 --> 00:07:12.920
and says, no, you cannot look at all the features.

00:07:13.240 --> 00:07:15.740
You can only look at the small random subset

00:07:15.740 --> 00:07:18.420
of them. Wait, really? How small of a subset?

00:07:18.660 --> 00:07:21.660
For classification tasks, it typically restricts

00:07:21.660 --> 00:07:24.040
the tree to a subset equal to the square root

00:07:24.040 --> 00:07:26.879
of the total features. For regression, it's usually

00:07:26.879 --> 00:07:29.699
about a third. Oh, wow. I see the mechanism now.

00:07:29.839 --> 00:07:33.740
Yes. Let's say we're trying to predict if it

00:07:33.740 --> 00:07:37.540
will rain and one feature is the sky is completely

00:07:37.540 --> 00:07:40.420
black. Right. That is a massive dominant predictor.

00:07:40.459 --> 00:07:42.860
The most obvious one, yeah. Right. Without feature

00:07:42.860 --> 00:07:45.259
bagging, every single tree would just split on

00:07:45.259 --> 00:07:47.420
black sky right at the top node. They would all

00:07:47.420 --> 00:07:50.220
become clones of each other. Exactly. By forcing

00:07:50.220 --> 00:07:52.899
the trees to look at random restricted subsets

00:07:52.899 --> 00:07:55.699
of columns, you effectively de -correlate them.

00:07:56.019 --> 00:07:57.620
You force some trees to figure out how to predict

00:07:57.620 --> 00:07:59.579
the rain without being allowed to look at the

00:07:59.579 --> 00:08:01.600
black sky variable at all. That's brilliant.

00:08:01.720 --> 00:08:04.279
So they have to find secondary, tertiary, and

00:08:04.279 --> 00:08:07.360
completely hidden patterns in the humidity or

00:08:07.360 --> 00:08:10.470
the wind direction. You got it. when you average

00:08:10.470 --> 00:08:13.350
those uniquely constrained, highly diverse trees

00:08:13.350 --> 00:08:16.550
together, you create a true wisdom of the crowd

00:08:16.550 --> 00:08:19.730
effect that is far more resilient than any single

00:08:19.730 --> 00:08:23.189
tree. That is incredibly clever. Okay, so now

00:08:23.189 --> 00:08:25.550
we have a functioning forest. It's making highly

00:08:25.550 --> 00:08:27.810
accurate predictions by combining all these uniquely

00:08:27.810 --> 00:08:30.970
constrained trees. But human beings are naturally

00:08:30.970 --> 00:08:33.509
distrustful of black boxes. Oh, of course we

00:08:33.509 --> 00:08:36.210
are. We want to know the why. Right. If I feed

00:08:36.210 --> 00:08:38.629
a massive data set into this, I don't just want

00:08:38.629 --> 00:08:40.649
the final prediction. I want to know how the

00:08:40.649 --> 00:08:43.049
forest arrived at it. Like, if I'm using this

00:08:43.049 --> 00:08:44.710
for medical data, I don't just want to know if

00:08:44.710 --> 00:08:46.929
someone is sick. I want to know which symptom

00:08:46.929 --> 00:08:49.809
was the biggest red flag. Which clues were actually

00:08:49.809 --> 00:08:51.970
driving the math? Well, random forests have built

00:08:51.970 --> 00:08:54.850
-in mechanisms to rank variable importance. The

00:08:54.850 --> 00:08:57.289
first and arguably most robust method originally

00:08:57.289 --> 00:08:59.590
outlined by Breiman is permutation importance.

00:08:59.750 --> 00:09:01.649
Permutation, okay. Remember a moment ago when

00:09:01.649 --> 00:09:03.590
we talked about the bootstrap bagging process?

00:09:04.009 --> 00:09:06.129
How drawing with replacement naturally leaves

00:09:06.129 --> 00:09:09.450
out about 37 % of the data for any given tree.

00:09:09.710 --> 00:09:11.980
Right, the hat drawing method means Some rows

00:09:11.980 --> 00:09:14.679
never get picked for tree number 42. Exactly.

00:09:15.100 --> 00:09:17.480
Those left out rows are called the out of bag

00:09:17.480 --> 00:09:20.480
samples. They are incredibly valuable because

00:09:20.480 --> 00:09:22.799
they act as a built -in testing set. Oh, built

00:09:22.799 --> 00:09:25.330
-in testing. That's handy. Right. To figure out

00:09:25.330 --> 00:09:28.230
how important a specific feature is, say, blood

00:09:28.230 --> 00:09:31.610
pressure in a medical data set, you first run

00:09:31.610 --> 00:09:33.870
those out -of -bag samples through their respective

00:09:33.870 --> 00:09:37.090
trees and record the baseline prediction error.

00:09:37.289 --> 00:09:39.590
Okay, establishing a baseline. Then you take

00:09:39.590 --> 00:09:41.610
the blood pressure column in those out -of -bag

00:09:41.610 --> 00:09:44.730
samples and you randomly scramble all the values.

00:09:45.149 --> 00:09:47.149
You permute them. Wait, you essentially break

00:09:47.149 --> 00:09:49.750
that specific column of data on purpose to see

00:09:49.750 --> 00:09:52.529
what happens. You break it and then you run those

00:09:52.529 --> 00:09:55.139
scrambled samples through the tree. again. If

00:09:55.139 --> 00:09:57.399
the error jumps up massively, it proves that

00:09:57.399 --> 00:09:59.539
the model was heavily relying on blood pressure

00:09:59.539 --> 00:10:02.460
to make accurate predictions. It's highly important.

00:10:02.519 --> 00:10:05.179
And if it doesn't jump? If the error barely changes,

00:10:05.360 --> 00:10:07.379
it means the model wasn't really using that feature

00:10:07.379 --> 00:10:09.740
anyway. Okay, so permutation breaks the data

00:10:09.740 --> 00:10:13.039
intentionally. But didn't the tree already do

00:10:13.039 --> 00:10:15.679
a massive amount of math on which features were

00:10:15.679 --> 00:10:18.740
best when it was initially calculating the splits?

00:10:19.300 --> 00:10:21.559
Can't we just look at its own homework to see

00:10:21.559 --> 00:10:24.179
what it valued we can actually and that is the

00:10:24.179 --> 00:10:27.059
second built -in method mean decrease in impurity

00:10:27.059 --> 00:10:29.539
mean decrease in impurity, okay When a tree is

00:10:29.539 --> 00:10:31.759
building itself, it evaluates a split by seeing

00:10:31.759 --> 00:10:34.659
how cleanly it separates the classes. Imagine

00:10:34.659 --> 00:10:38.159
a node with 10 patient records. If a split divides

00:10:38.159 --> 00:10:40.159
them, so one side has nine sick patients and

00:10:40.159 --> 00:10:43.340
the other have one, that is a very pure split.

00:10:43.519 --> 00:10:45.480
Right. It effectively segregated the data into

00:10:45.480 --> 00:10:48.279
distinct categories. Exactly. And the algorithm

00:10:48.279 --> 00:10:50.600
measures this purity using mathematical metrics

00:10:50.600 --> 00:10:54.039
like the Janik coefficient entropy or mean squared

00:10:54.039 --> 00:10:57.259
error for regression. OK. And then? Mean decrease

00:10:57.259 --> 00:11:00.679
and impurity simply tallies up how much a specific

00:11:00.679 --> 00:11:03.639
feature reduced the impurity across every single

00:11:03.639 --> 00:11:05.860
split it was involved in, across every tree in

00:11:05.860 --> 00:11:07.700
the entire forest. Oh, that makes sense. It's

00:11:07.700 --> 00:11:10.019
incredibly fast because, as you said, it's just

00:11:10.019 --> 00:11:12.120
aggregating the homework the algorithm already

00:11:12.120 --> 00:11:14.240
did during training. So what does this all mean

00:11:14.240 --> 00:11:17.220
for the end user? It means you have two distinct

00:11:17.220 --> 00:11:20.840
tools to peer inside the black box. But I imagine

00:11:20.840 --> 00:11:23.679
they aren't flawless. No, they both have significant

00:11:23.679 --> 00:11:27.320
drawbacks. Permutation importance unfairly favors

00:11:27.320 --> 00:11:30.100
features that have more distinct values. Oh,

00:11:30.100 --> 00:11:32.899
really? Yeah, and it breaks down when you have

00:11:32.899 --> 00:11:35.259
collinear features, variables that are highly

00:11:35.259 --> 00:11:37.679
correlated with each other, like height and leg

00:11:37.679 --> 00:11:39.639
length. Right, because if you scramble one, the

00:11:39.639 --> 00:11:41.879
model just leans on the other. Exactly, making

00:11:41.879 --> 00:11:44.360
both look less important than they actually are.

00:11:44.759 --> 00:11:46.860
And what's the catch with the impurity method?

00:11:47.539 --> 00:11:50.080
Impurity metrics, which happen to be the default

00:11:50.080 --> 00:11:52.720
in popular programming libraries like scikit

00:11:52.720 --> 00:11:55.620
-learn, suffer from high cardinality bias. High

00:11:55.620 --> 00:11:58.399
cardinality meaning, like, variables that have

00:11:58.399 --> 00:12:00.659
a massive number of unique values, like an ID

00:12:00.659 --> 00:12:03.679
number or a zip code. Exactly. The math of the

00:12:03.679 --> 00:12:06.039
Jujina coefficient naturally favors features

00:12:06.039 --> 00:12:09.080
with lots of unique values, inflating their importance

00:12:09.080 --> 00:12:12.039
artificially. That sounds dangerous. It is. Worse,

00:12:12.360 --> 00:12:14.480
impurity importance only reflects statistics

00:12:14.480 --> 00:12:17.179
gathered during the training phase. does not

00:12:17.179 --> 00:12:19.360
necessarily reflect how useful a feature will

00:12:19.360 --> 00:12:22.799
actually be when the model is hit with new unseen

00:12:22.799 --> 00:12:26.259
test data out in the real world. Wow. That is

00:12:26.259 --> 00:12:28.519
a huge caveat for any data scientist to keep

00:12:28.519 --> 00:12:30.840
in mind. OK, we've opened the black box. We've

00:12:30.840 --> 00:12:32.740
scrutinized the variable importance. Now I want

00:12:32.740 --> 00:12:34.639
to step back and look at the algorithm's DNA.

00:12:34.700 --> 00:12:37.529
Let's do it. Because the source material reveals

00:12:37.529 --> 00:12:39.970
that this supposedly unique ensemble of trees

00:12:39.970 --> 00:12:42.889
is secretly related to other foundational machine

00:12:42.889 --> 00:12:45.149
learning concepts. This is where the theory gets

00:12:45.149 --> 00:12:47.850
genuinely beautiful. Let's look at the relationship

00:12:47.850 --> 00:12:52.429
to k -nearest neighbors, or KNN. In 2002, researchers

00:12:52.429 --> 00:12:54.990
Lin and Jine mathematically proved that random

00:12:54.990 --> 00:12:58.289
forests and KNN are actually both operating as

00:12:58.289 --> 00:13:01.289
weighted neighborhood schemes. Now standard k

00:13:01.289 --> 00:13:03.250
-nearest neighbors is pretty intuitive. It essentially

00:13:03.250 --> 00:13:05.169
says, tell me who your neighbors are and I'll

00:13:05.169 --> 00:13:07.000
tell you who you are. Exactly. Exactly. If you

00:13:07.000 --> 00:13:10.360
plot data on a graph, K &N draws a rigid physical

00:13:10.360 --> 00:13:12.840
boundary around a new data point and looks at

00:13:12.840 --> 00:13:14.919
the closest physical neighbors to classify it.

00:13:15.159 --> 00:13:18.120
But Lin and Jan realized a random forest is doing

00:13:18.120 --> 00:13:21.039
the exact same thing just with a fundamentally

00:13:21.039 --> 00:13:23.840
different definition of what a neighbor is. How

00:13:23.840 --> 00:13:26.700
so? Well, instead of looking at physical Euclidean

00:13:26.700 --> 00:13:29.659
distance on a graph, a random forest defines

00:13:29.659 --> 00:13:32.379
a neighbor as a data point that ends up in the

00:13:32.379 --> 00:13:35.120
exact same final leaf note of a decision tree.

00:13:35.370 --> 00:13:39.309
as you do. Oh, that is a fascinating shift in

00:13:39.309 --> 00:13:41.710
perspective. It's like finding out your two completely

00:13:41.710 --> 00:13:43.950
separate groups of friends actually hang out

00:13:43.950 --> 00:13:46.570
without you. Yeah. KNN and random forests approach

00:13:46.570 --> 00:13:49.110
it so differently, but arrive at the same concept.

00:13:49.190 --> 00:13:52.090
It really is. Standard KNN is like drawing a

00:13:52.090 --> 00:13:54.250
rigid physical circle around your house and saying,

00:13:54.610 --> 00:13:56.809
everyone in this geographic circle is my neighbor.

00:13:57.120 --> 00:13:59.899
But a random forest defines a neighbor by shared

00:13:59.899 --> 00:14:02.659
traits. You both like jazz, own golden retrievers,

00:14:02.679 --> 00:14:05.200
and work night shifts. The neighborhood isn't

00:14:05.200 --> 00:14:07.759
a physical circle. It morphs dynamically based

00:14:07.759 --> 00:14:09.720
on what traits actually matter to the trees.

00:14:10.100 --> 00:14:12.840
This raises an important question though. Why

00:14:12.840 --> 00:14:15.200
is the forest's version of a neighborhood often

00:14:15.200 --> 00:14:17.419
mathematically superior? Yeah, why is it smarter?

00:14:17.659 --> 00:14:20.519
Lin and Jian showed that because the random forest

00:14:20.519 --> 00:14:22.919
adapts the shape of its neighborhood based on

00:14:22.919 --> 00:14:25.620
the local importance of each feature, it doesn't

00:14:25.620 --> 00:14:28.879
get confused by irrelevant noise. It only groups

00:14:28.879 --> 00:14:31.240
you with data points that share the characteristics

00:14:31.240 --> 00:14:33.620
that are actually driving the outcome. You know,

00:14:33.639 --> 00:14:35.220
we've been talking about this whole process,

00:14:35.360 --> 00:14:38.000
assuming the forest knows what it's looking for,

00:14:38.500 --> 00:14:41.279
that we've handed it data neatly labeled with

00:14:41.279 --> 00:14:44.149
the correct answers during training. Right, supervised

00:14:44.149 --> 00:14:46.870
learning. But the real superpower mentioned in

00:14:46.870 --> 00:14:49.169
the source is that it doesn't always need us

00:14:49.169 --> 00:14:51.409
to hold his hand. It can operate completely in

00:14:51.409 --> 00:14:54.350
the dark, performing unsupervised learning. Oh,

00:14:54.429 --> 00:14:56.490
this is such a cool trick. You can effectively

00:14:56.490 --> 00:14:59.129
force a random forest into working without any

00:14:59.129 --> 00:15:01.690
labels at all. How? You do this by generating

00:15:01.690 --> 00:15:04.429
a massive set of synthetic data. You take your

00:15:04.429 --> 00:15:07.049
real data and you independently shuffle the values

00:15:07.049 --> 00:15:10.029
in every single column. This totally destroys

00:15:10.029 --> 00:15:12.769
any underlying structure or correlation between

00:15:12.769 --> 00:15:14.889
the features. OK, so you make a garbage data

00:15:14.889 --> 00:15:17.789
set. Right. Then you ask the random forest to

00:15:17.789 --> 00:15:20.009
classify whether a data point is from the real

00:15:20.009 --> 00:15:22.730
set or the shuffled synthetic set. So it's forced

00:15:22.730 --> 00:15:24.960
to learn that deep structural rules of the real

00:15:24.960 --> 00:15:27.659
data just to distinguish reality from random

00:15:27.659 --> 00:15:30.080
noise. Exactly. The source mentions this creates

00:15:30.080 --> 00:15:33.700
a dissimilarity measure called AdCL1. What exactly

00:15:33.700 --> 00:15:36.860
is that math doing? AdCL1 basically tracks how

00:15:36.860 --> 00:15:39.440
often two real data points end up in the same

00:15:39.440 --> 00:15:42.139
leaf node while the forest is trying to separate

00:15:42.139 --> 00:15:45.620
the real data from the fake data. If two patient

00:15:45.620 --> 00:15:48.000
records constantly land in the same leaf node,

00:15:48.399 --> 00:15:50.779
the algorithm mathematically concludes they are

00:15:50.779 --> 00:15:53.059
highly similar even if it doesn't know what disease

00:15:53.059 --> 00:15:56.000
they have. That's wild. Because random forests

00:15:56.000 --> 00:15:59.039
handle messy mixed variable types effortlessly,

00:15:59.639 --> 00:16:01.899
this specific dissimilarity measure has been

00:16:01.899 --> 00:16:04.620
famously used to find hidden clusters of patients

00:16:04.620 --> 00:16:07.259
based on incredibly complex tissue marker data.

00:16:07.600 --> 00:16:09.919
It handles outliers beautifully, too. Amazing.

00:16:10.179 --> 00:16:12.259
Well, if random forests are this versatile and

00:16:12.259 --> 00:16:14.789
this mathematically adaptable, What happens when

00:16:14.789 --> 00:16:17.149
researchers push the logic to the absolute extreme?

00:16:17.409 --> 00:16:19.710
You get some highly theoretical boundary pushing

00:16:19.710 --> 00:16:22.090
variations. One of the most striking is extra

00:16:22.090 --> 00:16:24.690
trees, which stands for extremely randomized

00:16:24.690 --> 00:16:26.929
trees. Here's where it gets really interesting

00:16:26.929 --> 00:16:30.769
to me. In standard random forests, we've established

00:16:30.769 --> 00:16:33.549
that we inject randomness by sampling the data

00:16:33.549 --> 00:16:36.350
rows and restricting the feature columns. But

00:16:36.350 --> 00:16:38.730
when a tree actually sits down to make a split,

00:16:39.330 --> 00:16:42.289
it calculates the perfectly optimal mathematically

00:16:42.289 --> 00:16:46.889
best cut point for that specific feature. ExtraTrees

00:16:46.889 --> 00:16:49.330
throws that optimization completely out the window.

00:16:49.649 --> 00:16:52.370
It abandons it entirely. ExtraTrees doesn't even

00:16:52.370 --> 00:16:54.549
bother with bootstrap bagging. It uses the whole

00:16:54.549 --> 00:16:56.830
learning sample. But for the feature splits,

00:16:57.029 --> 00:16:59.169
it picks completely random cut points. Wait,

00:16:59.169 --> 00:17:01.070
just totally random? Totally random. It just

00:17:01.070 --> 00:17:03.629
blindly guesses a few random split values for

00:17:03.629 --> 00:17:05.970
a feature, tests them, and picks the best of

00:17:05.970 --> 00:17:08.210
those random guesses, rather than calculating

00:17:08.210 --> 00:17:10.869
the true optimal split. It's so ironic. You add

00:17:10.869 --> 00:17:13.410
even more noise. You literally strip away the

00:17:13.410 --> 00:17:15.230
algorithm's ability to find the mathematically

00:17:15.230 --> 00:17:18.269
perfect split. And it still yields highly competitive,

00:17:18.549 --> 00:17:21.190
incredible results. It proves just how much heavy

00:17:21.190 --> 00:17:23.230
lifting the ensemble averaging is actually doing.

00:17:23.369 --> 00:17:26.380
It really does. Then you have researchers tackling

00:17:26.380 --> 00:17:29.559
the nightmare of high -dimensional data. If you

00:17:29.559 --> 00:17:32.119
feed a random forest thousands of features but

00:17:32.119 --> 00:17:34.019
only three of them are actually informative,

00:17:34.619 --> 00:17:37.380
the forest can drown in the sheer volume of noise.

00:17:37.539 --> 00:17:39.819
Because the random feature subsets will almost

00:17:39.819 --> 00:17:42.700
always be filled with garbage variables. Exactly.

00:17:42.920 --> 00:17:46.099
So how do you fix the math when the noise outnumbers

00:17:46.099 --> 00:17:49.099
the signal? Several ways. There is pre -filtering,

00:17:49.200 --> 00:17:51.099
which is simply using a separate statistical

00:17:51.099 --> 00:17:53.559
test to strip out the garbage variables before

00:17:53.559 --> 00:17:56.440
the forest even sees them. Then you have enriched

00:17:56.440 --> 00:17:59.440
random forests. Think of an enriched random forest

00:17:59.440 --> 00:18:02.220
like a rigged lottery. Instead of every feature

00:18:02.220 --> 00:18:04.519
getting an equal number of ping -pong balls in

00:18:04.519 --> 00:18:07.380
the drawing machine, the algorithm identifies

00:18:07.380 --> 00:18:10.140
the historically smarter features and gives them

00:18:10.140 --> 00:18:11.880
extra balls. That's a great way to picture it.

00:18:11.880 --> 00:18:14.500
It uses weighted random sampling to increase

00:18:14.500 --> 00:18:16.299
the odds that the good features get selected

00:18:16.299 --> 00:18:19.440
for a split. Right. And similarly, there are

00:18:19.440 --> 00:18:22.019
tree -weighted random forests. Instead of treating

00:18:22.019 --> 00:18:24.460
every tree in the forest as an equal voter, you

00:18:24.460 --> 00:18:26.539
treat it like a corporate boardroom. Where some

00:18:26.539 --> 00:18:29.259
votes matter more. Exactly. The trees that prove

00:18:29.259 --> 00:18:31.450
to be highly accurate curate on the out -of -bag

00:18:31.450 --> 00:18:34.269
samples get more voting power in the final prediction

00:18:34.269 --> 00:18:38.089
than the mediocre trees. But, you know, if we

00:18:38.089 --> 00:18:40.190
are talking about extreme mathematics, the biggest

00:18:40.190 --> 00:18:43.130
theoretical leap is kerf, the kernel random forest.

00:18:43.309 --> 00:18:45.009
Now, if you're listening to this and wondering

00:18:45.009 --> 00:18:47.549
why we are suddenly talking about kernel methods

00:18:47.549 --> 00:18:51.369
in a decision tree deep dive, bear with us, because

00:18:51.369 --> 00:18:55.349
this is the bridge between dirty real -world

00:18:55.349 --> 00:18:58.819
data and elegant math. Yes. If we connect this

00:18:58.819 --> 00:19:01.259
to the bigger picture, curve isn't just a party

00:19:01.259 --> 00:19:03.460
trick. It's a massive theoretical breakthrough.

00:19:03.700 --> 00:19:06.500
How so? Well, kernel methods are rigorous mathematical

00:19:06.500 --> 00:19:09.000
tools used to map data into infinitely higher

00:19:09.000 --> 00:19:11.380
dimensions to separate it. They are theoretically

00:19:11.380 --> 00:19:14.119
beautiful and allow mathematicians to prove concrete

00:19:14.119 --> 00:19:16.259
guarantees about an algorithm's performance.

00:19:16.299 --> 00:19:18.900
OK. And trees. Decision trees, on the other hand,

00:19:19.000 --> 00:19:22.019
are gritty, discrete, and computationally messy.

00:19:22.420 --> 00:19:24.660
They work practically, but they are hard to analyze

00:19:24.660 --> 00:19:26.809
mathematically. But Leo Breiman noticed a link,

00:19:27.029 --> 00:19:29.369
and a researcher named Skornet formally defined

00:19:29.369 --> 00:19:32.769
it, right? Yes. Skornet modified the definition

00:19:32.769 --> 00:19:35.970
of a random forest slightly, creating the centered

00:19:35.970 --> 00:19:40.250
kerf and uniform kerf. By doing this, he proved

00:19:40.250 --> 00:19:42.450
that random forests can actually be rewritten

00:19:42.450 --> 00:19:45.549
mathematically as kernel methods. Wow. And because

00:19:45.549 --> 00:19:48.529
he built that bridge, he was able to mathematically

00:19:48.529 --> 00:19:51.170
prove the upper bounds on their rates of consistency.

00:19:51.710 --> 00:19:54.430
Meaning? He could mathematically guarantee that

00:19:54.430 --> 00:19:57.430
given enough data, the forest will actually converge

00:19:57.430 --> 00:19:59.869
on the absolute truth. It isn't just a computational

00:19:59.869 --> 00:20:02.670
hack anymore. It is theoretically sound. Exactly.

00:20:02.890 --> 00:20:05.490
So we've built this incredibly robust, mathematically

00:20:05.490 --> 00:20:08.390
guaranteed, highly adaptable forest. It handles

00:20:08.390 --> 00:20:10.849
noise. It gives us variable importance. It bridges

00:20:10.849 --> 00:20:13.049
massive theoretical gaps in machine learning,

00:20:13.769 --> 00:20:15.950
which brings us to the inevitable catch. There's

00:20:15.950 --> 00:20:18.759
always a catch. If it is this powerful, Why isn't

00:20:18.759 --> 00:20:21.900
it the only algorithm developers ever use? First,

00:20:21.940 --> 00:20:25.339
we return to the massive loss of intrinsic interpretability.

00:20:26.140 --> 00:20:28.400
We talked about variable importance tools earlier,

00:20:28.420 --> 00:20:30.359
but those only tell you what mattered. They do

00:20:30.359 --> 00:20:32.740
not tell you how this specific decision was made.

00:20:33.119 --> 00:20:35.000
Following the path of a single decision tree

00:20:35.000 --> 00:20:37.900
is trivial. Following the intersecting paths

00:20:37.900 --> 00:20:40.859
of 500 deeply randomized trees is physically

00:20:40.859 --> 00:20:44.359
impossible for a human mind. Imagine trying to

00:20:44.359 --> 00:20:47.079
explain to someone why they were denied a bank

00:20:47.079 --> 00:20:49.500
loan. If you use a single decision tree, you

00:20:49.500 --> 00:20:52.119
can say, well, your credit score was below 700,

00:20:52.500 --> 00:20:55.000
and your income was below this specific threshold,

00:20:55.339 --> 00:20:58.319
so the math says the loan is denied. It's a clear,

00:20:58.720 --> 00:21:01.420
understandable why. Which people need. Exactly.

00:21:01.799 --> 00:21:04.000
But if you use a random forest, you basically

00:21:04.000 --> 00:21:05.960
just have to give them the collective shrug of

00:21:05.960 --> 00:21:09.900
500 trees. The forest has spoken. You can't explain

00:21:09.900 --> 00:21:12.799
the exact path, and when end users can't understand

00:21:12.799 --> 00:21:15.420
a model, they lose trust in it entirely. And

00:21:15.420 --> 00:21:17.559
there are genuine performance drawbacks too.

00:21:17.700 --> 00:21:19.819
If your underlying data features are perfectly

00:21:19.819 --> 00:21:22.000
linearly correlated with your target variable,

00:21:22.519 --> 00:21:24.700
a random forest might actually perform worse

00:21:24.700 --> 00:21:27.799
than a basic computationally cheap linear model

00:21:27.799 --> 00:21:30.960
like a multinomial logistic regression or naive

00:21:30.960 --> 00:21:33.859
Bayes. Overkill, essentially. Yeah, and it also

00:21:33.859 --> 00:21:35.579
struggles heavily when dealing with multiple

00:21:35.579 --> 00:21:38.099
categorical variables that have completely different

00:21:38.099 --> 00:21:40.819
numbers of levels. But the source material notes

00:21:40.819 --> 00:21:43.220
that researchers have found a brilliant solution

00:21:43.220 --> 00:21:46.119
to the interpretability problem. If the massive

00:21:46.119 --> 00:21:49.140
forest is too opaque, they use model compression.

00:21:49.500 --> 00:21:52.900
Yes, they transform that massive, impenetrable,

00:21:52.900 --> 00:21:56.180
random forest back into a minimal, single -decision

00:21:56.180 --> 00:21:58.640
tree. They call it a born -again tree. Wait,

00:21:58.839 --> 00:22:01.460
how does a single tree replicate a whole forest

00:22:01.460 --> 00:22:03.359
without just reverting to the overfitting problem

00:22:03.359 --> 00:22:06.140
we started with? Because the born -again tree

00:22:06.140 --> 00:22:08.880
isn't trained on the original noisy data. It

00:22:08.880 --> 00:22:12.039
is trained to perfectly reproduce the exact decision

00:22:12.039 --> 00:22:14.880
function of the massive ensemble. Oh! It learns

00:22:14.880 --> 00:22:17.740
the smooth, variance -reduced logic of the forest,

00:22:18.180 --> 00:22:21.519
but compresses that complex logic into one human

00:22:21.519 --> 00:22:24.019
-readable tree structure. That is the ultimate

00:22:24.019 --> 00:22:25.839
having -your -cake -and -eating -it -too scenario

00:22:25.839 --> 00:22:28.460
for a data scientist. You reaveal the predictive

00:22:28.460 --> 00:22:30.640
power and get all the mathematical variance reduction

00:22:30.640 --> 00:22:33.279
of the massive random forest, but you hand the

00:22:33.279 --> 00:22:36.099
end user a perfectly interpretable single -born

00:22:36.099 --> 00:22:38.880
-again tree. It is a profoundly elegant full

00:22:38.880 --> 00:22:41.259
circle moment for the algorithm. It really is.

00:22:41.579 --> 00:22:44.720
Yeah. So to recap our journey today, we started

00:22:44.720 --> 00:22:48.200
with a single overfitting tree. Our student memorizing

00:22:48.200 --> 00:22:51.059
the noise of the test. We multiplied it. We injected

00:22:51.059 --> 00:22:53.539
intentional randomness through bootstrap bagging

00:22:53.539 --> 00:22:56.220
and feature subsets to de -correlate the errors,

00:22:56.440 --> 00:22:58.440
making the crowd significantly wiser than the

00:22:58.440 --> 00:23:01.450
individual. We peeked inside the black box using

00:23:01.450 --> 00:23:03.950
permutations and impurity metrics to find which

00:23:03.950 --> 00:23:07.029
variables mattered most. We explored its hidden,

00:23:07.190 --> 00:23:09.269
shape -shifting connections to k -nearest neighbors

00:23:09.269 --> 00:23:12.569
in kernel math, and finally, we compressed all

00:23:12.569 --> 00:23:14.930
that messy collective wisdom back down into a

00:23:14.930 --> 00:23:17.880
single, understandable born -again tree. It is

00:23:17.880 --> 00:23:20.700
a master class in how embracing a little controlled

00:23:20.700 --> 00:23:23.220
chaos can actually lead to much clearer, more

00:23:23.220 --> 00:23:25.480
robust answers. Absolutely. And it leaves me

00:23:25.480 --> 00:23:27.119
with one final thought to chew on. We've spent

00:23:27.119 --> 00:23:29.759
this time seeing how we can take the messy, deeply

00:23:29.759 --> 00:23:32.839
randomized, noisy collective wisdom of a massive

00:23:32.839 --> 00:23:35.920
random forest and mathematically dispil it into

00:23:35.920 --> 00:23:38.759
a single, perfectly interpretable born again

00:23:38.759 --> 00:23:40.940
rule book. Could we ever do the same with human

00:23:40.940 --> 00:23:43.990
systems? Imagine if the chaotic, decentralized,

00:23:44.230 --> 00:23:46.569
incredibly noisy decisions of a whole society

00:23:46.569 --> 00:23:49.230
could somehow be mathematically distilled into

00:23:49.230 --> 00:23:51.829
a single perfectly logical set of understandable

00:23:51.829 --> 00:23:54.269
rules. It makes you wonder if our own chaotic

00:23:54.269 --> 00:23:56.690
processes are just waiting for the right mathematical

00:23:56.690 --> 00:23:59.329
algorithm to compress them into clarity. Thank

00:23:59.329 --> 00:24:01.390
you for joining us on this deep dive into the

00:24:01.390 --> 00:24:02.890
source material. Stay curious.