WEBVTT

00:00:00.000 --> 00:00:03.430
So, there is this invisible engine, right? It's

00:00:03.430 --> 00:00:05.530
basically powering your web searches. It's driving

00:00:05.530 --> 00:00:08.509
these massive scientific discoveries, and it's

00:00:08.509 --> 00:00:10.890
making some of the most complex data predictions

00:00:10.890 --> 00:00:12.810
in the world possible. Yeah, it's everywhere.

00:00:12.990 --> 00:00:15.550
It really is. It's a machine learning technique

00:00:15.550 --> 00:00:18.750
called gradient boosting. And today, we are going

00:00:18.750 --> 00:00:20.769
to, well, we're going to tear it apart to see

00:00:20.769 --> 00:00:23.210
exactly how it works. Definitely. We are relying

00:00:23.210 --> 00:00:26.449
on a remarkably comprehensive Wikipedia article

00:00:26.449 --> 00:00:28.850
on the subject to guide us through this. And

00:00:28.850 --> 00:00:31.820
our mission for this deep dive is to de - mystify

00:00:31.820 --> 00:00:34.880
the internal mechanics of this algorithm. Explore

00:00:34.880 --> 00:00:37.259
how it relentlessly learns from its own mistakes

00:00:37.259 --> 00:00:40.899
and understand why it consistently outperforms

00:00:40.899 --> 00:00:43.259
other popular methods like random forests out

00:00:43.259 --> 00:00:45.840
there in the real world. Right. And for those

00:00:45.840 --> 00:00:48.780
of you who are visualizing our space today, the

00:00:48.780 --> 00:00:51.640
backdrop behind this isn't just like... a generic

00:00:51.640 --> 00:00:54.960
digital grid. No, not at all. It is this sprawling

00:00:54.960 --> 00:00:58.039
interconnected decision tree. It's glowing with

00:00:58.039 --> 00:01:00.000
thousands of different pathways, you know, nodes

00:01:00.000 --> 00:01:02.520
and terminal leaves. And we chose this because

00:01:02.520 --> 00:01:05.000
today is an exploration of the architecture of

00:01:05.000 --> 00:01:07.620
machine intelligence. Exactly. We are looking

00:01:07.620 --> 00:01:10.859
at the specific mathematical ways we teach machines

00:01:10.859 --> 00:01:14.560
to, well, to perfect their understanding of really

00:01:14.560 --> 00:01:17.670
complex data. Right. And to understand why gradient

00:01:17.670 --> 00:01:20.609
boosting is so universally powerful, we first

00:01:20.609 --> 00:01:22.930
have to understand how it actually builds its

00:01:22.930 --> 00:01:25.590
knowledge base. Because it doesn't do it by attempting

00:01:25.590 --> 00:01:29.129
to be perfect right out of the gate, right? It

00:01:29.129 --> 00:01:33.930
does it by hyper -focusing on its own failures.

00:01:34.209 --> 00:01:36.730
Yeah, that sequential focus on failure is, I

00:01:36.730 --> 00:01:38.849
mean, it's the defining characteristic. Right.

00:01:39.129 --> 00:01:41.549
At its most basic definition, gradient boosting

00:01:41.549 --> 00:01:43.629
is what we call an ensemble technique. So it

00:01:43.629 --> 00:01:46.189
combines what The literature actually calls weak

00:01:46.189 --> 00:01:49.750
learners into a single highly accurate strong

00:01:49.750 --> 00:01:52.150
prediction model. Weak learners. Yeah. And when

00:01:52.150 --> 00:01:54.069
we say weak learners, we are typically talking

00:01:54.069 --> 00:01:57.310
about very simple decision trees. These are models

00:01:57.310 --> 00:01:59.269
that make extremely few assumptions about the

00:01:59.269 --> 00:02:01.730
data. They're super basic. Exactly. On their

00:02:01.730 --> 00:02:03.829
own, their predictive power is barely better

00:02:03.829 --> 00:02:07.230
than just random guessing. OK, let's unpack this,

00:02:07.329 --> 00:02:09.550
because combining weak learners isn't exactly

00:02:09.550 --> 00:02:12.009
a new concept in machine learning. No, it's not.

00:02:12.229 --> 00:02:15.389
But the way gradient boosting combines them is

00:02:15.389 --> 00:02:17.990
totally unique. So I'm trying to visualize this

00:02:17.990 --> 00:02:20.629
for you all listening. Is it like a relay race?

00:02:20.810 --> 00:02:22.750
A relay race. Like, let's say we have a team

00:02:22.750 --> 00:02:25.810
of just average runners, right? And the first

00:02:25.810 --> 00:02:29.610
runner runs their leg, but they fall, say, 10

00:02:29.610 --> 00:02:31.550
meters short of the finish line. Right. They

00:02:31.550 --> 00:02:35.270
mess up. Yeah. Does the second runner take the

00:02:35.270 --> 00:02:38.569
baton and run a full race? Or is their only job

00:02:38.569 --> 00:02:41.810
to run that exact 10 meter gap that the first

00:02:41.810 --> 00:02:44.270
runner missed? Well, it's close to a relay race,

00:02:44.310 --> 00:02:46.370
but we need to tweak the mechanics of that analogy

00:02:46.370 --> 00:02:48.969
just a bit. OK, tweak away. So the second runner

00:02:48.969 --> 00:02:50.949
isn't running on the same track as the first

00:02:50.949 --> 00:02:53.009
runner. They're actually running on a track made

00:02:53.009 --> 00:02:55.330
entirely of the first runner's error. Oh, wow.

00:02:55.590 --> 00:02:58.189
OK. Yeah. In traditional statistics, the difference

00:02:58.189 --> 00:03:00.430
between an actual value and a predicted value

00:03:00.430 --> 00:03:03.430
is just called a residual. Right. But here, we're

00:03:03.430 --> 00:03:05.930
dealing with what they call pseudo residuals.

00:03:06.169 --> 00:03:08.610
The algorithm applies this thing called iterative

00:03:08.610 --> 00:03:10.830
functional gradient descent to find them. OK,

00:03:10.849 --> 00:03:13.490
wait. I want to pause right there. Because functional

00:03:13.490 --> 00:03:16.219
gradient descent sounds incredibly heavy. It

00:03:16.219 --> 00:03:18.580
is a bit of a mouthful. Yeah. And most people

00:03:18.580 --> 00:03:20.580
who are, you know, familiar with basic machine

00:03:20.580 --> 00:03:23.419
learning, they know standard gradient descent.

00:03:23.560 --> 00:03:26.000
Right. The classic hill climbing or hill descending

00:03:26.000 --> 00:03:28.060
thing. Exactly. You're standing on a mountain

00:03:28.060 --> 00:03:31.900
of error and you take steps down the slope by

00:03:31.900 --> 00:03:34.759
tweaking your mathematical weights until you

00:03:34.759 --> 00:03:37.620
reach the bottom, which is like the lowest possible

00:03:37.620 --> 00:03:40.879
error. Yeah. So how is functional gradient descent

00:03:40.879 --> 00:03:43.530
different from that? It's a crucial distinction,

00:03:43.750 --> 00:03:46.610
actually. So standard gradient descent operates

00:03:46.610 --> 00:03:49.210
in what we call parameter space. Parameter space.

00:03:49.349 --> 00:03:53.590
Right. You have a single fixed model, say a neural

00:03:53.590 --> 00:03:56.370
network, with a set number of connections. And

00:03:56.370 --> 00:03:59.270
you're basically just turning the numeric dials,

00:03:59.270 --> 00:04:01.669
the parameters, to lower the error. OK, turning

00:04:01.669 --> 00:04:04.310
dials. Got it. But functional gradient descent

00:04:04.310 --> 00:04:06.830
operates in function space. Function space. Yeah.

00:04:06.969 --> 00:04:09.129
You aren't just turning dials on an existing

00:04:09.129 --> 00:04:12.110
machine. At every step down that mountain, you

00:04:12.110 --> 00:04:15.650
are building an entirely new machine, a new function,

00:04:15.830 --> 00:04:17.889
which in this case is a brand new decision tree,

00:04:18.290 --> 00:04:20.410
and you're adding it to the assembly line. So

00:04:20.410 --> 00:04:22.810
it's building the track as it goes. Exactly.

00:04:23.189 --> 00:04:25.949
What's fascinating here is that looking at the

00:04:25.949 --> 00:04:29.029
problem in function space allows the algorithm

00:04:29.029 --> 00:04:32.350
to optimize almost any differentiable loss function.

00:04:32.449 --> 00:04:34.790
Wait, meaning as long as the error can be mathematically

00:04:34.790 --> 00:04:37.009
measured, right? And it has a slope we can calculate,

00:04:37.029 --> 00:04:40.550
like a derivative, the algorithm can just figure

00:04:40.550 --> 00:04:43.230
out how to minimize it. Precisely. I mean, a

00:04:43.230 --> 00:04:45.709
common differentiable loss function is mean squared

00:04:45.709 --> 00:04:48.689
error. Right. Where you are just trying to minimize

00:04:48.689 --> 00:04:51.029
the square of the difference between your prediction

00:04:51.029 --> 00:04:54.509
and reality. Right. But because it isn't locked

00:04:54.509 --> 00:04:57.439
into just one way of measuring error, gradient

00:04:57.439 --> 00:05:00.040
boosting is incredibly adaptable. It can do anything.

00:05:00.319 --> 00:05:02.519
Basically, it handles regression, classification,

00:05:02.920 --> 00:05:06.319
ranking algorithms. It's really a universal problem

00:05:06.319 --> 00:05:08.500
solver. Which brings us to a really interesting

00:05:08.500 --> 00:05:10.579
historical pivot. Oh, yeah. Because you don't

00:05:10.579 --> 00:05:13.079
just wake up one day and invent functional gradient

00:05:13.079 --> 00:05:15.480
descent. No, you don't. The source text mentions

00:05:15.480 --> 00:05:17.860
a very specific evolution of this idea. Yeah,

00:05:17.920 --> 00:05:19.660
it was a fascinating collaborative evolution.

00:05:19.779 --> 00:05:21.899
It was occurring right around the turn of the

00:05:21.899 --> 00:05:24.860
millennium. Late 90s, early 2000s. Exactly. It

00:05:24.860 --> 00:05:26.899
started with a statistician named Leo Breiman

00:05:26.899 --> 00:05:32.639
in 1997, I believe. OK. Before Breiman, boosting

00:05:32.639 --> 00:05:35.620
algorithms, things like Adaboost, they were viewed

00:05:35.620 --> 00:05:38.819
mostly as clever empirical tricks. Just hacks,

00:05:39.040 --> 00:05:41.759
basically. Yeah, just hacks for tweaking weights

00:05:41.759 --> 00:05:45.480
on training data. But Breiman, he had the mathematical

00:05:45.480 --> 00:05:48.379
intuition to look at boosting and realize, wait.

00:05:48.519 --> 00:05:50.899
This is actually a form of optimization. Oh.

00:05:51.079 --> 00:05:52.860
He recognized that it was stepping down a cost

00:05:52.860 --> 00:05:55.279
function. So he saw the underlying physics of

00:05:55.279 --> 00:05:58.060
what was previously just seen as like a neat

00:05:58.060 --> 00:06:01.120
engineering trick. Yes, exactly. He saw the math

00:06:01.120 --> 00:06:03.160
beneath the trick. Then Jerome Freedman took

00:06:03.160 --> 00:06:05.560
Breiman's observation and, well, he basically

00:06:05.560 --> 00:06:08.300
built the engine. He formalized it. Yeah. In

00:06:08.300 --> 00:06:11.720
1999 and 2001, Freedman published papers developing

00:06:11.720 --> 00:06:14.500
explicit regression gradient boosting algorithms.

00:06:14.639 --> 00:06:17.100
He gave it that formal structure. Right. But

00:06:17.100 --> 00:06:19.560
simultaneously, There was a group of researchers,

00:06:20.259 --> 00:06:23.060
Lou Mason, Jonathan Baxter, Peter Bartlett, and

00:06:23.060 --> 00:06:26.300
Marcus Freenan. OK. And they were generalizing

00:06:26.300 --> 00:06:28.959
this concept. They were the ones who formally

00:06:28.959 --> 00:06:31.519
introduced the view of boosting as an iterative

00:06:31.519 --> 00:06:34.379
functional gradient descent algorithm. So they

00:06:34.379 --> 00:06:37.620
broadened it. They did. They proved that it wasn't

00:06:37.620 --> 00:06:40.980
just a specific trick for a specific tree, but

00:06:40.980 --> 00:06:43.939
a universal framework, one that could optimize

00:06:43.939 --> 00:06:47.519
any cost function. Okay, so Breyman saw the spark,

00:06:47.839 --> 00:06:49.959
Friedman built the engine, and Mason and his

00:06:49.959 --> 00:06:52.079
colleagues, they basically wrote the physics

00:06:52.079 --> 00:06:54.879
textbook on why the engine could fly. That's

00:06:54.879 --> 00:06:56.779
a great way to put it, yeah. Now that we have

00:06:56.779 --> 00:06:58.800
the history and the theory, I want to look at

00:06:58.800 --> 00:07:00.860
the physical shape of the weak learners inside

00:07:00.860 --> 00:07:03.540
this engine. Sure. The text specifically focuses

00:07:03.540 --> 00:07:06.120
on gradient tree boosting, which uses carts,

00:07:06.579 --> 00:07:08.519
right? Yeah. Classification and regression trees.

00:07:09.449 --> 00:07:12.629
If these are weak learners, how big is a typical

00:07:12.629 --> 00:07:15.050
tree, like physically? Well, the physical size

00:07:15.050 --> 00:07:17.350
of the tree is controlled by a parameter designated

00:07:17.350 --> 00:07:20.949
as J. J. Got it. Right. J represents the number

00:07:20.949 --> 00:07:23.449
of terminal nodes, or leaves, at the bottom of

00:07:23.449 --> 00:07:25.550
the tree. So if J equals 2, you basically have

00:07:25.550 --> 00:07:27.230
a single split. We actually call it a decision

00:07:27.230 --> 00:07:29.709
stump. A stump? Yeah, a stump. It asks one question.

00:07:30.189 --> 00:07:33.779
Like, is the user over 18? Yes or no? That's

00:07:33.779 --> 00:07:36.600
it. But I mean, a decision stump seems too weak,

00:07:36.800 --> 00:07:39.600
right? If j equals 2, there is absolutely no

00:07:39.600 --> 00:07:42.019
interaction between variables. Correct. And we

00:07:42.019 --> 00:07:44.779
know variables interact constantly in real data,

00:07:45.139 --> 00:07:47.500
like a user's age interacts with their income,

00:07:47.860 --> 00:07:49.740
which interacts with their geographic location.

00:07:49.920 --> 00:07:51.959
A single split just can't capture that at all.

00:07:52.120 --> 00:07:54.790
No, it can't. which is why j equals 2 is rarely

00:07:54.790 --> 00:07:57.970
used in complex applications. To capture variable

00:07:57.970 --> 00:08:00.550
interaction, you need a larger j. The source

00:08:00.550 --> 00:08:02.649
actually notes that Hasty and his colleagues

00:08:02.649 --> 00:08:05.889
found a j between 4 and 8 usually works perfectly

00:08:05.889 --> 00:08:08.930
for gradient boosting. Wait, really? Why that

00:08:08.930 --> 00:08:11.410
specific range? Why not like a massive tree with

00:08:11.410 --> 00:08:13.810
a j of 100? Because of the math of interactions.

00:08:14.189 --> 00:08:16.649
A tree with eight terminal leaves typically has

00:08:16.649 --> 00:08:19.629
a depth of about three levels. That means a single

00:08:19.629 --> 00:08:22.160
pass down the tree can evaluate three different

00:08:22.160 --> 00:08:25.480
variables together, like age, income, and location.

00:08:25.740 --> 00:08:28.160
In real world data, the underlying patterns are

00:08:28.160 --> 00:08:31.920
usually driven by low order interactions. Three

00:08:31.920 --> 00:08:34.480
-way or four -way interactions capture almost

00:08:34.480 --> 00:08:36.940
all the meaningful signal. Oh, interesting. Yeah,

00:08:37.000 --> 00:08:39.899
the text notes that going over a J of 10 is almost

00:08:39.899 --> 00:08:42.700
never mathematically necessary because real -world

00:08:42.700 --> 00:08:45.200
variables rarely interact in 10 -dimensional

00:08:45.200 --> 00:08:47.200
ways that actually matter. So it's capturing

00:08:47.200 --> 00:08:49.580
enough interaction to be useful but remaining

00:08:49.580 --> 00:08:52.059
simple enough to still be a mathematically weak

00:08:52.059 --> 00:08:53.919
learner. Exactly. You don't want it to be too

00:08:53.919 --> 00:08:56.080
smart. But I have to push back on this design

00:08:56.080 --> 00:08:58.159
a little bit. OK, go ahead. We have this algorithm

00:08:58.159 --> 00:09:01.600
that relentlessly hunts down the pseudo residuals,

00:09:01.659 --> 00:09:04.740
the exact errors made by the previous four to

00:09:04.740 --> 00:09:07.620
eight node tree. If we run this process, say,

00:09:07.799 --> 00:09:11.039
1 ,000 times, adding a new tree to fix the tiny

00:09:11.039 --> 00:09:13.399
gap left by the last one, isn't it just going

00:09:13.399 --> 00:09:16.100
to memorize the training data? Like it feels

00:09:16.100 --> 00:09:18.159
like a student who memorizes all the answers

00:09:18.159 --> 00:09:21.019
to a specific practice test but then completely

00:09:21.019 --> 00:09:23.059
fails the actual exam because they didn't learn

00:09:23.059 --> 00:09:25.779
the underlying concepts. That is without a doubt

00:09:25.779 --> 00:09:28.059
the defining vulnerability of the algorithm.

00:09:28.500 --> 00:09:30.679
You were describing overfitting. Overfitting.

00:09:30.840 --> 00:09:34.159
Gradient boosting by its very nature is just

00:09:34.159 --> 00:09:37.039
a relentless mathematical bloodhound. Right.

00:09:37.340 --> 00:09:40.230
If you let it run unchecked it will perfectly

00:09:40.230 --> 00:09:42.809
fit the training data. Right. Capturing all the

00:09:42.809 --> 00:09:46.190
random noise and anomalies and its ability to

00:09:46.190 --> 00:09:49.590
generalize to unseen real world data will just

00:09:49.590 --> 00:09:51.950
plummet. So how do we stop it? I mean, we see

00:09:51.950 --> 00:09:54.370
learning rates used in almost all deep neural

00:09:54.370 --> 00:09:57.190
networks to stop overfitting, but how does the

00:09:57.190 --> 00:09:59.809
concept of shrinkage interact specifically with

00:09:59.809 --> 00:10:02.490
this sequential tree building process? Good question.

00:10:03.169 --> 00:10:05.090
Shrinkage is basically the gradient boosting

00:10:05.090 --> 00:10:07.549
equivalent of a learning rate. It's usually denoted

00:10:07.549 --> 00:10:11.610
by the parameter In a completely unregularized

00:10:11.610 --> 00:10:14.210
model, the learning rate is 1. That means each

00:10:14.210 --> 00:10:16.789
new tree attempts to correct 100 % of the residual

00:10:16.789 --> 00:10:18.909
error left by the previous tree. It goes all

00:10:18.909 --> 00:10:21.690
in. Exactly. Shrinkage reduces that. It scales

00:10:21.690 --> 00:10:23.889
down the contribution of each new tree by a factor,

00:10:24.090 --> 00:10:26.409
which is typically less than 0 .1. Here's where

00:10:26.409 --> 00:10:28.629
it gets really interesting, because the math

00:10:28.629 --> 00:10:31.909
suggests that purposefully preventing the algorithm

00:10:31.909 --> 00:10:35.070
from fully fixing the error at each step actually

00:10:35.070 --> 00:10:38.490
makes the final model far more accurate. It does

00:10:38.490 --> 00:10:42.629
a good way to visualize shrinkage is. Think about

00:10:42.629 --> 00:10:45.330
adjusting a thermostat in a really drafty house.

00:10:45.490 --> 00:10:49.190
OK. If the room is 10 degrees too cold and your

00:10:49.190 --> 00:10:52.629
thermostat acting with a learning rate of 1 blasts

00:10:52.629 --> 00:10:55.429
the furnace at maximum capacity to correct the

00:10:55.429 --> 00:10:58.690
entire 10 degree air immediately, it's going

00:10:58.690 --> 00:11:00.509
to overshoot. Right. It gets way too hot. The

00:11:00.509 --> 00:11:02.710
room becomes sweltering, then it shuts off, then

00:11:02.710 --> 00:11:04.970
it gets freezing again. It's erratic. So a small

00:11:04.970 --> 00:11:06.970
learning rate is like saying, hey, only correct

00:11:06.970 --> 00:11:09.330
10 % of the temperature difference. Yes. Turn

00:11:09.330 --> 00:11:11.980
the heat up by 1 degree. Wait. evaluate the new

00:11:11.980 --> 00:11:14.000
temperature, and then make another one -degree

00:11:14.000 --> 00:11:16.700
adjustment. Exactly. It forces the algorithm

00:11:16.700 --> 00:11:19.940
to take tiny, cautious steps down the air gradient.

00:11:20.159 --> 00:11:22.240
It builds a smoother, more generalized model

00:11:22.240 --> 00:11:24.659
that doesn't overreact to noisy data points.

00:11:24.740 --> 00:11:27.139
That makes total sense. But there is a direct

00:11:27.139 --> 00:11:29.279
mathematical trade -off here. There always is.

00:11:29.559 --> 00:11:31.879
Right. If you lower the learning rate, the new,

00:11:31.980 --> 00:11:34.379
you must increase M. which is the total number

00:11:34.379 --> 00:11:36.240
of trees in the ensemble. Oh, because you're

00:11:36.240 --> 00:11:38.940
taking smaller steps. Cautious steps mean you

00:11:38.940 --> 00:11:41.000
need a lot more steps to reach the bottom of

00:11:41.000 --> 00:11:44.139
the mountain. And that dramatically increases

00:11:44.139 --> 00:11:47.159
the computational power and time required to

00:11:47.159 --> 00:11:50.490
train the model. Okay, so... Beyond just slowing

00:11:50.490 --> 00:11:52.950
the learning rate down, how else do we stop the

00:11:52.950 --> 00:11:55.389
model from memorizing that practice test? Well,

00:11:55.409 --> 00:11:57.889
Jerome Friedman introduced a really brilliant

00:11:57.889 --> 00:12:00.990
structural modification. It's called stochastic

00:12:00.990 --> 00:12:04.590
gradient boosting. Stochastic? Yeah. He was inspired

00:12:04.590 --> 00:12:06.929
by Leo Breiman's earlier work on a concept called

00:12:06.929 --> 00:12:11.210
bagging. In standard gradient boosting, every

00:12:11.210 --> 00:12:14.149
single new tree is trained on the entire data

00:12:14.149 --> 00:12:17.450
set. But in stochastic gradient boosting, at

00:12:17.450 --> 00:12:20.409
each iteration, the bass learner is fit on a

00:12:20.409 --> 00:12:23.049
random subsample of the training data. A random

00:12:23.049 --> 00:12:24.629
subsample? Yeah, it's drawn without replacement.

00:12:24.830 --> 00:12:26.269
Okay, drawn without replacement. Let's clarify

00:12:26.269 --> 00:12:27.950
that for the listener really quick. Sure. It

00:12:27.950 --> 00:12:30.190
means that if you have a data set of, say, 100

00:12:30.190 --> 00:12:34.169
rows, and you randomly pull row 42 to use in

00:12:34.169 --> 00:12:37.509
your training sample, a mirror row 42 is temporarily

00:12:37.509 --> 00:12:39.789
removed from the pool. You can't draw it again.

00:12:39.960 --> 00:12:42.700
for that specific tree. The fraction of the data

00:12:42.700 --> 00:12:46.259
used is defined by the parameter f. And a typical

00:12:46.259 --> 00:12:50.019
value for f is 0 .5. Meaning each new tree in

00:12:50.019 --> 00:12:53.059
the sequence is totally blind to half of the

00:12:53.059 --> 00:12:55.759
training data. Yes, exactly. This feels a bit

00:12:55.759 --> 00:12:58.299
like dropout in neural networks but applied to

00:12:58.299 --> 00:13:00.919
the rows of data instead of the neurons. That's

00:13:00.919 --> 00:13:03.919
a solid comparison. It's like a student studying

00:13:03.919 --> 00:13:06.659
for a massive exam by purposefully only looking

00:13:06.659 --> 00:13:09.399
at a random half of their flashcards on any given...

00:13:09.289 --> 00:13:11.850
I like that. By never letting the student see

00:13:11.850 --> 00:13:14.549
the whole deck at once, you force them to learn

00:13:14.549 --> 00:13:17.710
the broad, robust concepts because they just

00:13:17.710 --> 00:13:19.990
can't rely on memorizing the sequential order

00:13:19.990 --> 00:13:22.870
of the FEC. That flashcard analogy perfectly

00:13:22.870 --> 00:13:25.899
captures the mechanism. By injecting this randomness,

00:13:26.019 --> 00:13:28.059
the trees become slightly de -correlated from

00:13:28.059 --> 00:13:30.539
one another. Right. No single outlier in the

00:13:30.539 --> 00:13:32.820
data set can heavily influence the entire model

00:13:32.820 --> 00:13:35.240
because the outlier is mathematically invisible

00:13:35.240 --> 00:13:37.879
half the time. Oh, wow. That's so clever. It

00:13:37.879 --> 00:13:40.899
is. And as a secondary benefit, because you are

00:13:40.899 --> 00:13:43.700
only computing the gradient on 50 % of the data

00:13:43.700 --> 00:13:46.639
at each step, the algorithm trains much faster

00:13:46.639 --> 00:13:50.240
per iteration. Nice. Are there any mathematical

00:13:50.240 --> 00:13:52.700
penalties applied to the leaves themselves? I

00:13:52.700 --> 00:13:55.379
know in other models we sometimes penalize complexity.

00:13:55.950 --> 00:13:59.269
Yes, there are direct structural and mathematical

00:13:59.269 --> 00:14:02.409
penalties. First off, engineers can set a minimum

00:14:02.409 --> 00:14:04.509
threshold for the number of observations allowed

00:14:04.509 --> 00:14:07.090
in a terminal node. OK. If a tree tries to make

00:14:07.090 --> 00:14:09.389
a split that results in a leaf containing only

00:14:09.389 --> 00:14:12.090
like two or three data points, the algorithm

00:14:12.090 --> 00:14:14.830
just rejects the split. It just says no. Yeah,

00:14:14.830 --> 00:14:17.149
it refuses to make rules for highly isolated

00:14:17.149 --> 00:14:19.429
edge cases. That prevents it from getting too

00:14:19.429 --> 00:14:22.950
hyper specific. Exactly. And secondly, they apply

00:14:22.950 --> 00:14:25.620
complexity penalties to the math itself. most

00:14:25.620 --> 00:14:28.679
notably an L2 penalty on the leaf weights. Okay,

00:14:28.720 --> 00:14:31.419
what does an L2 penalty actually do to the leaf?

00:14:31.759 --> 00:14:33.779
Well, without getting bogged down in the deep

00:14:33.779 --> 00:14:36.899
algebra. Soles. An L2 penalty essentially adds

00:14:36.899 --> 00:14:39.379
a squared term to the loss function, and this

00:14:39.379 --> 00:14:42.120
term grows larger as the final predicted values

00:14:42.120 --> 00:14:44.019
of the leaves grow larger. So what's the effect

00:14:44.019 --> 00:14:47.000
of that? Functionally, it acts like a gravitational

00:14:47.000 --> 00:14:50.460
pull towards zero. How? It shrinks the extreme

00:14:50.460 --> 00:14:53.320
output values of the individual leaves. Oh, I

00:14:53.320 --> 00:14:55.840
see. So if a specific tree thinks it has found

00:14:55.840 --> 00:14:58.340
this massive definitive pattern and wants to

00:14:58.340 --> 00:15:01.879
output a massive correction, the L2 penalty dampens

00:15:01.879 --> 00:15:04.779
that enthusiasm. It reigns it in. Exactly. It

00:15:04.779 --> 00:15:06.679
ensures that no single tree in the ensemble is

00:15:06.679 --> 00:15:09.080
allowed to yell too loudly. They all have to

00:15:09.080 --> 00:15:11.080
whisper their predictions together. So what does

00:15:11.080 --> 00:15:14.620
this all mean? We have this rigorously penalized,

00:15:14.960 --> 00:15:17.860
stochastic, functionally optimized mathematical

00:15:17.860 --> 00:15:21.100
beast. We've totally tamed it. Where is it actually

00:15:21.100 --> 00:15:23.399
operating in the wild? For you listening, where

00:15:23.399 --> 00:15:25.399
do you interact with this? Let's start with how

00:15:25.399 --> 00:15:27.889
you navigate the internet. Commercial web search

00:15:27.889 --> 00:15:30.090
engines and companies like Yahoo and Yandex have

00:15:30.090 --> 00:15:32.590
been very open about this. They rely heavily

00:15:32.590 --> 00:15:35.429
on variants of gradient boosting for their machine

00:15:35.429 --> 00:15:37.950
-learned ranking engines. Search results? Yeah.

00:15:38.250 --> 00:15:40.950
When you type a query, gradient boosting models

00:15:40.950 --> 00:15:43.370
sift through thousands of latent variables to

00:15:43.370 --> 00:15:45.409
determine the exact relevance ranking of the

00:15:45.409 --> 00:15:47.529
results you end up seeing. But it scales far

00:15:47.529 --> 00:15:50.309
beyond just ranking web pages, right? Oh, absolutely.

00:15:50.409 --> 00:15:52.629
It scales to the fundamental physics of the universe.

00:15:52.850 --> 00:15:56.169
Crazy. At the Large Hadron Collider, Researchers

00:15:56.169 --> 00:15:59.070
utilized variants called gradient boosting deep

00:15:59.070 --> 00:16:02.110
neural networks. They used these models to successfully

00:16:02.110 --> 00:16:05.230
reproduce the incredibly complex data analyses

00:16:05.230 --> 00:16:07.570
that originally led to the discovery of the Higgs

00:16:07.570 --> 00:16:10.889
boson. OK, so the exact same underlying mathematical

00:16:10.889 --> 00:16:13.710
logic that decides if a user wants to click on

00:16:13.710 --> 00:16:16.529
a recipe for chocolate chip cookies is also being

00:16:16.529 --> 00:16:20.039
used to map the God particle. Exactly. Same math,

00:16:20.159 --> 00:16:22.240
different application. It's also used heavily

00:16:22.240 --> 00:16:25.320
in earth sciences. Oh, really? Yeah. Geologists

00:16:25.320 --> 00:16:27.700
use gradient boosting to evaluate the quality

00:16:27.700 --> 00:16:30.399
of tight sandstone reservoirs deep underground.

00:16:31.120 --> 00:16:33.860
They use it by analyzing these super complex

00:16:33.860 --> 00:16:37.200
seismic and well -logged data sets. Wow. I think

00:16:37.200 --> 00:16:39.559
it's important to mention that if you go looking

00:16:39.559 --> 00:16:41.440
for gradient boosting in your own work or in

00:16:41.440 --> 00:16:44.100
a soccer library, it often operates under a few

00:16:44.100 --> 00:16:46.100
different aliases. It does. You'll rarely just

00:16:46.100 --> 00:16:48.379
see it called gradient boosting in a code repository.

00:16:48.570 --> 00:16:51.250
It's often referred to as GBM, which stands for

00:16:51.250 --> 00:16:55.169
Gradient Boosting Machine, or MART, which is

00:16:55.169 --> 00:16:58.429
Multiple Additive Regression Trees. I'm RT. And

00:16:58.429 --> 00:17:01.070
if you read ecology papers, they often call it

00:17:01.070 --> 00:17:04.589
BRT, Boosted Regression Trees. Lots of acronyms.

00:17:04.869 --> 00:17:08.049
Always. And today, if you are working in data

00:17:08.049 --> 00:17:09.970
science, you will almost certainly use highly

00:17:09.970 --> 00:17:12.650
optimized open source software packages like

00:17:12.650 --> 00:17:16.309
XGBoost or LightGBM. Those are the big ones right

00:17:16.309 --> 00:17:19.380
now. Yeah. If we connect this to the bigger picture,

00:17:19.980 --> 00:17:22.019
one of the primary reasons data scientists lean

00:17:22.019 --> 00:17:24.960
on these specific packages isn't just for prediction

00:17:25.230 --> 00:17:27.049
What else is it for? It's for something called

00:17:27.049 --> 00:17:29.029
feature importance ranking. Oh, interesting.

00:17:29.170 --> 00:17:31.269
How does that work? Like, if we have an ensemble

00:17:31.269 --> 00:17:34.529
of thousands of tiny, weak trees, how does it

00:17:34.529 --> 00:17:36.869
tell us which features, which variables actually

00:17:36.869 --> 00:17:39.190
matter the most? It uses statistical metrics,

00:17:39.430 --> 00:17:41.849
primarily one called entropy. Entropy? Yeah,

00:17:41.890 --> 00:17:44.569
entropy is a measure of impurity or chaos in

00:17:44.569 --> 00:17:47.049
your data. Every time a tree makes a split, say

00:17:47.049 --> 00:17:49.690
splitting the data by age, it is trying to reduce

00:17:49.690 --> 00:17:51.609
that chaos. Right, organize it a bit better.

00:17:51.849 --> 00:17:55.029
Because a gradient boosting model builds thousands

00:17:55.029 --> 00:17:58.089
of trees, data scientists can look across the

00:17:58.089 --> 00:18:01.750
entire ensemble and mathematically average the

00:18:01.750 --> 00:18:04.049
total drop in entropy caused by each variable.

00:18:04.309 --> 00:18:06.490
Oh, I see where this is going. Yeah. If splitting

00:18:06.490 --> 00:18:09.029
by location consistently cleans up the data and

00:18:09.029 --> 00:18:11.950
reduces the error across, say, 1 ,500 different

00:18:11.950 --> 00:18:14.670
trees, the algorithm just flags location as a

00:18:14.670 --> 00:18:17.519
highly important feature. That is invaluable.

00:18:17.720 --> 00:18:19.839
I mean, it's basically a built -in metal detector

00:18:19.839 --> 00:18:23.339
for finding the actual signal in a massive, noisy

00:18:23.339 --> 00:18:25.819
data set. It really is. But, you know, we have

00:18:25.819 --> 00:18:27.440
to address the elephant in the room here. OK,

00:18:27.480 --> 00:18:30.890
let's do it. superhuman accuracy, feature importance,

00:18:31.329 --> 00:18:33.630
universal adaptability, what is the ultimate

00:18:33.630 --> 00:18:35.690
trade -off? Because there's always a catch. Always

00:18:35.690 --> 00:18:38.930
a catch. The trade -off is a severe, almost total

00:18:38.930 --> 00:18:41.349
loss of intelligibility and interpretability.

00:18:41.849 --> 00:18:44.750
It is the classic black box problem. Meaning,

00:18:45.049 --> 00:18:46.890
we know the answer it gives us is highly accurate,

00:18:46.970 --> 00:18:49.589
but we have absolutely no earthly idea how it

00:18:49.589 --> 00:18:51.750
arrived at that specific conclusion for that

00:18:51.750 --> 00:18:54.390
specific data point. Precisely. I mean, if you

00:18:54.390 --> 00:18:57.339
have a single decision tree, A single weak learner

00:18:57.339 --> 00:18:59.519
tracing its logic is trivial. You just follow

00:18:59.519 --> 00:19:01.799
the arrows. Right. You start at the root. Yeah.

00:19:02.140 --> 00:19:06.019
Is the user over 18? Yes. Do they live in a city?

00:19:06.380 --> 00:19:09.859
No. The path is inherently human readable. Totally.

00:19:10.039 --> 00:19:12.380
But when you apply gradient boosting, you are

00:19:12.380 --> 00:19:14.799
dealing with hundreds or thousands of trees,

00:19:15.200 --> 00:19:17.819
all sequentially passing pseudo residuals back

00:19:17.819 --> 00:19:20.740
and forth. Just a mess of math. Exactly. All

00:19:20.740 --> 00:19:23.380
making tiny fractional adjustments based on learning

00:19:23.380 --> 00:19:26.920
rates and L2 penalties. it is practically impossible

00:19:26.920 --> 00:19:28.960
for a human brain to trace that mathematical

00:19:28.960 --> 00:19:31.460
web. Yeah, it just can't do it. We lose the ability

00:19:31.460 --> 00:19:35.039
to verify the why behind the what. And additionally,

00:19:35.359 --> 00:19:37.720
maintaining that black box requires an incredibly

00:19:37.720 --> 00:19:39.759
high computational demand. So it's expensive

00:19:39.759 --> 00:19:42.500
and opaque. Yes. And this raises an important

00:19:42.500 --> 00:19:44.900
question. As these models are deployed in high

00:19:44.900 --> 00:19:47.119
-stakes fields like medicine and finance, we

00:19:47.119 --> 00:19:49.039
need to be able to trust their reasoning, not

00:19:49.039 --> 00:19:50.980
just their results. Right. If a model denies

00:19:50.980 --> 00:19:54.470
you alone, you want to know why. Exactly. So

00:19:54.470 --> 00:19:56.890
can we have both the superhuman accuracy of an

00:19:56.890 --> 00:19:58.950
ensemble and the human readability of a single

00:19:58.950 --> 00:20:01.630
tree? Well, the text actually mentions a really

00:20:01.630 --> 00:20:04.410
fascinating emerging field of research attempting

00:20:04.410 --> 00:20:07.009
to solve exactly that problem. What does? It's

00:20:07.009 --> 00:20:10.349
a workaround called model compression. Yes. Researchers

00:20:10.349 --> 00:20:12.609
are developing mathematical techniques to take

00:20:12.609 --> 00:20:17.190
a massive impenetrable black box XGBoost model

00:20:17.190 --> 00:20:20.690
and functionally compress it into a single born

00:20:20.690 --> 00:20:23.490
-again decision tree. Born again. I love that

00:20:23.490 --> 00:20:27.130
term. It's descriptive. This new single tree

00:20:27.130 --> 00:20:30.450
is engineered to approximate the exact same decision

00:20:30.450 --> 00:20:33.589
boundaries as the massive ensemble, but it distills

00:20:33.589 --> 00:20:36.210
it back down into a single flowchart. A flowchart

00:20:36.210 --> 00:20:39.250
that a human doctor or like a financial regulator

00:20:39.250 --> 00:20:41.650
can actually read and audit. Exactly. Best of

00:20:41.650 --> 00:20:43.750
both worlds. Distilling thousands of failure

00:20:43.750 --> 00:20:46.490
-driven trees into a single born -again path

00:20:46.490 --> 00:20:49.059
of logic. That is incredible. And that actually

00:20:49.059 --> 00:20:50.799
brings us to a final thought I want to leave

00:20:50.799 --> 00:20:53.099
you all with today. OK. One that sort of steps

00:20:53.099 --> 00:20:55.539
away from the pure math and looks at the philosophy

00:20:55.539 --> 00:20:58.440
of how this algorithm actually works. I'm intrigued.

00:20:58.660 --> 00:21:00.579
Well, human education is traditionally built

00:21:00.579 --> 00:21:03.519
on showing students the complete correct picture,

00:21:03.980 --> 00:21:05.960
like memorizing the right facts, learning the

00:21:05.960 --> 00:21:08.500
right formulas from the start. But gradient boosting

00:21:08.500 --> 00:21:11.019
achieves its near perfect understanding of the

00:21:11.019 --> 00:21:14.039
world purely through the lens of failure. It

00:21:14.039 --> 00:21:16.880
learns exclusively by mapping the exact shape

00:21:16.880 --> 00:21:19.500
of what it got wrong over and over again in these

00:21:19.500 --> 00:21:23.059
tiny fractional steps. It makes you wonder. As

00:21:23.059 --> 00:21:24.740
we try to build machines that think more like

00:21:24.740 --> 00:21:27.660
us, maybe the ultimate secret to mastering complex

00:21:27.660 --> 00:21:30.039
problems isn't about aiming for immediate perfection.

00:21:30.140 --> 00:21:32.880
Wow. Maybe true intelligence is just the willingness

00:21:32.880 --> 00:21:35.960
to ruthlessly systematically map your own mistakes.

00:21:36.660 --> 00:21:38.279
Thank you for joining us on this deep dive.
