WEBVTT

00:00:00.000 --> 00:00:05.660
It is March 26, 2026, and every single time your

00:00:05.660 --> 00:00:10.480
spam filter traps a phishing email or a doctor

00:00:11.029 --> 00:00:13.490
predicts a patient's survival rate. Or an AI

00:00:13.490 --> 00:00:15.369
parses your voice command, yeah. Right. They

00:00:15.369 --> 00:00:17.730
are all relying on a piece of math born in the

00:00:17.730 --> 00:00:20.010
1830s. It's pretty well to think about. It is.

00:00:20.250 --> 00:00:23.149
So welcome to today's deep dive. We are tackling

00:00:23.149 --> 00:00:25.870
logistic regression. And our mission today is

00:00:25.870 --> 00:00:28.910
to, well, basically shortcut you past all the

00:00:28.910 --> 00:00:31.370
dense mathematical jargon to understand this

00:00:31.370 --> 00:00:33.729
hidden engine. The engine that's secretly predicting

00:00:33.729 --> 00:00:36.229
our behavior. Exactly. Because we live in a world

00:00:36.229 --> 00:00:38.770
defined by absolute outcomes, like you pass a

00:00:38.770 --> 00:00:41.579
test or you fail. is fraudulent or it's totally

00:00:41.579 --> 00:00:43.240
legitimate. Right, we're surrounded by these

00:00:43.240 --> 00:00:46.140
binary zeros and ones. But the standard math

00:00:46.140 --> 00:00:48.600
most of us learned in high school, it fundamentally

00:00:48.600 --> 00:00:50.740
breaks down when you try to predict those absolutes.

00:00:50.820 --> 00:00:52.859
Oh, it completely shatters. I mean, if we want

00:00:52.859 --> 00:00:54.799
to understand why standard linear regression

00:00:54.799 --> 00:00:57.579
fails so spectacularly here, we can just look

00:00:57.579 --> 00:00:59.859
at a classic statistical scenario. OK, laid on

00:00:59.859 --> 00:01:03.119
me. So imagine 20 students, and they each spend

00:01:03.119 --> 00:01:06.219
anywhere from, say, zero to six hours studying

00:01:06.219 --> 00:01:09.379
for a final exam. Sure. Now, we aren't trying

00:01:09.379 --> 00:01:11.680
to predict their exact numerical grade, like

00:01:11.680 --> 00:01:15.060
a 0 to 100. We only care about the binary outcome.

00:01:15.599 --> 00:01:18.200
They either pass, which is a 1, or they fail,

00:01:18.359 --> 00:01:21.379
which is a 0. And if you try to use standard

00:01:21.379 --> 00:01:24.060
linear regression for that, like if you literally

00:01:24.060 --> 00:01:27.260
take a ruler and draw a straight sloped line

00:01:27.260 --> 00:01:29.579
through a scatter plot of those ones and zeros,

00:01:29.840 --> 00:01:31.680
your line is just going to shoot straight past

00:01:31.680 --> 00:01:34.650
the data. Right. Trying to predict a pass or

00:01:34.650 --> 00:01:37.409
fail with a straight line is like saying a student

00:01:37.409 --> 00:01:40.430
who studies for 10 hours has 120 % chance of

00:01:40.430 --> 00:01:42.510
passing. Yeah. Or someone who doesn't study at

00:01:42.510 --> 00:01:45.250
all has a negative 20 % chance. It's mathematical

00:01:45.250 --> 00:01:47.989
nonsense. You can't have a negative probability.

00:01:48.150 --> 00:01:50.969
You really can't. A straight line simply doesn't

00:01:50.969 --> 00:01:54.069
understand boundaries. It extends infinitely

00:01:54.069 --> 00:01:57.109
in both directions, you know? Yeah. But probability,

00:01:57.409 --> 00:02:00.230
by its very definition, has to remain strictly

00:02:00.230 --> 00:02:04.099
confined between 0 and 1. You are either 0 %

00:02:04.099 --> 00:02:07.319
likely to do something, 100 % likely, or somewhere

00:02:07.319 --> 00:02:09.340
within that absolute range. There's no such thing

00:02:09.340 --> 00:02:13.219
as being 120 % certain. Exactly. So the straight

00:02:13.219 --> 00:02:15.539
line has to be abandoned. We need a completely

00:02:15.539 --> 00:02:18.020
different geometric shape. Enter the standard

00:02:18.020 --> 00:02:20.960
logistic function. Or the sigmoid function. Right.

00:02:21.080 --> 00:02:23.699
So what does that actually look like? Well, imagine

00:02:23.699 --> 00:02:26.300
taking that rigid straight line on your graph

00:02:26.300 --> 00:02:29.439
and just physically bending the ends. OK. You

00:02:29.439 --> 00:02:32.340
pull the top end down so it flattens out. never

00:02:32.340 --> 00:02:35.139
quite touches the ceiling of 100 % and then you

00:02:35.139 --> 00:02:37.620
pull the bottom end up so it flattens out and

00:02:37.620 --> 00:02:41.060
never drops below 0%. Leaving you with this like

00:02:41.060 --> 00:02:44.960
elegant S -shaped curve. Exactly and this specific

00:02:44.960 --> 00:02:48.620
function can take any real number input from

00:02:48.620 --> 00:02:51.419
negative infinity to positive infinity and gracefully

00:02:51.419 --> 00:02:53.979
squash it down so the output is forever trapped

00:02:53.979 --> 00:02:56.810
between 0 and 1. OK, let's unpack this, because

00:02:56.810 --> 00:02:59.150
getting from a straight infinite line to a bounded

00:02:59.150 --> 00:03:02.250
s -curve requires a bridge. You can't just plug

00:03:02.250 --> 00:03:04.150
standard data into an s -curve and expect it

00:03:04.150 --> 00:03:06.050
to work without some kind of mathematical translation.

00:03:06.409 --> 00:03:08.530
And that translation mechanism is a concept called

00:03:08.530 --> 00:03:10.969
the logit, right? The logistic unit, yes. The

00:03:10.969 --> 00:03:13.650
logit basically transforms the log odds of an

00:03:13.650 --> 00:03:17.789
event into a clean probability. But before we

00:03:17.789 --> 00:03:19.490
even get to the logarithm part, we really have

00:03:19.490 --> 00:03:22.090
to talk about odds. Yeah, because odds and probability

00:03:22.090 --> 00:03:24.930
are frequently conflated. but they measure entirely

00:03:24.930 --> 00:03:26.750
different relationships. Break that down for

00:03:26.750 --> 00:03:30.189
us. Sure. So probability is the ratio of success

00:03:30.189 --> 00:03:33.150
to the total number of attempts. Like if you

00:03:33.150 --> 00:03:36.210
roll a standard six -sided die, your probability

00:03:36.210 --> 00:03:39.430
of rolling a four is one out of six. Odds, on

00:03:39.430 --> 00:03:41.770
the other hand, are the ratio of success to failure.

00:03:42.509 --> 00:03:45.090
So the odds of rolling that four are one to five.

00:03:45.569 --> 00:03:47.949
One success for every five failures. So if you

00:03:47.949 --> 00:03:50.180
take a coin flip, The probability of getting

00:03:50.180 --> 00:03:54.199
heads is 50%, or .5, but the odds are one to

00:03:54.199 --> 00:03:56.780
one. One success for every one failure. Perfect

00:03:56.780 --> 00:03:58.840
example. And if you have an 80 % probability

00:03:58.840 --> 00:04:00.960
of passing that exam we talked about, your odds

00:04:00.960 --> 00:04:03.500
are four to one. Four successes for every one

00:04:03.500 --> 00:04:06.280
failure. Now we apply the logarithm. The logit

00:04:06.280 --> 00:04:08.120
function basically takes the natural logarithm

00:04:08.120 --> 00:04:10.199
of those odds. And why do we do that? Because

00:04:10.199 --> 00:04:13.180
odds have an asymmetrical limit. They can go

00:04:13.180 --> 00:04:15.500
all the way up to positive infinity, but they

00:04:15.500 --> 00:04:18.399
can't drop below zero. I mean, you can't have

00:04:18.399 --> 00:04:20.160
negative odds. Right. That wouldn't make sense.

00:04:20.279 --> 00:04:22.720
But when you take the natural logarithm of the

00:04:22.720 --> 00:04:25.639
odds, you magically stretch that scale out in

00:04:25.639 --> 00:04:28.279
both directions. Oh, wow. Yeah. A 1 to 1 odds

00:04:28.279 --> 00:04:31.500
ratio, a 50 % probability becomes exactly 0.

00:04:32.220 --> 00:04:35.220
Anything less than 50 % becomes a negative number

00:04:35.220 --> 00:04:38.560
stretching down to negative infinity. And anything

00:04:38.560 --> 00:04:41.579
more than 50 % becomes a positive number stretching

00:04:41.579 --> 00:04:44.180
to positive infinity. So we've essentially tricked

00:04:44.180 --> 00:04:47.060
the math. We really have. By calculating the

00:04:47.060 --> 00:04:49.879
log odds, we've created a perfectly infinite

00:04:49.879 --> 00:04:52.959
symmetrical scale that a computer can just run

00:04:52.959 --> 00:04:55.339
a standard linear equation on under the hood.

00:04:55.420 --> 00:04:58.199
Exactly. But the output we actually look at remains

00:04:58.199 --> 00:05:00.620
perfectly squashed into that beautiful s curve

00:05:00.620 --> 00:05:03.180
of probability. It's brilliant because it allows

00:05:03.180 --> 00:05:06.240
us to use the predictable, easily calculable

00:05:06.240 --> 00:05:09.079
mechanics of linear math while still honoring

00:05:09.079 --> 00:05:11.519
the strict boundaries of the real world. So the

00:05:11.519 --> 00:05:14.160
math gives us a way to compress infinity down

00:05:14.160 --> 00:05:17.300
into a neat percentage. But I mean, if I'm a

00:05:17.300 --> 00:05:19.540
disaster planner or a doctor looking at those

00:05:19.540 --> 00:05:22.459
numbers, how do I actually translate that percentage

00:05:22.459 --> 00:05:24.720
into a concrete decision? Like, what does this

00:05:24.720 --> 00:05:27.120
engine actually spit out? It spits out coefficients.

00:05:27.220 --> 00:05:29.740
OK. In this mathematical context, a coefficient

00:05:29.740 --> 00:05:32.699
is essentially a multiplier or a weight assigned

00:05:32.699 --> 00:05:35.480
to a specific variable. It dictates exactly how

00:05:35.480 --> 00:05:38.899
much influence one piece of data has on the final

00:05:38.899 --> 00:05:41.350
outcome. Can we look at the odds ratio to see

00:05:41.350 --> 00:05:43.829
this in action? Definitely. Let's say we have

00:05:43.829 --> 00:05:46.509
a model analyzing building evacuations during

00:05:46.509 --> 00:05:49.449
a hurricane, and we're working in base 10. The

00:05:49.449 --> 00:05:53.029
model has a variable for number of evacuation

00:05:53.029 --> 00:05:55.709
warnings received. If the coefficient for that

00:05:55.709 --> 00:05:58.069
specific variable is 2, it means that receiving

00:05:58.069 --> 00:06:00.310
just one additional warning increases the odds

00:06:00.310 --> 00:06:03.029
of a person evacuating by a factor of 10 squared.

00:06:03.279 --> 00:06:06.199
10 to the power of 2, so an increase of 100 times.

00:06:06.300 --> 00:06:08.779
An increase in the odds by a factor of 100 purely

00:06:08.779 --> 00:06:11.060
from nudging that one variable up by a single

00:06:11.060 --> 00:06:12.980
unit. Wait, wait. I want to pause here to make

00:06:12.980 --> 00:06:14.720
sure we don't fall into the very trap we just

00:06:14.720 --> 00:06:18.480
discussed. So, multiplying the odds by 100 absolutely

00:06:18.480 --> 00:06:21.680
does not mean the actual probability multiplies

00:06:21.680 --> 00:06:24.339
by 100. Right, exactly. Because if you have a

00:06:24.339 --> 00:06:26.759
10 % probability of evacuating, you can't multiply

00:06:26.759 --> 00:06:29.800
that by 100 and have a 1000 % probability. The

00:06:29.800 --> 00:06:32.420
S -curve prevents that. The S -curve is the great

00:06:32.420 --> 00:06:35.600
equalizer here. If you are sitting right in the

00:06:35.600 --> 00:06:38.899
middle of the curve, like, at a 50 % probability,

00:06:39.230 --> 00:06:42.009
a massive jump in odds will drastically shoot

00:06:42.009 --> 00:06:45.009
your probability upward. OK. But if you are already

00:06:45.009 --> 00:06:47.670
near the top of that S curve, say, you already

00:06:47.670 --> 00:06:50.110
have a 95 % probability of evacuating because

00:06:50.110 --> 00:06:52.550
you live right on the coast, your odds can still

00:06:52.550 --> 00:06:55.170
multiply by 100. But your actual probability

00:06:55.170 --> 00:06:58.399
might only nudge up to like, 98 or 99 percent.

00:06:58.860 --> 00:07:01.259
Ah, the closer you get to absolute certainty,

00:07:01.740 --> 00:07:03.920
the harder it is to move the needle on the probability

00:07:03.920 --> 00:07:06.779
scale. The S -curve basically compresses those

00:07:06.779 --> 00:07:09.660
massive skyrocketing jumps and odds into tiny

00:07:09.660 --> 00:07:12.259
fractional adjustments at the extreme ends. It

00:07:12.259 --> 00:07:15.879
perfectly mimics diminishing returns. And what's

00:07:15.879 --> 00:07:18.079
fascinating here is how this exact mathematical

00:07:18.079 --> 00:07:21.850
behavior mirrors human decision making across

00:07:21.850 --> 00:07:24.790
wildly different environments. Right. Medical

00:07:24.790 --> 00:07:27.790
professionals rely on this structure to calculate

00:07:27.790 --> 00:07:31.209
the tris, the trauma, and injury severity score.

00:07:31.449 --> 00:07:33.629
Oh, I heard of that. Yeah. It takes variables

00:07:33.629 --> 00:07:36.329
like patient age, blood pressure, and respiratory

00:07:36.329 --> 00:07:39.269
rate, assigns coefficients to each, and maps

00:07:39.269 --> 00:07:41.949
them onto the S curve to predict patient mortality.

00:07:42.449 --> 00:07:45.209
It literally gives emergency rooms a concrete

00:07:45.209 --> 00:07:48.720
probability to guide triage. And it scales to

00:07:48.720 --> 00:07:50.680
completely different disciplines without changing

00:07:50.680 --> 00:07:53.399
the underlying math. Yeah. Like political scientists

00:07:53.399 --> 00:07:56.699
use it to model voter behavior. If you are predicting

00:07:56.699 --> 00:07:59.079
whether a Nepalese voter will choose the Nepali

00:07:59.079 --> 00:08:01.360
Congress or the Communist Party of Nepal, you

00:08:01.360 --> 00:08:03.480
just map demographic variables onto the curve

00:08:03.480 --> 00:08:06.160
to find the probability of a vote. That political

00:08:06.160 --> 00:08:08.779
example actually highlights an important expansion

00:08:08.779 --> 00:08:10.720
of the model. We've mostly been talking about

00:08:10.720 --> 00:08:13.540
binary outcomes, pass or fail, live or die. But

00:08:13.540 --> 00:08:15.319
the voter has more than two choices. Right. There

00:08:15.319 --> 00:08:18.019
are multiple parties. Exactly. This is where

00:08:18.019 --> 00:08:20.319
multinomial logistic regression comes into play.

00:08:20.379 --> 00:08:23.600
It handles situations with multiple unranked

00:08:23.600 --> 00:08:26.740
categories. The math calculates the odds of choosing

00:08:26.740 --> 00:08:29.819
one specific party over a baseline reference

00:08:29.819 --> 00:08:31.939
party. Oh, interesting. And if the categories

00:08:31.939 --> 00:08:34.870
do have a strict order, like A patient rating

00:08:34.870 --> 00:08:37.710
their pain on a scale from 1 to 10 statisticians

00:08:37.710 --> 00:08:40.970
use ordinal logistic regression. So the S -curve

00:08:40.970 --> 00:08:43.669
structure can really stretch to fit whatever

00:08:43.669 --> 00:08:47.490
categorical reality we need it to. But this leaves

00:08:47.490 --> 00:08:49.850
a massive mechanical gap in our understanding.

00:08:50.330 --> 00:08:52.990
How so? Well, we know what the S -curve does.

00:08:53.350 --> 00:08:55.350
We know how to read the odds and coefficients

00:08:55.350 --> 00:08:58.009
it produces. But how does a computer actually

00:08:58.009 --> 00:09:00.990
find the perfect S -curve for a messy set of

00:09:00.990 --> 00:09:04.059
raw data? Ah. OK. Well, if we were using linear

00:09:04.059 --> 00:09:06.240
regression, the computer would just minimize

00:09:06.240 --> 00:09:08.179
the squared error. Right. It would literally

00:09:08.179 --> 00:09:10.100
measure the physical distance between the data

00:09:10.100 --> 00:09:12.299
points on the graph and the straight line, and

00:09:12.299 --> 00:09:14.080
adjust the line until that total distance is

00:09:14.080 --> 00:09:16.220
as small as possible. You plug the numbers into

00:09:16.220 --> 00:09:18.139
a closed form equation, and you get an immediate

00:09:18.139 --> 00:09:20.379
perfect answer. But we don't have a straight

00:09:20.379 --> 00:09:22.879
line. We have an S -curve, and the data points

00:09:22.879 --> 00:09:26.059
are all clustered at the absolute top and absolute

00:09:26.059 --> 00:09:28.769
bottom of the graph. Measuring physical distance

00:09:28.769 --> 00:09:30.850
doesn't really work the same way here. It doesn't.

00:09:30.850 --> 00:09:32.990
It requires an entirely different approach called

00:09:32.990 --> 00:09:36.970
maximum likelihood estimation or MLE. Instead

00:09:36.970 --> 00:09:39.870
of minimizing physical distance, MLE calculates

00:09:39.870 --> 00:09:42.009
something known as log loss or cross entropy.

00:09:42.220 --> 00:09:44.860
Right. And the concept of log loss is often described

00:09:44.860 --> 00:09:47.059
using the word surprisell, which I love. It's

00:09:47.059 --> 00:09:49.440
literally measuring how shocked the mathematical

00:09:49.440 --> 00:09:52.860
model is by reality. Yes. Think back to our students.

00:09:53.000 --> 00:09:55.500
Let's say the computer draws a tentative S -curve.

00:09:55.700 --> 00:09:57.700
And based on that curve, it predicts a student

00:09:57.700 --> 00:10:01.340
who barely studied has a 99 % probability of

00:10:01.340 --> 00:10:03.899
failing. OK. But then we look at the actual data,

00:10:04.120 --> 00:10:07.179
and that student passed. The model's surprisell

00:10:07.179 --> 00:10:09.899
approaches infinity. It is incredibly shocked.

00:10:10.159 --> 00:10:12.730
Exactly. The algorithm's sole objective is to

00:10:12.730 --> 00:10:15.730
minimize its own surprise. It wants its predictions

00:10:15.730 --> 00:10:18.230
to align as closely as possible with the actual

00:10:18.230 --> 00:10:21.210
observed event. But because there is no simple

00:10:21.210 --> 00:10:23.470
close -form equation to solve for the perfect

00:10:23.470 --> 00:10:26.629
S -curve all at once, the computer has to guess

00:10:26.629 --> 00:10:30.240
and check. It uses iterative numerical methods

00:10:30.240 --> 00:10:32.720
to basically find the bottom of the error curve.

00:10:32.980 --> 00:10:35.000
A common analogy for this is Newton's method,

00:10:35.299 --> 00:10:37.559
like imagine you're blindfolded standing on the

00:10:37.559 --> 00:10:39.600
side of a mountain and your goal is to reach

00:10:39.600 --> 00:10:42.080
the lowest point in the valley. the point of

00:10:42.080 --> 00:10:45.019
minimum surprise. You can't see the valley. All

00:10:45.019 --> 00:10:47.019
you can do is feel the slope of the ground right

00:10:47.019 --> 00:10:49.519
beneath your feet. You figure out which direction

00:10:49.519 --> 00:10:52.059
goes down, and you take a step. And then you

00:10:52.059 --> 00:10:54.500
feel the slope again, calculate the new downward

00:10:54.500 --> 00:10:56.879
trajectory, and take another step. The computer

00:10:56.879 --> 00:11:00.080
calculates a tentative curve, measures the surprizal,

00:11:00.519 --> 00:11:02.519
calculates the derivative, which is the slope,

00:11:02.720 --> 00:11:05.220
to find the direction of less surprise, adjusts

00:11:05.220 --> 00:11:08.059
the coefficients slightly, and repeats the process.

00:11:08.480 --> 00:11:11.539
Step by step, it slowly crawls down the mountain

00:11:11.539 --> 00:11:14.480
until the ground levels out and it can't minimize

00:11:14.480 --> 00:11:16.659
the surprise any further. And that is when the

00:11:16.659 --> 00:11:20.179
model has converged. Yes. But that blindfolded

00:11:20.179 --> 00:11:22.580
walk down the mountain can go completely wrong

00:11:22.580 --> 00:11:25.460
if the terrain is deceptive. The mathematical

00:11:25.460 --> 00:11:29.080
crawling process can fail. Wait, really? How?

00:11:29.240 --> 00:11:31.820
Well, the model might never converge. It might

00:11:31.820 --> 00:11:34.759
just keep calculating forever. One primary reason

00:11:34.759 --> 00:11:37.610
this happens is a lack of data. There's a general

00:11:37.610 --> 00:11:40.210
heuristic called the rule of 10, which basically

00:11:40.210 --> 00:11:42.789
states you need at least 10 actual events per

00:11:42.789 --> 00:11:45.269
explanatory variable to get stable coefficients.

00:11:46.309 --> 00:11:48.529
If your model has five variables, you need at

00:11:48.529 --> 00:11:51.250
least 50 occurrences of the outcome you are tracking.

00:11:51.649 --> 00:11:53.429
Otherwise, the math doesn't have enough terrain

00:11:53.429 --> 00:11:55.610
to find the valley. And another fatal trap is

00:11:55.610 --> 00:11:57.730
a concept called complete separation, right?

00:11:57.950 --> 00:12:01.009
Yes. Complete separation occurs when your predictors

00:12:01.009 --> 00:12:04.149
perfectly predict the outcome. OK, wait. On the

00:12:04.149 --> 00:12:06.350
surface, perfectly predicting the outcome sounds

00:12:06.350 --> 00:12:08.669
like the ultimate goal. If students studies for

00:12:08.669 --> 00:12:11.669
exactly six hours, they pass 100 % of the time.

00:12:11.990 --> 00:12:14.389
Why is perfection breaking the math? We have

00:12:14.389 --> 00:12:17.490
to remember that the algorithm's goal, minimizing

00:12:17.490 --> 00:12:20.929
surprise. If six hours of studying guarantees

00:12:20.929 --> 00:12:23.830
a pass, the only way the model can experience

00:12:23.830 --> 00:12:27.350
absolute zero surprise is to turn the S -curve

00:12:27.350 --> 00:12:30.500
into an infinitely steep vertical wall right

00:12:30.500 --> 00:12:33.000
at the six hour mark. Oh, I see. It wants the

00:12:33.000 --> 00:12:35.360
probability to instantly snap from zero to one.

00:12:35.610 --> 00:12:38.809
But the formula relies on real numbers, and you

00:12:38.809 --> 00:12:41.129
cannot graph a perfectly vertical line with real

00:12:41.129 --> 00:12:43.269
numbers in this function. The algorithm just

00:12:43.269 --> 00:12:44.990
keeps stepping down the mountain, pushing the

00:12:44.990 --> 00:12:46.690
coefficient higher and higher, trying to build

00:12:46.690 --> 00:12:48.950
that vertical wall. It pushes toward infinity,

00:12:49.350 --> 00:12:51.490
the math panics, and the entire model crashes.

00:12:51.710 --> 00:12:53.789
So what does this all mean? Like, why should

00:12:53.789 --> 00:12:56.389
anyone outside of a data science lab care about

00:12:56.389 --> 00:12:59.129
maximum likelihood estimation or the algorithm

00:12:59.129 --> 00:13:01.269
minimizing its surprisingly? Because it's everywhere.

00:13:01.490 --> 00:13:03.230
Exactly, because this isn't just a statistical

00:13:03.230 --> 00:13:06.110
quirk. This process of iterative Guessing, measuring

00:13:06.110 --> 00:13:08.929
shock, and adjusting coefficients is the fundamental

00:13:08.929 --> 00:13:11.210
mechanism of supervised machine learning. It

00:13:11.210 --> 00:13:13.830
really is. When a tech company announces they

00:13:13.830 --> 00:13:17.909
have trained a new AI to filter your inbox or

00:13:17.909 --> 00:13:20.730
diagnose a lung scan, this is what that training

00:13:20.730 --> 00:13:24.129
looks like under the hood. The AI is iteratively

00:13:24.129 --> 00:13:26.970
adjusting its internal S -curves to minimize

00:13:26.970 --> 00:13:29.370
its surprise when tested against real -world

00:13:29.370 --> 00:13:32.210
training data. It is actively minimizing the

00:13:32.210 --> 00:13:36.049
cross entropy loss function. This specific mathematical

00:13:36.049 --> 00:13:39.789
process is literally the bedrock upon which modern

00:13:39.789 --> 00:13:41.870
artificial intelligence is built. Which makes

00:13:41.870 --> 00:13:44.049
the timeline of this math almost unbelievable.

00:13:44.309 --> 00:13:46.250
We are talking about the engine of modern AI

00:13:46.250 --> 00:13:49.330
but the origins of this formula predate the American

00:13:49.330 --> 00:13:51.309
Civil War. Yeah, they have absolutely nothing

00:13:51.309 --> 00:13:53.309
to do with computers or machine learning. Right.

00:13:53.470 --> 00:13:55.809
The history of logistic regression is honestly

00:13:55.809 --> 00:13:57.909
a testament to the universality of mathematics.

00:13:58.570 --> 00:14:01.230
It begins in the 1830s with a Belgian mathematician

00:14:01.230 --> 00:14:04.490
named Pierre Francois Verhulst. And he wasn't

00:14:04.490 --> 00:14:06.970
looking at data sets to predict behavior. He

00:14:06.970 --> 00:14:09.429
was looking at biological population growth.

00:14:09.710 --> 00:14:11.970
Because at the time, the assumption was that

00:14:11.970 --> 00:14:14.629
populations just grew exponentially, right? But

00:14:14.629 --> 00:14:17.250
Verhost looked at the reality of resources and

00:14:17.250 --> 00:14:19.809
realized populations eventually hit a ceiling.

00:14:20.330 --> 00:14:23.509
Yes. He introduced the concept of carrying capacity.

00:14:24.009 --> 00:14:26.509
A population grows rapidly at first, but as food

00:14:26.509 --> 00:14:29.490
and space become scarce, that growth slows down

00:14:29.490 --> 00:14:31.669
and eventually levels off. And when he modeled

00:14:31.669 --> 00:14:34.429
this constraint mathematically, he produced the

00:14:34.429 --> 00:14:37.720
very first logistic function. He gave the world

00:14:37.720 --> 00:14:40.799
the S curve. He did. And the scientific process

00:14:40.799 --> 00:14:43.159
that followed wasn't a straight line of development.

00:14:43.360 --> 00:14:46.620
It was chaotic, isolated and incredibly cross

00:14:46.620 --> 00:14:49.440
-disciplinary. It wasn't invented by a tech bro

00:14:49.440 --> 00:14:52.120
in a garage. It was built by chemists, biologists

00:14:52.120 --> 00:14:54.679
and medical researchers over a century. Absolutely.

00:14:55.279 --> 00:14:58.320
In 1883, a chemist named Wilhelm Ostwald was

00:14:58.320 --> 00:15:01.220
studying autocatalysis situations where a chemical

00:15:01.220 --> 00:15:03.820
reaction creates a product that basically accelerates

00:15:03.820 --> 00:15:06.279
the reaction itself. But eventually the raw materials

00:15:06.279 --> 00:15:08.940
run out. So he independently discovered the exact

00:15:08.940 --> 00:15:11.539
same S -curve to model chemical constraints?

00:15:11.740 --> 00:15:15.539
Yes. And decades later, in 1920, Raymond Pearl

00:15:15.539 --> 00:15:18.159
and Lowell Reed rediscovered it yet again, completely

00:15:18.159 --> 00:15:20.620
unaware of Reholst and applied it back to population

00:15:20.620 --> 00:15:22.799
growth. It proves that this mathematical shape

00:15:22.799 --> 00:15:25.179
is an intrinsic property of the natural world,

00:15:25.500 --> 00:15:27.580
governing everything from yeast in a petri dish

00:15:27.580 --> 00:15:30.679
to chemical reactions. It does. But convincing

00:15:30.679 --> 00:15:33.240
the broader statistical community to adopt it

00:15:33.240 --> 00:15:36.769
was a massive hurdle. In the 1930s, the battleground

00:15:36.769 --> 00:15:40.070
was bioassays, testing the lethal potency of

00:15:40.070 --> 00:15:42.649
various drugs and toxins. Okay. The dominant

00:15:42.649 --> 00:15:45.389
model of the era was the probit model, championed

00:15:45.389 --> 00:15:47.250
by heavyweights like Chester Bliss and Ronald

00:15:47.250 --> 00:15:49.889
Fisher. The probit model used a normal distribution

00:15:49.889 --> 00:15:53.570
curve. It worked, but the math was dense and

00:15:53.570 --> 00:15:55.889
computationally heavy. Right. Here's where it

00:15:55.889 --> 00:15:59.730
gets really interesting. In 1944, a researcher

00:15:59.730 --> 00:16:02.549
named Joseph Berkson steps into the fray. He

00:16:02.549 --> 00:16:05.509
takes Reholt's population growth curve, recognizes

00:16:05.509 --> 00:16:08.090
its mathematical elegance, and coins the term

00:16:08.090 --> 00:16:11.509
logit as a direct punchy alternative to the probit

00:16:11.509 --> 00:16:13.509
model. Yeah, and he spent the next several decades

00:16:13.509 --> 00:16:15.830
relentlessly arguing that his logit model was

00:16:15.830 --> 00:16:18.330
superior. And Berkson eventually won the war.

00:16:18.590 --> 00:16:21.590
By the 1970s, the logit model reached parity

00:16:21.590 --> 00:16:23.990
with the probit model and then largely surpassed

00:16:23.990 --> 00:16:26.429
it in widespread use. The deciding factor was

00:16:26.429 --> 00:16:28.669
really just computational simplicity. The math

00:16:28.669 --> 00:16:31.250
of the logit model, particularly its use of logarithms

00:16:31.250 --> 00:16:33.929
and odds, was significantly easier for the computers

00:16:33.929 --> 00:16:36.669
of that era to process. But predicting how much

00:16:36.669 --> 00:16:39.529
toxin kills a beetle is a long way from predicting

00:16:39.529 --> 00:16:43.440
human behavior. How did Berkson's bioassay math

00:16:43.440 --> 00:16:45.980
become the foundation for predicting voter choice

00:16:45.980 --> 00:16:48.899
and consumer habits? That leap occurred in 1973,

00:16:49.120 --> 00:16:51.240
driven by an economist named Daniel McFadden.

00:16:51.259 --> 00:16:54.200
Okay. He achieved a massive theoretical breakthrough

00:16:54.200 --> 00:16:58.440
by linking multinomial logistic regression directly

00:16:58.440 --> 00:17:02.700
to the economic theory of discrete choice. McFadden

00:17:02.700 --> 00:17:05.119
basically proved that you could use this exact

00:17:05.119 --> 00:17:09.019
S -curve math to model how a rational human actor

00:17:09.019 --> 00:17:11.319
chooses the option that provides them with the

00:17:11.319 --> 00:17:13.640
greatest utility. Oh, wow. Yeah. He moved the

00:17:13.640 --> 00:17:16.099
math from measuring physical limits like carrying

00:17:16.099 --> 00:17:19.039
capacity to measuring abstract human preference.

00:17:19.619 --> 00:17:21.940
That foundation basically exploded the use of

00:17:21.940 --> 00:17:24.400
logistic regression across the social sciences,

00:17:24.799 --> 00:17:27.319
economics, and marketing. From yeast, to chemicals,

00:17:27.619 --> 00:17:29.819
to beetles, to the stock market. Exactly. And

00:17:29.819 --> 00:17:31.599
if we connect this to the bigger picture, the

00:17:31.599 --> 00:17:33.559
convergence becomes even more profound. How so?

00:17:33.839 --> 00:17:36.559
The formula Berson was fighting for in the 1940s,

00:17:37.000 --> 00:17:39.980
the equation that maps inputs through a sigmoid

00:17:39.980 --> 00:17:42.200
function to generate a probability between 0

00:17:42.200 --> 00:17:45.799
and 1, is functionally identical to a single

00:17:45.799 --> 00:17:48.700
layer perceptron in a modern artificial neural

00:17:48.700 --> 00:17:51.829
network. That is wild. The exact same mathematical

00:17:51.829 --> 00:17:54.250
logic that predicted how a population of animals

00:17:54.250 --> 00:17:56.789
reaches its physical limit is currently computing

00:17:56.789 --> 00:17:59.549
the very first layer of an artificial intelligence's

00:17:59.549 --> 00:18:01.490
thought process. A single layer neural network

00:18:01.490 --> 00:18:04.410
computes its output using that S -curve. And

00:18:04.410 --> 00:18:06.670
crucially, the mathematical derivative of the

00:18:06.670 --> 00:18:08.910
sigmoid function is incredibly easy for a computer

00:18:08.910 --> 00:18:11.549
to calculate. Right. That ease of calculation

00:18:11.549 --> 00:18:14.089
is what makes the back propagation process possible.

00:18:14.400 --> 00:18:16.839
It is what allows modern neural networks to adjust

00:18:16.839 --> 00:18:18.940
their weights, step down the mountain, and actually

00:18:18.940 --> 00:18:20.839
learn. Okay, let's bring you back to the surface.

00:18:21.319 --> 00:18:23.579
We have covered an immense expanse of theory

00:18:23.579 --> 00:18:26.039
today. We started by dismantling the straight,

00:18:26.140 --> 00:18:28.759
rigid line of linear regression and learned why

00:18:28.759 --> 00:18:31.940
mapping binary absolutes requires bending reality

00:18:31.940 --> 00:18:35.019
into an elegant S -curve. We broke down the mechanics

00:18:35.019 --> 00:18:38.099
of the logit, exploring how converting probabilities

00:18:38.099 --> 00:18:42.059
into log odds allows us to trick the math into

00:18:42.059 --> 00:18:44.339
using linear equations for bounded outcomes.

00:18:44.579 --> 00:18:47.200
Right, and we examined how coefficients act as

00:18:47.200 --> 00:18:50.299
weights, multiplying odds, while the S -curve

00:18:50.299 --> 00:18:52.980
naturally compresses those massive jumps as we

00:18:52.980 --> 00:18:55.299
approach absolute certainty. We looked under

00:18:55.299 --> 00:18:57.980
the hood of machine learning to see maximum likelihood

00:18:57.980 --> 00:19:01.180
estimation in action, watching a computer blindly

00:19:01.180 --> 00:19:03.619
step down a mountain to minimize its surprizal.

00:19:03.900 --> 00:19:06.339
We explored why perfect predictions break the

00:19:06.339 --> 00:19:09.180
model by demanding infinitely steep walls. And

00:19:09.180 --> 00:19:12.519
we traced a chaotic 200 -year history watching

00:19:12.519 --> 00:19:14.859
a simple equation for population constraints

00:19:14.859 --> 00:19:17.640
evolve into the discrete choice models of modern

00:19:17.640 --> 00:19:20.200
economics and the neural pathways of artificial

00:19:20.200 --> 00:19:22.799
intelligence. It really is the hidden architecture

00:19:22.799 --> 00:19:24.700
of decision -making. It truly is. Which brings

00:19:24.700 --> 00:19:27.299
me to a final slightly mind -bending thought

00:19:27.299 --> 00:19:28.839
from the research to leave you with. Oh, this

00:19:28.839 --> 00:19:31.349
is a good one. There's a specific mathematical

00:19:31.349 --> 00:19:33.470
interpretation of logistic regression called

00:19:33.470 --> 00:19:36.329
the latent variable model. This interpretation

00:19:36.329 --> 00:19:38.930
assumes that there isn't actually a hard binary

00:19:38.930 --> 00:19:42.049
in nature. Right. It assumes there is an unobserved

00:19:42.049 --> 00:19:44.930
hidden continuous variable, a hidden spectrum

00:19:44.930 --> 00:19:48.009
of utility or desire or intent paired with the

00:19:48.009 --> 00:19:50.930
random noise of the universe. Under this theory,

00:19:51.589 --> 00:19:54.500
the binary outcome the 1 or the 0, the yes or

00:19:54.500 --> 00:19:57.220
the no, is simply an indicator of whether that

00:19:57.220 --> 00:19:59.579
hidden internal variable has finally crossed

00:19:59.579 --> 00:20:02.160
a specific, invisible threshold. Think about

00:20:02.160 --> 00:20:04.400
the last absolute choice you made today, yes

00:20:04.400 --> 00:20:07.059
or no, buy or don't buy, click or don't click.

00:20:07.579 --> 00:20:10.460
We view our daily choices as definitive absolutes.

00:20:11.180 --> 00:20:14.000
But if the latent variable model is right, Underneath,

00:20:14.180 --> 00:20:17.019
your simple yes or no is a hidden, swirling,

00:20:17.019 --> 00:20:19.960
continuous spectrum of utility, constantly battling

00:20:19.960 --> 00:20:22.140
against random noise. It's crazy thought. It

00:20:22.140 --> 00:20:24.500
is. Next time you make a definitive choice, ask

00:20:24.500 --> 00:20:26.920
yourself, what does your hidden continuous variable

00:20:26.920 --> 00:20:29.019
look like right now? And how much of your final

00:20:29.019 --> 00:20:31.400
decision was just the math of the standard logistic

00:20:31.400 --> 00:20:31.920
distribution?