WEBVTT

00:00:00.000 --> 00:00:04.179
Humans have this almost obsessive biological

00:00:04.179 --> 00:00:06.679
need to draw lines. Oh, absolutely. Right, like

00:00:06.679 --> 00:00:08.919
we want to put literally everything we encounter

00:00:08.919 --> 00:00:12.080
into these neat distinct little categories just

00:00:12.080 --> 00:00:14.220
to survive. Yeah, it's how we make sense of the

00:00:14.220 --> 00:00:17.760
world. Exactly. Like this berry is safe, but

00:00:17.760 --> 00:00:21.920
that berry over there is poisonous. Or this email

00:00:21.920 --> 00:00:24.079
is an important message from my boss, and this

00:00:24.079 --> 00:00:26.519
one is just spam. Right. We're constantly drawing

00:00:26.519 --> 00:00:28.739
boundaries to filter out the chaos. We really

00:00:28.739 --> 00:00:32.929
are. Welcome to today's custom tailored deep

00:00:32.929 --> 00:00:35.329
dive. I am so glad you were joining us for this

00:00:35.329 --> 00:00:37.869
one. Me too. This is a great topic. Yeah, because

00:00:37.869 --> 00:00:41.229
today our mission is to unpack a really comprehensive

00:00:41.229 --> 00:00:43.929
Wikipedia article on something called support

00:00:43.929 --> 00:00:46.829
vector machines or SVMs. Which, I mean, if you're

00:00:46.829 --> 00:00:48.909
into statistical learning theory, this is the

00:00:48.909 --> 00:00:51.329
holy grail. It really is. And the goal here for

00:00:51.329 --> 00:00:52.929
you, the listener, is to kind of cut through

00:00:52.929 --> 00:00:55.689
all that dense, heavy math and extract those

00:00:55.689 --> 00:00:59.369
intuitive, those aha. Moments about how machines

00:00:59.369 --> 00:01:01.689
actually learn to categorize the world right

00:01:01.689 --> 00:01:04.469
to make these advanced concepts totally accessible

00:01:04.469 --> 00:01:07.569
without losing You know the fascinating nuances

00:01:07.569 --> 00:01:11.109
exactly Because sure human intuition is great

00:01:11.109 --> 00:01:13.430
at drawing lines, but when you ask a literal

00:01:13.430 --> 00:01:16.370
computer to draw that same boundary Yeah, human

00:01:16.370 --> 00:01:18.609
intuition doesn't exactly translate into code

00:01:18.609 --> 00:01:21.310
it totally fails Yeah, you have to translate

00:01:21.310 --> 00:01:24.730
the very abstract concept of drawing a line into

00:01:24.730 --> 00:01:27.790
pure rigid mathematics. Right. OK, let's unpack

00:01:27.790 --> 00:01:31.049
this. We have to go back to 1964. Ah, the golden

00:01:31.049 --> 00:01:33.909
era of Bell Labs. Right. Vladimir Vapnik and

00:01:33.909 --> 00:01:36.549
Alexey Chervenenko is working at AT &T Bell Labs.

00:01:37.230 --> 00:01:39.569
They basically introduced a mathematical framework

00:01:39.569 --> 00:01:42.870
that revolutionized how computers do this. They

00:01:42.870 --> 00:01:45.109
created support vector machines. And honestly,

00:01:45.310 --> 00:01:47.409
it remains one of the most elegantly constructed

00:01:47.409 --> 00:01:49.689
models in the history of machine learning. It

00:01:49.689 --> 00:01:51.989
really is the bedrock. But since you're listening

00:01:51.989 --> 00:01:54.129
to this deep dive, you already know the basics

00:01:54.129 --> 00:01:55.909
of machine learning. You know what an algorithm

00:01:55.909 --> 00:01:58.090
is. Yeah, we don't need to do a beginner's tutorial

00:01:58.090 --> 00:02:00.510
here. No, we are bypassing the elementary stuff.

00:02:00.709 --> 00:02:03.109
We are dissecting the mechanical brilliance of

00:02:03.109 --> 00:02:05.650
SVMs. The fun stuff. Right, so before we get

00:02:05.650 --> 00:02:08.590
into the wild, non -linear warping stuff later,

00:02:08.909 --> 00:02:10.849
we have to start with the foundational geometry.

00:02:11.430 --> 00:02:13.789
Because the goal of any linear classifier is

00:02:13.789 --> 00:02:15.770
just to separate two different classes of data.

00:02:16.370 --> 00:02:19.009
Correct. So, like, imagine you plot your data

00:02:19.009 --> 00:02:22.250
points on a table. Let's say red marbles and

00:02:22.250 --> 00:02:24.870
blue marbles scattered on a table. Perfect. A

00:02:24.870 --> 00:02:27.610
standard linear classifier just finds, well,

00:02:27.789 --> 00:02:29.949
any straight line that successfully divides the

00:02:29.949 --> 00:02:32.430
red marbles from the blue ones. Just a thin graphite

00:02:32.430 --> 00:02:34.370
pencil line drawn right between them. Exactly.

00:02:35.009 --> 00:02:37.590
But NSVM doesn't just draw any line. Right. It's

00:02:37.590 --> 00:02:39.870
looking for the, what is it, the maximum margin

00:02:39.870 --> 00:02:43.439
hyperplane. Yes. That margin is the whole defining

00:02:43.439 --> 00:02:45.659
characteristic of the algorithm. It's not just

00:02:45.659 --> 00:02:47.919
drawing a boundary, it's pushing that boundary

00:02:47.919 --> 00:02:51.520
as far away from the closest marbles of both

00:02:51.520 --> 00:02:54.280
colors as mathematically possible. So instead

00:02:54.280 --> 00:02:57.060
of a thin pencil line, it's like the SVM is trying

00:02:57.060 --> 00:02:59.639
to slide the absolute thickest possible plank

00:02:59.639 --> 00:03:02.520
of wood between the red and blue marbles. I love

00:03:02.520 --> 00:03:06.379
that analogy, yes. A thick wooden plank. And

00:03:06.379 --> 00:03:09.080
so the specific marbles that end up resting right

00:03:09.080 --> 00:03:11.270
up against the edges of that plank. The ones

00:03:11.270 --> 00:03:13.689
touching the wood. Those are your support vectors.

00:03:13.870 --> 00:03:16.490
Oh, right. Hence the name. Exactly. And the name

00:03:16.490 --> 00:03:20.069
is very literal. Those specific marbles, the

00:03:20.069 --> 00:03:22.469
support vectors, they structurally support the

00:03:22.469 --> 00:03:25.110
entire geometric model. Wait, so what about all

00:03:25.110 --> 00:03:27.490
the other marbles? You could literally remove

00:03:27.490 --> 00:03:29.750
them. If you deleted all the other data points

00:03:29.750 --> 00:03:32.210
in your data set, the ones safely pushed back

00:03:32.210 --> 00:03:34.830
behind the margin, the boundary wouldn't shift

00:03:34.830 --> 00:03:37.629
a single millimeter. Wow. Yeah, those support

00:03:37.629 --> 00:03:39.770
vectors hold the entire mathematical equation

00:03:39.770 --> 00:03:44.060
in place. But, I mean, why go through the immense

00:03:44.060 --> 00:03:46.759
computational trouble of maximizing that margin?

00:03:47.219 --> 00:03:49.780
Like, if a thin pencil line separates the red

00:03:49.780 --> 00:03:52.360
and blue marbles today, why does the machine

00:03:52.360 --> 00:03:54.699
care how thick the boundary is? That comes down

00:03:54.699 --> 00:03:57.300
to generalization error. Okay, meaning what exactly?

00:03:57.360 --> 00:03:59.080
Well, in machine learning, if your algorithm

00:03:59.080 --> 00:04:01.360
just memorizes the training data, it's useless

00:04:01.360 --> 00:04:03.099
when you deploy it in the real world. Right,

00:04:03.360 --> 00:04:06.300
overfitting. Exactly! If it draws this highly

00:04:06.300 --> 00:04:09.020
specific, squiggly pencil line that perfectly

00:04:09.020 --> 00:04:11.539
weaves through your current marbles, it memorized

00:04:11.539 --> 00:04:13.719
the past... but it can't predict the future.

00:04:14.139 --> 00:04:17.600
Because the real world is messy. Tomorrow, someone

00:04:17.600 --> 00:04:19.620
is going to toss new marbles onto the table,

00:04:19.899 --> 00:04:22.180
and they won't land in the exact same spots.

00:04:22.540 --> 00:04:25.399
Precisely. So by maximizing the margin, by using

00:04:25.399 --> 00:04:28.920
that thick wooden plank, the SVM intentionally

00:04:28.920 --> 00:04:31.759
creates this massive mathematical buffer zone.

00:04:31.779 --> 00:04:35.199
Oh, I see. Yeah. So when completely new, unseen

00:04:35.199 --> 00:04:38.019
data arrives tomorrow, that buffer zone ensures

00:04:38.019 --> 00:04:40.379
the model can handle slight deviations without

00:04:40.379 --> 00:04:43.439
totally failing. It trades a tiny bit of initial

00:04:43.439 --> 00:04:45.759
flexibility for a massive increase in future

00:04:45.759 --> 00:04:48.040
stability. It's basically building in a structural

00:04:48.040 --> 00:04:50.399
safety tolerance. Exactly. But okay, hold on.

00:04:50.699 --> 00:04:52.759
That entire concept relies on a mathematically

00:04:52.759 --> 00:04:55.160
perfect world. Yes, it does. It assumes that

00:04:55.160 --> 00:04:57.319
a flawless separation between the two classes

00:04:57.319 --> 00:04:59.980
is even possible. But what happens when a red

00:04:59.980 --> 00:05:02.660
marble actually rolls over into the blue territory?

00:05:02.819 --> 00:05:05.480
Well, that's where the original 1964 algorithm

00:05:05.480 --> 00:05:09.000
had a fatal flaw. If the classes overlap, a rigid

00:05:09.000 --> 00:05:11.560
hard margin just breaks down. It just crashes.

00:05:11.959 --> 00:05:15.100
Basically, yeah. The math cannot compute a perfect

00:05:15.100 --> 00:05:17.560
wooden plank if the marbles are tangled together.

00:05:17.680 --> 00:05:20.620
It locked up the moment it hit real -world chaos.

00:05:20.879 --> 00:05:23.709
So they had to fix it. They did. Decades later,

00:05:24.209 --> 00:05:26.910
actually. In 1995, Corinna Cortez and Vapnik

00:05:26.910 --> 00:05:29.410
introduced the soft margin. The soft margin.

00:05:29.610 --> 00:05:32.129
Yeah, they re -engineered the math to finally

00:05:32.129 --> 00:05:34.610
embrace imperfection. And they did this using

00:05:34.610 --> 00:05:37.310
a specific tuning parameter, right? Yeah. Represented

00:05:37.310 --> 00:05:39.350
as the letter C. That's the one. So let me push

00:05:39.350 --> 00:05:41.009
back on this a bit to make sure I understand.

00:05:41.589 --> 00:05:43.589
Is this parameter basically just a mathematical

00:05:43.589 --> 00:05:45.800
dial for tolerance? That's a great way to put

00:05:45.800 --> 00:05:48.399
it. Yeah. So like we are mathematically trading

00:05:48.399 --> 00:05:52.740
off how wide our street or wooden plank is versus

00:05:52.740 --> 00:05:55.079
how many marbles we tolerate being on the wrong

00:05:55.079 --> 00:05:58.860
side. Exactly. If we set C incredibly high, the

00:05:58.860 --> 00:06:01.819
model acts just like that old rigid hard margin.

00:06:01.879 --> 00:06:04.420
It panics about every single error. Right. It

00:06:04.420 --> 00:06:07.339
heavily penalizes any single marble that crosses

00:06:07.339 --> 00:06:10.120
the boundary. It'll make the margin dangerously

00:06:10.120 --> 00:06:12.560
thin just to get everything perfectly right.

00:06:12.680 --> 00:06:15.939
But if we dial C down. The model relaxes a bit.

00:06:16.399 --> 00:06:18.959
It learns even if a perfect rule is impossible.

00:06:19.420 --> 00:06:21.779
Yes. It says, I'll accept a few marbles on the

00:06:21.779 --> 00:06:24.459
wrong side as long as my overall plank stays

00:06:24.459 --> 00:06:27.639
wide and robust. That's so smart. And mathematically,

00:06:27.939 --> 00:06:31.060
SVMs manage this balancing act using a specialized

00:06:31.060 --> 00:06:33.500
penalty function called hinge loss. Hinge loss?

00:06:33.639 --> 00:06:35.620
Yeah. It operates within a broader framework

00:06:35.620 --> 00:06:38.160
called empirical risk minimization. OK. Let's

00:06:38.160 --> 00:06:40.100
dig into the mechanics of hinge loss for a second,

00:06:40.220 --> 00:06:43.399
because this is where the SVM really diverges

00:06:43.399 --> 00:06:46.240
from other famous models, right? Oh, significantly.

00:06:46.540 --> 00:06:49.300
If we connect this to the bigger picture, most

00:06:49.300 --> 00:06:52.360
traditional statistical models, like, say, logistic

00:06:52.360 --> 00:06:55.759
regression, they use what we call log loss. Right.

00:06:56.100 --> 00:06:58.740
They're constantly trying to estimate entire

00:06:58.740 --> 00:07:01.319
probability distributions, so they calculate

00:07:01.319 --> 00:07:04.199
exactly how far away every single data point

00:07:04.199 --> 00:07:06.800
is from the boundary, and they continuously update

00:07:06.800 --> 00:07:08.600
the probability of that point. Which means the

00:07:08.600 --> 00:07:10.939
model is just doing exhausting math for every

00:07:10.939 --> 00:07:13.579
single point in the data set forever. Exactly.

00:07:13.639 --> 00:07:15.379
It's constantly trying to push probabilities

00:07:15.379 --> 00:07:18.620
closer to 100%. It's never truly satisfied. It's

00:07:18.620 --> 00:07:21.199
a huge computational load. But hinge loss is

00:07:21.199 --> 00:07:24.360
different. Very different. Hinge loss is remarkably

00:07:24.360 --> 00:07:26.660
efficient because it actually knows when to stop

00:07:26.660 --> 00:07:29.660
calculating. Think of it like a physical mechanical

00:07:29.660 --> 00:07:32.779
hinge on a heavy door. Okay. The hinge only engages

00:07:32.779 --> 00:07:35.120
and provides resistance when you push the door

00:07:35.120 --> 00:07:38.079
past a specific angle. So if a marble is correctly

00:07:38.079 --> 00:07:40.399
classified and it's sitting safely outside the

00:07:40.399 --> 00:07:44.139
margin, the hinge loss is just zero zero the

00:07:44.139 --> 00:07:47.899
svm simply ignores it it totally ignores it completely

00:07:47.899 --> 00:07:50.639
the model does not care if a point is one inch

00:07:50.639 --> 00:07:52.819
past the safe margin or a thousand miles past

00:07:52.819 --> 00:07:55.920
it the penalty is zero wow yeah the hinge loss

00:07:55.920 --> 00:07:58.259
only triggers it only calculates a mathematical

00:07:58.259 --> 00:08:01.379
penalty if a point actually breaches the buffer

00:08:01.379 --> 00:08:03.500
zone or crosses completely over to the wrong

00:08:03.500 --> 00:08:06.199
side that is incredibly elegant it just converges

00:08:06.199 --> 00:08:08.579
on the absolute simplest geometric boundary it

00:08:08.579 --> 00:08:11.560
needs and throws away all the necessary calculations.

00:08:11.839 --> 00:08:14.500
Yeah, it's highly efficient. But, okay, even

00:08:14.500 --> 00:08:17.079
with a soft margin tolerating a few errors, we

00:08:17.079 --> 00:08:19.560
are still ultimately talking about drawing flat

00:08:19.560 --> 00:08:22.040
lines and sliding flat planks of wood. We are.

00:08:22.220 --> 00:08:24.699
But what if the geometry of the data is completely

00:08:24.699 --> 00:08:28.220
nonlinear? Like, imagine a data set where the

00:08:28.220 --> 00:08:31.100
red marbles form a tight cluster right in the

00:08:31.100 --> 00:08:33.679
center of the table. And the blue marbles form

00:08:33.679 --> 00:08:36.500
a perfect ring entirely surrounding them. Ah,

00:08:36.580 --> 00:08:39.100
the concentric circles problem. Right. You literally

00:08:39.100 --> 00:08:41.899
cannot draw a straight line through a circle

00:08:41.899 --> 00:08:45.100
to separate the inside from the outside. No plank

00:08:45.100 --> 00:08:47.259
of wood is going to work no matter how soft the

00:08:47.259 --> 00:08:50.120
margin is. You can't. And for decades, linear

00:08:50.120 --> 00:08:52.720
classifiers were practically useless for complex

00:08:52.720 --> 00:08:55.000
data sets because they were trapped in that flat

00:08:55.000 --> 00:08:57.740
2D geometry. Well, here's where it gets really

00:08:57.740 --> 00:09:00.320
interesting. You can't draw a straight line between

00:09:00.320 --> 00:09:03.940
a ring and a cluster. on a flat 2D table. But

00:09:03.940 --> 00:09:06.500
imagine if you could go under the table and just

00:09:06.500 --> 00:09:08.600
forcefully punch the center of it from underneath.

00:09:08.759 --> 00:09:10.840
I like where this is going. Right. So the table

00:09:10.840 --> 00:09:13.919
bows upward, the red marbles in the center fly

00:09:13.919 --> 00:09:16.460
up into 3D space toward the ceiling, while the

00:09:16.460 --> 00:09:18.879
blue ring stays lower down near the floor. Yes.

00:09:18.940 --> 00:09:22.139
Suddenly, you can just slide a flat, rigid sheet

00:09:22.139 --> 00:09:24.980
of metal horizontally right between the hovering

00:09:24.980 --> 00:09:27.940
red marbles and the lower blue marbles. Exactly.

00:09:28.080 --> 00:09:31.299
You've solved an impossible 2D problem by forcing

00:09:31.299 --> 00:09:34.240
the data into 3D space. Like a 4D chess move.

00:09:34.279 --> 00:09:36.960
It really is. You are describing the process

00:09:36.960 --> 00:09:39.360
of mapping data into a high dimensional feature

00:09:39.360 --> 00:09:42.519
space. This was a monumental breakthrough in

00:09:42.519 --> 00:09:46.539
1992 by Bernhard Bozer, Isabelle Guion, and Wagner.

00:09:46.600 --> 00:09:47.980
And this is what they call the kernel trick,

00:09:48.019 --> 00:09:49.919
right? That's the one. But wait, I have to stop

00:09:49.919 --> 00:09:52.700
you there. Mapping thousands of beta points into

00:09:52.700 --> 00:09:54.720
higher dimensions sounds like a computational

00:09:54.720 --> 00:09:57.750
nightmare. Like going from 2D to 3D is one thing,

00:09:58.370 --> 00:10:00.590
but what if the math requires mapping the data

00:10:00.590 --> 00:10:04.889
into 40 dimensions? 40 ,000 dimensions to find

00:10:04.889 --> 00:10:07.909
a clean separation. Oh, physically calculating

00:10:07.909 --> 00:10:10.570
the exact new coordinates for every single data

00:10:10.570 --> 00:10:13.230
point in a 40 ,000 -dimensional space. That would

00:10:13.230 --> 00:10:15.730
instantly crash practically any standard system.

00:10:15.789 --> 00:10:17.690
Right. The time it would take would be astronomical.

00:10:17.789 --> 00:10:20.230
Totally. But what's fascinating here is how the

00:10:20.230 --> 00:10:22.690
kernel trick bypasses that physical calculation

00:10:22.690 --> 00:10:25.090
entirely. It's a shortcut. A massive one. It

00:10:25.090 --> 00:10:27.649
uses dot products. OK, let's break down exactly

00:10:27.649 --> 00:10:30.929
how a dot product acts as a shortcut here, because

00:10:30.929 --> 00:10:33.820
that's the secret sauce. Right, so a dot product

00:10:33.820 --> 00:10:35.740
is essentially just a mathematical operation

00:10:35.740 --> 00:10:38.000
that measures the similarity between two vectors.

00:10:38.440 --> 00:10:40.960
Basically, how much two data points align with

00:10:40.960 --> 00:10:44.000
each other in space. The brilliance of the SVM

00:10:44.000 --> 00:10:46.139
algorithm is that it can be completely rewritten

00:10:46.139 --> 00:10:49.059
so that it only ever needs to know the dot product

00:10:49.059 --> 00:10:52.139
between pairs of data points. Wait, really? Yeah,

00:10:52.299 --> 00:10:54.860
it literally never needs to know the absolute

00:10:54.860 --> 00:10:57.860
specific coordinates of the points themselves.

00:10:57.960 --> 00:10:59.899
So the algorithm doesn't need to know exactly

00:10:59.899 --> 00:11:03.419
where marble A and marble B are physically located

00:11:03.419 --> 00:11:06.259
in that 40 ,000 dimensional space. It only needs

00:11:06.259 --> 00:11:08.720
to know the geometric relationship or similarity

00:11:08.720 --> 00:11:11.620
between them. Exactly. And certain mathematical

00:11:11.620 --> 00:11:14.620
functions, which we call kernels, they can calculate

00:11:14.620 --> 00:11:17.519
what that dot product would be in a massive high

00:11:17.519 --> 00:11:20.799
dimensional space using only the original lower

00:11:20.799 --> 00:11:23.879
dimensional data as the input. That is just mathematical

00:11:23.879 --> 00:11:26.179
sleight of hand. That's magic. It really feels

00:11:26.179 --> 00:11:28.820
like it, whether you are using a polynomial kernel

00:11:28.820 --> 00:11:32.120
to map geometric curves or a Gaussian radial

00:11:32.120 --> 00:11:34.299
basis function. Which maps data into infinite

00:11:34.299 --> 00:11:36.919
dimensions. Right. Effectively infinite dimensions,

00:11:37.000 --> 00:11:39.279
just by measuring the distance of every point

00:11:39.279 --> 00:11:42.480
from specific landmarks. The kernel computes

00:11:42.480 --> 00:11:45.019
that similarity mathematically. It never actually

00:11:45.019 --> 00:11:48.000
physically transforms or moves the data points.

00:11:48.179 --> 00:11:50.299
So it operates completely down in the original

00:11:50.299 --> 00:11:53.019
symbol space, but it reads all the geometric

00:11:53.019 --> 00:11:55.620
benefits of infinite dimensions. You get all

00:11:55.620 --> 00:11:58.159
the power, none of the computational cost. OK,

00:11:58.159 --> 00:12:00.779
so we have this absolute beast of an infinite

00:12:00.009 --> 00:12:02.769
algorithm. It constructs massive safety margins,

00:12:03.289 --> 00:12:05.970
it handles messy overlapping data with hinge

00:12:05.970 --> 00:12:09.610
loss, and it computationally warps dimensions

00:12:09.610 --> 00:12:12.029
for free. It's a powerhouse. So what does this

00:12:12.029 --> 00:12:14.940
all mean? Practically. Where does this specific

00:12:14.940 --> 00:12:17.340
architecture actually thrive out in the real

00:12:17.340 --> 00:12:19.840
world? Well, because SVMs only care about the

00:12:19.840 --> 00:12:21.840
support vectors and completely ignore the rest

00:12:21.840 --> 00:12:24.879
of the correctly classified data, they are incredibly

00:12:24.879 --> 00:12:27.940
resilient in environments with really sparse,

00:12:27.960 --> 00:12:30.259
highly complex data. Like what? For instance,

00:12:30.500 --> 00:12:32.440
text and hypertext categorization. That's a huge

00:12:32.440 --> 00:12:34.980
one. This reduces the need for labeled instances

00:12:34.980 --> 00:12:37.360
dramatically. Right, because text categorization

00:12:37.360 --> 00:12:39.379
usually relies on something like a bag of words

00:12:39.379 --> 00:12:42.139
model, where every single unique word in the

00:12:42.139 --> 00:12:44.250
dictionary represents a different mathematical

00:12:44.250 --> 00:12:46.830
dimension. Exactly. Which means your feature

00:12:46.830 --> 00:12:49.610
space has tens of thousands of dimensions. Most

00:12:49.610 --> 00:12:52.649
algorithms would just drown in that much irrelevant

00:12:52.649 --> 00:12:55.769
data. Right. But in SVM, because it uses hinge

00:12:55.769 --> 00:12:59.070
loss, it simply ignores all the irrelevant words.

00:12:59.350 --> 00:13:01.889
It focuses strictly on the few critical words,

00:13:01.929 --> 00:13:04.490
the support vectors, that actually define the

00:13:04.490 --> 00:13:07.210
boundary between, say, a sports article and a

00:13:07.210 --> 00:13:09.750
financial report. That makes total sense. And

00:13:09.750 --> 00:13:11.550
the source also mentions image segmentation.

00:13:11.399 --> 00:13:14.179
documentation, analyzing satellite SAR data,

00:13:14.679 --> 00:13:17.460
handwriting recognition. Yes, all highly dimensional

00:13:17.460 --> 00:13:19.240
problems. But the one that really stood out to

00:13:19.240 --> 00:13:22.240
me was the biological sciences. The text specifically

00:13:22.240 --> 00:13:25.240
calls out that SVMs are used to classify protein

00:13:25.240 --> 00:13:27.500
structures and they frequently achieve up to

00:13:27.500 --> 00:13:30.159
90 % accuracy. It's amazing the high dimensional

00:13:30.159 --> 00:13:32.519
mapping is perfectly suited for analyzing the

00:13:32.519 --> 00:13:35.379
complex chemical properties of biological compounds.

00:13:35.879 --> 00:13:38.059
But biology introduces a strict requirement,

00:13:38.360 --> 00:13:40.559
right? Scientists can't just blindly trust a

00:13:40.559 --> 00:13:43.919
machine. prediction. No, absolutely not. If a

00:13:43.919 --> 00:13:47.039
model predicts a compound is toxic, the chemist

00:13:47.039 --> 00:13:50.820
needs to know exactly why. Which molecular feature

00:13:50.820 --> 00:13:53.259
triggered that classification? Yeah. But how

00:13:53.259 --> 00:13:55.360
do you extract that reasoning from an infinite

00:13:55.360 --> 00:13:58.360
dimensional hyperplane? It's tough. Researchers

00:13:58.360 --> 00:14:02.120
rely on techniques like permutation tests based

00:14:02.120 --> 00:14:05.159
on the SVM's feature weights. Okay, how does

00:14:05.159 --> 00:14:07.970
that work? Well, when a linear SVM draws its

00:14:07.970 --> 00:14:11.190
boundary, every feature of the data has a mathematical

00:14:11.190 --> 00:14:13.750
weight attached to it. That weight indicates

00:14:13.750 --> 00:14:16.269
how heavily that specific feature influenced

00:14:16.269 --> 00:14:18.750
the angle of the hyperplane. But just looking

00:14:18.750 --> 00:14:20.929
at the weight can be deceptive, right? if the

00:14:20.929 --> 00:14:23.190
features are heavily correlated or tangled together.

00:14:23.470 --> 00:14:25.769
Exactly. Which is why they run a permutation

00:14:25.769 --> 00:14:28.470
test. The researchers will intentionally scramble

00:14:28.470 --> 00:14:30.750
or randomize the data for one specific feature

00:14:30.750 --> 00:14:33.509
across all the samples and then run the SVM again.

00:14:33.629 --> 00:14:35.850
Oh, I see. So if the model's accuracy suddenly

00:14:35.850 --> 00:14:38.289
plummets, you know that specific feature was

00:14:38.289 --> 00:14:40.610
structurally vital. Right. And if you scramble

00:14:40.610 --> 00:14:43.090
a feature and the accuracy barely changes, that

00:14:43.090 --> 00:14:45.549
feature was essentially irrelevant. That's a

00:14:45.549 --> 00:14:47.250
clever way to reverse -engineer the internal

00:14:47.250 --> 00:14:50.460
logic. But, you know... Applying these models

00:14:50.460 --> 00:14:53.779
to complex fields exposes a pretty massive vulnerability

00:14:53.779 --> 00:14:56.259
in the whole SCM architecture. The elephant in

00:14:56.259 --> 00:14:58.500
the room. Right. For all this incredible dimension

00:14:58.500 --> 00:15:01.580
warping power, SVMs are inherently binary. They

00:15:01.580 --> 00:15:04.419
only do two class tasks. They do. The math of

00:15:04.419 --> 00:15:07.100
a hyperplane only divides space into two halves.

00:15:07.600 --> 00:15:10.220
It can only answer yes or no, group A or group

00:15:10.220 --> 00:15:14.240
B. But the real world is rarely binary. Like,

00:15:14.399 --> 00:15:16.659
what if an image recognition system needs to

00:15:16.659 --> 00:15:19.539
sort photos into five different categories? Dogs,

00:15:19.759 --> 00:15:23.259
cats, birds, cars, and airplanes. You can't divide

00:15:23.259 --> 00:15:25.740
five groups with a single flat line. You can't.

00:15:25.879 --> 00:15:28.220
And this raises a major issue about the limitations

00:15:28.220 --> 00:15:30.879
of the math. To sort multiple classes, engineers

00:15:30.879 --> 00:15:33.720
are forced to build these elaborate, clunky workarounds

00:15:33.720 --> 00:15:36.659
on top of the binary core. The source mentions

00:15:36.659 --> 00:15:39.100
the one versus all strategy, where you literally

00:15:39.100 --> 00:15:41.279
have to build five separate binary classifiers,

00:15:41.360 --> 00:15:44.059
like one tournament tests, is this a dog or is

00:15:44.059 --> 00:15:45.539
it literally anything else? And the next one

00:15:45.539 --> 00:15:48.639
is, is this a cat or anything else? Yeah, or

00:15:48.639 --> 00:15:51.159
they use the even more computationally heavy

00:15:51.159 --> 00:15:54.600
one versus one strategy. That's where every single

00:15:54.600 --> 00:15:57.580
class fights every other class in a massive round

00:15:57.580 --> 00:15:59.679
robin tournament. That sounds exhausting. It

00:15:59.679 --> 00:16:04.080
is. Is it a dog or a cat, a dog or a bird, a

00:16:04.080 --> 00:16:07.429
cat or a car? That just seems absurdly inefficient.

00:16:08.509 --> 00:16:10.750
You're forcing this really elegant algorithm

00:16:10.750 --> 00:16:13.909
to run dozens of redundant calculations. It's

00:16:13.909 --> 00:16:16.250
a huge bottleneck. It's why they had to develop

00:16:16.250 --> 00:16:18.590
structural solutions like directed acyclic graphs

00:16:18.590 --> 00:16:20.889
to try and streamline the multiclass problem.

00:16:21.090 --> 00:16:23.110
Right. And beyond the multiclass issue, there

00:16:23.110 --> 00:16:25.870
is another profound drawback here. The source

00:16:25.870 --> 00:16:28.529
notes the parameters are difficult to interpret

00:16:28.529 --> 00:16:31.509
and it lacks calibrated class probabilities Yes,

00:16:31.529 --> 00:16:33.970
that's a big one for how we interact with AI

00:16:33.970 --> 00:16:36.870
today because like a logistic regression model

00:16:36.870 --> 00:16:39.110
will help on clean percentage It'll tell you

00:16:39.110 --> 00:16:41.769
I am 88 percent confident this transaction is

00:16:41.769 --> 00:16:44.250
fraudulent But an SVM doesn't do that because

00:16:44.250 --> 00:16:46.789
it relies entirely on flat geometry and hinge

00:16:46.789 --> 00:16:50.009
loss It just says the data point is on the fraudulent

00:16:50.009 --> 00:16:52.049
side of the hyperplane. It doesn't give you a

00:16:52.049 --> 00:16:54.789
probability It just gives you a definitive location.

00:16:55.589 --> 00:16:57.970
Exactly. You can try to map the distance from

00:16:57.970 --> 00:17:01.230
the boundary into a probability score using secondary

00:17:01.230 --> 00:17:03.870
techniques, but natively. Natively, it's purely

00:17:03.870 --> 00:17:06.109
geometric. You either fall inside the boundary

00:17:06.109 --> 00:17:08.789
or outside it. Which means at the end of the

00:17:08.789 --> 00:17:10.869
day, we are essentially dealing with a powerful

00:17:10.869 --> 00:17:13.130
black box. We really are. I mean, we understand

00:17:13.130 --> 00:17:15.190
the high level mechanics. We know how the kernels

00:17:15.190 --> 00:17:17.650
work the space. We know the hinge loss optimizes

00:17:17.650 --> 00:17:21.089
the motion. But when an SVM maps millions of

00:17:21.089 --> 00:17:23.809
data points into a high dimensional space and

00:17:23.809 --> 00:17:26.819
spits out a rigid binary decision? Yeah, human

00:17:26.819 --> 00:17:29.460
intuition simply cannot visualize the boundary

00:17:29.460 --> 00:17:32.500
it just drew. Right. We get unparalleled accuracy,

00:17:32.759 --> 00:17:35.500
but we sacrifice transparent, human -readable

00:17:35.500 --> 00:17:38.380
reasoning. it refuses to explain its work. It

00:17:38.380 --> 00:17:40.160
is the ultimate trade -off. But, you know, before

00:17:40.160 --> 00:17:42.579
we wrap up, there is one final extension of this

00:17:42.579 --> 00:17:44.660
framework from the source that I think is fascinating.

00:17:44.759 --> 00:17:47.180
It was formally introduced by Vapnik in 1998,

00:17:47.180 --> 00:17:50.900
and it kind of blurs the line between rigid algorithms

00:17:50.900 --> 00:17:53.140
and intuitive reasoning even further. Oh, you

00:17:53.140 --> 00:17:54.819
talk about transductive support vector machines.

00:17:55.140 --> 00:17:57.539
Yes, transduction. How does transductive learning

00:17:57.539 --> 00:18:00.400
diverge from the standard SVM model we've been

00:18:00.400 --> 00:18:02.319
talking about this whole time? Well, everything

00:18:02.319 --> 00:18:04.839
we've covered so far is based on inductive learning.

00:18:05.339 --> 00:18:07.559
Inductive learning looks at the labeled training

00:18:07.559 --> 00:18:10.759
data, infers a general universal mathematical

00:18:10.759 --> 00:18:14.000
rule, and then blindly applies that static rule

00:18:14.000 --> 00:18:17.460
to new unknown data later on. It builds a philosophy

00:18:17.460 --> 00:18:19.859
of the world and just forces every new experience

00:18:19.859 --> 00:18:22.880
to conform to it. Exactly. But a transductive

00:18:22.880 --> 00:18:26.180
SVM operates completely differently. It is fed

00:18:26.180 --> 00:18:29.559
a mix of the labeled training data, and the specific

00:18:29.559 --> 00:18:32.480
unlabeled test data at the exact same time. Wait,

00:18:32.480 --> 00:18:34.180
really? It looks at the unlabeled data before

00:18:34.180 --> 00:18:37.079
it even finishes drawing the maximum margin hyperplane?

00:18:37.279 --> 00:18:40.299
Yes. Now, it obviously doesn't know the answers,

00:18:40.400 --> 00:18:43.119
the labels, for the test data, but it can see

00:18:43.119 --> 00:18:45.880
their geometric distribution in the space. Wow.

00:18:46.039 --> 00:18:48.480
Yeah. It uses the density and the structural

00:18:48.480 --> 00:18:51.640
layout of the unknown data to help guide exactly

00:18:51.640 --> 00:18:53.980
where the boundary should be drawn. It's literally

00:18:53.980 --> 00:18:56.000
using the structure of the unknown to inform

00:18:56.000 --> 00:18:58.259
its understanding of the known. It's a brilliant

00:18:58.259 --> 00:19:00.980
shift in optimization. The machine is no longer

00:19:00.980 --> 00:19:03.640
trying to build a universal rule for all possible

00:19:03.640 --> 00:19:06.940
future data. It is specifically optimizing its

00:19:06.940 --> 00:19:10.059
margin to correctly classify the exact set of

00:19:10.059 --> 00:19:11.839
test points sitting right in front of it. That

00:19:11.839 --> 00:19:15.160
is wild. You aren't forcing a preconceived rule

00:19:15.160 --> 00:19:17.940
onto a new environment. You are letting the reality

00:19:17.940 --> 00:19:20.440
of the new environment actively influence the

00:19:20.440 --> 00:19:22.539
rule. Right. It factors in the context of the

00:19:22.539 --> 00:19:25.500
new data before it makes a decision. Mathematically,

00:19:25.960 --> 00:19:28.500
transduction allows the SVM to find boundaries

00:19:28.500 --> 00:19:31.380
in complex spaces that an inductive model would

00:19:31.380 --> 00:19:33.240
completely miss. Because the inductive model

00:19:33.240 --> 00:19:35.500
wouldn't realize how the new data was clustered

00:19:35.500 --> 00:19:38.359
until it was already too late. Exactly. Which

00:19:38.359 --> 00:19:41.369
honestly leaves you with something Truly fascinating

00:19:41.369 --> 00:19:43.470
to ponder today. Oh, definitely. Because if a

00:19:43.470 --> 00:19:46.349
machine can categorize new, unknown information

00:19:46.349 --> 00:19:49.490
simply by analyzing its geometric relationship

00:19:49.490 --> 00:19:51.650
to a few known examples in a high -dimensional

00:19:51.650 --> 00:19:54.910
space, actively allowing the unknown context

00:19:54.910 --> 00:19:58.009
to shift its boundaries, where exactly is the

00:19:58.009 --> 00:20:00.789
line between mathematical classification and

00:20:00.789 --> 00:20:03.490
human intuition? It's a profound thought. We

00:20:03.490 --> 00:20:06.710
assume intuition is this uniquely human biological

00:20:06.710 --> 00:20:09.359
trait. Right. But when you walk into a new room

00:20:09.359 --> 00:20:12.960
and instantly read the vibe, or when you immediately

00:20:12.960 --> 00:20:16.200
categorize a stranger, based on a few subtle

00:20:16.200 --> 00:20:18.660
contextual cues. Are you just running a highly

00:20:18.660 --> 00:20:22.799
evolved biological kernel trick? Exactly. Are

00:20:22.799 --> 00:20:26.039
our brains just rapidly mapping contextual clues

00:20:26.039 --> 00:20:28.740
into higher dimensions and drawing transductive

00:20:28.740 --> 00:20:31.019
hyperplanes without us even realizing it? We

00:20:31.019 --> 00:20:33.279
may be operating on geometric boundaries far

00:20:33.279 --> 00:20:35.460
more often than we care to admit. It's definitely

00:20:35.460 --> 00:20:37.000
something to think about the next time you jump

00:20:37.000 --> 00:20:39.079
to a conclusion. Yeah. Well, thank you for joining

00:20:39.079 --> 00:20:41.259
us on this deep dive. It's been a great conversation.

00:20:41.480 --> 00:20:43.799
It really has. Just remember, whether you are

00:20:43.799 --> 00:20:45.779
filter - out the noise in your daily life trying

00:20:45.779 --> 00:20:49.059
to classify complex data or calculating infinite

00:20:49.059 --> 00:20:52.200
dimensional hyperplanes. The goal is always just

00:20:52.200 --> 00:20:54.099
finding the clearest path through the noise.

00:20:54.380 --> 00:20:56.079
Keep questioning where the boundaries are drawn

00:20:56.079 --> 00:20:58.680
and maybe give yourself a little soft margin.