WEBVTT

00:00:00.000 --> 00:00:02.319
You know, when you picture how a machine learns,

00:00:02.580 --> 00:00:05.889
you... You probably imagine this massive cavernous

00:00:05.889 --> 00:00:08.310
library. I look totally. Just millions of books,

00:00:08.429 --> 00:00:11.929
billions of images, endless spreadsheets of data

00:00:11.929 --> 00:00:14.050
just being shoveled into a computer until it

00:00:14.050 --> 00:00:16.949
finally figures out how to recognize a cat or

00:00:16.949 --> 00:00:19.649
translate a sentence or diagnose a disease. Yeah,

00:00:19.649 --> 00:00:21.390
we tend to treat artificial intelligence like

00:00:21.390 --> 00:00:24.370
a speed reader that never sleeps. Right. Assuming

00:00:24.370 --> 00:00:26.230
the only way to teach it is by force feeding

00:00:26.230 --> 00:00:30.289
it absolutely everything we have. But welcome

00:00:30.289 --> 00:00:33.270
to today's deep dive. because if you're joining

00:00:33.270 --> 00:00:35.909
us, you are in for a pretty fascinating shift

00:00:35.909 --> 00:00:38.070
in perspective. Yeah, it really turns that whole

00:00:38.070 --> 00:00:40.649
idea upside down. It does. Today, our mission

00:00:40.649 --> 00:00:43.329
is to explore a comprehensive Wikipedia overview

00:00:43.329 --> 00:00:46.149
on a concept called active learning. We're going

00:00:46.149 --> 00:00:49.590
to look at how we can train AI, not by force

00:00:49.590 --> 00:00:52.350
feeding it massive amounts of data, but by teaching

00:00:52.350 --> 00:00:55.429
the AI to actually ask humans the right questions.

00:00:55.590 --> 00:00:58.009
Which is, I mean, it's a complete reversal of

00:00:58.009 --> 00:00:59.770
the brute force approach to education we're used

00:00:59.770 --> 00:01:02.310
to. Okay, let's unpack this. the fundamental

00:01:02.310 --> 00:01:04.530
problem we're looking at here is the snagging

00:01:04.530 --> 00:01:07.189
cost of human labor. Right, because for a long

00:01:07.189 --> 00:01:10.010
time, standard supervised learning relied entirely

00:01:10.010 --> 00:01:13.250
on the assumption that volume is the key to accuracy.

00:01:13.549 --> 00:01:15.590
Like just gather up all the answers up front,

00:01:15.689 --> 00:01:18.189
hand them to the algorithm, and just hope it

00:01:18.189 --> 00:01:20.709
eventually spots the underlying patterns. Exactly.

00:01:21.209 --> 00:01:24.090
But as the source material points out, that method

00:01:24.090 --> 00:01:27.870
hits a massive wall in the real world. We live

00:01:27.870 --> 00:01:30.750
in a world with oceans of raw, unlabeled data.

00:01:30.920 --> 00:01:34.120
It is literally everywhere. But paying human

00:01:34.120 --> 00:01:37.140
beings to sit there and manually label that data?

00:01:37.319 --> 00:01:39.060
Oh, it's a nightmare. Yeah, whether you're paying

00:01:39.060 --> 00:01:42.159
doctors to read thousands of x -rays or linguists

00:01:42.159 --> 00:01:44.560
to tag parts of speech in a million sentences,

00:01:44.920 --> 00:01:47.939
it's incredibly expensive, and it is agonizingly

00:01:47.939 --> 00:01:50.159
slow. So the question becomes, instead of paying

00:01:50.159 --> 00:01:52.900
experts to label a million random examples, what

00:01:52.900 --> 00:01:55.280
if we could teach the AI to look at the data

00:01:55.280 --> 00:01:57.900
and figure out the exact specific questions it

00:01:57.900 --> 00:02:00.480
needs to ask to learn the fastest? And that's

00:02:00.480 --> 00:02:02.120
the core definition of active learning, right?

00:02:02.140 --> 00:02:04.459
Yeah. It is an iterative supervised learning

00:02:04.459 --> 00:02:07.099
process. But instead of just passively receiving

00:02:07.099 --> 00:02:10.300
data, the algorithm interactively queries a human

00:02:10.300 --> 00:02:12.780
expert, which the literature calls the teacher

00:02:12.780 --> 00:02:16.039
or the Oracle Oracle. I love that. Yeah, it sounds

00:02:16.039 --> 00:02:19.020
very mystical. But the crucial detail here is

00:02:19.020 --> 00:02:21.580
that the algorithm isn't just asking for random

00:02:21.580 --> 00:02:24.919
labels. It is mathematically targeting the specific

00:02:24.919 --> 00:02:26.860
data points that will be the most informative

00:02:26.860 --> 00:02:28.639
to its current state of understanding, which

00:02:28.639 --> 00:02:32.120
means it learns faster, vastly faster. By choosing

00:02:32.120 --> 00:02:35.219
its own examples, the algorithm can learn a complex

00:02:35.219 --> 00:02:38.180
concept with a tiny fraction of the data points

00:02:38.180 --> 00:02:40.780
it would need in normal supervised learning.

00:02:40.939 --> 00:02:44.039
Wow. It speeds up the development process so

00:02:44.039 --> 00:02:46.580
dramatically that, in certain highly complex

00:02:46.580 --> 00:02:49.479
scenarios, an active learning model can essentially

00:02:49.479 --> 00:02:52.639
stand in for needing a quantum or supercomputer.

00:02:52.919 --> 00:02:55.419
Wait, okay, that is a staggering claim. Just

00:02:55.419 --> 00:02:57.300
by asking the right questions, you bypass the

00:02:57.300 --> 00:02:59.240
need for a supercomputer? In specific cases,

00:02:59.379 --> 00:03:02.539
yeah. That's wild. But I guess if we want to

00:03:02.539 --> 00:03:05.319
understand how the AI asks for help, we first

00:03:05.319 --> 00:03:07.520
have to understand how it maps out its own ignorance,

00:03:07.620 --> 00:03:09.500
you know? We have to look at the mathematics

00:03:09.500 --> 00:03:12.300
of the unknown. Right, and let's ground this

00:03:12.300 --> 00:03:15.020
in a real -world scenario from the sources. Imagine

00:03:15.020 --> 00:03:18.039
you are a biological researcher trying to engineer

00:03:18.300 --> 00:03:22.159
a new highly specific protein. Okay. The total

00:03:22.159 --> 00:03:25.139
universe of all possible data under consideration

00:03:25.139 --> 00:03:28.159
like every protein you could possibly test is

00:03:28.159 --> 00:03:30.819
represented mathematically by the letter T. Okay

00:03:30.819 --> 00:03:33.719
so T is the massive pile of everything. Everything.

00:03:33.939 --> 00:03:35.879
It includes the handful of proteins you've already

00:03:35.879 --> 00:03:38.580
tested in the lab plus the millions of variations

00:03:38.580 --> 00:03:40.680
you haven't even looked at yet. Correct. And

00:03:40.680 --> 00:03:43.180
remember active learning moves in cycles or iterations.

00:03:43.240 --> 00:03:46.159
Right. During every single iteration which the

00:03:46.159 --> 00:03:49.479
math labels as I, the algorithm's job is to take

00:03:49.479 --> 00:03:52.340
that massive universe of data, T, and break it

00:03:52.340 --> 00:03:54.800
down into three distinct separate subsets. Okay,

00:03:54.840 --> 00:03:57.599
so slicing up the pie. Exactly. The first subset

00:03:57.599 --> 00:03:59.599
is the data where the label is already known.

00:03:59.699 --> 00:04:03.219
We call this T sub K, I. So in our protein lab,

00:04:03.319 --> 00:04:06.300
T sub K, I is the small stack of lab reports

00:04:06.300 --> 00:04:08.819
sitting on your desk. The AI already has the

00:04:08.819 --> 00:04:10.840
answers for those. It knows exactly what those

00:04:10.840 --> 00:04:13.300
specific proteins do. Yes. Then you have the

00:04:13.300 --> 00:04:16.759
second subset, which is T sub U. U for unknown.

00:04:17.180 --> 00:04:20.240
Right. This is the vast ocean of data where the

00:04:20.240 --> 00:04:22.839
label is unknown. The millions of proteins we

00:04:22.839 --> 00:04:25.240
haven't tested, the AI is totally clueless about

00:04:25.240 --> 00:04:27.660
these. Which brings us to the third and really

00:04:27.660 --> 00:04:32.439
the most critical subset, T sub C, I. And C stands

00:04:32.439 --> 00:04:35.920
for chosen. Yes. These are the highly specific

00:04:35.920 --> 00:04:38.500
targeted data points that the AI plucks out of

00:04:38.500 --> 00:04:40.879
the unknown pile, hands to the human oracle,

00:04:41.100 --> 00:04:43.100
and basically says, please test this one next.

00:04:43.699 --> 00:04:46.939
Man, defining that chosen subset, figuring out

00:04:46.939 --> 00:04:50.300
exactly what goes into T sub C, that feels like

00:04:50.300 --> 00:04:52.420
the entire magic trick of this whole field. Oh,

00:04:52.439 --> 00:04:54.339
it is the entire battlefield of active learning

00:04:54.339 --> 00:04:56.899
research. The algorithm's sole objective is to

00:04:56.899 --> 00:04:59.259
extract the absolute maximum amount of learning

00:04:59.259 --> 00:05:01.860
from the minimum number of human queries. I kind

00:05:01.860 --> 00:05:03.920
of like to think of it like a student studying

00:05:03.920 --> 00:05:07.459
for a grueling final exam. Okay, how so? Well...

00:05:06.910 --> 00:05:08.889
A standard machine learning model is the student

00:05:08.889 --> 00:05:11.149
who insists on rereading the entire textbook

00:05:11.149 --> 00:05:13.089
from page one, even the chapters they already

00:05:13.089 --> 00:05:15.610
know perfectly. Right. Highly inefficient. Yeah.

00:05:16.290 --> 00:05:18.290
An active learning model is the smart student.

00:05:18.850 --> 00:05:21.290
They take a practice test, figure out, oh, I'm

00:05:21.290 --> 00:05:23.689
parable at calculus but great at algebra, and

00:05:23.689 --> 00:05:26.550
then they only take their absolute hardest calculus

00:05:26.550 --> 00:05:28.870
questions to the professor during office hours.

00:05:29.410 --> 00:05:31.610
That's actually a great analogy. But here is

00:05:31.610 --> 00:05:33.589
what I struggle with, and I really want to push

00:05:33.589 --> 00:05:36.430
back on this a bit. Go for it. If the AI doesn't

00:05:36.430 --> 00:05:39.209
know the answers to the unknown data, like if

00:05:39.209 --> 00:05:42.209
it's completely clueless about millions of proteins,

00:05:43.050 --> 00:05:45.230
how on earth does it know which questions are

00:05:45.230 --> 00:05:48.050
actually worth asking? Isn't there a massive

00:05:48.050 --> 00:05:50.990
risk that the AI just gets overwhelmed by totally

00:05:50.990 --> 00:05:54.069
useless, confusing, or uninformative examples?

00:05:54.350 --> 00:05:56.329
You're hitting on the biggest nightmare for developers

00:05:56.329 --> 00:05:58.750
in this space. If the algorithm asks the wrong

00:05:58.750 --> 00:06:01.139
questions, the whole system collapses. Because

00:06:01.139 --> 00:06:03.639
it's just spinning its wheels. Exactly. It wastes

00:06:03.639 --> 00:06:05.800
the oracle's time, it spends money, and it learns

00:06:05.800 --> 00:06:08.860
absolutely nothing. Avoiding uninformative examples

00:06:08.860 --> 00:06:11.819
is the whole crux of the field. The machine needs

00:06:11.819 --> 00:06:14.420
a way to evaluate the value of its own ignorance.

00:06:14.680 --> 00:06:16.439
Because if you're deploying this at scale, it's

00:06:16.439 --> 00:06:18.319
not just bothering one professor during office

00:06:18.319 --> 00:06:21.529
hours. Not at all. When you scale this up to

00:06:21.529 --> 00:06:24.129
large real -world projects where an AI needs

00:06:24.129 --> 00:06:26.550
to define its chosen subset at an industrial

00:06:26.550 --> 00:06:29.689
level It often plugs into human crowdsourcing

00:06:29.689 --> 00:06:32.310
frameworks like Amazon Mechanical Turk, right?

00:06:32.449 --> 00:06:35.149
We are talking about platforms where you might

00:06:35.149 --> 00:06:38.279
have hundreds or thousands of human beings in

00:06:38.279 --> 00:06:40.259
the active learning loop, sitting at their computers,

00:06:40.720 --> 00:06:43.660
and you are paying them per label. Right, actual

00:06:43.660 --> 00:06:46.740
cash. Yes. If your algorithm is feeding them

00:06:46.740 --> 00:06:49.439
useless, redundant questions, you are literally

00:06:49.439 --> 00:06:51.819
burning through your research budget for zero

00:06:51.819 --> 00:06:54.779
gain. So the AI is hunting for the perfect questions

00:06:54.779 --> 00:06:57.240
to ask us, because every question costs money.

00:06:57.399 --> 00:06:59.779
And according to the source material, there are

00:06:59.779 --> 00:07:02.500
three specific scenarios or environments where

00:07:02.500 --> 00:07:04.939
it conducts this hunt. Three ways it can ask

00:07:04.939 --> 00:07:06.639
a question. Let's look at the first one. Pool

00:07:06.639 --> 00:07:08.839
-based sampling. Pool -based sampling is currently

00:07:08.839 --> 00:07:11.600
the most well -known method out there. In this

00:07:11.600 --> 00:07:14.120
environment, the AI gets to evaluate the entire

00:07:14.120 --> 00:07:16.620
pool of unlabeled data at once before it makes

00:07:16.620 --> 00:07:18.519
any decisions. So it looks at the whole data

00:07:18.519 --> 00:07:21.160
set, say all million untested proteins, and it

00:07:21.160 --> 00:07:24.279
assigns a confidence score to every single one.

00:07:24.519 --> 00:07:27.360
Yes. So it's assigning a mathematical measurement

00:07:27.360 --> 00:07:30.199
of how well it thinks it understands that specific

00:07:30.199 --> 00:07:32.779
data point. Yeah. even without having the actual

00:07:32.779 --> 00:07:36.220
label. So it scores everything, sorts that massive

00:07:36.220 --> 00:07:39.100
list, selects the instances where its confidence

00:07:39.100 --> 00:07:42.120
is the absolute lowest, and then sends those

00:07:42.120 --> 00:07:45.220
specific, highly ambiguous points to the human

00:07:45.220 --> 00:07:47.879
teacher. Exactly. It sounds incredibly thorough,

00:07:48.180 --> 00:07:50.600
like it surveys the entire landscape before making

00:07:50.600 --> 00:07:53.360
a move. But evaluating a million things at once?

00:07:53.699 --> 00:07:56.699
That has to take a toll. Oh, it does. The theoretical

00:07:56.699 --> 00:07:59.620
limit to pool -based sampling is computer memory.

00:08:00.769 --> 00:08:02.910
Evaluating and holding confidence scores for

00:08:02.910 --> 00:08:05.529
billions of data points simultaneously requires

00:08:05.529 --> 00:08:08.050
massive computational overhead. I can imagine.

00:08:08.250 --> 00:08:10.370
Think about how your own laptop slows down when

00:08:10.370 --> 00:08:13.470
you have too many tabs open. Now imagine an AI

00:08:13.470 --> 00:08:15.750
trying to hold millions of complex mathematical

00:08:15.750 --> 00:08:17.730
evaluations in its active memory. Yeah, that

00:08:17.730 --> 00:08:19.629
would fry my computer. And even if you have the

00:08:19.629 --> 00:08:21.329
world's biggest computer, you still have to deal

00:08:21.329 --> 00:08:23.009
with the human on the other end. And that's the

00:08:23.009 --> 00:08:26.250
practical limit, human fatigue. The oracle is

00:08:26.250 --> 00:08:29.470
a person. They get tired. their attention wanes,

00:08:29.509 --> 00:08:32.070
and they require a paycheck. No matter how many

00:08:32.070 --> 00:08:34.870
brilliant, ambiguous points the machine finds

00:08:34.870 --> 00:08:38.090
in its massive pool, the human can only process

00:08:38.090 --> 00:08:40.289
a limited number before they start making mistakes.

00:08:40.669 --> 00:08:43.710
Okay, so if looking at the whole pool crashes

00:08:43.710 --> 00:08:46.330
your computer's memory, how do we fix that? Do

00:08:46.330 --> 00:08:49.789
we just, like, force the AI to put blinders on

00:08:49.789 --> 00:08:52.210
and look at data one piece at a time? That is

00:08:52.210 --> 00:08:54.370
exactly what the second scenario does. It's called

00:08:54.370 --> 00:08:57.590
stream -based selective sampling. Right, so if

00:08:57.590 --> 00:09:00.090
the pool -based method is a guy standing on a

00:09:00.090 --> 00:09:02.570
balcony looking at a massive crowd of data all

00:09:02.570 --> 00:09:05.490
at once, the stream -based method is a guy standing

00:09:05.490 --> 00:09:08.110
at a turnstile checking tickets one by one. That's

00:09:08.110 --> 00:09:10.330
a great way to put it. The AI looks at the data

00:09:10.330 --> 00:09:12.509
sequentially. For every single item that passes

00:09:12.509 --> 00:09:14.909
through, it makes a split -second decision. Do

00:09:14.909 --> 00:09:17.009
I label this myself, or do I ask the teacher?

00:09:17.259 --> 00:09:20.139
But wait, if the AI is making these rapid -fire

00:09:20.139 --> 00:09:22.519
decisions one by one, especially early on when

00:09:22.519 --> 00:09:24.820
it hasn't learned much yet, it has no idea what's

00:09:24.820 --> 00:09:26.720
coming next in the stream. Isn't it just flying

00:09:26.720 --> 00:09:29.820
blind? Yes. That is the major concession you

00:09:29.820 --> 00:09:32.139
make with stream -based sampling. Because it

00:09:32.139 --> 00:09:33.860
only sees a data point right in front of its

00:09:33.860 --> 00:09:36.700
nose, it doesn't know if a vastly more informative

00:09:36.700 --> 00:09:38.940
data point is coming 10 steps down the line.

00:09:38.940 --> 00:09:42.559
Oh, man. It lacks a global view of the data set's

00:09:42.559 --> 00:09:45.399
distribution. So it might panic and ask the human

00:09:45.399 --> 00:09:47.919
to label a slightly confusing data point now,

00:09:48.440 --> 00:09:51.039
completely unaware that a perfectly clear, incredibly

00:09:51.039 --> 00:09:52.960
educational data point was about to come through

00:09:52.960 --> 00:09:56.990
the turnstile a minute later. Exactly. And because

00:09:56.990 --> 00:09:59.309
it can't efficiently capitalize on the structure

00:09:59.309 --> 00:10:01.649
of the data, the algorithm tends to be a bit

00:10:01.649 --> 00:10:05.370
more needy. Needy AI. Yeah. It asks more questions

00:10:05.370 --> 00:10:07.830
overall, meaning the human teacher ends up spending

00:10:07.830 --> 00:10:10.669
more effort supplying labels compared to the

00:10:10.669 --> 00:10:13.029
pool -based approach. You save computer memory,

00:10:13.330 --> 00:10:16.190
but you might waste more human time. OK, so pool

00:10:16.190 --> 00:10:18.629
-based looks at everything but eats up memory.

00:10:18.919 --> 00:10:21.039
Stream -based looks at one thing at a time, but

00:10:21.039 --> 00:10:23.139
wastes human effort because it lacks context.

00:10:23.240 --> 00:10:25.940
But what if you are in a situation where you

00:10:25.940 --> 00:10:28.700
don't even have a pool or a stream? What if your

00:10:28.700 --> 00:10:31.399
starting data is almost non -existent? Then you

00:10:31.399 --> 00:10:33.419
need a different approach entirely. Here's where

00:10:33.419 --> 00:10:36.200
it gets really interesting. The third scenario.

00:10:36.899 --> 00:10:39.179
Membership query synthesis. This is a radical

00:10:39.179 --> 00:10:41.500
departure from the other two. In this scenario,

00:10:41.740 --> 00:10:44.519
the AI doesn't just look at existing data. It

00:10:44.519 --> 00:10:47.860
literally plays Frankenstein. It does. It uses

00:10:47.860 --> 00:10:50.759
generative AI techniques to synthesize entirely

00:10:50.759 --> 00:10:53.620
artificial data from scratch, and then asks the

00:10:53.620 --> 00:10:56.600
human to label its creation. The example from

00:10:56.600 --> 00:11:00.419
the sources is wild. The animal one. Yes. If

00:11:00.419 --> 00:11:02.620
the AI is trying to learn the visual boundary

00:11:02.620 --> 00:11:05.059
between humans and animals, it might generate

00:11:05.059 --> 00:11:08.159
a bizarre clipped synthetic image of a leg, a

00:11:08.159 --> 00:11:10.220
leg that doesn't actually exist anywhere in the

00:11:10.220 --> 00:11:12.340
real world, and present it to the human oracle

00:11:12.340 --> 00:11:16.080
asking, is this appendage human or animal? Theoretically,

00:11:16.220 --> 00:11:19.059
it's a brilliant concept. The AI can probe the

00:11:19.059 --> 00:11:21.879
exact boundaries of its own knowledge by manifesting

00:11:21.879 --> 00:11:23.899
the precise data point that would clear up its

00:11:23.899 --> 00:11:26.360
confusion. But the constraints of reality make

00:11:26.360 --> 00:11:29.379
this exceptionally difficult. The massive challenge

00:11:29.379 --> 00:11:32.139
of query synthesis is that the synthetic data

00:11:32.139 --> 00:11:35.419
must strictly obey the natural laws and constraints

00:11:35.419 --> 00:11:37.990
of the real world. Because the AI doesn't actually

00:11:37.990 --> 00:11:40.129
understand what a leg is, it just understands

00:11:40.129 --> 00:11:42.570
pixel values and mathematical distributions.

00:11:42.830 --> 00:11:44.529
Precisely. And the source material provides a

00:11:44.529 --> 00:11:47.389
fantastic medical example of where this generative

00:11:47.389 --> 00:11:51.309
process can fail completely. Imagine an AI generating

00:11:51.309 --> 00:11:54.529
a synthetic blood test to learn about diagnosing

00:11:54.529 --> 00:11:57.750
diseases. OK. In a real white blood cell differential

00:11:57.750 --> 00:12:00.210
test, a doctor looks at different components

00:12:00.210 --> 00:12:03.129
of white blood cells. Because these components

00:12:03.129 --> 00:12:05.970
are percentages of a whole, they most perfectly

00:12:05.970 --> 00:12:09.389
add up to exactly 100%. Right. Basic math. A

00:12:09.389 --> 00:12:11.950
human doctor knows this instinctively. But if

00:12:11.950 --> 00:12:14.629
the AI lacks that strict mathematical constraint,

00:12:15.070 --> 00:12:17.230
it might generate a synthetic blood test where

00:12:17.230 --> 00:12:20.370
the white blood cells add up to 110%. It's generating

00:12:20.370 --> 00:12:22.850
an impossibility. It's handing a doctor a patient

00:12:22.850 --> 00:12:25.330
profile that violates the laws of mathematics.

00:12:25.669 --> 00:12:27.889
And it gets even more complicated when variables

00:12:27.889 --> 00:12:30.710
are biologically dependent on one another. The

00:12:30.710 --> 00:12:33.929
source mentions two specific enzymes, ALT and

00:12:33.929 --> 00:12:36.490
AST. which doctors use to measure liver function.

00:12:36.549 --> 00:12:39.429
Okay, I've heard of those. There are strict physiological

00:12:39.429 --> 00:12:41.570
rules about how these enzymes behave together.

00:12:42.129 --> 00:12:44.950
If a patient is chronically ill, those enzymes

00:12:44.950 --> 00:12:49.009
elevate in specific ratios. So if the AI hallucinated

00:12:49.009 --> 00:12:51.809
a chronically ill patient profile where the AST

00:12:51.809 --> 00:12:54.409
enzyme is perfectly normal, sitting at the lower

00:12:54.409 --> 00:12:57.169
limit of 8 units per liter, but the ALT enzyme

00:12:57.169 --> 00:13:00.279
is wildly above normal, That's not just a weird

00:13:00.279 --> 00:13:03.019
patient. No, it's impossible. That is a physiological

00:13:03.019 --> 00:13:05.500
contradiction. It's like generating a synthetic

00:13:05.500 --> 00:13:07.620
house where the second floor is floating 10 feet

00:13:07.620 --> 00:13:10.139
above the first floor without any walls. The

00:13:10.139 --> 00:13:12.759
individual pieces exist, but the physics connecting

00:13:12.759 --> 00:13:16.179
them are fundamentally broken. Exactly. The AI's

00:13:16.179 --> 00:13:18.340
mathematical imagination has to be tightly bound

00:13:18.340 --> 00:13:21.919
by biological and physical reality. If it isn't,

00:13:22.100 --> 00:13:24.340
the query synthesis method just wastes the human

00:13:24.340 --> 00:13:26.580
expert time by asking them to diagnose nonsense.

00:13:26.840 --> 00:13:29.759
Okay, so the AI has its scenario. It's either

00:13:29.759 --> 00:13:32.539
evaluating a massive pool, watching a sequential

00:13:32.539 --> 00:13:35.440
stream, or synthesizing its own data from scratch.

00:13:35.840 --> 00:13:37.759
But that brings us to the core engine of the

00:13:37.759 --> 00:13:40.100
whole operation. The strategy. Right. What is

00:13:40.100 --> 00:13:43.200
the actual internal mathematical logic the AI

00:13:43.200 --> 00:13:45.679
uses to point at a piece of data and say, ah!

00:13:45.950 --> 00:13:48.149
This is the one. This is the data point that

00:13:48.149 --> 00:13:50.409
will teach me the most. You're talking about

00:13:50.409 --> 00:13:52.529
query strategies. Yeah. This is the internal

00:13:52.529 --> 00:13:55.129
brain of active learning. The literature organizes

00:13:55.129 --> 00:13:57.929
these into a few fascinating mathematical categories.

00:13:58.129 --> 00:14:00.710
And looking at the sources, some of these strategies

00:14:00.710 --> 00:14:03.289
sound incredibly intuitive, like expected error

00:14:03.289 --> 00:14:06.830
reduction. But how does an AI know a piece of

00:14:06.830 --> 00:14:09.590
data will reduce its errors if it doesn't even

00:14:09.590 --> 00:14:12.250
know what the data is yet? How does that simulation

00:14:12.250 --> 00:14:14.870
actually run? Well, it's a game of probabilities.

00:14:15.159 --> 00:14:18.320
The AI looks at an unlabeled data point and calculates

00:14:18.320 --> 00:14:21.139
the probability of all possible answers. It basically

00:14:21.139 --> 00:14:23.679
says to itself, OK, there is a 70 % chance the

00:14:23.679 --> 00:14:26.120
human will label this a cat and a 30 % chance

00:14:26.120 --> 00:14:28.799
they will label it a dog. OK. If they say cat,

00:14:29.059 --> 00:14:31.259
I calculate that my overall model accuracy will

00:14:31.259 --> 00:14:34.159
improve by 5%. If they say dog, my accuracy improves

00:14:34.159 --> 00:14:37.600
by 2%. It mathematically multiplies those probabilities

00:14:37.600 --> 00:14:40.220
to get an expected average drop in its error

00:14:40.220 --> 00:14:42.720
rate. Oh. Yeah, it runs that complex probability

00:14:42.720 --> 00:14:46.919
simulation for every single unlabeled point and

00:14:46.919 --> 00:14:49.019
asks for the one with the highest expected payoff.

00:14:49.159 --> 00:14:52.039
That is mind blowing. It's mapping out all the

00:14:52.039 --> 00:14:54.159
alternate futures of what the human might say

00:14:54.159 --> 00:14:57.379
and choosing the future with the best educational

00:14:57.379 --> 00:15:00.289
return on investment. And the sources also mentioned

00:15:00.289 --> 00:15:02.009
variance reduction, which is similar, right?

00:15:02.210 --> 00:15:05.330
Yes. But instead of focusing on error, variance

00:15:05.330 --> 00:15:07.330
reduction seeks out the data point that will

00:15:07.330 --> 00:15:09.809
minimize the mathematical spread or variance

00:15:09.809 --> 00:15:13.149
of the AI's own uncertainty. It wants to tighten

00:15:13.149 --> 00:15:15.250
its confidence intervals across the board. OK.

00:15:15.549 --> 00:15:17.929
But my absolute favorite strategy mentioned in

00:15:17.929 --> 00:15:20.009
the deep dive sources is something called query

00:15:20.009 --> 00:15:22.809
by committee. Oh, the committee approach is highly

00:15:22.809 --> 00:15:26.620
effective. Yeah. I picture it like a hung jury

00:15:26.620 --> 00:15:29.080
in a courtroom. Instead of just training one

00:15:29.080 --> 00:15:31.740
single AI model, you train a variety of different

00:15:31.740 --> 00:15:33.759
models on the limited data you already have.

00:15:33.960 --> 00:15:36.139
They become your committee. Then you feed them

00:15:36.139 --> 00:15:38.500
an unlabeled piece of data and ask them all to

00:15:38.500 --> 00:15:40.700
vote on what they think it is. If all the models

00:15:40.700 --> 00:15:42.759
in the committee agree, if they all look at an

00:15:42.759 --> 00:15:45.000
image and say, That's definitely a stop sign.

00:15:45.379 --> 00:15:47.759
Great. You don't need to ask the human. But if

00:15:47.759 --> 00:15:49.860
the committee is completely split. Right. If

00:15:49.860 --> 00:15:52.320
half the models scream, it's a stop sign, and

00:15:52.320 --> 00:15:54.299
the other half scream, it's a speed limit sign,

00:15:54.960 --> 00:15:57.620
that massive internal disagreement triggers an

00:15:57.620 --> 00:16:00.340
alarm. That specific data point is the one you

00:16:00.340 --> 00:16:02.440
send to the human oracle, because it is clearly

00:16:02.440 --> 00:16:04.740
sitting right on the chaotic boundary of your

00:16:04.740 --> 00:16:07.039
model's understanding. What's fascinating here

00:16:07.039 --> 00:16:09.340
is how that boundary actually looks when we map

00:16:09.340 --> 00:16:11.799
it out mathematically. And that brings us to

00:16:11.799 --> 00:16:14.580
a specific geometrical query strategy called

00:16:14.580 --> 00:16:17.340
the minimum marginal hyperplane. OK, minimum

00:16:17.340 --> 00:16:19.200
marginal hyperplane. That sounds like a warp

00:16:19.200 --> 00:16:21.019
drive from a sci -fi movie. We definitely need

00:16:21.019 --> 00:16:23.000
to break that down. It sounds intimidating, I

00:16:23.000 --> 00:16:26.809
know, but it's quite elegant visually. To understand

00:16:26.809 --> 00:16:29.549
it, we first need to quickly define what a support

00:16:29.549 --> 00:16:33.929
vector machine, or SVM, is. An SVM is a classic

00:16:33.929 --> 00:16:36.809
classification algorithm. Its entire job is to

00:16:36.809 --> 00:16:39.269
look at data and draw a rigid boundary between

00:16:39.269 --> 00:16:42.120
different categories. Okay, so imagine a massive

00:16:42.120 --> 00:16:45.019
multi -dimensional room. On one side of the room,

00:16:45.120 --> 00:16:47.720
you have all your labeled known cats. On the

00:16:47.720 --> 00:16:49.940
other side, you have all your labeled known dogs.

00:16:50.399 --> 00:16:52.940
The SVM's job is to draw a line straight down

00:16:52.940 --> 00:16:55.120
the middle of the room to separate them perfectly.

00:16:55.559 --> 00:16:57.820
Exactly. And in a high dimensional mathematical

00:16:57.820 --> 00:17:00.379
space, that dividing line isn't just a simple

00:17:00.379 --> 00:17:03.240
line drawn on a piece of paper. It's a flat surface

00:17:03.240 --> 00:17:06.380
called a hyperplane. I am picturing a giant sheet

00:17:06.380 --> 00:17:08.839
of glass cutting the room completely in half.

00:17:09.160 --> 00:17:11.319
Cats on the left, dogs on the right. That's a

00:17:11.319 --> 00:17:14.180
perfect visual. I know. You introduced your unknown,

00:17:14.400 --> 00:17:16.539
unlabeled data points. They scatter all over

00:17:16.539 --> 00:17:19.180
the room. The algorithm calculates the physical

00:17:19.180 --> 00:17:22.119
distance from every single unlabeled point to

00:17:22.119 --> 00:17:24.440
that glass wall, the separating hyperplane. OK.

00:17:24.720 --> 00:17:26.700
That distance is mathematically referred to as

00:17:26.700 --> 00:17:29.900
the margin, or W. So if an unlabeled data point

00:17:29.900 --> 00:17:32.599
has a massive W, it means it is sitting far,

00:17:32.700 --> 00:17:35.099
far away in the back corner of the room, deep

00:17:35.099 --> 00:17:38.039
in cat territory. The AI is highly confident

00:17:38.039 --> 00:17:40.349
it's a cat. It doesn't need to bother the human

00:17:40.349 --> 00:17:43.710
oracle to ask. Precisely. But the minimum marginal

00:17:43.710 --> 00:17:46.430
hyperplane method does the exact opposite. It

00:17:46.430 --> 00:17:48.849
hunts specifically for the data points with the

00:17:48.849 --> 00:17:52.250
smallest w. These are the points whose noses

00:17:52.250 --> 00:17:53.910
are practically pressed right up against the

00:17:53.910 --> 00:17:56.470
glass. They are the ultimate fence setters. The

00:17:56.470 --> 00:17:59.839
SVM is agonizingly uncertain about them because

00:17:59.839 --> 00:18:02.019
they are resting right on the razor's edge of

00:18:02.019 --> 00:18:04.220
the two categories. So by sending those specific

00:18:04.220 --> 00:18:06.579
points, the ones with the smallest margin to

00:18:06.579 --> 00:18:09.640
the human oracle for labeling, the AI rapidly

00:18:09.640 --> 00:18:12.819
clarifies the exact position and angle of that

00:18:12.819 --> 00:18:15.299
dividing wall. It's just brilliant. It completely

00:18:15.299 --> 00:18:18.220
ignores the easy stuff and zeros in exclusively

00:18:18.220 --> 00:18:20.980
on the most difficult borderline cases. It is

00:18:20.980 --> 00:18:23.849
incredibly efficient. But with all these different

00:18:23.849 --> 00:18:26.049
strategies we've discussed, expected error reduction,

00:18:26.410 --> 00:18:28.670
variance simulations, query by committee, minimum

00:18:28.670 --> 00:18:30.750
marginal hyperplanes, it raises an important

00:18:30.750 --> 00:18:33.529
question. How does an engineer even know which

00:18:33.529 --> 00:18:36.109
one to pick? Yeah. If you are building a model

00:18:36.109 --> 00:18:39.509
to read complex satellite imagery, do you use

00:18:39.509 --> 00:18:41.569
a committee of models or do you use a hyperplane?

00:18:41.849 --> 00:18:44.230
I imagine historically that just required a lot

00:18:44.230 --> 00:18:46.150
of trial and error by the human engineer. It

00:18:46.150 --> 00:18:49.390
did. The engineer had to guess which heuristic

00:18:49.390 --> 00:18:52.319
would work best for their specific data. But

00:18:52.319 --> 00:18:54.480
the cutting edge of the field right now is stepping

00:18:54.480 --> 00:18:57.960
beyond manual selection. We're moving into meta

00:18:57.960 --> 00:18:59.779
-learning. Algorithms learning how to learn.

00:19:00.019 --> 00:19:02.619
Yes. We're now using machine learning algorithms

00:19:02.619 --> 00:19:06.240
to evaluate the data environment itself and autonomously

00:19:06.240 --> 00:19:10.039
learn the optimal active learning strategy. We're

00:19:10.039 --> 00:19:12.779
literally training AI to figure out the best

00:19:12.779 --> 00:19:16.160
possible way to ask humans for help. That's incredible.

00:19:16.359 --> 00:19:18.599
It is a recursive layer of optimization that

00:19:18.599 --> 00:19:21.019
is pushing the efficiency of these systems to

00:19:21.019 --> 00:19:23.220
unprecedented levels. So what does this all mean

00:19:23.220 --> 00:19:25.799
for you listening to this right now? When you

00:19:25.799 --> 00:19:27.900
step back from the hyperplanes and the probability

00:19:27.900 --> 00:19:30.359
formulas... What we've really explored today

00:19:30.359 --> 00:19:32.500
is a profound shift in our relationship with

00:19:32.500 --> 00:19:34.920
technology. Yeah, a huge shift. We started with

00:19:34.920 --> 00:19:38.079
the bottleneck of expensive human labor, staring

00:19:38.079 --> 00:19:42.240
down oceans of raw, unlabeled data. And the solution

00:19:42.240 --> 00:19:44.720
wasn't just building a faster, hungrier machine

00:19:44.720 --> 00:19:47.700
that consumes more power. The solution was transforming

00:19:47.700 --> 00:19:51.039
the AI into an inquisitive student, a student

00:19:51.039 --> 00:19:53.539
that mathematically calculates its own uncertainty,

00:19:54.039 --> 00:19:56.299
that can intelligently manage its own memory,

00:19:56.619 --> 00:19:59.160
that respects the constraints of bio reality,

00:19:59.500 --> 00:20:02.380
and that strategically queries human experts

00:20:02.380 --> 00:20:04.920
to build optimal models faster than we ever thought

00:20:04.920 --> 00:20:07.740
possible. Exactly. It means the future of technology

00:20:07.740 --> 00:20:10.460
isn't just about passively feeding data to machines.

00:20:10.559 --> 00:20:13.519
It's a dynamic, interactive dialogue between

00:20:13.519 --> 00:20:16.519
human expertise and machine efficiency. It is

00:20:16.519 --> 00:20:19.420
a powerful dialogue, but if we connect this to

00:20:19.420 --> 00:20:22.079
the bigger picture, it does leave us with a critical,

00:20:22.240 --> 00:20:24.579
lingering question, one that the field is only

00:20:24.579 --> 00:20:27.220
just beginning to grapple with. Ooh, what's that?

00:20:27.519 --> 00:20:30.099
If an active learning AI is explicitly programmed

00:20:30.099 --> 00:20:32.220
to hunt down only the most confusing, boundary

00:20:32.220 --> 00:20:34.720
-pushing, and ambiguous edge cases to learn from,

00:20:35.240 --> 00:20:37.200
what happens to its understanding of the mundane?

00:20:37.380 --> 00:20:40.240
If an AI is only ever trained by looking at the

00:20:40.240 --> 00:20:44.400
most bizarre, overlapping, complex examples at

00:20:44.400 --> 00:20:47.480
the very margins of our world, Could it eventually

00:20:47.480 --> 00:20:50.059
develop a fundamentally skewed perspective of

00:20:50.059 --> 00:20:52.640
what normal reality actually looks like? That

00:20:52.640 --> 00:20:55.799
is wild. It's like if a doctor only ever studied

00:20:55.799 --> 00:20:58.259
the rarest, most incredibly complicated diseases

00:20:58.259 --> 00:21:00.900
on Earth, and then suddenly couldn't recognize

00:21:00.900 --> 00:21:03.279
a common cold because they never spent any time

00:21:03.279 --> 00:21:05.559
looking at one. Exactly. The smartest student

00:21:05.559 --> 00:21:08.299
in the room, only asking the hardest questions,

00:21:08.759 --> 00:21:10.779
but maybe losing sight of the baseline entirely.

00:21:11.189 --> 00:21:13.670
That is a fascinating and slightly terrifying

00:21:13.670 --> 00:21:15.609
place to leave it. Thank you for joining us on

00:21:15.609 --> 00:21:17.690
this deep dive into the source material. We will

00:21:17.690 --> 00:21:18.369
catch you next time.
