WEBVTT

00:00:00.000 --> 00:00:04.660
Imagine a self -driving car, right? Just cruising

00:00:04.660 --> 00:00:07.339
down a busy city street at like 40 miles an hour.

00:00:07.480 --> 00:00:09.960
Yeah, processing everything in real time. Exactly.

00:00:10.259 --> 00:00:13.099
Its onboard AI was trained on millions of miles

00:00:13.099 --> 00:00:15.580
of footage. So it instantly recognizes sedans,

00:00:15.919 --> 00:00:18.059
pedestrians, bicyclists, traffic lights. All

00:00:18.059 --> 00:00:20.440
the standard stuff you'd expect. Right. But suddenly,

00:00:20.640 --> 00:00:24.140
a teenager on a motorized single -wheel electric

00:00:24.140 --> 00:00:27.320
hoverboard glides right across the crosswalk.

00:00:27.500 --> 00:00:30.480
Oh. Wow. Yeah. And the AI's training data set

00:00:30.480 --> 00:00:33.039
contains zero examples of a single -wheel electric

00:00:33.039 --> 00:00:35.359
hoverboard. Under traditional machine learning

00:00:35.359 --> 00:00:38.500
paradigms, the system should just fail. It definitely

00:00:38.500 --> 00:00:40.560
should. I mean, it shouldn't know how to classify

00:00:40.560 --> 00:00:43.600
the object or predict its speed or understand

00:00:43.600 --> 00:00:45.579
its trajectory at all. But the craziest part

00:00:45.579 --> 00:00:48.140
is the most advanced systems today don't crash.

00:00:48.350 --> 00:00:50.450
They deduce what they're looking at in a fraction

00:00:50.450 --> 00:00:53.189
of a millisecond. Which is, frankly, a remarkable

00:00:53.189 --> 00:00:56.130
leap in computational logic. We're looking at

00:00:56.130 --> 00:00:59.149
extreme domain adaptation here. Because in standard

00:00:59.149 --> 00:01:02.289
generalization, a classifier correctly identifies

00:01:02.289 --> 00:01:04.890
new samples of classes it explicitly saw during

00:01:04.890 --> 00:01:07.469
training. But in the scenario you just described,

00:01:08.069 --> 00:01:10.010
no samples from the target class were given during

00:01:10.010 --> 00:01:12.579
the training phase. None at all. Right. The machine

00:01:12.579 --> 00:01:14.959
is observing samples of test time from classes

00:01:14.959 --> 00:01:17.560
it has never been trained on, and it still successfully

00:01:17.560 --> 00:01:20.060
predicts what class they belong to. Well, welcome

00:01:20.060 --> 00:01:23.049
to the deep dive. Today, we have a very specific

00:01:23.049 --> 00:01:25.569
mission for you, the listener. We are pulling

00:01:25.569 --> 00:01:28.549
from a highly detailed source a comprehensive

00:01:28.549 --> 00:01:31.829
Wikipedia article on a totally paradigm -shifting

00:01:31.829 --> 00:01:34.609
concept in artificial intelligence called zero

00:01:34.609 --> 00:01:38.109
-shot learning, or ZSL. It's a fascinating topic.

00:01:38.209 --> 00:01:40.989
It really is. Our goal today is to basically

00:01:40.989 --> 00:01:43.349
shortcut your way to understanding how AI is

00:01:43.349 --> 00:01:46.530
mimicking a uniquely human ability, and that's

00:01:46.530 --> 00:01:48.129
recognizing something it has literally never

00:01:48.129 --> 00:01:50.670
laid eyes on. Because we all know standard convolut

00:01:50.670 --> 00:01:53.909
- neural networks require massive meticulously

00:01:53.909 --> 00:01:57.069
labeled data sets to avoid overfitting. Exactly.

00:01:57.370 --> 00:02:00.530
So bypassing target class data entirely, I mean

00:02:00.530 --> 00:02:02.810
that completely rewrites the rules of deep learning.

00:02:03.609 --> 00:02:06.569
Okay let's unpack this. How is it mathematically

00:02:06.569 --> 00:02:10.229
or even computationally possible to recognize

00:02:10.229 --> 00:02:12.699
what you haven't actually seen? Well, before

00:02:12.699 --> 00:02:14.599
we get into the complex vector mathematics, we

00:02:14.599 --> 00:02:17.039
should probably anchor this with the primary

00:02:17.039 --> 00:02:19.240
real -world example the source provides. Oh,

00:02:19.340 --> 00:02:21.879
the zebra analogy. Yes, the classic zebra analogy.

00:02:22.139 --> 00:02:25.439
So imagine an AI trained exclusively on images

00:02:25.439 --> 00:02:29.000
of horses. OK. It processes the pixel data, the

00:02:29.000 --> 00:02:31.479
edges, the shapes, and it mathematically understands

00:02:31.479 --> 00:02:33.979
the visual structure of a horse. However, it

00:02:33.979 --> 00:02:36.680
has never processed a single pixel of a zebra.

00:02:36.879 --> 00:02:39.400
So zebra simply does not exist in its visual

00:02:39.400 --> 00:02:41.319
vocabulary. Precisely. But instead of showing

00:02:41.319 --> 00:02:44.280
it a picture of a zebra, you give the AI auxiliary

00:02:44.280 --> 00:02:46.419
textual information. You just feed it a text

00:02:46.419 --> 00:02:49.860
string that says, a zebra is a striped horse.

00:02:49.860 --> 00:02:52.479
Right. And because of that extra context, the

00:02:52.479 --> 00:02:54.740
very first time it processes an image of a zebra,

00:02:55.219 --> 00:02:57.400
it correctly identifies it. Yeah, it connects

00:02:57.400 --> 00:03:00.039
the visual data of the force it knows with the

00:03:00.039 --> 00:03:01.979
conceptual data of the stripes to just bridge

00:03:01.979 --> 00:03:04.520
the gap to the unknown class. It's like someone

00:03:04.520 --> 00:03:08.000
describing a mythical creature to you. Say, a

00:03:08.000 --> 00:03:10.719
unicorn is a horse with a horn. Oh, that's a

00:03:10.719 --> 00:03:12.479
good way to look at it. Right. You'd know it

00:03:12.479 --> 00:03:14.240
instantly if you saw it trotting down the street,

00:03:14.340 --> 00:03:15.900
even though they don't exist and you've obviously

00:03:15.900 --> 00:03:17.560
never seen one. You just combine the concepts.

00:03:17.639 --> 00:03:19.400
But exactly. But wait, let me push back on this

00:03:19.400 --> 00:03:22.039
for a second. Sure. For a computer system to

00:03:22.039 --> 00:03:25.479
understand striped and horse and actually combine

00:03:25.479 --> 00:03:28.400
them, it must already have a baseline understanding

00:03:28.400 --> 00:03:30.879
of those individual concepts, right? I mean,

00:03:30.879 --> 00:03:33.240
it can't learn zebra from a complete vacuum.

00:03:33.300 --> 00:03:35.479
It needs those foundational building blocks.

00:03:35.740 --> 00:03:38.099
What's fascinating here is that your deduction

00:03:38.099 --> 00:03:43.439
hits on the absolute core mechanism of ZSL. It

00:03:43.439 --> 00:03:45.659
relies entirely on what the literature calls

00:03:45.659 --> 00:03:48.960
auxiliary information. Auxiliary information,

00:03:49.080 --> 00:03:52.360
okay. The AI is taking observed classes, so the

00:03:52.360 --> 00:03:54.159
foundational building blocks it actually does

00:03:54.159 --> 00:03:57.020
know, and associating them with non -observed

00:03:57.020 --> 00:04:01.199
classes through encoded observable distinguishing

00:04:01.199 --> 00:04:03.810
properties. So it's essentially borrowing knowledge

00:04:03.810 --> 00:04:06.430
from one domain and projecting it onto another.

00:04:06.569 --> 00:04:08.550
That's exactly what it's doing. But to really

00:04:08.550 --> 00:04:10.750
appreciate the architecture here, I think we

00:04:10.750 --> 00:04:13.150
should briefly look at how researchers figured

00:04:13.150 --> 00:04:15.949
out how to teach machines this trick in the first

00:04:15.949 --> 00:04:18.149
place. Yeah, the history is really interesting.

00:04:18.310 --> 00:04:20.930
The source traces this back to 2008, which, you

00:04:20.930 --> 00:04:24.310
know, in AI research is basically ancient history.

00:04:24.350 --> 00:04:26.670
Oh, completely. And it highlights how parallel

00:04:26.670 --> 00:04:30.089
scientific discovery can be. Because of the 2008

00:04:30.089 --> 00:04:33.360
TRA -AAI contract, two separate papers dropped

00:04:33.360 --> 00:04:36.480
tackling this exact concept, but from completely

00:04:36.480 --> 00:04:38.319
different angles. Right, you had one side working

00:04:38.319 --> 00:04:41.240
in natural language processing, or NLP, and the

00:04:41.240 --> 00:04:43.319
other side working in computer vision. Totally

00:04:43.319 --> 00:04:46.120
separate fields at the time. Yeah. The NLP paper,

00:04:46.480 --> 00:04:48.879
which was authored by Chang, Ratnoff, Roth, and

00:04:48.879 --> 00:04:51.660
Srikumar, introduced a method to classify text

00:04:51.660 --> 00:04:53.879
without specific training data, and they called

00:04:53.879 --> 00:04:56.699
it dataless classification. Dateless classification.

00:04:57.040 --> 00:04:59.279
OK. Meanwhile, at that exact same conference,

00:04:59.660 --> 00:05:01.959
Laura Schell presented a paper for computer vision

00:05:01.959 --> 00:05:04.500
applying a similar bridging concept to images

00:05:04.500 --> 00:05:07.980
and termed it zero data learning. So both of

00:05:07.980 --> 00:05:10.120
those names literally describe the reality of

00:05:10.120 --> 00:05:12.420
the method, right? Classifying without target

00:05:12.420 --> 00:05:15.040
class data. Yeah, they were very literal. But

00:05:15.040 --> 00:05:17.500
neither of those names stuck. It wasn't until

00:05:17.500 --> 00:05:20.660
a year later, in 2009 at the NIPS conference,

00:05:21.079 --> 00:05:23.759
that Pallitucci, Hinton, Palmerlo, and Mitchell

00:05:23.759 --> 00:05:26.639
published a paper that actually coined the term

00:05:26.639 --> 00:05:29.319
zero -shot learning. Which is obviously the one

00:05:29.319 --> 00:05:31.399
we use today. Yeah, clearly that's the one that

00:05:31.399 --> 00:05:33.860
took over the scientific literature. But why

00:05:33.860 --> 00:05:37.120
did the term zero -shot win out over data lists?

00:05:37.300 --> 00:05:39.319
Was it just better marketing by the researchers?

00:05:39.579 --> 00:05:42.990
Well, sort of. Grounding it as zero -shot was

00:05:42.990 --> 00:05:45.350
a deliberate signal to the computer vision community.

00:05:45.509 --> 00:05:47.290
Oh, because it immediately connected the method

00:05:47.290 --> 00:05:50.170
to one -shot learning. Precisely. One -shot learning

00:05:50.170 --> 00:05:52.810
was already a well -established taxonomy where

00:05:52.810 --> 00:05:54.769
classification is learned from just one or two

00:05:54.769 --> 00:05:57.209
examples. Makes sense. By calling it zero -shot,

00:05:57.470 --> 00:06:00.470
Pallitucci and his team framed it as the logical

00:06:00.470 --> 00:06:03.250
mathematical extreme of the one -shot concept.

00:06:03.470 --> 00:06:06.389
It resonated so well with the existing framework

00:06:06.389 --> 00:06:08.490
of machine learning that it permanently caught

00:06:08.490 --> 00:06:11.509
on. It just fit perfectly into the existing mental

00:06:11.509 --> 00:06:15.060
model. But the terminology is really just the

00:06:15.060 --> 00:06:17.759
surface. The real breakthrough was standardizing

00:06:17.759 --> 00:06:20.540
how that auxiliary information is actually processed.

00:06:20.980 --> 00:06:22.680
Right. Let's move from the history to the actual

00:06:22.680 --> 00:06:25.860
math, the secret sauce. You mentioned the AI

00:06:25.860 --> 00:06:28.360
needs auxiliary information to bridge the gap.

00:06:29.160 --> 00:06:31.360
According to the source, this prerequisite information

00:06:31.360 --> 00:06:34.459
is basically broken down into three primary mechanisms.

00:06:35.079 --> 00:06:37.120
Let's start with the first one, which is learning

00:06:37.120 --> 00:06:39.600
with attributes. So learning with attributes

00:06:39.600 --> 00:06:43.240
is where classes are accompanied by predefined

00:06:43.240 --> 00:06:46.199
structured descriptions. This is highly compositional.

00:06:46.439 --> 00:06:49.279
OK, so like building blocks. Exactly. For example,

00:06:49.319 --> 00:06:51.819
if you are working in computer vision and want

00:06:51.819 --> 00:06:54.759
the AI to recognize different types of unseen

00:06:54.759 --> 00:06:57.720
birds, you don't just give it raw text. You give

00:06:57.720 --> 00:06:59.879
it a structured array of attributes. So things

00:06:59.879 --> 00:07:03.379
like redhead, long beak, or yellow underbelly.

00:07:03.540 --> 00:07:05.300
Here's where it gets really interesting. Because

00:07:05.300 --> 00:07:07.620
the way you're describing this structured array,

00:07:07.759 --> 00:07:09.660
I mean, attributes are basically like giving

00:07:09.660 --> 00:07:13.139
the AI a police sketch checklist. Oh, I like

00:07:13.139 --> 00:07:16.139
that comparison. Great. You tell the sketch artist,

00:07:16.420 --> 00:07:20.300
OK, male, glasses, scar on the left cheek. The

00:07:20.300 --> 00:07:22.779
artist hasn't actually seen the suspect. But

00:07:22.779 --> 00:07:25.000
they can generate an accurate representation

00:07:25.000 --> 00:07:28.060
based on a rigidly structured list of known features.

00:07:28.399 --> 00:07:30.740
That captures the rigid nature of attributes

00:07:30.740 --> 00:07:33.560
perfectly. Thanks. But researchers quickly realized

00:07:33.560 --> 00:07:35.939
that manually creating structured checklists

00:07:35.939 --> 00:07:39.139
for every possible object in the universe is

00:07:39.139 --> 00:07:41.420
totally unscalable. Yeah, you'd be writing checklists

00:07:41.420 --> 00:07:44.699
forever. Exactly. which leads to the second type

00:07:44.699 --> 00:07:47.420
of auxiliary information, and that's learning

00:07:47.420 --> 00:07:49.899
from textual description. Okay. This was the

00:07:49.899 --> 00:07:52.279
key direction pursued in natural language processing.

00:07:52.490 --> 00:07:55.629
Instead of rigid attribute arrays, the class

00:07:55.629 --> 00:07:57.850
labels are treated as having inherent semantic

00:07:57.850 --> 00:08:00.310
meaning, and they're augmented with free text

00:08:00.310 --> 00:08:02.569
natural language descriptions. So instead of

00:08:02.569 --> 00:08:05.009
a rigid checklist, you are literally just giving

00:08:05.009 --> 00:08:08.449
the AI a Wikipedia page to read about the unseen

00:08:08.449 --> 00:08:10.769
class. Pretty much, yeah. But if it's borrowing

00:08:10.769 --> 00:08:12.709
knowledge, it must need a translator, right?

00:08:12.810 --> 00:08:14.750
It can't just read an English Wikipedia page.

00:08:15.170 --> 00:08:17.970
How is the free text actually converted into

00:08:17.970 --> 00:08:20.290
something a computational system can calculate?

00:08:20.509 --> 00:08:23.879
That is the crucial step. The original 2008 paper

00:08:23.879 --> 00:08:26.540
used what's called explicit semantic analysis,

00:08:27.339 --> 00:08:31.019
but the field rapidly evolved toward dense representations.

00:08:31.420 --> 00:08:34.139
Dense representation. Yeah. In a dense representation,

00:08:34.639 --> 00:08:37.320
a neural network processes the free text, say

00:08:37.320 --> 00:08:40.019
a paragraph describing a hoverboard, and convotes

00:08:40.019 --> 00:08:42.539
those words into a dense vector. It basically

00:08:42.539 --> 00:08:45.200
turns a concept into a numerical coordinate in

00:08:45.200 --> 00:08:47.220
a high -dimensional mathematical space. So it

00:08:47.220 --> 00:08:50.059
turns language into geography. Exactly. That

00:08:50.059 --> 00:08:53.080
is exactly it. The goal is to represent the textual

00:08:53.080 --> 00:08:55.700
descriptions in the exact same mathematical space

00:08:55.700 --> 00:08:58.299
as the visual features of the objects being classified.

00:08:58.419 --> 00:09:01.299
Oh, wow. Yeah, so if the AI can mathematically

00:09:01.299 --> 00:09:04.039
align the vector for the word stripes with the

00:09:04.039 --> 00:09:06.580
visual edge detection matrix of a striped pattern,

00:09:06.779 --> 00:09:09.320
it can classify an image based purely on a dictionary

00:09:09.320 --> 00:09:12.730
definition. That is wild. And it is extended

00:09:12.730 --> 00:09:15.750
into cross -lingual tasks, too, where the AI

00:09:15.750 --> 00:09:18.429
categorizes documents across different languages

00:09:18.429 --> 00:09:21.289
without specific language class training data.

00:09:21.830 --> 00:09:24.169
It draws on transfer learning from completely

00:09:24.169 --> 00:09:26.690
different tasks, like textual entailment and

00:09:26.690 --> 00:09:29.429
question answering. In fact, the purest form

00:09:29.429 --> 00:09:32.409
of zero -shot classification is the ability to

00:09:32.409 --> 00:09:35.350
classify a single example without observing any

00:09:35.350 --> 00:09:38.539
annotated data at all. OK, so... Reading a dictionary

00:09:38.539 --> 00:09:40.860
definition works brilliantly if the thing is

00:09:40.860 --> 00:09:43.460
easily described in text, but what if the AI

00:09:43.460 --> 00:09:45.720
encounters something highly abstract that, you

00:09:45.720 --> 00:09:48.399
know, defies simple linguistic description? That's

00:09:48.399 --> 00:09:50.399
a great point. A text vector isn't going to help

00:09:50.399 --> 00:09:52.639
much if the relationship between objects is purely

00:09:52.639 --> 00:09:55.759
structural. It needs a spatial map, which brings

00:09:55.759 --> 00:09:57.940
us to the third mechanism mentioned in the source,

00:09:58.200 --> 00:10:02.250
which is class -class similarity. Right. Class

00:10:02.250 --> 00:10:04.889
similarity relies entirely on that high dimensional

00:10:04.889 --> 00:10:07.190
mathematical space we just talked about. Imagine

00:10:07.190 --> 00:10:09.769
an embedding space where every possible concept

00:10:09.769 --> 00:10:12.250
is plotted as a point. OK, I'm picturing it.

00:10:12.750 --> 00:10:15.149
In this space, things that are conceptually or

00:10:15.149 --> 00:10:17.470
structurally similar are physically clustered

00:10:17.470 --> 00:10:20.470
closer together. OK, my analogy for this. It's

00:10:20.470 --> 00:10:23.419
like dropping a pin on a digital map. Even if

00:10:23.419 --> 00:10:25.340
there's no labeled town exactly where you drop

00:10:25.340 --> 00:10:28.039
the pin, you know its exact latitude and longitude.

00:10:28.419 --> 00:10:29.899
And because of its mathematical coordinates,

00:10:30.059 --> 00:10:32.379
you know for a fact it is closer to Chicago than

00:10:32.379 --> 00:10:35.200
it is to New York. Yes. So you can infer the

00:10:35.200 --> 00:10:37.919
climate and the time zone purely based on its

00:10:37.919 --> 00:10:40.299
nearest neighbors. That is an excellent way to

00:10:40.299 --> 00:10:43.850
visualize representational similarity. A zero

00:10:43.850 --> 00:10:47.570
-shot classifier takes a new unseen sample and

00:10:47.570 --> 00:10:50.669
extracts its features to plot it as a new data

00:10:50.669 --> 00:10:52.990
point in that continuous space. Just drops the

00:10:52.990 --> 00:10:56.350
pin. It then calculates the mathematical distance

00:10:56.350 --> 00:10:59.850
to all the known embedded classes. The nearest

00:10:59.850 --> 00:11:01.750
embedded class is used as the predicted class.

00:11:02.409 --> 00:11:05.789
The AI relies entirely on the geometric relationships

00:11:05.789 --> 00:11:08.970
between known concepts to infer the identity

00:11:08.970 --> 00:11:11.200
of the unknown. So the math is really elegant.

00:11:11.360 --> 00:11:14.019
You map attributes, you embed text vectors, or

00:11:14.019 --> 00:11:16.399
you plot coordinates in a continuous space, and

00:11:16.399 --> 00:11:19.279
the AI successfully identifies the zebra, or

00:11:19.279 --> 00:11:21.360
the hoverboard. But... You know, the controlled

00:11:21.360 --> 00:11:23.840
environment of a laboratory data set is completely

00:11:23.840 --> 00:11:26.000
different from the chaotic reality of the actual

00:11:26.000 --> 00:11:28.320
world. Oh, entirely. Yeah. And this brings us

00:11:28.320 --> 00:11:31.340
to a massive evolutionary hurdle for the field,

00:11:31.840 --> 00:11:34.279
which is known as generalized zero -shot learning,

00:11:34.580 --> 00:11:37.340
or GZSL. Generalized, OK. Because in the original

00:11:37.340 --> 00:11:40.100
ZSL laboratory setup, there was a massive crutch.

00:11:40.879 --> 00:11:44.179
At test time, the AI is explicitly told it is

00:11:44.179 --> 00:11:47.500
only looking at new, unseen classes. Oh, so it

00:11:47.500 --> 00:11:49.059
doesn't even have to look for the old stuff.

00:11:49.179 --> 00:11:50.980
Right. The system doesn't have to worry about

00:11:50.980 --> 00:11:53.559
known objects. It only has to search its unseen

00:11:53.559 --> 00:11:56.360
database. Right. If I know a test is only going

00:11:56.360 --> 00:11:59.299
to feature trick questions, my brain is already

00:11:59.299 --> 00:12:02.159
primed to look for the trick. Exactly. But out

00:12:02.159 --> 00:12:04.840
in the wild, when an AI is deployed in a self

00:12:04.840 --> 00:12:07.279
-driving car or an automated medical scanner,

00:12:07.899 --> 00:12:10.220
known and unknown things exist simultaneously.

00:12:10.960 --> 00:12:13.340
The hoverboard is right next to a completely

00:12:13.340 --> 00:12:16.440
standard Toyota sedan. Which introduces a severely

00:12:16.440 --> 00:12:19.399
complex problem. At test time, samples from both

00:12:19.399 --> 00:12:21.340
new classes and known classes appear together.

00:12:21.480 --> 00:12:24.980
That sounds incredibly messy. It is. It is incredibly

00:12:24.980 --> 00:12:27.220
challenging for a classifier to mathematically

00:12:27.220 --> 00:12:30.500
estimate if a given sample is a novel concept

00:12:30.500 --> 00:12:33.519
it needs to deduce, or something it was explicitly

00:12:33.519 --> 00:12:35.480
trained on and should just recognize normally.

00:12:36.360 --> 00:12:40.059
But the source outlines two primary computational

00:12:40.059 --> 00:12:43.080
approaches researchers have developed to handle

00:12:43.080 --> 00:12:46.019
this real -world mix. OK, the first approach

00:12:46.019 --> 00:12:49.120
is using a gating module. Yes, a gating module.

00:12:49.480 --> 00:12:51.860
It's basically a separate mechanism placed at

00:12:51.860 --> 00:12:54.100
the very front of the architecture. It acts as

00:12:54.100 --> 00:12:56.840
an anomaly detector. OK. Its sole function is

00:12:56.840 --> 00:12:59.460
to evaluate the incoming data and make a routing

00:12:59.460 --> 00:13:02.919
decision. Does this sample belong to a seen class

00:13:02.919 --> 00:13:05.500
or an unseen class? So it's like a bouncer at

00:13:05.500 --> 00:13:08.200
a club. Exactly. And it can output a hard decision

00:13:08.200 --> 00:13:11.200
like a strict binary yes or no or a soft probabilistic

00:13:11.200 --> 00:13:14.549
decision, essentially saying, I am 82 % confident

00:13:14.549 --> 00:13:17.029
this belongs to an unseen class. But let me push

00:13:17.029 --> 00:13:19.009
back on this gating architecture for a second.

00:13:19.009 --> 00:13:21.429
Sure. If the primary neural network is already

00:13:21.429 --> 00:13:24.029
struggling to cleanly separate knowns from unknowns

00:13:24.029 --> 00:13:26.750
in the embedding space, isn't adding a gating

00:13:26.750 --> 00:13:29.190
module just passing the buck to a different algorithm?

00:13:29.470 --> 00:13:32.110
I mean, how does the gatekeeper accurately flag

00:13:32.110 --> 00:13:35.690
a new class if it has literally never been trained

00:13:35.690 --> 00:13:38.029
on what that new class looks like? This raises

00:13:38.029 --> 00:13:40.090
an important question, and it really exposes

00:13:40.090 --> 00:13:42.230
the fundamental limits of anomaly detection.

00:13:42.649 --> 00:13:45.009
You are entirely correct. The gating module is

00:13:45.009 --> 00:13:47.230
inherently a bottleneck. Yeah, it seems flawed.

00:13:47.710 --> 00:13:50.129
It is forced to look for the absence of familiar

00:13:50.129 --> 00:13:52.710
patterns rather than the presence of specific

00:13:52.710 --> 00:13:55.990
new ones, which leads to high error rates. Because

00:13:55.990 --> 00:13:58.909
if the gatekeeper messes up... If it misclassifies

00:13:58.909 --> 00:14:02.570
a novel object as a known object, the main classifier

00:14:02.570 --> 00:14:04.990
will just force it into the wrong category and

00:14:04.990 --> 00:14:07.909
the whole system fails. Wow, okay. Because of

00:14:07.909 --> 00:14:10.750
this inherent flaw, the second approach has become

00:14:10.750 --> 00:14:13.970
a vastly more powerful alternative. The generative

00:14:13.970 --> 00:14:16.450
module. Yes. Instead of trying to act as a bouncer,

00:14:16.950 --> 00:14:19.409
a generative module takes a much more aggressive

00:14:19.409 --> 00:14:22.539
approach. It is an auxiliary's network, often

00:14:22.539 --> 00:14:25.500
a generative adversarial network or a variational

00:14:25.500 --> 00:14:28.220
autoencoder, that is trained to actively generate

00:14:28.220 --> 00:14:30.440
feature representations of the unseen classes

00:14:30.440 --> 00:14:33.340
before the testing phase even begins. Wait, it

00:14:33.340 --> 00:14:35.340
hallucinates the mathematical features of things

00:14:35.340 --> 00:14:38.639
it hasn't seen? Precisely. It is insane. It takes

00:14:38.639 --> 00:14:40.679
the auxiliary information we discussed earlier.

00:14:40.759 --> 00:14:43.620
So the semantic text embeddings or the structured

00:14:43.620 --> 00:14:46.519
attributes of the unseen classes combines them

00:14:46.519 --> 00:14:49.919
with random noise. and synthesizes artificial

00:14:49.919 --> 00:14:53.250
feature vectors. It creates synthetic training

00:14:53.250 --> 00:14:56.009
data for the unknown classes. So we basically

00:14:56.009 --> 00:14:58.870
turn a zero -shot problem into a traditional

00:14:58.870 --> 00:15:01.809
supervised learning problem by forcing the AI

00:15:01.809 --> 00:15:04.509
to generate its own training data. Exactly. The

00:15:04.509 --> 00:15:07.370
AI uses the dictionary definition of zebra to

00:15:07.370 --> 00:15:09.750
hallucinate thousands of mathematical variations

00:15:09.750 --> 00:15:12.250
of what a zebra's pixel data should look like.

00:15:12.250 --> 00:15:14.309
Yep. And once it generates all these synthetic

00:15:14.309 --> 00:15:16.490
features, you just train a standard classifier

00:15:16.490 --> 00:15:19.210
on the entire data set. The real data from the

00:15:19.210 --> 00:15:21.210
scene classes and the hallucinated data from

00:15:21.129 --> 00:15:23.809
the unseen classes. It is a brilliant workaround.

00:15:24.049 --> 00:15:26.190
By synthetically turning the unseen into the

00:15:26.190 --> 00:15:28.549
scene, it dramatically improves the accuracy

00:15:28.549 --> 00:15:30.690
of generalized zero -shot learning and allows

00:15:30.690 --> 00:15:33.690
the system to smoothly handle mixed real -world

00:15:33.690 --> 00:15:36.690
data streams. So having engineered robust workarounds

00:15:36.690 --> 00:15:40.190
for that messy mix of knowns and unknowns, ZSL

00:15:40.190 --> 00:15:42.710
is breaking out of basic image classification.

00:15:43.470 --> 00:15:45.389
We aren't just identifying single objects in

00:15:45.389 --> 00:15:47.070
the center of a frame anymore, right? Not at

00:15:47.070 --> 00:15:49.320
all. The source highlights that it's heavily

00:15:49.320 --> 00:15:52.799
used in semantic segmentation. And semantic segmentation

00:15:52.799 --> 00:15:55.460
isn't just labeling an image. It's classifying

00:15:55.460 --> 00:15:58.399
every single pixel in an image to draw precise

00:15:58.399 --> 00:16:00.580
boundaries around objects. Which is incredibly

00:16:00.580 --> 00:16:03.419
complex. Doing that for unseen concepts requires

00:16:03.419 --> 00:16:06.580
an unbelievable level of spatial mapping. Right,

00:16:06.700 --> 00:16:08.940
because semantic segmentation requires the AI

00:16:08.940 --> 00:16:11.320
to understand not just what an unseen object

00:16:11.320 --> 00:16:14.019
is, but its geometry, its borders, and how it

00:16:14.019 --> 00:16:17.149
includes other objects. But even beyond complex

00:16:17.149 --> 00:16:19.590
computer vision, the two applications from the

00:16:19.590 --> 00:16:21.769
references that truly demonstrate the power of

00:16:21.769 --> 00:16:24.950
ZSL are computational biology and abstract reasoning.

00:16:25.149 --> 00:16:28.830
Yes, the 2021 paper from Cell Systems by Whitman,

00:16:28.950 --> 00:16:32.370
you and Arnold. It's about using ZSL for machine

00:16:32.370 --> 00:16:34.529
learning assisted directed protein evolution.

00:16:34.710 --> 00:16:37.690
Moving from drawing bounding boxes around cars

00:16:37.690 --> 00:16:40.649
to mutating biological proteins sounds like an

00:16:40.649 --> 00:16:43.529
unbelievable leap. I mean, how does zero shot

00:16:43.529 --> 00:16:46.610
learning even apply to molecular biology. If

00:16:46.610 --> 00:16:49.000
we connect this to the bigger picture. The core

00:16:49.000 --> 00:16:52.279
challenge in directed protein evolution, which

00:16:52.279 --> 00:16:54.539
is how we develop new therapeutics and enzymes,

00:16:55.000 --> 00:16:58.299
is the sheer scale of the sequence space. The

00:16:58.299 --> 00:17:01.100
number of possible amino acid combinations for

00:17:01.100 --> 00:17:04.519
a single protein is astronomically large. It

00:17:04.519 --> 00:17:06.500
represents a fitness landscape with billions

00:17:06.500 --> 00:17:08.759
of peaks and valleys. And it is physically impossible

00:17:08.759 --> 00:17:11.880
to test every one of those billions of mutations

00:17:11.880 --> 00:17:15.180
manually in a lab to gather training data. Exactly.

00:17:15.400 --> 00:17:18.130
Gathering experimental assay data for every would

00:17:18.130 --> 00:17:21.349
take millennia. But if you treat amino acid sequences

00:17:21.349 --> 00:17:23.710
like text, you can use zero -shot learning. You

00:17:23.710 --> 00:17:26.089
embed the known biological sequences into a high

00:17:26.089 --> 00:17:28.490
-dimensional mathematical space, mapping out

00:17:28.490 --> 00:17:30.769
how certain structures relate to certain biological

00:17:30.769 --> 00:17:32.710
functions. And then the zero -shot model can

00:17:32.710 --> 00:17:35.650
evaluate entirely novel unobserved protein variants.

00:17:36.069 --> 00:17:38.470
It just plots the unobserved mutations on the

00:17:38.470 --> 00:17:40.970
map. We know where the useful proteins live in

00:17:40.970 --> 00:17:43.410
the mathematical space, so the AI just calculates

00:17:43.410 --> 00:17:46.089
the coordinates of new sequences and predicts

00:17:46.089 --> 00:17:49.640
their success. without a single real -world experiment.

00:17:49.839 --> 00:17:53.019
Yes. It finds the peaks in the fitness landscape

00:17:53.019 --> 00:17:56.619
mathematically. It bypasses the need for laboratory

00:17:56.619 --> 00:17:59.500
training data entirely. That is just incredible.

00:17:59.660 --> 00:18:01.579
And the boundaries being pushed even further

00:18:01.579 --> 00:18:05.599
into pure logic. The recent 2024 paper from Scientific

00:18:05.599 --> 00:18:08.559
Reports demonstrates that untrained neural networks

00:18:08.559 --> 00:18:11.480
can perform memorization -independent abstract

00:18:11.480 --> 00:18:14.319
reasoning using zero -shot principles. Meaning

00:18:14.319 --> 00:18:16.619
the networks aren't just regurgitating patterns

00:18:16.619 --> 00:18:18.779
they memorize during their initial training phase?

00:18:18.960 --> 00:18:21.509
Exactly. They are synthesizing logic to solve

00:18:21.509 --> 00:18:24.130
entirely unseen abstract scenarios. They are

00:18:24.130 --> 00:18:26.529
mapping the underlying rules of a problem rather

00:18:26.529 --> 00:18:28.809
than just matching surface level features. Whether

00:18:28.809 --> 00:18:32.049
you are using NLP to automatically categorize

00:18:32.049 --> 00:18:34.970
documents into a complex taxonomy you never explicitly

00:18:34.970 --> 00:18:38.230
trained it on or predicting synthetic protein

00:18:38.230 --> 00:18:41.910
mutations to cure a disease, the core value proposition

00:18:41.910 --> 00:18:44.970
of zero -shot learning is supreme efficiency.

00:18:45.470 --> 00:18:48.930
It saves us from the impossible, infinite task

00:18:48.930 --> 00:18:52.450
of manually annotating and labeling the entire

00:18:52.450 --> 00:18:55.029
universe of data. So, what does this all mean?

00:18:55.430 --> 00:18:58.109
For you, the listener, navigating a technological

00:18:58.109 --> 00:19:00.390
landscape increasingly driven by deep learning,

00:19:00.849 --> 00:19:02.670
understanding zero -shot learning is understanding

00:19:02.670 --> 00:19:05.660
the ultimate computational shortcut. We used

00:19:05.660 --> 00:19:08.500
to think AI was strictly bound by its past experiences.

00:19:09.059 --> 00:19:11.299
That a model could only ever be a mirror reflecting

00:19:11.299 --> 00:19:14.119
the specific labeled data we explicitly fed into

00:19:14.119 --> 00:19:16.819
it. But ZSL proves that with the right mathematical

00:19:16.819 --> 00:19:19.359
architecture, semantic vector spaces, and generative

00:19:19.359 --> 00:19:22.359
modules, AI can make massive deductive leaps

00:19:22.359 --> 00:19:25.019
into the unknown. It's about leveraging the structure

00:19:25.019 --> 00:19:26.859
of what you do know to conquer what you don't

00:19:26.859 --> 00:19:28.559
know. And that leaves us with a fascinating,

00:19:28.640 --> 00:19:31.039
almost philosophical paradox to consider. The

00:19:31.039 --> 00:19:33.660
entire architecture of zero -shot learning, as

00:19:33.660 --> 00:19:36.440
we've discussed it today, relies heavily on human

00:19:36.440 --> 00:19:39.720
-provided auxiliary information. We have to supply

00:19:39.720 --> 00:19:41.940
the structured attributes. We have to define

00:19:41.940 --> 00:19:44.220
the semantic space with our language and our

00:19:44.220 --> 00:19:46.799
encyclopedias. Right, the AI is still leaning

00:19:46.799 --> 00:19:49.839
on us for the definitions. Exactly. But as these

00:19:49.839 --> 00:19:51.920
generative modules and abstract reasoning engines

00:19:51.920 --> 00:19:54.779
become more autonomous, what happens to our understanding

00:19:54.779 --> 00:19:57.980
of the world when AI systems begin generating

00:19:57.980 --> 00:20:00.759
their own mathematical attributes and mapping

00:20:00.759 --> 00:20:03.400
class similarities for concepts that humans haven't

00:20:03.400 --> 00:20:06.000
even discovered or invented words for yet? Oh

00:20:06.000 --> 00:20:09.019
wow, a spatial map filled with coordinates for

00:20:09.019 --> 00:20:11.220
ideas we don't even have the language to describe

00:20:11.220 --> 00:20:13.599
yet. Exactly. That is an incredible thought to

00:20:13.599 --> 00:20:16.009
chew on. Well, thank you for joining us on this

00:20:16.009 --> 00:20:19.170
deep dive into the mechanics of the unseen. Keep

00:20:19.170 --> 00:20:21.690
looking for those hidden connections, keep leveraging

00:20:21.690 --> 00:20:23.829
the foundational concepts you know to explore

00:20:23.829 --> 00:20:27.069
what you don't, and most importantly, keep questioning

00:20:27.069 --> 00:20:28.769
the complex world around you.