WEBVTT

00:00:00.000 --> 00:00:02.640
Welcome to the deep dive if you're listening

00:00:02.640 --> 00:00:06.179
to this, you're probably someone who Well, you

00:00:06.179 --> 00:00:08.599
love grasping complex topics, right? Do you want

00:00:08.599 --> 00:00:10.640
to understand the world quickly and thoroughly

00:00:10.640 --> 00:00:13.240
but without drowning in all that horrible information

00:00:13.240 --> 00:00:15.699
overload, right? Nobody has time to read every

00:00:15.699 --> 00:00:18.839
single textbook exactly and that's exactly why

00:00:18.839 --> 00:00:21.519
we are pulling from a really comprehensive source

00:00:21.519 --> 00:00:24.100
today We're looking at excerpts from the Wikipedia

00:00:24.100 --> 00:00:26.839
article on meta learning in computer science,

00:00:26.940 --> 00:00:30.660
which I know sounds heavy but it's actually incredibly

00:00:30.660 --> 00:00:33.539
relevant to how we all learn. It really is. The

00:00:33.539 --> 00:00:35.899
mission for today is to explore how artificial

00:00:35.899 --> 00:00:38.659
intelligence is shifting from simply learning

00:00:38.659 --> 00:00:41.719
facts, you know, memorizing things, to actually

00:00:41.719 --> 00:00:44.020
learning how to learn. Right, which is a massive

00:00:44.020 --> 00:00:46.320
paradigm shift. Okay, let's unpack this. Because

00:00:46.320 --> 00:00:47.979
I want you to imagine for a second that you have

00:00:47.979 --> 00:00:51.340
an AI, right? And this system can completely

00:00:51.340 --> 00:00:53.740
crush the world's greatest grandmasters at chess.

00:00:53.820 --> 00:00:56.619
Oh, easily. processing millions of moves a second.

00:00:56.960 --> 00:00:59.200
Yeah. It's by all accounts a certified genius.

00:00:59.719 --> 00:01:02.259
But then you take that exact same super smart

00:01:02.259 --> 00:01:05.299
AI and you ask it to play a simple game of checkers,

00:01:05.659 --> 00:01:08.140
what happens? It completely breaks down. It has

00:01:08.140 --> 00:01:10.420
absolutely no idea what to do. It essentially

00:01:10.420 --> 00:01:13.340
has to be rebuilt, entirely retrained from scratch,

00:01:13.700 --> 00:01:17.000
just to understand how to move a piece diagonally.

00:01:17.120 --> 00:01:19.680
Yeah, and that is, I mean, it's the great paradox

00:01:19.680 --> 00:01:22.219
of modern computer science, really. Why is the

00:01:22.219 --> 00:01:25.099
smartest machine on Earth so incredibly rigid?

00:01:26.019 --> 00:01:28.519
Because we have built these monolithic systems

00:01:28.519 --> 00:01:31.939
that are unbelievably good at one specific thing,

00:01:32.180 --> 00:01:34.879
but they are... while they're incredibly brittle,

00:01:35.319 --> 00:01:38.659
they lack the fundamental ability to adapt. We

00:01:38.659 --> 00:01:40.799
spend massive amounts of time feeding machines

00:01:40.799 --> 00:01:43.260
data just hoping they learn a single task, when

00:01:43.260 --> 00:01:45.379
what we really need is a machine that understands

00:01:45.379 --> 00:01:47.519
the underlying mechanics of learning itself.

00:01:47.920 --> 00:01:50.060
Which brings us to the core definition from our

00:01:50.060 --> 00:01:53.140
source today. Meta -learning is defined as a

00:01:53.140 --> 00:01:55.340
subfield of machine learning. where automatic

00:01:55.340 --> 00:01:57.420
learning algorithms are applied to the metadata

00:01:57.420 --> 00:01:59.819
of machine learning experiments. Right, the metadata.

00:02:00.040 --> 00:02:01.799
But let's put it in the computer science dictionary

00:02:01.799 --> 00:02:03.780
for a second, because what does interacting with

00:02:03.780 --> 00:02:06.780
metadata actually look like in practice? Well...

00:02:06.560 --> 00:02:08.680
To understand that, you kind of have to look

00:02:08.680 --> 00:02:11.219
at what a standard machine learning model ignores.

00:02:11.919 --> 00:02:14.740
Usually, when you train an AI, you feed it thousands

00:02:14.740 --> 00:02:17.680
of images of cats, right? It makes guesses, and

00:02:17.680 --> 00:02:19.979
it gets a score based on its accuracy. So just

00:02:19.979 --> 00:02:22.419
trial and error. Exactly. And the engineers,

00:02:22.479 --> 00:02:24.860
they only care about the final model that can

00:02:24.860 --> 00:02:27.460
spot a cat. The entire journey to get there,

00:02:27.460 --> 00:02:30.840
all the mistakes, is just discarded. But meta

00:02:30.840 --> 00:02:33.460
-learning flips that entirely. It cares immensely

00:02:33.460 --> 00:02:35.199
about the journey. It looks at the metadata.

00:02:35.419 --> 00:02:38.419
Meaning the actual process itself? Yes. It analyzes

00:02:38.419 --> 00:02:40.780
the properties of the problem. It looks at how

00:02:40.780 --> 00:02:43.199
the algorithm's performance fluctuated over time,

00:02:43.259 --> 00:02:45.159
the patterns it tried to use, all of it. And

00:02:45.159 --> 00:02:46.759
what does it do with that? What's fascinating

00:02:46.759 --> 00:02:49.879
here is that it takes all of that context and

00:02:49.879 --> 00:02:53.300
uses it to dynamically alter, select, or even

00:02:53.300 --> 00:02:55.580
combine entirely different learning algorithms

00:02:55.580 --> 00:02:58.500
to solve a new problem. I think the natural reaction

00:02:58.500 --> 00:03:03.620
for someone outside the field is to wonder why

00:03:03.620 --> 00:03:06.180
that extra layer of complexity is even necessary.

00:03:06.740 --> 00:03:08.800
Because if you have a massive neural network

00:03:08.800 --> 00:03:11.560
and just a ton of computing power, shouldn't

00:03:11.560 --> 00:03:13.520
it just naturally be able to learn whatever you

00:03:13.520 --> 00:03:15.610
throw at it? You would think so, right. But it

00:03:15.610 --> 00:03:18.550
hits a wall due to a fundamental limitation.

00:03:18.750 --> 00:03:21.750
It's called inductive bias. Inductive bias. Right.

00:03:22.069 --> 00:03:25.069
Every single standard learning algorithm operates

00:03:25.069 --> 00:03:28.550
on an inductive bias. It's simply a set of built

00:03:28.550 --> 00:03:30.750
-in hard -coded assumptions that it makes about

00:03:30.750 --> 00:03:32.689
the data it's going to process. Like mathematical

00:03:32.689 --> 00:03:35.370
blinders. Exactly. Because it has those blinders

00:03:35.370 --> 00:03:38.129
on, a standard algorithm is only going to perform

00:03:38.129 --> 00:03:41.330
well if its specific assumptions happen to perfectly

00:03:41.330 --> 00:03:43.669
align with the new problem you give it. So it's

00:03:43.669 --> 00:03:46.569
highly specialized. but trapped by its own design.

00:03:47.129 --> 00:03:49.370
Yeah. I mean, think of inductive bias like a

00:03:49.370 --> 00:03:52.509
world -class, ultra -expensive sushi knife. Oh,

00:03:52.629 --> 00:03:54.729
that's a great analogy. Right. If your goal is

00:03:54.729 --> 00:03:57.069
slicing delicate fish, its design assumptions

00:03:57.069 --> 00:03:59.669
are perfect. The blade is exactly what you need.

00:04:00.009 --> 00:04:03.250
But if you take that exact same tool into a forest

00:04:03.250 --> 00:04:06.430
to chop down an oak tree... It's utterly useless.

00:04:06.830 --> 00:04:08.830
The assumptions built into the knife just do

00:04:08.830 --> 00:04:11.169
not match the reality of the wood. Precisely.

00:04:11.449 --> 00:04:14.509
A standard algorithm might dominate text translation,

00:04:14.830 --> 00:04:17.490
but completely fail at image recognition because

00:04:17.490 --> 00:04:19.589
the relationship between the data structure and

00:04:19.589 --> 00:04:22.529
the algorithm is mismatched. So meta -learning

00:04:22.529 --> 00:04:25.370
is sort of like a master craftsman. Okay, how

00:04:25.370 --> 00:04:27.949
so? Instead of just wildly swinging a sushi knife

00:04:27.949 --> 00:04:32.189
at a tree, the AI steps back, analyzes the properties

00:04:32.189 --> 00:04:35.430
of the task, the metadata, and dynamically decides,

00:04:35.670 --> 00:04:38.420
oh wait, this is wood. Right. I need to put away

00:04:38.420 --> 00:04:40.899
the knife, generate a chainsaw, and adjust my

00:04:40.899 --> 00:04:45.019
grip. It shifts its own inductive bias to match

00:04:45.019 --> 00:04:47.079
the reality of the problem in front of it. That's

00:04:47.079 --> 00:04:49.339
exactly it. It essentially rewrites its own rules

00:04:49.339 --> 00:04:51.819
for solving the problem. But practically speaking,

00:04:52.160 --> 00:04:54.699
how did scientists even begin to conceptualize

00:04:54.699 --> 00:04:57.060
code that could rewrite its own rules? Because,

00:04:57.060 --> 00:04:59.439
I mean, writing software that writes its own

00:04:59.439 --> 00:05:01.920
software sounds like a recipe for a total system

00:05:01.920 --> 00:05:04.629
crash. It does, yeah. But they didn't invent

00:05:04.629 --> 00:05:06.850
the concept from scratch. They actually looked

00:05:06.850 --> 00:05:09.430
at the ultimate most robust learning system that

00:05:09.430 --> 00:05:11.829
already exists. Which is? Biological evolution.

00:05:12.410 --> 00:05:15.050
Yeah, early pioneering work in the late 80s and

00:05:15.050 --> 00:05:18.250
early 90s, specifically by researchers like Jürgen

00:05:18.250 --> 00:05:22.259
Schmidhuber. and Yoshua Bengio, they used genetic

00:05:22.259 --> 00:05:24.920
evolution as the blueprint. OK, that makes sense.

00:05:25.120 --> 00:05:27.939
In nature, genetic evolution is the ultimate

00:05:27.939 --> 00:05:30.600
meta learner. It learns the actual learning procedure

00:05:30.600 --> 00:05:34.100
itself, encodes that procedure into DNA, and

00:05:34.100 --> 00:05:36.220
then executes it in the brain of an organism.

00:05:36.639 --> 00:05:39.399
It's an open -ended hierarchy of meta evolution.

00:05:39.639 --> 00:05:41.420
Wait, hold on. I have to push back on this a

00:05:41.420 --> 00:05:43.550
little bit. Go for it. I understand the inspiration,

00:05:44.129 --> 00:05:46.829
but biological evolution takes millions of years

00:05:46.829 --> 00:05:50.209
of horrific trial and error. I mean, entire species

00:05:50.209 --> 00:05:52.649
dying off just to find the right bias for survival.

00:05:52.800 --> 00:05:55.560
Right, it's not exactly efficient. Yeah. So if

00:05:55.560 --> 00:05:58.420
the goal of AI is to be fast and flexible, how

00:05:58.420 --> 00:06:00.660
on earth does it dynamically choose its bias

00:06:00.660 --> 00:06:03.079
in real time without running a million -year

00:06:03.079 --> 00:06:05.279
simulation? Like, what levers is it actually

00:06:05.279 --> 00:06:07.959
pulling to speed this up? That is the exact problem

00:06:07.959 --> 00:06:10.480
researchers had to solve. It requires a crucial

00:06:10.480 --> 00:06:12.959
distinction. When a meta -learning system shifts

00:06:12.959 --> 00:06:16.000
its bias, it isn't blindly mutating like early

00:06:16.000 --> 00:06:18.680
life on Earth. It is meticulously adjusting two

00:06:18.680 --> 00:06:21.620
very specific parameters. And we should probably

00:06:21.620 --> 00:06:24.050
clarify for anyone who has taken a basic data

00:06:24.050 --> 00:06:26.709
science class. This is completely different from

00:06:26.709 --> 00:06:29.209
the standard bias variance dilemma you learn

00:06:29.209 --> 00:06:31.509
about in statistics. Right. This is a different

00:06:31.509 --> 00:06:35.110
kind of bias. Exactly. Meta -learning is specifically

00:06:35.110 --> 00:06:38.170
tweaking what we call declarative bias and procedural

00:06:38.170 --> 00:06:40.290
bias. Okay. Let's unpack those because that sounds

00:06:40.290 --> 00:06:44.009
heavy. What is a declarative bias doing? Declarative

00:06:44.009 --> 00:06:46.990
bias restricts the search space. It mathematically

00:06:46.990 --> 00:06:49.670
dictates the boundaries of what the AI is even

00:06:49.670 --> 00:06:51.970
allowed to consider. Give me an example of that.

00:06:52.629 --> 00:06:55.629
So if the AI looks at the metadata of a new problem,

00:06:55.970 --> 00:06:58.310
it might dynamically alter its declarative bias

00:06:58.310 --> 00:07:01.170
to say, based on these initial properties, I'm

00:07:01.170 --> 00:07:03.209
only going to look at linear functions. Oh, I

00:07:03.209 --> 00:07:05.470
see. By actively restricting the representation

00:07:05.470 --> 00:07:07.850
of the problem, it slashes the amount of searching

00:07:07.850 --> 00:07:10.350
the algorithm has to do. Got it. So declarative

00:07:10.350 --> 00:07:13.089
bias is the AI setting up fences around the playground.

00:07:13.350 --> 00:07:15.199
That's a good way to put it. It looks at the

00:07:15.199 --> 00:07:17.459
task, decides half the playground is a complete

00:07:17.459 --> 00:07:19.819
waste of time, and just blocks it off so it doesn't

00:07:19.819 --> 00:07:22.639
wander into the woods. Exactly. So what is procedural

00:07:22.639 --> 00:07:25.699
bias then? Procedural bias dictates the strategy

00:07:25.699 --> 00:07:28.959
within those fences. It imposes constraints on

00:07:28.959 --> 00:07:31.439
the ordering of the hypotheses the AI tests.

00:07:31.560 --> 00:07:34.360
So within that restricted playground. How does

00:07:34.360 --> 00:07:37.240
it systematically try out solutions? Right. A

00:07:37.240 --> 00:07:39.699
procedural bias might be dynamically set. So

00:07:39.699 --> 00:07:42.540
the AI always tests the smallest, simplest mathematical

00:07:42.540 --> 00:07:45.620
hypotheses first before moving on to incredibly

00:07:45.620 --> 00:07:48.819
complex ones. Oh, wow. Yeah. So by intelligently

00:07:48.819 --> 00:07:51.220
shifting these two biases together based on the

00:07:51.220 --> 00:07:54.339
problem's metadata, the AI zeros in on a solution

00:07:54.339 --> 00:07:57.360
exponentially faster than just random trial and

00:07:57.360 --> 00:07:59.600
error. Here's where it gets really interesting

00:07:59.600 --> 00:08:02.399
because going from theory to actual software

00:08:02.399 --> 00:08:04.800
architecture, is where the magic happens. Oh,

00:08:04.800 --> 00:08:07.480
definitely. The source outlines three primary

00:08:07.480 --> 00:08:09.600
ways engineers are currently building these meta

00:08:09.600 --> 00:08:11.879
-learning systems, and they function very differently.

00:08:12.000 --> 00:08:15.259
Let's walk through them. The first is model -based

00:08:15.259 --> 00:08:17.740
meta -learning. Right, and the core goal of the

00:08:17.740 --> 00:08:20.399
model -based approach is extremely rapid adaptation.

00:08:20.800 --> 00:08:24.079
Okay. These systems often use cyclic networks

00:08:24.079 --> 00:08:27.160
and rely on external or internal memory banks.

00:08:27.500 --> 00:08:30.220
The intention is to update the model's parameters,

00:08:30.459 --> 00:08:32.879
its internal settings, with just a handful of

00:08:32.879 --> 00:08:35.419
training steps. Rather than the thousands of

00:08:35.419 --> 00:08:38.379
steps a normal neural network requires. Exactly.

00:08:38.419 --> 00:08:40.580
And a standout example of this from the text

00:08:40.580 --> 00:08:44.639
is memory augmented neural networks, or MENs.

00:08:44.759 --> 00:08:47.580
Yes, MENs are fascinating. From what I understand,

00:08:47.840 --> 00:08:51.610
standard deep learning has to slowly adjust millions

00:08:51.610 --> 00:08:54.929
of internal weights every single time it sees

00:08:54.929 --> 00:08:56.850
a new piece of data. It's a tedious process.

00:08:57.110 --> 00:08:59.580
Very slow. But a make -in essentially has a built

00:08:59.580 --> 00:09:02.600
-in scratch pad. It can rapidly write new information

00:09:02.600 --> 00:09:04.440
to an external memory bank and then read from

00:09:04.440 --> 00:09:06.940
it immediately. It can adapt to an entirely new

00:09:06.940 --> 00:09:08.960
task after seeing only three or four examples

00:09:08.960 --> 00:09:10.899
because it's storing the knowledge differently.

00:09:11.100 --> 00:09:13.840
Exactly. It bypasses the slow, gradual weight

00:09:13.840 --> 00:09:16.740
-updating process of standard networks by leaning

00:09:16.740 --> 00:09:18.899
on that fast access memory. Which is brilliant.

00:09:19.159 --> 00:09:21.879
It is. Which leads us to the second major architectural

00:09:21.879 --> 00:09:24.840
approach, metric -based meta -learning. OK. How

00:09:24.840 --> 00:09:26.940
is this one different? Well, this operates on

00:09:26.940 --> 00:09:29.919
a completely different philosophy. The core idea

00:09:29.919 --> 00:09:32.559
here is related to nearest neighbors' algorithms.

00:09:33.120 --> 00:09:35.360
The system generates a weight using a kernel

00:09:35.360 --> 00:09:39.059
function. Ultimately, it is trying to learn an

00:09:39.059 --> 00:09:41.779
effective metric, a mathematical distance function

00:09:41.779 --> 00:09:44.299
between different objects. Okay, kernel function

00:09:44.299 --> 00:09:46.399
and distance function. Let me try to translate

00:09:46.399 --> 00:09:47.980
that into something we use every day. Please

00:09:47.980 --> 00:09:50.399
do. Think of a highly advanced matchmaking app.

00:09:50.440 --> 00:09:54.909
Okay. A basic, rigid app. just looks for exact

00:09:54.909 --> 00:09:57.710
overlaps, right? Like you both selected dogs

00:09:57.710 --> 00:10:00.549
and hiking, therefore you are a match. Right,

00:10:00.889 --> 00:10:02.909
standard pattern matching. But a metric -based

00:10:02.909 --> 00:10:06.159
meta learner does something much deeper. It doesn't

00:10:06.159 --> 00:10:08.659
just figure out who is similar, it dynamically

00:10:08.659 --> 00:10:11.600
learns what metric actually matters in a given

00:10:11.600 --> 00:10:13.980
context. It learns how to mathematically measure

00:10:13.980 --> 00:10:16.860
the distance between two user profiles depending

00:10:16.860 --> 00:10:19.240
on the task. Because the criteria for finding

00:10:19.240 --> 00:10:21.440
a romantic partner is completely different from

00:10:21.440 --> 00:10:23.460
finding a tennis buddy or a business co -founder.

00:10:24.179 --> 00:10:27.159
Exactly. The AI dynamically learns how to measure

00:10:27.159 --> 00:10:29.679
the space between data points based on what it's

00:10:29.679 --> 00:10:31.840
trying to achieve. That is a brilliant way to

00:10:31.840 --> 00:10:34.250
visualize it. Thanks. It isn't learning the data

00:10:34.250 --> 00:10:36.610
itself. It is learning the relationship between

00:10:36.610 --> 00:10:39.649
the inputs. A perfect example of this in practice

00:10:39.649 --> 00:10:42.490
is convolutional Siamese neural networks. Siamese

00:10:42.490 --> 00:10:45.009
neural networks. Yeah. As the name implies, this

00:10:45.009 --> 00:10:48.370
architecture uses two twin networks. They share

00:10:48.370 --> 00:10:51.049
the exact same internal weights and parameters.

00:10:51.149 --> 00:10:54.070
OK. You feed one image into the left twin and

00:10:54.070 --> 00:10:56.669
a different image into the right twin. The networks

00:10:56.669 --> 00:10:59.750
aren't trying to output a label like, this is

00:10:59.750 --> 00:11:02.379
a cat. What are they doing instead? They're jointly

00:11:02.379 --> 00:11:04.899
trained to output the mathematical distance,

00:11:05.080 --> 00:11:06.860
or the difference between the two inputs. Oh,

00:11:06.860 --> 00:11:08.919
wow. So if you show it a new type of machinery

00:11:08.919 --> 00:11:11.559
it has never seen before, it doesn't need to

00:11:11.559 --> 00:11:13.460
know what the machine is called. Nope. It just

00:11:13.460 --> 00:11:15.539
needs to look at a baseline photo, look at the

00:11:15.539 --> 00:11:19.019
new photo, and say these are mathematically 99

00:11:19.019 --> 00:11:22.960
% similar. It learns how to compare things, not

00:11:22.960 --> 00:11:25.279
how to name them. Exactly. Another variation

00:11:25.279 --> 00:11:27.740
mentioned in the text is prototypical networks.

00:11:28.059 --> 00:11:31.200
How do those work? In this setup, the AI creates

00:11:31.200 --> 00:11:33.860
a generalized average prototype representation

00:11:33.860 --> 00:11:36.840
for a class of items. When a new piece of data

00:11:36.840 --> 00:11:39.820
comes in, the network just computes the distance

00:11:39.820 --> 00:11:42.440
to these prototypes. Like comparing a new dog

00:11:42.440 --> 00:11:45.039
to its ultimate concept of a dog. Precisely.

00:11:45.259 --> 00:11:48.220
It is a highly effective, simplified inductive

00:11:48.220 --> 00:11:50.919
bias that yields phenomenal results for what

00:11:50.919 --> 00:11:53.600
we call few -shot learning. Which is when you

00:11:53.600 --> 00:11:55.720
only have a tiny amount of data to work with,

00:11:55.799 --> 00:11:58.159
right? Exactly. Okay, so we have model -based

00:11:58.250 --> 00:12:01.269
using memory scratch pads and metric -based learning

00:12:01.269 --> 00:12:04.350
how to measure the distance between things. That

00:12:04.350 --> 00:12:06.450
brings us to the third approach, which sounds

00:12:06.450 --> 00:12:10.429
the most aggressive. Optimization -based. Metal

00:12:10.429 --> 00:12:12.629
learning. Yeah, this approach goes straight for

00:12:12.629 --> 00:12:15.690
the engine. Optimization based algorithms explicitly

00:12:15.690 --> 00:12:18.509
adjust the underlying optimization process itself.

00:12:18.730 --> 00:12:21.210
Okay. A massive milestone here was introduced

00:12:21.210 --> 00:12:25.190
in 2017. It's known as MM, a model agnostic meta

00:12:25.190 --> 00:12:27.169
learning. I want to break down that term model

00:12:27.169 --> 00:12:29.389
agnostic because that implies it doesn't actually

00:12:29.389 --> 00:12:31.190
care what kind of neural network you're running

00:12:31.190 --> 00:12:33.070
it on, right? That is exactly what it means.

00:12:33.769 --> 00:12:36.549
MAML is compatible with virtually any model that

00:12:36.549 --> 00:12:38.759
learns through gradient descent. And for anyone

00:12:38.759 --> 00:12:41.360
who isn't deep into the math, gradient descent

00:12:41.360 --> 00:12:43.820
is basically the AI trying to find the lowest

00:12:43.820 --> 00:12:46.600
point of error. Right. Imagine the AI is standing

00:12:46.600 --> 00:12:49.759
on a foggy mountain and its goal is to find the

00:12:49.759 --> 00:12:52.820
bottom of the deepest valley. It takes a step,

00:12:53.200 --> 00:12:56.440
feels the slope, and walks downward. That's gradient

00:12:56.440 --> 00:12:59.100
descent. That's a great visual. But normally,

00:12:59.220 --> 00:13:01.460
finding that valley takes thousands and thousands

00:13:01.460 --> 00:13:04.039
of steps. Right. A standard model trains its

00:13:04.039 --> 00:13:06.059
parameters to reach the bottom of one specific

00:13:06.059 --> 00:13:09.659
valley for one specific task. MAML, however,

00:13:09.940 --> 00:13:12.179
doesn't train for one valley. It trains the parameters

00:13:12.179 --> 00:13:15.259
across a huge sequence of different tasks. So

00:13:15.259 --> 00:13:17.659
what is its actual goal? Its goal is to find

00:13:17.659 --> 00:13:19.840
a starting point on the mountain that is mathematically

00:13:19.840 --> 00:13:22.279
close to all the valleys. Oh, wow. So if you're

00:13:22.279 --> 00:13:24.629
listening to this on your commute, Think of Mamm

00:13:24.629 --> 00:13:26.690
-ML like your brain when you jump into a rental

00:13:26.690 --> 00:13:30.210
car. You already know how to drive. You don't

00:13:30.210 --> 00:13:32.210
need to relearn the physics of a steering wheel

00:13:32.210 --> 00:13:34.909
or the concept of a brake pedal. Because of your

00:13:34.909 --> 00:13:37.789
prior experience across many cars, your brain

00:13:37.789 --> 00:13:39.990
is pre -optimized. Right, you're not starting

00:13:39.990 --> 00:13:42.529
from scratch. Exactly. You just need two minutes

00:13:42.529 --> 00:13:44.389
to fine -tune your environment to figure out

00:13:44.389 --> 00:13:46.710
where the wipers are and how sensitive the gas

00:13:46.710 --> 00:13:50.490
pedal is. And Mamm -ML trains the AI to be the

00:13:50.490 --> 00:13:53.340
driver of the rental car. As the original researchers

00:13:53.340 --> 00:13:55.919
stated, it explicitly trains the model to be

00:13:55.919 --> 00:13:58.500
easy to fine tune. It is an incredibly powerful

00:13:58.500 --> 00:14:01.759
concept. It essentially primes the network so

00:14:01.759 --> 00:14:04.639
that when an entirely new unseen task arrives,

00:14:05.019 --> 00:14:07.460
it only takes a tiny amount of training data

00:14:07.460 --> 00:14:10.519
and just a few steps of gradient descent to reach

00:14:10.519 --> 00:14:12.659
expert level performance. It really is. It sounds

00:14:12.659 --> 00:14:14.899
like the holy grail of artificial intelligence.

00:14:15.039 --> 00:14:17.419
It does. You build a foundation that is inherently

00:14:17.419 --> 00:14:19.799
prepped to learn anything on the fly. But we

00:14:19.799 --> 00:14:22.399
have to talk about the reality of deploying this

00:14:22.399 --> 00:14:24.519
outside of a laboratory. Yes, we do. Because

00:14:24.519 --> 00:14:27.860
the literature points out a massive glaring flaw

00:14:27.860 --> 00:14:30.279
in how these optimization algorithms have historically

00:14:30.279 --> 00:14:32.720
worked. What happens when the AI gets it wrong?

00:14:32.909 --> 00:14:35.129
Yeah, there is a very dangerous catch when you

00:14:35.129 --> 00:14:37.409
implement this in the real world. Historically,

00:14:37.450 --> 00:14:39.769
when a meta -learning system is juggling a diverse

00:14:39.769 --> 00:14:42.929
set of tasks, it optimizes for the average score

00:14:42.929 --> 00:14:45.590
across all of them. Right. It wants the highest

00:14:45.590 --> 00:14:47.909
mean performance. And this is where statistics

00:14:47.909 --> 00:14:50.330
can lie to you. Oh, absolutely. If you are only

00:14:50.330 --> 00:14:53.230
looking at the average, you are inevitably masking

00:14:53.230 --> 00:14:56.529
some catastrophic individual failures. Yep. Like,

00:14:56.529 --> 00:14:59.610
if I take five exams and I score 100 on four

00:14:59.610 --> 00:15:01.990
of them, but I get a zero on the fifth, my average

00:15:01.990 --> 00:15:05.769
is an 80. Which isn't bad. Right. On paper, to

00:15:05.769 --> 00:15:08.070
an algorithm looking at the mean, I look like

00:15:08.070 --> 00:15:11.990
a solid B student. But if that zero was my driver's

00:15:11.990 --> 00:15:15.029
license exam, I absolutely should not be operating

00:15:15.029 --> 00:15:17.169
a vehicle. Oh no, you should not. If you apply

00:15:17.169 --> 00:15:20.669
this to real -world AI like medical diagnostics

00:15:20.669 --> 00:15:23.250
predicting diseases or autonomous vehicles navigating

00:15:23.250 --> 00:15:27.330
streets, sacrificing the performance of one specific

00:15:27.330 --> 00:15:29.870
task just to keep the overall average high is

00:15:29.870 --> 00:15:32.470
lethal. It's totally unacceptable. It is a critical

00:15:32.470 --> 00:15:35.309
vulnerability and researchers realized that they

00:15:35.309 --> 00:15:38.190
had to address it immediately. This led to a

00:15:38.190 --> 00:15:40.950
very recent development from 2023 called Rommel.

00:15:41.120 --> 00:15:43.879
Which stands for? Robust meta -reinforcement

00:15:43.879 --> 00:15:47.340
learning. OK. Rommelau completely flips the objective.

00:15:47.539 --> 00:15:50.360
Instead of chasing a high average score, it focuses

00:15:50.360 --> 00:15:52.960
entirely on the worst -case scenarios. Really?

00:15:53.120 --> 00:15:55.940
Yeah. It identifies the tasks where the AI is

00:15:55.940 --> 00:15:58.679
performing the poorest and dynamically shifts

00:15:58.679 --> 00:16:01.240
computing resources to improve those specific

00:16:01.240 --> 00:16:04.120
low scores. So it guarantees a baseline of safety.

00:16:04.220 --> 00:16:07.600
Exactly. It refuses to let any single task drop

00:16:07.600 --> 00:16:09.919
below a certain threshold, ensuring the system

00:16:09.919 --> 00:16:12.879
is actually robust, regardless of what environment

00:16:12.879 --> 00:16:15.059
it's dropped into. And the brilliant part is

00:16:15.059 --> 00:16:17.840
that Ramalel is a meta algorithm itself. You

00:16:17.840 --> 00:16:20.340
can basically stack it on top of Mamalel. That's

00:16:20.340 --> 00:16:22.519
right. It forces the underlying system to care

00:16:22.519 --> 00:16:25.659
about the outliers. And this layering of algorithms,

00:16:25.779 --> 00:16:28.460
you know, AI managing AI, is really the defining

00:16:28.460 --> 00:16:30.899
characteristic of this entire field. It's moving

00:16:30.899 --> 00:16:33.500
so fast. We have moved incredibly fast from the

00:16:33.500 --> 00:16:35.779
theoretical evolutionary blueprints of the 80s

00:16:35.779 --> 00:16:39.620
to tangible world -altering software. Back in

00:16:39.620 --> 00:16:42.720
2017, Google Brain launched its AutoML project,

00:16:43.000 --> 00:16:45.240
which was literally described as AI building

00:16:45.240 --> 00:16:48.080
AI. I remember that. That system actually managed

00:16:48.080 --> 00:16:50.500
to design a neural network architecture that

00:16:50.500 --> 00:16:53.860
briefly exceeded the best, most optimized networks

00:16:53.860 --> 00:16:57.059
designed by human experts on standard benchmarks.

00:16:57.539 --> 00:17:00.399
AI successfully designing an AI that beats human

00:17:00.399 --> 00:17:04.460
engineers. That is wild. It is. This raises an

00:17:04.460 --> 00:17:06.960
important question, though. If we project this

00:17:06.960 --> 00:17:09.299
exact trajectory into the future, where does

00:17:09.299 --> 00:17:11.559
it end? Right. What's the end game? Well, the

00:17:11.559 --> 00:17:14.440
mathematics of this field point toward an extreme

00:17:14.440 --> 00:17:16.710
endpoint. It's a theory. a theoretical construct

00:17:16.710 --> 00:17:19.210
known in the literature as the Godel machine.

00:17:19.509 --> 00:17:21.569
The Godel machine. It sounds like something out

00:17:21.569 --> 00:17:23.730
of a sci -fi novel. What exactly is it? It is

00:17:23.730 --> 00:17:26.529
a theoretical ultimate meta -learning system.

00:17:26.950 --> 00:17:30.170
It contains a general theorem prover. And what

00:17:30.170 --> 00:17:32.329
makes the Godel machine unique is that it has

00:17:32.329 --> 00:17:34.789
full access to its own source code. Meaning?

00:17:34.990 --> 00:17:37.809
It can inspect and modify any part of its own

00:17:37.809 --> 00:17:41.049
software at will. It is designed to achieve recursive

00:17:41.049 --> 00:17:43.750
self -improvement. Well, lots of AI systems tweak

00:17:43.750 --> 00:17:46.039
their weights, though. What makes this the ultimate

00:17:46.039 --> 00:17:47.779
endpoint? What makes it the endpoint is that

00:17:47.779 --> 00:17:50.339
its self -improvement is mathematically provably

00:17:50.339 --> 00:17:53.420
optimal. Wait, provably optimal? Yes. It doesn't

00:17:53.420 --> 00:17:55.720
use trial and error. It doesn't guess what might

00:17:55.720 --> 00:17:58.680
make it better. Before it rewrites a single line

00:17:58.680 --> 00:18:02.160
of its own code, it uses its theorem prover to

00:18:02.160 --> 00:18:04.579
mathematically guarantee that the change will

00:18:04.579 --> 00:18:08.059
increase its ability to solve problems. It completely

00:18:08.059 --> 00:18:11.079
masters the metal learning process, continually

00:18:11.079 --> 00:18:13.519
and optimally accelerating its own intelligence

00:18:13.519 --> 00:18:17.500
over a single lifelong run. It is the absolute

00:18:17.500 --> 00:18:20.339
theoretical ceiling of an algorithm shifting

00:18:20.339 --> 00:18:23.420
its own inductive bias. So what does this all

00:18:23.420 --> 00:18:25.099
mean? Let's bring this all the way back to you,

00:18:25.200 --> 00:18:27.299
the person listening right now. Every single

00:18:27.299 --> 00:18:30.000
day, you are tweaking your own internal procedural

00:18:30.000 --> 00:18:32.420
bias. Yeah, you are. You choose to listen to

00:18:32.420 --> 00:18:34.519
deep dives. You figure out whether you absorb

00:18:34.519 --> 00:18:36.240
information better through audio or reading.

00:18:36.680 --> 00:18:38.400
You aggressively try to filter out the noise

00:18:38.400 --> 00:18:40.480
of the internet so you can learn faster and adapt

00:18:40.480 --> 00:18:42.680
to new challenges at your job. Yeah. You are,

00:18:42.819 --> 00:18:46.099
by definition, a meta learner. Absolutely. And

00:18:46.099 --> 00:18:48.700
computer scientists are painstakingly engineering

00:18:48.700 --> 00:18:51.099
artificial intelligence to do exactly what you

00:18:51.099 --> 00:18:53.779
do naturally. For decades, machine learning was

00:18:53.779 --> 00:18:56.190
essentially about giving an AI a fish. feeding

00:18:56.190 --> 00:18:58.549
it millions of pixels so it could recognize a

00:18:58.549 --> 00:19:01.190
cat or a stop sign. Exactly, but metal learning

00:19:01.190 --> 00:19:04.150
is the painstaking process of teaching the AI

00:19:04.150 --> 00:19:07.029
how to build the perfect fishing rod for whatever

00:19:07.029 --> 00:19:09.329
ocean it suddenly finds itself in. And it goes

00:19:09.329 --> 00:19:11.630
even further than that. It is teaching the machine

00:19:11.630 --> 00:19:14.160
to understand the physics of the water. to analyze

00:19:14.160 --> 00:19:16.920
the behavior of the fish, and to physically alter

00:19:16.920 --> 00:19:19.460
the rod's design in real time to guarantee a

00:19:19.460 --> 00:19:21.740
catch no matter the conditions. Which leaves

00:19:21.740 --> 00:19:24.539
us with a rather profound and honestly a slightly

00:19:24.539 --> 00:19:27.839
haunting thought to mull over. Yeah. If we follow

00:19:27.839 --> 00:19:30.960
this meta -learning path all the way to its logical

00:19:30.960 --> 00:19:33.819
conclusion, to that theoretical Goethe machine

00:19:33.819 --> 00:19:37.660
we just discussed, An AI that perfectly, flawlessly,

00:19:37.900 --> 00:19:40.380
and provably optimizes its own source code and

00:19:40.380 --> 00:19:42.960
learning procedures across every possible domain,

00:19:43.599 --> 00:19:46.039
does human -led machine learning research eventually

00:19:46.039 --> 00:19:48.480
make itself completely obsolete? It's a heavy

00:19:48.480 --> 00:19:51.559
question. If an AI becomes the ultimate self

00:19:51.559 --> 00:19:54.140
-improving meta -learner, the very last thing

00:19:54.140 --> 00:19:56.440
human engineers may ever need to teach a machine

00:19:56.440 --> 00:19:59.720
is simply the desire to learn. Thank you for

00:19:59.720 --> 00:20:01.579
joining us on this deep dive into the architecture

00:20:01.579 --> 00:20:03.890
of intelligence. Keep questioning the systems

00:20:03.890 --> 00:20:05.509
around you and keep learning how to learn.