WEBVTT

00:00:00.000 --> 00:00:03.080
Right now, the most advanced artificial intelligence

00:00:03.080 --> 00:00:06.400
in the world... It isn't actually being built

00:00:06.400 --> 00:00:08.859
by human engineers. Well, it's being built by

00:00:08.859 --> 00:00:11.080
other artificial intelligence. Yeah. And in many

00:00:11.080 --> 00:00:12.800
cases, it's actually doing a better job than

00:00:12.800 --> 00:00:15.300
we ever did. Exactly. I mean, think about the

00:00:15.300 --> 00:00:17.739
design of a skyscraper. You have a master architect

00:00:17.739 --> 00:00:21.739
who spends months or maybe years drafting blueprints,

00:00:22.039 --> 00:00:25.559
calculating load -bearing walls, optimizing elevator

00:00:25.559 --> 00:00:28.879
shafts. It's a very meticulous, deeply human

00:00:28.879 --> 00:00:31.420
process. Right. Totally. But imagine if you didn't

00:00:31.420 --> 00:00:33.600
need the architect at all. Imagine if you could

00:00:33.600 --> 00:00:35.869
just tell a computer, hey, I need a building

00:00:35.869 --> 00:00:39.109
that safely holds 5 ,000 people. And the software

00:00:39.109 --> 00:00:41.969
literally invents the concept of a steel I -beam

00:00:41.969 --> 00:00:44.630
on its own through, I don't know, millions of

00:00:44.630 --> 00:00:46.890
rounds of trial and error. The implications of

00:00:46.890 --> 00:00:50.590
that are staggering. Yeah. Because you are completely

00:00:50.590 --> 00:00:54.329
removing human intuition, and by extension, human

00:00:54.329 --> 00:00:57.409
bias, from the structural design process. Which

00:00:57.409 --> 00:00:59.469
is exactly why we're taking a deep dive today

00:00:59.469 --> 00:01:02.469
into how that exact shift is happening. But not

00:01:02.469 --> 00:01:04.709
with concrete and steel. We're talking about

00:01:04.709 --> 00:01:08.370
neural networks. So welcome to the deep dive.

00:01:09.209 --> 00:01:11.909
Our mission today is to explore a wildly fascinating

00:01:11.909 --> 00:01:14.010
subfield of automated machine learning known

00:01:14.010 --> 00:01:18.400
as neural architecture search. or NAS for short.

00:01:19.019 --> 00:01:21.420
Essentially, we're looking at how AI is learning

00:01:21.420 --> 00:01:23.920
to design its own brain structure. To set the

00:01:23.920 --> 00:01:25.780
stage for you, when we talk about artificial

00:01:25.780 --> 00:01:27.500
neural networks, we're talking about the underlying

00:01:27.500 --> 00:01:29.920
mathematical models that power, well, everything

00:01:29.920 --> 00:01:32.739
from image recognition on your phone to real

00:01:32.739 --> 00:01:35.819
-time language translation. Historically, human

00:01:35.819 --> 00:01:37.879
researchers had to manually connect the layers.

00:01:38.099 --> 00:01:40.180
They had to decide the number of nodes, tune

00:01:40.180 --> 00:01:42.939
the pathways of these networks. Neural architecture

00:01:42.939 --> 00:01:45.140
search, it automates that incredibly tedious

00:01:45.140 --> 00:01:49.200
design process. And we know exactly who is listening

00:01:49.200 --> 00:01:52.180
to this right now. You are the learner. Whether

00:01:52.180 --> 00:01:54.319
you're, I don't know, prepping for a high stakes

00:01:54.319 --> 00:01:56.420
tech meeting, trying to catch up on the latest

00:01:56.420 --> 00:01:59.540
machine learning paradigms, or you're just insanely

00:01:59.540 --> 00:02:03.420
curious about the cutting edge of AI. we're going

00:02:03.420 --> 00:02:05.879
to break down how these self -designed networks

00:02:05.879 --> 00:02:08.300
function. And why they're outperforming the ones

00:02:08.300 --> 00:02:11.479
that were meticulously handcrafted by human engineers.

00:02:11.699 --> 00:02:13.719
OK, let's unpack this. Because understanding

00:02:13.719 --> 00:02:17.439
how NAS actually functions, it requires us to

00:02:17.439 --> 00:02:20.360
look at its three foundational pillars. If an

00:02:20.360 --> 00:02:23.560
AI is going to build another AI, it needs rules,

00:02:23.800 --> 00:02:25.699
right? Absolutely. And according to the research,

00:02:25.759 --> 00:02:28.120
those rules are broken down into the search space,

00:02:28.500 --> 00:02:30.939
the search strategy, and the performance estimation

00:02:30.939 --> 00:02:33.400
strategy. Yeah. You can't just hand a computer

00:02:33.400 --> 00:02:35.479
a blank slate and say, you know, build a network.

00:02:35.620 --> 00:02:38.800
It would just panic, right? Pretty much. Mathematical

00:02:38.800 --> 00:02:41.000
possibilities are infinite, and the computer

00:02:41.000 --> 00:02:43.919
would just freeze up trying to calculate literally

00:02:43.919 --> 00:02:46.240
everything. So first, you have to define the

00:02:46.240 --> 00:02:48.599
search space. This sets the physical boundaries.

00:02:49.500 --> 00:02:52.139
It defines the specific types of artificial neural

00:02:52.139 --> 00:02:54.919
network structures that can actually be explored

00:02:54.919 --> 00:02:57.919
by the system. And we usually represent this

00:02:57.919 --> 00:03:02.969
space as a directed acyclic graph or a DAG. Let's

00:03:02.969 --> 00:03:05.610
pause on that term for a second. Directed acyclic

00:03:05.610 --> 00:03:07.969
graph. That just means the data flows in one

00:03:07.969 --> 00:03:09.710
specific direction, right, from the input to

00:03:09.710 --> 00:03:12.310
the output. And acyclic means there are no loops.

00:03:12.669 --> 00:03:15.229
Yes, exactly. And that is a crucial constraint.

00:03:15.569 --> 00:03:17.810
Because if there were loops, the data processing

00:03:17.810 --> 00:03:20.530
could theoretically get stuck in an infinite

00:03:20.530 --> 00:03:22.729
cycle. It's just bouncing around forever. Right,

00:03:22.729 --> 00:03:24.490
just passing the same numbers back and forth

00:03:24.490 --> 00:03:27.129
forever. The DAG ensures the network actually

00:03:27.129 --> 00:03:29.680
produces a final answer. So once you have that

00:03:29.680 --> 00:03:31.860
space to find, you move to the search strategy.

00:03:31.979 --> 00:03:33.879
The second pillar. Right. This is the actual

00:03:33.879 --> 00:03:36.360
algorithm the system uses to navigate that massive

00:03:36.360 --> 00:03:39.060
space of possibilities and, well, zero in on

00:03:39.060 --> 00:03:41.539
the best configuration. Got it. And finally,

00:03:41.539 --> 00:03:44.000
you have the performance estimation strategy.

00:03:44.919 --> 00:03:47.979
This is how the system evaluates whether a potential

00:03:47.979 --> 00:03:51.240
design is actually any good without being forced

00:03:51.240 --> 00:03:54.080
to fully build and train the entire network from

00:03:54.080 --> 00:03:56.669
scratch every single time. Which would take centuries

00:03:56.669 --> 00:03:59.430
of computing time, honestly. Yeah, to make this

00:03:59.430 --> 00:04:01.810
tangible, I like to think of neural architecture

00:04:01.810 --> 00:04:06.129
search like... tackling an endless Lego set.

00:04:06.409 --> 00:04:09.629
OK, I like that. So the search space is all the

00:04:09.629 --> 00:04:11.789
possible Lego pieces you have dumped out on the

00:04:11.789 --> 00:04:14.169
floor. You know, the specific bricks, the gears,

00:04:14.270 --> 00:04:16.990
the little windows. The search strategy is the

00:04:16.990 --> 00:04:19.290
experimental process you use to actually build,

00:04:19.529 --> 00:04:21.269
like grabbing a piece, trying to snap it together,

00:04:21.310 --> 00:04:23.430
seeing if it fits. Right, right. And the performance

00:04:23.430 --> 00:04:26.129
estimation strategy. That's your ability to look

00:04:26.129 --> 00:04:28.730
at a half finished structure and accurately guess

00:04:28.730 --> 00:04:30.829
if the final logo house will be structurally

00:04:30.829 --> 00:04:33.490
sound without having to actually build the entire

00:04:33.519 --> 00:04:36.540
roof first to see if it collapses. What's fascinating

00:04:36.540 --> 00:04:38.939
here is the philosophical shift this represents

00:04:38.939 --> 00:04:41.759
in computer science. How so? Well, we are basically

00:04:41.759 --> 00:04:45.259
abandoning human intuition. The old way was an

00:04:45.259 --> 00:04:47.439
engineer looking at a problem and saying, I think

00:04:47.439 --> 00:04:50.199
a data layer of this size makes sense here based

00:04:50.199 --> 00:04:53.660
on my experience. Now we're moving to a paradigm

00:04:53.660 --> 00:04:58.620
of pure automated optimization. NAS is deeply

00:04:58.620 --> 00:05:01.459
connected to hyperparameter optimization and

00:05:01.459 --> 00:05:03.240
something called meta -learning. Meta -learning,

00:05:03.300 --> 00:05:04.899
that's a great concept. Yeah, meta -learning

00:05:04.899 --> 00:05:08.660
is, at its core, learning how to learn. So instead

00:05:08.660 --> 00:05:10.779
of optimizing a neural network to recognize a

00:05:10.779 --> 00:05:13.660
picture of a cat, you are optimizing the overarching

00:05:13.660 --> 00:05:16.259
system that builds the network that eventually

00:05:16.259 --> 00:05:18.699
recognizes the cat. It's basically turtles all

00:05:18.699 --> 00:05:21.480
the way down. It really is. So, okay, now we

00:05:21.480 --> 00:05:23.819
have the rules of the game. But how did the system

00:05:23.819 --> 00:05:26.730
actually play the game in the early days? This

00:05:26.730 --> 00:05:28.930
brings us to the pioneers of this field, who

00:05:28.930 --> 00:05:31.870
essentially used, well, brute force to kick down

00:05:31.870 --> 00:05:33.449
the door. Yeah, they didn't have the elegant

00:05:33.449 --> 00:05:36.110
methods we have today. The earliest major successes

00:05:36.110 --> 00:05:38.089
in this field relied heavily on reinforcement

00:05:38.089 --> 00:05:41.810
learning, or RL. You basically treat the design

00:05:41.810 --> 00:05:45.009
of the AI architecture as a video game. You have

00:05:45.009 --> 00:05:47.230
an AI agent, which is usually called a controller,

00:05:47.829 --> 00:05:50.170
making design choices step by step. Okay, so

00:05:50.170 --> 00:05:52.870
it's playing the game. Exactly. It picks a layer,

00:05:53.290 --> 00:05:55.639
chooses a connection, and then it gets a reward

00:05:55.639 --> 00:05:58.180
signal based on how accurate the resulting network

00:05:58.180 --> 00:06:01.339
is. If the network performs well, the controller

00:06:01.339 --> 00:06:04.259
updates its internal math to favor those kinds

00:06:04.259 --> 00:06:06.339
of choices in the future. The source material

00:06:06.339 --> 00:06:09.060
highlights some incredible early work here by

00:06:09.060 --> 00:06:13.000
researchers Barrett Zoff and Kwok Viet Le. They

00:06:13.000 --> 00:06:15.000
unleashed a reinforcement learning controller

00:06:15.000 --> 00:06:18.079
on a famous image classification data set called

00:06:18.079 --> 00:06:21.300
CFR 10. CFR 10, yeah, a classic benchmark. But

00:06:21.300 --> 00:06:22.980
we shouldn't just gloss over what that actually

00:06:22.980 --> 00:06:25.120
looked like. The controller literally wrote out

00:06:25.120 --> 00:06:27.199
a string of code describing an architecture,

00:06:27.860 --> 00:06:29.959
built it, tested it, and learned from the result.

00:06:30.040 --> 00:06:32.740
And it did this thousands of times over. And

00:06:32.740 --> 00:06:34.560
the results they achieved, I mean, they sent

00:06:34.560 --> 00:06:37.329
shockwaves through the community. The AI designed

00:06:37.329 --> 00:06:40.110
a network that rivaled the absolute best manually

00:06:40.110 --> 00:06:42.790
designed architectures in existence at the time.

00:06:42.790 --> 00:06:46.509
It hit an error rate of just 3 .65 on the dataset.

00:06:47.019 --> 00:06:49.899
To put that in perspective for you, it was slightly

00:06:49.899 --> 00:06:53.899
more accurate and over 1 .05 times faster than

00:06:53.899 --> 00:06:56.660
a highly optimized, related model built by a

00:06:56.660 --> 00:06:59.060
whole team of human experts. And not only that,

00:06:59.079 --> 00:07:01.759
but they tested it on language, too. On the Pentree

00:07:01.759 --> 00:07:04.120
Bank dataset, which is this massive collection

00:07:04.120 --> 00:07:06.879
of text used to train language models, their

00:07:06.879 --> 00:07:09.720
RL model created a recurrent cell that outperformed

00:07:09.720 --> 00:07:12.160
the leading human design system by a massive

00:07:12.160 --> 00:07:14.990
margin. Right, the perplexity score. Exactly.

00:07:15.110 --> 00:07:18.149
It achieved a 3 .6 improvement in its perplexity

00:07:18.149 --> 00:07:21.310
score, which that's just a metric showing how

00:07:21.310 --> 00:07:23.810
confidently a model predicts the next word in

00:07:23.810 --> 00:07:25.610
a sequence. The bottleneck, however, became obvious

00:07:25.610 --> 00:07:27.990
very quickly. Yeah, I bet. Training in architecture

00:07:27.990 --> 00:07:30.470
from scratch on massive data sets like ImageNet,

00:07:30.829 --> 00:07:33.069
which contains millions of high -resolution photos,

00:07:33.449 --> 00:07:35.149
it just takes an absurd amount of time. Right.

00:07:35.329 --> 00:07:37.629
So the researchers adapted. They developed something

00:07:37.629 --> 00:07:40.120
called NASNet. They realized that instead of

00:07:40.120 --> 00:07:43.560
having the AI design an entire massive network

00:07:43.560 --> 00:07:46.300
all at once, they could just have the AI design

00:07:46.300 --> 00:07:49.519
a highly efficient building block on a small

00:07:49.519 --> 00:07:52.660
data set. And then they just copy and paste that

00:07:52.660 --> 00:07:55.279
block to handle the massive data set. It's basically

00:07:55.279 --> 00:07:57.720
the concept of transfer learning applied to architecture.

00:07:58.040 --> 00:08:00.399
Precisely. They constrain the search space to

00:08:00.399 --> 00:08:03.000
just two types of convolutional cells. First,

00:08:03.180 --> 00:08:05.579
you have normal cells, which process the data

00:08:05.579 --> 00:08:07.660
but keep the dimensions, like the height and

00:08:07.660 --> 00:08:09.540
width of the image map, the exact same. Right.

00:08:09.860 --> 00:08:12.379
And second, you have reduction cells. And reduction

00:08:12.379 --> 00:08:14.939
cells do exactly what the name implies. They

00:08:14.939 --> 00:08:17.560
reduce the height and width of the data map by

00:08:17.560 --> 00:08:20.529
a factor of two. They typically do this by skipping

00:08:20.529 --> 00:08:23.350
over every other pixel of data, which in machine

00:08:23.350 --> 00:08:26.050
learning is called using a stride of two. By

00:08:26.050 --> 00:08:28.629
shrinking the data map, the network forces the

00:08:28.629 --> 00:08:31.730
system to extract only the most important, high

00:08:31.730 --> 00:08:34.779
-level features of the image. The AI's only job

00:08:34.779 --> 00:08:36.899
was to figure out the internal wiring of these

00:08:36.899 --> 00:08:39.320
specific cells and how to stack them together.

00:08:39.519 --> 00:08:42.919
And the payoff for that was massive. When that

00:08:42.919 --> 00:08:45.500
AI -designed architecture was applied to the

00:08:45.500 --> 00:08:49.120
giant ImageNet data set, it achieved over 82

00:08:49.120 --> 00:08:52.659
% top -1 accuracy. But the accuracy isn't even

00:08:52.659 --> 00:08:55.360
the most impressive part here. No! No. The AI

00:08:55.360 --> 00:08:58.299
-designed NASNet exceeded the best human -invented

00:08:58.299 --> 00:09:02.580
architectures while using 9 billion fewer FLOPs.

00:09:02.940 --> 00:09:06.200
That is 9 billion fewer floating -point operations

00:09:06.200 --> 00:09:09.759
per second. That is a 28 % reduction in computational

00:09:09.759 --> 00:09:12.139
cost for a better result. It found efficiencies

00:09:12.139 --> 00:09:14.840
humans just couldn't see. Yeah, it's wild. And

00:09:14.840 --> 00:09:16.539
reinforcement learning wasn't the only brute

00:09:16.539 --> 00:09:19.019
force method they tried. Researchers also used

00:09:19.019 --> 00:09:21.559
evolutionary algorithms. This is where the process

00:09:21.559 --> 00:09:23.980
gets incredibly Darwinian. Very survival of the

00:09:23.980 --> 00:09:26.399
fittest. Exactly. You start with a pool of candidate

00:09:26.399 --> 00:09:28.860
architectures. Then you literally mutate them,

00:09:29.039 --> 00:09:31.860
like randomly swapping out a large 5x5 convolution

00:09:31.860 --> 00:09:34.220
filter. for a smaller three by three one. You

00:09:34.220 --> 00:09:36.299
test the mutated networks on the data and you

00:09:36.299 --> 00:09:38.320
kill off the lowest scores, replacing them with

00:09:38.320 --> 00:09:40.179
the fittest new designs from the next generation.

00:09:40.539 --> 00:09:43.779
It perfectly mirrors biological evolution. Over

00:09:43.779 --> 00:09:46.720
many generations, the candidate pool gets refined

00:09:46.720 --> 00:09:49.139
through natural selection. And studies showed

00:09:49.139 --> 00:09:51.980
that on data sets like CIFAR -10 and ImageNet,

00:09:52.440 --> 00:09:54.879
this evolutionary approach performed just as

00:09:54.879 --> 00:09:56.919
well as the reinforcement learning methods. OK,

00:09:56.919 --> 00:09:58.820
I have to step in here on behalf of the listener

00:09:58.820 --> 00:10:01.559
because there is a glaring logistical problem.

00:10:01.519 --> 00:10:05.480
with everything we just discussed. Wait, if we

00:10:05.480 --> 00:10:08.340
are randomly mutating architectures or having

00:10:08.340 --> 00:10:10.539
a controller spit out thousands of designs and

00:10:10.539 --> 00:10:12.679
then training every single one of those models

00:10:12.679 --> 00:10:14.860
from scratch to see how they perform, doesn't

00:10:14.860 --> 00:10:17.139
that take an astronomical, almost impractical

00:10:17.139 --> 00:10:19.500
amount of computing power? Oh, absolutely. That

00:10:19.500 --> 00:10:22.159
is the crucial flaw of the early methods. The

00:10:22.159 --> 00:10:24.360
computing power required was staggering. I can

00:10:24.360 --> 00:10:26.580
only imagine. We're talking about thousands of

00:10:26.580 --> 00:10:29.320
GPU days, which means running a high -end graphics

00:10:29.320 --> 00:10:32.039
processing unit at maximum capacity for years

00:10:32.039 --> 00:10:35.000
of continuous time just to find one single architecture.

00:10:35.480 --> 00:10:37.820
Only massive tech companies could afford the

00:10:37.820 --> 00:10:40.200
electricity bill for that. So, if training a

00:10:40.200 --> 00:10:42.419
million networks from scratch is the bottleneck,

00:10:42.779 --> 00:10:44.320
they must have found a way to recycle the work

00:10:44.320 --> 00:10:46.600
they already did. Did they figure out a way to

00:10:46.600 --> 00:10:48.799
stop starting from zero every time? You hit the

00:10:48.799 --> 00:10:51.809
nail on the head. That exact realization led

00:10:51.809 --> 00:10:54.710
to the development of ENAS, or Efficient Neural

00:10:54.710 --> 00:10:58.070
Architecture Search. ENAS solved the compute

00:10:58.070 --> 00:11:00.389
bottleneck through parameter sharing. Instead

00:11:00.389 --> 00:11:02.909
of training thousands of distinct networks independently,

00:11:03.450 --> 00:11:05.809
ENAS uses a controller to search for an optimal

00:11:05.809 --> 00:11:08.669
subgraph within one giant overarching graph.

00:11:09.490 --> 00:11:11.610
Instead of building a thousand separate houses

00:11:11.610 --> 00:11:14.309
to see which one stands up, you build one gigantic

00:11:14.309 --> 00:11:17.590
mansion and the AI just evaluates different suites

00:11:17.590 --> 00:11:19.549
of rooms inside that mansion. That is a great

00:11:19.549 --> 00:11:21.809
way to visualize it. All the different child

00:11:21.809 --> 00:11:24.490
models essentially share parameters, the learned

00:11:24.490 --> 00:11:27.129
mathematical weights. Because the models are

00:11:27.129 --> 00:11:28.789
sharing the heavy lifting that has already been

00:11:28.789 --> 00:11:31.570
computed, ENAS required a thousand -fold less

00:11:31.570 --> 00:11:34.750
GPU hours than standard mass methods, while still

00:11:34.750 --> 00:11:37.230
achieving comparable error rates. A thousand

00:11:37.230 --> 00:11:39.549
-fold reduction in time and energy? I mean, that

00:11:39.549 --> 00:11:42.149
transition completely redefines the field. It

00:11:42.149 --> 00:11:44.490
brings us right into the modern era of NAS, the

00:11:44.490 --> 00:11:47.009
war on inefficiency. And the ultimate weapon

00:11:47.009 --> 00:11:49.690
in that war is the one -shot model. ENAS laid

00:11:49.690 --> 00:11:52.659
the groundwork. Yeah. And one -shot models take

00:11:52.659 --> 00:11:55.019
that weight sharing concept to its absolute logical

00:11:55.019 --> 00:11:57.820
conclusion. Researchers define what is called

00:11:57.820 --> 00:12:01.580
a super network. This is a massive directed acyclic

00:12:01.580 --> 00:12:04.639
graph that contains every possible operation

00:12:04.639 --> 00:12:07.340
and connection allowed in the entire search space.

00:12:07.340 --> 00:12:09.919
So. What does this all mean? Let's take that

00:12:09.919 --> 00:12:11.820
Swiss Army knife analogy and stretch it a bit.

00:12:12.259 --> 00:12:14.279
A super network is like buying a giant fully

00:12:14.279 --> 00:12:17.240
loaded Swiss Army knife. It has scissors, pliers,

00:12:17.679 --> 00:12:20.340
a saw, three different blades, a magnifying glass.

00:12:20.539 --> 00:12:22.460
Right. It has everything. Instead of a blacksmith

00:12:22.460 --> 00:12:25.000
forging a brand new, highly specific tool from

00:12:25.000 --> 00:12:27.480
scratch every time you have a new task, the AI

00:12:27.480 --> 00:12:30.240
simply evaluates the giant tool it already has

00:12:30.240 --> 00:12:32.659
and figures out which specific blades to flip

00:12:32.659 --> 00:12:35.860
out for the job. Yes, and recent advancements

00:12:35.860 --> 00:12:38.379
made that process even smoother by combining

00:12:38.379 --> 00:12:40.840
the weight -sharing paradigm with a mathematical

00:12:40.840 --> 00:12:44.120
trick called continuous relaxation. This created

00:12:44.120 --> 00:12:47.600
a subfield called differentiable NAS. The most

00:12:47.600 --> 00:12:50.539
popular algorithm here is known as DARTs. Differentiable

00:12:50.539 --> 00:12:52.659
meaning they can use gradient -based optimization,

00:12:52.899 --> 00:12:54.919
right? Exactly. Let's explain how continuous

00:12:54.919 --> 00:12:57.299
relaxation actually works because it's brilliant.

00:12:58.039 --> 00:13:00.879
Instead of the AI making a hard, discrete choice

00:13:00.879 --> 00:13:03.480
like, you know, I will strictly use layer A and

00:13:03.480 --> 00:13:07.200
entirely ignore layer B, it calculates a probability

00:13:07.200 --> 00:13:09.779
for everything. You can say, I'm 80 % sure I

00:13:09.779 --> 00:13:12.039
should use layer A and 20 % sure I should use

00:13:12.039 --> 00:13:14.299
layer B. Going back to your Swiss army knife,

00:13:14.659 --> 00:13:16.759
it's like the AI pulling the saw blade out 80

00:13:16.759 --> 00:13:18.960
% of the way and the scissors out 20 % of the

00:13:18.960 --> 00:13:21.299
way, testing the cut and letting the underlying

00:13:21.299 --> 00:13:23.259
math tell it which tool to push out further.

00:13:23.320 --> 00:13:25.500
That's a great image. It allows the system to

00:13:25.500 --> 00:13:27.820
smoothly slide down the mathematical hill. toward

00:13:27.820 --> 00:13:30.279
the most optimal answer, rather than randomly

00:13:30.279 --> 00:13:32.639
jumping around between discrete choices. Though

00:13:32.639 --> 00:13:34.919
the source material notes that darts ran into

00:13:34.919 --> 00:13:38.059
a massive wall early on called performance collapse.

00:13:38.539 --> 00:13:41.139
Oh yeah, that was a big hurdle. Basically, the

00:13:41.139 --> 00:13:44.360
system would find a cheat code. It would rely

00:13:44.360 --> 00:13:47.299
too heavily on skip connections, where the data

00:13:47.299 --> 00:13:49.580
just bypasses a mathematical layer entirely.

00:13:50.240 --> 00:13:52.460
Skip connections are mathematically cheap and

00:13:52.460 --> 00:13:54.919
fast, so the optimization algorithm just loved

00:13:54.919 --> 00:13:57.440
them. But if your network is just a bunch of

00:13:57.440 --> 00:13:59.759
wires bypassing all the actual processing layers,

00:13:59.919 --> 00:14:02.279
it doesn't learn anything, which results in terrible

00:14:02.279 --> 00:14:04.580
generalization when it's given new data. the

00:14:04.580 --> 00:14:06.799
community had to engineer their way out of that

00:14:06.799 --> 00:14:09.820
trap. They introduced fixes like Hessian norm

00:14:09.820 --> 00:14:12.419
regularizations. Which sounds incredibly dense.

00:14:12.500 --> 00:14:14.519
Let's unpack what a Hessian norm actually does.

00:14:14.600 --> 00:14:17.159
In simple terms, it smooths out the mathematical

00:14:17.159 --> 00:14:20.840
landscape. Without it, the AI might find a steep,

00:14:21.019 --> 00:14:24.399
narrow valley of accuracy, like that skip connection

00:14:24.399 --> 00:14:26.460
cheat code that works perfectly for the training

00:14:26.460 --> 00:14:28.639
data but fails completely in the real world.

00:14:28.759 --> 00:14:31.960
Ah, OK. Hessian norm regularization actively

00:14:31.960 --> 00:14:34.940
punishes the AI for picking those sharp, unstable

00:14:34.940 --> 00:14:37.679
valleys, forcing it to find a wider, smoother

00:14:37.679 --> 00:14:39.960
mathematical solution that will actually hold

00:14:39.960 --> 00:14:42.019
up when it sees data it has never encountered

00:14:42.019 --> 00:14:44.200
before. And here's where it gets really interesting.

00:14:44.840 --> 00:14:47.440
Once those mathematical traps were smoothed out,

00:14:47.840 --> 00:14:50.080
the time savings of these differentiable NAS

00:14:50.080 --> 00:14:53.100
models became staggering. The research points

00:14:53.100 --> 00:14:56.179
to a model called FBNet, which stands for Facebook

00:14:56.179 --> 00:14:59.039
Berkeley Network. FBnet is a perfect example

00:14:59.039 --> 00:15:02.639
of this efficiency. By utilizing a differentiable

00:15:02.639 --> 00:15:05.940
super network search, FBnet discovered architectures

00:15:05.940 --> 00:15:08.799
that completely beat the speed and accuracy trade

00:15:08.799 --> 00:15:12.120
-offs of previous human and AI models. But the

00:15:12.120 --> 00:15:14.440
real headline is that it accomplished this using

00:15:14.440 --> 00:15:16.940
over 400 times less search time than the older

00:15:16.940 --> 00:15:19.080
reinforcement learning methods. Wait, so you

00:15:19.080 --> 00:15:21.379
are telling me it did the exact same highly complex

00:15:21.379 --> 00:15:24.200
design job, but it did it 400 times faster. Exactly.

00:15:24.620 --> 00:15:27.600
And another model... SqueezeNAS produced networks

00:15:27.600 --> 00:15:31.460
that outperformed MobileNet v3 for semantic segmentation.

00:15:31.620 --> 00:15:33.899
Semantic segmentation being the process of categorizing

00:15:33.899 --> 00:15:36.340
an image pixel by pixel, right? Like identifying

00:15:36.340 --> 00:15:38.740
where a road ends and a sidewalk begins. Exactly.

00:15:39.179 --> 00:15:41.200
SqueezeNAS found a better architecture using

00:15:41.200 --> 00:15:43.600
over a hundred times less search time than the

00:15:43.600 --> 00:15:47.139
search used to find MobileNet v3. We really shouldn't

00:15:47.139 --> 00:15:49.779
overlook why models like SqueezeNAS and MobileNet

00:15:49.779 --> 00:15:53.460
are so critical. It brings us to the next massive

00:15:53.460 --> 00:15:56.279
evolution in the field, which is multi -objective

00:15:56.279 --> 00:15:58.639
search. Right. Because in the real world, having

00:15:58.639 --> 00:16:02.759
a network that is 99 .9 % accurate is totally

00:16:02.759 --> 00:16:05.340
useless if it requires a supercomputer drawing

00:16:05.340 --> 00:16:08.259
massive amounts of power to run. If you want

00:16:08.259 --> 00:16:10.860
to put an AI in a self -driving car, a smart

00:16:10.860 --> 00:16:13.919
doorbell, or a medical device, it has to be efficient.

00:16:14.220 --> 00:16:16.580
If we connect this to the bigger picture, think

00:16:16.580 --> 00:16:18.480
about the smartphone you're using right now to

00:16:18.480 --> 00:16:20.980
listen to this deep dive. Your phone has limited

00:16:20.980 --> 00:16:23.580
battery life, limited thermal capacity so it

00:16:23.580 --> 00:16:27.460
doesn't overheat, and limited memory. Multi -objective

00:16:27.460 --> 00:16:29.980
search algorithms force the AI to optimize for

00:16:29.980 --> 00:16:33.120
those exact physical constraints while it designs

00:16:33.120 --> 00:16:35.470
the network. The sources mentioned an evolutionary

00:16:35.470 --> 00:16:37.830
algorithm called Lemonade, which first of all,

00:16:38.129 --> 00:16:40.330
great acronym. It's brilliant. Lemonade actually

00:16:40.330 --> 00:16:43.330
utilizes Lamarckian evolution, which is a biological

00:16:43.330 --> 00:16:45.929
theory where traits acquired by an organism during

00:16:45.929 --> 00:16:48.149
its lifetime can be passed on to its offspring.

00:16:48.649 --> 00:16:50.970
It uses that to efficiently optimize for multiple

00:16:50.970 --> 00:16:53.750
objectives simultaneously. Then you have systems

00:16:53.750 --> 00:16:56.929
like neural architect, which uses reinforcement

00:16:56.929 --> 00:17:00.009
learning, but specifically bakes in penalties

00:17:00.009 --> 00:17:03.389
for memory consumption, model size, and inference

00:17:03.389 --> 00:17:06.250
time. Meaning how fast the model actually spits

00:17:06.250 --> 00:17:08.890
out a prediction. Right. So the AI is literally

00:17:08.890 --> 00:17:11.869
being rewarded for designing a network that will

00:17:11.869 --> 00:17:14.890
not drain your phone's battery. Which is an incredible

00:17:14.890 --> 00:17:18.269
leap forward for consumer technology. But even

00:17:18.269 --> 00:17:20.609
with all of these brilliant efficiencies, weight

00:17:20.609 --> 00:17:23.730
sharing, continuous relaxation, multi -objective

00:17:23.730 --> 00:17:27.089
goals, evaluating these architectures still requires

00:17:27.089 --> 00:17:28.730
running them on hardware. Yeah, you still have

00:17:28.730 --> 00:17:31.259
to compute it. Which requires electricity. Which

00:17:31.259 --> 00:17:34.339
brings up a very real, very modern problem for

00:17:34.339 --> 00:17:36.680
the AI industry, which is the carbon footprint.

00:17:37.299 --> 00:17:39.380
How do researchers at universities who don't

00:17:39.380 --> 00:17:42.359
have a multi -billion dollar server farm test

00:17:42.359 --> 00:17:44.960
their new neural architecture search algorithms

00:17:44.960 --> 00:17:47.420
without burning a massive hole in the atmosphere?

00:17:47.660 --> 00:17:49.680
The compute bottleneck was becoming a massive

00:17:49.680 --> 00:17:52.500
barrier to entry, threatening to monopolize AI

00:17:52.500 --> 00:17:55.230
research entirely. The solution the community

00:17:55.230 --> 00:17:57.970
developed came in the form of NAS benchmarks.

00:17:58.589 --> 00:18:01.430
These are essentially massive, pre -calculated,

00:18:01.569 --> 00:18:04.970
queryable data sets. Researchers with heavy computing

00:18:04.970 --> 00:18:08.089
power created fixed search spaces and ran all

00:18:08.089 --> 00:18:10.390
the training pipelines in advance, logging the

00:18:10.390 --> 00:18:12.829
results of thousands of architectures. So now,

00:18:12.950 --> 00:18:15.950
if I have a brilliant new idea for a search strategy...

00:18:15.980 --> 00:18:18.000
I don't have to train a network from scratch

00:18:18.000 --> 00:18:19.819
to see if my algorithm made a good choice. I

00:18:19.819 --> 00:18:21.980
just look up the answer and the answer key. That

00:18:21.980 --> 00:18:25.220
is the exact mechanism. And there are two primary

00:18:25.220 --> 00:18:28.059
types of benchmarks you will encounter. A tabular

00:18:28.059 --> 00:18:31.420
benchmark literally queries the actual historically

00:18:31.420 --> 00:18:34.059
recorded performance of a specific architecture

00:18:34.059 --> 00:18:36.660
that was already physically trained to completion

00:18:36.660 --> 00:18:39.039
by the benchmark creators. And the second type

00:18:39.039 --> 00:18:42.250
is even wilder, a surrogate benchmark. This uses

00:18:42.250 --> 00:18:44.369
another completely separate neural network to

00:18:44.369 --> 00:18:46.609
predict how well the architecture will perform

00:18:46.609 --> 00:18:49.450
based on the data trends. It is intensely meta.

00:18:49.970 --> 00:18:51.990
You have an AI predicting the accuracy of an

00:18:51.990 --> 00:18:54.329
architecture that was designed by another AI.

00:18:54.470 --> 00:18:57.049
Yeah, that's wild. But the practical real -world

00:18:57.049 --> 00:18:59.390
result is that these benchmarks can be queried

00:18:59.390 --> 00:19:02.430
in milliseconds using just a standard off -the

00:19:02.430 --> 00:19:05.670
-shelf desktop CPU. You no longer need a warehouse

00:19:05.670 --> 00:19:08.250
of GPUs to test whether your architecture search

00:19:08.250 --> 00:19:10.960
algorithm actually works. The sources also spotlight

00:19:10.960 --> 00:19:14.539
a few other extremely clever low resource alternative

00:19:14.539 --> 00:19:17.000
methods for researchers on a budget, like hill

00:19:17.000 --> 00:19:19.799
climbing. They apply something called network

00:19:19.799 --> 00:19:22.539
morphisms, which basically means making surgical

00:19:22.539 --> 00:19:25.200
changes to the network's structure, like making

00:19:25.200 --> 00:19:27.740
a layer wider while perfectly preserving its

00:19:27.740 --> 00:19:30.579
mathematical function. Using this, researchers

00:19:30.579 --> 00:19:33.000
trained a highly competitive network on the CIFAR

00:19:33.000 --> 00:19:36.500
-10 dataset with under a 5 % error rate in just

00:19:36.500 --> 00:19:40.250
12 hours on a single standard GPU. There is also

00:19:40.250 --> 00:19:42.170
Bayesian optimization, which has always been

00:19:42.170 --> 00:19:44.589
a staple for hyperparameter tuning. Bayesian

00:19:44.589 --> 00:19:46.789
optimization uses an acquisition function to

00:19:46.789 --> 00:19:49.029
constantly manage the tension between exploration,

00:19:49.549 --> 00:19:52.210
which means trying wild, completely untested

00:19:52.210 --> 00:19:55.490
new designs, and exploitation. Exploitation meaning

00:19:55.490 --> 00:19:57.910
refining the minor details of the designs that

00:19:57.910 --> 00:20:00.450
are already proving to be highly accurate. Right.

00:20:00.950 --> 00:20:03.500
A framework called Bananas Use this approach

00:20:03.500 --> 00:20:06.180
coupled with a neural predictor to get incredibly

00:20:06.180 --> 00:20:09.140
efficient results. Okay, I have to play devil's

00:20:09.140 --> 00:20:11.099
advocate here because there seems to be a glaring

00:20:11.099 --> 00:20:13.700
flaw with the benchmark solution. If we rely

00:20:13.700 --> 00:20:16.519
entirely on these pre -calculated tabular or

00:20:16.519 --> 00:20:18.880
surrogate benchmarks to save time and energy...

00:20:19.349 --> 00:20:22.430
Aren't we just trapping the AI in a sandbox?

00:20:22.970 --> 00:20:26.089
How can the system discover a truly revolutionary

00:20:26.089 --> 00:20:29.369
out -of -the -box architecture if it is only

00:20:29.369 --> 00:20:31.230
allowed to look up answers in a pre -written

00:20:31.230 --> 00:20:33.789
test? Doesn't that limit the entire premise of

00:20:33.789 --> 00:20:36.529
neural architecture search? This raises an important

00:20:36.529 --> 00:20:38.869
question, and you've just identified the central

00:20:38.869 --> 00:20:41.349
tension in the field right now. You have this

00:20:41.349 --> 00:20:44.410
profound need for wild, unconstrained exploration,

00:20:44.750 --> 00:20:46.829
which is exactly where techniques like Bayesian

00:20:46.829 --> 00:20:49.339
optimization and reinforcement learning truly

00:20:49.339 --> 00:20:51.700
shine -pauling against the very real physical

00:20:51.700 --> 00:20:54.160
constraints of global computing resources, carbon

00:20:54.160 --> 00:20:56.940
emissions, and academic accessibility. It's a

00:20:56.940 --> 00:20:59.640
tough balance. The benchmarks are a necessary

00:20:59.640 --> 00:21:02.700
compromise. They allow the wider scientific community

00:21:02.700 --> 00:21:05.779
to iterate rapidly on the search strategies themselves

00:21:05.779 --> 00:21:08.740
without destroying the environment. But yes,

00:21:09.140 --> 00:21:11.619
the search space in a benchmark is inherently

00:21:11.619 --> 00:21:13.599
restricted by the boundaries of what has already

00:21:13.599 --> 00:21:15.900
been mapped by the creators. It's the ultimate

00:21:15.900 --> 00:21:17.920
trade -off between discovery and sustainability.

00:21:18.319 --> 00:21:20.680
But look at the incredible arc we just traced.

00:21:21.039 --> 00:21:23.420
We went from the brute force pioneers burning

00:21:23.420 --> 00:21:27.140
thousands of GPU days to blindly find the perfect

00:21:27.140 --> 00:21:30.200
network to the elegant efficiency of one -shot

00:21:30.200 --> 00:21:33.000
super networks acting like giant Swiss army knives

00:21:33.000 --> 00:21:36.059
sharing data. And finally to environmentally

00:21:36.059 --> 00:21:38.500
conscious benchmarks that democratize the research

00:21:38.500 --> 00:21:41.059
so anyone with a laptop can participate. It is

00:21:41.059 --> 00:21:43.859
a profound evolution to witness. We are actively

00:21:43.859 --> 00:21:46.440
watching artificial intelligence become exceptionally

00:21:46.440 --> 00:21:49.119
adept at building better and more efficient versions

00:21:49.119 --> 00:21:51.480
of itself. The tools they're using to do it are

00:21:51.480 --> 00:21:53.920
getting sharper, and the barriers to entry for

00:21:53.920 --> 00:21:56.599
humans to guide that process are dropping rapidly.

00:21:56.940 --> 00:21:58.980
Which leaves us with a lingering thought for

00:21:58.980 --> 00:22:02.059
you to shoe on as you head into your day. We

00:22:02.059 --> 00:22:04.779
started this deep dive by talking about the master

00:22:04.779 --> 00:22:07.619
architect painstakingly designing a skyscraper.

00:22:08.339 --> 00:22:10.500
If neural architecture search is successfully

00:22:10.500 --> 00:22:12.559
automating the role of the network architect,

00:22:13.440 --> 00:22:15.480
and meta -learning is automating the very process

00:22:15.480 --> 00:22:19.000
of learning itself, if an AI is now evaluating,

00:22:19.299 --> 00:22:22.119
designing, and optimizing its own mathematical

00:22:22.119 --> 00:22:24.740
brain structures using a fraction of the computing

00:22:24.740 --> 00:22:27.720
power it used to take, at what point does the

00:22:27.720 --> 00:22:30.599
human engineer's role shift entirely from the

00:22:30.599 --> 00:22:33.539
creator of the intelligence to simply a bystander

00:22:33.539 --> 00:22:35.339
watching the building construct itself from the

00:22:35.339 --> 00:22:37.460
ground up. We might just be laying the concrete

00:22:37.460 --> 00:22:39.700
foundation and letting the machine figure out

00:22:39.700 --> 00:22:41.440
how to build the rest of the skyline. Thanks

00:22:41.440 --> 00:22:42.680
for joining us on this deep dive.