WEBVTT

00:00:00.000 --> 00:00:03.359
So if you want to master tennis, you know, you

00:00:03.359 --> 00:00:05.700
just practice tennis. Right, obviously. Yeah,

00:00:05.700 --> 00:00:07.719
you get out there on the court, you work on your

00:00:07.719 --> 00:00:11.740
serve, you drill your backhand, and you figure

00:00:11.740 --> 00:00:14.699
out the footwork. But what if I told you that

00:00:14.699 --> 00:00:18.199
the secret to becoming a world class tennis player

00:00:18.199 --> 00:00:22.460
faster was actually, well, that you had to decide

00:00:22.460 --> 00:00:24.839
to play competitive ping pong at the exact same

00:00:24.839 --> 00:00:27.239
time? I mean, that sounds like a recipe for completely

00:00:27.239 --> 00:00:29.679
scrambling your brain. Right. You would think

00:00:29.679 --> 00:00:32.039
the macro movements of a tennis swing and then

00:00:32.039 --> 00:00:34.979
like the micro flicks of a ping pong paddle would

00:00:34.979 --> 00:00:37.020
just actively interfere with each other. Yeah,

00:00:37.140 --> 00:00:38.859
it does sound like a terrible idea on the surface.

00:00:38.859 --> 00:00:41.320
Totally backwards. But underneath the wildly

00:00:41.320 --> 00:00:43.119
different scales of the tennis court and the

00:00:43.119 --> 00:00:45.079
ping pong table, you are essentially forcing

00:00:45.079 --> 00:00:48.829
your brain to train on the exact same underlying

00:00:48.829 --> 00:00:51.590
mechanics. Like you're building foundational

00:00:51.590 --> 00:00:54.429
spatial awareness, spin prediction, and hand

00:00:54.429 --> 00:00:56.450
-eye coordination. And that counterintuitive

00:00:56.450 --> 00:00:59.090
idea is exactly what we are exploring today.

00:00:59.630 --> 00:01:01.609
So welcome to the Deep Dive. Thanks for having

00:01:01.609 --> 00:01:05.269
me. We are taking a look at a massive, incredibly

00:01:05.269 --> 00:01:07.989
dense set of source material centered around

00:01:07.989 --> 00:01:10.469
a Wikipedia article on this concept called multi

00:01:10.469 --> 00:01:14.150
-task learning, or MTO. Yes, MTL. And our mission

00:01:14.150 --> 00:01:17.469
for you today is to demystify how machine learning

00:01:17.469 --> 00:01:20.290
models actually get significantly smarter by

00:01:20.290 --> 00:01:23.250
juggling multiple tasks at once. Which is fascinating.

00:01:23.469 --> 00:01:25.150
Because for us humans who usually just want to

00:01:25.150 --> 00:01:27.409
focus on one thing at a time, it feels, you know,

00:01:27.430 --> 00:01:29.790
totally backwards. It does. And to establish

00:01:29.790 --> 00:01:33.010
exactly why this matters for you and really the

00:01:33.010 --> 00:01:35.569
technology you use every day, we have to look

00:01:35.569 --> 00:01:38.170
at the context of how artificial intelligence

00:01:38.170 --> 00:01:41.030
traditionally learns. Historically, we train

00:01:41.030 --> 00:01:44.409
in AI. in a complete vacuum. We give it one single

00:01:44.409 --> 00:01:47.329
objective, a massive pile of data, and it learns

00:01:47.329 --> 00:01:49.170
that one specific thing. Right, like a horse

00:01:49.170 --> 00:01:52.370
with blinders on. Exactly. But multi -task learning,

00:01:52.590 --> 00:01:55.870
or MTL, forces the AI to look at multiple different

00:01:55.870 --> 00:01:59.409
tasks simultaneously. It forces the system to

00:01:59.409 --> 00:02:01.450
exploit the commonalities and the differences

00:02:01.450 --> 00:02:04.209
across all those tasks. So it's not getting distracted?

00:02:04.549 --> 00:02:07.500
Not at all. When you compel a system to solve

00:02:07.500 --> 00:02:10.139
multiple problems at once, you aren't distracting

00:02:10.139 --> 00:02:13.080
it. You are actually improving both its learning

00:02:13.080 --> 00:02:15.520
efficiency and its prediction accuracy across

00:02:15.520 --> 00:02:17.520
the board. Okay, let's unpack this, because if

00:02:17.520 --> 00:02:19.300
we look at the source text, it points out that

00:02:19.300 --> 00:02:21.560
the early versions of this concept were actually

00:02:21.560 --> 00:02:24.759
just called hints. Yeah, hints. Which is a surprisingly

00:02:24.759 --> 00:02:27.400
gentle term for computer science, honestly. It

00:02:27.400 --> 00:02:29.699
really is. But then we get to this foundational

00:02:29.699 --> 00:02:33.319
1997 paper by a researcher named Rich Caruana.

00:02:33.439 --> 00:02:35.389
Right, a huge paper. paper in the field. And

00:02:35.389 --> 00:02:38.550
he theorized that MTL uses the training signals

00:02:38.550 --> 00:02:42.090
of related tasks as, and I'm quoting here, inductive

00:02:42.090 --> 00:02:45.590
bias while using a shared representation. Yes.

00:02:45.750 --> 00:02:48.169
Now even for a deep dive, I mean, that is some

00:02:48.169 --> 00:02:50.430
heavy academic phrasing. It is a bit dense. So

00:02:50.430 --> 00:02:52.449
let's break those terms down because the mechanics

00:02:52.449 --> 00:02:54.409
underneath them are really brilliant. Please

00:02:54.409 --> 00:02:57.669
do. So in machine learning, an inductive bias

00:02:57.669 --> 00:03:01.110
is essentially the set of assumptions a learning

00:03:01.110 --> 00:03:04.580
algorithm uses to predict outcomes. for situations

00:03:04.580 --> 00:03:06.500
it has never seen before. OK, so it's like it's

00:03:06.500 --> 00:03:09.020
gut instinct. Kind of, yeah. Inductive means

00:03:09.020 --> 00:03:12.360
it's generalizing from specific examples, and

00:03:12.360 --> 00:03:14.819
bias means it leans towards certain types of

00:03:14.819 --> 00:03:17.199
conclusions. Got it. Think of an AI trying to

00:03:17.199 --> 00:03:20.000
guess the rules of a card game just by watching

00:03:20.000 --> 00:03:23.120
people play. OK. If it only watches one specific

00:03:23.120 --> 00:03:26.889
game, say poker, it might draw the wrong conclusions

00:03:26.889 --> 00:03:29.310
about how all cards work. Like, it might think

00:03:29.310 --> 00:03:31.870
the goal is always to hoard chips. Exactly. But

00:03:31.870 --> 00:03:34.590
if it watches poker, blackjack, and solitaire

00:03:34.590 --> 00:03:37.500
at the same time... The hints it gets from the

00:03:37.500 --> 00:03:39.939
other games guide it toward a much more robust

00:03:39.939 --> 00:03:42.740
understanding. Oh, I see. It learns what a suit

00:03:42.740 --> 00:03:45.699
is or what shuffling does, completely independent

00:03:45.699 --> 00:03:48.180
of the specific game. Let's bring this into the

00:03:48.180 --> 00:03:50.240
real world with the spam filter example from

00:03:50.240 --> 00:03:52.340
the text, because I think this makes Caruana's

00:03:52.340 --> 00:03:54.219
theory instantly click. That's a great example.

00:03:54.360 --> 00:03:56.759
So imagine you are building a spam filter for

00:03:56.759 --> 00:03:59.740
two different users. User A is an English speaker.

00:03:59.860 --> 00:04:02.740
User B is a Russian speaker. Right. Now, if user

00:04:02.740 --> 00:04:05.909
A receives an email entirely in Russian, The

00:04:05.909 --> 00:04:09.009
AI standard single task filter is going to flag

00:04:09.009 --> 00:04:11.349
that immediately. It sees the Cyrillic alphabet

00:04:11.349 --> 00:04:14.490
and says, obviously, spam. Right, because for

00:04:14.490 --> 00:04:17.069
the English speaker, the system just learns a

00:04:17.069 --> 00:04:19.610
lazy rule. Lazy rule. Yeah, it basically just

00:04:19.610 --> 00:04:23.009
says, Cyrillic equals bad. It finds the easiest

00:04:23.009 --> 00:04:25.629
possible shortcut to solve the specific problem

00:04:25.629 --> 00:04:27.879
it was given. Right. But for the Russian speaker,

00:04:28.060 --> 00:04:30.399
that lazy rule would destroy their inbox. Exactly.

00:04:30.500 --> 00:04:32.500
I mean, a Russian email is perfectly normal for

00:04:32.500 --> 00:04:35.220
them. So if you train two separate AI models

00:04:35.220 --> 00:04:38.199
in a vacuum, they learn totally different roles.

00:04:38.259 --> 00:04:41.509
They do. But let's say both users start receiving

00:04:41.509 --> 00:04:44.769
scam emails about an urgent wire transfer. Oh,

00:04:44.769 --> 00:04:48.569
yes. If the AI is using multitask learning, it

00:04:48.569 --> 00:04:51.750
groups these distinct but related classification

00:04:51.750 --> 00:04:54.730
tasks together into a shared representation.

00:04:55.209 --> 00:04:57.550
It suddenly realizes, wait, whether the text

00:04:57.550 --> 00:05:00.970
is in English or Russian, this specific manipulative

00:05:00.970 --> 00:05:03.529
phrasing about a wire transfer is a universal

00:05:03.529 --> 00:05:05.810
indicator of a scam. And that reveals the core

00:05:05.810 --> 00:05:08.290
mechanism of why this is so powerful. It's what

00:05:08.290 --> 00:05:10.790
we call regularization. Regularization. OK. Yeah.

00:05:10.850 --> 00:05:12.250
In machine learning, you always run the risk

00:05:12.250 --> 00:05:14.509
of your model overfitting. Overfitting. Right.

00:05:14.509 --> 00:05:17.750
Overfitting happens when an AI becomes obsessively

00:05:17.750 --> 00:05:20.910
fixated on the specific random quirks of its

00:05:20.910 --> 00:05:23.290
training data rather than the actual underlying

00:05:23.290 --> 00:05:25.529
patterns. Like assuming all Russian is spam.

00:05:25.629 --> 00:05:29.250
Exactly. By jointly solving each user's spam

00:05:29.250 --> 00:05:32.569
problem, the AI solutions actively regularize

00:05:32.569 --> 00:05:34.910
each other. So they balance each other out. Yes.

00:05:35.329 --> 00:05:38.370
The Russian task acts as a guardrail for the

00:05:38.370 --> 00:05:41.089
English task, effectively telling it, hey, don't

00:05:41.089 --> 00:05:43.149
focus so much on the alphabet, focus on the intent.

00:05:43.170 --> 00:05:45.750
That makes a lot of sense. Regularization mathematically

00:05:45.750 --> 00:05:49.029
forces the system to zoom out and learn the actual

00:05:49.029 --> 00:05:51.810
universal rules of what makes an email spam.

00:05:52.230 --> 00:05:55.269
So grouping related tasks makes perfect logical

00:05:55.269 --> 00:05:57.670
sense. They share a fundamental goal. They do.

00:05:57.899 --> 00:05:59.519
But here's where it gets really interesting,

00:05:59.800 --> 00:06:01.459
because the source material takes a hard left

00:06:01.459 --> 00:06:03.459
turn here. It does take a leap. It introduces

00:06:03.459 --> 00:06:06.819
a concept called auxiliary learning, and it claims

00:06:06.819 --> 00:06:09.819
that teaching an AI completely entirely unrelated

00:06:09.819 --> 00:06:12.980
tasks alongside its main goal actually improves

00:06:12.980 --> 00:06:15.639
its performance even more. Yes. Wait, so teaching

00:06:15.639 --> 00:06:18.399
an AI unrelated things actually helps. It really

00:06:18.399 --> 00:06:20.759
does. Isn't that like a professional race car

00:06:20.759 --> 00:06:23.779
driver? training for a championship by taking

00:06:23.779 --> 00:06:26.199
up juggling? I mean, juggling has absolutely

00:06:26.199 --> 00:06:28.420
nothing to do with steering a car. It sounds

00:06:28.420 --> 00:06:32.240
completely paradoxical. Totally. Logically, forcing

00:06:32.240 --> 00:06:36.000
a system to learn unrelated tasks should just

00:06:36.000 --> 00:06:39.199
introduce chaos. But the race car driver analogy

00:06:39.199 --> 00:06:43.040
is actually perfect. Yeah. Juggling doesn't teach

00:06:43.040 --> 00:06:45.600
the driver how to shift gears, but it forces

00:06:45.600 --> 00:06:47.959
their brain to build hyper -efficient spatial

00:06:47.959 --> 00:06:50.819
awareness and peripheral reaction times. Oh,

00:06:50.939 --> 00:06:53.360
wow. And those underlying neurological upgrades

00:06:53.360 --> 00:06:55.639
then make them a better driver. I hadn't thought

00:06:55.639 --> 00:06:57.500
of it like that. The research shows that when

00:06:57.500 --> 00:07:00.959
you joint learn a principal task alongside completely

00:07:00.959 --> 00:07:04.740
unrelated auxiliary tasks, crucially using the

00:07:04.740 --> 00:07:07.480
exact same input data, it provides a massive

00:07:07.480 --> 00:07:09.639
improvement over standard multitask learning.

00:07:09.790 --> 00:07:11.649
But how is that practically happening inside

00:07:11.649 --> 00:07:14.350
the computer? If I'm trying to predict the stock

00:07:14.350 --> 00:07:16.730
market, why would I also ask the AI to count

00:07:16.730 --> 00:07:18.870
the number of vowels in the financial reports?

00:07:19.310 --> 00:07:21.449
How does the juggling help the driving in data?

00:07:21.769 --> 00:07:23.949
What's fascinating here is how the system is

00:07:23.949 --> 00:07:26.509
forced to handle the concept of noise. Noise,

00:07:26.569 --> 00:07:29.689
OK. In any data set, there's the core truth you're

00:07:29.689 --> 00:07:32.850
looking for, the signal. Right. And then there

00:07:32.850 --> 00:07:35.110
are the idiosyncrasies of the data distribution.

00:07:35.149 --> 00:07:37.889
Right. The random useless fluctuations, which

00:07:37.889 --> 00:07:41.410
we call noise. OK. So signal and noise. Exactly.

00:07:41.569 --> 00:07:45.069
Yeah. When you give an AI two totally unrelated

00:07:45.069 --> 00:07:48.529
tasks to solve using the exact same input data,

00:07:49.050 --> 00:07:51.269
you are essentially forcing it to build a much

00:07:51.269 --> 00:07:54.649
sparser, far more informative foundational representation

00:07:54.649 --> 00:07:56.649
of that data. Because it can't rely on those

00:07:56.649 --> 00:07:59.420
lazy shortcuts anymore. Exactly. If it uses a

00:07:59.420 --> 00:08:02.220
cheap trick or memorizes the noise to solve task

00:08:02.220 --> 00:08:05.480
A, that trick will almost certainly fail catastrophically

00:08:05.480 --> 00:08:08.540
on the completely unrelated task B. Right. The

00:08:08.540 --> 00:08:10.899
system literally imposes a mathematical penalty

00:08:10.899 --> 00:08:13.800
on itself for relying on noise. The architecture

00:08:13.800 --> 00:08:15.879
encourages the mathematical representations of

00:08:15.879 --> 00:08:18.100
these different groups of tasks to be orthogonal.

00:08:18.319 --> 00:08:20.500
Orthogonal. Meaning they operate at right angles

00:08:20.500 --> 00:08:22.600
to each other. Yes. But what does a right angle

00:08:22.600 --> 00:08:24.939
mean for data? In mathematical and data terms,

00:08:25.660 --> 00:08:28.920
if two variables are orthogonal, It means changing

00:08:28.920 --> 00:08:31.139
one has absolutely zero effect on the other.

00:08:31.560 --> 00:08:33.159
They are completely independent. So they don't

00:08:33.159 --> 00:08:35.559
touch at all. Right. It's like the volume knob

00:08:35.559 --> 00:08:38.600
and the channel dial on a radio. Okay. Twisting

00:08:38.600 --> 00:08:40.740
the volume doesn't change the station, and changing

00:08:40.740 --> 00:08:42.879
the station doesn't change the volume. I like

00:08:42.879 --> 00:08:46.659
that analogy. By forcing the AI to study the

00:08:46.659 --> 00:08:49.799
stock market prices and count the vowels simultaneously,

00:08:50.440 --> 00:08:53.000
the AI is forced to separate the core signal

00:08:53.279 --> 00:08:56.720
the actual financial data from the specific unrelated

00:08:56.720 --> 00:08:58.700
tasks. Right, because vowels don't drive the

00:08:58.700 --> 00:09:01.320
stock price. Exactly. It realizes the only way

00:09:01.320 --> 00:09:04.299
to solve both is to strip away all the surface

00:09:04.299 --> 00:09:06.960
level noise and extract only the absolute most

00:09:06.960 --> 00:09:09.539
fundamental features of the raw data before it

00:09:09.539 --> 00:09:11.720
splits off to solve the specific problems. That

00:09:11.720 --> 00:09:14.039
is mind -bending. The distraction is actually

00:09:14.039 --> 00:09:16.460
the cure for the noise. It really is. The system

00:09:16.460 --> 00:09:19.500
is forced to become profound rather than superficial.

00:09:19.720 --> 00:09:22.159
And it builds a foundation so solid that it can

00:09:22.159 --> 00:09:24.370
apply that underlying understanding to almost

00:09:24.370 --> 00:09:26.549
anything. Okay, so this makes sense if the world

00:09:26.549 --> 00:09:29.049
stands still, you know. You have a static set

00:09:29.049 --> 00:09:31.330
of tasks, you feed them to the AI and it strips

00:09:31.330 --> 00:09:33.450
away the noise. Right. But I'm thinking about

00:09:33.450 --> 00:09:35.289
something like my Netflix account. All right.

00:09:35.710 --> 00:09:37.929
What I want to watch on a Tuesday morning before

00:09:37.929 --> 00:09:41.289
work is totally different from what I want to

00:09:41.289 --> 00:09:44.409
watch on a Friday night. Oh, absolutely. My preferences

00:09:44.409 --> 00:09:47.899
are a constantly moving target. If my data is

00:09:47.899 --> 00:09:50.399
constantly shifting, doesn't this shared memory

00:09:50.399 --> 00:09:53.759
become outdated immediately? Like, how does MTL

00:09:53.759 --> 00:09:57.879
adapt to time and evolution? That is the exact

00:09:57.879 --> 00:10:00.659
problem researchers had to solve next. And it

00:10:00.659 --> 00:10:02.539
requires us to distinguish between concurrent

00:10:02.539 --> 00:10:04.600
learning, which is the traditional MTL we've

00:10:04.600 --> 00:10:07.059
been discussing, and the sequential transfer

00:10:07.059 --> 00:10:10.000
of knowledge. Sequential transfer? Yeah. In concurrent

00:10:10.000 --> 00:10:12.000
learning, the shared representation is developed

00:10:12.000 --> 00:10:14.580
across all tasks at the exact same time. Right.

00:10:14.820 --> 00:10:16.950
But in transfer learning, you develop a massive

00:10:16.950 --> 00:10:19.850
robust representation first, and then pass it

00:10:19.850 --> 00:10:22.769
down the line to new tasks. The Wikipedia article

00:10:22.769 --> 00:10:25.190
cites GoogleNet as a prime example of this. Which

00:10:25.190 --> 00:10:28.330
is this massive deep convolutional neural network

00:10:28.330 --> 00:10:30.889
used for image classification. Right. It takes

00:10:30.889 --> 00:10:33.649
an enormous amount of computational power to

00:10:33.649 --> 00:10:36.649
train GoogleNet to understand what a cat looks

00:10:36.649 --> 00:10:38.250
like from a million different angles. Now I can

00:10:38.250 --> 00:10:40.909
imagine. But in doing so, the initial layers

00:10:40.909 --> 00:10:43.029
of the network aren't actually learning cat.

00:10:43.340 --> 00:10:45.039
Wait, what are they learning, though? They are

00:10:45.039 --> 00:10:47.659
learning how to detect an edge. They are learning

00:10:47.659 --> 00:10:50.779
how to perceive a curve or a shadow or a texture.

00:10:51.039 --> 00:10:54.500
Oh, wow. Once it has developed that incredibly

00:10:54.500 --> 00:10:58.159
robust mathematical representation of basic geometry,

00:10:59.360 --> 00:11:02.179
that knowledge isn't locked away. OK. It can

00:11:02.179 --> 00:11:04.960
be used to initialize an entirely different model

00:11:04.960 --> 00:11:09.009
meant to detect, say, tumors. in medical scans.

00:11:09.429 --> 00:11:11.950
Really? From cats to tumors? Yes. You sequentially

00:11:11.950 --> 00:11:14.570
transfer that pre -trained foundation so the

00:11:14.570 --> 00:11:16.350
new algorithm doesn't have to start from scratch.

00:11:16.570 --> 00:11:18.250
It's like sending someone to art school for four

00:11:18.250 --> 00:11:20.289
years to learn everything about light, shadow,

00:11:20.429 --> 00:11:22.789
and anatomy, right? Yes. And then on graduation

00:11:22.789 --> 00:11:25.289
day, handing them a tattoo gun? Yeah. Like, they

00:11:25.289 --> 00:11:28.039
don't know how to tattoo yet? But their foundational

00:11:28.039 --> 00:11:30.480
understanding of how shapes work is so strong

00:11:30.480 --> 00:11:33.440
that they will learn the specific art of tattooing

00:11:33.440 --> 00:11:35.620
infinitely faster than someone off the street.

00:11:36.100 --> 00:11:38.899
That is a brilliant way to visualize it. But

00:11:38.899 --> 00:11:41.500
the researchers pushed it even further to address

00:11:41.500 --> 00:11:44.220
your Netflix problem. The moving target problem.

00:11:44.299 --> 00:11:47.139
Right. What happens if the environment is continuously

00:11:47.139 --> 00:11:50.179
changing while the AI is operating? OK. They

00:11:50.179 --> 00:11:52.759
developed something called GoAL, which stands

00:11:52.759 --> 00:11:56.320
for Group Online Adaptive Learning. GoAL. This

00:11:56.320 --> 00:11:59.000
applies the principles of MTL to non -stationary

00:11:59.000 --> 00:12:02.240
environments. With GoAL, a learner operating

00:12:02.240 --> 00:12:04.700
in a shifting environment, like a financial time

00:12:04.700 --> 00:12:07.600
series or a recommendation algorithm, can actually

00:12:07.600 --> 00:12:09.840
benefit from the previous experience of another

00:12:09.840 --> 00:12:12.340
learner to quickly adapt to his new surroundings.

00:12:12.559 --> 00:12:14.519
And this leads right into a phrase from the text

00:12:14.519 --> 00:12:16.360
that totally stopped me on my tracks. Oh yeah.

00:12:16.480 --> 00:12:18.759
Yeah, under the optimization section, it talks

00:12:18.759 --> 00:12:21.600
about evolutionary multitasking. I mean, are

00:12:21.600 --> 00:12:24.200
we talking about literal AI genetics here? Are

00:12:24.200 --> 00:12:26.480
we breeding algorithms to see which ones survive?

00:12:26.720 --> 00:12:29.860
We actually are. Really? Yeah. Instead of just

00:12:29.860 --> 00:12:33.220
tweaking one algorithm, the researchers map all

00:12:33.220 --> 00:12:36.360
of these distinct optimization tasks into a single

00:12:36.360 --> 00:12:39.639
unified search space. OK. And within that space,

00:12:39.840 --> 00:12:42.320
you have a whole population of candidate solutions.

00:12:42.639 --> 00:12:45.539
It relies on continuous genetic transfer. OK,

00:12:45.559 --> 00:12:48.340
but how does an algorithm transfer genetics?

00:12:48.440 --> 00:12:50.620
I mean, it doesn't have DNA. Through a process

00:12:50.620 --> 00:12:53.740
that mirrors biological crossover. As this evolving

00:12:53.740 --> 00:12:56.100
population of candidate solutions tries to solve

00:12:56.100 --> 00:12:59.279
these different tasks, they harness hidden relationships

00:12:59.279 --> 00:13:02.240
through something called implicit parallelism.

00:13:02.580 --> 00:13:05.039
Implicit parallelism. Implicit parallelism basically

00:13:05.039 --> 00:13:08.509
means that by evaluating one single child the

00:13:08.509 --> 00:13:11.210
system is secretly testing dozens of underlying

00:13:11.210 --> 00:13:13.870
mathematical traits at the same time without

00:13:13.870 --> 00:13:16.100
doing any extra math. Can you give me an example

00:13:16.100 --> 00:13:18.759
of that? Sure. If I test a car's top speed, I'm

00:13:18.759 --> 00:13:20.899
implicitly testing its engine, its aerodynamics,

00:13:21.080 --> 00:13:23.259
and its tires all at once. So if candidate A

00:13:23.259 --> 00:13:25.299
is really good at predicting stock volatility

00:13:25.299 --> 00:13:27.659
and candidate B is really good at sentiment analysis

00:13:27.659 --> 00:13:29.860
of news articles, they literally cross over their

00:13:29.860 --> 00:13:33.620
traits. Yes. The system physically swaps segments

00:13:33.620 --> 00:13:36.419
of their underlying code. That's wild. It induces

00:13:36.419 --> 00:13:39.399
a leap in performance. Suddenly... Their offspring

00:13:39.399 --> 00:13:42.899
is this hyper -aware financial savant that understands

00:13:42.899 --> 00:13:45.379
both the math of the market and the mood of the

00:13:45.379 --> 00:13:48.279
news. And because it's happening across a whole

00:13:48.279 --> 00:13:51.100
population simultaneously, the evolutionary system

00:13:51.100 --> 00:13:53.519
progresses all the distinct tasks at the same

00:13:53.519 --> 00:13:56.220
time. The experience of the entire population

00:13:56.409 --> 00:13:59.090
lifts every individual solver. So we have this

00:13:59.090 --> 00:14:02.370
beautiful system. It groups users. It filters

00:14:02.370 --> 00:14:05.429
spam. It sees through noise by juggling unrelated

00:14:05.429 --> 00:14:08.710
things. And it breeds super algorithms through

00:14:08.710 --> 00:14:11.009
genetic crossover. That sounds perfect. Yeah.

00:14:11.070 --> 00:14:12.909
But what happens if the tasks we group together

00:14:12.909 --> 00:14:14.889
actually hate each other? What if they have opposing

00:14:14.889 --> 00:14:17.070
goals? We have to look at the system's vulnerability.

00:14:17.190 --> 00:14:19.230
We do. And to understand that vulnerability,

00:14:19.669 --> 00:14:21.789
the source material dives into some very heavy

00:14:21.789 --> 00:14:24.990
mathematics. It brings up something called RKHSVV.

00:14:25.090 --> 00:14:27.870
Right. Yes, the reproducing kernel Hilbert space

00:14:27.870 --> 00:14:30.549
of vector -valued functions along with separable

00:14:30.549 --> 00:14:33.470
kernels, which is quite the mouthful. It is,

00:14:33.470 --> 00:14:36.649
but the mechanism is crucial. And there's a political

00:14:36.649 --> 00:14:38.850
example in the text to explain this. Just to

00:14:38.850 --> 00:14:40.789
be completely clear to you listening, we're not

00:14:40.789 --> 00:14:42.649
taking any political sides here. Right. We're

00:14:42.649 --> 00:14:45.049
purely looking at the math of how data is distributed.

00:14:45.129 --> 00:14:47.309
Exactly. Just looking at the original source

00:14:47.309 --> 00:14:50.629
material impartially. So the example talks about

00:14:50.629 --> 00:14:53.409
predicting a politician's favorability rating

00:14:53.409 --> 00:14:56.519
based on political party. Yes. Think of an RK

00:14:56.519 --> 00:15:00.240
HSVV, like a massive multi -dimensional sorting

00:15:00.240 --> 00:15:03.500
facility. If you just have a standard 2D map

00:15:03.500 --> 00:15:06.460
of politicians and voters based on favorability,

00:15:07.200 --> 00:15:09.059
everyone is jumbled together in a muddy mess.

00:15:09.179 --> 00:15:11.899
Just a big blob of data. Right. Taking an average

00:15:11.899 --> 00:15:14.740
tells you nothing. But a Hilbert space uses complex

00:15:14.740 --> 00:15:17.399
matrices to mathematically toss all that data

00:15:17.399 --> 00:15:20.679
into a 3D or 4D room. Oh, interesting. Suddenly,

00:15:20.820 --> 00:15:22.860
the boundaries between how a Republican views

00:15:22.860 --> 00:15:24.940
the politician and how a Democrat views them

00:15:24.940 --> 00:15:27.639
become obvious. Because they're physically separated

00:15:27.639 --> 00:15:30.600
in that 4D space. Exactly. The math allows the

00:15:30.600 --> 00:15:32.659
system to regularize the predictions by grouping

00:15:32.659 --> 00:15:35.240
people by party, controlling the variance with

00:15:35.240 --> 00:15:37.659
respect to a group mean. It acknowledges that

00:15:37.659 --> 00:15:40.100
the underlying rules for predicting a Democrat's

00:15:40.100 --> 00:15:43.220
view are structurally different from a Republican's,

00:15:43.220 --> 00:15:46.240
but maps them within the same societal matrix.

00:15:46.480 --> 00:15:49.700
So what does this all mean? The system is great

00:15:49.700 --> 00:15:52.279
at mapping things out, but right after establishing

00:15:52.279 --> 00:15:55.360
this 4D sorting room, the text introduces the

00:15:55.360 --> 00:15:58.179
concept of negative transfer. Yes, the dark side.

00:15:58.419 --> 00:16:00.500
From reading the source, it sounds like a tug

00:16:00.500 --> 00:16:03.580
of war. Let's say you have a shared module trying

00:16:03.580 --> 00:16:07.200
to learn two tasks. If those tasks seek conflicting

00:16:07.200 --> 00:16:09.940
features, their mathematical gradients point

00:16:09.940 --> 00:16:12.899
in completely opposing directions. And to clarify

00:16:12.899 --> 00:16:15.360
for you, a gradient is essentially the direction

00:16:15.360 --> 00:16:17.519
the AI needs to go to find the right answer.

00:16:17.580 --> 00:16:20.440
Right. In gradient descent optimization, which

00:16:20.440 --> 00:16:22.559
is the engine driving these deep neural networks,

00:16:23.340 --> 00:16:25.879
the AI is basically blindfolded, taking small

00:16:25.879 --> 00:16:28.399
steps down a mathematical mountain to find the

00:16:28.399 --> 00:16:30.759
lowest possible point of error. The valley. Right,

00:16:30.820 --> 00:16:33.000
the lowest point is the optimal solution. But

00:16:33.000 --> 00:16:36.100
if task A says, the lowest point is to the left,

00:16:36.240 --> 00:16:38.399
and task B says, no, the lowest point is to the

00:16:38.399 --> 00:16:40.919
right, you are literally asking the AI to go

00:16:40.919 --> 00:16:44.539
north and south at the exact same time. The shared

00:16:44.539 --> 00:16:47.019
module just gets paralyzed. That is negative

00:16:47.019 --> 00:16:49.580
transfer in a nutshell. The gradients cancel

00:16:49.580 --> 00:16:52.659
each other out or they violently swing the AI

00:16:52.659 --> 00:16:54.779
back and forth across the mountains. Ripping

00:16:54.779 --> 00:16:57.879
it apart. Yeah. The simultaneous training of

00:16:57.879 --> 00:17:00.539
these supposedly related tasks actually hinders

00:17:00.539 --> 00:17:02.700
the overall performance compared to just letting

00:17:02.700 --> 00:17:05.380
them learn on their own single task models. Okay.

00:17:05.900 --> 00:17:07.940
The joint feature representation breaks down

00:17:07.940 --> 00:17:11.220
because it cannot serve two masters with fundamentally

00:17:11.220 --> 00:17:13.700
opposing needs. It's like tying the tennis player

00:17:13.700 --> 00:17:15.720
and the ping pong player together at the waist

00:17:15.720 --> 00:17:17.539
and telling them to play their respective games.

00:17:17.920 --> 00:17:19.339
They're just going to drag each other to the

00:17:19.339 --> 00:17:22.359
floor. Precisely. But the researchers came up

00:17:22.359 --> 00:17:24.740
with a deeply elegant solution to this problem,

00:17:25.599 --> 00:17:27.920
game theoretic optimization. Game theoretic.

00:17:27.980 --> 00:17:30.980
Yeah. They propose viewing the optimization problem

00:17:30.980 --> 00:17:34.400
as a literal game where every single task is

00:17:34.400 --> 00:17:37.390
a self -interested player. And all of these players

00:17:37.390 --> 00:17:40.069
are competing through a reward matrix, trying

00:17:40.069 --> 00:17:43.069
to reach a state that satisfies everyone. I,

00:17:43.069 --> 00:17:45.789
uh, I have to push back on this. Okay. Wait.

00:17:46.190 --> 00:17:49.170
Math doesn't have a conscience. How do you force

00:17:49.170 --> 00:17:51.730
a mathematical equation to compromise if its

00:17:51.730 --> 00:17:54.150
only directive is to find its own lowest error

00:17:54.150 --> 00:17:56.470
rate? I mean, selfish algorithms don't just sit

00:17:56.470 --> 00:17:59.089
around a table and negotiate. If we connect this

00:17:59.089 --> 00:18:01.529
to the bigger picture, we could look at a concept

00:18:01.529 --> 00:18:04.089
from economics called the Nash Cooperative Dargaining

00:18:04.089 --> 00:18:07.380
System. Like John Nash. Yes. In traditional gradient

00:18:07.380 --> 00:18:10.380
descent, the problem is that each task just aggressively

00:18:10.380 --> 00:18:13.259
shouts its own loss rate and demands the whole

00:18:13.259 --> 00:18:15.200
system move in its direction. It's a room full

00:18:15.200 --> 00:18:17.380
of people screaming their own demands. Right.

00:18:17.819 --> 00:18:20.380
But the game theoretic approach changes the mathematical

00:18:20.380 --> 00:18:23.500
incentives. It defines a game matrix where the

00:18:23.500 --> 00:18:26.079
reward for each task is no longer just getting

00:18:26.079 --> 00:18:28.730
its own way. Okay, then what is the reward? The

00:18:28.730 --> 00:18:30.849
reward is calculated based on the agreement of

00:18:30.849 --> 00:18:32.930
its own gradient with the common gradient of

00:18:32.930 --> 00:18:35.589
the group. So how does the math actually calculate

00:18:35.589 --> 00:18:38.809
that agreement? It calculates a vector that minimizes

00:18:38.809 --> 00:18:42.109
the maximum loss among all the tasks. Instead

00:18:42.109 --> 00:18:44.490
of just averaging the directions, which might

00:18:44.490 --> 00:18:47.710
send one task completely off a cliff, the algorithm

00:18:47.710 --> 00:18:50.349
finds a specific angle of descent that guarantees

00:18:50.349 --> 00:18:54.880
no single task. goes backwards. Wow. You mathematically

00:18:54.880 --> 00:18:58.059
mandate cooperation by setting the update direction

00:18:58.059 --> 00:19:00.440
to be the Nash cooperative bargaining solution.

00:19:00.480 --> 00:19:03.200
Okay. It forces the algorithms to realize that

00:19:03.200 --> 00:19:05.440
the only way they can optimize their own specific

00:19:05.440 --> 00:19:08.380
task is if they find a vector that simultaneously

00:19:08.380 --> 00:19:11.259
benefits the whole. That is incredible. It transforms

00:19:11.259 --> 00:19:13.799
a destructive tug of war into a synchronized

00:19:13.799 --> 00:19:16.319
cooperative negotiation. It really does. It stops

00:19:16.319 --> 00:19:18.799
the negative transfer because the system literally

00:19:18.799 --> 00:19:21.099
won't allow a mathematical move that actively

00:19:21.099 --> 00:19:24.000
destroys another task's progress. It finds the

00:19:24.000 --> 00:19:26.720
path of most mutual benefit. It is quite literally

00:19:26.720 --> 00:19:29.279
diplomacy translated into calculus. Well, we

00:19:29.279 --> 00:19:31.160
have covered a massive amount of ground today.

00:19:31.240 --> 00:19:33.450
We certainly have. To wrap this deep dive up,

00:19:33.549 --> 00:19:35.990
I want to bring all these abstract, soaring concepts

00:19:35.990 --> 00:19:38.109
back down to earth for you. Because this isn't

00:19:38.109 --> 00:19:40.150
just theory sitting in a lab somewhere. Not at

00:19:40.150 --> 00:19:44.220
all. the textless, sweeping, real -world applications

00:19:44.220 --> 00:19:47.079
for this technology. They are using multi -task

00:19:47.079 --> 00:19:49.559
optimization to accelerate complex engineering

00:19:49.559 --> 00:19:52.599
designs. They are pushing it into cloud computing,

00:19:52.900 --> 00:19:55.779
aiming for these massive on -demand optimization

00:19:55.779 --> 00:19:58.859
services that can cater to multiple different

00:19:58.859 --> 00:20:01.619
customers simultaneously without cross -contamination.

00:20:01.859 --> 00:20:04.059
And it's making major breakthroughs in industrial

00:20:04.059 --> 00:20:06.740
manufacturing and chemistry, too. Oh, chemistry.

00:20:07.000 --> 00:20:09.140
Yeah, they're optimizing chemical reactions by

00:20:09.140 --> 00:20:11.789
looking at the shape variables across multiple

00:20:11.789 --> 00:20:13.930
different experiments at the same time. Okay.

00:20:14.250 --> 00:20:16.450
So it's finding the underlying physics of the

00:20:16.450 --> 00:20:19.410
reactions faster than human trial and error ever

00:20:19.410 --> 00:20:21.589
could. That's amazing. And for the technically

00:20:21.589 --> 00:20:23.769
curious listener out there who wants to see how

00:20:23.769 --> 00:20:26.589
this actually runs, the source even mentions

00:20:26.589 --> 00:20:30.269
a specific MATLAB software package called MALSAR.

00:20:30.410 --> 00:20:32.650
Yes, MALSAR. Which stands for Multitask Learning

00:20:32.650 --> 00:20:35.759
Via Structural Regularization. It implements

00:20:35.759 --> 00:20:38.680
a whole suite of these algorithms, from clustered

00:20:38.680 --> 00:20:41.740
multitask learning to robust low -rank learning.

00:20:42.059 --> 00:20:43.940
So the tools to play with this are out there

00:20:43.940 --> 00:20:46.039
right now. They are. And I think there is a profound

00:20:46.039 --> 00:20:48.700
takeaway here that extends far beyond the code.

00:20:48.819 --> 00:20:50.980
Oh, definitely. We've seen today that artificial

00:20:50.980 --> 00:20:52.960
intelligence achieves its highest potential,

00:20:53.200 --> 00:20:55.759
its deepest understanding of the world, not by

00:20:55.759 --> 00:20:58.380
hyper -focusing on a single isolated problem

00:20:58.380 --> 00:21:01.339
in a vacuum. Right. It achieves brilliance by

00:21:01.339 --> 00:21:05.039
finding the hidden harmony. in conflicting, unrelated,

00:21:05.480 --> 00:21:08.240
and continuously evolving tasks through cooperative

00:21:08.240 --> 00:21:11.799
games. It's beautiful, really. It is. So as you

00:21:11.799 --> 00:21:14.160
go about your week, it's worth asking yourself.

00:21:15.259 --> 00:21:17.920
In an era of constant human information overload,

00:21:18.640 --> 00:21:20.740
are we trying too hard to single task our way

00:21:20.740 --> 00:21:23.420
through life? That's a great question. What auxiliary

00:21:23.420 --> 00:21:25.660
tasks or unrelated hobbies could you be picking

00:21:25.660 --> 00:21:27.940
up right now that might secretly help you screen

00:21:27.940 --> 00:21:28.740
out your own noise?
