WEBVTT

00:00:00.000 --> 00:00:03.160
So imagine you've spent, I don't know, the last

00:00:03.160 --> 00:00:06.320
five years meticulously training a spam filter

00:00:06.320 --> 00:00:10.619
on your personal inbox. You've flagged the newsletters,

00:00:11.179 --> 00:00:13.880
the weird sales pitches, the phishing scams,

00:00:13.880 --> 00:00:17.600
and your AI is just a flawless digital bouncer

00:00:17.600 --> 00:00:21.079
at this point. It knows exactly what you personally

00:00:21.079 --> 00:00:23.679
consider junk. Yeah, it's highly tailored to

00:00:23.679 --> 00:00:26.640
you. Exactly. Now imagine taking that exact same

00:00:26.640 --> 00:00:29.679
perfectly trained AI and just plugging it into

00:00:29.679 --> 00:00:31.859
the inbox of an executive at a completely different

00:00:31.859 --> 00:00:35.560
company in a completely different industry. What

00:00:35.560 --> 00:00:37.920
actually happens? Well, I mean, it fails almost

00:00:37.920 --> 00:00:40.259
immediately. Really? Just right away? Yeah, pretty

00:00:40.259 --> 00:00:42.299
much. It starts letting in obvious corporate

00:00:42.299 --> 00:00:45.259
spam, and it's probably blocking critical client

00:00:45.259 --> 00:00:47.780
emails. The model is effectively blind in this

00:00:47.780 --> 00:00:49.700
new environment. Okay, let's unpack this for

00:00:49.700 --> 00:00:51.539
a second. Because to you, the listener, and to

00:00:51.539 --> 00:00:54.740
me, we think, why can't a smart AI just automatically

00:00:54.740 --> 00:00:56.579
work everywhere? Right, that's the expectation.

00:00:56.840 --> 00:00:59.859
Like, if it knows what spam is, why does changing

00:00:59.859 --> 00:01:02.810
the user matter so much? We expect these algorithms

00:01:02.810 --> 00:01:05.709
to be universally intelligent, but we're doing

00:01:05.709 --> 00:01:08.950
a deep dive today into a Wikipedia article on

00:01:08.950 --> 00:01:12.030
a field called domain adaptation. And it paints

00:01:12.030 --> 00:01:14.230
a very different picture. It really does. It

00:01:14.230 --> 00:01:16.430
basically suggests that machine learning models

00:01:16.430 --> 00:01:19.689
are incredibly literal. Like, they only know

00:01:19.689 --> 00:01:21.870
the exact reality they were raised in. Yeah,

00:01:21.930 --> 00:01:23.450
that's a great way to put it. And to understand

00:01:23.450 --> 00:01:25.989
why they're so literal, we kind of have to look

00:01:25.989 --> 00:01:29.170
at how these models are formally structured at

00:01:29.170 --> 00:01:32.040
a mathematical level. OK, lay it on us. So in

00:01:32.040 --> 00:01:33.620
standard supervised machine learning, you have

00:01:33.620 --> 00:01:35.620
two main components. You have an input space,

00:01:35.700 --> 00:01:38.700
which we represent as X, and a label space represented

00:01:38.700 --> 00:01:41.500
as Y. So just tying that back to our spam filter,

00:01:41.579 --> 00:01:44.459
X is the raw text of the incoming email, and

00:01:44.459 --> 00:01:47.140
Y is simply the label, right, like spam or not

00:01:47.140 --> 00:01:49.920
spam. Exactly. And the entire objective of the

00:01:49.920 --> 00:01:52.379
algorithm is to learn a mathematical model, which

00:01:52.379 --> 00:01:55.480
we call a hypothesis, or H. A hypothesis. Got

00:01:55.480 --> 00:01:58.120
it. Right. And this hypothesis has one job. It

00:01:58.120 --> 00:02:01.049
looks at an example from input space X. and it

00:02:01.049 --> 00:02:04.010
attaches the correct label from space Y. So H

00:02:04.010 --> 00:02:06.890
is basically the internal rulebook the AI develops

00:02:06.890 --> 00:02:10.289
to map the text to the right label. Precisely.

00:02:10.650 --> 00:02:13.189
The model writes that rulebook by studying a

00:02:13.189 --> 00:02:16.770
training sample, but this points us directly

00:02:16.770 --> 00:02:19.629
to the fatal flaw of standard machine learning.

00:02:20.050 --> 00:02:23.030
Which is what? Well, standard supervised learning

00:02:23.030 --> 00:02:26.370
operates on this massive assumption. It assumes

00:02:26.370 --> 00:02:28.389
that all the examples it learns from are drawn

00:02:28.389 --> 00:02:31.300
from a specific static source distribution. We

00:02:31.300 --> 00:02:34.039
call that D sub S. D sub S. Okay. Yeah. And the

00:02:34.039 --> 00:02:36.719
algorithm essentially hardwires itself to the

00:02:36.719 --> 00:02:39.000
belief that every single piece of data it encounters

00:02:39.000 --> 00:02:41.340
in the future will be drawn perfectly from that

00:02:41.340 --> 00:02:44.039
exact same D sub S distribution. Oh, wow. So

00:02:44.039 --> 00:02:45.500
it just assumes the universe will never change.

00:02:45.580 --> 00:02:48.000
Like if the AI was trained on my personal emails,

00:02:48.280 --> 00:02:50.400
it assumes the entire world communicates exactly

00:02:50.400 --> 00:02:52.719
like my friends and family do. Exactly. It assumes

00:02:52.719 --> 00:02:54.759
the future will be a perfect statistical mirror

00:02:54.759 --> 00:02:58.099
of the past. But, you know, The real world is

00:02:58.099 --> 00:03:00.240
messy and it's constantly shifting. And this

00:03:00.240 --> 00:03:03.000
is where domain adaptation comes into play. Domain

00:03:03.000 --> 00:03:05.960
adaptation introduces a second related distribution,

00:03:06.419 --> 00:03:10.000
the target domain, or D sub T. D sub T target

00:03:10.000 --> 00:03:12.120
domain. Right. So the engineering challenge is

00:03:12.120 --> 00:03:14.139
transferring knowledge from the source domain,

00:03:14.400 --> 00:03:17.860
D sub S, to the target domain, D sub T, while

00:03:17.860 --> 00:03:19.580
committing the least amount of error possible

00:03:19.580 --> 00:03:22.360
along the way. So we're essentially forcing the

00:03:22.360 --> 00:03:25.560
AI to step into a new reality. Yes, exactly.

00:03:25.879 --> 00:03:27.919
And to give this some context, within the broader

00:03:27.919 --> 00:03:30.780
field of AI, researchers Pan and Yang developed

00:03:30.780 --> 00:03:33.199
a taxonomy for transfer learning back in 2010.

00:03:33.680 --> 00:03:35.960
They classified domain adaptation specifically

00:03:35.960 --> 00:03:39.319
as transductive transfer learning. Transductive.

00:03:39.699 --> 00:03:42.319
If the core task like catching spam remains the

00:03:42.319 --> 00:03:45.000
same, how does transductive differ from just,

00:03:45.000 --> 00:03:47.139
you know, standard learning? Well, it means the

00:03:47.139 --> 00:03:49.439
objective hasn't changed, but the marginal distributions

00:03:49.439 --> 00:03:51.580
of the data have. Meaning what, exactly? In other

00:03:51.580 --> 00:03:53.659
words, the overall demographic makeup of the

00:03:53.659 --> 00:03:56.800
data, say the total volume of emails or the ratio

00:03:56.800 --> 00:03:59.139
of newsletters to direct messages that differs

00:03:59.139 --> 00:04:02.199
between the two domains. Oh, I see. Furthermore...

00:04:01.759 --> 00:04:04.319
In transductive transfer learning, we assume

00:04:04.319 --> 00:04:08.020
we have absolutely zero labeled data for the

00:04:08.020 --> 00:04:10.379
new target domain. We only have the raw inputs.

00:04:10.560 --> 00:04:13.460
Just the raw x space. Right. Which makes it distinctly

00:04:13.460 --> 00:04:15.400
different from inductive transfer learning, where

00:04:15.400 --> 00:04:18.459
you actually have labels for the new task, or

00:04:18.459 --> 00:04:20.500
unsupervised transfer learning, where you have

00:04:20.500 --> 00:04:23.360
no labels in... either domain. OK, so standard

00:04:23.360 --> 00:04:26.420
machine learning is basically like studying for

00:04:26.420 --> 00:04:29.079
a test using only past exams, banking on the

00:04:29.079 --> 00:04:30.920
teacher, never changing their testing style.

00:04:31.019 --> 00:04:33.540
Yeah, that's a good analogy. But domain adaptation

00:04:33.540 --> 00:04:36.019
is preparing for the teacher to throw a massive

00:04:36.019 --> 00:04:39.759
curveball. But I have to push back on the premise

00:04:39.759 --> 00:04:41.790
here for a second. Sure, go ahead. If the new

00:04:41.790 --> 00:04:44.149
executive's inbox, the target domain, is fundamentally

00:04:44.149 --> 00:04:46.750
different, why try to salvage the old model at

00:04:46.750 --> 00:04:50.129
all? Building a new AI from scratch for the new

00:04:50.129 --> 00:04:52.389
inbox just seems like a cleaner solution, doesn't

00:04:52.389 --> 00:04:55.449
it? You'd think so, but starting from scratch

00:04:55.449 --> 00:04:58.209
is prohibitively expensive. Really? Oh yeah.

00:04:58.629 --> 00:05:00.870
It requires a massive amount of human labor to

00:05:00.870 --> 00:05:03.529
create a new, perfectly labeled data set for

00:05:03.529 --> 00:05:06.870
every single new user or new hospital or new

00:05:06.870 --> 00:05:10.279
city an AI is deployed in. Ah, right, all that

00:05:10.279 --> 00:05:13.300
manual tagging. Exactly. If we can salvage the

00:05:13.300 --> 00:05:15.879
foundational knowledge from the source domain,

00:05:16.500 --> 00:05:19.180
the basic understanding of Inglis syntax, the

00:05:19.180 --> 00:05:21.360
underlying psychological structure of a scam,

00:05:22.000 --> 00:05:24.120
and mathematically adapt it to the target domain,

00:05:24.350 --> 00:05:27.209
we save immense amounts of time and computational

00:05:27.209 --> 00:05:31.170
power. So we're recycling the AI's core intelligence

00:05:31.170 --> 00:05:33.290
rather than just throwing it in the trash. That

00:05:33.290 --> 00:05:34.930
makes sense. Exactly. But since we aren't starting

00:05:34.930 --> 00:05:37.250
from scratch, we have to isolate exactly what

00:05:37.250 --> 00:05:39.089
broke during the transition from the source to

00:05:39.089 --> 00:05:41.149
the target, right? Like, what actually changes

00:05:41.149 --> 00:05:43.889
in the data to confuse the AI so badly? Well,

00:05:43.970 --> 00:05:46.209
the source material outlines three common types

00:05:46.209 --> 00:05:49.189
of distribution shifts. The first, and probably

00:05:49.189 --> 00:05:51.329
the most straightforward, is called covariate

00:05:51.329 --> 00:05:54.949
shift. And in covariate shift, the input distributions

00:05:54.949 --> 00:05:58.589
change, but the fundamental mathematical relationship

00:05:58.589 --> 00:06:01.410
between the inputs and the labels remains perfectly

00:06:01.410 --> 00:06:04.970
unchanged. applying that to our spam filter,

00:06:05.110 --> 00:06:07.449
the new executive receives totally different

00:06:07.449 --> 00:06:10.129
types of emails than I do. The input distribution,

00:06:10.389 --> 00:06:13.629
like the vocabulary, the senders, the frequency

00:06:13.629 --> 00:06:16.110
of messages, it's completely different. But the

00:06:16.110 --> 00:06:18.629
definition of the label spam hasn't actually

00:06:18.629 --> 00:06:21.850
changed. Like a phishing scam trying to steal

00:06:21.850 --> 00:06:24.470
a password is still a phishing scam, whether

00:06:24.470 --> 00:06:27.959
it's aimed at me. or a CEO. Exactly. The frequency

00:06:27.959 --> 00:06:30.779
and style of the features changed, but the underlying

00:06:30.779 --> 00:06:33.279
truth did not. OK, got it. What's the second

00:06:33.279 --> 00:06:35.740
shift? The second type is prior shift, also known

00:06:35.740 --> 00:06:38.779
as label shift. This occurs when the label distribution

00:06:38.779 --> 00:06:41.420
differs between the data sets, but the conditional

00:06:41.420 --> 00:06:43.720
distribution of features stays the same. Label

00:06:43.720 --> 00:06:45.699
shift. Give me an example of that. The Wikipedia

00:06:45.699 --> 00:06:48.040
article actually uses a highly illustrative example

00:06:48.040 --> 00:06:51.600
for this. Imagine an AI trained to classify hair

00:06:51.600 --> 00:06:54.439
color in images, and it's trained exclusively

00:06:54.439 --> 00:06:57.279
on a data dataset from Italy. That's our source

00:06:57.279 --> 00:06:59.319
domain. Okay, so we can assume a much higher

00:06:59.319 --> 00:07:02.180
proportion of darker hair colors in that source

00:07:02.180 --> 00:07:04.519
dataset. Correct. Now we take that exact same

00:07:04.519 --> 00:07:07.079
model and deploy it in Norway, our target domain.

00:07:07.160 --> 00:07:09.930
Oh wow, okay. Right. The proportion of blonde

00:07:09.930 --> 00:07:12.370
to black hair differs wildly between these two

00:07:12.370 --> 00:07:15.209
populations. So the prior probability of the

00:07:15.209 --> 00:07:17.449
overall label distribution has completely flipped.

00:07:17.709 --> 00:07:19.449
But the conditional distribution meaning the

00:07:19.449 --> 00:07:21.910
specific physical traits conditionally tied to

00:07:21.910 --> 00:07:24.610
the label blonde that stays the same. Exactly.

00:07:24.750 --> 00:07:26.970
Like a blonde person in Italy looks fundamentally

00:07:26.970 --> 00:07:29.769
like a blonde person in Norway. The physical

00:07:29.769 --> 00:07:32.410
features of blondness haven't changed, only how

00:07:32.410 --> 00:07:35.649
often the AI encounters them. Precisely. And

00:07:35.649 --> 00:07:38.269
if the AI deployed in Norway is made aware of

00:07:38.269 --> 00:07:41.350
this shift in population proportions, it can

00:07:41.350 --> 00:07:43.750
exploit that prior knowledge to dramatically

00:07:43.750 --> 00:07:46.129
improve its estimates. It just needs to adjust

00:07:46.129 --> 00:07:48.769
its internal baseline to expect more blonde hair.

00:07:48.959 --> 00:07:50.980
So it just updates its baseline assumptions,

00:07:51.060 --> 00:07:53.199
that makes sense. But that leads us to the third

00:07:53.199 --> 00:07:56.180
shift, which honestly seems inherently more destructive.

00:07:56.379 --> 00:07:58.680
Yeah, the third is concept shift, or conditional

00:07:58.680 --> 00:08:01.000
shift. And this is where it gets really tricky.

00:08:01.480 --> 00:08:03.420
This is where the relationship between the features

00:08:03.420 --> 00:08:06.000
and the labels completely changes, even if the

00:08:06.000 --> 00:08:08.259
input distribution looks identical. Wait, wait.

00:08:08.420 --> 00:08:11.459
So concept shift means the literal rules of reality

00:08:11.459 --> 00:08:15.560
changed for the AI. The mapping of x to y is

00:08:15.560 --> 00:08:18.819
entirely rewritten. Yes. Think of medical diagnosis.

00:08:19.579 --> 00:08:21.980
You might have the exact same symptoms. Let's

00:08:21.980 --> 00:08:24.240
say a high fever and a specific type of rash.

00:08:24.560 --> 00:08:26.899
Those are your inputs. OK. But depending on the

00:08:26.899 --> 00:08:28.980
population or the geographic region you're in

00:08:28.980 --> 00:08:31.500
your domain, those exact same symptoms might

00:08:31.500 --> 00:08:33.840
indicate entirely different diseases. Oh, wow.

00:08:34.340 --> 00:08:36.840
The same symptom means a different disease. How

00:08:36.840 --> 00:08:39.379
does a model not just completely self -destruct

00:08:39.379 --> 00:08:41.799
when the facts change like that? I mean, if A...

00:08:41.909 --> 00:08:44.710
no longer equals B, everything it learned is

00:08:44.710 --> 00:08:46.450
fundamentally flawed. Well, it is the hardest

00:08:46.450 --> 00:08:50.250
shift to solve. And honestly, often, models do

00:08:50.250 --> 00:08:52.350
fail catastrophically if they aren't aggressively

00:08:52.350 --> 00:08:54.570
corrected. I can imagine. It requires the model

00:08:54.570 --> 00:08:57.889
to actively unlearn a deeply ingrained correlation

00:08:57.889 --> 00:09:00.629
and map those same inputs to an entirely new

00:09:00.629 --> 00:09:03.190
output based on the new environment. If it doesn't,

00:09:03.289 --> 00:09:05.450
it will confidently make the wrong diagnosis

00:09:05.450 --> 00:09:08.980
every single time. Man. Because concept ships

00:09:08.980 --> 00:09:12.159
fundamentally break the model's logic, engineers

00:09:12.159 --> 00:09:14.559
can't just feed it more data, right? They have

00:09:14.559 --> 00:09:17.460
to change how the model learns. Which I guess

00:09:17.460 --> 00:09:21.059
brings us to the data scenarios. How engineers

00:09:21.059 --> 00:09:23.620
perform this corrective surgery depends entirely

00:09:23.620 --> 00:09:25.460
on what resources they have in the target domain,

00:09:25.559 --> 00:09:27.799
doesn't it? It does. The data scenarios dictate

00:09:27.799 --> 00:09:29.759
the mathematical approach, and they fall into

00:09:29.759 --> 00:09:33.899
three distinct categories. The first is unsupervised

00:09:33.899 --> 00:09:37.370
domain adaptation. Unsupervised, meaning no labels.

00:09:37.649 --> 00:09:40.169
Exactly. This means we have an abundance of data

00:09:40.169 --> 00:09:42.509
from the target domain, but absolutely none of

00:09:42.509 --> 00:09:44.570
it is labeled. So we have a decade's worth of

00:09:44.570 --> 00:09:46.789
the new executives' emails, but nobody has gone

00:09:46.789 --> 00:09:49.809
through and manually categorized them. The AI

00:09:49.809 --> 00:09:51.850
has to look at the unlabeled patterns in the

00:09:51.850 --> 00:09:54.029
target domain and mathematically compare them

00:09:54.029 --> 00:09:55.809
with the labeled patterns it remembers from the

00:09:55.809 --> 00:09:57.789
source domain and just try to bridge the gap

00:09:57.789 --> 00:10:00.070
on its own. That's it. Now the second scenario

00:10:00.070 --> 00:10:02.269
is semi -supervised. This is where the vast majority

00:10:02.269 --> 00:10:05.120
of the target data is unlabeled. have a tiny

00:10:05.120 --> 00:10:08.320
handful of labeled examples. Like maybe the executive

00:10:08.320 --> 00:10:11.700
took five minutes to flag a few really obvious

00:10:11.700 --> 00:10:14.899
spam emails. Perfect example. Those few labels

00:10:14.899 --> 00:10:17.779
act as mathematical anchors to help align the

00:10:17.779 --> 00:10:20.399
distributions of the two domains. Got it. And

00:10:20.399 --> 00:10:22.700
the third. Finally, there's supervised domain

00:10:22.700 --> 00:10:24.720
adaptation where all the available data from

00:10:24.720 --> 00:10:27.179
the target domain is already labeled. Okay, but

00:10:27.179 --> 00:10:29.559
for you listening... If you're managing data

00:10:29.559 --> 00:10:32.620
for your own projects, unsupervised is clearly

00:10:32.620 --> 00:10:34.740
the holy grail here, right? Because it means

00:10:34.740 --> 00:10:38.299
you skip the incredibly tedious, expensive human

00:10:38.299 --> 00:10:41.139
labor of relabeling everything. Absolutely. And

00:10:41.139 --> 00:10:43.500
if you have fully supervised data, haven't you

00:10:43.500 --> 00:10:46.460
just bypassed the core problem of domain adaptation

00:10:46.460 --> 00:10:49.080
entirely? Well, you've certainly reduced the

00:10:49.080 --> 00:10:52.259
complexity. As the source notes, in a fully supervised

00:10:52.259 --> 00:10:55.259
scenario, domain adaptation essentially reduces

00:10:55.259 --> 00:10:57.899
to a refinement of the original source predictor.

00:10:58.039 --> 00:11:00.240
Just a touch up. Yeah, it's akin to taking that

00:11:00.240 --> 00:11:02.399
bittily hair color network and fine -tuning it

00:11:02.399 --> 00:11:04.840
with a fully labeled Norway data set. It's an

00:11:04.840 --> 00:11:07.240
incremental update rather than a massive leap

00:11:07.240 --> 00:11:09.860
across an unknown statistical chasm. Okay, so

00:11:09.860 --> 00:11:12.240
we understand the shifts. We know what kind of

00:11:12.240 --> 00:11:14.480
data we might be working with. Now let's get

00:11:14.480 --> 00:11:16.980
into the engine room. Let's do it. What are the

00:11:16.980 --> 00:11:19.620
actual mathematical mechanisms the algorithmic

00:11:19.620 --> 00:11:22.559
master plans use to execute this transfer of

00:11:22.559 --> 00:11:25.320
knowledge? The source material lists four main

00:11:25.320 --> 00:11:27.860
principles and they represent totally different

00:11:27.860 --> 00:11:30.500
philosophies of problem solving. Let's start

00:11:30.500 --> 00:11:33.799
with reweighting algorithms. Okay, so the objective

00:11:33.799 --> 00:11:36.299
of a reweighting algorithm is to take the labeled

00:11:36.299 --> 00:11:39.320
samples from the source domain and manually adjust

00:11:39.320 --> 00:11:42.100
their statistical significance in the loss function.

00:11:42.350 --> 00:11:44.690
So that the source distribution mathematically

00:11:44.690 --> 00:11:47.649
mirrors the target distribution. Exactly. Okay,

00:11:47.730 --> 00:11:49.470
I want to break down how that actually works

00:11:49.470 --> 00:11:52.149
in practice. Are we artificially duplicating

00:11:52.149 --> 00:11:55.070
data, or are we changing how the AI perceives

00:11:55.070 --> 00:11:57.529
the data it already has? We're changing how it

00:11:57.529 --> 00:12:00.309
perceives the data by applying a technique called

00:12:00.309 --> 00:12:02.929
important sampling. Important sampling? Right.

00:12:03.289 --> 00:12:05.590
During training, every time an AI makes a mistake,

00:12:05.730 --> 00:12:08.090
it receives a mathematical penalty, which updates

00:12:08.090 --> 00:12:11.169
its rulebook. In reweighting, we multiply the

00:12:11.169 --> 00:12:13.809
penalty for certain types of source data to force

00:12:13.809 --> 00:12:16.570
the AI to care about them more. Oh, I see. Let's

00:12:16.570 --> 00:12:18.970
say you're adapting that spam filter to the executive

00:12:18.970 --> 00:12:21.990
and the unlabeled target data suggests they receive

00:12:21.990 --> 00:12:24.509
a massive amount of invoices. Right, and the

00:12:24.509 --> 00:12:27.210
AI looks back at its source data, my inbox, and

00:12:27.210 --> 00:12:30.330
realizes I only received like one or two fake

00:12:30.330 --> 00:12:33.250
invoices a year. It originally learned to mostly

00:12:33.250 --> 00:12:35.909
ignore them because they were statistically insignificant.

00:12:36.409 --> 00:12:39.440
But under a reweighting algorithm, the engineers

00:12:39.440 --> 00:12:42.620
intervene. They tell the algorithm to heavily

00:12:42.620 --> 00:12:45.379
multiply the penalty for getting those specific

00:12:45.379 --> 00:12:48.299
invoice emails wrong during its retraining phase.

00:12:50.259 --> 00:12:53.059
By doing so, they force the old source data to

00:12:53.059 --> 00:12:55.500
simulate the shape and priorities of the new

00:12:55.500 --> 00:12:58.220
reality. It's like telling the AI, remember those

00:12:58.220 --> 00:13:00.580
rare examples you barely paid attention to back

00:13:00.580 --> 00:13:02.899
home? Pretend those are now the most important

00:13:02.899 --> 00:13:05.090
things in the world. That makes total sense.

00:13:05.350 --> 00:13:07.070
It's very effective. So the second principle

00:13:07.070 --> 00:13:10.049
is iterative algorithms. The text describes this

00:13:10.049 --> 00:13:13.389
as a form of auto labeling, right? Yes. Often

00:13:13.389 --> 00:13:15.970
referred to as pseudo labeling. You take the

00:13:15.970 --> 00:13:18.149
model you trained on the source data and you

00:13:18.149 --> 00:13:21.230
let it loose on the unlabeled target data. It

00:13:21.230 --> 00:13:23.269
makes its best predictions and automatically

00:13:23.269 --> 00:13:26.370
applies a label to the target examples it feels

00:13:26.370 --> 00:13:29.450
most confident about. It takes its own test and

00:13:29.450 --> 00:13:32.909
grades it. But wait, isn't that a massive vulnerability?

00:13:33.289 --> 00:13:35.769
How so? Well, if the model is already confused

00:13:35.769 --> 00:13:38.009
by the new domain and it auto -labels something

00:13:38.009 --> 00:13:40.529
incorrectly, it's just going to use that incorrect

00:13:40.529 --> 00:13:42.990
data to train itself further. It would just create

00:13:42.990 --> 00:13:45.049
a feedback loop of its own mistakes. That is

00:13:45.049 --> 00:13:47.470
exactly the danger. That phenomenon is called

00:13:47.470 --> 00:13:50.230
confirmation bias or error amplification, and

00:13:50.230 --> 00:13:52.490
it is the primary risk of iterative approaches.

00:13:52.730 --> 00:13:56.029
So how do they fix that? To mitigate this, engineers

00:13:56.029 --> 00:13:59.769
employ strict confidence thresholds. The AI doesn't

00:13:59.769 --> 00:14:02.809
auto -label everything. It evaluates the mathematical

00:14:02.809 --> 00:14:05.769
certainty of its predictions and might only apply

00:14:05.769 --> 00:14:09.470
pseudo -labels to the top 1 % of data it is most

00:14:09.470 --> 00:14:11.600
overwhelmingly confident about. Oh, I get it.

00:14:11.600 --> 00:14:13.919
So it only anchors itself to the things it absolutely

00:14:13.919 --> 00:14:16.299
knows for sure and then retrains a new model

00:14:16.299 --> 00:14:19.159
based on those anchors. Exactly. It repeats that

00:14:19.159 --> 00:14:22.759
process iteratively. Train a model, label the

00:14:22.759 --> 00:14:25.500
highest confidence target examples, combine those

00:14:25.500 --> 00:14:27.779
with the source data, and train a new model.

00:14:27.980 --> 00:14:30.480
It slowly bootstraps its way toward understanding

00:14:30.480 --> 00:14:32.740
the target domain, expanding its confidence with

00:14:32.740 --> 00:14:34.960
each generation. Yeah, it builds its own stepping

00:14:34.960 --> 00:14:38.269
stones, essentially. Fascinating. Now, the third

00:14:38.269 --> 00:14:40.809
principle seems to take an entirely different,

00:14:41.049 --> 00:14:44.549
almost combative approach. It's called the search

00:14:44.549 --> 00:14:47.529
of a common representation space. Yes. And this

00:14:47.529 --> 00:14:50.190
is where the engineering becomes highly sophisticated.

00:14:50.529 --> 00:14:53.230
The goal is to mathematically map the source

00:14:53.230 --> 00:14:56.370
domain and the target domain into a shared space

00:14:56.370 --> 00:14:58.870
where their distributions overlap so perfectly

00:14:58.870 --> 00:15:01.129
that they are indistinguishable from one another.

00:15:01.330 --> 00:15:02.850
Wait, you're trying to erase the differences

00:15:02.850 --> 00:15:06.200
between Italy and Norway? How do you force two

00:15:06.200 --> 00:15:08.720
distinct data sets to become indistinguishable?

00:15:08.759 --> 00:15:11.740
By utilizing adversarial machine learning. Adversarial?

00:15:11.759 --> 00:15:13.840
Yeah. You actually set up two neural networks

00:15:13.840 --> 00:15:16.460
and force them to compete in minimax games. Like

00:15:16.460 --> 00:15:18.399
they're playing against each other. Exactly.

00:15:19.179 --> 00:15:22.220
The first network is a feature extractor. Its

00:15:22.220 --> 00:15:24.940
job is to look at the data and pull out the relevant

00:15:24.940 --> 00:15:28.220
information to perform the primary task, like

00:15:28.220 --> 00:15:30.919
classifying hair color. Okay. The second network

00:15:30.919 --> 00:15:33.820
acts as a domain discriminator. Its sole purpose

00:15:33.820 --> 00:15:36.039
is to look at the features extracted by the first

00:15:36.039 --> 00:15:38.960
network and guess. Did this image come from Italy

00:15:38.960 --> 00:15:41.500
or did it come from Norway? This sounds like

00:15:41.500 --> 00:15:44.139
a police sketch artist forced to draw a suspect

00:15:44.139 --> 00:15:47.039
for a detective. That is an excellent way to

00:15:47.039 --> 00:15:48.840
conceptualize it. Like if the sketch artist,

00:15:48.860 --> 00:15:51.000
the feature extractor draws the suspect standing

00:15:51.000 --> 00:15:53.299
in front of the Coliseum or wearing a heavy Norwegian

00:15:53.299 --> 00:15:56.960
winter coat, the detect, the discriminator immediately

00:15:56.960 --> 00:15:59.740
knows where the image came from. Right. And in

00:15:59.740 --> 00:16:02.539
this adversarial setup, the sketch artist is

00:16:02.539 --> 00:16:05.519
actively mathematically penalized through something

00:16:05.519 --> 00:16:08.100
called a gradient reversal layer. If the detective

00:16:08.100 --> 00:16:10.659
successfully guesses the origin. Whoa. So the

00:16:10.659 --> 00:16:13.580
sketch artist is forced to stop drawing the backgrounds,

00:16:13.720 --> 00:16:15.940
the clothing, the lighting. everything that gives

00:16:15.940 --> 00:16:18.360
away the domain, they're forced to draw only

00:16:18.360 --> 00:16:21.500
the face. They extract only the core universal

00:16:21.500 --> 00:16:24.620
features of hair color that are completely independent

00:16:24.620 --> 00:16:28.440
of geography. Exactly. By heavily penalizing

00:16:28.440 --> 00:16:31.379
the network for retaining domain -specific information,

00:16:32.080 --> 00:16:34.100
you force it to become blind to the environment.

00:16:34.590 --> 00:16:37.090
It strips away all the local shortcuts it used

00:16:37.090 --> 00:16:39.850
to rely on, leaving only the fundamental truth

00:16:39.850 --> 00:16:42.570
of the concept. That's brilliant. By destroying

00:16:42.570 --> 00:16:45.350
its ability to differentiate the domains, you

00:16:45.350 --> 00:16:47.889
force it to be universally smart rather than

00:16:47.889 --> 00:16:51.480
locally smart. You force the AI to focus on the

00:16:51.480 --> 00:16:54.059
signal by actively punishing it for seeing the

00:16:54.059 --> 00:16:56.039
noise. I love that. That's very elegant. Which

00:16:56.039 --> 00:16:58.480
brings us to the fourth algorithmic principle,

00:16:58.700 --> 00:17:00.700
and this one relies on a completely different

00:17:00.700 --> 00:17:02.840
branch of mathematics. Right. The hierarchical

00:17:02.840 --> 00:17:05.000
Bayesian model. Right. This approach involves

00:17:05.000 --> 00:17:07.599
constructing a Bayesian factorization model for

00:17:07.599 --> 00:17:10.460
counts represented mathematically as p of n.

00:17:10.599 --> 00:17:13.140
OK, let's translate p of n and factorization

00:17:13.140 --> 00:17:15.480
models into something tangible. How does this

00:17:15.480 --> 00:17:17.839
actually solve the domain problem? Well, let's

00:17:17.839 --> 00:17:21.660
say our data involves counting the frequency

00:17:21.660 --> 00:17:25.220
of specific words in an email that's rn. The

00:17:25.220 --> 00:17:27.400
model seeks to understand the probability, the

00:17:27.400 --> 00:17:30.940
p, of those word counts. To do this across different

00:17:30.940 --> 00:17:34.079
domains, a hierarchical Bayesian model breaks

00:17:34.079 --> 00:17:37.140
the data down into latent representation. Latent

00:17:37.140 --> 00:17:39.859
representations, meaning? They're hidden underlying

00:17:39.859 --> 00:17:42.579
variables that explain the data we observe. So

00:17:42.579 --> 00:17:44.339
instead of just looking at the surface level

00:17:44.339 --> 00:17:48.279
words, the AI tries to deduce the hidden topics

00:17:48.279 --> 00:17:50.839
generating those words. Correct. But the brilliance

00:17:50.839 --> 00:17:53.220
of the hierarchical approach is how it structures

00:17:53.220 --> 00:17:56.099
those hidden variables. The model mathematically

00:17:56.099 --> 00:17:59.619
derives latent factors that are entirely specific

00:17:59.619 --> 00:18:02.680
to a single domain, but it also derives global

00:18:02.680 --> 00:18:05.430
latent. factors that are shared across all domains.

00:18:05.569 --> 00:18:07.730
Oh, I see. It creates a filing system for its

00:18:07.730 --> 00:18:09.670
own knowledge. Basically, yeah. It essentially

00:18:09.670 --> 00:18:12.289
says, I've analyzed the data, and this particular

00:18:12.289 --> 00:18:14.970
rule about language only applies to the CEO's

00:18:14.970 --> 00:18:17.089
inbox. I'll put that in the domain -specific

00:18:17.089 --> 00:18:19.690
folder, but this other rule about how phishing

00:18:19.690 --> 00:18:22.039
links are formatted applies everywhere. I'll

00:18:22.039 --> 00:18:23.839
put that in the global folder. That's a great

00:18:23.839 --> 00:18:26.640
analogy. It mathematically separates the universal

00:18:26.640 --> 00:18:29.900
truths from the local quirks. It builds a hierarchy

00:18:29.900 --> 00:18:32.460
of knowledge, allowing for a specialized adaptation

00:18:32.460 --> 00:18:35.079
while maintaining a shared foundational understanding.

00:18:35.309 --> 00:18:37.589
What's really remarkable is that this isn't just

00:18:37.589 --> 00:18:39.509
theoretical math sitting in an academic paper

00:18:39.509 --> 00:18:42.910
from 2010. This is active, deployable engineering.

00:18:43.069 --> 00:18:45.369
It really is. The source material emphasizes

00:18:45.369 --> 00:18:48.390
that if you, listening to this, are dealing with

00:18:48.390 --> 00:18:50.569
a covariate shift in your own data right now,

00:18:50.769 --> 00:18:53.369
you don't have to invent a gradient reversal

00:18:53.369 --> 00:18:55.750
layer from scratch. Definitely not. The theory

00:18:55.750 --> 00:18:58.390
has been beautifully translated into accessible

00:18:58.390 --> 00:19:01.410
software packages over the past decades. If you

00:19:01.410 --> 00:19:04.230
work in the Python ecosystem, which dominates

00:19:04.230 --> 00:19:07.190
this space, there are comprehensive tools available.

00:19:07.230 --> 00:19:09.950
Like what? Well, you have SCADA, which stands

00:19:09.950 --> 00:19:12.029
for site kit adaptation, and that's designed

00:19:12.029 --> 00:19:14.190
to integrate smoothly with standard machine learning

00:19:14.190 --> 00:19:17.650
workflows. There's also ADPT, the awesome domain

00:19:17.650 --> 00:19:20.829
adaptation Python toolbox, and TLib, the transfer

00:19:20.829 --> 00:19:22.910
learning library. Nice. And if you're working

00:19:22.910 --> 00:19:25.470
outside of Python? For engineers operating within

00:19:25.470 --> 00:19:28.009
MATLAB, there is the appropriately named Domain

00:19:28.009 --> 00:19:31.500
Adaptation Toolbox. These packages compile the

00:19:31.500 --> 00:19:33.859
reweighting algorithms, the adversarial networks,

00:19:34.420 --> 00:19:36.680
and the iterative autolabelers into ready -to

00:19:36.680 --> 00:19:39.240
-use functions. So the theoretical math has really

00:19:39.240 --> 00:19:41.380
just been commoditized into practical tools.

00:19:41.880 --> 00:19:44.019
Exactly. Well, let's synthesize the journey we've

00:19:44.019 --> 00:19:46.490
taken today. We started with the realization

00:19:46.490 --> 00:19:49.670
that an AI is fundamentally captive to the specific

00:19:49.670 --> 00:19:51.769
environment it was trained in. Right, it's literal.

00:19:51.990 --> 00:19:54.650
And we explored the engineering required to help

00:19:54.650 --> 00:19:57.710
a model step from a safe, familiar source domain

00:19:57.710 --> 00:20:01.829
D sub S into a wild, unpredictable target domain

00:20:01.829 --> 00:20:05.009
D sub T. We navigated the distinct types of statistical

00:20:05.009 --> 00:20:07.670
failures. covariate shifts where the input patterns

00:20:07.670 --> 00:20:10.390
change, prior shifts where the demographic proportions

00:20:10.390 --> 00:20:12.910
flip, and concept shifts where the foundational

00:20:12.910 --> 00:20:15.359
rules of reality are completely rewritten. And

00:20:15.359 --> 00:20:18.279
we broke down how engineers execute the fits

00:20:18.279 --> 00:20:21.700
by maliciously reweighting the AI's memories,

00:20:22.200 --> 00:20:24.460
by carefully allowing it to auto -label its future

00:20:24.460 --> 00:20:27.140
through pseudo -labels, by using adversarial

00:20:27.140 --> 00:20:30.059
networks to strip away environmental noise, and

00:20:30.059 --> 00:20:32.380
by building Bayesian hierarchies to separate

00:20:32.380 --> 00:20:35.900
local quirks from universal truths. It is a remarkable

00:20:35.900 --> 00:20:38.680
mathematical process of unlearning and relearning,

00:20:39.160 --> 00:20:42.039
which Honestly, if we revisit our earlier discussion,

00:20:42.259 --> 00:20:44.680
brings up a rather profound implication regarding

00:20:44.680 --> 00:20:46.940
concept shifts. Oh, right. The scenario where

00:20:46.940 --> 00:20:49.099
the exact same symptom points to an entirely

00:20:49.099 --> 00:20:51.059
different disease depending on the hospital the

00:20:51.059 --> 00:20:54.160
AI is deployed in. Yeah. If an AI model is constantly

00:20:54.160 --> 00:20:57.160
adapting to concept shifts, where the very relationship

00:20:57.160 --> 00:20:59.759
between inputs and outputs fundamentally changes

00:20:59.759 --> 00:21:02.619
to fit a new reality, at what point does the

00:21:02.619 --> 00:21:04.519
adapted model become an entirely new entity?

00:21:04.680 --> 00:21:08.039
Wow. It's the AI equivalent of the ship of Theseus.

00:21:08.400 --> 00:21:10.460
If you replace every plank of wood on a ship

00:21:10.460 --> 00:21:13.200
over time, is it still the same ship? Exactly.

00:21:13.359 --> 00:21:15.859
If the model has to continuously overwrite its

00:21:15.859 --> 00:21:18.519
foundational mapping of X to Y just to survive

00:21:18.519 --> 00:21:21.839
a new domain, are we actually transferring knowledge?

00:21:22.180 --> 00:21:25.259
Or are we slowly overwriting the original AI's

00:21:25.259 --> 00:21:27.579
reality until absolutely nothing of the source

00:21:27.579 --> 00:21:30.180
domain remains? If the same symptom now means

00:21:30.180 --> 00:21:32.539
a different disease, the model's original truth

00:21:32.539 --> 00:21:34.819
is just gone. Something for you to chew on the

00:21:34.819 --> 00:21:36.900
next time you hear about a quote unquote highly

00:21:36.900 --> 00:21:40.019
trained AI being deployed into a brand new environment.

00:21:40.480 --> 00:21:42.319
Whose reality is it actually experiencing?