WEBVTT

00:00:00.000 --> 00:00:01.540
You know, if you really sit back and just watch

00:00:01.540 --> 00:00:04.480
a toddler figuring out the world, it is honestly

00:00:04.480 --> 00:00:07.019
a master class in data processing. Oh, totally.

00:00:07.360 --> 00:00:09.900
We completely take it for granted. Right. But

00:00:09.900 --> 00:00:11.640
cognitive scientists are constantly pointing

00:00:11.640 --> 00:00:14.060
out that, well, babies don't have a supervisor

00:00:14.060 --> 00:00:17.600
following them around, explicitly labeling every

00:00:17.600 --> 00:00:19.660
single object in their field of vision. Yeah,

00:00:19.780 --> 00:00:21.519
there's no one constantly whispering, this is

00:00:21.519 --> 00:00:24.350
a dog, this is not a dog, this is a cup. Exactly.

00:00:24.530 --> 00:00:27.489
They just look, they interact, they observe structural

00:00:27.489 --> 00:00:30.449
patterns, and somehow, like entirely on their

00:00:30.449 --> 00:00:33.250
own, they piece together the underlying physics

00:00:33.250 --> 00:00:37.469
of reality. They are essentially using the environment

00:00:37.469 --> 00:00:41.189
itself as the curriculum. The raw data they take

00:00:41.189 --> 00:00:43.810
in becomes his own teacher, which is, I mean,

00:00:43.909 --> 00:00:45.789
it's incredibly efficient compared to needing

00:00:45.789 --> 00:00:48.390
a guided tour for every new object. And yet,

00:00:48.549 --> 00:00:50.840
for the longest time, artificial intelligence

00:00:50.840 --> 00:00:53.280
just didn't work like that at all. AI development

00:00:53.280 --> 00:00:56.679
was, you know, severely bottlenecked. Because

00:00:56.679 --> 00:00:59.759
human beings had to painstakingly sit there and

00:00:59.759 --> 00:01:02.359
label every single piece of training data. Right,

00:01:02.399 --> 00:01:04.859
we were talking about millions of images tagged

00:01:04.859 --> 00:01:07.739
by hand just to teach a machine what a stop sign

00:01:07.739 --> 00:01:10.280
looks like under, like, different lighting conditions.

00:01:10.420 --> 00:01:13.310
Yeah, it was a massive chore. But today on this

00:01:13.310 --> 00:01:15.390
deep dive, we are looking at a stack of notes

00:01:15.390 --> 00:01:18.629
and a really foundational Wikipedia article on

00:01:18.629 --> 00:01:21.670
a machine learning paradigm that completely flips

00:01:21.670 --> 00:01:24.590
that script. We're talking about self -supervised

00:01:24.590 --> 00:01:27.810
learning, or SSL. It's a game changer. It really

00:01:27.810 --> 00:01:30.329
is. We're going to explore how machines are finally

00:01:30.329 --> 00:01:32.650
learning to teach themselves by finding hidden

00:01:32.650 --> 00:01:36.409
structures in raw, unlabeled data. Okay, let's

00:01:36.409 --> 00:01:38.930
unpack this. How on earth does a mathematical

00:01:38.930 --> 00:01:41.379
model actually grade its own homework? Well,

00:01:41.379 --> 00:01:44.040
it requires a pretty fundamental shift in how

00:01:44.040 --> 00:01:46.599
we think about the input data itself. Because

00:01:46.599 --> 00:01:49.099
to understand how a machine teaches itself, we

00:01:49.099 --> 00:01:51.159
first need to look at how it manipulates the

00:01:51.159 --> 00:01:53.840
information we feed it. In traditional supervised

00:01:53.840 --> 00:01:55.859
learning, the process relies on two distinct

00:01:55.859 --> 00:01:58.439
things. You have the data, and you have the external

00:01:58.439 --> 00:02:01.939
label provided by a human. But in self -supervised

00:02:01.939 --> 00:02:04.500
learning, the architecture actually forces the

00:02:04.500 --> 00:02:07.780
model to use the raw data to generate its own

00:02:07.780 --> 00:02:10.060
supervisory signals. It's essentially a two -step

00:02:10.060 --> 00:02:12.949
process. So what's step one? First, the model

00:02:12.949 --> 00:02:16.129
is assigned what researchers call a pretext task

00:02:16.129 --> 00:02:19.449
or an auxiliary task. The goal here isn't to

00:02:19.449 --> 00:02:21.550
solve the main problem yet. Sounds like a warm

00:02:21.550 --> 00:02:25.159
-up. Exactly. The goal is for the model to generate

00:02:25.159 --> 00:02:28.379
its own pseudo labels to initialize its internal

00:02:28.379 --> 00:02:31.719
parameters and really just build a baseline understanding

00:02:31.719 --> 00:02:34.240
of the data structure. And only after it has

00:02:34.240 --> 00:02:36.159
constructed that foundational geometry does it

00:02:36.159 --> 00:02:39.159
move on to step two, which is tackling the actual

00:02:39.159 --> 00:02:42.319
downstream task we care about. You've got? A

00:02:42.319 --> 00:02:46.639
good way to visualize this pretext task is to

00:02:46.639 --> 00:02:48.830
think about a jigsaw puzzle. But imagine you

00:02:48.830 --> 00:02:51.050
bought this jigsaw puzzle at a garage sale, and

00:02:51.050 --> 00:02:53.150
it just came in a blank plastic bag. Oh, I hate

00:02:53.150 --> 00:02:55.009
when that happens. Right. You do not have the

00:02:55.009 --> 00:02:57.189
box. You don't have that external label, the

00:02:57.189 --> 00:02:59.310
picture on the cover telling you what the final

00:02:59.310 --> 00:03:01.409
image is actually supposed to be. But you aren't

00:03:01.409 --> 00:03:04.330
completely helpless. Exactly. Even without the

00:03:04.330 --> 00:03:06.310
box, you can still look at the physical shapes

00:03:06.310 --> 00:03:08.629
of the pieces. You see a flat blue edge. You

00:03:08.629 --> 00:03:11.669
see how the color gradients on two pieces match

00:03:11.669 --> 00:03:14.469
perfectly. You see the interlocking tabs. So

00:03:14.469 --> 00:03:17.469
by solving the pretext puzzle of how the physical

00:03:17.469 --> 00:03:20.210
pieces relate to each other, you inherently learn

00:03:20.210 --> 00:03:22.949
the structure of the image. Right. Without anyone

00:03:22.949 --> 00:03:25.110
ever explicitly telling you, hey, it's a picture

00:03:25.110 --> 00:03:27.629
of a mountain. Yes. The relationships within

00:03:27.629 --> 00:03:30.990
the data become the explicit teacher, and developers

00:03:30.990 --> 00:03:34.009
force the AI to learn these internal relationships

00:03:34.009 --> 00:03:36.289
through a technique called data augmentation.

00:03:36.639 --> 00:03:38.979
Break that down for me. So they take the original

00:03:38.979 --> 00:03:42.159
input data and mathematically transform it to

00:03:42.159 --> 00:03:45.080
create pairs of related samples. Going back to

00:03:45.080 --> 00:03:47.740
your puzzle, imagine taking a cluster of pieces

00:03:47.740 --> 00:03:50.240
and introducing digital noise by blurring them.

00:03:50.400 --> 00:03:52.599
Or maybe like cropping a section out entirely.

00:03:52.659 --> 00:03:54.780
Yeah, cropping them, shifting the color balance

00:03:54.780 --> 00:03:57.939
or rotating them 90 degrees. The AI is forced

00:03:57.939 --> 00:04:00.159
to look at the original clean sample and this

00:04:00.159 --> 00:04:03.060
heavily altered augmented sample and figure out

00:04:03.060 --> 00:04:05.849
how they connect. By solving the puzzle of these

00:04:05.849 --> 00:04:08.710
transformed images, the architector forces the

00:04:08.710 --> 00:04:12.210
AI to ignore the superficial noise, like lighting

00:04:12.210 --> 00:04:16.149
or orientation, and capture the absolute essential

00:04:16.149 --> 00:04:18.670
underlying features of the data. Because if the

00:04:18.670 --> 00:04:21.670
model can still confidently identify that a puzzle

00:04:21.670 --> 00:04:24.350
piece belongs in a specific spot, even after

00:04:24.350 --> 00:04:26.430
you've rotated it, blurred it, changed the color

00:04:26.430 --> 00:04:29.310
tint, it means the model actually understands

00:04:29.310 --> 00:04:32.449
the geometric concept of what the piece is. Right.

00:04:32.509 --> 00:04:34.850
It's no longer just memorizing exactly how that

00:04:34.850 --> 00:04:37.069
piece looked under perfect static conditions.

00:04:37.089 --> 00:04:39.269
Yeah. And once it starts representing concepts

00:04:39.269 --> 00:04:41.990
rather than just memorizing pixels, it unlocks

00:04:41.990 --> 00:04:45.290
a level of robustness that traditional AI just

00:04:45.290 --> 00:04:47.910
really struggles with. It builds a genuine feature

00:04:47.910 --> 00:04:50.490
space. Which sounds brilliant for a static controlled

00:04:50.490 --> 00:04:52.790
data set. But if you're talking about robust

00:04:52.790 --> 00:04:55.069
AI, we have to talk about the real world. And

00:04:55.069 --> 00:04:56.930
well, the real world isn't static. It changes

00:04:56.930 --> 00:04:59.470
constantly. All the time. So what happens when

00:04:59.470 --> 00:05:01.449
the data the machine is looking at? suddenly

00:05:01.449 --> 00:05:04.209
starts acting differently. Say we have a self

00:05:04.209 --> 00:05:07.370
-supervised algorithm predicting traffic patterns

00:05:07.370 --> 00:05:11.389
based on historical data and suddenly a new massive

00:05:11.389 --> 00:05:14.810
highway interchange opens up or a major event

00:05:14.810 --> 00:05:18.050
fundamentally shifts human commuting behavior

00:05:18.050 --> 00:05:20.810
overnight. That phenomenon is a major hurdle

00:05:20.810 --> 00:05:23.600
in machine learning. It's known as concept drift.

00:05:24.100 --> 00:05:26.279
It occurs when the statistical properties of

00:05:26.279 --> 00:05:28.959
the data just change over time. So the fundamental

00:05:28.959 --> 00:05:31.120
rules of the environment shift. Exactly. And

00:05:31.120 --> 00:05:33.480
suddenly the model's past learning its entire

00:05:33.480 --> 00:05:36.420
internal geometry is outdated and it starts making

00:05:36.420 --> 00:05:39.639
inaccurate predictions. This is where those pseudo

00:05:39.639 --> 00:05:42.040
labels we mentioned earlier become absolutely

00:05:42.040 --> 00:05:45.379
critical for survival. How so? Well, in a dynamic

00:05:45.379 --> 00:05:47.879
environment, the model will inevitably detect

00:05:47.879 --> 00:05:50.860
that a new incoming stream of data deviates from

00:05:50.860 --> 00:05:53.339
its previous learned baseline. Okay. When it

00:05:53.339 --> 00:05:55.660
sees this, the system generates a classification

00:05:55.660 --> 00:05:58.620
result for that new weird instance based on its

00:05:58.620 --> 00:06:01.300
best current understanding. It then takes that

00:06:01.300 --> 00:06:04.019
predicted class, uses it as a surrogate ground

00:06:04.019 --> 00:06:06.459
truth, a pseudo label, and feeds it back into

00:06:06.459 --> 00:06:08.939
itself. To update or retrain the specific components

00:06:08.939 --> 00:06:10.660
that are starting to age out. Wait, I have to

00:06:10.660 --> 00:06:13.560
jump in and push back on this because structurally

00:06:13.560 --> 00:06:16.279
that sounds like a recipe for a catastrophic

00:06:16.279 --> 00:06:19.220
failure. How do you mean? If the model is suddenly

00:06:19.220 --> 00:06:21.720
facing data it doesn't fully understand, and

00:06:21.720 --> 00:06:24.920
it's generating its own labels based on its own

00:06:24.920 --> 00:06:27.699
potentially flawed guesses, and then training

00:06:27.699 --> 00:06:30.259
itself on those guesses. Ah, I see what you're

00:06:30.259 --> 00:06:32.319
saying. Doesn't that risk a terrible feedback

00:06:32.319 --> 00:06:36.439
loop? Like, it makes a mistake, confidently labels

00:06:36.439 --> 00:06:39.360
that mistake as the absolute truth, learns from

00:06:39.360 --> 00:06:42.319
it, and just aggressively propagates its own

00:06:42.319 --> 00:06:44.819
errors until the whole system completely breaks

00:06:44.819 --> 00:06:47.819
down. What's fascinating here is how the architecture

00:06:47.819 --> 00:06:50.819
is specifically designed to prevent exactly that

00:06:50.819 --> 00:06:53.860
kind of cascading failure. Adaptive learning

00:06:53.860 --> 00:06:56.379
pipelines don't just blindly trust every guess

00:06:56.379 --> 00:06:58.620
they make. So there are guardrails. Incredibly

00:06:58.620 --> 00:07:01.220
strict statistical thresholds, yeah. The model

00:07:01.220 --> 00:07:03.620
calculates a probability distribution for every

00:07:03.620 --> 00:07:05.959
single prediction, evaluating its own certainty.

00:07:06.019 --> 00:07:08.500
It knows how confident it is. Yes, and it only

00:07:08.500 --> 00:07:11.220
chooses to use the pseudo label when the classifier

00:07:11.220 --> 00:07:13.100
produces a sufficiently competent prediction,

00:07:13.120 --> 00:07:16.529
say like a 99 % certainty score. So if the model

00:07:16.529 --> 00:07:19.029
looks at the new traffic pattern and is only

00:07:19.029 --> 00:07:21.949
55 % sure about what's happening, it just discards

00:07:21.949 --> 00:07:24.550
the prediction. Exactly. It refuses to train

00:07:24.550 --> 00:07:27.259
on it. It only incorporates the pseudo -labeled

00:07:27.259 --> 00:07:29.639
instance when it hits that high confidence threshold,

00:07:29.939 --> 00:07:32.199
using it to refresh its understanding of the

00:07:32.199 --> 00:07:34.620
emerging patterns. So it's essentially acting

00:07:34.620 --> 00:07:37.699
like a highly advanced student who encounters

00:07:37.699 --> 00:07:40.639
a new type of calculus problem. Oh, that's a

00:07:40.639 --> 00:07:43.000
good analogy. They know the foundational material

00:07:43.000 --> 00:07:45.259
well enough to realize that this new problem

00:07:45.259 --> 00:07:47.879
is really just a slight variation of something

00:07:47.879 --> 00:07:50.620
they already mastered. They confidently apply

00:07:50.620 --> 00:07:53.740
the rule, verify the internal logic holds up,

00:07:54.139 --> 00:07:56.819
and update their own mental toolkit. Right, completely

00:07:56.819 --> 00:07:58.959
bypassing the need for the teacher to hold their

00:07:58.959 --> 00:08:01.079
hand through it. That captures the mechanism

00:08:01.079 --> 00:08:04.100
perfectly. It allows the system to achieve continuous

00:08:04.100 --> 00:08:07.199
adaptation in dynamic environments without requiring

00:08:07.199 --> 00:08:09.939
a human engineering team to constantly step in,

00:08:10.000 --> 00:08:13.199
pause the system, and manually annotate the new

00:08:13.199 --> 00:08:15.699
drifting data. Okay, so we understand the broad

00:08:15.699 --> 00:08:19.629
strokes. The machine uses the data to train itself

00:08:19.629 --> 00:08:22.730
via pretext tasks, and it guards against its

00:08:22.730 --> 00:08:25.550
own hallucinations by relying on strict confidence

00:08:25.550 --> 00:08:27.750
thresholds. Right. But if we actually open up

00:08:27.750 --> 00:08:30.149
the hood and look at the math, how does it test

00:08:30.149 --> 00:08:33.769
itself? Because the source material breaks self

00:08:33.769 --> 00:08:35.710
-supervised learning down into a few distinct

00:08:35.710 --> 00:08:38.250
flavors, starting with something called auto

00:08:38.250 --> 00:08:40.440
-associative learning. Yeah, auto -associative

00:08:40.440 --> 00:08:42.379
self -supervised learning is essentially the

00:08:42.379 --> 00:08:45.419
foundational architecture. In this setup, a neural

00:08:45.419 --> 00:08:48.299
network is quite literally trained to reproduce

00:08:48.299 --> 00:08:51.279
or reconstruct its own input data. It associates

00:08:51.279 --> 00:08:53.720
the input with itself. Exactly. And historically,

00:08:53.980 --> 00:08:55.980
this is almost always achieved using a structure

00:08:55.980 --> 00:08:58.580
called an autoencoder. The mechanics of autoencoders

00:08:58.580 --> 00:09:01.139
are fascinating because they force the machine

00:09:01.139 --> 00:09:03.700
into a bottleneck. It's a lot like having to

00:09:03.700 --> 00:09:05.740
verbally describe a suspect to a police sketch

00:09:05.740 --> 00:09:08.480
artist over a terrible, static -filled phone

00:09:08.480 --> 00:09:11.080
line. I love that comparison. Right. The original

00:09:11.080 --> 00:09:13.899
high -dimensional data, let's say a massive,

00:09:14.019 --> 00:09:16.399
high -resolution photograph of the suspect, is

00:09:16.399 --> 00:09:18.899
what you start with. But you can't transmit every

00:09:18.899 --> 00:09:21.220
single pixel over that bad phone line. No, you'd

00:09:21.220 --> 00:09:23.639
run out of bandwidth. You have an encoder that

00:09:23.639 --> 00:09:25.759
compresses this massive amount of information

00:09:25.759 --> 00:09:29.279
into a tiny, restricted format. which in machine

00:09:29.279 --> 00:09:31.240
learning is called the lower dimensional latent

00:09:31.240 --> 00:09:34.240
space. You have to figure out what is absolutely

00:09:34.240 --> 00:09:36.779
essential. You don't describe every individual

00:09:36.779 --> 00:09:39.600
eyelash. Exactly. You focus on the defining scar

00:09:39.600 --> 00:09:42.679
on the cheek and the sharp jawline. You compress

00:09:42.679 --> 00:09:46.080
the data down to its most meaningful, concentrated

00:09:46.080 --> 00:09:48.259
representation. And the magic happens on the

00:09:48.259 --> 00:09:49.940
other end of that phone line. Right, with the

00:09:49.940 --> 00:09:52.220
decoder. The decoder, the sketch artist, has

00:09:52.220 --> 00:09:54.559
to take that tiny, compressed set of instructions

00:09:54.559 --> 00:09:58.539
from the latent space and somehow perfect redraw

00:09:58.539 --> 00:10:01.120
the original massive high -resolution photograph.

00:10:01.360 --> 00:10:03.679
And the model learns by constantly comparing

00:10:03.679 --> 00:10:07.159
the original photo to the final sketch. It calculates

00:10:07.159 --> 00:10:09.720
a mathematical penalty for every single difference,

00:10:10.039 --> 00:10:12.820
which algorithms refer to as the mean squared

00:10:12.820 --> 00:10:15.480
error. So if the sketch artist drew a round jaw

00:10:15.480 --> 00:10:17.740
instead of a sharp one, the mathematical penalty

00:10:17.740 --> 00:10:21.360
is massive. Exactly. By aggressively trying to

00:10:21.360 --> 00:10:24.600
tweak its internal math to minimize that reconstruction

00:10:24.600 --> 00:10:28.200
error, the autoencoder learns an incredibly efficient

00:10:28.200 --> 00:10:30.539
vocabulary for the core features of the data.

00:10:30.720 --> 00:10:33.519
Because if the phone line had unlimited bandwidth,

00:10:34.360 --> 00:10:36.580
the network would just lazily transmit the image

00:10:36.580 --> 00:10:38.679
pixel for pixel without actually learning what

00:10:38.679 --> 00:10:41.809
a jawline or a scar even is. Right, the informational

00:10:41.809 --> 00:10:44.429
bottleneck is the entire point. The restriction

00:10:44.429 --> 00:10:47.250
forces comprehension. But while autoencoders

00:10:47.250 --> 00:10:49.909
are elegant, they suffer from a massive practical

00:10:49.909 --> 00:10:53.029
flaw. The computing power, right. Forcing a machine

00:10:53.029 --> 00:10:55.690
to perfectly recreate every single background

00:10:55.690 --> 00:10:58.389
pixel of an image requires immense computational

00:10:58.389 --> 00:11:01.509
power. And frankly, recreating the exact texture

00:11:01.509 --> 00:11:03.409
of a brick wall in the background of an image

00:11:03.409 --> 00:11:06.309
doesn't help the AI understand the core subject.

00:11:06.490 --> 00:11:09.090
Which makes sense. And this inefficiency led

00:11:09.090 --> 00:11:11.409
to the second major architecture, contrastive

00:11:11.409 --> 00:11:14.210
self -supervised learning. Exactly. Because instead

00:11:14.210 --> 00:11:16.710
of forcing the machine to perfectly redraw the

00:11:16.710 --> 00:11:19.549
image, contrastive learning just asks the machine

00:11:19.549 --> 00:11:21.750
to spot the difference between two things. So

00:11:21.750 --> 00:11:24.129
it relies on comparisons rather than reconstruction.

00:11:24.269 --> 00:11:26.769
Right. It operates using positive and negative

00:11:26.769 --> 00:11:30.070
examples. Imagine a basic binary task where the

00:11:30.070 --> 00:11:33.340
AI needs to understand what a dog is. Positive

00:11:33.340 --> 00:11:36.759
examples are augmented variations of a dog image.

00:11:36.960 --> 00:11:38.980
And the negative examples? They're images of

00:11:38.980 --> 00:11:42.539
literally anything else. Cats, cars, trees. The

00:11:42.539 --> 00:11:44.840
loss function, the mathematical rule the system

00:11:44.840 --> 00:11:47.480
uses to correct its internal weights, works by

00:11:47.480 --> 00:11:49.740
minimizing the distance between positive sample

00:11:49.740 --> 00:11:52.039
pairs. Pulling them closer together in the mathematical

00:11:52.039 --> 00:11:54.519
space. Yes. And simultaneously, it maximizes

00:11:54.519 --> 00:11:57.259
the distance between negative sample pairs, aggressively

00:11:57.259 --> 00:11:59.539
pushing them further apart. So it's learning

00:11:59.539 --> 00:12:02.019
by sorting. This vector looks like that vector,

00:12:02.179 --> 00:12:04.299
so group them closely together. This vector looks

00:12:04.299 --> 00:12:06.080
nothing like that one, so repel them. That's

00:12:06.080 --> 00:12:08.679
exactly it. The source material gave some incredible

00:12:08.679 --> 00:12:12.139
real -world examples of this, like CLAP, which

00:12:12.139 --> 00:12:14.700
stands for Contrastive Language Image Pre -Training.

00:12:14.860 --> 00:12:17.679
Oh, CLAP is a great example. It takes an image,

00:12:17.980 --> 00:12:20.500
say a picture of a golden retriever, and a piece

00:12:20.500 --> 00:12:22.779
of text, the word's golden retriever, and tries

00:12:22.779 --> 00:12:26.139
to align them. If they match, it considers them

00:12:26.139 --> 00:12:29.419
a positive pair. The math adjusts the internal

00:12:29.419 --> 00:12:32.120
weights to give their encoding vectors a large

00:12:32.120 --> 00:12:35.100
cosine similarity. And stripping away the jargon,

00:12:35.379 --> 00:12:37.460
cosine similarity is just measuring the angle

00:12:37.460 --> 00:12:40.039
between two arrows in high -dimensional math

00:12:40.039 --> 00:12:42.200
space. Right. If the arrow for the image and

00:12:42.200 --> 00:12:44.120
the arrow for the text point in the exact same

00:12:44.120 --> 00:12:47.299
direction, the angle is zero, meaning the similarity

00:12:47.299 --> 00:12:50.679
is incredibly high. There's also Info NC, or

00:12:50.679 --> 00:12:53.419
Noise Contrastive Estimation, which scales this

00:12:53.419 --> 00:12:55.960
up across massive datasets, optimizing models

00:12:55.960 --> 00:12:58.980
by comparing one positive sample against a massive

00:12:58.980 --> 00:13:01.259
batch of negative noise samples. Contrastive

00:13:01.259 --> 00:13:02.860
learning dominated the field because it just

00:13:02.860 --> 00:13:05.679
makes intuitive logical sense. You build a robust

00:13:05.679 --> 00:13:08.460
definition of a dog by studying dogs, but just

00:13:08.460 --> 00:13:11.399
as importantly, by studying what is not a dog.

00:13:11.559 --> 00:13:14.139
But the architecture hit a wall. Processing all

00:13:14.139 --> 00:13:17.080
those negative examples is computationally exhausting.

00:13:17.659 --> 00:13:20.179
To make the comparisons mathematically robust

00:13:20.179 --> 00:13:22.419
and prevent the model from getting confused,

00:13:22.980 --> 00:13:25.860
you need massive batch sizes of negative data

00:13:25.860 --> 00:13:28.919
loaded into memory at the exact same time. It

00:13:28.919 --> 00:13:31.360
requires staggering amounts of hardware. It really

00:13:31.360 --> 00:13:33.840
does. Which brings us to the rebels of the machine

00:13:33.840 --> 00:13:36.659
learning world, because engineers absolutely

00:13:36.659 --> 00:13:40.000
hate hardware limitations. Contrastive learning

00:13:40.000 --> 00:13:43.360
makes logical sense, but researchers asked, what

00:13:43.360 --> 00:13:45.740
happens if we just take away the negative examples

00:13:45.740 --> 00:13:48.840
entirely to say computing power. This breakthrough

00:13:48.840 --> 00:13:51.399
is called non -contrastive self -supervised learning

00:13:51.399 --> 00:13:55.990
or NCSSL. In this approach, using complex methods

00:13:55.990 --> 00:13:58.730
like BioOL, which stands for Bootstrap, your

00:13:58.730 --> 00:14:01.769
own latent or direct pred, the system uses only

00:14:01.769 --> 00:14:04.330
positive examples. It completely ignores negative

00:14:04.330 --> 00:14:06.210
samples. Completely ignores them. It just takes

00:14:06.210 --> 00:14:08.490
two augmented views of the exact same image and

00:14:08.490 --> 00:14:10.370
tries to maximize the mathematical similarity

00:14:10.370 --> 00:14:12.610
between them. Here's where it gets really interesting.

00:14:13.070 --> 00:14:15.769
Because on paper, that math shouldn't work at

00:14:15.769 --> 00:14:18.610
all. Not even a little bit. If you only show

00:14:18.610 --> 00:14:21.870
the AI positive examples and your only instruction

00:14:21.870 --> 00:14:24.610
is make the math match, wouldn't the easiest

00:14:24.610 --> 00:14:27.429
way for the network to get a perfect score be

00:14:27.429 --> 00:14:30.090
to just classify literally everything in the

00:14:30.090 --> 00:14:32.610
universe as the exact same thing? Yes. If it

00:14:32.610 --> 00:14:35.370
just outputs a flat zero for every single image

00:14:35.370 --> 00:14:37.950
it ever sees, the difference between any two

00:14:37.950 --> 00:14:41.019
images is always zero. It achieves a perfect

00:14:41.019 --> 00:14:43.799
zero loss score without actually learning a single

00:14:43.799 --> 00:14:46.639
feature. It feels like the AI taking a massive

00:14:46.639 --> 00:14:48.919
lazy shortcut. You've just described the exact

00:14:48.919 --> 00:14:51.399
phenomenon researchers call representation collapse.

00:14:51.960 --> 00:14:54.220
It is the most obvious glaring flaw in the theory,

00:14:54.259 --> 00:14:56.500
and it's why many thought non -contrastive learning

00:14:56.500 --> 00:14:59.639
was just a dead end. But it works. Counterintuitively,

00:14:59.720 --> 00:15:02.220
yes, these methods don't collapse into that lazy

00:15:02.220 --> 00:15:04.799
shortcut. They actually converge on a highly

00:15:04.799 --> 00:15:07.620
useful local minimum. And the reason they avoid

00:15:07.620 --> 00:15:10.419
the trivial zero -loss solution is due to a very

00:15:10.419 --> 00:15:13.200
clever asymmetrical architectural trick. How

00:15:13.200 --> 00:15:15.100
do they actually stop the math from collapsing

00:15:15.100 --> 00:15:18.779
in on itself? Effective NCSSL requires an asymmetrical

00:15:18.779 --> 00:15:22.059
design. Instead of one network, you have two

00:15:22.059 --> 00:15:24.620
neural networks looking at the two different

00:15:24.620 --> 00:15:27.549
augmented views of the image. We call them the

00:15:27.549 --> 00:15:29.830
online network and the target network. Okay,

00:15:29.990 --> 00:15:33.429
two networks. The trick is adding an extra predictor

00:15:33.429 --> 00:15:36.549
exclusively on the online side and crucially

00:15:36.549 --> 00:15:39.529
employing what algorithm designers call a stop

00:15:39.529 --> 00:15:41.850
gradient on the target side. Okay, let's break

00:15:41.850 --> 00:15:43.610
down a stop gradient for those of us who don't

00:15:43.610 --> 00:15:46.340
read research papers for fun. Think of it as

00:15:46.340 --> 00:15:49.259
deliberately freezing the target network in place.

00:15:49.799 --> 00:15:51.799
While the first network, the online network,

00:15:51.879 --> 00:15:54.580
is actively updating its internal math and frantically

00:15:54.580 --> 00:15:57.080
trying to guess the correct output, the second

00:15:57.080 --> 00:15:59.500
network, the target network, ignores all the

00:15:59.500 --> 00:16:02.059
immediate errors. So it doesn't use backpropagation

00:16:02.059 --> 00:16:04.679
to constantly tweak its own weights based on

00:16:04.679 --> 00:16:07.379
what the online network is doing. Exactly. Instead,

00:16:07.580 --> 00:16:10.559
it acts as a steady, slowly moving benchmark.

00:16:10.889 --> 00:16:13.129
Because the target network is essentially frozen

00:16:13.129 --> 00:16:15.610
and doesn't instantly agree with the online network's

00:16:15.610 --> 00:16:18.070
errors, the two networks can't simply collude

00:16:18.070 --> 00:16:21.149
and agree to instantly output zero. The asymmetry

00:16:21.149 --> 00:16:24.070
forces the online network to actually learn the

00:16:24.070 --> 00:16:26.929
meaningful structural features required to predict

00:16:26.929 --> 00:16:29.950
the stable target, bypassing the collapse entirely.

00:16:30.250 --> 00:16:32.789
That is incredibly elegant. It's like forcing

00:16:32.789 --> 00:16:35.590
someone to hit a moving target, but the target

00:16:35.590 --> 00:16:38.389
explicitly refuses to move closer to make the

00:16:38.389 --> 00:16:40.759
shot easier. you actually have to learn how to

00:16:40.759 --> 00:16:43.440
aim. That's a great way to put it. So if we trace

00:16:43.440 --> 00:16:45.919
this evolution, we've gone from reproducing data

00:16:45.919 --> 00:16:48.740
pixel by pixel with autoencoders to comparing

00:16:48.740 --> 00:16:51.039
positive and negatives to save background rendering

00:16:51.039 --> 00:16:54.059
to ditching the negatives entirely with frozen

00:16:54.059 --> 00:16:56.769
target networks. It moves fast. moving beyond

00:16:56.769 --> 00:16:59.230
just matching pairs, where is this technology

00:16:59.230 --> 00:17:01.429
actually heading? Because the source highlights

00:17:01.429 --> 00:17:04.410
a major theoretical leap forward proposed in

00:17:04.410 --> 00:17:07.509
2022 by Jan Lacoon, who is widely considered

00:17:07.509 --> 00:17:10.390
one of the godfathers of modern AI. This brings

00:17:10.390 --> 00:17:13.029
us to the cutting -edge, joint -embedding predictive

00:17:13.029 --> 00:17:16.170
architectures, or JPEBES. This is a fascinating

00:17:16.170 --> 00:17:18.390
evolution that tries to mimic human cognition

00:17:18.390 --> 00:17:21.369
even closer. Remember the autoencoders we discussed?

00:17:21.730 --> 00:17:23.970
The sketch artist trying to redraw the original

00:17:23.970 --> 00:17:26.579
photo? They fundamentally worked by trying to

00:17:26.579 --> 00:17:30.400
reconstruct the original input exactly. GPs abandoned

00:17:30.400 --> 00:17:33.519
reconstruction entirely. Really? Yeah. Unlike

00:17:33.519 --> 00:17:37.019
autoencoders, GPs operate exclusively in the

00:17:37.019 --> 00:17:39.960
abstract latent space. They do not care about

00:17:39.960 --> 00:17:42.559
recreating pixel -level noise. Meaning if you

00:17:42.559 --> 00:17:45.420
show it a video, it isn't wasting computing power

00:17:45.420 --> 00:17:48.019
trying to predict the exact shade of green on

00:17:48.019 --> 00:17:50.180
a single blade of grass blowing in the background.

00:17:50.440 --> 00:17:52.839
Exactly. Predicting every individual pixel doesn't

00:17:52.839 --> 00:17:55.259
equal comprehension. It's just statistical brute

00:17:55.259 --> 00:17:58.920
force. GPs focus entirely on semantic conceptual

00:17:58.920 --> 00:18:02.119
structure. So how do they learn? Instead of redrawing

00:18:02.119 --> 00:18:05.059
the image, they learn by predicting masked abstract

00:18:05.059 --> 00:18:07.779
representations from the visible context. You

00:18:07.779 --> 00:18:10.140
show the model a sequence of video frames, and

00:18:10.140 --> 00:18:12.599
you completely mask or hide the next few frames,

00:18:12.680 --> 00:18:14.880
and you ask it to predict the abstract concept

00:18:14.880 --> 00:18:17.700
of what happens next, not the exact pixels. So

00:18:17.700 --> 00:18:20.000
to make that concrete, if I show a Jeepa model

00:18:20.000 --> 00:18:22.380
a video of a glass of water being pushed off

00:18:22.380 --> 00:18:24.519
the edge of a table, and I mask the part where

00:18:24.519 --> 00:18:26.880
it hits the ground. The model doesn't try to

00:18:26.880 --> 00:18:29.140
mathematically draw the exact shape of a hundred

00:18:29.140 --> 00:18:31.660
shattered glass shards and splashing water drops.

00:18:32.380 --> 00:18:35.279
It predicts the overarching concept of gravity,

00:18:35.640 --> 00:18:38.240
impact, and breaking. It learns the physics of

00:18:38.240 --> 00:18:40.789
the action rather than the visual texture. If

00:18:40.789 --> 00:18:42.890
we connect this to the bigger picture, this is

00:18:42.890 --> 00:18:45.990
exactly why Lacan introduced Jepes. His ultimate

00:18:45.990 --> 00:18:48.549
goal is the establishment of a comprehensive

00:18:48.549 --> 00:18:51.809
world model. The world model. Yes. By forcing

00:18:51.809 --> 00:18:54.609
the machine to predict abstract representations

00:18:54.609 --> 00:18:57.589
of missing temporal information, you are literally

00:18:57.589 --> 00:19:00.410
teaching it the physics, the logic, and the common

00:19:00.410 --> 00:19:03.250
sense of the real world. A world model aims to

00:19:03.250 --> 00:19:05.390
enable machines to replicate human intellect

00:19:05.390 --> 00:19:08.210
by giving them a grounded conceptual framework

00:19:08.210 --> 00:19:10.650
of the environment they exist in. That's wild.

00:19:10.750 --> 00:19:13.509
It is seen by many researchers as the necessary

00:19:13.509 --> 00:19:16.589
foundational step toward true autonomous reasoning

00:19:16.589 --> 00:19:19.190
and planning. So what does this all mean? We've

00:19:19.190 --> 00:19:21.089
covered everything from data augmentation to

00:19:21.089 --> 00:19:24.309
frozen benchmark networks to world models predicting

00:19:24.309 --> 00:19:27.170
gravity. But why should you, the listener, care

00:19:27.170 --> 00:19:29.710
about the mathematical intricacies of self -supervised

00:19:29.710 --> 00:19:32.700
learning? Well, because it's everywhere. Exactly.

00:19:33.059 --> 00:19:35.859
You should care. Because this is not theoretical

00:19:35.859 --> 00:19:38.539
garage science waiting for a breakthrough. It

00:19:38.539 --> 00:19:41.700
is actively powering the invisible infrastructure

00:19:41.700 --> 00:19:44.779
of your digital life right now. The heavy hitters

00:19:44.779 --> 00:19:47.240
mentioned in our sources really prove it. Oh,

00:19:47.240 --> 00:19:49.380
definitely. Take natural language processing.

00:19:50.099 --> 00:19:52.920
When you type a highly specific, nuanced query

00:19:52.920 --> 00:19:56.119
into a search engine, you're interacting with

00:19:56.119 --> 00:19:59.559
Google's BERT. Bert didn't learn language by

00:19:59.559 --> 00:20:02.000
humans feeding it dictionary definitions. It

00:20:02.000 --> 00:20:04.660
used self -supervised learning to literally play

00:20:04.660 --> 00:20:07.039
Mad Libs with itself. He's brilliant! It took

00:20:07.039 --> 00:20:09.640
billions of sentences from Wikipedia, blacked

00:20:09.640 --> 00:20:12.779
out 15 % of the words, and forced itself to guess

00:20:12.779 --> 00:20:15.220
the missing text based on the surrounding context.

00:20:15.369 --> 00:20:17.710
By doing that billions of times, it learned the

00:20:17.710 --> 00:20:19.670
nuanced, structural difference between like,

00:20:19.710 --> 00:20:21.609
I'm going to the bank to deposit money and I'm

00:20:21.609 --> 00:20:23.529
standing by the river bank. We see the exact

00:20:23.529 --> 00:20:26.069
same architecture in audio processing. When you

00:20:26.069 --> 00:20:28.029
use a speech -to -text feature on your phone,

00:20:28.269 --> 00:20:29.970
you are likely brushing up against algorithms

00:20:29.970 --> 00:20:32.869
based on Facebook's Wave2Vec. Which uses self

00:20:32.869 --> 00:20:35.710
-supervised learning to mask frames of raw audio

00:20:35.710 --> 00:20:38.619
waveforms. Right, and it forces the model to

00:20:38.619 --> 00:20:41.660
predict the missing sound structures. It learned

00:20:41.660 --> 00:20:43.980
to understand the phonetic building blocks of

00:20:43.980 --> 00:20:47.380
human speech without needing humans to painstakingly

00:20:47.380 --> 00:20:50.259
transcribe every single syllable of the training

00:20:50.259 --> 00:20:53.339
audio. And it extends into biology, too. Self

00:20:53.339 --> 00:20:56.380
GenomeNet uses these exact same principles to

00:20:56.380 --> 00:21:00.160
find hidden structural patterns in raw unlabeled

00:21:00.160 --> 00:21:02.980
genomic sequences, fundamentally reshaping how

00:21:02.980 --> 00:21:05.539
quickly researchers can process the absolute

00:21:05.539 --> 00:21:08.859
deluge of biological data we produce every single

00:21:08.859 --> 00:21:11.549
day. The speed at which this allows machines

00:21:11.549 --> 00:21:14.289
to scale is just breathtaking. It really is.

00:21:14.410 --> 00:21:17.230
It is a profound structural shift in the timeline

00:21:17.230 --> 00:21:19.289
of technology. And I think there is a philosophical

00:21:19.289 --> 00:21:21.690
implication here that goes far beyond the efficiency

00:21:21.690 --> 00:21:23.529
of the code. What do you mean? Well, for decades,

00:21:23.809 --> 00:21:25.410
artificial intelligence was essentially just

00:21:25.410 --> 00:21:28.349
a mirror. It was a rigid reflection of the explicit

00:21:28.349 --> 00:21:30.910
human -made labels we fed into it. It only knew

00:21:30.910 --> 00:21:33.170
what a dog was because a human engineer carved

00:21:33.170 --> 00:21:35.289
out the mathematical boundaries of a dog and

00:21:35.289 --> 00:21:37.650
handed it to the machine. But with architectures

00:21:37.650 --> 00:21:40.619
like JPEG's operation, entirely in a semantic

00:21:40.619 --> 00:21:43.460
abstract space to build their own internal world

00:21:43.460 --> 00:21:46.980
models based on physics and logic, we have to

00:21:46.980 --> 00:21:50.059
ask a difficult question. We do. If a machine

00:21:50.059 --> 00:21:52.819
is no longer just memorizing tixels and it's

00:21:52.819 --> 00:21:55.339
no longer just reciting our hand -fed labels,

00:21:55.640 --> 00:21:58.660
if it is instead observing raw data, finding

00:21:58.660 --> 00:22:01.240
hidden relationships, and building its own conceptual

00:22:01.240 --> 00:22:04.180
understanding of reality entirely autonomously,

00:22:04.960 --> 00:22:08.200
Where exactly is the line between a complex mathematical

00:22:08.200 --> 00:22:11.140
calculation and actual comprehension? Are they

00:22:11.140 --> 00:22:13.099
just running the numbers incredibly fast or are

00:22:13.099 --> 00:22:15.339
they actually starting to know things? Exactly.

00:22:15.559 --> 00:22:17.420
It brings us right back to that toddler learning

00:22:17.420 --> 00:22:19.859
to see the world. We don't hand babies a dictionary

00:22:19.859 --> 00:22:22.420
and a list of labels. We just let them experience

00:22:22.420 --> 00:22:25.440
the raw data of reality until the puzzle pieces

00:22:25.440 --> 00:22:27.690
snap together on their own. And for the first

00:22:27.690 --> 00:22:30.029
time in history, we've figured out how to let

00:22:30.029 --> 00:22:31.950
machines play with the puzzle without giving

00:22:31.950 --> 00:22:34.049
them the box. That is going to be something to

00:22:34.049 --> 00:22:35.589
think about the next time you ask your phone

00:22:35.589 --> 00:22:37.849
a highly contextual question and it understands

00:22:37.849 --> 00:22:40.809
exactly what you mean. Thanks for joining us

00:22:40.809 --> 00:22:42.890
on this deep dive and we'll see you next time.