WEBVTT

00:00:00.000 --> 00:00:02.359
If you want a computer to learn what a human

00:00:02.359 --> 00:00:06.660
face looks like, your first instinct is probably

00:00:06.660 --> 00:00:09.039
to just show it a million faces. Right, yeah.

00:00:09.380 --> 00:00:11.539
Just feed it endless data. Yeah, and you tell

00:00:11.539 --> 00:00:15.160
it to mathematically memorize every exact, precise

00:00:15.160 --> 00:00:17.879
pixel. I mean, it makes logical sense. We generally

00:00:17.879 --> 00:00:20.760
expect computers to be these rigid, deterministic

00:00:20.760 --> 00:00:24.929
machines. Exactly. But... Well, back in 2013,

00:00:25.289 --> 00:00:28.250
a pair of researchers realized that demanding

00:00:28.250 --> 00:00:31.429
that kind of absolute rigid precision was actually

00:00:31.429 --> 00:00:34.329
making artificial intelligence incredibly stupid.

00:00:34.670 --> 00:00:36.570
Right. They figured out that to make machines

00:00:36.570 --> 00:00:38.829
truly creative, they had to teach them how to

00:00:38.829 --> 00:00:42.049
guess. Yeah, how to embrace chaos, really. Exactly.

00:00:42.170 --> 00:00:44.990
How to do math with these fuzzy overlapping clouds

00:00:44.990 --> 00:00:47.130
of uncertainty. And for you listening right now,

00:00:47.130 --> 00:00:48.979
whether you're, you know... prepping for a massive

00:00:48.979 --> 00:00:51.039
tech meeting or trying to catch up on the whole

00:00:51.039 --> 00:00:53.240
generative AI boom, or maybe you're just insanely

00:00:53.240 --> 00:00:55.719
curious about the actual mechanics under the

00:00:55.719 --> 00:00:58.119
hood of these models, you want the thorough knowledge.

00:00:58.500 --> 00:01:01.039
You want the deep structural understanding without

00:01:01.039 --> 00:01:05.099
drowning in a sea of dense academic jargon. Right.

00:01:05.640 --> 00:01:08.280
We want to understand the actual physics of the

00:01:08.280 --> 00:01:10.359
engine, not just stare at the steering wheel.

00:01:10.640 --> 00:01:14.200
Because we are talking about a fundamental shift

00:01:14.200 --> 00:01:16.939
in how computers perceive reality. It really

00:01:16.939 --> 00:01:20.420
is a massive shift. So today, our source material

00:01:20.420 --> 00:01:24.280
is a comprehensive Wikipedia article on a foundational

00:01:24.280 --> 00:01:26.719
neural network architecture. It's called the

00:01:26.719 --> 00:01:29.739
Variational Autoencoder, or VAE. The VAE, yeah.

00:01:29.840 --> 00:01:31.859
And our mission for this deep dive is to understand

00:01:31.859 --> 00:01:34.959
exactly how this specific architecture completely

00:01:34.959 --> 00:01:38.060
changed the game in machine learning by injecting

00:01:38.060 --> 00:01:40.640
probability and randomness straight into its

00:01:40.640 --> 00:01:43.000
core. Right into the center of the math. Okay,

00:01:43.019 --> 00:01:45.000
let's unpack this. Where did this radical shift

00:01:45.000 --> 00:01:47.129
actually start? Well, the origin point traces

00:01:47.129 --> 00:01:49.689
back to Diederich P. Kingma and Max Welling.

00:01:49.790 --> 00:01:52.870
OK. So in 2013, they published this framework

00:01:52.870 --> 00:01:56.750
that just brilliantly bridged a gap between two

00:01:56.750 --> 00:01:59.230
historically distinct fields. Two different worlds,

00:01:59.250 --> 00:02:01.510
basically. Right. On one side, you had deep learning,

00:02:01.709 --> 00:02:05.230
which deals with massive layered artificial neural

00:02:05.230 --> 00:02:07.689
networks that are just great at finding patterns.

00:02:07.849 --> 00:02:10.460
Brute force pattern recognition. Exactly. And

00:02:10.460 --> 00:02:12.819
then on the other side, you had variational Bayesian

00:02:12.819 --> 00:02:14.719
methods. Which is more about statistics, right?

00:02:14.819 --> 00:02:17.840
Yeah. It's a field deeply rooted in probabilities,

00:02:18.120 --> 00:02:21.219
statistics, and, crucially, measuring uncertainty.

00:02:21.979 --> 00:02:24.439
Kingma and Welling brought those two worlds together

00:02:24.439 --> 00:02:27.180
into one elegant, highly scalable framework.

00:02:27.500 --> 00:02:30.719
But to really grasp what makes a variational

00:02:30.719 --> 00:02:34.139
autoencoder so special, we first need to visualize

00:02:35.020 --> 00:02:37.120
the baseline model it was upgrading, right? Yeah,

00:02:37.219 --> 00:02:38.960
we have to look at the traditional autoencoder

00:02:38.960 --> 00:02:41.300
first. Right. So how does the traditional one

00:02:41.300 --> 00:02:44.379
work? So a traditional autoencoder consists of

00:02:44.379 --> 00:02:47.259
two main neural networks working in tandem, and

00:02:47.259 --> 00:02:49.180
they're connected by this sort of bottleneck.

00:02:49.219 --> 00:02:52.560
OK. The first network is the encoder. Its job

00:02:52.560 --> 00:02:55.520
is to take complex, high -dimensional input data,

00:02:55.599 --> 00:02:57.840
like, say, a high -resolution image of a human

00:02:57.840 --> 00:03:00.319
face, which might have millions of pixels, and

00:03:00.319 --> 00:03:02.879
compress it down into a much smaller, low -dimensional

00:03:02.879 --> 00:03:05.039
mathematical representation. And we call this

00:03:05.039 --> 00:03:07.379
highly compressed state the latent space. Exactly,

00:03:07.479 --> 00:03:10.819
the latent space. that latent space for you listening

00:03:10.819 --> 00:03:13.780
just to make it concrete. Imagine a massive,

00:03:14.139 --> 00:03:16.840
blank, two -dimensional graph. Okay, a 2D graph.

00:03:16.979 --> 00:03:19.379
Yeah, and the x -axis represents the width of

00:03:19.379 --> 00:03:22.259
a smile, and the y -axis represents the tilt

00:03:22.259 --> 00:03:24.500
of a head. I like this. So the encoder looks

00:03:24.500 --> 00:03:26.860
at a picture of a smiling person tilting their

00:03:26.860 --> 00:03:30.199
head to the left, and it maps that entire complex

00:03:30.199 --> 00:03:34.159
image to one single microscopic dot on that 2D

00:03:34.159 --> 00:03:36.439
graph. Right. It finds the coordinates, plots

00:03:36.439 --> 00:03:38.379
the dot, and says, here is the essence of that

00:03:38.379 --> 00:03:41.280
image. That is the perfect visual. And once the

00:03:41.280 --> 00:03:44.960
encoder plots that specific dot, the second network

00:03:44.960 --> 00:03:46.800
the decoder takes over. Right, the second F.

00:03:46.900 --> 00:03:49.159
It looks at the exact coordinates of that tiny

00:03:49.159 --> 00:03:51.699
dot on the graph and attempts to reconstruct

00:03:51.699 --> 00:03:54.319
the original high resolution image from it. It

00:03:54.319 --> 00:03:56.979
basically paints it pixel by pixel to match the

00:03:56.979 --> 00:03:59.479
input. So a traditional autoencoder is essentially

00:03:59.479 --> 00:04:02.840
a rigid highly specific filing system. Very rigid.

00:04:03.039 --> 00:04:06.560
It maps that picture of a face to a single exact

00:04:06.560 --> 00:04:08.800
mathematical point. If you want that face back,

00:04:08.919 --> 00:04:10.879
you go to that exact coordinate, pull the file,

00:04:10.979 --> 00:04:14.650
and rebuild it. Exactly. But a variational autoencoder,

00:04:14.830 --> 00:04:18.250
or VAE, throws out that rigid filing system entirely.

00:04:18.370 --> 00:04:20.629
It just completely scraps it. Yeah. Instead of

00:04:20.629 --> 00:04:23.069
mapping an image to a single microscopic dot

00:04:23.069 --> 00:04:26.529
on our graph, it maps the image to a distribution.

00:04:26.829 --> 00:04:29.769
Right. So instead of a dot, it's like the encoder

00:04:29.769 --> 00:04:33.329
takes a spray paint can and sprays a fuzzy cloud

00:04:33.329 --> 00:04:37.040
onto the graph. Yes. a fuzzy cloud. You are mapping

00:04:37.040 --> 00:04:39.500
the general vibe or a range of possibilities

00:04:39.500 --> 00:04:42.339
instead of a specific coordinate. And mathematically,

00:04:42.699 --> 00:04:45.959
the sources say it maps the input to a multivariate

00:04:45.959 --> 00:04:48.860
Gaussian distribution. Right. And what's fascinating

00:04:48.860 --> 00:04:51.360
here is why mapping to a fuzzy cloud instead

00:04:51.360 --> 00:04:54.040
of a single point is so incredibly crucial for

00:04:54.040 --> 00:04:56.360
making the AI smart. It prevents overfitting.

00:04:56.660 --> 00:04:58.860
Right. Exactly. When a traditional network maps

00:04:58.860 --> 00:05:01.259
to rigid points, it suffers from what data scientists

00:05:01.259 --> 00:05:03.980
call overfitting because it's just memorizing

00:05:03.980 --> 00:05:06.199
exact coordinates that latent space our graph

00:05:06.199 --> 00:05:08.519
ends up with a few isolated dots scattered around,

00:05:08.879 --> 00:05:11.519
separated by massive empty void. Just huge gaps

00:05:11.519 --> 00:05:14.160
of nothing. Right. And if you ask the traditional

00:05:14.160 --> 00:05:16.620
decoder to build an image from one of those empty

00:05:16.620 --> 00:05:20.319
voids, it panics. It outputs absolute garbage

00:05:20.319 --> 00:05:22.879
because it has no idea what exists between the

00:05:22.879 --> 00:05:25.310
exact points it memorized. It's just a parrot

00:05:25.310 --> 00:05:28.050
repeating a highly specific phrase. It hasn't

00:05:28.050 --> 00:05:29.709
actually learned the language. That's a great

00:05:29.709 --> 00:05:32.790
way to put it. By forcing the encoder to map

00:05:32.790 --> 00:05:36.250
data to a probabilistic distribution, a wide

00:05:36.250 --> 00:05:39.230
fuzzy cloud, those clouds naturally stretch and

00:05:39.230 --> 00:05:41.829
overlap. So there are no more empty voids? Zero

00:05:41.829 --> 00:05:44.629
empty voids. The network can't just memorize

00:05:44.629 --> 00:05:47.350
isolated dots anymore. It is forced to learn

00:05:47.350 --> 00:05:50.290
the underlying continuous structural rules of

00:05:50.290 --> 00:05:52.410
the data. It forces the model to generalize.

00:05:52.629 --> 00:05:54.829
Exactly. Because the Leighton space is now made

00:05:54.829 --> 00:05:57.529
of continuous overlapping distributions, you

00:05:57.529 --> 00:06:00.089
can pick any random spot inside that entire cloudy

00:06:00.089 --> 00:06:02.629
area, feed those coordinates to the decoder,

00:06:02.730 --> 00:06:05.250
and it will generate a completely new, unique

00:06:05.250 --> 00:06:07.790
face. A face that has literally never existed

00:06:07.790 --> 00:06:10.149
before. Never existed. But it still perfectly

00:06:10.149 --> 00:06:12.490
follows the anatomical rules of a human face.

00:06:12.930 --> 00:06:15.029
You move from just compressing data to actually

00:06:15.029 --> 00:06:18.310
generating new reality. Wait, OK, hold on. If

00:06:18.310 --> 00:06:20.649
everything in this latent space is just a fuzzy

00:06:20.649 --> 00:06:23.470
cloud of probability, doesn't that become a massive

00:06:23.470 --> 00:06:26.990
liability? Oh, so? Well, if I have a spray painted

00:06:26.990 --> 00:06:29.269
cloud for cats and a spray painted cloud for

00:06:29.269 --> 00:06:32.110
dogs, and they're constantly expanding and stretching

00:06:32.110 --> 00:06:34.850
to avoid empty voids, aren't they just going

00:06:34.850 --> 00:06:38.920
to completely bleed into each other? Right. Like,

00:06:39.019 --> 00:06:41.279
how do we keep the filing room from turning into

00:06:41.279 --> 00:06:44.639
a chaotic, mangled mess where every coordinate

00:06:44.639 --> 00:06:47.480
generates some terrifying cat -dog hybrid? That

00:06:47.480 --> 00:06:50.620
is the exact mathematical hurdle the VAE has

00:06:50.620 --> 00:06:53.240
to overcome to be useful. You have to carefully

00:06:53.240 --> 00:06:55.720
organize the fuzzy clouds. Right. So how does

00:06:55.720 --> 00:06:59.189
it do that? To do that, the VAE uses a very specific

00:06:59.189 --> 00:07:02.110
loss function to optimize itself. A loss function

00:07:02.110 --> 00:07:04.370
is simply the mathematical formula a network

00:07:04.370 --> 00:07:07.910
uses to score its own performance and correct

00:07:07.910 --> 00:07:10.009
its mistakes. It's grading rubric, basically.

00:07:10.129 --> 00:07:12.810
Right. And the VAE's loss function is derived

00:07:12.810 --> 00:07:14.889
from something called the free energy expression.

00:07:15.129 --> 00:07:17.670
It requires the network to balance two highly

00:07:17.670 --> 00:07:20.050
competing mathematical forces. It's a tug of

00:07:20.050 --> 00:07:22.009
war. Let's break down the two sides of that rope.

00:07:22.089 --> 00:07:23.949
What is pulling on the first side? On the first

00:07:23.949 --> 00:07:26.410
side, you have the reconstruction error. This

00:07:26.410 --> 00:07:28.629
is the network measuring whether the decoder

00:07:28.629 --> 00:07:31.050
actually outputs something that looks like the

00:07:31.050 --> 00:07:33.490
original input image. So it's checking its work.

00:07:33.629 --> 00:07:36.389
Yeah, it calculates this using standard metrics

00:07:36.389 --> 00:07:39.769
like mean squared error or cross entropy, literally

00:07:39.769 --> 00:07:42.750
going pixel by pixel to see how far off the generated

00:07:42.750 --> 00:07:46.410
image is from reality. The reconstruction error

00:07:46.410 --> 00:07:50.370
wants the decoder to be absolutely perfect. But

00:07:50.370 --> 00:07:52.680
to get a perfect reconstruction, the encoder

00:07:52.680 --> 00:07:55.300
would want to shrink those fuzzy clouds back

00:07:55.300 --> 00:08:00.459
down into highly specific. rigid points. Exactly.

00:08:00.680 --> 00:08:02.500
A reconstruction error is desperately trying

00:08:02.500 --> 00:08:05.319
to pull the distributions into tiny tight shapes

00:08:05.319 --> 00:08:08.399
so it can perfectly redraw the exact quirks of

00:08:08.399 --> 00:08:10.779
the original image without any randomness getting

00:08:10.779 --> 00:08:12.879
in the way. Which of course would lead us right

00:08:12.879 --> 00:08:15.120
back to overfitting and memorization. We'd lose

00:08:15.120 --> 00:08:17.199
the generative magic entirely. The clouds would

00:08:17.199 --> 00:08:19.439
just become darts again. So to prevent the clouds

00:08:19.439 --> 00:08:21.660
from shrinking back into dots we have the force

00:08:21.660 --> 00:08:24.480
on the other side of the rope. the Kullback -Leibler

00:08:24.480 --> 00:08:27.639
divergence, or KL divergence for short. KL divergence.

00:08:27.860 --> 00:08:29.959
Okay, let's define the mechanics of that. How

00:08:29.959 --> 00:08:32.019
does it pull back against the reconstruction

00:08:32.019 --> 00:08:35.539
error? KL divergence is basically a statistical

00:08:35.539 --> 00:08:38.100
measurement of how one probability distribution

00:08:38.100 --> 00:08:41.159
differs from a baseline reference distribution.

00:08:41.740 --> 00:08:45.559
In the VAE, it acts as a strict structural penalty.

00:08:45.639 --> 00:08:48.080
A penalty, okay. Yeah. It looks at the approximated

00:08:48.080 --> 00:08:50.620
posterior distribution, which is just the fuzzy

00:08:50.620 --> 00:08:53.879
cloud the encoder just generated, denoted mathematically

00:08:53.879 --> 00:08:57.539
as Q, and it measures how far away that cloud

00:08:57.539 --> 00:09:00.299
is from a prior distribution, denoted as P. Yet

00:09:00.299 --> 00:09:03.549
in a standard VAE, This prior P is a standard

00:09:03.549 --> 00:09:05.809
normal distribution, right? You got it. A standard

00:09:05.809 --> 00:09:09.419
normal distribution, meaning a. perfect bell

00:09:09.419 --> 00:09:12.700
curve centered exactly at zero on our graph with

00:09:12.700 --> 00:09:15.480
a variance of exactly one. That's it. The KL

00:09:15.480 --> 00:09:18.080
divergence penalty forces every single fuzzy

00:09:18.080 --> 00:09:20.340
cloud the encoder makes to try and mimic that

00:09:20.340 --> 00:09:22.200
perfect bell curve at the center of the graph.

00:09:22.480 --> 00:09:24.940
It penalizes the network if it tries to push

00:09:24.940 --> 00:09:27.179
a cloud way off into the isolated corners of

00:09:27.179 --> 00:09:29.539
the graph, and it heavily penalizes the network

00:09:29.539 --> 00:09:31.899
if it tries to shrink the cloud into a tiny rigid

00:09:31.899 --> 00:09:34.039
dot. So what does this all mean for our latent

00:09:34.039 --> 00:09:36.830
space graph? The reconstruction error is like

00:09:36.830 --> 00:09:39.529
this obsessive artist trying to shrink the clouds

00:09:39.529 --> 00:09:42.669
and push them to specific isolated corners of

00:09:42.669 --> 00:09:45.470
the canvas to perfectly capture a specific face.

00:09:45.490 --> 00:09:48.350
Yes. But the KL divergence acts like a strict

00:09:48.350 --> 00:09:51.029
organizer, constantly pulling back on the rope,

00:09:51.350 --> 00:09:53.929
forcing all those different clouds to stay relatively

00:09:53.929 --> 00:09:57.169
wide, smoothly shaped, and neatly grouped near

00:09:57.169 --> 00:09:59.029
the center of the graph. That's exactly how it

00:09:59.029 --> 00:10:01.710
works. It forces them to overlap just enough

00:10:01.710 --> 00:10:04.379
so there are no empty voids. but it relies on

00:10:04.379 --> 00:10:06.580
the artist's pulling power to keep them from

00:10:06.580 --> 00:10:09.379
perfectly overlapping into a gray, featureless

00:10:09.379 --> 00:10:12.600
mush. Right. The tension between those two forces

00:10:12.600 --> 00:10:15.659
is what creates a smooth, continuous, and highly

00:10:15.659 --> 00:10:18.379
organized latent space. It's a beautiful balance.

00:10:18.539 --> 00:10:20.940
And by finding a highly efficient Q distribution

00:10:20.940 --> 00:10:23.960
through this tug of war, the VAE performs what

00:10:23.960 --> 00:10:26.860
is called amortized inference. Amortized inference.

00:10:26.899 --> 00:10:29.190
What does that mean in practice? Normally, in

00:10:29.190 --> 00:10:31.289
older statistical models, you'd have to calculate

00:10:31.289 --> 00:10:34.470
the parameters for every single data point individually

00:10:34.470 --> 00:10:37.669
through iterative optimization. It was computationally

00:10:37.669 --> 00:10:41.690
exhausting. Just brutally slow. Yeah. But with

00:10:41.690 --> 00:10:44.190
amortized inference, the neural network learns

00:10:44.190 --> 00:10:46.990
to reuse the same mathematical parameters across

00:10:46.990 --> 00:10:49.950
multiple data points. It essentially learns a

00:10:49.950 --> 00:10:53.169
universal mapping rule, which results in massive

00:10:53.169 --> 00:10:55.889
computational memory savings. It's way more efficient.

00:10:56.309 --> 00:10:59.190
Hugely. And when you combine these two competing

00:10:59.190 --> 00:11:02.429
forces, minimizing the reconstruction error while

00:11:02.429 --> 00:11:05.690
minimizing the KL divergence penalty, you are

00:11:05.690 --> 00:11:07.909
mathematically maximizing what's known as the

00:11:07.909 --> 00:11:10.830
evidence lower bound, or the ELBO. The evidence

00:11:10.830 --> 00:11:13.429
lower bound. It's a beautiful structural concept.

00:11:13.610 --> 00:11:15.889
You have the fuzzy clouds, and you have the ELBO

00:11:15.889 --> 00:11:19.230
acting as the rigorous mathematical scoring system

00:11:19.230 --> 00:11:21.370
to keep the artist and the organizer in perfect

00:11:21.370 --> 00:11:23.509
tension. But the sources note that when King

00:11:23.509 --> 00:11:25.610
Gremma and Welling actually sat down to build

00:11:25.610 --> 00:11:28.809
and train this network in 2013 using standard

00:11:28.809 --> 00:11:31.750
deep learning methods, they hit a massive mathematical

00:11:31.750 --> 00:11:33.990
brick wall. They really did. The theory was flawless,

00:11:33.990 --> 00:11:36.840
but The actual mechanics of artificial neural

00:11:36.840 --> 00:11:39.600
networks broke the system because neural networks

00:11:39.600 --> 00:11:41.960
learn through a process called backpropagation.

00:11:42.440 --> 00:11:44.899
The network makes a guess, the loss function

00:11:44.899 --> 00:11:48.179
calculates the error or ill -bio score, and then

00:11:48.179 --> 00:11:50.399
the network sends that error signal backward

00:11:50.399 --> 00:11:52.980
through its layers. It uses gradients, which

00:11:52.980 --> 00:11:55.899
are mathematical derivatives, to figure out exactly

00:11:55.899 --> 00:11:58.539
how to adjust the weights of its artificial neurons

00:11:58.539 --> 00:12:01.440
to get a better score on the next round. Right.

00:12:01.440 --> 00:12:03.320
You take the derivative of the path to calculate

00:12:03.320 --> 00:12:06.039
the slope of the error, and you update the weights

00:12:06.039 --> 00:12:09.179
using standard stochastic gradient descent. But

00:12:09.179 --> 00:12:11.860
here is the roadblock. To move from the encoder

00:12:11.860 --> 00:12:15.179
to the decoder, the VAE has to sample a random

00:12:15.179 --> 00:12:17.919
point from inside that fuzzy cloud. Right, because

00:12:17.919 --> 00:12:20.340
it's generating a distribution. Exactly. But

00:12:20.340 --> 00:12:22.740
you cannot calculate a mathematical derivative

00:12:22.740 --> 00:12:25.960
through a purely random operation. No. A random

00:12:25.960 --> 00:12:28.840
sample has no slope. it is non -differentiable.

00:12:29.320 --> 00:12:31.340
The moment the air signal traveling backward

00:12:31.340 --> 00:12:34.419
hits that random sampling step, the math just

00:12:34.419 --> 00:12:37.019
shatters. The network goes completely blind and

00:12:37.019 --> 00:12:39.620
cannot update the encoder's weights. Here's where

00:12:39.620 --> 00:12:41.440
it gets really interesting because the whole

00:12:41.440 --> 00:12:44.159
point of the VAE was to introduce that randomness.

00:12:44.240 --> 00:12:46.679
Right. To use a physical analogy, calculating

00:12:46.679 --> 00:12:49.679
backpropagation is like trying to calculate the

00:12:49.679 --> 00:12:52.460
exact continuous reverse trajectory of a bowling

00:12:52.460 --> 00:12:54.730
ball rolling down a lane. Okay, I see where you're

00:12:54.730 --> 00:12:56.929
going. You can use physics to draw a perfect

00:12:56.929 --> 00:12:59.610
line backward from the pins to the bowler's hand,

00:13:00.549 --> 00:13:03.409
but the VAE's random sampling is like the bowling

00:13:03.409 --> 00:13:06.009
ball suddenly teleporting to a completely random

00:13:06.009 --> 00:13:08.549
spot within a three -foot radius halfway down

00:13:08.549 --> 00:13:11.250
the lane. Yes. You can't draw a continuous line

00:13:11.250 --> 00:13:14.470
backward anymore. The math of tracing back the

00:13:14.470 --> 00:13:16.909
derivative, it shatters the moment that random

00:13:16.909 --> 00:13:20.220
teleportation happens inside the path. The teleporting

00:13:20.220 --> 00:13:22.820
bowling ball breaks the gradient entirely. So

00:13:22.820 --> 00:13:26.519
how do you train a probabilistic network if backpropagation

00:13:26.519 --> 00:13:29.100
can't handle probability? Right, what's the fix?

00:13:29.620 --> 00:13:32.659
This is where Kingma and Welling introduced the

00:13:32.659 --> 00:13:35.600
reprimandorization trick. The parameterization

00:13:35.600 --> 00:13:38.399
trick. It is an incredibly elegant workaround

00:13:38.399 --> 00:13:40.919
to save the math. They realized they couldn't

00:13:40.919 --> 00:13:43.399
have the true randomness happening inside the

00:13:43.399 --> 00:13:46.440
network's core calculation path, so they externalized

00:13:46.440 --> 00:13:48.539
the teleportation. They took the randomness out

00:13:48.539 --> 00:13:50.320
of the bowling lane. How does that actually work

00:13:50.320 --> 00:13:52.860
mechanically? Instead of the network rolling

00:13:52.860 --> 00:13:55.220
the dice and generating the random sample itself,

00:13:55.679 --> 00:13:58.720
they changed the encoder so it only outputs two

00:13:58.720 --> 00:14:01.399
completely deterministic, predictable numbers.

00:14:01.480 --> 00:14:04.580
Okay, what are they? The mean, which is the exact

00:14:04.580 --> 00:14:07.919
center of the cloud, denoted as moog, and the

00:14:07.919 --> 00:14:09.980
variance, which is the exact spread of the cloud,

00:14:10.139 --> 00:14:12.879
denoted as sigma. Because these are just calculated

00:14:12.879 --> 00:14:15.740
numbers, they have a derivative. Back propagation

00:14:15.740 --> 00:14:18.480
can flow right through them. But wait, if it

00:14:18.480 --> 00:14:21.039
only outputs the center and the spread, where

00:14:21.039 --> 00:14:24.080
does the random sampling come from to feed the

00:14:24.080 --> 00:14:27.389
decoder? They inject an external standard random

00:14:27.389 --> 00:14:29.529
number generator. It's represented by the Greek

00:14:29.529 --> 00:14:33.529
letter epsilon, and it simply produces raw standard

00:14:33.529 --> 00:14:36.250
normal noise. Just pure static. Exactly. And

00:14:36.250 --> 00:14:38.549
it sits completely outside the network's learning

00:14:38.549 --> 00:14:40.730
path. The network takes that external static,

00:14:40.830 --> 00:14:43.129
that random noise, and it shapes it using the

00:14:43.129 --> 00:14:45.570
deterministic mean and variance it calculated.

00:14:45.909 --> 00:14:47.830
And the source has mentioned they use the Cholesky

00:14:47.830 --> 00:14:50.450
decomposition to apply this scaling. Yeah, think

00:14:50.450 --> 00:14:52.590
of the Cholesky decomposition like a mathematical

00:14:52.590 --> 00:14:56.230
mixing board. The network takes the raw, uncorrelated

00:14:56.230 --> 00:14:58.850
static from the external noise generator, runs

00:14:58.850 --> 00:15:01.330
it through the mixing board, and shapes that

00:15:01.330 --> 00:15:04.210
static so it perfectly matches the specific correlation

00:15:04.210 --> 00:15:06.909
and spread the variance, or sigma, that the encoder

00:15:06.909 --> 00:15:08.809
asked for. Right. Then it shifts it to the right

00:15:08.809 --> 00:15:11.789
center point, the mu. The brilliant part is that

00:15:11.789 --> 00:15:15.809
the random variable, epsilon, is now just an

00:15:15.809 --> 00:15:19.779
external input acting as raw material. that the

00:15:19.779 --> 00:15:21.679
error signal needs to travel backward through

00:15:21.679 --> 00:15:25.139
is entirely deterministic. You bypass the teleportation.

00:15:25.139 --> 00:15:28.240
Complete. The back propagation can flow smoothly

00:15:28.240 --> 00:15:31.179
from the decoder right past the external noise

00:15:31.179 --> 00:15:33.799
injection straight through the deterministic

00:15:33.799 --> 00:15:36.240
mean and variance and into the encoder. Yes.

00:15:36.419 --> 00:15:39.059
It completely saves the entire architecture allowing

00:15:39.059 --> 00:15:41.419
these probabilistic models to be trained with

00:15:41.419 --> 00:15:43.259
the brute force efficiency of deep learning.

00:15:43.419 --> 00:15:46.559
It unlocked everything. And because the reparameterization

00:15:46.559 --> 00:15:49.059
tricks so elegantly solved the training problem,

00:15:49.620 --> 00:15:51.799
researchers immediately realized the potential.

00:15:51.940 --> 00:15:54.220
They didn't just stop at that 2013 vanilla model.

00:15:54.460 --> 00:15:57.330
Oh, not at all. The floodgates opened and they

00:15:57.330 --> 00:15:59.649
started tweaking the dials of the architecture

00:15:59.649 --> 00:16:02.129
for new domains almost immediately. Yeah, the

00:16:02.129 --> 00:16:04.289
sources note a massive evolution in the tech.

00:16:05.149 --> 00:16:08.669
VAEs started out strictly as tools for unsupervised

00:16:08.669 --> 00:16:11.909
learning, you know, just throwing mountains of

00:16:11.909 --> 00:16:14.049
raw data at a wall and seeing what underlying

00:16:14.049 --> 00:16:15.990
structure the machine figures out on its own.

00:16:15.990 --> 00:16:18.629
Right. But their effectiveness quickly expanded

00:16:18.629 --> 00:16:21.809
into semi -supervised and fully supervised learning.

00:16:22.200 --> 00:16:25.320
The architecture proved highly adaptable, especially

00:16:25.320 --> 00:16:27.360
when researchers started manipulating that tug

00:16:27.360 --> 00:16:29.860
-of -war we talked about earlier. A prime example

00:16:29.860 --> 00:16:32.840
from the source material is the beta -VAE. Beta

00:16:32.840 --> 00:16:35.980
-VAE. Mechanically, what does the beta -dial

00:16:35.980 --> 00:16:38.799
actually do? It adds a highly weighted multiplier

00:16:38.799 --> 00:16:41.460
to the Kullback -Leibler divergence term. Okay,

00:16:41.519 --> 00:16:43.240
so it cranks up the pressure from that strict

00:16:43.240 --> 00:16:46.340
organizer. Exactly. When you make the KL divergence

00:16:46.340 --> 00:16:49.120
penalties significantly stronger than the reconstruction

00:16:49.120 --> 00:16:52.360
error, you force what's called manifold disentanglement.

00:16:52.639 --> 00:16:55.000
Manifold disentanglement. Let's translate the

00:16:55.000 --> 00:16:56.879
math into reality. What does that actually look

00:16:56.879 --> 00:16:59.659
like when the AI is processing an image? It means

00:16:59.659 --> 00:17:02.480
the AI is under so much pressure to be mathematically

00:17:02.480 --> 00:17:05.359
efficient that it automatically discovers and

00:17:05.359 --> 00:17:08.180
separates the most fundamental independent visual

00:17:08.180 --> 00:17:11.839
concepts without any human teaching it. Really?

00:17:12.029 --> 00:17:14.950
Like what? For instance, if you feed a standard

00:17:14.950 --> 00:17:19.170
VAE pictures of 3D chairs, it might learn a messy,

00:17:19.430 --> 00:17:22.569
entangled soup of chairness where changing one

00:17:22.569 --> 00:17:24.930
variable alters the color, the size, and the

00:17:24.930 --> 00:17:27.089
shape all at once. All jumbled together. Right.

00:17:27.279 --> 00:17:30.079
But a beta VAE, because of that extreme restriction

00:17:30.079 --> 00:17:32.740
on its latent space, will naturally separate

00:17:32.740 --> 00:17:34.779
the features out to survive the penalty. Oh,

00:17:34.779 --> 00:17:37.519
wow. It will dedicate one specific axis on our

00:17:37.519 --> 00:17:40.579
graph solely to the width of the chair. It dedicates

00:17:40.579 --> 00:17:43.259
another axis solely to the leg style, another

00:17:43.259 --> 00:17:46.480
axis solely to the rotation angle. It disentangles

00:17:46.480 --> 00:17:48.779
the underlying mathematical factors of reality.

00:17:48.920 --> 00:17:51.599
That is wild. It basically creates highly specific

00:17:51.599 --> 00:17:54.460
isolated sliders for the building blocks of the

00:17:54.460 --> 00:17:57.549
physical world. It really does. mention the conditional

00:17:57.549 --> 00:18:00.950
VAE, or CVAE, which takes almost the opposite

00:18:00.950 --> 00:18:03.059
approach to unsupervised learning, right? Yeah,

00:18:03.200 --> 00:18:07.000
the CVAE injects actual label information directly

00:18:07.000 --> 00:18:10.059
into the latent space. So instead of aggressively

00:18:10.059 --> 00:18:12.319
penalizing the model and hoping it organically

00:18:12.319 --> 00:18:15.160
figures out what a chair is, you feed it the

00:18:15.160 --> 00:18:17.940
label chair alongside the image. Oh, so you just

00:18:17.940 --> 00:18:20.940
tell it. Exactly. This places a deterministic

00:18:20.940 --> 00:18:23.380
constraint on the representation. You're giving

00:18:23.380 --> 00:18:26.380
the latent space explicit rules, allowing you

00:18:26.380 --> 00:18:28.700
to manually pull those sliders and say to the

00:18:28.700 --> 00:18:31.440
decoder, generate a chair, but conditionally

00:18:31.440 --> 00:18:34.049
constrain it to look exactly like this specific

00:18:34.049 --> 00:18:36.430
style. And there are hybrid models too, right?

00:18:36.849 --> 00:18:39.829
Mixing the probabilistic latent spaces of VAEs

00:18:39.829 --> 00:18:42.710
with JANs generative adversarial networks to

00:18:42.710 --> 00:18:45.230
improve the crispness and pixel quality of the

00:18:45.230 --> 00:18:47.410
generated outputs. Yeah, JANs are great at high

00:18:47.410 --> 00:18:49.710
fidelity generation. But looking at the evolution

00:18:49.710 --> 00:18:51.589
of these variants, there's an interesting tension

00:18:51.589 --> 00:18:55.319
here. If you take a beta VAE and you crank up

00:18:55.319 --> 00:18:57.319
the strictness of the Kale divergence to force

00:18:57.319 --> 00:19:00.339
the AI to neatly disentangle its factors, doesn't

00:19:00.339 --> 00:19:02.599
that massive restriction actually limit what

00:19:02.599 --> 00:19:05.269
the AI can imagine? That is a great point. If

00:19:05.269 --> 00:19:07.589
we connect this to the bigger picture, that tension

00:19:07.589 --> 00:19:10.089
between structural predictability and organic

00:19:10.089 --> 00:19:13.130
creativity is exactly what AI researchers are

00:19:13.130 --> 00:19:15.269
still wrestling with. Are they finding workarounds

00:19:15.269 --> 00:19:17.609
for that, too? They are. In fact, many recent

00:19:17.609 --> 00:19:19.829
approaches detailed in the sources have decided

00:19:19.829 --> 00:19:22.650
that the Kullback -Libler divergence is actually

00:19:22.650 --> 00:19:25.970
fundamentally too restrictive for certain highly

00:19:25.970 --> 00:19:29.210
complex creative tasks. Wait, so they fired this

00:19:29.210 --> 00:19:31.930
strict organizer? They replaced it with a completely

00:19:31.930 --> 00:19:34.500
different type of management. The source material

00:19:34.500 --> 00:19:38.400
highlights several statistical distance VAE variants.

00:19:38.740 --> 00:19:41.440
Statistical distance? Yeah. These models completely

00:19:41.440 --> 00:19:44.000
remove the KL divergence and replace it with

00:19:44.000 --> 00:19:46.279
alternative mathematical ways to measure the

00:19:46.279 --> 00:19:49.339
distance between distributions. They use complex

00:19:49.339 --> 00:19:51.799
geometric formulas like the sliced Wasserstein

00:19:51.799 --> 00:19:54.660
distance, the energy distance, or the maximum

00:19:54.660 --> 00:19:57.960
mean discrepancy known as MMD. Let's break down

00:19:57.960 --> 00:20:00.720
the mechanics of those. How does a sliced Wasserstein

00:20:00.720 --> 00:20:03.440
distance operate differently than KL divergence

00:20:03.440 --> 00:20:05.839
on our latent spacecraft? Well, think about what

00:20:05.839 --> 00:20:09.160
KL divergence demands. It forces every single

00:20:09.160 --> 00:20:12.500
fuzzy cloud to try and look exactly like a perfect

00:20:12.500 --> 00:20:15.059
standard bell curve centered at zero. Right.

00:20:15.559 --> 00:20:18.539
It's an aggressive uniform mold. Exactly. But

00:20:18.539 --> 00:20:21.980
real -world data is often messy. The true underlying

00:20:21.980 --> 00:20:24.509
shape of the data might look like a donut. or

00:20:24.509 --> 00:20:27.589
maybe two entirely separate asymmetrical clusters.

00:20:27.970 --> 00:20:30.690
Reality isn't just one big bill curve. No, it's

00:20:30.690 --> 00:20:33.029
not. By using something like the Wasserstein

00:20:33.029 --> 00:20:34.750
distance, which is actually often called the

00:20:34.750 --> 00:20:37.430
Earth Movers distance, because it literally measures

00:20:37.430 --> 00:20:40.190
the mathematical cost of moving piles of dirt

00:20:40.190 --> 00:20:42.009
from one distribution shape to match another,

00:20:42.750 --> 00:20:45.509
the network stops forcing every individual cloud

00:20:45.509 --> 00:20:47.930
into a rigid bell shape. Oh, so it allows the

00:20:47.930 --> 00:20:50.490
latent space to take on much more complex organic

00:20:50.490 --> 00:20:53.250
and weird shapes. Exactly. It ensures that the

00:20:53.250 --> 00:20:55.480
overall collection of all the points the encoder

00:20:55.480 --> 00:20:57.359
produces, which is called the push -forward measure,

00:20:57.900 --> 00:21:00.680
closely matches the complex empirical distribution

00:21:00.680 --> 00:21:03.339
of the real target data. So it ensures the whole

00:21:03.339 --> 00:21:05.839
filing room reflects the messy reality of the

00:21:05.839 --> 00:21:08.460
world, rather than forcing every individual file

00:21:08.460 --> 00:21:11.569
to look identical. Right. Which in turn offers

00:21:11.569 --> 00:21:14.230
completely different, highly nuanced flavors

00:21:14.230 --> 00:21:17.569
of generative creativity and accuracy. Which

00:21:17.569 --> 00:21:19.730
is absolutely mind -blowing when you take a step

00:21:19.730 --> 00:21:21.670
back and look at what we've unpacked here today.

00:21:21.789 --> 00:21:24.710
For you listening, why does all this dense math

00:21:24.710 --> 00:21:28.990
about ELBOS, reparameterization tricks, and Wasserstein

00:21:28.990 --> 00:21:31.589
distances actually matter to your understanding

00:21:31.589 --> 00:21:34.089
of technology? It matters because variational

00:21:34.089 --> 00:21:36.809
autoencoders are a foundational pillar of the

00:21:36.809 --> 00:21:39.920
modern generative AI landscape. Right. Understanding

00:21:39.920 --> 00:21:42.220
their mechanics means understanding a profound

00:21:42.220 --> 00:21:44.859
shift in computer science. The most powerful

00:21:44.859 --> 00:21:47.579
algorithms in the world today don't just mechanically

00:21:47.579 --> 00:21:50.279
memorize the reality we feed them. No, they don't

00:21:50.279 --> 00:21:53.259
just record pixels. They compress reality into

00:21:53.259 --> 00:21:55.299
underlying mathematical concepts. They learn

00:21:55.299 --> 00:21:57.440
the physics, the lighting, the geometry. They

00:21:57.440 --> 00:21:59.839
map out the statistical fuzzy probabilities of

00:21:59.839 --> 00:22:02.259
what reality could be. They learn the vibe of

00:22:02.259 --> 00:22:04.339
the universe and from that managed uncertainty,

00:22:04.799 --> 00:22:06.759
they generate something entirely new that has

00:22:06.759 --> 00:22:09.440
never existed before. It's incredible. And that

00:22:09.440 --> 00:22:12.180
leaves us with a fascinating implication to mull

00:22:12.180 --> 00:22:15.160
over as this technology scales. We talked about

00:22:15.160 --> 00:22:18.519
how these models can achieve manifold disentanglement,

00:22:19.119 --> 00:22:21.480
isolating the exact independent mathematical

00:22:21.480 --> 00:22:24.200
sliders for the rotation of a chair or the lighting

00:22:24.200 --> 00:22:26.519
of a room. Right, the physical building blocks.

00:22:26.700 --> 00:22:29.660
But as these probabilistic latent spaces grow

00:22:29.660 --> 00:22:32.400
exponentially larger and process entirely different

00:22:32.400 --> 00:22:35.599
types of data, what happens when they start disentangling

00:22:35.599 --> 00:22:38.799
deeply abstract concepts? Oh, wow. If a network

00:22:38.799 --> 00:22:41.319
can isolate the mathematical variable for width,

00:22:41.740 --> 00:22:44.859
could a future vastly scaled architecture isolate

00:22:44.859 --> 00:22:47.500
the exact mathematical variable for sarcasm in

00:22:47.500 --> 00:22:49.980
a block of text? That's crazy to think about.

00:22:50.039 --> 00:22:52.680
Could it find the precise slider for melancholy

00:22:52.680 --> 00:22:55.359
in a piece of music? Mathematically untangling

00:22:55.359 --> 00:22:57.440
human emotion from the data we leave behind?

00:22:57.720 --> 00:23:00.240
A mathematical coordinate for a feeling. That

00:23:00.240 --> 00:23:01.700
is something for you to think about next time

00:23:01.700 --> 00:23:03.539
you see a machine create something beautiful

00:23:03.539 --> 00:23:06.420
or heartbreaking out of thin air. We'll catch

00:23:06.420 --> 00:23:07.480
you on the next deep dive.