WEBVTT

00:00:00.000 --> 00:00:03.740
Welcome to today's deep dive. So, um, you've

00:00:03.740 --> 00:00:07.139
almost certainly seen those insanely realistic

00:00:07.139 --> 00:00:09.539
AI images popping up lately, right? Yeah, like

00:00:09.539 --> 00:00:11.720
the hyper realistic deep phase or, you know,

00:00:12.220 --> 00:00:14.900
those totally surreal digital paintings. Exactly.

00:00:15.000 --> 00:00:17.179
Like, uh, I don't know, a hyper detailed photograph

00:00:17.179 --> 00:00:19.399
of a cat wearing a spacesuit on Mars. Right.

00:00:19.839 --> 00:00:23.280
Yeah. Which is always fun. It is. But, um, Have

00:00:23.280 --> 00:00:26.519
you ever actually stopped to wonder what is literally

00:00:26.519 --> 00:00:29.079
happening under the hood to create those? Like,

00:00:29.079 --> 00:00:31.519
how does a computer actually do that? Yeah, it

00:00:31.519 --> 00:00:33.640
feels like a magic trick when you just look at

00:00:33.640 --> 00:00:36.560
the end result. It totally does. So today, our

00:00:36.560 --> 00:00:39.119
mission is to sort of strip away all that intimidating

00:00:39.119 --> 00:00:41.759
math from the source material. We're looking

00:00:41.759 --> 00:00:44.479
at a pretty dense breakdown of diffusion models,

00:00:44.479 --> 00:00:46.759
and we want to extract the actual mechanics of

00:00:46.759 --> 00:00:48.619
how this works. Right, because you want to walk

00:00:48.619 --> 00:00:51.759
away actually understanding how we turn pure,

00:00:52.000 --> 00:00:54.539
meaning - a static into, you know, a masterpiece.

00:00:55.140 --> 00:00:57.399
Yeah. And to set the stage here, we need to look

00:00:57.399 --> 00:00:59.700
at the big picture. What is the core goal of

00:00:59.700 --> 00:01:02.340
a diffusion model? Well, broadly speaking, the

00:01:02.340 --> 00:01:05.480
goal is to learn a process that can generate

00:01:05.480 --> 00:01:08.200
brand new data that looks like it belongs to

00:01:08.200 --> 00:01:11.599
an original data set. So like a data set of human

00:01:11.599 --> 00:01:14.180
faces or cat photos. OK, so it wants to make

00:01:14.180 --> 00:01:16.879
a new face that fits in with the old faces. Exactly.

00:01:17.280 --> 00:01:20.060
But the secret to how it does this. and this

00:01:20.060 --> 00:01:22.680
is kind of the ironic part, relies entirely on

00:01:22.680 --> 00:01:25.670
the concept of destruction. So we're building

00:01:25.670 --> 00:01:27.790
by destroying. Yeah, I mean, it is fundamentally

00:01:27.790 --> 00:01:30.069
a problem of thermodynamics applied to data.

00:01:30.230 --> 00:01:32.310
Which sounds intense. Let's break that down.

00:01:32.629 --> 00:01:35.230
The foundation here is a two -part mechanism,

00:01:35.390 --> 00:01:37.530
right? You've got the forward diffusion process

00:01:37.530 --> 00:01:39.629
and then the reverse sampling process. Right,

00:01:39.829 --> 00:01:41.150
those are the two halves of the whole system.

00:01:41.390 --> 00:01:43.489
For the forward process, I was thinking of this

00:01:43.489 --> 00:01:46.450
analogy. Imagine you take a pristine photograph

00:01:46.450 --> 00:01:49.290
and you just slowly start adding TV static to

00:01:49.290 --> 00:01:52.170
it. Frame by frame. Yeah, frame by frame. Until

00:01:52.170 --> 00:01:54.390
eventually it's just pure meaningless noise,

00:01:54.629 --> 00:01:57.510
just a screen of gray fuzz. That's forward diffusion,

00:01:57.670 --> 00:02:00.340
right? That is exactly it. You start with your

00:02:00.340 --> 00:02:04.219
clean image, which we call X0, and over a sequence

00:02:04.219 --> 00:02:07.400
of discrete time steps, you add a tiny scaled

00:02:07.400 --> 00:02:10.120
amount of Gaussian noise. And it's a Markov chain,

00:02:10.159 --> 00:02:12.840
right. So step five only depends on step four.

00:02:12.919 --> 00:02:14.919
Right. You aren't looking at the whole history

00:02:14.919 --> 00:02:17.319
of the image. You're just iteratively degrading

00:02:17.319 --> 00:02:19.699
it until you hit total static. OK. But why does

00:02:19.699 --> 00:02:21.960
that matter? Why are we actively ruining good

00:02:21.960 --> 00:02:24.490
pictures? Well, the source text brings up this

00:02:24.490 --> 00:02:26.830
connection to non -equilibrium thermodynamics,

00:02:27.530 --> 00:02:30.669
which was introduced to this field back in 2015.

00:02:30.870 --> 00:02:34.330
OK, 2015. Yeah. Imagine your entire data set

00:02:34.330 --> 00:02:38.490
of millions of images is like a tightly clustered

00:02:38.490 --> 00:02:41.110
cloud of particles. Like floating in a highly

00:02:41.110 --> 00:02:44.129
dimensional space. Exactly. That cloud represents

00:02:44.129 --> 00:02:47.009
highly ordered information. By adding that Gaussian

00:02:47.009 --> 00:02:49.629
noise over time, you are forcing those particles

00:02:49.629 --> 00:02:52.009
to diffuse out. Oh, I see. Like gas filling a

00:02:52.009 --> 00:02:54.330
room. Right. You're flattening that complex,

00:02:54.330 --> 00:02:57.310
jagged data distribution into a smooth, standard,

00:02:57.389 --> 00:03:00.289
normal distribution. It's conceptually identical

00:03:00.289 --> 00:03:02.550
to the Maxwell -Boltzmann distribution of particles

00:03:02.550 --> 00:03:05.169
settling to a potential well. Okay, I get the

00:03:05.169 --> 00:03:07.650
physics analogy. But I mean, destroying an image

00:03:07.650 --> 00:03:11.009
is easy. Anybody can mess up a picture. How on

00:03:11.009 --> 00:03:13.449
earth does the model undestroy it? Yeah, that's

00:03:13.449 --> 00:03:15.349
the big question. And it's where the neural network

00:03:15.349 --> 00:03:17.469
actually comes in. Because if it's just trained

00:03:17.469 --> 00:03:19.949
to reverse the noise of a specific data set,

00:03:20.009 --> 00:03:22.169
isn't it just like memorizing the training data?

00:03:22.430 --> 00:03:24.750
Like, how is it creating a novel image and not

00:03:24.750 --> 00:03:27.169
just regurgitating a specific cat it already

00:03:27.169 --> 00:03:30.460
saw? That's a super common misconception. But

00:03:30.460 --> 00:03:33.099
the brilliant twist here is what the model is

00:03:33.099 --> 00:03:36.340
actually learning to approximate. It's not memorizing

00:03:36.340 --> 00:03:39.680
specific paths back to specific images. OK, what

00:03:39.680 --> 00:03:42.060
is it doing then? By forcing the neural network

00:03:42.060 --> 00:03:44.379
to watch this destruction happen step by step,

00:03:44.780 --> 00:03:46.719
the model is trained to reverse the mathematical

00:03:46.719 --> 00:03:50.379
process. It learns the continuous vector field

00:03:50.379 --> 00:03:53.169
of the probability distribution. So it's charting

00:03:53.169 --> 00:03:55.490
a new course every time. Yeah, exactly. When

00:03:55.490 --> 00:03:57.909
you start with pure noise to generate a new image,

00:03:58.310 --> 00:04:00.250
you're dropping a pin in a totally random spot

00:04:00.250 --> 00:04:03.750
in that static. As the model denoises, it charts

00:04:03.750 --> 00:04:05.710
a brand new trajectory through the data space,

00:04:06.189 --> 00:04:08.810
steering toward areas of high data density. Oh,

00:04:08.810 --> 00:04:11.389
wow. So it's synthesizing features. It learned

00:04:11.389 --> 00:04:14.310
like textures, edges, colors into a configuration

00:04:14.310 --> 00:04:16.910
that literally didn't exist before. Right. It

00:04:16.910 --> 00:04:19.069
obeys the statistical rules of the data set,

00:04:19.430 --> 00:04:21.730
but it builds something entirely new. OK. So

00:04:21.730 --> 00:04:23.769
it's learning the rules of the terrain, not just

00:04:23.769 --> 00:04:26.250
memorizing the map. That makes sense. But how

00:04:26.250 --> 00:04:28.889
does the computer actually execute that reverse

00:04:28.889 --> 00:04:31.889
trajectory? Well, that brings us to the DDPM.

00:04:32.129 --> 00:04:34.689
The denoising diffusion probabilistic model.

00:04:35.000 --> 00:04:37.639
From around 2020, right? Yeah, that was a massive

00:04:37.639 --> 00:04:40.339
conceptual shift. Because initially, you might

00:04:40.339 --> 00:04:42.240
think the neural network is looking at the static

00:04:42.240 --> 00:04:45.480
and trying to guess the final pristine picture.

00:04:45.660 --> 00:04:47.439
Right, trying to see the cat and the noise. But

00:04:47.439 --> 00:04:50.079
it's not. The AI isn't predicting the original

00:04:50.079 --> 00:04:52.939
image. It's only predicting the noise. Wait,

00:04:52.959 --> 00:04:54.680
instead of trying to predict what the final picture

00:04:54.680 --> 00:04:56.759
should look like, the AI is just predicting the

00:04:56.759 --> 00:05:00.060
noise itself. Yes. The exact noise that was added

00:05:00.060 --> 00:05:02.319
at that specific step. It's almost like, OK,

00:05:02.459 --> 00:05:04.560
imagine a sculptor looking at a block of marble.

00:05:04.839 --> 00:05:06.819
And instead of trying to see the statue inside,

00:05:06.939 --> 00:05:09.879
they just expertly identify the dust and sweep

00:05:09.879 --> 00:05:12.060
it away, like one layer at a time. That is a

00:05:12.060 --> 00:05:14.839
perfect analogy. The neural network, which is

00:05:14.839 --> 00:05:17.279
often called the backbone, is looking at step

00:05:17.279 --> 00:05:20.699
T and predicting epsilon, which is the specific

00:05:20.699 --> 00:05:23.740
Gaussian noise added at that exact microsecond.

00:05:23.939 --> 00:05:26.240
So you just subtract that predicted noise and

00:05:26.240 --> 00:05:28.740
you inch one step closer to the clean image.

00:05:28.939 --> 00:05:31.860
Exactly. You predict the noise, subtract a fraction

00:05:31.860 --> 00:05:35.290
of it, and step backward. And this bridges perfectly

00:05:35.290 --> 00:05:37.910
into another mind -bending concept from the source

00:05:37.910 --> 00:05:41.170
material, score -based generative models. Oh,

00:05:41.310 --> 00:05:43.430
yeah, the noise conditional score networks. For

00:05:43.430 --> 00:05:46.009
a while, the research community treated DDPMs

00:05:46.009 --> 00:05:48.470
and score -based models as like two completely

00:05:48.470 --> 00:05:50.769
different tracks, right? Yeah, they did. Because

00:05:50.769 --> 00:05:53.149
DDPMs think about removing discrete steps of

00:05:53.149 --> 00:05:55.689
noise, while score -based models use continuous

00:05:55.689 --> 00:05:57.769
time and something called Langevin dynamics.

00:05:58.009 --> 00:06:00.350
Langevin dynamics. OK, how does that work? They

00:06:00.350 --> 00:06:02.720
want to find the score function. Basically, they

00:06:02.720 --> 00:06:04.980
look at a noisy pixel and ask, in which direction

00:06:04.980 --> 00:06:06.779
should I move this pixel to make it look more

00:06:06.779 --> 00:06:09.399
like a real image? OK, so imagine the data space

00:06:09.399 --> 00:06:11.579
is like a topographical map. The valleys are

00:06:11.579 --> 00:06:14.519
just pure noise, and the mountain peaks are valid,

00:06:14.779 --> 00:06:18.040
coherent images. So the score function is basically

00:06:18.040 --> 00:06:21.579
an arrow pointing directly uphill. Exactly. It's

00:06:21.579 --> 00:06:25.220
the steepest path toward a coherent image. Langevin

00:06:25.220 --> 00:06:28.240
Dynamics pulls the random noise up that gradient

00:06:28.240 --> 00:06:31.100
toward the highly structured data. And the crazy

00:06:31.100 --> 00:06:33.420
part, the real mathematical breakthrough here,

00:06:33.839 --> 00:06:36.100
was proving that predicting the noise in DDPM

00:06:36.100 --> 00:06:38.560
and estimating that uphill gradient in the score

00:06:38.560 --> 00:06:40.920
-based model, they're mathematically equivalent.

00:06:41.100 --> 00:06:44.000
Yes. They are the exact same thing from different

00:06:44.000 --> 00:06:47.139
angles. Predicting the noise to subtract, or

00:06:47.139 --> 00:06:49.540
calculating the vector to nudge the pixels uphill,

00:06:50.040 --> 00:06:52.569
it's two sides of the same coin. That is wild.

00:06:52.850 --> 00:06:54.850
OK, so mathematically, we can reverse entropy.

00:06:55.009 --> 00:06:57.470
We can turn static into art. We can. But practically,

00:06:57.930 --> 00:07:00.610
the original DDPM takes around 1 ,000 steps to

00:07:00.610 --> 00:07:03.649
turn noise into an image. Doing that pixel by

00:07:03.649 --> 00:07:06.170
pixel for a high -res image sounds like it would

00:07:06.170 --> 00:07:08.970
melt my laptop. Oh, absolutely. It was computationally

00:07:08.970 --> 00:07:10.529
devastating. So how do we make this practical

00:07:10.529 --> 00:07:12.730
for everyday use? How do I generate an image

00:07:12.730 --> 00:07:15.189
in five seconds on my phone? Well, you have to

00:07:15.189 --> 00:07:17.930
bypass the strict rules of that original Markov

00:07:17.930 --> 00:07:21.459
chain. And the solution to that was DDIMM, the

00:07:21.459 --> 00:07:24.399
denoising diffusion implicit model. Okay, DDIMM.

00:07:24.500 --> 00:07:26.660
How does that speed things up? In the original

00:07:26.660 --> 00:07:29.860
process, it's stochastic, meaning it relies on

00:07:29.860 --> 00:07:32.399
a random walk. You can't skip steps without breaking

00:07:32.399 --> 00:07:35.939
the math. But DDIMM sacrifices a tiny bit of

00:07:35.939 --> 00:07:39.319
that randomness to make the reverse process deterministic.

00:07:39.540 --> 00:07:42.360
Oh, so if it's deterministic, the path is fixed.

00:07:42.680 --> 00:07:44.680
Exactly. And because it's a fixed trajectory,

00:07:44.879 --> 00:07:47.100
you don't have to evaluate every single intermediate

00:07:47.100 --> 00:07:49.720
step. You can take massive mathematical leaps.

00:07:49.839 --> 00:07:52.680
So you can skip from step 1 ,000 down to 900.

00:07:53.000 --> 00:07:55.120
Right, and generate a totally coherent image

00:07:55.120 --> 00:07:57.639
in just 20 steps. And it looks just as good as

00:07:57.639 --> 00:07:59.800
one that took 1 ,000. Wow. OK, so we skipped

00:07:59.800 --> 00:08:03.420
steps. But even with 20 steps, a 4K canvas has

00:08:03.420 --> 00:08:05.920
millions of pixels. Doing heavy math on all of

00:08:05.920 --> 00:08:08.139
them is still a huge bottleneck. Which is where

00:08:08.139 --> 00:08:10.279
latent diffusion comes in. This is the trick

00:08:10.279 --> 00:08:12.000
that completely changed the industry. Right.

00:08:12.060 --> 00:08:14.959
This is what stable diffusion uses. Yeah. Operating

00:08:14.959 --> 00:08:17.709
in pixel space is computationally wasteful. A

00:08:17.709 --> 00:08:20.050
big blue sky doesn't need millions of parameters

00:08:20.050 --> 00:08:22.189
to be understood as the blue sky. So instead

00:08:22.189 --> 00:08:25.589
of doing the math on a massive image, you shrink

00:08:25.589 --> 00:08:28.410
the canvas. You use an encoder to squish the

00:08:28.410 --> 00:08:31.629
image into a tiny lower dimensional space. The

00:08:31.629 --> 00:08:34.830
latent space, exactly. It captures the semantic

00:08:34.830 --> 00:08:37.269
essence of the image without keeping track of

00:08:37.269 --> 00:08:39.240
every single pixel. You do all the diffusion

00:08:39.240 --> 00:08:42.399
adding noise, predicting it in this tiny latent

00:08:42.399 --> 00:08:44.639
space. And then when you're done, you just decode

00:08:44.639 --> 00:08:47.019
it back to full size. Exactly. You're doing the

00:08:47.019 --> 00:08:48.960
heavy lifting on a tiny blueprint instead of

00:08:48.960 --> 00:08:51.240
the full construction site. That is so smart.

00:08:51.379 --> 00:08:53.879
OK, so we've made it fast. We shrunk it down.

00:08:54.279 --> 00:08:56.700
But here is my biggest question. If I type in

00:08:56.700 --> 00:08:59.799
a prompt for a black cat with red eyes, how do

00:08:59.799 --> 00:09:02.279
I steer this noise remover? Ah, the steering

00:09:02.279 --> 00:09:04.700
wheel. Yeah, because if it's just removing static

00:09:04.700 --> 00:09:07.559
to find any valid image, how does it know I want

00:09:07.500 --> 00:09:10.379
on a black cat and not like a picture of a toaster.

00:09:10.440 --> 00:09:12.220
That relies on something called classifier -free

00:09:12.220 --> 00:09:15.179
guidance, or CFG. CFG. I've seen that slider

00:09:15.179 --> 00:09:18.899
in AI art programs. Right. So to understand CFG,

00:09:19.539 --> 00:09:22.240
think about how the model is trained. It's basically

00:09:22.240 --> 00:09:25.139
a noisy channel concept. The model takes a text

00:09:25.139 --> 00:09:27.480
prompt, translates it into a mathematical vector,

00:09:27.919 --> 00:09:30.759
and feeds it in alongside the noise. OK. So it

00:09:30.759 --> 00:09:33.639
learns to associate the word cat with the visual

00:09:33.639 --> 00:09:37.490
pixels of a cat. Yes. But during training, we

00:09:37.490 --> 00:09:40.370
randomly drop the text prompt about 10 % of the

00:09:40.370 --> 00:09:42.789
time. Wait, really? Why would you hide the text

00:09:42.789 --> 00:09:45.070
from it? Because you need the model to learn

00:09:45.070 --> 00:09:47.789
two different ways of denoising. It learned conditional

00:09:47.789 --> 00:09:50.610
denoising using the text prompt and unconditional

00:09:50.610 --> 00:09:52.929
denoising, where it just tries to find any image

00:09:52.929 --> 00:09:55.490
without a prompt. Oh, I see. Yeah, so imagine

00:09:55.490 --> 00:09:58.710
the AI has two compasses. The unconditional compass

00:09:58.710 --> 00:10:01.110
just points vaguely toward a coherent image,

00:10:01.610 --> 00:10:03.590
but the conditional compass points specifically

00:10:03.590 --> 00:10:06.889
toward a black cat with red eyes. So by comparing

00:10:06.889 --> 00:10:09.289
the two, you can find the exact difference between

00:10:09.289 --> 00:10:12.029
them. Exactly. You mathematically subtract the

00:10:12.029 --> 00:10:13.509
unconditional prediction from the conditional

00:10:13.509 --> 00:10:16.129
one. That extracts a powerful guidance vector.

00:10:16.190 --> 00:10:18.690
And then you can just multiply that vector to

00:10:18.690 --> 00:10:21.679
push it harder. Right, you scale it up, you aggressively

00:10:21.679 --> 00:10:24.820
nudge the denoising process toward your specific

00:10:24.820 --> 00:10:27.019
prompt. And you can even use negative prompting,

00:10:27.139 --> 00:10:29.879
right, like tell it to find the vector for blurry

00:10:29.879 --> 00:10:32.519
and steer in the exact opposite direction. Yes,

00:10:32.620 --> 00:10:34.500
exactly. It gives you incredible control over

00:10:34.500 --> 00:10:36.220
the generation. OK, so we know how to reverse

00:10:36.220 --> 00:10:39.059
noise, skip steps, shrink the canvas and steer

00:10:39.059 --> 00:10:42.460
it with text. But what is the actual engine doing

00:10:42.460 --> 00:10:45.490
the heavy lifting? the neural network architecture

00:10:45.490 --> 00:10:47.889
itself. Historically, the backbone was almost

00:10:47.889 --> 00:10:51.370
always a UNET. UNETs. They're super good at processing

00:10:51.370 --> 00:10:54.669
images, right? They are. They down sample an

00:10:54.669 --> 00:10:57.730
image to learn the broad structure and then upsample

00:10:57.730 --> 00:11:00.429
it to retain fine details. They have a strong

00:11:00.429 --> 00:11:03.470
bias for spatial relationships. They assume neighboring

00:11:03.470 --> 00:11:05.629
pixels matter to each other. Which makes sense

00:11:05.629 --> 00:11:08.309
for pictures. But recently we're seeing a massive

00:11:08.309 --> 00:11:11.769
shift away from units toward transformers, specifically

00:11:11.769 --> 00:11:14.230
D -DI -T diffusion transformers. Yeah, the paradigm

00:11:14.230 --> 00:11:16.549
is definitely shifting. Which is crazy to me

00:11:16.549 --> 00:11:19.009
because transformers are what run large language

00:11:19.009 --> 00:11:22.210
models like ChatGPT. They process sequences of

00:11:22.210 --> 00:11:24.950
words. Why are they suddenly taking over image

00:11:24.950 --> 00:11:28.269
generation? Because transformers scale so beautifully.

00:11:28.570 --> 00:11:30.970
They don't have that built -in assumption about

00:11:30.970 --> 00:11:33.750
local pixels. They use an attention mechanism

00:11:33.750 --> 00:11:36.649
that lets every part of the input look at every

00:11:36.649 --> 00:11:39.210
other part, no matter how far away it is. So

00:11:39.210 --> 00:11:41.470
you just chop the image up into little patches,

00:11:41.909 --> 00:11:44.129
treat them like words in a sentence, and feed

00:11:44.129 --> 00:11:46.710
them into the transformer. Exactly. And as you

00:11:46.710 --> 00:11:48.870
throw more computer to transformer, it just keeps

00:11:48.870 --> 00:11:51.850
getting better. UNETS tend to hit a performance

00:11:51.850 --> 00:11:54.399
ceiling. That makes total sense. And we're seeing

00:11:54.399 --> 00:11:56.840
this play out in the real world right now. Let's

00:11:56.840 --> 00:11:58.899
look at some examples from the source text. Sure.

00:11:59.220 --> 00:12:03.019
Look at OpenAI. Dell 82 uses a really interesting

00:12:03.019 --> 00:12:06.399
unCLIP method converting text encodings directly

00:12:06.399 --> 00:12:09.679
into imaging codings. But their new video generator,

00:12:09.879 --> 00:12:13.019
Sora, that uses a diffusion transformer. Right,

00:12:13.100 --> 00:12:16.240
the DIT. Because video is just a sequence of

00:12:16.240 --> 00:12:18.929
patches over time. Exactly. And Stability AI,

00:12:19.110 --> 00:12:21.409
who popularized latent diffusion with a UNAB,

00:12:21.870 --> 00:12:24.129
they actually moved to a transformer model for

00:12:24.129 --> 00:12:26.870
stable diffusion 3. Wow, everyone is pivoting.

00:12:27.190 --> 00:12:28.970
But there are other approaches too, right? Like

00:12:28.970 --> 00:12:31.009
Google's Imogen. They don't use the latent space

00:12:31.009 --> 00:12:33.850
trick at all. No. Google uses cascading diffusion

00:12:33.850 --> 00:12:36.110
models. Basically, they start with one model

00:12:36.110 --> 00:12:39.570
generating a tiny 64 by 64 pixel image. Just

00:12:39.570 --> 00:12:41.389
getting the broad strokes right. Right. Then

00:12:41.389 --> 00:12:43.389
a totally separate diffusion model takes that

00:12:43.389 --> 00:12:46.350
and upscales it to 256. And then another model

00:12:46.350 --> 00:12:49.009
upscales that to 224. Like an assembly line.

00:12:49.149 --> 00:12:51.870
Yeah. And then you have meta with transfusion,

00:12:52.230 --> 00:12:54.789
which actually combines autoregressive text generation

00:12:54.789 --> 00:12:58.230
with denoising diffusion in a single model. Or

00:12:58.230 --> 00:13:01.149
their movie gen, which uses flow matching. It's

00:13:01.149 --> 00:13:04.039
incredible how adaptable this math is. The text

00:13:04.039 --> 00:13:06.100
points out these models aren't just for images.

00:13:06.399 --> 00:13:08.840
They're completely modality agnostic. Oh, totally.

00:13:08.940 --> 00:13:11.120
You can use it for audio generation. Yeah. Or

00:13:11.120 --> 00:13:13.580
human motion synthesis. Wait, motion synthesis?

00:13:13.759 --> 00:13:15.799
Yeah, you take a noisy chaotic trajectory of

00:13:15.799 --> 00:13:19.440
a 3D skeleton, and the model denoises the temporal

00:13:19.440 --> 00:13:21.559
sequence until it collapses into a perfectly

00:13:21.559 --> 00:13:24.639
smooth, realistic walking animation. That is

00:13:24.639 --> 00:13:26.539
mind -blowing. It really works on anything. It

00:13:26.539 --> 00:13:28.639
does. It can even be used for natural language

00:13:28.639 --> 00:13:31.120
text generation. Okay, so just to recap the journey

00:13:31.120 --> 00:13:33.759
we've been on, we started with this crazy idea

00:13:33.759 --> 00:13:37.080
of thermodynamics and entropy, destroying a sandcastle

00:13:37.080 --> 00:13:39.659
basically. Right, the forward process. Then we

00:13:39.659 --> 00:13:41.580
learned how neural networks predict the noise

00:13:41.580 --> 00:13:44.740
to reverse time, like sweeping dust off a marble

00:13:44.740 --> 00:13:47.440
statue. We sped it up by skipping steps with

00:13:47.440 --> 00:13:50.039
DDIMM, shrank the canvas with latent diffusion,

00:13:50.500 --> 00:13:53.480
and learned to steer it using CFG text guidance.

00:13:53.960 --> 00:13:55.919
And swapped out the engine for transformers.

00:13:56.399 --> 00:13:58.620
Exactly. So the next time you generate an AI

00:13:58.620 --> 00:14:00.700
image, you're not just running a filter. You're

00:14:00.700 --> 00:14:03.559
literally watching a simulated universe of random

00:14:03.559 --> 00:14:06.759
particles collapsing into a state of perfect

00:14:06.759 --> 00:14:09.600
guided order. It's a localized reversal of entropy.

00:14:10.200 --> 00:14:14.299
But before we wrap up, there is one final totally

00:14:14.299 --> 00:14:16.220
mind -bending concept at the end of the source

00:14:16.220 --> 00:14:19.029
material that we have to talk about. Oh. Lay

00:14:19.029 --> 00:14:21.070
it on me. We talked a lot about how the path

00:14:21.070 --> 00:14:24.350
from pure noise to a final image is a squiggly

00:14:24.350 --> 00:14:27.049
random walk, right? That eto process. Yeah, that's

00:14:27.049 --> 00:14:29.029
why it takes so many steps. Well, researchers

00:14:29.029 --> 00:14:30.929
are now looking at something called flow -based

00:14:30.929 --> 00:14:34.289
diffusion models and rectified flow. Rectified

00:14:34.289 --> 00:14:36.190
flow? What are they rectifying? The trajectory

00:14:36.190 --> 00:14:39.549
itself. Rectified flow attempts to learn a process

00:14:39.549 --> 00:14:41.690
that makes the path between the noise and the

00:14:41.690 --> 00:14:43.970
final image a perfectly straight line. Wait,

00:14:44.330 --> 00:14:47.460
a perfectly straight line? Yes. It mathematically

00:14:47.460 --> 00:14:50.179
removes the curvature from the probability space.

00:14:50.460 --> 00:14:53.299
But hold on. If the mathematical path from static

00:14:53.299 --> 00:14:56.059
to the image is perfectly straight, you don't

00:14:56.059 --> 00:14:58.299
need to take 20 gentle steps to curve your way

00:14:58.299 --> 00:15:00.759
there. Exactly. You could calculate the exact

00:15:00.759 --> 00:15:04.159
solution in one single step. Wait, really? One

00:15:04.159 --> 00:15:06.899
single step? Just one mathematical calculation.

00:15:07.340 --> 00:15:11.179
From pure random noise to a flawless high -resolution

00:15:11.179 --> 00:15:15.500
image instantly. Instantly. Wow. And if we completely

00:15:15.500 --> 00:15:17.799
eliminate that random walk and replace it with

00:15:17.799 --> 00:15:20.940
a single geometric calculation, are we even doing

00:15:20.940 --> 00:15:23.460
diffusion anymore? Or have we just discovered

00:15:23.460 --> 00:15:26.659
a totally new way to mathematically conjure reality

00:15:26.659 --> 00:15:28.799
out of the void? That gives me chills. I mean,

00:15:28.820 --> 00:15:31.480
if the compute cost plummets to near zero and

00:15:31.480 --> 00:15:33.799
the generation is instantaneous, you wouldn't

00:15:33.799 --> 00:15:36.139
just be generating static images, you'd be generating

00:15:36.139 --> 00:15:38.200
real -time interactive realities on the fly.

00:15:38.480 --> 00:15:41.200
Exactly. Real -time rendering of reality. You

00:15:41.200 --> 00:15:43.139
wouldn't need a graphics card to render a video

00:15:43.139 --> 00:15:45.580
game level. The AI would just diffuse the world

00:15:45.580 --> 00:15:48.019
into existence at 60 frames per second based

00:15:48.019 --> 00:15:50.399
purely on where you look. It makes you wonder

00:15:50.399 --> 00:15:53.360
if pre -recorded media will even exist in a decade.

00:15:54.000 --> 00:15:56.019
Something for you to chew on next time you fire

00:15:56.019 --> 00:15:58.799
up an AI generator. Thanks for joining us and

00:15:58.799 --> 00:16:00.259
we'll catch you on the next deep dive.
