WEBVTT

00:00:00.000 --> 00:00:02.240
You've got five different browser tabs open right

00:00:02.240 --> 00:00:05.139
now. You're exporting footage from one AI tool

00:00:05.139 --> 00:00:07.320
to another. You just want to stitch a single

00:00:07.320 --> 00:00:09.919
video together. Yeah, it's a massive glitchy

00:00:09.919 --> 00:00:12.300
headache for modern creators. The final export

00:00:12.300 --> 00:00:15.179
usually feels completely broken and disjointed.

00:00:15.300 --> 00:00:17.600
Oh, completely broken. Today, we're exploring

00:00:17.600 --> 00:00:20.219
a totally new structural framework. Welcome to

00:00:20.219 --> 00:00:22.019
the deep dive, everyone. We're looking at the

00:00:22.019 --> 00:00:25.059
LTX desktop video application. And we're diving

00:00:25.059 --> 00:00:28.579
deep into the LTX 2 .3 model. Right. This is

00:00:28.579 --> 00:00:30.339
going to be a really fun one. We'll cover how

00:00:30.339 --> 00:00:33.460
it merges generation and editing natively. We'll

00:00:33.460 --> 00:00:35.380
see how developers smashed arbitrary hardware

00:00:35.380 --> 00:00:38.399
limits completely. And we'll explore some wild

00:00:38.399 --> 00:00:41.960
timeline native AI generation features. Finally,

00:00:41.979 --> 00:00:44.060
we'll ask what this actually means for human

00:00:44.060 --> 00:00:46.810
editors. This completely changes how you think

00:00:46.810 --> 00:00:48.710
about creating videos. You never actually have

00:00:48.710 --> 00:00:50.469
to leave the video editor. You aren't wasting

00:00:50.469 --> 00:00:52.950
hours managing random exported files anymore.

00:00:53.229 --> 00:00:55.810
So before the new workflow, we need some context.

00:00:56.070 --> 00:00:58.350
We have to understand the engine powering this

00:00:58.350 --> 00:01:00.990
system. Traditional editors usually just bolt

00:01:00.990 --> 00:01:04.290
AI onto old architecture. But LTX is built entirely

00:01:04.290 --> 00:01:07.129
around a multimodal model. Which is a massive

00:01:07.129 --> 00:01:09.769
fundamental shift in software design. Let's define

00:01:09.769 --> 00:01:12.459
what we mean by a multimodal model. It's an AI

00:01:12.459 --> 00:01:15.599
generating video, audio, and text simultaneously.

00:01:16.120 --> 00:01:18.239
That's a crucial definition for everything we

00:01:18.239 --> 00:01:21.719
discussed today. So LTX 2 .3 brings four major

00:01:21.719 --> 00:01:24.909
systemic upgrades. First, they rebuilt the visual

00:01:24.909 --> 00:01:27.590
autoencoder from scratch. Yeah, and that autoencoder

00:01:27.590 --> 00:01:30.489
controls the generated texture sharpness. Older

00:01:30.489 --> 00:01:32.709
models compressed things way too aggressively

00:01:32.709 --> 00:01:35.670
during generation. You lost crucial edge details

00:01:35.670 --> 00:01:38.209
during the decompression phase. The result was

00:01:38.209 --> 00:01:40.670
a notoriously muddy or blurry texture. Right.

00:01:40.750 --> 00:01:43.930
But this rebuilt version handles raw pixel data

00:01:43.930 --> 00:01:47.049
beautifully. You get incredibly clean, sharp

00:01:47.049 --> 00:01:49.810
object edges everywhere now. They also completely

00:01:49.810 --> 00:01:53.079
retrained the model's motion data. I'll admit,

00:01:53.180 --> 00:01:56.140
I still wrestle with prompt drift myself. It's

00:01:56.140 --> 00:01:57.840
so frustrating when models just freeze halfway

00:01:57.840 --> 00:02:00.019
through. That was genuinely the biggest early

00:02:00.019 --> 00:02:02.079
user complaint. The AI would just forget the

00:02:02.079 --> 00:02:03.760
physics of a scene. Yeah, it ruins the shot.

00:02:03.959 --> 00:02:06.760
But LTX 2 .3 keeps the generated movement completely

00:02:06.760 --> 00:02:09.180
natural. It understands how objects maintain

00:02:09.180 --> 00:02:11.840
momentum over time perfectly. The third upgrade

00:02:11.840 --> 00:02:15.080
focuses entirely on native vertical video. It

00:02:15.080 --> 00:02:18.530
supports 1080 by 1920. perfectly out of the box

00:02:18.530 --> 00:02:20.949
which is an absolute game changer for short form

00:02:20.949 --> 00:02:23.469
video creators and the fourth major upgrade brings

00:02:23.469 --> 00:02:27.189
a hi -fi gn vocoder Right. That new vocoder creates

00:02:27.189 --> 00:02:31.009
incredibly clean audio sync. It reconstructs

00:02:31.009 --> 00:02:33.330
audio waveforms to match micro movements perfectly.

00:02:33.629 --> 00:02:36.349
It successfully removes awkward silences and

00:02:36.349 --> 00:02:39.430
digital noise artifacts. Why does training natively

00:02:39.430 --> 00:02:41.770
on vertical video actually matter? Why not just

00:02:41.770 --> 00:02:44.509
crop standard horizontal landscape footage? Well,

00:02:44.590 --> 00:02:46.770
cropping horizontal video almost always ruin

00:02:46.770 --> 00:02:49.289
your composition. Native training helps the AI

00:02:49.289 --> 00:02:52.009
frame vertical subjects properly. The generated

00:02:52.009 --> 00:02:54.449
subjects fit the tall aspect ratio naturally.

00:02:54.759 --> 00:02:57.000
So native training prevents awkward cropping

00:02:57.000 --> 00:02:59.479
and frames perfectly. Exactly. Now, how do we

00:02:59.479 --> 00:03:02.219
get this running locally? LTX Desktop is a fully

00:03:02.219 --> 00:03:04.620
local open source editor. There are absolutely

00:03:04.620 --> 00:03:07.680
no subscriptions or per generation costs. You

00:03:07.680 --> 00:03:09.620
get complete and total privacy for your projects.

00:03:10.110 --> 00:03:13.009
That privacy aspect is absolutely huge for studio

00:03:13.009 --> 00:03:15.530
workflows. You literally download the installer

00:03:15.530 --> 00:03:17.969
straight to your machine. The installation process

00:03:17.969 --> 00:03:22.129
is smooth, but the file is massive. You're going

00:03:22.129 --> 00:03:24.969
to need about 70 to 150 gigabytes. Yeah, it has

00:03:24.969 --> 00:03:27.530
to download all the required models. And Windows

00:03:27.530 --> 00:03:29.889
users must remember to run as administrator.

00:03:30.250 --> 00:03:32.710
That simple step prevents the software freezing

00:03:32.710 --> 00:03:35.349
during setup. During that setup, you face a very

00:03:35.349 --> 00:03:38.750
interesting choice. You can use the LTX API for

00:03:38.750 --> 00:03:41.729
text encoding. Or you can download a math of

00:03:41.729 --> 00:03:44.409
local encoder instead. The text encoder basically

00:03:44.409 --> 00:03:47.229
translates your written text prompts. The API

00:03:47.229 --> 00:03:49.990
option is completely free for anyone. It saves

00:03:49.990 --> 00:03:52.669
you about 25 gigabytes of storage space. But

00:03:52.669 --> 00:03:55.050
the local encoder guarantees a fully offline

00:03:55.050 --> 00:03:57.560
workflow. Wait, let me push back on that specific

00:03:57.560 --> 00:03:59.860
choice. So it's totally private and secure on

00:03:59.860 --> 00:04:02.259
my machine, unless I really want to save local

00:04:02.259 --> 00:04:04.340
hard drive space. In that case, my text prompts

00:04:04.340 --> 00:04:06.599
leave my computer. Yeah, that's the exact technical

00:04:06.599 --> 00:04:08.919
trade -off you make here. The API sends just

00:04:08.919 --> 00:04:12.039
your text prompts to external servers. The actual

00:04:12.039 --> 00:04:14.199
video generation still happens entirely on your

00:04:14.199 --> 00:04:17.920
computer. But for total isolation, you must download

00:04:17.920 --> 00:04:20.759
that local encoder. Is there any difference in

00:04:20.759 --> 00:04:23.579
video quality between them? No, the final generated

00:04:23.579 --> 00:04:25.939
video quality remains exactly the same. It's

00:04:25.939 --> 00:04:28.319
purely a difference in local data routing and

00:04:28.319 --> 00:04:31.779
storage. So the API choice only impacts local

00:04:31.779 --> 00:04:36.060
storage routing. But there is a massive hardware

00:04:36.060 --> 00:04:38.579
roadblock sitting ahead. Right. The official

00:04:38.579 --> 00:04:41.959
requirement is 32 gigabytes of VRAM. Let's clarify

00:04:41.959 --> 00:04:45.060
that term really quickly for the listener. VRAM

00:04:45.060 --> 00:04:47.379
is computer memory strictly dedicated to processing

00:04:47.379 --> 00:04:49.759
graphics. That intense requirement basically

00:04:49.759 --> 00:04:52.480
demands an outrageously expensive card. You'd

00:04:52.480 --> 00:04:55.339
need something like an RTX 1590 to run it. Most

00:04:55.339 --> 00:04:57.379
normal people simply do not have that enterprise

00:04:57.379 --> 00:04:59.040
hardware. This is where the story gets really

00:04:59.040 --> 00:05:01.740
fascinating. The open source community completely

00:05:01.740 --> 00:05:04.439
revolted against this hardware limit. They literally

00:05:04.439 --> 00:05:07.300
used AI coding tools to remove it. They built

00:05:07.300 --> 00:05:10.209
alternate forks like 1GP almost instantly. They

00:05:10.209 --> 00:05:12.589
actually got it running on 12 gigabyte consumer

00:05:12.589 --> 00:05:15.730
cards. People are using standard 30 series gaming

00:05:15.730 --> 00:05:18.209
graphics cards now. And that happened within

00:05:18.209 --> 00:05:22.470
a single week of release. Whoa. Imagine scaling

00:05:22.470 --> 00:05:26.470
an enterprise level software wall down to a consumer

00:05:26.470 --> 00:05:30.610
GPU in just seven days. Two sec silence. It completely

00:05:30.610 --> 00:05:32.990
changes how we view software development timelines.

00:05:33.529 --> 00:05:35.810
It absolutely democratizes access to professional

00:05:35.810 --> 00:05:39.529
generative video tools. Mac optimization is also

00:05:39.529 --> 00:05:41.850
actively in progress right now. Apple Silicon

00:05:41.850 --> 00:05:44.170
users currently have to use the API connection,

00:05:44.430 --> 00:05:47.069
but native local support is coming very soon.

00:05:47.269 --> 00:05:50.970
Does bypassing the VRAM gate slow down render

00:05:50.970 --> 00:05:53.949
times? Yes. Generating these clips will definitely

00:05:53.949 --> 00:05:56.490
take much longer, but it completely democratizes

00:05:56.490 --> 00:05:58.889
the software for everyday users. You trade rendering

00:05:58.889 --> 00:06:01.089
speed for actual software accessibility. We're

00:06:01.089 --> 00:06:03.649
basically trading rendering speed for total democratization.

00:06:03.829 --> 00:06:06.329
Yeah. So we bypassed the massive hardware limits

00:06:06.329 --> 00:06:09.029
successfully. Now we're inside the actual video

00:06:09.029 --> 00:06:11.410
editor timeline interface. Let's look at where

00:06:11.410 --> 00:06:13.730
the paradigm shift actually happens. You usually

00:06:13.730 --> 00:06:16.689
start in the gen space for quick renders. You

00:06:16.689 --> 00:06:19.769
render your clips at lower resolutions like 540p.

00:06:20.089 --> 00:06:23.709
Then you just use the built -in 2x video upstaler.

00:06:23.810 --> 00:06:26.029
You also get all the standard video editing tools.

00:06:26.149 --> 00:06:28.750
You get color correction and auto letterbox formatting

00:06:28.750 --> 00:06:33.189
natively. But the new AI features are truly revolutionary

00:06:33.189 --> 00:06:35.879
here. The first wild feature is the Regenerate

00:06:35.879 --> 00:06:38.379
Shot tool. You just right -click a clip directly

00:06:38.379 --> 00:06:40.620
on your active timeline. It re -rolls the generation

00:06:40.620 --> 00:06:42.879
without leaving the active editor. You also get

00:06:42.879 --> 00:06:45.319
native image -to -video capabilities integrated.

00:06:45.439 --> 00:06:48.199
You literally just drag a static image onto the

00:06:48.199 --> 00:06:50.899
timeline. You add a prompt to create fluid, natural

00:06:50.899 --> 00:06:53.899
motion. You can even mix in external video footage

00:06:53.899 --> 00:06:56.220
files. You can easily bring in clips from clang

00:06:56.220 --> 00:06:58.240
or runway. They all live together seamlessly

00:06:58.240 --> 00:07:00.480
on this unified timeline. The third feature is

00:07:00.480 --> 00:07:02.449
called the Bridge Shots tool. It's currently

00:07:02.449 --> 00:07:05.329
powered by the Gemini AI system natively. It

00:07:05.329 --> 00:07:07.810
analyzes the last frame of your first clip. Then

00:07:07.810 --> 00:07:09.750
it analyzes the first frame of your next clip.

00:07:09.850 --> 00:07:11.949
It automatically generates the missing transition

00:07:11.949 --> 00:07:14.850
footage seamlessly between them. It literally

00:07:14.850 --> 00:07:17.629
fills the empty gap with completely new video.

00:07:17.970 --> 00:07:20.250
Right now, the version one frame matching is

00:07:20.250 --> 00:07:23.269
admittedly quite buggy. Finally, we have the

00:07:23.269 --> 00:07:25.990
native retake in painting feature here. Let's

00:07:25.990 --> 00:07:28.269
quickly define in painting so we're all on track.

00:07:28.509 --> 00:07:31.730
It means erasing a mistake. So AI redraws that

00:07:31.730 --> 00:07:35.290
spot. You regenerate just a tiny, isolated portion

00:07:35.290 --> 00:07:37.829
of the clip. The rest of your original video

00:07:37.829 --> 00:07:41.389
remains perfectly intact. There is a small, annoying

00:07:41.389 --> 00:07:44.569
UI scroll bug right now. It's like stacking Lego

00:07:44.569 --> 00:07:47.189
blocks of data on a timeline. You never leave

00:07:47.189 --> 00:07:49.930
the room to manufacture new bricks. Does BridgeShots

00:07:49.930 --> 00:07:52.449
understand complex lighting changes between clips?

00:07:52.959 --> 00:07:55.040
The current frame matching struggles hard with

00:07:55.040 --> 00:07:57.639
complex lighting shifts natively. You definitely

00:07:57.639 --> 00:07:59.759
have to guide it with very specific prompts.

00:08:00.019 --> 00:08:01.800
So current frame matching struggles with lighting

00:08:01.800 --> 00:08:04.660
without explicit guidance. Exactly. You know,

00:08:04.660 --> 00:08:06.579
it needs specific prompts. We're going to take

00:08:06.579 --> 00:08:09.139
a brief pause here. This deep dive is supported

00:08:09.139 --> 00:08:12.959
by AI Mastery AZ course. Are you ready to level

00:08:12.959 --> 00:08:16.019
up your AI skills? Join the AI Mastery community

00:08:16.019 --> 00:08:18.949
to unlock exclusive tutorials. You can master

00:08:18.949 --> 00:08:22.730
tools like LTX Desktop and more. Learn from experts

00:08:22.730 --> 00:08:24.430
and connect with thousands of professionals.

00:08:24.709 --> 00:08:28.389
Start your AI journey today with AI Mastery AZ.

00:08:28.769 --> 00:08:31.930
All right, we are back. So an AI can bridge shots

00:08:31.930 --> 00:08:34.990
natively on a timeline. It can erase visual mistakes

00:08:34.990 --> 00:08:37.690
with a single mouse click. What actually happens

00:08:37.690 --> 00:08:40.169
to the human behind the computer keyboard? This

00:08:40.169 --> 00:08:42.289
creates serious anxiety in many creator communities

00:08:42.289 --> 00:08:45.370
today. People naturally fear that AI will completely

00:08:45.370 --> 00:08:47.850
automate their jobs, but we really need to look

00:08:47.850 --> 00:08:51.009
at structural limitations here. Models like Kling

00:08:51.009 --> 00:08:53.830
and Veo generate incredibly short clips. They

00:08:53.830 --> 00:08:56.590
usually max out at 5 to 15 seconds total. Right,

00:08:56.629 --> 00:08:58.830
and they completely fail at multi -shot storytelling

00:08:58.830 --> 00:09:01.309
natively. AI lacks an inherent understanding

00:09:01.309 --> 00:09:03.769
of natural visual rhythm. It doesn't understand

00:09:03.769 --> 00:09:05.970
emotional pacing or deeper narrative structure.

00:09:06.309 --> 00:09:08.929
It generates visually impressive... but completely

00:09:08.929 --> 00:09:12.929
isolated standalone clips. Exactly. Imagine chaining

00:09:12.929 --> 00:09:14.750
these short clips together automatically without

00:09:14.750 --> 00:09:17.509
humans. The system simply doesn't understand

00:09:17.509 --> 00:09:20.929
the rhythm of cinematic cuts. It misses the underlying

00:09:20.929 --> 00:09:24.269
emotional flow of the broader scene. The human

00:09:24.269 --> 00:09:26.570
editor still completely controls the sequence

00:09:26.570 --> 00:09:29.990
vision. The AI merely refines and generates the

00:09:29.990 --> 00:09:33.389
raw video material. Even if an AI could perfectly

00:09:33.389 --> 00:09:37.539
chain scenes together seamlessly, It wouldn't

00:09:37.539 --> 00:09:39.500
understand the emotional heartbeat of a complex

00:09:39.500 --> 00:09:42.460
scene. It just doesn't know when a quiet moment

00:09:42.460 --> 00:09:44.830
should linger. Right. And that lingering moment

00:09:44.830 --> 00:09:47.889
requires pure human empathy. Editing is fundamentally

00:09:47.889 --> 00:09:50.549
about feeling the specific emotional weight.

00:09:50.710 --> 00:09:53.129
The software just provides much better tools

00:09:53.129 --> 00:09:55.289
for human editors. Does this mean the editor

00:09:55.289 --> 00:09:57.970
shifts from technician to director? Yes. Editors

00:09:57.970 --> 00:10:00.909
will spend less time managing tedious rendering

00:10:00.909 --> 00:10:03.830
files. They'll basically become high -level curators

00:10:03.830 --> 00:10:06.649
of these generated visual moments. The job becomes

00:10:06.649 --> 00:10:08.690
much more about high -level creative direction.

00:10:09.029 --> 00:10:11.169
Human editors are becoming creative curators

00:10:11.169 --> 00:10:13.820
of generated moments. back and synthesize the

00:10:13.820 --> 00:10:17.919
main takeaway today. LTX Desktop version 1 definitely

00:10:17.919 --> 00:10:20.799
has its annoying software bugs. The interface

00:10:20.799 --> 00:10:23.480
scroll glitches and hardware gates are quite

00:10:23.480 --> 00:10:26.559
frustrating. But the core structural concept

00:10:26.559 --> 00:10:29.480
of timeline continuity is revolutionary. They

00:10:29.480 --> 00:10:31.879
introduced an amazing underlying concept called

00:10:31.879 --> 00:10:35.259
thinking tokens natively. These tokens actively

00:10:35.259 --> 00:10:37.840
look at the entire sequence you've built. They

00:10:37.840 --> 00:10:40.240
maintain character and lighting consistency across

00:10:40.240 --> 00:10:43.220
multiple different cuts. This fundamentally changes

00:10:43.220 --> 00:10:46.039
how we approach multi -shot storytelling entirely.

00:10:46.419 --> 00:10:48.139
You should definitely experiment with this powerful

00:10:48.139 --> 00:10:51.019
software yourself. It's completely free and totally

00:10:51.019 --> 00:10:53.580
open source to download today. You literally

00:10:53.580 --> 00:10:56.000
have absolutely nothing to lose by testing it.

00:10:56.139 --> 00:10:58.639
If an open source community bypasses massive

00:10:58.639 --> 00:11:01.480
hardware gates weekly, will massive corporations

00:11:01.480 --> 00:11:04.000
actually dictate the future of creative software?

00:11:04.139 --> 00:11:06.519
Or will anonymous developers tinkering in free

00:11:06.519 --> 00:11:09.919
time lead us? You jump between five different

00:11:09.919 --> 00:11:12.639
browser tabs today to edit. Tomorrow, you might

00:11:12.639 --> 00:11:14.980
direct everything from a single unified timeline.

00:11:15.259 --> 00:11:18.039
Keep exploring, keep creating, and keep questioning

00:11:18.039 --> 00:11:18.940
the tools you use.