WEBVTT

00:00:00.270 --> 00:00:04.089
The headlines, they promise Hollywood level AI

00:00:04.089 --> 00:00:07.230
movies. You know, just from a few words, you

00:00:07.230 --> 00:00:09.109
see these incredible clips online and you think,

00:00:09.230 --> 00:00:12.630
wow, the era of limitless creation is here. But

00:00:12.630 --> 00:00:16.390
the reality is, making a full story, where the

00:00:16.390 --> 00:00:19.370
main character doesn't change their face or their

00:00:19.370 --> 00:00:21.469
voice every eight seconds, that's actually the

00:00:21.469 --> 00:00:23.010
number one technical challenge right now. Yeah,

00:00:23.030 --> 00:00:25.649
it really is. So today, we're diving into the

00:00:25.649 --> 00:00:28.339
systematic process that actually, well, solve

00:00:28.339 --> 00:00:30.640
that consistency problem. Welcome to the deep

00:00:30.640 --> 00:00:33.460
dive. We've got a, I think, really necessary,

00:00:33.679 --> 00:00:36.560
repeatable four -step method. It basically turns

00:00:36.560 --> 00:00:40.119
those chaotic one -off AI video clips into something

00:00:40.119 --> 00:00:42.759
continuous, a coherent story. Right. And this

00:00:42.759 --> 00:00:45.359
is pretty critical for anyone hoping to graduate

00:00:45.359 --> 00:00:48.640
from just generating cool, isolated shots to

00:00:48.640 --> 00:00:50.799
actually crafting a real narrative. That's the

00:00:50.799 --> 00:00:52.659
mission for today. We'll show you exactly how

00:00:52.659 --> 00:00:54.619
to set up your characters, let's call it DNA,

00:00:55.079 --> 00:00:58.380
using just a single static image. than how to

00:00:58.380 --> 00:01:00.619
lock that visual identity into every scene afterwards.

00:01:01.020 --> 00:01:03.719
And crucially, that final step. fixing the audio

00:01:03.719 --> 00:01:06.099
inconsistencies, because those just instantly

00:01:06.099 --> 00:01:08.200
ruin the immersion, don't they? They absolutely

00:01:08.200 --> 00:01:10.799
do. So this is your operational blueprint, really,

00:01:10.920 --> 00:01:14.480
your shortcut to getting reliable AI storytelling.

00:01:14.760 --> 00:01:16.079
We're going to cut through the hype and give

00:01:16.079 --> 00:01:18.900
you the actual workflow reality. OK, let's unpack

00:01:18.900 --> 00:01:21.040
this core issue then. Yeah. We kind of assume

00:01:21.040 --> 00:01:24.280
AI remembers things, right? If you use an LLM

00:01:24.280 --> 00:01:27.439
for a story, maybe a specialized chat GPT, the

00:01:27.439 --> 00:01:30.340
characters stick. The model holds onto that context.

00:01:30.879 --> 00:01:33.280
But why is it that when we shift to AI video,

00:01:33.480 --> 00:01:37.439
that memory just seems to vanish completely?

00:01:37.620 --> 00:01:40.040
Yeah, that's the blank slate problem. And it's

00:01:40.040 --> 00:01:42.040
fascinating, really, because the underlying tech

00:01:42.040 --> 00:01:44.120
is just fundamentally different LLMs. They work

00:01:44.120 --> 00:01:46.579
with tokens, text, they have a context window,

00:01:46.959 --> 00:01:49.510
a kind of working memory. OK. But when you generate

00:01:49.510 --> 00:01:51.829
video, usually with a diffusion model, you're

00:01:51.829 --> 00:01:53.709
often starting from pure noise. You're generating

00:01:53.709 --> 00:01:56.269
pixels, not words. Starting from noise. Right,

00:01:56.290 --> 00:01:58.370
exactly. So every single time you ask the tool,

00:01:58.510 --> 00:02:01.069
make me a clip, it genuinely starts fresh. Even

00:02:01.069 --> 00:02:03.930
if you feed it the exact same words, the random

00:02:03.930 --> 00:02:06.109
noise seed is different. The output's slightly

00:02:06.109 --> 00:02:08.479
different. And poof, continuity is gone. gone

00:02:08.479 --> 00:02:11.180
immediately. The model just doesn't have that

00:02:11.180 --> 00:02:13.860
built -in persistent memory for a character's

00:02:13.860 --> 00:02:16.280
visual look across generations. So it's not that

00:02:16.280 --> 00:02:18.919
the system is actively trying to forget our character.

00:02:19.639 --> 00:02:23.419
It just, well, it lacks the architecture to persistently

00:02:23.419 --> 00:02:25.939
remember visual details when starting a new task.

00:02:26.199 --> 00:02:28.120
Precisely. You know, you'll get that perfect

00:02:28.120 --> 00:02:31.060
eight second clip. Your brave knight shining

00:02:31.060 --> 00:02:33.300
armor looks great. Then you generate the next

00:02:33.300 --> 00:02:35.599
scene, maybe the knight walking into a castle.

00:02:36.020 --> 00:02:38.699
And suddenly the armor shifts from silver to,

00:02:38.699 --> 00:02:40.979
I don't know, bronze. His face looks five years

00:02:40.979 --> 00:02:44.870
older maybe. And the voice. Totally different

00:02:44.870 --> 00:02:47.449
it completely breaks the suspension of disbelief

00:02:47.449 --> 00:02:49.569
as a viewer you immediately feel like okay This

00:02:49.569 --> 00:02:51.770
is kind of amateur it does and you know I still

00:02:51.770 --> 00:02:53.909
wrestle with prompt drift myself sometimes if

00:02:53.909 --> 00:02:55.969
I don't use a really strict external reference

00:02:55.969 --> 00:02:58.409
It's a very very common frustration even for

00:02:58.409 --> 00:03:01.409
people doing this a lot So if the tools themselves

00:03:01.409 --> 00:03:04.669
lack that core internal memory, what's the fundamental

00:03:04.669 --> 00:03:07.289
idea? We need to use instead. How do we externalize

00:03:07.289 --> 00:03:09.889
that character identity? Well, we have to externalize

00:03:09.889 --> 00:03:13.039
the memory And we do that using a single consistent

00:03:13.039 --> 00:03:16.979
visual reference image. OK, step one. It sounds

00:03:16.979 --> 00:03:19.020
a bit counterintuitive for making video, doesn't

00:03:19.020 --> 00:03:22.280
it? We start by creating a single still picture.

00:03:23.090 --> 00:03:26.750
Why is this character image the DNA for the whole

00:03:26.750 --> 00:03:29.569
project? That picture becomes the immutable reference

00:03:29.569 --> 00:03:32.330
point. It's what you feed back to the AI again

00:03:32.330 --> 00:03:35.069
and again to basically force consistency. Okay.

00:03:35.229 --> 00:03:38.449
The key here is extreme, almost surgical detail

00:03:38.449 --> 00:03:41.169
in that first prompt. You need to define every

00:03:41.169 --> 00:03:43.150
little pixel of that character you want. So you

00:03:43.150 --> 00:03:44.969
can't just say a robot. You've got to blueprint

00:03:44.969 --> 00:03:47.189
the character meticulously. Absolutely. Think

00:03:47.189 --> 00:03:49.110
like an engineer designing it. Like the example

00:03:49.110 --> 00:03:52.449
prompt. A friendly, futuristic robot librarian.

00:03:52.939 --> 00:03:55.879
smooth, white metallic body, glowing blue lines,

00:03:56.300 --> 00:03:58.520
simple dome head, large digital visor showing

00:03:58.520 --> 00:04:01.500
friendly animated eyes, wearing a smart gray

00:04:01.500 --> 00:04:04.479
vest. That level of detail. That's the specificity

00:04:04.479 --> 00:04:05.919
you need. And here's an important little technique

00:04:05.919 --> 00:04:08.360
at the start. Turn off any style consistency

00:04:08.360 --> 00:04:10.259
settings when you're creating this very first

00:04:10.259 --> 00:04:12.900
image. Ah, okay. Why's that? You want the AI

00:04:12.900 --> 00:04:15.840
to give you maximum creativity initially. Let

00:04:15.840 --> 00:04:18.740
it explore a bit. Then you review the options

00:04:18.740 --> 00:04:21.519
and pick the single best, usually the full frontal

00:04:21.519 --> 00:04:25.540
view image that nails your vision. OK, that makes

00:04:25.540 --> 00:04:28.120
sense. But let's say I love the white robot librarian

00:04:28.120 --> 00:04:30.800
I got. But maybe halfway through making my video,

00:04:30.939 --> 00:04:34.399
I think, hmm, those blue lines, maybe they should

00:04:34.399 --> 00:04:37.560
be a warm orange instead. If I just change the

00:04:37.560 --> 00:04:40.259
prompt text, won't the whole robot image drift?

00:04:40.860 --> 00:04:42.819
Well, that depends on the settings you use next.

00:04:42.939 --> 00:04:45.279
If you just edit the text prompt alone, yes,

00:04:45.519 --> 00:04:47.379
you absolutely risk drift. Things will change

00:04:47.379 --> 00:04:50.000
subtly. all over. Right. This is where the tools

00:04:50.000 --> 00:04:52.120
offer features usually called something like

00:04:52.120 --> 00:04:54.259
precise reference or maybe structure reference.

00:04:54.259 --> 00:04:55.939
You have to use one of those. OK, walk us through

00:04:55.939 --> 00:04:58.100
the difference there. Precise versus structure

00:04:58.100 --> 00:05:00.500
reference. Sure. So if you choose precise reference,

00:05:00.579 --> 00:05:02.959
you're telling the AI to lock pretty much everything.

00:05:03.040 --> 00:05:06.540
The texture, the color, the fine details. Then

00:05:06.540 --> 00:05:09.420
if you prompt for orange lines, the AI really

00:05:09.420 --> 00:05:12.500
tries hard to change only the lines while keeping

00:05:12.500 --> 00:05:14.759
everything else identical. Now, if you choose

00:05:14.759 --> 00:05:17.019
structure reference, you're locking more the

00:05:17.019 --> 00:05:20.360
skeleton. or the pose, the overall shape. So

00:05:20.360 --> 00:05:22.439
with structure lock, you could maybe change the

00:05:22.439 --> 00:05:25.040
material from metal to wood, but the robot would

00:05:25.040 --> 00:05:28.300
keep its exact shape and stance. I see. For character

00:05:28.300 --> 00:05:30.459
consistency, like keeping the robot looking like

00:05:30.459 --> 00:05:32.759
the same robot, we generally stick with the precise

00:05:32.759 --> 00:05:36.019
option. So that precise lock lets you make small,

00:05:36.399 --> 00:05:39.120
controlled changes, like the line color, without

00:05:39.120 --> 00:05:41.480
the whole character morphing into something else.

00:05:41.540 --> 00:05:44.699
That's the nuance, exactly. And that brings us

00:05:44.699 --> 00:05:48.040
neatly to step two. the starting frames. We take

00:05:48.040 --> 00:05:50.439
that perfect reference image we made. The DNA

00:05:50.439 --> 00:05:53.300
image. The DNA image, right. We upload it and

00:05:53.300 --> 00:05:55.180
critically, we make sure that precise reference

00:05:55.180 --> 00:05:57.980
feature stays on. Now we're setting the stage

00:05:57.980 --> 00:06:00.720
for each scene. So we're creating the static

00:06:00.720 --> 00:06:03.100
starting point for each video clip. We place

00:06:03.100 --> 00:06:05.100
our locked character into a new background, a

00:06:05.100 --> 00:06:07.139
new environment. Let's use the contrast examples.

00:06:07.420 --> 00:06:09.939
Scene one. The robot is pointing to a book for

00:06:09.939 --> 00:06:13.839
a young student. And then scene two. Same robot,

00:06:14.199 --> 00:06:16.040
but now it's leaning forward, maybe listening

00:06:16.040 --> 00:06:19.240
carefully to an elderly man sitting in a comfy

00:06:19.240 --> 00:06:22.740
armchair. And the key is the robot's face, its

00:06:22.740 --> 00:06:26.500
body. Those color lines, they look visually identical

00:06:26.500 --> 00:06:29.160
across both of those static frames. It's only

00:06:29.160 --> 00:06:31.259
the background and the other people that change.

00:06:31.560 --> 00:06:34.360
The reference image is really doing all the heavy

00:06:34.360 --> 00:06:36.879
lifting for consistency there. So when moving

00:06:36.879 --> 00:06:39.259
from step one, creating that reference, to step

00:06:39.259 --> 00:06:42.100
two, making these starting frames, what's the

00:06:42.100 --> 00:06:43.899
immediate problem if someone forgets to turn

00:06:43.899 --> 00:06:46.279
that precise reference or locking feature on?

00:06:46.459 --> 00:06:48.839
Well, the character will immediately start to

00:06:48.839 --> 00:06:51.139
drift in the new scene. It might pick up lighting

00:06:51.139 --> 00:06:52.879
cues or color tones from the new background,

00:06:53.040 --> 00:06:54.699
and bam, you're right back to square one with

00:06:54.699 --> 00:06:57.240
inconsistency. Okay. Step three is where things

00:06:57.240 --> 00:06:59.660
actually start moving. We take that static frame

00:06:59.660 --> 00:07:01.420
from step two, the one with the locked character

00:07:01.420 --> 00:07:03.779
in the scene, and we use an image -to -video

00:07:03.779 --> 00:07:05.899
tool to bring it to life. Yeah, this is where

00:07:05.899 --> 00:07:08.740
the magic happens. But it's fragile magic, you

00:07:08.740 --> 00:07:11.459
know? We're telling the AI what should move,

00:07:11.540 --> 00:07:14.019
and just as importantly, how it should move.

00:07:14.019 --> 00:07:16.550
Right. This requires really precise... instructions

00:07:16.550 --> 00:07:19.310
about the motion, the choreography. It's not

00:07:19.310 --> 00:07:21.470
just describing the background anymore. So we

00:07:21.470 --> 00:07:24.350
need another super detailed prompt, but this

00:07:24.350 --> 00:07:27.389
time focusing on the action. Just saying the

00:07:27.389 --> 00:07:29.629
robot points its finger up at the book. Yeah.

00:07:29.949 --> 00:07:31.990
That's not gonna cut it anymore, is it? Absolutely

00:07:31.990 --> 00:07:34.389
not. We have to be directive, tell it the pace,

00:07:34.550 --> 00:07:36.769
the scope of the movement. So for that first

00:07:36.769 --> 00:07:39.610
clip, the robot and the student, the full prompt

00:07:39.610 --> 00:07:42.310
needs to be something like that. Okay. The robot

00:07:42.310 --> 00:07:45.110
librarian points its finger up at the book. Its

00:07:45.110 --> 00:07:47.350
blue visor blinks slowly. The young girl looks

00:07:47.350 --> 00:07:50.490
up. Then maybe add camera direction. The camera

00:07:50.490 --> 00:07:53.629
slowly pushes in towards the girl's face. Both

00:07:53.629 --> 00:07:56.029
characters are relatively still. The movement

00:07:56.029 --> 00:07:58.730
is very gentle. Make the scene eight seconds

00:07:58.730 --> 00:08:02.170
long. Whoa. Okay, imagine having a system that

00:08:02.170 --> 00:08:05.350
can reliably take those super precise, gentle

00:08:05.350 --> 00:08:08.069
motion instructions, turn them into a cohesive

00:08:08.069 --> 00:08:10.879
eight second clip. and then do that consistently

00:08:10.879 --> 00:08:13.899
across, say, 100 scenes. That's the real power

00:08:13.899 --> 00:08:15.839
we're trying to unlock here. It really is. It's

00:08:15.839 --> 00:08:18.180
all about control. And an advanced tip here,

00:08:18.399 --> 00:08:20.420
technically, sometimes you need to use what's

00:08:20.420 --> 00:08:23.000
called prompt weighting syntax, things like using

00:08:23.000 --> 00:08:24.879
parentheses or special keywords. To tell the

00:08:24.879 --> 00:08:27.079
model what's most important. Exactly. To tell

00:08:27.079 --> 00:08:30.220
it, OK, focus maybe 90 % of your effort on the

00:08:30.220 --> 00:08:33.419
slow camera push and only 10 % on the robot's

00:08:33.419 --> 00:08:35.750
little secondary action, like the blinking. That

00:08:35.750 --> 00:08:37.769
helps systematize the prompt engineering, doesn't

00:08:37.769 --> 00:08:40.149
it? Makes scaling it up feel more achievable,

00:08:40.450 --> 00:08:42.929
which leads to the idea of maybe using a prompt

00:08:42.929 --> 00:08:46.129
helper, like creating a custom AI assistant,

00:08:46.490 --> 00:08:48.669
maybe a specialized Gemini or something, whose

00:08:48.669 --> 00:08:51.129
only job is to translate a simple idea like,

00:08:51.169 --> 00:08:54.549
robot shows the girl a book, into that really

00:08:54.549 --> 00:08:57.370
technical, weighted, super detailed video prompt.

00:08:57.549 --> 00:08:59.549
That's absolutely the way forward for efficiency.

00:08:59.750 --> 00:09:01.850
You systematize the inputs to try and stabilize

00:09:01.850 --> 00:09:04.820
the outputs. So if the goal here is both consistency

00:09:04.820 --> 00:09:07.419
and quality in these video clips. What's the

00:09:07.419 --> 00:09:10.000
one cardinal rule a creator should never break

00:09:10.000 --> 00:09:12.440
when actually generating the video in step three?

00:09:12.600 --> 00:09:15.600
Oh, simple. Never generate only one video output.

00:09:16.059 --> 00:09:18.820
You absolutely must create multiples I'd say

00:09:18.820 --> 00:09:20.519
at least three to five versions of each clips.

00:09:20.559 --> 00:09:23.220
You can pick the best one You need quality control

00:09:23.220 --> 00:09:25.799
looking for glitches weird flickers distortions

00:09:25.799 --> 00:09:28.740
right always generate options Okay, so we've

00:09:28.740 --> 00:09:30.360
worked through visual consistency with the first

00:09:30.360 --> 00:09:33.539
three steps reference image starting frame motion

00:09:33.539 --> 00:09:37.179
generation But the second huge technical trap

00:09:37.179 --> 00:09:41.059
is audio mm -hmm big one our robot now looks

00:09:41.059 --> 00:09:44.000
identical in every scene, which is great, but

00:09:44.000 --> 00:09:46.539
when it speaks, it might sound like a completely

00:09:46.539 --> 00:09:49.279
different actor in every single clip. How do

00:09:49.279 --> 00:09:51.799
we fix this immersion breaking problem? Right.

00:09:51.899 --> 00:09:54.159
We have to isolate the audio and treat it completely

00:09:54.159 --> 00:09:56.299
separately. That's step four. You pretty much

00:09:56.299 --> 00:09:59.519
have to use an external voice cloning or a text

00:09:59.519 --> 00:10:01.480
-to -speech service, something like 11 Labs,

00:10:01.519 --> 00:10:04.299
for example. OK. The goal is to create a single,

00:10:04.620 --> 00:10:06.639
consistent voice for your character. So we pick

00:10:06.639 --> 00:10:09.100
one definitive voice, let's call it Rachel, for

00:10:09.100 --> 00:10:11.399
our robot. And that specific voice profile will

00:10:11.399 --> 00:10:13.659
be used for our robot across every single scene

00:10:13.659 --> 00:10:15.539
it speaks in. It doesn't matter what prompt we

00:10:15.539 --> 00:10:17.679
use for the video itself. Correct. So the workflow

00:10:17.679 --> 00:10:19.500
usually starts with scripting the dialogue first.

00:10:19.600 --> 00:10:21.940
Then you take your final video clip, upload it

00:10:21.940 --> 00:10:23.860
to the voice tool, choose your specific voice

00:10:23.860 --> 00:10:26.559
profile or Rachel, and generate that new clean

00:10:26.559 --> 00:10:30.039
consistent audio track just for the dialog. And

00:10:30.039 --> 00:10:34.320
this process. it has to be repeated meticulously

00:10:34.320 --> 00:10:37.159
for every single scene where the robot speaks,

00:10:37.360 --> 00:10:39.500
right? Using the exact same voice model, the

00:10:39.500 --> 00:10:42.559
same settings every time. Exactly. It's repetitive,

00:10:42.559 --> 00:10:44.740
yeah, but that's the only way to guarantee audio

00:10:44.740 --> 00:10:47.080
consistency for that character. Now, you mentioned

00:10:47.080 --> 00:10:49.580
this is the difficult part, the final edit. So

00:10:49.580 --> 00:10:52.299
we have our video clips with consistent visuals,

00:10:52.919 --> 00:10:56.559
and we have our new consistent Rachel audio tracks

00:10:56.559 --> 00:10:59.700
for the robot's lines. But the original video

00:10:59.700 --> 00:11:01.570
clips, They have audio too, right? Well, they

00:11:01.570 --> 00:11:04.590
do, and it's usually unusable because it's inconsistent

00:11:04.590 --> 00:11:07.509
or just noise. So in your video editor, the first

00:11:07.509 --> 00:11:09.629
thing you do is mute the original audio track

00:11:09.629 --> 00:11:11.389
from the video clip completely. Just silence

00:11:11.389 --> 00:11:14.559
it. Okay, easy enough, but wait. If our voice

00:11:14.559 --> 00:11:17.919
tool, like 11 Labs, processes the dialogue, doesn't

00:11:17.919 --> 00:11:19.840
it usually process all the dialogue in the clip?

00:11:19.940 --> 00:11:21.940
Yeah. What about the voice of the young girl

00:11:21.940 --> 00:11:24.139
or the elderly man? Wouldn't they also sound

00:11:24.139 --> 00:11:26.919
like Rachel? Ah, yes. That's the flaw you typically

00:11:26.919 --> 00:11:29.419
have to fix manually. Often, the voice tool will

00:11:29.419 --> 00:11:31.299
change all the voices in the segment to your

00:11:31.299 --> 00:11:33.679
chosen one, the robot voice in this case. So

00:11:33.679 --> 00:11:35.559
everything sounds like the robot. Potentially,

00:11:35.799 --> 00:11:38.320
yes. So you have to do some careful audio surgery

00:11:38.320 --> 00:11:41.019
in your editor. You need to identify precisely

00:11:41.019 --> 00:11:43.899
where the robot stops speaking and say, the girl

00:11:43.899 --> 00:11:46.940
starts. You keep the robot's original track muted.

00:11:47.460 --> 00:11:50.279
You place the new clean Rachel audio track for

00:11:50.279 --> 00:11:53.960
the robot's lines. But then you have to... painstakingly

00:11:53.960 --> 00:11:55.700
cut out the parts of that new Rachel track where

00:11:55.700 --> 00:11:57.559
the girl was supposed to be speaking. And let

00:11:57.559 --> 00:11:59.980
her original audio come through. Or replace it,

00:12:00.000 --> 00:12:01.940
too. Exactly. You either let her original generated

00:12:01.940 --> 00:12:04.720
voice come through from the muted track by unmuting

00:12:04.720 --> 00:12:07.759
just those sections, or ideally, you generate

00:12:07.759 --> 00:12:09.600
another consistent voice for the girl using the

00:12:09.600 --> 00:12:12.759
same process and layer that in. Wow, OK. That

00:12:12.759 --> 00:12:15.299
is intricate. That's like frame accurate audio

00:12:15.299 --> 00:12:18.090
editing. Step four isn't just clicking generate

00:12:18.090 --> 00:12:21.450
audio. It's manually solving a multi -voice synchronization

00:12:21.450 --> 00:12:24.070
and replacement puzzle in a final video editor.

00:12:24.269 --> 00:12:26.350
It really is. It's the necessary final polish.

00:12:26.830 --> 00:12:29.070
And then adding some subtle background sound,

00:12:29.330 --> 00:12:32.250
like a constant library hum or room tone throughout

00:12:32.250 --> 00:12:35.990
the entire scene that helps mask any tiny imperfections

00:12:35.990 --> 00:12:38.429
from the cuts and really completes the illusion

00:12:38.429 --> 00:12:42.350
of quality. So if a creator just skips this whole

00:12:42.350 --> 00:12:47.970
audio workflow, Ignore step four. What's the

00:12:47.970 --> 00:12:50.250
immediate flaw that every single viewer is going

00:12:50.250 --> 00:12:52.230
to notice, even if the visuals are absolutely

00:12:52.230 --> 00:12:54.789
perfect? Oh, it's instant. Inconsistent character

00:12:54.789 --> 00:12:57.210
voices just make the whole thing feel amateurish

00:12:57.210 --> 00:12:59.809
or kind of cheap, cobbled together. It completely

00:12:59.809 --> 00:13:02.370
shatters the believability. OK, let's think about

00:13:02.370 --> 00:13:05.220
scaling this up. What if we have multiple recurring

00:13:05.220 --> 00:13:08.639
characters, like our robot librarian and say

00:13:08.639 --> 00:13:11.000
a little floating drone sidekick that follows

00:13:11.000 --> 00:13:13.779
it around? How does this four -step method handle

00:13:13.779 --> 00:13:16.440
that added complexity? Well, the process is fundamentally

00:13:16.440 --> 00:13:18.840
the same, but you basically duplicate the complexity

00:13:18.840 --> 00:13:21.440
and maybe even square it. You generate a separate

00:13:21.440 --> 00:13:23.639
reference image for each character. Robot A gets

00:13:23.639 --> 00:13:26.820
an image. Drone B gets its own image. Then in

00:13:26.820 --> 00:13:28.740
step two, setting up the scene frames, you'd

00:13:28.740 --> 00:13:31.259
ideally upload both reference images. But now

00:13:31.259 --> 00:13:34.220
you run into the challenge of regional prompting.

00:13:34.340 --> 00:13:36.840
Regional prompting. You mean telling the AI which

00:13:36.840 --> 00:13:38.840
reference image applies to which part of the

00:13:38.840 --> 00:13:41.120
scene, like this blob is the robot, that blob

00:13:41.120 --> 00:13:44.019
is the drone? Exactly that. Because without more

00:13:44.019 --> 00:13:46.480
advanced tools, tools, maybe like masking, if

00:13:46.480 --> 00:13:48.620
you just upload two reference images in a scene

00:13:48.620 --> 00:13:51.940
prompt, they often conflict or kind of blend

00:13:51.940 --> 00:13:54.620
together into mush. You need to use advanced

00:13:54.620 --> 00:13:56.820
prompt syntax or specific features if the tool

00:13:56.820 --> 00:14:01.139
supports it to say, okay, apply robot A's reference

00:14:01.139 --> 00:14:04.639
to the character in the center. Apply Drone B's

00:14:04.639 --> 00:14:06.500
reference only to the object in the top right

00:14:06.500 --> 00:14:08.679
of the frame. That sounds like a significant

00:14:08.679 --> 00:14:11.399
technical hurdle that wasn't obvious in this

00:14:11.399 --> 00:14:13.639
simple one -character workflow. It definitely

00:14:13.639 --> 00:14:15.679
is. It adds a layer of complexity. And then,

00:14:15.679 --> 00:14:18.279
of course, in step four, the audio, you need

00:14:18.279 --> 00:14:21.399
to select two distinct, consistent voices. Maybe

00:14:21.399 --> 00:14:24.019
Nova for the robot and a higher -pitched, whirring

00:14:24.019 --> 00:14:26.809
Sparky for the drone. And then in your final

00:14:26.809 --> 00:14:28.870
edit, you're managing potentially three separate

00:14:28.870 --> 00:14:31.409
audio tracks that need careful cutting and syncing.

00:14:31.549 --> 00:14:34.029
The robot track, the drone track, and any human

00:14:34.029 --> 00:14:36.570
character tracks. Okay, that paints a clearer

00:14:36.570 --> 00:14:39.850
picture of the real work involved. Now we hear

00:14:39.850 --> 00:14:42.230
constant news about next -generation tools. You

00:14:42.230 --> 00:14:44.570
know, models like Sora, they talk about built

00:14:44.570 --> 00:14:47.909
-in features like cameo or continuity. Does this

00:14:47.909 --> 00:14:50.690
whole four -step multi -tool method we've outlined,

00:14:51.190 --> 00:14:53.570
does it become obsolete once those tools launch

00:14:53.570 --> 00:14:56.200
widely? You know, I really don't think so. Not

00:14:56.200 --> 00:14:58.629
completely anyway. That cameo feature, from what

00:14:58.629 --> 00:15:01.090
we understand, seems focused on tracking real

00:15:01.090 --> 00:15:03.909
people across clips to keep them visually consistent.

00:15:04.350 --> 00:15:06.570
That doesn't directly help you create and maintain

00:15:06.570 --> 00:15:09.350
a fictional character, like our specific robot

00:15:09.350 --> 00:15:11.850
librarian design. OK, that makes sense. And what

00:15:11.850 --> 00:15:14.149
about the promised recut or continuity features?

00:15:14.350 --> 00:15:16.549
Those sound like they'll help smooth out transitions

00:15:16.549 --> 00:15:19.549
between generated scenes, maybe ensure physics

00:15:19.549 --> 00:15:21.750
remain stable, things like that. It's more about

00:15:21.750 --> 00:15:24.429
the quality control of the video generation itself.

00:15:24.610 --> 00:15:27.019
It still doesn't seem to solve the fundamental

00:15:27.019 --> 00:15:29.080
problem of creating the consistent character

00:15:29.080 --> 00:15:32.679
look in the first place, nor does it solve the

00:15:32.679 --> 00:15:35.500
absolutely critical issue of multi -character

00:15:35.500 --> 00:15:38.500
audio consistency. So I think this four -step

00:15:38.500 --> 00:15:41.399
method or something very like it remains the

00:15:41.399 --> 00:15:43.340
essential backbone of the production strategy

00:15:43.340 --> 00:15:45.840
for the foreseeable future. So the realistic

00:15:45.840 --> 00:15:49.399
toolkit picture emerges. To make even that short,

00:15:49.399 --> 00:15:53.870
consistent video of the robot librarian. We potentially

00:15:53.870 --> 00:15:56.570
ended up using, what, up to six different specialized

00:15:56.570 --> 00:15:58.129
tools? Yeah, let's count them. You've got your

00:15:58.129 --> 00:16:00.269
image creation tool for step one, your scene

00:16:00.269 --> 00:16:02.690
framing tool or feature for step two, using the

00:16:02.690 --> 00:16:04.909
reference, maybe an optional AI prompt helper,

00:16:05.710 --> 00:16:07.970
then the core image to video tool for step three,

00:16:08.370 --> 00:16:10.549
the external voice tool for step four's audio

00:16:10.549 --> 00:16:13.549
generation. And finally, a complex video editor

00:16:13.549 --> 00:16:16.049
for step four's audio fixing and final assembly.

00:16:16.289 --> 00:16:18.049
It's really a pipeline, isn't it? Not a single

00:16:18.049 --> 00:16:20.110
magic app. It's absolutely a workflow, a pipeline.

00:16:20.429 --> 00:16:22.690
So looking at this complex multi -tool approach,

00:16:23.389 --> 00:16:25.169
what do you think is the biggest trap related

00:16:25.169 --> 00:16:27.450
to just strategic planning that creators need

00:16:27.450 --> 00:16:30.289
to consciously avoid? Hmm. I'd say rushing those

00:16:30.289 --> 00:16:32.850
crucial early steps. If you rush the character

00:16:32.850 --> 00:16:35.090
design in step one, if you settle for a mediocre

00:16:35.090 --> 00:16:38.350
reference image, All the complexity and effort

00:16:38.350 --> 00:16:41.090
you put into the next five tools, it just ends

00:16:41.090 --> 00:16:44.750
up amplifying that initial flaw. Bad character

00:16:44.750 --> 00:16:47.330
design inevitably leads to a bad final video,

00:16:47.629 --> 00:16:50.409
no matter how fancy the tools are later. That's

00:16:50.409 --> 00:16:52.889
spot on. So the core takeaway for you listening

00:16:52.889 --> 00:16:56.159
is pretty clear. Consistent AI video isn't achieved

00:16:56.159 --> 00:16:59.200
by some single magic button right now. It's about

00:16:59.200 --> 00:17:01.919
intelligently combining several specialized tools

00:17:01.919 --> 00:17:04.700
into this reliable four -step process. Right.

00:17:04.960 --> 00:17:07.200
You start with that solid reference image, you

00:17:07.200 --> 00:17:09.220
move to careful starting frames for each scene,

00:17:09.839 --> 00:17:11.920
you ensure quality movement generation with detail

00:17:11.920 --> 00:17:14.220
prompts, and then critically you tackle the audio

00:17:14.220 --> 00:17:16.799
correction manually and meticulously. The consistency

00:17:16.799 --> 00:17:19.680
of your characters visually and audibly. It's

00:17:19.680 --> 00:17:22.259
now entirely in your hands, really, through disciplined

00:17:22.259 --> 00:17:24.500
execution of this kind of process. And this applies

00:17:24.500 --> 00:17:26.500
everywhere, doesn't it? Consistent marketing

00:17:26.500 --> 00:17:29.259
mascots, characters in explainer videos, virtual

00:17:29.259 --> 00:17:31.619
presenters for education. Any project needing

00:17:31.619 --> 00:17:34.619
consistency. Planning that reference image properly

00:17:34.619 --> 00:17:37.319
up front, that's the key to unlocking success

00:17:37.319 --> 00:17:39.299
for all those projects. But what's the best way

00:17:39.299 --> 00:17:42.559
to actually learn this? Honestly, just start

00:17:42.559 --> 00:17:45.309
small. Grab the tools you have access to, pick

00:17:45.309 --> 00:17:48.130
a really simple character idea, and just dedicate

00:17:48.130 --> 00:17:50.269
some time to practicing each of these four steps.

00:17:50.369 --> 00:17:52.769
Get a feel for it today. OK, so we've kind of

00:17:52.769 --> 00:17:54.849
solved the technical challenge of consistency,

00:17:55.049 --> 00:17:57.470
which turns out to require combining maybe half

00:17:57.470 --> 00:18:01.589
a dozen different tools intelligently. But if

00:18:01.589 --> 00:18:04.170
AI creativity right now means we have to constantly

00:18:04.170 --> 00:18:06.390
patchwork these specialized services together,

00:18:07.079 --> 00:18:09.740
Will this fractured workflow ever really converge

00:18:09.740 --> 00:18:13.380
into a single cohesive interface? Or is managing

00:18:13.380 --> 00:18:16.279
this complex multi -tool pipeline actually the

00:18:16.279 --> 00:18:18.980
future of deep AI creativity? That's something

00:18:18.980 --> 00:18:20.900
for you to mull over. Thanks for diving deep

00:18:20.900 --> 00:18:22.299
with us today. We'll catch you next time.
