WEBVTT

00:00:00.000 --> 00:00:02.980
You've got this great idea for an AI video. You

00:00:02.980 --> 00:00:06.540
type it in. You hit generate. And, well, it's

00:00:06.540 --> 00:00:09.119
a mess. Yeah, a total mess. Your main character's

00:00:09.119 --> 00:00:11.980
face just keeps morphing. It turns into someone

00:00:11.980 --> 00:00:14.560
else entirely in the background. Right. The visual

00:00:14.560 --> 00:00:17.760
style feels incredibly chaotic. What was supposed

00:00:17.760 --> 00:00:20.140
to be cinematic looks like a complete and utter

00:00:20.140 --> 00:00:22.480
accident. We have all been there. It's incredibly

00:00:22.480 --> 00:00:24.320
frustrating. You kind of feel like you're playing

00:00:24.320 --> 00:00:26.539
a slot machine. You just pull the lever and hope.

00:00:27.890 --> 00:00:31.269
Welcome to the deep dive. Today we are breaking

00:00:31.269 --> 00:00:34.329
down a step -by -step guide. We're exploring

00:00:34.329 --> 00:00:38.090
exactly how to build AI video from scratch. We're

00:00:38.090 --> 00:00:40.810
looking at a specific three tool sequence. We're

00:00:40.810 --> 00:00:44.539
talking Cloud AI, ChatGPT, and Google Flow. That's

00:00:44.539 --> 00:00:47.079
the stack. This sequence guarantees consistent,

00:00:47.420 --> 00:00:49.799
highly cinematic results. It's really about moving

00:00:49.799 --> 00:00:52.939
from total randomness to actual directorial control.

00:00:53.219 --> 00:00:55.479
Right. It really is a mindset shift. We need

00:00:55.479 --> 00:00:57.820
to understand why those initial generations turn

00:00:57.820 --> 00:01:01.000
into chaotic messes. It's honestly not about

00:01:01.000 --> 00:01:04.560
writing a better, more complex prompt. It is

00:01:04.560 --> 00:01:06.700
entirely about the sequence of events. The sequence.

00:01:06.859 --> 00:01:09.140
Let's unpack that. The fundamental rule is this.

00:01:09.599 --> 00:01:11.840
You must build your visuals as static images

00:01:11.840 --> 00:01:14.969
first. You have to do before animating anything

00:01:14.969 --> 00:01:18.189
at all. Generating static images is much faster.

00:01:18.409 --> 00:01:20.689
It's also significantly cheaper than generating

00:01:20.689 --> 00:01:23.010
video. You know, I still wrestle with prompt

00:01:23.010 --> 00:01:25.629
drift myself. Oh, yeah. Watching my character

00:01:25.629 --> 00:01:28.469
morph into a totally different person halfway

00:01:28.469 --> 00:01:31.329
through a scene, it is a really helpless feeling.

00:01:31.530 --> 00:01:33.469
Absolutely. Everyone experiences that prompt

00:01:33.469 --> 00:01:35.590
drift initially. It's like stacking Lego blocks

00:01:35.590 --> 00:01:38.629
of data. You need a solid, unmoving base before

00:01:38.629 --> 00:01:40.569
you add the moving pieces. Exactly. If you build

00:01:40.569 --> 00:01:42.670
on a shaky foundation, the whole scene just collapses.

00:01:42.829 --> 00:01:46.469
Yeah. But why is that? Why is fixing a mistake

00:01:46.469 --> 00:01:49.409
in a video so much harder than in a static image?

00:01:50.030 --> 00:01:52.930
Because video multiplies variables across time.

00:01:53.189 --> 00:01:55.769
Static images isolate those variables. Think

00:01:55.769 --> 00:01:58.390
about what a video actually is. Right. A standard

00:01:58.390 --> 00:02:01.349
film runs at 24 frames per second. If you generate

00:02:01.349 --> 00:02:04.530
a 10 -second clip, the AI isn't just making one

00:02:04.530 --> 00:02:08.610
picture. It's making 240 pictures. Exactly. 240

00:02:08.610 --> 00:02:11.930
distinct images. And here is the truly difficult

00:02:11.930 --> 00:02:14.870
part for the AI. It has to remember what happened

00:02:14.870 --> 00:02:18.129
in frame one. It has to apply that exact memory

00:02:18.129 --> 00:02:21.150
to frame two. Then it has to guess what naturally

00:02:21.150 --> 00:02:24.349
happens in frame three. That is an immense computational

00:02:24.349 --> 00:02:27.250
burden. The system has to maintain the physics

00:02:27.250 --> 00:02:30.150
of a moving scene. It is massive. Eventually

00:02:30.150 --> 00:02:32.530
the machine drops the ball. The math just gets

00:02:32.530 --> 00:02:35.110
too heavy. It forgets the color of the character's

00:02:35.110 --> 00:02:37.310
jacket. It forgets the exact lighting angle on

00:02:37.310 --> 00:02:39.969
the coffee cup. Wow. But an image generator only

00:02:39.969 --> 00:02:43.110
has to solve one single frame. It concentrates

00:02:43.110 --> 00:02:45.409
all its processing power on that exact moment.

00:02:45.490 --> 00:02:47.669
You get a highly detailed, perfectly accurate

00:02:47.669 --> 00:02:50.650
picture. So video multiplies the chaos across

00:02:50.650 --> 00:02:54.129
time. Images isolate the problem entirely. Exactly.

00:02:54.169 --> 00:02:56.490
It pins the butterfly to the board, so to speak.

00:02:56.750 --> 00:02:59.289
You lock in the exact visual style. You fix the

00:02:59.289 --> 00:03:01.509
character's faces. Right. You perfect the lighting

00:03:01.509 --> 00:03:03.770
before a single frame actually moves. Which means

00:03:03.770 --> 00:03:06.330
we need a rock solid textual blueprint to start.

00:03:06.409 --> 00:03:08.389
Yeah. And that brings us to the first tool in

00:03:08.389 --> 00:03:11.150
this sequence. We're talking about Claude AI.

00:03:11.400 --> 00:03:14.240
Right. Before you even look at an image generator,

00:03:14.379 --> 00:03:16.979
before you open a video tool, you need a very

00:03:16.979 --> 00:03:20.400
clear, structured plan. Claude AI handles this

00:03:20.400 --> 00:03:23.080
entire blueprint stage. The guide we are looking

00:03:23.080 --> 00:03:25.360
at uses a very specific example. It starts with

00:03:25.360 --> 00:03:28.520
a remarkably simple story. A father and daughter

00:03:28.520 --> 00:03:32.020
escape a sudden volcano eruption. They're driving

00:03:32.020 --> 00:03:34.879
around desperately searching for gas to survive.

00:03:35.080 --> 00:03:37.219
It's a great example. It has inherent tension,

00:03:37.680 --> 00:03:40.319
but visually it's very contained. But let me

00:03:40.319 --> 00:03:42.939
push back on that a bit. Doesn't giving the AI

00:03:42.939 --> 00:03:46.580
a highly complex story make for a richer video?

00:03:47.080 --> 00:03:49.919
Like, if I want a sprawling sci -fi epic, shouldn't

00:03:49.919 --> 00:03:52.479
my premise be sprawling, too? It sounds counterintuitive,

00:03:52.520 --> 00:03:55.000
but no. A simple premise is actually much better

00:03:55.000 --> 00:03:58.219
here. Think about how the AI processes text tokens.

00:03:58.259 --> 00:04:01.280
OK. If you give the AI a complex novel with 10

00:04:01.280 --> 00:04:04.060
subplots, it won't know what visual elements

00:04:04.060 --> 00:04:06.219
to prioritize. Its attention gets scattered.

00:04:06.400 --> 00:04:08.560
It loses focus on the actual scene. Precisely.

00:04:08.680 --> 00:04:11.099
Two characters and one clear goal is perfect.

00:04:11.340 --> 00:04:14.199
It allows the AI to extract very clear visual

00:04:14.199 --> 00:04:16.660
directions. It knows exactly who to light and

00:04:16.660 --> 00:04:19.439
what they're doing. Keep the core idea incredibly

00:04:19.439 --> 00:04:23.459
simple so the AI can extract clear visual directions.

00:04:24.079 --> 00:04:27.240
Beat. That's the secret. So you take that simple

00:04:27.240 --> 00:04:30.720
story to Claude, then you create an AI video

00:04:30.720 --> 00:04:32.839
prompt skill. Let's define that. For someone

00:04:32.839 --> 00:04:35.540
new to this, what is an AI video prompt skill?

00:04:35.920 --> 00:04:39.389
Simply put, A saved instruction file for repeated

00:04:39.389 --> 00:04:41.670
AI tasks. So you aren't just typing into a blank

00:04:41.670 --> 00:04:43.689
chat box every time? Exactly. You are setting

00:04:43.689 --> 00:04:46.569
up a permanent framework. You tell Claude how

00:04:46.569 --> 00:04:49.230
to behave as a master video director. Nice. You

00:04:49.230 --> 00:04:51.269
load this skill automatically whenever you want

00:04:51.269 --> 00:04:53.389
to make a video. It saves you from constantly

00:04:53.389 --> 00:04:57.050
explaining the rules to the AI. What are we actually

00:04:57.050 --> 00:04:59.310
instructing Claude to do with this skill? What

00:04:59.310 --> 00:05:01.870
are the outputs we need? This skill demands three

00:05:01.870 --> 00:05:03.910
very distinct things from your simple story.

00:05:04.370 --> 00:05:06.870
It forces Claude to break the story down systematically.

00:05:07.670 --> 00:05:10.310
First, it generates a design sheet prompt. A

00:05:10.310 --> 00:05:12.149
design sheet? Kind of like concept art for a

00:05:12.149 --> 00:05:14.680
movie. Right. Exactly like concept art. This

00:05:14.680 --> 00:05:17.500
prompt outlines the characters, the core environment,

00:05:17.680 --> 00:05:20.399
and the color palette. It establishes the visual

00:05:20.399 --> 00:05:23.920
roles. Second, Claude generates a storyboard

00:05:23.920 --> 00:05:27.079
prompt. This breaks the story down into specific

00:05:27.079 --> 00:05:30.079
camera angles for each shot. Close -ups, wide

00:05:30.079 --> 00:05:32.540
shots, panning descriptions. Yes, it acts as

00:05:32.540 --> 00:05:35.779
the cinematographer. Third, it generates individual

00:05:35.779 --> 00:05:38.860
video prompts for every single scene. These are

00:05:38.860 --> 00:05:41.540
highly detailed instructions for the final animation

00:05:41.540 --> 00:05:44.600
phase. So Claude gives us this perfect structured

00:05:44.600 --> 00:05:47.560
text. We have the design sheet prompt, the storyboard

00:05:47.560 --> 00:05:50.279
prompt, and the video prompt. Yeah. But text

00:05:50.279 --> 00:05:52.300
isn't a movie. We need to actually visualize

00:05:52.300 --> 00:05:54.180
these blueprints. This is where we transition

00:05:54.180 --> 00:05:56.629
to the second stage. This is where ChatGPT comes

00:05:56.629 --> 00:05:59.089
in to lock in the visual style. We're moving

00:05:59.089 --> 00:06:02.089
from planning to actual image generation. You

00:06:02.089 --> 00:06:05.290
open ChatGPT. Specifically, you want to use the

00:06:05.290 --> 00:06:08.029
Deli 3 image generation feature inside it. Your

00:06:08.029 --> 00:06:10.430
first move is creating that master design sheet.

00:06:10.610 --> 00:06:12.629
We paste the design sheet prompt that Claude

00:06:12.629 --> 00:06:15.490
just wrote for us. You do. But the guide emphasizes

00:06:15.490 --> 00:06:18.290
a crucial addition here. You must manually add

00:06:18.290 --> 00:06:21.529
a specific phrase to the end. OK. You type. High

00:06:21.529 --> 00:06:24.569
detail, wide format, every element clearly visible.

00:06:24.990 --> 00:06:27.449
Why those exact words? It sounds like an arbitrary

00:06:27.449 --> 00:06:29.870
magic spell. Why not just say, make it look good?

00:06:30.110 --> 00:06:32.629
Because diffusion models can be lazy. If you

00:06:32.629 --> 00:06:36.269
just say, make it look good. The AI focuses on

00:06:36.269 --> 00:06:38.790
aesthetics. It might use heavy shadows to look

00:06:38.790 --> 00:06:41.629
cinematic. It might blur the background for an

00:06:41.629 --> 00:06:43.990
artistic depth of field. Which hides the actual

00:06:43.990 --> 00:06:46.850
details we need. Precisely. We are building a

00:06:46.850 --> 00:06:49.569
reference document, not a final piece of art.

00:06:50.050 --> 00:06:53.449
Those specific words force the AI to prioritize

00:06:53.449 --> 00:06:56.449
spatial clarity. Makes sense. We need flat, even

00:06:56.449 --> 00:06:58.730
lighting. We need to see the character's exact

00:06:58.730 --> 00:07:01.230
face. We need to clearly see the buckle on their

00:07:01.230 --> 00:07:03.790
backpack. This image is the anchor for everything

00:07:03.790 --> 00:07:06.000
else. So you get this design sheet back, you

00:07:06.000 --> 00:07:08.920
look it over, but what happens if chat GPT gets

00:07:08.920 --> 00:07:11.860
a small detail wrong? Say the daughter's backpack

00:07:11.860 --> 00:07:14.939
is the wrong color, or a photograph on the table

00:07:14.939 --> 00:07:18.139
is facing the wrong way. The instinct is to just

00:07:18.139 --> 00:07:20.160
rewrite the whole prompt and try again. That

00:07:20.160 --> 00:07:22.519
is the biggest mistake people make. Do not rewrite

00:07:22.519 --> 00:07:24.439
the whole prompt. Why not? Because rewriting

00:07:24.439 --> 00:07:27.360
changes the underlying seed noise, the AI will

00:07:27.360 --> 00:07:29.259
generate a completely different image from scratch.

00:07:29.500 --> 00:07:32.000
Oh, wow. The lighting will change. The faces

00:07:32.000 --> 00:07:34.120
will change. You lose all the good stuff just

00:07:34.120 --> 00:07:37.720
to fix one tiny detail. So how do you fix it

00:07:37.720 --> 00:07:41.519
without destroying the image? You add one specific

00:07:41.519 --> 00:07:44.480
localized correction sentence. If the photograph

00:07:44.480 --> 00:07:47.139
is wrong, you literally just type only the back

00:07:47.139 --> 00:07:49.279
of the photograph is visible. You just replied

00:07:49.279 --> 00:07:52.360
the image with that one sentence. Yes. The AI

00:07:52.360 --> 00:07:54.839
understands localized constraints much better.

00:07:55.060 --> 00:07:57.439
It keeps the original context window intact.

00:07:57.800 --> 00:08:00.759
It just surgically alters that one specific element

00:08:00.759 --> 00:08:03.500
you mentioned. Let me play devil's advocate here.

00:08:03.579 --> 00:08:06.279
Why not just skip this design sheet entirely?

00:08:06.420 --> 00:08:09.160
Okay. If we have the text prompts, why not generate

00:08:09.069 --> 00:08:11.459
the storyboard right away. Because without that

00:08:11.459 --> 00:08:14.560
visual anchor, the AI is just guessing. It will

00:08:14.560 --> 00:08:17.639
hallucinate a new visual style for every single

00:08:17.639 --> 00:08:19.899
panel. The dreaded prompt drift again. Exactly.

00:08:20.060 --> 00:08:21.740
In panel one, your character is wearing a clean

00:08:21.740 --> 00:08:24.720
blue jacket. In panel two, it's suddenly a dirty

00:08:24.720 --> 00:08:27.519
denim vest. Right. The lighting shifts from overcast

00:08:27.519 --> 00:08:30.399
to bright sunlight. It looks amateurish. The

00:08:30.399 --> 00:08:33.059
design sheet is the exact visual anchor keeping

00:08:33.059 --> 00:08:36.120
the AI from hallucinating. It absolutely is.

00:08:36.360 --> 00:08:38.259
So once that design sheet is perfectly locked

00:08:38.259 --> 00:08:40.220
in, once the character is looking Exactly right.

00:08:40.379 --> 00:08:43.460
Then, and only then, do you generate the storyboard.

00:08:43.600 --> 00:08:45.740
You take the storyboard prompt Claude wrote earlier,

00:08:46.159 --> 00:08:48.659
you paste it into ChatGPT. But here is the critical

00:08:48.659 --> 00:08:52.620
step. You must attach that perfect design sheet

00:08:52.620 --> 00:08:55.700
image to the prompt. You are literally feeding

00:08:55.700 --> 00:08:59.340
the image back into the AI. You tell ChatGPT

00:08:59.340 --> 00:09:02.159
to match those exact characters and colors. You're

00:09:02.159 --> 00:09:04.399
giving it visual reference, not just text. Yes.

00:09:04.480 --> 00:09:07.409
And you add one more specific phrase here. You

00:09:07.409 --> 00:09:10.830
type, high detail, each panel clearly separated

00:09:10.830 --> 00:09:13.169
and readable. Because a storyboard is a grid

00:09:13.169 --> 00:09:15.029
of multiple images. Right. It's usually a two

00:09:15.029 --> 00:09:18.830
by six grid, 12 panels total. If the AI blends

00:09:18.830 --> 00:09:20.950
the borders of those panels together, it becomes

00:09:20.950 --> 00:09:23.590
a mess. You need clean, distinct shots because

00:09:23.590 --> 00:09:25.529
you will be isolating them soon. We're going

00:09:25.529 --> 00:09:27.690
to take a quick pause right here. We don't go

00:09:27.690 --> 00:09:31.889
anywhere. And we are back. So far, we have planned

00:09:31.889 --> 00:09:34.830
our text in Claude. We have locked in our visual

00:09:34.830 --> 00:09:37.730
style with a design sheet in ChatGPT. And we

00:09:37.730 --> 00:09:41.230
just generated a perfectly consistent 12 -panel

00:09:41.230 --> 00:09:44.009
storyboard. Everything matches. You have a comic

00:09:44.009 --> 00:09:46.610
book version of your film. It is static, but

00:09:46.610 --> 00:09:49.169
it is visually perfect. Now comes the magic.

00:09:49.789 --> 00:09:52.970
We finally breathe life into it. We are moving

00:09:52.970 --> 00:09:56.879
to the third tool. Google Flow. This is the animation

00:09:56.879 --> 00:09:59.879
phase. Google Flow uses an underlying technology

00:09:59.879 --> 00:10:02.860
called VO3. Let's define VO3 for the listener.

00:10:03.200 --> 00:10:06.080
An AI engine that turns static images into moving

00:10:06.080 --> 00:10:08.159
clips. Simple enough, how do we actually use

00:10:08.159 --> 00:10:11.000
it? You take that full 12 -panel storyboard image.

00:10:11.259 --> 00:10:13.620
You upload that single image file directly into

00:10:13.620 --> 00:10:16.860
Google Flow. OK. Then you paste those 12 specific

00:10:16.860 --> 00:10:19.500
video prompts, the ones Claude wrote for us back

00:10:19.500 --> 00:10:21.980
in Stage 1. So we are giving Flow the static

00:10:21.980 --> 00:10:24.480
visual grid plus the text instructions of how

00:10:24.480 --> 00:10:26.759
things should move. Exactly. And you add a master

00:10:26.759 --> 00:10:28.940
instruction to the top of the prompt. You tell

00:10:28.940 --> 00:10:31.600
Flow, generate a scene using shots in the uploaded

00:10:31.600 --> 00:10:33.500
film storyboard sequence. The guide mentions

00:10:33.500 --> 00:10:36.000
adding one more strict constraint here. You have

00:10:36.000 --> 00:10:39.159
to explicitly tell the AI, no subtitles, and

00:10:39.159 --> 00:10:41.419
no music. Why do we need to specify that? We

00:10:41.419 --> 00:10:43.799
obviously want sound and text in our final video

00:10:43.799 --> 00:10:47.039
eventually, right? We do. But AI video generators

00:10:47.039 --> 00:10:49.940
have a really bad habit. If they try to generate

00:10:49.940 --> 00:10:52.580
audio or text alongside the video, it bakes it

00:10:52.580 --> 00:10:54.700
directly into the file. It's permanently attached

00:10:54.700 --> 00:10:57.480
to the visuals. Yes. AI -generated text often

00:10:57.480 --> 00:11:00.679
looks like alien gibberish. It hallucinates weird

00:11:00.679 --> 00:11:03.679
letters. AI -generated music might have tempo

00:11:03.679 --> 00:11:06.200
changes you hate. That makes sense. If that audio

00:11:06.200 --> 00:11:08.620
is baked into your raw video file, you cannot

00:11:08.620 --> 00:11:11.500
separate it later. You ruin your non -linear

00:11:11.500 --> 00:11:14.019
editing options. Blank audio and clean frames

00:11:14.019 --> 00:11:16.220
give you total control during the final edit.

00:11:16.419 --> 00:11:18.919
It beat. Exactly. Keep the raw footage as clean

00:11:18.919 --> 00:11:21.600
as possible so you hit generate and flow goes

00:11:21.600 --> 00:11:24.440
to work. It processes the grid. It does. And

00:11:24.440 --> 00:11:28.929
whoo! Imagine turning a flat 12 -panel grid into

00:11:28.929 --> 00:11:31.990
15 seconds of cinematic motion instantly. Yeah.

00:11:32.309 --> 00:11:34.610
It analyzes the spatial relationships in the

00:11:34.610 --> 00:11:37.250
static image. It calculates the temporal consistency

00:11:37.250 --> 00:11:39.809
required to make it move. It just blows my mind

00:11:39.809 --> 00:11:42.470
every time I see it work. It's pretty wild to

00:11:42.470 --> 00:11:46.110
think about. You are taking a flat static comic

00:11:46.110 --> 00:11:48.950
strip and getting a living, breathing movie out

00:11:48.950 --> 00:11:51.190
of it. Absolutely. It feels like something straight

00:11:51.190 --> 00:11:54.419
out of a sci -fi novel. But wait. We have 12

00:11:54.419 --> 00:11:58.019
distinct panels, and flow generates a 15 -second

00:11:58.019 --> 00:12:01.059
clip. That math is pretty tight. It is very tight,

00:12:01.240 --> 00:12:03.940
averaging just over one second per shot. Some

00:12:03.940 --> 00:12:06.120
of those shots must flash by incredibly quickly.

00:12:06.240 --> 00:12:08.419
They absolutely do. Some shots will feel rushed.

00:12:08.700 --> 00:12:10.519
Occasionally, the AI might even skip a panel

00:12:10.519 --> 00:12:13.019
entirely if the transition is too complex. Oh,

00:12:13.019 --> 00:12:15.759
really? Yeah, but that is OK. This initial generation

00:12:15.759 --> 00:12:18.419
is basically an animatic. It is a rough cut.

00:12:18.500 --> 00:12:21.639
It shows you the overall pacing and flow. So

00:12:21.639 --> 00:12:23.779
you watch this rough 15 second clip, what are

00:12:23.779 --> 00:12:26.419
we looking for? You are checking for temporal

00:12:26.419 --> 00:12:29.539
consistency. Do the warm orange tones of the

00:12:29.539 --> 00:12:32.179
volcano ash stay consistent into the final gas

00:12:32.179 --> 00:12:34.820
station scene? Does the camera pan smoothly?

00:12:35.840 --> 00:12:39.039
If it feels too fast, or if a transition is jarring,

00:12:39.600 --> 00:12:41.620
how do we fix it? We aren't in a traditional

00:12:41.620 --> 00:12:44.080
editing timeline here. Flow has a feature called

00:12:44.080 --> 00:12:46.840
the describe your edits box. It uses natural

00:12:46.840 --> 00:12:49.230
language processing to adjust the timeline. You

00:12:49.230 --> 00:12:52.110
literally just type smooth the transition between

00:12:52.110 --> 00:12:55.950
each shot, or you type maintain consistent pacing

00:12:55.950 --> 00:12:58.309
throughout the video. You just ask it nicely

00:12:58.309 --> 00:13:01.669
to fix the edit. Exactly. The engine recalculates

00:13:01.669 --> 00:13:04.590
the latent space between the frames. It generates

00:13:04.590 --> 00:13:07.289
new transitional frames to smooth out the motion.

00:13:07.389 --> 00:13:10.669
Wow. It adjusts the pacing without changing the

00:13:10.669 --> 00:13:13.490
core visual assets you establish. That is incredible

00:13:13.490 --> 00:13:16.559
control. You are directing the edit with text.

00:13:16.720 --> 00:13:18.580
And because you did the hard work in stage one

00:13:18.580 --> 00:13:21.039
and two. Because you built that perfect design

00:13:21.039 --> 00:13:23.679
sheet. Right. The final motion clip maintains

00:13:23.679 --> 00:13:25.980
its integrity. The characters look right. The

00:13:25.980 --> 00:13:28.759
lighting matches. It feels intentional. The Claude

00:13:28.759 --> 00:13:31.840
to chat GPT to flow pipeline is undeniably powerful.

00:13:32.600 --> 00:13:35.059
But let's look at the bigger picture. Not everyone

00:13:35.059 --> 00:13:37.179
uses those specific tools. Maybe they don't have

00:13:37.179 --> 00:13:39.600
access. Maybe the subscription costs for three

00:13:39.600 --> 00:13:41.960
different premium AI services are just too high.

00:13:42.080 --> 00:13:45.100
That is a very valid concern. Generating images

00:13:45.100 --> 00:13:48.419
and video at scale gets expensive quickly. The

00:13:48.419 --> 00:13:51.200
beauty of this specific framework is its modularity.

00:13:51.840 --> 00:13:55.259
The tools are completely swappable. Let's dig

00:13:55.259 --> 00:13:58.139
into the tool built. What are the alternatives

00:13:58.139 --> 00:14:01.399
for stage one, the prompt writing phase? If you

00:14:01.399 --> 00:14:04.529
don't want to use Claude, Gemini, is a fantastic

00:14:04.529 --> 00:14:08.070
alternative. Google's Gemini has a very capable

00:14:08.070 --> 00:14:11.730
free tier. It handles large context windows beautifully,

00:14:11.809 --> 00:14:13.850
which is great for building those skills. What

00:14:13.850 --> 00:14:16.110
about Grok? Grok is another great option for

00:14:16.110 --> 00:14:18.889
the text phase. It is incredibly fast. It also

00:14:18.889 --> 00:14:21.549
has fewer safety restrictions, which can be helpful

00:14:21.549 --> 00:14:23.850
if your story involves action or conflict that

00:14:23.850 --> 00:14:26.710
other AIs might mistakenly flag as inappropriate.

00:14:27.309 --> 00:14:30.049
What about stage two, generating the static images?

00:14:30.549 --> 00:14:33.509
We need tools that excel at building consistent

00:14:33.509 --> 00:14:36.769
design sheets. Ideogram is a phenomenal alternative

00:14:36.769 --> 00:14:40.129
to chat GPT and Dale E3. Why ideogram specifically?

00:14:40.330 --> 00:14:42.570
Two reasons. First, it is currently the best

00:14:42.570 --> 00:14:45.009
in the market at rendering visible text accurately.

00:14:45.210 --> 00:14:47.309
Oh, nice. If your scene requires a neon sign

00:14:47.309 --> 00:14:50.370
or a newspaper headline, ideogram nails it. Second,

00:14:50.429 --> 00:14:52.590
it has unique built -in tools for maintaining

00:14:52.590 --> 00:14:54.629
strict character consistency across different

00:14:54.629 --> 00:14:57.149
prompts. Any other image alternatives? Leonardo

00:14:57.149 --> 00:14:59.950
AI. It has an amazing free tier. It gives you

00:14:59.950 --> 00:15:02.200
an incredible amount of granular control. over

00:15:02.200 --> 00:15:05.519
the artistic style. If you want a highly stylized,

00:15:05.620 --> 00:15:08.320
painted look rather than photorealism, Leonardo

00:15:08.320 --> 00:15:11.080
is brilliant. Now for the heavy hitter. Stage

00:15:11.080 --> 00:15:15.039
three, the video generation. This is almost always

00:15:15.039 --> 00:15:18.100
the most computationally expensive step. What

00:15:18.100 --> 00:15:21.659
can we use instead of Google Flow and VO3? Kling

00:15:21.659 --> 00:15:24.559
3 .0 is a massive contender right now. I have

00:15:24.559 --> 00:15:26.740
seen a lot of clips from Kling online. This looks

00:15:26.740 --> 00:15:30.379
very cinematic. It is. Kling 3 .0 handles physics

00:15:30.379 --> 00:15:33.600
simulations remarkably well. Water splashing,

00:15:34.000 --> 00:15:36.559
smoke billowing. It's often cheaper than enterprise

00:15:36.559 --> 00:15:39.559
solutions, and the quality is stunning. What

00:15:39.559 --> 00:15:42.190
else is out there for video? Pix4C1 is another

00:15:42.190 --> 00:15:44.350
strong option. It is specifically optimized for

00:15:44.350 --> 00:15:46.250
taking storyboard reference panels and translating

00:15:46.250 --> 00:15:48.590
them into consistent motion. Good to know. And

00:15:48.590 --> 00:15:52.190
if you are on a strict $0 budget, Hylua AI is

00:15:52.190 --> 00:15:54.330
perfect for completely free experimentation.

00:15:54.750 --> 00:15:56.529
So you have all these clips, but we still need

00:15:56.529 --> 00:15:58.470
to put it all together. We need to add the music

00:15:58.470 --> 00:16:01.529
and subtitles we avoided earlier. For final editing,

00:16:02.009 --> 00:16:04.649
CapCut is the dominant choice for creators. Right.

00:16:04.779 --> 00:16:07.440
It's free, incredibly intuitive, and there is

00:16:07.440 --> 00:16:09.539
no watermark if you use the desktop version.

00:16:10.220 --> 00:16:12.419
If you want Hollywood level control, DaVinci

00:16:12.419 --> 00:16:15.279
Resolve offers professional color grading. It

00:16:15.279 --> 00:16:18.240
is a massive complex program, but the base version

00:16:18.240 --> 00:16:21.159
is entirely free. And obviously Adobe Premiere

00:16:21.159 --> 00:16:23.620
if you are already in that ecosystem. But let

00:16:23.620 --> 00:16:26.360
me ask you this. If we swap out the software,

00:16:26.919 --> 00:16:29.200
if we use Gemini instead of Claude and Kling

00:16:29.200 --> 00:16:32.340
instead of Flow, does the quality of this specific

00:16:32.340 --> 00:16:35.169
sequence degrade? Does the magic disappear? Not

00:16:35.169 --> 00:16:37.250
at all. The discipline of the sequence is what

00:16:37.250 --> 00:16:39.509
creates the quality. It is absolutely not about

00:16:39.509 --> 00:16:42.149
the specific brand of AI. It's the architecture

00:16:42.149 --> 00:16:45.129
of the workflow. Precisely. The AI models will

00:16:45.129 --> 00:16:47.149
change. A new tool will launch next week that

00:16:47.149 --> 00:16:50.309
makes VO3 look old. But the logic of the pipeline

00:16:50.309 --> 00:16:53.110
remains. Right. You must build the textual blueprint

00:16:53.110 --> 00:16:55.870
first. You must lock in the static image second.

00:16:56.289 --> 00:16:58.870
You only animate as the absolute final step.

00:16:59.110 --> 00:17:01.450
The sequence is universal, even if the exact

00:17:01.450 --> 00:17:04.089
software you decide to use changes. Two -sex

00:17:04.089 --> 00:17:06.490
silence. That is the big takeaway here. Building

00:17:06.490 --> 00:17:10.069
an AI video from scratch feels like magic. When

00:17:10.069 --> 00:17:12.869
you watch a finished high -quality AI film, it

00:17:12.869 --> 00:17:15.369
looks like a miracle of prompting, but it isn't

00:17:15.369 --> 00:17:17.849
magical prompting at all. No, it's just a structured

00:17:17.849 --> 00:17:20.990
process. It is about having a clear, methodical

00:17:20.990 --> 00:17:23.990
sequence. You start with a simple story. You

00:17:23.990 --> 00:17:27.869
let the text tool structure your prompts. You

00:17:27.869 --> 00:17:31.769
build your static visuals to anchor the AI. Only

00:17:31.769 --> 00:17:34.509
then do you hit the animate button. Each tool

00:17:34.509 --> 00:17:37.930
handles one specific job. You isolate the variables.

00:17:38.289 --> 00:17:40.950
You fix problems while they are still flat images.

00:17:41.170 --> 00:17:43.349
Because fixing a moving target is nearly impossible.

00:17:43.650 --> 00:17:45.829
Exactly. You become a director, carefully setting

00:17:45.829 --> 00:17:48.390
up the shot. You stop being a gambler just pulling

00:17:48.390 --> 00:17:50.450
the slot machine lever. It changes everything.

00:17:50.509 --> 00:17:52.710
Yeah. I want to leave everyone with a final thought

00:17:52.710 --> 00:17:54.910
to chew on today. We have been talking about

00:17:54.910 --> 00:17:58.400
this specific AI video workflow. But think about

00:17:58.400 --> 00:18:00.880
your own workflows outside of video creation.

00:18:00.980 --> 00:18:02.779
Oh, it applies to almost everything. It really

00:18:02.779 --> 00:18:05.079
does. Think about analytical business projects

00:18:05.079 --> 00:18:07.819
or writing code or even designing a presentation.

00:18:08.539 --> 00:18:10.880
This methodology building the design sheet first

00:18:10.880 --> 00:18:14.039
is universal. How often do we rush to the final

00:18:14.039 --> 00:18:16.940
moving product? We jump straight into the execution

00:18:16.940 --> 00:18:20.569
phase. blindly because it feels productive, we

00:18:20.569 --> 00:18:23.250
try to animate our ideas before we have actually

00:18:23.250 --> 00:18:25.130
designed them. We skip the blueprint because

00:18:25.130 --> 00:18:27.049
we want to see the house. Exactly. And then we

00:18:27.049 --> 00:18:29.849
spend twice as long fixing structural errors

00:18:29.849 --> 00:18:32.490
that never should have happened. Right. Next

00:18:32.490 --> 00:18:35.509
time you start a complex new project, stop and

00:18:35.509 --> 00:18:38.230
ask yourself, did I build my static design sheet

00:18:38.230 --> 00:18:41.309
first or am I just hoping the sequence works

00:18:41.309 --> 00:18:44.150
itself out? Thank you for joining The Deep Dive.

00:18:44.309 --> 00:18:45.329
We will see you next time.
