WEBVTT

00:00:00.000 --> 00:00:02.560
If you think back just a few years ago, getting

00:00:02.560 --> 00:00:06.280
a professional cinematic video ad, that was a

00:00:06.280 --> 00:00:09.460
huge undertaking. Oh, yeah. Weeks, maybe months

00:00:09.460 --> 00:00:11.419
of production. You're talking about hiring the

00:00:11.419 --> 00:00:14.539
full crew, finding locations, and spending, what,

00:00:14.660 --> 00:00:16.280
tens of thousands of dollars before you even

00:00:16.280 --> 00:00:18.120
see a single frame. That was the cost of entry,

00:00:18.260 --> 00:00:20.929
right? A huge barrier. But the reality today,

00:00:21.149 --> 00:00:23.469
I mean, it's completely changed the game. It

00:00:23.469 --> 00:00:25.929
has. We have these incredible generative tools

00:00:25.929 --> 00:00:29.410
now, you know, Nano Banana Pro for images and

00:00:29.410 --> 00:00:33.329
VO 3 .1 for adding motion. And these tools allow

00:00:33.329 --> 00:00:36.429
you, the creator or the small team, to generate

00:00:36.429 --> 00:00:40.009
really high -end story -driven ads from, well,

00:00:40.049 --> 00:00:41.899
just a text prompt. And it happens in hours,

00:00:42.060 --> 00:00:44.460
not weeks. Exactly. It's a fundamental shift.

00:00:44.579 --> 00:00:46.740
It is. And the source material we're looking

00:00:46.740 --> 00:00:49.600
at today gives us the perfect shortcut. It lays

00:00:49.600 --> 00:00:53.219
out a precise framework for key steps to start

00:00:53.219 --> 00:00:56.259
filming and start generating. Welcome to the

00:00:56.259 --> 00:00:58.500
deep dive. Our mission today is to unpack that

00:00:58.500 --> 00:01:01.520
guide on cinematic AI video ad creation. We're

00:01:01.520 --> 00:01:03.859
going to distill it down to that essential, repeatable

00:01:03.859 --> 00:01:06.359
four -step process. And it really is a process,

00:01:06.400 --> 00:01:09.200
like a workflow. And the biggest insight right

00:01:09.200 --> 00:01:11.400
up front is that you have to respect all four

00:01:11.400 --> 00:01:14.319
steps. If you get excited and just jump straight

00:01:14.319 --> 00:01:17.519
to generating the video. Or you rush the planning.

00:01:17.719 --> 00:01:20.000
The output is just going to be a mess. It's going

00:01:20.000 --> 00:01:22.859
to be disjointed, unusable, a total waste of

00:01:22.859 --> 00:01:24.239
time. So we're going to walk through this in

00:01:24.239 --> 00:01:27.140
order. Step one is storyboarding. The creative

00:01:27.140 --> 00:01:30.400
foundation. Then step two, image creation, which

00:01:30.400 --> 00:01:33.640
is our visual anchor. Step three is video generation.

00:01:33.980 --> 00:01:37.060
Where the magic happens. And finally, step four,

00:01:37.180 --> 00:01:39.560
final production, stitching it all together.

00:01:39.659 --> 00:01:41.739
Let's dive into that first step, storyboarding.

00:01:41.939 --> 00:01:45.280
This is, it's not technical at all. Right. It's

00:01:45.280 --> 00:01:48.299
all about defining a super clear creative vision

00:01:48.299 --> 00:01:50.620
and then breaking it down into separate scenes

00:01:50.620 --> 00:01:53.159
before you even touch a generative tool. And

00:01:53.159 --> 00:01:55.379
this is a great place to use AI assistance. You

00:01:55.379 --> 00:01:59.760
know, Chad GPT, Claude, Gemini, not as the creator,

00:01:59.780 --> 00:02:02.079
but as an accelerator. Totally. You use them

00:02:02.079 --> 00:02:04.299
conversationally, you feed them your concept,

00:02:04.319 --> 00:02:06.260
and they help flesh it out, break it down into

00:02:06.260 --> 00:02:08.439
maybe seven or ten scenes. It helps you get past

00:02:08.439 --> 00:02:10.990
that blank page problem. Exactly. You can have

00:02:10.990 --> 00:02:14.270
a high level idea like I want a high energy running

00:02:14.270 --> 00:02:16.770
ad and it'll suggest specific shots, specific

00:02:16.770 --> 00:02:19.229
angles. But you're the curator. You're always

00:02:19.229 --> 00:02:21.569
the one in charge. Always. The source had a great

00:02:21.569 --> 00:02:24.889
example concept, a Nike running ad, but with

00:02:24.889 --> 00:02:28.750
this futuristic kind of Blade Runner City vibe.

00:02:28.969 --> 00:02:32.590
Yeah, that aesthetic is key. Sleek, tech forward.

00:02:33.389 --> 00:02:36.449
It fits the brand so perfectly. It does. And

00:02:36.449 --> 00:02:39.439
to get that look, you need... A specific sequence

00:02:39.439 --> 00:02:41.919
of shots. Right. So you'd outline, let's say,

00:02:41.939 --> 00:02:45.500
seven scenes. You start wide, distant shot of

00:02:45.500 --> 00:02:48.259
a neon skyline. Then you move to an empty, illuminated

00:02:48.259 --> 00:02:51.219
street. Then the really important one. A close

00:02:51.219 --> 00:02:53.560
-up of the shoes hitting wet pavement. Followed

00:02:53.560 --> 00:02:55.580
by the runner accelerating. Maybe a side profile.

00:02:55.879 --> 00:02:57.699
Yep. Then an over -the -shoulder shot. You see

00:02:57.699 --> 00:03:00.199
the city lights reflecting in the puddles. Then

00:03:00.199 --> 00:03:02.500
a clean product shot of the shoe itself. And

00:03:02.500 --> 00:03:04.080
you end with the runner sprinting right at the

00:03:04.080 --> 00:03:05.900
camera. And that breakdown dictates everything.

00:03:06.039 --> 00:03:08.240
And this is the important part. You're not just

00:03:08.240 --> 00:03:10.479
taking the AI suggestions and running with them.

00:03:10.659 --> 00:03:13.539
No. The human element of taste is critical. You

00:03:13.539 --> 00:03:15.479
have to modify the output. If the AI suggests

00:03:15.479 --> 00:03:17.819
something that doesn't fit your brand, you push

00:03:17.819 --> 00:03:20.860
back. You have to know when to say no. The AI

00:03:20.860 --> 00:03:24.080
can generate ideas, but it can't feel the brand

00:03:24.080 --> 00:03:26.520
alignment, you know, that emotional tone. That's

00:03:26.520 --> 00:03:28.800
all human judgment. It's the difference between

00:03:28.800 --> 00:03:31.159
generating something and generating the right

00:03:31.159 --> 00:03:33.500
thing. So if the AI can do all that. You know,

00:03:33.520 --> 00:03:36.819
the scene breakdowns, the suggestions. What's

00:03:36.819 --> 00:03:38.860
the one thing that a human has to bring to the

00:03:38.860 --> 00:03:42.219
table? It's taste. Human taste guides the final

00:03:42.219 --> 00:03:45.879
concept. AI only assists idea generation. That's

00:03:45.879 --> 00:03:47.819
it. So once that storyboard is locked, we move

00:03:47.819 --> 00:03:51.800
to step two, image creation. And this is where

00:03:51.800 --> 00:03:53.620
things get tricky, but also really interesting.

00:03:53.800 --> 00:03:55.620
Because the quality of your starting frames,

00:03:55.719 --> 00:03:57.840
it determines everything. It's the anchor for

00:03:57.840 --> 00:03:59.900
the entire project. And we're using a tool like

00:03:59.900 --> 00:04:02.180
NanoBanana Pro for this, right? Yeah. Through

00:04:02.180 --> 00:04:05.080
Gemini or AI Studio. Yep. You need that level

00:04:05.080 --> 00:04:07.300
of control over the visuals, things like aspect

00:04:07.300 --> 00:04:09.919
ratio, the lighting. Those details are what give

00:04:09.919 --> 00:04:12.800
you a stable, high quality video later on. It's

00:04:12.800 --> 00:04:15.020
the difference maker. And here's a great tip

00:04:15.020 --> 00:04:17.660
from the source. Don't just generate one image

00:04:17.660 --> 00:04:20.029
per scene description. No, you need options.

00:04:20.509 --> 00:04:22.870
Generate four or five variations for each scene.

00:04:23.069 --> 00:04:26.170
Why so many? Because you need to see how the

00:04:26.170 --> 00:04:29.110
AI interprets the prompt in different ways. It

00:04:29.110 --> 00:04:31.509
lets you choose the single best frame that really

00:04:31.509 --> 00:04:33.949
nails that Blade Runner aesthetic we talked about.

00:04:34.170 --> 00:04:35.970
Okay, but here's the part where I think a lot

00:04:35.970 --> 00:04:38.050
of people fail. Yeah. The part that can ruin

00:04:38.050 --> 00:04:40.759
the whole video. Yeah. maintaining visual consistency.

00:04:41.019 --> 00:04:43.300
This is the critical step. You need all seven

00:04:43.300 --> 00:04:45.199
of those scenes to look like they were shot by

00:04:45.199 --> 00:04:47.879
the same person in the same place at the same

00:04:47.879 --> 00:04:50.120
time. Absolutely. So for your very first scene,

00:04:50.220 --> 00:04:53.680
that distant skyline, you generate images until

00:04:53.680 --> 00:04:56.480
one is perfect. Just perfect. And that single

00:04:56.480 --> 00:04:59.639
image becomes, what, the style anchor. Exactly.

00:04:59.819 --> 00:05:02.279
Your style anchor. But wait, if I use the exact

00:05:02.279 --> 00:05:05.160
same text tromped for scene two, why do I still

00:05:05.160 --> 00:05:07.439
need to reference that first image? won't the

00:05:07.439 --> 00:05:10.439
prompt keep it consistent great question and

00:05:10.439 --> 00:05:14.680
the answer is surprisingly no Generative models,

00:05:15.040 --> 00:05:16.920
even really good ones, they have what's called

00:05:16.920 --> 00:05:20.459
prompt drift. You can type the exact same prompt

00:05:20.459 --> 00:05:22.779
in seven times and you'll get seven slightly

00:05:22.779 --> 00:05:25.220
different lighting setups, different color palettes.

00:05:25.279 --> 00:05:27.519
They just, they drift away from the original

00:05:27.519 --> 00:05:29.959
intent. So the text prompt alone isn't stable

00:05:29.959 --> 00:05:32.560
enough for a whole sequence. Precisely. The image

00:05:32.560 --> 00:05:35.639
is stable. So the rule is for every single scene

00:05:35.639 --> 00:05:38.300
after the first one, scene two, three, four,

00:05:38.420 --> 00:05:40.980
all of them. You have to add that first anchor

00:05:40.980 --> 00:05:43.199
image as a reference in your prompt. And you

00:05:43.199 --> 00:05:46.399
tell the AI explicitly, use the same style, lighting,

00:05:46.500 --> 00:05:48.899
and aesthetic. from image one you're forcing

00:05:48.899 --> 00:05:50.720
it you're locking it down and this isn't just

00:05:50.720 --> 00:05:53.379
for the mood it's for objects too like the shoe

00:05:53.379 --> 00:05:56.319
exactly if you have the shoe in scene one you

00:05:56.319 --> 00:05:58.620
reference that image for every other shoe shot

00:05:58.620 --> 00:06:00.600
otherwise it might change color or even model

00:06:00.600 --> 00:06:03.079
mid -run i'll admit i still wrestle with prompt

00:06:03.079 --> 00:06:05.839
drift myself it happens so fast using that anchor

00:06:05.839 --> 00:06:08.120
image yeah that really feels like the only way

00:06:08.120 --> 00:06:10.360
to get a professional cohesive look it's the

00:06:10.360 --> 00:06:13.600
secret So if we skip referencing that first scene

00:06:13.600 --> 00:06:16.300
image for every shot, what's the guaranteed result?

00:06:16.600 --> 00:06:18.800
The scenes become visually disjointed, ruining

00:06:18.800 --> 00:06:21.720
the professional look of the final video. Guaranteed.

00:06:21.779 --> 00:06:24.740
Okay. So now we have our beautiful, consistent,

00:06:24.819 --> 00:06:29.220
static images. Seven approved frames. We move

00:06:29.220 --> 00:06:34.860
to step three. Video creation. Time to add some

00:06:34.860 --> 00:06:37.079
motion. And we actually go back to the AI assistant

00:06:37.079 --> 00:06:40.250
first. You take your seven final images and you

00:06:40.250 --> 00:06:42.649
ask it to write motion -specific prompts for

00:06:42.649 --> 00:06:45.009
each one. So the AI looks at the static picture

00:06:45.009 --> 00:06:46.930
and suggests the camera movement. Yeah, like

00:06:46.930 --> 00:06:49.850
a quick dolly in or a slow pan left. It gives

00:06:49.850 --> 00:06:52.569
you ideas for the motion. But here, unlike with

00:06:52.569 --> 00:06:55.490
the images, the source says to only ask for one

00:06:55.490 --> 00:06:58.050
video prompt per scene. Right. It's all about

00:06:58.050 --> 00:07:01.350
cost and time. Generating video is just way more

00:07:01.350 --> 00:07:03.829
resource intensive than images. So it would be

00:07:03.829 --> 00:07:06.470
too expensive and slow to generate 50 video clips

00:07:06.470 --> 00:07:09.470
just to find seven good ones. Prohibitively so.

00:07:09.959 --> 00:07:11.620
You have to be much more targeted with your video

00:07:11.620 --> 00:07:14.259
prompts. And for the tool, we're using VO 3 .1

00:07:14.259 --> 00:07:16.759
Fast. Why the fast version? It's just so much

00:07:16.759 --> 00:07:19.040
cheaper and quicker. And honestly, for this kind

00:07:19.040 --> 00:07:20.879
of work, the quality difference isn't big enough

00:07:20.879 --> 00:07:23.139
to justify the extra cost and wait time of the

00:07:23.139 --> 00:07:25.939
slower model. You need to iterate. And the most

00:07:25.939 --> 00:07:28.019
important technical feature of whatever tool

00:07:28.019 --> 00:07:30.759
you use, it has to let you use your own image

00:07:30.759 --> 00:07:33.560
as the starting frame. It has to. Otherwise,

00:07:33.699 --> 00:07:35.860
all that consistency work you did in step two

00:07:35.860 --> 00:07:37.939
is just gone. It's the link that holds the whole

00:07:37.939 --> 00:07:40.319
pipeline together. Now, the reality of the output

00:07:40.319 --> 00:07:43.959
is still, well, it's not perfect. You're still

00:07:43.959 --> 00:07:45.800
going to have to regenerate some scenes. Oh,

00:07:45.819 --> 00:07:48.319
for sure. If you have 10 scenes, you're absolutely

00:07:48.319 --> 00:07:50.259
going to redo at least three or four of them,

00:07:50.300 --> 00:07:52.480
maybe two, three, four times each. That's just

00:07:52.480 --> 00:07:55.720
where the tech is right now. Whoa. But just imagine

00:07:55.720 --> 00:07:59.240
scaling this, scaling this structured system

00:07:59.240 --> 00:08:02.000
across an entire campaign. You could produce

00:08:02.000 --> 00:08:05.259
hundreds of high -end localized ads in just a

00:08:05.259 --> 00:08:08.129
few days. It's a profound shift in production

00:08:08.129 --> 00:08:10.870
capacity. It really is. But there's a key piece

00:08:10.870 --> 00:08:13.430
of expectation management here. The source notes

00:08:13.430 --> 00:08:15.370
that if the AI gives you an eight second video

00:08:15.370 --> 00:08:17.269
clip. You're not going to use all eight seconds.

00:08:17.410 --> 00:08:19.209
Right. You might only get one to three seconds

00:08:19.209 --> 00:08:23.029
of truly usable footage. It's because of model

00:08:23.029 --> 00:08:25.589
coherence over time. The longer the AI has to

00:08:25.589 --> 00:08:28.589
generate, the more likely it is to, well, to

00:08:28.589 --> 00:08:30.939
get. Things start melting. People grow extra

00:08:30.939 --> 00:08:34.659
fingers. Yeah, exactly. The first few seconds

00:08:34.659 --> 00:08:36.580
are stable because they're close to that perfect

00:08:36.580 --> 00:08:38.740
starting image. But as it gets further away,

00:08:38.919 --> 00:08:41.139
it kind of loses its grip on reality. So you

00:08:41.139 --> 00:08:43.259
have to treat it like raw footage. You know you're

00:08:43.259 --> 00:08:44.940
going to have to trim the beginning and the end.

00:08:45.039 --> 00:08:47.519
You have to. You judge the quality by asking,

00:08:47.600 --> 00:08:51.399
can I get one? to three perfect seconds out of

00:08:51.399 --> 00:08:54.519
this, not is the whole eight -second clip a masterpiece.

00:08:54.940 --> 00:08:57.519
That one mindset shift probably saves a lot of

00:08:57.519 --> 00:09:00.360
frustration. A ton. So the source says an eight

00:09:00.360 --> 00:09:02.379
-second generation usually gives you only one

00:09:02.379 --> 00:09:05.820
to three seconds of good footage. What does that

00:09:05.820 --> 00:09:08.620
tell us about judging the quality? Quality comes

00:09:08.620 --> 00:09:11.799
from aggressive trimming, not expecting eight

00:09:11.799 --> 00:09:14.360
perfect seconds of output. Okay, now for the

00:09:14.360 --> 00:09:17.690
home stretch. Step four. Production. We've got

00:09:17.690 --> 00:09:20.409
our short, beautiful, consistent clips. It's

00:09:20.409 --> 00:09:22.610
time to stitch them all together. Yep. Using

00:09:22.610 --> 00:09:25.629
an editor like CapCut, Premiere, or DaVinci Resolve.

00:09:25.769 --> 00:09:27.610
And the key concept here, the thing that really

00:09:27.610 --> 00:09:30.230
elevates the final product, is audio -driven

00:09:30.230 --> 00:09:32.570
editing. The music is not an afterthought. It

00:09:32.570 --> 00:09:34.470
dictates everything. The whole rhythm of the

00:09:34.470 --> 00:09:36.649
piece. So you pick your music first. Something

00:09:36.649 --> 00:09:39.679
with a clear beat. A clear beat, yeah. You import

00:09:39.679 --> 00:09:42.080
your clips, you import the audio, and then you

00:09:42.080 --> 00:09:44.100
trim your little one to three second clips so

00:09:44.100 --> 00:09:46.539
that the cut, the end of the clip, lands exactly

00:09:46.539 --> 00:09:49.240
on a beat. That's what gives it that quick, rhythmic,

00:09:49.240 --> 00:09:51.960
professional feel. The visual cuts are synced

00:09:51.960 --> 00:09:54.779
to the music. It's all about timing, not fancy

00:09:54.779 --> 00:09:57.659
effects, just timing. Once that's locked, you

00:09:57.659 --> 00:10:00.460
do some simple enhancements. Like text overlays,

00:10:00.460 --> 00:10:02.980
maybe some simple fades. Exactly. Keep it simple.

00:10:03.080 --> 00:10:05.559
But the last critical step is color grading.

00:10:06.110 --> 00:10:08.929
Why is that so important if we already use the

00:10:08.929 --> 00:10:12.169
consistency anchor? The anchor handles the style,

00:10:12.269 --> 00:10:15.110
but not the final calibration. You might still

00:10:15.110 --> 00:10:17.590
have tiny differences in exposure between clips.

00:10:17.789 --> 00:10:20.830
So you apply one final unified color correction

00:10:20.830 --> 00:10:23.230
across everything. It's the final glue that makes

00:10:23.230 --> 00:10:25.009
it feel like it was all shot on the same day

00:10:25.009 --> 00:10:27.850
with the same camera. Then you export in 4K.

00:10:27.990 --> 00:10:30.110
This whole workflow, it really brings up the

00:10:30.110 --> 00:10:32.870
question of automation. We're using AI for so

00:10:32.870 --> 00:10:35.309
much of it already. What parts can we automate

00:10:35.309 --> 00:10:37.950
and what parts must stay human? You can automate

00:10:37.950 --> 00:10:40.830
the repetitive stuff, you know, generating prompt

00:10:40.830 --> 00:10:43.110
variations, the pipeline that takes a description

00:10:43.110 --> 00:10:46.029
and spits out five images, file naming. But the

00:10:46.029 --> 00:10:48.169
source is really firm on what you should not

00:10:48.169 --> 00:10:50.090
automate. And that is the creative selection

00:10:50.090 --> 00:10:52.509
process. This is the golden rule. You cannot

00:10:52.509 --> 00:10:54.789
have an AI pick the best starting frame or the

00:10:54.789 --> 00:10:57.440
best video clip. You can't. Automation should

00:10:57.440 --> 00:11:00.860
handle repetition, not taste. And AI will miss

00:11:00.860 --> 00:11:03.620
the nuance. It will miss the feeling. It will

00:11:03.620 --> 00:11:06.120
pick something that's technically correct, but

00:11:06.120 --> 00:11:08.639
creatively dead. Because you have to visually

00:11:08.639 --> 00:11:11.759
inspect those five options to see which one really

00:11:11.759 --> 00:11:14.360
captures the vision. That curation, looking at

00:11:14.360 --> 00:11:16.419
five images and saying, this one feels right.

00:11:16.519 --> 00:11:19.200
That is the highest value human skill in this

00:11:19.200 --> 00:11:21.919
whole process now. So if we can technically automate

00:11:21.919 --> 00:11:25.139
the entire process end to end. Why should we

00:11:25.139 --> 00:11:28.139
intentionally leave that creative selection part

00:11:28.139 --> 00:11:31.220
to a human? Over -automation sacrifices taste.

00:11:31.460 --> 00:11:34.039
The final output needs human creative selection

00:11:34.039 --> 00:11:37.019
to be engaging. It all comes back to taste. Always.

00:11:37.179 --> 00:11:40.759
So to recap the big idea here for you, the listener,

00:11:41.000 --> 00:11:44.360
the era of AI marketing is here. It's happening.

00:11:44.480 --> 00:11:46.879
We're collapsing weeks into hours with tools

00:11:46.879 --> 00:11:50.379
like Nano Banana Pro and VO 3 .1. But the real

00:11:50.379 --> 00:11:53.220
power isn't just in the tools. It's in the structured

00:11:53.220 --> 00:11:56.090
workflow. using images to lock in consistency,

00:11:56.350 --> 00:11:59.389
and using audio to dictate the pace. And the

00:11:59.389 --> 00:12:01.250
critical mistake to avoid, as we've said, is

00:12:01.250 --> 00:12:03.230
trying to automate your own creative judgment.

00:12:03.289 --> 00:12:05.909
The AI can't do that. That's your job. Which

00:12:05.909 --> 00:12:08.429
really brings us to a final provocative thought.

00:12:08.690 --> 00:12:11.970
The winners in this new world. They're not going

00:12:11.970 --> 00:12:13.649
to be the best coders or even the people who

00:12:13.649 --> 00:12:16.029
write the longest, most technical prompts. They're

00:12:16.029 --> 00:12:17.590
going to be the ones with a clear creative vision

00:12:17.590 --> 00:12:21.169
and impeccable taste who use AI simply as a tool

00:12:21.169 --> 00:12:23.730
for execution. The tools are ready. The workflow

00:12:23.730 --> 00:12:26.090
is proven. The only question left for you is

00:12:26.090 --> 00:12:28.470
what original creative vision are you bringing

00:12:28.470 --> 00:12:29.009
to the table?