WEBVTT

00:00:00.000 --> 00:00:02.640
Have you ever had that feeling where you're using

00:00:02.640 --> 00:00:05.139
one of these, you know, incredibly powerful AI

00:00:05.139 --> 00:00:07.780
image tools, and it feels less like you're designing

00:00:07.780 --> 00:00:10.259
something and more like you're just playing a

00:00:10.259 --> 00:00:12.480
slot machine? Oh, absolutely. You type in this

00:00:12.480 --> 00:00:14.960
perfect prompt, you hit generate, and what comes

00:00:14.960 --> 00:00:17.859
back is just completely random. It's pure chance,

00:00:18.100 --> 00:00:20.739
and the frustration, it really peaks when you

00:00:20.739 --> 00:00:22.839
finally get that one great character. You know,

00:00:22.960 --> 00:00:25.960
the lighting's perfect, the mood is there, and

00:00:25.960 --> 00:00:28.160
then you just ask it to... Say, move the camera.

00:00:28.320 --> 00:00:30.219
And it all falls apart. It all falls apart. Suddenly,

00:00:30.280 --> 00:00:31.519
your character looks like a totally different

00:00:31.519 --> 00:00:33.679
person. That's that AI slop everyone talks about,

00:00:33.840 --> 00:00:36.359
that inconsistency that just kills a project.

00:00:37.799 --> 00:00:40.479
Welcome back to the Deep Dive. Today, we're on

00:00:40.479 --> 00:00:42.899
a mission, really. We want to guide you through

00:00:42.899 --> 00:00:46.020
a very specific workflow to take back directorial

00:00:46.020 --> 00:00:49.179
control over AI generation. We are moving beyond

00:00:49.179 --> 00:00:52.329
luck. The goal is to turn one single perfect

00:00:52.329 --> 00:00:55.630
image into an entire cinematic sequence. That's

00:00:55.630 --> 00:00:57.469
it, exactly. We're moving you out of the passenger

00:00:57.469 --> 00:00:59.810
seat and into the director's chair. We're going

00:00:59.810 --> 00:01:02.630
to break down this workflow using Nano Banana

00:01:02.630 --> 00:01:05.629
Pro for the images. It's powered by Gemini 3

00:01:05.629 --> 00:01:08.750
and then VO 3 .1, faster the video part. So we'll

00:01:08.750 --> 00:01:11.010
cover getting that foundation right, mastering

00:01:11.010 --> 00:01:13.530
the language of cinematography, and then the

00:01:13.530 --> 00:01:16.260
secret to getting controlled. predictable animation.

00:01:16.500 --> 00:01:18.200
Let's just dive right into the core problem.

00:01:18.379 --> 00:01:21.000
Okay, let's unpack this. Why are most of these

00:01:21.000 --> 00:01:25.540
AI tools so stubborn? The source has kept using

00:01:25.540 --> 00:01:28.659
the word inconsistency. Inconsistency is the

00:01:28.659 --> 00:01:31.500
root of all evil here. So when you create a character

00:01:31.500 --> 00:01:34.579
with some unique detail, right, like a robotic

00:01:34.579 --> 00:01:37.420
eye or a specific scar. And then you ask for

00:01:37.420 --> 00:01:40.700
a wide shot. Most models just forget those details.

00:01:41.459 --> 00:01:43.620
They resample the whole scene. It treats your

00:01:43.620 --> 00:01:45.980
new prompt like a brand new request. So it's

00:01:45.980 --> 00:01:47.760
not even that it's forgetting. It's that it's

00:01:47.760 --> 00:01:49.879
literally starting a whole new drawing from scratch.

00:01:50.040 --> 00:01:52.019
It's ignoring the face, the clothes, everything.

00:01:52.420 --> 00:01:54.099
Exactly. I mean, you watch a movie, the main

00:01:54.099 --> 00:01:56.060
actor doesn't suddenly have a different nose

00:01:56.060 --> 00:01:58.819
in the middle of a scene. We need that same level

00:01:58.819 --> 00:02:01.400
of consistency. And that's where this reference

00:02:01.400 --> 00:02:03.719
workflow comes in. Wait, so is this reference

00:02:03.719 --> 00:02:06.760
workflow just a fancy term for image to image?

00:02:06.900 --> 00:02:08.439
How is this different? How do we know it's not

00:02:08.439 --> 00:02:09.960
just going to copy the lighting and forget the

00:02:09.960 --> 00:02:11.900
character? That's the key question. A lot of

00:02:11.900 --> 00:02:14.319
those older systems, they just use the reference

00:02:14.319 --> 00:02:17.819
as a style guide. But Nano Banana Pro, using

00:02:17.819 --> 00:02:20.439
Gemini 3's architecture, is built to hold onto

00:02:20.439 --> 00:02:23.539
those key features. OK. So when you upload that

00:02:23.539 --> 00:02:26.240
reference image, your master blueprint, and you

00:02:26.240 --> 00:02:29.120
ask for a change, the model is forced to anchor

00:02:29.120 --> 00:02:32.199
the character's geometry. That robotic eye, for

00:02:32.199 --> 00:02:34.219
example, is locked in. It doesn't get washed

00:02:34.219 --> 00:02:36.340
away. So this reference workflow, it's basically

00:02:36.340 --> 00:02:38.599
the blueprint that forces consistency across

00:02:38.599 --> 00:02:40.939
all the other shots you generate. It's the only

00:02:40.939 --> 00:02:43.000
way to lock down that character's identity. And

00:02:43.000 --> 00:02:45.639
the whole thing hinges on that single foundation.

00:02:45.740 --> 00:02:48.419
Yeah. Which brings us to the most important step,

00:02:49.159 --> 00:02:52.319
creating the foundation image. If you rush this

00:02:52.319 --> 00:02:54.599
part, the whole thing just falls apart later.

00:02:54.879 --> 00:02:56.979
Right. This is the image that defines everything.

00:02:57.229 --> 00:02:59.849
The character, the lighting, the mood. All of

00:02:59.849 --> 00:03:01.729
it, the textures, everything. And the example

00:03:01.729 --> 00:03:03.229
in the sources, they didn't just use a simple

00:03:03.229 --> 00:03:06.629
portrait. They chose this really complex cyberpunk

00:03:06.629 --> 00:03:10.409
street food chef in a rainy Neo -Tokyo, which

00:03:10.409 --> 00:03:13.770
seems designed to be difficult. It's intentionally

00:03:13.770 --> 00:03:16.409
complex. That's where the real learning is. You

00:03:16.409 --> 00:03:18.729
have to deal with steam and puddles and neon

00:03:18.729 --> 00:03:21.729
lights. You can't just type cyberpunk chef and

00:03:21.729 --> 00:03:24.210
hope for the best. You need precision. Surgical

00:03:24.210 --> 00:03:26.539
precision. on the subject, the environment, and

00:03:26.539 --> 00:03:29.020
the camera. The sample prompt they gave is basically

00:03:29.020 --> 00:03:31.840
a technical checklist. It starts with the subject.

00:03:32.520 --> 00:03:35.569
Old Japanese chef. gray beard, and then the key

00:03:35.569 --> 00:03:38.830
detail. Robotic cybernetic left eye with a red

00:03:38.830 --> 00:03:41.610
lens, stained apron. That red lens, that's the

00:03:41.610 --> 00:03:43.689
anchor. That's the non -negotiable detail that

00:03:43.689 --> 00:03:45.430
AI has to hold on to. And then the environment

00:03:45.430 --> 00:03:49.310
gets just as specific. Ramen stall, heavy steam,

00:03:49.770 --> 00:03:52.909
rainy Neo -Tokyo, pink and blue neon signs blurred

00:03:52.909 --> 00:03:56.229
in the distance. If you don't lock that in, the

00:03:56.229 --> 00:03:57.969
lighting will be all over the place in your other

00:03:57.969 --> 00:04:01.189
shots. And finally, the camera. Medium shot,

00:04:01.409 --> 00:04:04.289
eye level, cinematic lighting. Photorealistic.

00:04:04.729 --> 00:04:07.550
Once you get that perfect image, that one file

00:04:07.550 --> 00:04:09.430
will become your North Star. So that's the one

00:04:09.430 --> 00:04:11.270
you keep coming back to. It's the one you upload

00:04:11.270 --> 00:04:13.650
to the reference section every single time. I

00:04:13.650 --> 00:04:17.449
still wrestle with prompt drift myself. I once

00:04:17.449 --> 00:04:20.170
spent three hours trying to get consistent shots

00:04:20.170 --> 00:04:22.589
because I was too lazy to perfect my first prompt.

00:04:22.670 --> 00:04:25.389
An entire afternoon wasted. Exactly. I learned

00:04:25.389 --> 00:04:28.050
my lesson. You have to invest the time upfront.

00:04:28.279 --> 00:04:31.779
in that foundation image. So extreme specificity

00:04:31.779 --> 00:04:34.360
on subject, environment, and camera angle is

00:04:34.360 --> 00:04:36.600
what prevents all those inconsistencies down

00:04:36.600 --> 00:04:38.680
the line. OK, this is where it gets really interesting.

00:04:39.240 --> 00:04:41.420
We're shifting from being a prompt engineer to

00:04:41.420 --> 00:04:44.000
being a director. We have our chef. He's consistent.

00:04:44.279 --> 00:04:46.480
Now we need to build a shot library around him.

00:04:46.579 --> 00:04:47.920
And to do that, we have to learn to speak the

00:04:47.920 --> 00:04:49.800
language of cameras, because these tools, like

00:04:49.800 --> 00:04:52.879
Nano Banana Pro, they understand cinematic terms

00:04:52.879 --> 00:04:54.959
better than just vague descriptions. We need

00:04:54.959 --> 00:04:57.339
to start big. The extreme wide shot. This isn't

00:04:57.339 --> 00:04:59.810
just about showing the scenery, is it? No. It's

00:04:59.810 --> 00:05:01.990
an establishing shot. It tells a story. It shows

00:05:01.990 --> 00:05:05.750
the chef as this small human detail against these

00:05:05.750 --> 00:05:08.470
massive dark skyscrapers. It sets the stakes.

00:05:08.850 --> 00:05:10.490
Then you bring in the drama with a low angle

00:05:10.490 --> 00:05:12.870
shot. Right. You want the chef to look powerful,

00:05:12.970 --> 00:05:15.310
like a master of his craft. You look up at him.

00:05:15.470 --> 00:05:17.550
You prompt for looking up from the counter level.

00:05:17.910 --> 00:05:20.569
And the sources said the AI kept that red cybernetic

00:05:20.569 --> 00:05:24.339
eye perfectly, even from below. The anchor held.

00:05:24.720 --> 00:05:26.560
And you can flip that for a high angle. That's

00:05:26.560 --> 00:05:28.879
more for vulnerability or for showing information,

00:05:29.040 --> 00:05:30.759
right? Looking down at his hands as he prepares

00:05:30.759 --> 00:05:33.579
the food. And my personal favorite for creating

00:05:33.579 --> 00:05:35.959
tension is the Dutch angle. The tilted shot.

00:05:36.199 --> 00:05:38.139
Yeah, the canted shot. You tell the camera to

00:05:38.139 --> 00:05:40.980
tilt 30 degrees and it immediately puts the viewer

00:05:40.980 --> 00:05:43.199
on edge. It signals that something is unstable.

00:05:43.500 --> 00:05:46.620
You can even add heavy rain falling diagonally

00:05:46.620 --> 00:05:49.459
to really push that dynamic energy. The AI knows

00:05:49.459 --> 00:05:52.180
exactly what Dutch angle means. Whoa. I mean,

00:05:52.259 --> 00:05:54.779
just imagine scaling this. If you can generate

00:05:54.779 --> 00:05:57.800
10 or 15 perfectly consistent shots like this,

00:05:58.079 --> 00:06:00.699
all ready to be dropped into an editor, that's

00:06:00.699 --> 00:06:03.139
real cinematic production power. It is. Then

00:06:03.139 --> 00:06:05.199
you need connection. So you use the over the

00:06:05.199 --> 00:06:07.860
shoulder shot, you prompt for a blurry customer

00:06:07.860 --> 00:06:10.360
in the foreground, and it pulls the viewer right

00:06:10.360 --> 00:06:12.819
into that interaction as the chef hands over

00:06:12.819 --> 00:06:16.360
the ramen. And finally, the really intimate detail

00:06:16.360 --> 00:06:19.680
shot, the macro shot. Yeah, the extreme close

00:06:19.680 --> 00:06:21.560
-up. For the cybernetic eye, you don't just say

00:06:21.560 --> 00:06:24.420
macro. You have to get technical. Increase the

00:06:24.420 --> 00:06:26.920
depth of field prompt. Emphasize the texture

00:06:26.920 --> 00:06:29.939
of the lens, the reflection of the neon. So the

00:06:29.939 --> 00:06:32.000
big takeaway is you're building this library,

00:06:32.379 --> 00:06:35.399
this shot library of 10, maybe 15 consistent

00:06:35.399 --> 00:06:38.740
angles all anchored to that one North Star image.

00:06:39.079 --> 00:06:41.600
Then you're ready for motion. Exactly. And if

00:06:41.600 --> 00:06:44.000
you really want to convey that emotional intensity

00:06:44.000 --> 00:06:47.180
or instability, the Dutch angle is the classic

00:06:47.180 --> 00:06:49.639
cinematic tool for the job. Now we make the move.

00:06:49.819 --> 00:06:53.439
We switch over to VO 3 .1 fast because it's quick

00:06:53.439 --> 00:06:55.790
and it, well. It understands physics pretty well.

00:06:56.009 --> 00:06:57.730
So there are two methods here. The simple one

00:06:57.730 --> 00:07:00.550
is method A, the start frame. Right. You give

00:07:00.550 --> 00:07:02.529
via one of your images and a really simple prompt

00:07:02.529 --> 00:07:04.949
like steam rises or rain falls. It's good for

00:07:04.949 --> 00:07:06.949
adding a bit of atmosphere, just pushing the

00:07:06.949 --> 00:07:09.649
sled a little. But method B is the real secret

00:07:09.649 --> 00:07:12.370
sauce for anything complex. We call it the start

00:07:12.370 --> 00:07:15.389
and end frame or laying the train track. OK,

00:07:15.550 --> 00:07:18.410
explain that. Well, if you just tell the AI chef

00:07:18.410 --> 00:07:21.720
walks away, the model has to guess. It starts

00:07:21.720 --> 00:07:24.360
to improvise the physics, the trajectory, and

00:07:24.360 --> 00:07:26.120
that's when you get that weird, unpredictable

00:07:26.120 --> 00:07:29.139
morphing. The AI slop. The slop comes back. But

00:07:29.139 --> 00:07:31.120
if you give it a perfect start frame our chef

00:07:31.120 --> 00:07:34.019
standing, and a perfect end frame our chef walking

00:07:34.019 --> 00:07:37.160
into the distance, both already generated consistently

00:07:37.160 --> 00:07:39.879
in NanoBanana Pro, then I have no choice. It

00:07:39.879 --> 00:07:41.680
has no choice but to connect the dots. It's like

00:07:41.680 --> 00:07:44.100
a train on a fixed track. It guarantees you have

00:07:44.100 --> 00:07:46.519
control over big camera moves, complex actions,

00:07:46.819 --> 00:07:48.680
anything like that. So the two -frame method

00:07:48.680 --> 00:07:51.319
is so vital because it forces the AI to connect

00:07:51.319 --> 00:07:54.139
two consistent states, which stops it from just

00:07:54.139 --> 00:07:57.279
making things up and morphing. Let's talk about

00:07:57.279 --> 00:08:00.600
a hack here for efficiency. The source is mentioned

00:08:00.600 --> 00:08:03.839
using an LLM, like Claude or GPT, to actually

00:08:03.839 --> 00:08:06.540
write your prompts for you. This is a huge time

00:08:06.540 --> 00:08:09.279
saver. Because trying to describe a subtle camera

00:08:09.279 --> 00:08:11.540
move in just the right words can be really hard

00:08:11.540 --> 00:08:13.939
for a person. So you upload your two frames,

00:08:14.240 --> 00:08:17.220
say, the macro shot of the eye. and medium shot

00:08:17.220 --> 00:08:20.319
of the face. And you just ask the LLM to write

00:08:20.319 --> 00:08:23.199
a concise prompt that explains the camera move

00:08:23.199 --> 00:08:25.240
needed to get from A to B. And you just copy

00:08:25.240 --> 00:08:27.279
and paste that directly into Vio. It takes all

00:08:27.279 --> 00:08:29.579
the guesswork out of it. And it lets you do really

00:08:29.579 --> 00:08:32.919
professional tricks like a rack focus. You generate

00:08:32.919 --> 00:08:35.220
image A, where the chef is sharp and the background

00:08:35.220 --> 00:08:37.799
is blurry, and image B, where the chef is blurry

00:08:37.799 --> 00:08:39.580
and the neon signs are sharp. And the prompt

00:08:39.580 --> 00:08:43.019
is just... The prompt is rack focus from the

00:08:43.019 --> 00:08:45.240
foreground character to the background neon sign.

00:08:45.839 --> 00:08:48.039
And it works perfectly because both of those

00:08:48.039 --> 00:08:50.419
states, A and B, came from the same foundation

00:08:50.419 --> 00:08:53.320
image. Even with this whole system though, there

00:08:53.320 --> 00:08:55.600
have to be common mistakes. What are the big

00:08:55.600 --> 00:08:57.720
pitfalls people need to watch out for? Number

00:08:57.720 --> 00:09:00.480
one is forgetting the foundation. You get excited,

00:09:00.820 --> 00:09:02.440
you try to generate a new shot, but you forget

00:09:02.440 --> 00:09:04.480
to put your North Star image in the reference

00:09:04.480 --> 00:09:06.820
slot. And the consistency breaks. Instantly.

00:09:07.559 --> 00:09:10.320
The fix is just discipline. Always, always use

00:09:10.320 --> 00:09:12.639
the foundation image as your anchor. Okay, what's

00:09:12.639 --> 00:09:15.039
mistake number two? Being lazy with prompts,

00:09:15.200 --> 00:09:17.399
just typing something like different angle or

00:09:17.399 --> 00:09:20.259
move camera a bit, it just confuses the model.

00:09:20.379 --> 00:09:22.220
You have to be specific. You have to use the

00:09:22.220 --> 00:09:25.000
professional language. Low angle, Broca, depth

00:09:25.000 --> 00:09:27.559
of field. You have to talk to the tool like a

00:09:27.559 --> 00:09:29.879
director talking to a cinematographer. Mistake

00:09:29.879 --> 00:09:33.240
three, the morphing hands. Hands are still a

00:09:33.240 --> 00:09:35.340
huge challenge for AI, aren't they? They are.

00:09:35.659 --> 00:09:37.919
If your chef is doing anything complex, like

00:09:37.919 --> 00:09:40.539
chopping or stirring, you absolutely have to

00:09:40.539 --> 00:09:43.100
use the start and end frame method for that specific

00:09:43.100 --> 00:09:46.100
action. It's the only way to minimize that weird

00:09:46.100 --> 00:09:48.059
digital distortion. And the last one, which you

00:09:48.059 --> 00:09:50.320
said is the most common reason a video feels

00:09:50.320 --> 00:09:53.940
amateur. Ignoring lighting consistency. If your

00:09:53.940 --> 00:09:56.720
wide shot is at night, your close -up can't suddenly

00:09:56.720 --> 00:09:58.980
look like it's daytime. It breaks the whole illusion.

00:09:59.220 --> 00:10:01.580
And the fix is just repetition. Just repetition.

00:10:01.700 --> 00:10:04.240
You have to repeat the exact atmospheric keywords

00:10:04.240 --> 00:10:07.860
in every single prompt. For our chef, that was

00:10:07.860 --> 00:10:10.799
neon lights, rainy night, blue and pink hues

00:10:10.799 --> 00:10:14.179
every single time, no matter the camera angle.

00:10:14.299 --> 00:10:16.399
That lack of repetition is what makes so much

00:10:16.399 --> 00:10:19.340
AI video feel so disjointed. It's the biggest

00:10:19.340 --> 00:10:21.860
giveaway of amateur work. You have to lock that

00:10:21.860 --> 00:10:23.919
atmosphere down. So when you put it all together,

00:10:23.980 --> 00:10:25.600
what does this actually mean? The sources seem

00:10:25.600 --> 00:10:29.009
to be saying that the era of AI video looking

00:10:29.009 --> 00:10:32.950
like a, you know, weird dream is coming to an

00:10:32.950 --> 00:10:36.309
end. It is. We are moving into an era of AI directing.

00:10:37.450 --> 00:10:39.769
Control is finally starting to outweigh chance.

00:10:40.190 --> 00:10:41.809
The recap is pretty straightforward, then. It

00:10:41.809 --> 00:10:45.750
is. One perfect foundation image to lock in your

00:10:45.750 --> 00:10:49.259
character. Use that to build a solid shot library

00:10:49.259 --> 00:10:52.519
of different cinematic angles and then animate

00:10:52.519 --> 00:10:54.940
it all with the start and end frame method for

00:10:54.940 --> 00:10:57.139
total control over the motion. It's a much more

00:10:57.139 --> 00:10:59.139
deliberate process than just hitting generate

00:10:59.139 --> 00:11:01.460
over and over, but the payoff is moving from

00:11:01.460 --> 00:11:04.480
random clips to actual controlled storytelling.

00:11:04.639 --> 00:11:07.139
Your story told exactly how you see it in your

00:11:07.139 --> 00:11:09.320
head. With that cinematic consistency completely

00:11:09.320 --> 00:11:11.639
locked in. So the next step for you, the listener,

00:11:11.779 --> 00:11:13.639
is to actually try this. Just do the first part.

00:11:13.980 --> 00:11:17.139
Open up the tool and spend some real time making

00:11:17.139 --> 00:11:20.679
one perfect character, your North Star. And then

00:11:20.679 --> 00:11:23.820
just use that one file to generate three consistent

00:11:23.820 --> 00:11:28.000
angles, a wide, a medium, and a close up. And

00:11:28.000 --> 00:11:30.620
just feel that shift in control. It's a powerful

00:11:30.620 --> 00:11:32.360
feeling. And here's a final thought to leave

00:11:32.360 --> 00:11:34.779
you with. Now that you understand the cinematic

00:11:34.779 --> 00:11:37.519
language that a low angle means power, a Dutch

00:11:37.519 --> 00:11:40.419
angle means tension. Try applying that to the

00:11:40.419 --> 00:11:42.879
real world. Think about it when you're just taking

00:11:42.879 --> 00:11:45.139
a photo with your phone. It really does change

00:11:45.139 --> 00:11:47.480
how you see everything around you. Until next

00:11:47.480 --> 00:11:49.639
time. Thank you for joining us on this deep dive.
