WEBVTT

00:00:00.000 --> 00:00:03.200
Imagine directing an entire cinematic movie where

00:00:03.200 --> 00:00:07.099
your only camera is a keyboard. Beat. No set.

00:00:07.320 --> 00:00:11.279
No crew. Just you. Right. Just you. And a machine

00:00:11.279 --> 00:00:14.460
learning to dream on command. Welcome to this

00:00:14.460 --> 00:00:16.379
deep dive. I am so glad you were here with us

00:00:16.379 --> 00:00:17.859
today. Yeah, thanks for having me. I'm excited

00:00:17.859 --> 00:00:19.899
for this one. We are going to take our time with

00:00:19.899 --> 00:00:22.500
this topic. We're slowing things down just a

00:00:22.500 --> 00:00:26.480
bit to break apart a highly structured, repeatable

00:00:26.480 --> 00:00:30.089
workflow for AI video creation. Which is something

00:00:30.089 --> 00:00:32.570
a lot of people desperately need right now. Absolutely.

00:00:32.780 --> 00:00:35.600
We're exploring how to combine two powerful systems,

00:00:36.020 --> 00:00:38.979
Google Flow and Claude AI. The goal here is to

00:00:38.979 --> 00:00:42.100
bypass that chaotic learning curve of AI video

00:00:42.100 --> 00:00:44.780
and actually build consistent, controllable scenes.

00:00:44.960 --> 00:00:46.619
I mean, it's a huge shift. We aren't just looking

00:00:46.619 --> 00:00:48.820
at software tools today. We are looking at a

00:00:48.820 --> 00:00:51.479
complete architectural process. First, we build

00:00:51.479 --> 00:00:54.079
the writer's room. Then we construct the visual

00:00:54.079 --> 00:00:56.439
blueprint. And finally, we direct the animation

00:00:56.439 --> 00:00:58.619
piece by piece. To really understand this workflow,

00:00:58.880 --> 00:01:01.179
we first have to understand why beginners usually

00:01:01.179 --> 00:01:04.260
fail. because it comes down to choosing the wrong

00:01:04.260 --> 00:01:07.019
type of tool from the absolute start. Yeah, that

00:01:07.019 --> 00:01:09.959
is the great AI video divide. There's basically

00:01:09.959 --> 00:01:13.400
this steep learning curve that stems from fragmented

00:01:13.400 --> 00:01:15.719
tools. Right, you have individual models. Exactly,

00:01:16.019 --> 00:01:19.040
models like Kling or Runway. They give you incredibly

00:01:19.040 --> 00:01:22.400
deep control, but they require juggling multiple

00:01:22.400 --> 00:01:26.640
platforms and... multiple subscriptions. It gets

00:01:26.640 --> 00:01:29.599
overwhelming. It does. But then, there are all

00:01:29.599 --> 00:01:32.239
-in -one workspaces. Google Flow solves that

00:01:32.239 --> 00:01:34.879
juggling act by combining three specific models

00:01:34.879 --> 00:01:37.019
into one place. So what are those three? Well,

00:01:37.120 --> 00:01:39.540
you've got Nano Banana Pro for your images, you've

00:01:39.540 --> 00:01:42.780
got VO3 .1 for your video generation, and then

00:01:42.780 --> 00:01:45.640
Gemini Omni. Which brings in multimodal editing.

00:01:45.760 --> 00:01:47.879
Yes. Which, just to define that simply, means

00:01:47.879 --> 00:01:51.379
combining text, images, and video to guide the

00:01:51.379 --> 00:01:53.920
AI. Right, exactly. But the really interesting

00:01:53.920 --> 00:01:56.319
part is Claude AI's role in all this. Yeah, because

00:01:56.319 --> 00:01:57.900
Claude doesn't actually generate any art, right?

00:01:57.900 --> 00:02:00.700
No, not a single pixel. It sits alongside Google

00:02:00.700 --> 00:02:03.280
Flow, purely as a scene planner and a prompt

00:02:03.280 --> 00:02:05.519
writer. I kind of like to compare this setup

00:02:05.519 --> 00:02:08.159
to a traditional movie set. Oh, yeah. Yeah, so

00:02:08.159 --> 00:02:09.819
Google Flow is your director of photography,

00:02:10.300 --> 00:02:12.780
and Claude AI is your head writer. That's a perfect

00:02:12.780 --> 00:02:15.419
way to look at it. But let me ask you this. If

00:02:15.419 --> 00:02:19.099
Google Flow is a true all -in -one tool, why

00:02:19.099 --> 00:02:22.060
rely on an external text model like Claude at

00:02:22.060 --> 00:02:25.840
all? Well, Google Flow is optimized for manipulating

00:02:25.840 --> 00:02:28.860
visual data. Claude is just vastly superior at

00:02:28.860 --> 00:02:31.639
maintaining narrative logic over long context

00:02:31.639 --> 00:02:34.560
windows. So Claude handles the logic, freeing

00:02:34.560 --> 00:02:37.439
Google Flow to focus purely on the visuals. Precisely.

00:02:37.560 --> 00:02:39.699
You let the writer write, And the camera film.

00:02:39.939 --> 00:02:42.159
Makes sense. So now that we have our writer and

00:02:42.159 --> 00:02:44.500
our DP, we have to actually teach the writer

00:02:44.500 --> 00:02:46.520
how to talk to the DP. Yeah, and this is where

00:02:46.520 --> 00:02:49.159
people get stuck. I mean, I still wrestle with

00:02:49.159 --> 00:02:51.180
prompt drift myself where my text instructions

00:02:51.180 --> 00:02:53.539
just get messy over time. Oh, absolutely. Writing

00:02:53.539 --> 00:02:55.560
prompts from scratch every time just leads to

00:02:55.560 --> 00:02:58.879
inconsistent garbage. The AI loses the thread.

00:02:59.300 --> 00:03:02.000
Right. So the solution is creating a custom skill

00:03:02.000 --> 00:03:05.210
inside Claude AI. You literally call it the AI

00:03:05.210 --> 00:03:07.169
video prompt writer. How does that actually work

00:03:07.169 --> 00:03:09.610
in the interface? You just go to customize, then

00:03:09.610 --> 00:03:12.129
skills, then create with Claude. And you paste

00:03:12.129 --> 00:03:14.629
in exact instructions to always generate three

00:03:14.629 --> 00:03:17.509
specific types of prompts. Three types. OK, what's

00:03:17.509 --> 00:03:19.210
the first one? Number one is the design sheet

00:03:19.210 --> 00:03:22.129
that covers your characters, your props, the

00:03:22.129 --> 00:03:24.949
overall style. Number two is the storyboard.

00:03:25.030 --> 00:03:27.650
That's your panel by panel camera angles. And

00:03:27.650 --> 00:03:29.669
number three is the scene prompt, which are the

00:03:29.669 --> 00:03:32.919
direct instructions for VO 3 .1. That's so structured.

00:03:33.340 --> 00:03:35.219
The source actually uses this great example.

00:03:35.699 --> 00:03:38.960
A woman and her dog escaping Manhattan during

00:03:38.960 --> 00:03:41.539
a zombie outbreak. Yeah, a classic setup. You

00:03:41.539 --> 00:03:43.659
literally just typed that one simple sentence

00:03:43.659 --> 00:03:46.219
and Claude generates all the structured technical

00:03:46.219 --> 00:03:48.500
prompts for you. It really does. It removes all

00:03:48.500 --> 00:03:51.080
the friction. But wait, why do we need three

00:03:51.080 --> 00:03:54.280
separate highly specific prompts instead of just

00:03:54.280 --> 00:03:56.400
one master prompt describing the whole video?

00:03:56.509 --> 00:03:59.629
Because if you feed a model a massive block of

00:03:59.629 --> 00:04:03.030
text, it simply drops details. It hallucinates.

00:04:03.569 --> 00:04:06.090
It can't balance all those variables at once.

00:04:06.289 --> 00:04:08.009
Breaking it down prevents the AI from getting

00:04:08.009 --> 00:04:10.150
overwhelmed and mixing up complex instructions.

00:04:10.449 --> 00:04:12.770
Exactly. You have to segment the cognitive load.

00:04:13.229 --> 00:04:15.909
OK. So with Claude generating these text blueprints,

00:04:16.230 --> 00:04:18.430
we must translate those into a visual foundation

00:04:18.430 --> 00:04:21.310
before generating any video. Right. And this

00:04:21.310 --> 00:04:24.509
is where we move into Google Flow, specifically

00:04:24.509 --> 00:04:27.810
using Nano Banana Pro for images. So we are building

00:04:27.810 --> 00:04:30.910
the design sheet next. Yes. You must definitively

00:04:30.910 --> 00:04:33.310
establish the world. What does the main character

00:04:33.310 --> 00:04:37.149
look like? The dog. The zombie. What are the

00:04:37.149 --> 00:04:39.430
clothing and props like? You're defining the

00:04:39.430 --> 00:04:42.329
color palette too. Everything. And here's a crucial

00:04:42.329 --> 00:04:46.540
tip. Start with low resolution. Oh, to save generation

00:04:46.540 --> 00:04:48.680
credits. Yeah, it saves credits and it generates

00:04:48.680 --> 00:04:51.860
way faster. You use that low -res draft to check

00:04:51.860 --> 00:04:54.720
for mistakes. If an element is missing, you don't

00:04:54.720 --> 00:04:56.920
fix it in Google Flow. You go back to Claude.

00:04:57.079 --> 00:04:59.100
Right. You go back to Claude for text revision.

00:04:59.579 --> 00:05:01.860
And once the image is perfect, then you render

00:05:01.860 --> 00:05:05.519
it in full resolution. That is smart. So then

00:05:05.519 --> 00:05:08.209
we move to the storyboard. We use Claude to generate

00:05:08.209 --> 00:05:11.129
a 12 -panel storyboard prompt. Like the convenience

00:05:11.129 --> 00:05:13.550
store, the zombie attack, the escape. Right.

00:05:13.709 --> 00:05:16.610
And we generate this in Google Flow using the

00:05:16.610 --> 00:05:18.910
design sheet as a visual reference to lock in

00:05:18.910 --> 00:05:20.850
that consistency. You have to attach it. It's

00:05:20.850 --> 00:05:23.230
mandatory. So what actually happens under the

00:05:23.230 --> 00:05:25.810
hood if you get impatient and skip the design

00:05:25.810 --> 00:05:28.170
sheet step? The model just invents a completely

00:05:28.170 --> 00:05:30.970
new reality for every panel. The woman and the

00:05:30.970 --> 00:05:33.149
dog will look different in every single shot.

00:05:33.509 --> 00:05:35.930
Without it, the AI literally forgets what your

00:05:35.930 --> 00:05:37.569
characters look like between every shot. Yeah,

00:05:37.569 --> 00:05:40.389
it has zero object permanence without that visual

00:05:40.389 --> 00:05:43.529
anchor. Sponsor. Okay. So we have our storyboard.

00:05:43.709 --> 00:05:46.329
We have our character DNA. Now we finally step

00:05:46.329 --> 00:05:48.670
onto the stage to make things move. The fun part.

00:05:48.970 --> 00:05:52.230
Switching over to VO 3 .1 for video generation

00:05:52.230 --> 00:05:55.050
in Google Flow. But before writing the scene

00:05:55.050 --> 00:05:58.310
prompt, the guide says you must upload two references.

00:05:58.430 --> 00:06:01.750
Yes. First, the design sheet for your world and

00:06:01.750 --> 00:06:04.750
character consistency, and second, the specific

00:06:04.750 --> 00:06:07.569
storyboard panel image. Which locks in your framing

00:06:07.569 --> 00:06:10.730
and composition. Exactly. Then, and only then,

00:06:10.790 --> 00:06:13.490
do you feed VO3 .1 the scene prompt from Claude.

00:06:13.709 --> 00:06:15.790
So for the first four panels, it's walking through

00:06:15.790 --> 00:06:18.230
the dark store, looking nervous, and then a zombie

00:06:18.230 --> 00:06:21.709
jumps out. Right. Whoa. I just have to pause

00:06:21.709 --> 00:06:24.930
and think about this. Imagine scaling to a billion

00:06:24.930 --> 00:06:27.740
queries across the globe. But here we're just

00:06:27.740 --> 00:06:30.240
intimately tweaking one single perfect frame

00:06:30.240 --> 00:06:34.439
of a zombie attack. It's wild. It really is a

00:06:34.439 --> 00:06:37.339
staggering amount of compute power, just focused

00:06:37.339 --> 00:06:40.399
on the shadow of a zombie in an aisle. But it

00:06:40.399 --> 00:06:42.060
doesn't always come out perfectly on the first

00:06:42.060 --> 00:06:44.759
try. No, definitely not. So we have to use this

00:06:44.759 --> 00:06:47.600
review and improve methodology. You do not throw

00:06:47.600 --> 00:06:50.480
away a whole clip if one cut looks wrong. Never.

00:06:50.720 --> 00:06:54.160
A weak first output is totally normal. You generate

00:06:54.160 --> 00:06:56.620
a second version with specific instructions.

00:06:56.720 --> 00:06:59.259
Like telling it to do a direct cut to a close

00:06:59.259 --> 00:07:01.720
-up. Exactly. And then you combine the strongest

00:07:01.720 --> 00:07:04.459
parts of both outputs in post. But why can't

00:07:04.459 --> 00:07:08.779
VO3 .1 just recognize a bad cut and fix it automatically

00:07:08.779 --> 00:07:11.160
in a single generation? Because it doesn't understand

00:07:11.160 --> 00:07:14.680
human anatomy or cinematic timing. It just predicts

00:07:14.680 --> 00:07:18.560
pixel patterns based on data. The AI lacks human

00:07:18.560 --> 00:07:21.540
taste. It needs us to stitch the best parts together.

00:07:21.600 --> 00:07:23.000
Right, you have to be the editor. So generating

00:07:23.000 --> 00:07:26.180
one good scene is great. But a 12 panel story

00:07:26.180 --> 00:07:28.279
will fall apart if you try to render it all at

00:07:28.279 --> 00:07:30.699
once. We have to control the AI's pacing. Yeah,

00:07:30.740 --> 00:07:32.860
you cannot do all 12 panels at once. This introduces

00:07:32.860 --> 00:07:35.360
the rule of chunking. Which means animating only

00:07:35.360 --> 00:07:39.180
four panels at a time. Right. Trying all 12 overwhelms

00:07:39.180 --> 00:07:43.060
VO 3 .1. It causes random hallucinatory transitions.

00:07:43.279 --> 00:07:45.720
It's like staffing Lego blocks of data. You do

00:07:45.720 --> 00:07:47.519
it piece by piece so the whole thing doesn't

00:07:47.519 --> 00:07:50.089
topple. I love that analogy. So for each chunk,

00:07:50.509 --> 00:07:53.230
you need three references uploaded. Yes. Number

00:07:53.230 --> 00:07:56.250
one, the cropped storyboard row. Number two,

00:07:56.490 --> 00:07:59.910
the design sheet. And number three, the final

00:07:59.910 --> 00:08:02.490
frame of the previous clip. That last one seems

00:08:02.490 --> 00:08:04.649
like a real secret weapon. Oh, it absolutely

00:08:04.649 --> 00:08:08.269
is. What about fixing weak cuts? Like, if a character

00:08:08.269 --> 00:08:10.970
shifts positions suddenly mid -scene. If that

00:08:10.970 --> 00:08:13.230
happens, instruct Vio to hold on a close -up

00:08:13.230 --> 00:08:15.949
before cutting to the next action. It hides the

00:08:15.949 --> 00:08:18.560
error. And again, you combine the best elements.

00:08:18.879 --> 00:08:20.259
I want to go back to that third reference for

00:08:20.259 --> 00:08:22.480
a second. Why is feeding the final frame of the

00:08:22.480 --> 00:08:25.120
previous clip back into the machine so critically

00:08:25.120 --> 00:08:27.040
important? Because it creates a rigid anchor

00:08:27.040 --> 00:08:30.060
in time. It prevents the model from subtly changing

00:08:30.060 --> 00:08:32.500
the lighting or the camera distance. It forces

00:08:32.500 --> 00:08:34.980
the new scene to mathematically lock into the

00:08:34.980 --> 00:08:37.379
last clip's ending. Perfectly said. Down to the

00:08:37.379 --> 00:08:40.799
exact pixel. Even with chunking, the entire project

00:08:40.799 --> 00:08:44.220
can unravel if you forget the overarching philosophy

00:08:44.220 --> 00:08:48.139
of AI video. Visual consistency over everything

00:08:48.139 --> 00:08:50.500
else. Without a doubt. The most common reason

00:08:50.500 --> 00:08:53.080
videos fall apart isn't bad text prompts. It's

00:08:53.080 --> 00:08:56.120
losing visual consistency. The characters change.

00:08:56.519 --> 00:08:58.399
The environments shift. Yeah, it just looks amateur.

00:08:58.820 --> 00:09:01.120
So rule number one is that the design sheet is

00:09:01.120 --> 00:09:03.840
the visual anchor. Keep it attached as a reference

00:09:03.840 --> 00:09:06.639
for every single generation. Let's talk about

00:09:06.639 --> 00:09:08.899
some mistakes that just burn time and credits.

00:09:09.320 --> 00:09:13.159
Number one. generating from text only. That forces

00:09:13.159 --> 00:09:16.779
the AI to guess entirely on its own. Number two,

00:09:17.039 --> 00:09:19.240
animating more than four panels at once. We covered

00:09:19.240 --> 00:09:22.340
that. It destroys the pacing. Number three, skipping

00:09:22.340 --> 00:09:24.799
the revision process or just expecting a perfect

00:09:24.799 --> 00:09:26.779
first output. You have to think like an editor.

00:09:26.840 --> 00:09:30.460
It's an iterative process. And number four, moving

00:09:30.460 --> 00:09:32.740
forward with an incomplete design sheet. Right,

00:09:32.740 --> 00:09:35.039
because any error there propagates into every

00:09:35.039 --> 00:09:36.820
single scene that follows. I do want to push

00:09:36.820 --> 00:09:39.600
back gently on that first mistake, though. Since

00:09:39.600 --> 00:09:43.059
text is how we naturally talk to AI, why is generating

00:09:43.059 --> 00:09:45.679
from text only considered such a massive mistake?

00:09:46.240 --> 00:09:48.379
Because human language is just too imprecise.

00:09:48.500 --> 00:09:51.399
If you say dark store, that means a billion different

00:09:51.399 --> 00:09:53.720
pixel variations to the model. Words are too

00:09:53.720 --> 00:09:56.419
ambiguous for video. Visual references provide

00:09:56.419 --> 00:09:58.940
the only undeniable truth. Exactly. You have

00:09:58.940 --> 00:10:02.220
to show it, not just tell it. So to recap this

00:10:02.220 --> 00:10:05.279
whole structured journey, AI video isn't luck.

00:10:05.639 --> 00:10:08.639
It's a process. It really is. You build the design

00:10:08.639 --> 00:10:11.899
sheet, you map the storyboard, you use Claude

00:10:11.899 --> 00:10:15.360
for precise prompts, you generate in small four

00:10:15.360 --> 00:10:19.039
-panel chunks with VO, and always, always keep

00:10:19.039 --> 00:10:21.340
your visual references attached. If you're going

00:10:21.340 --> 00:10:23.139
to try this for the first time, my advice is

00:10:23.139 --> 00:10:25.539
to keep it incredibly simple. One character,

00:10:25.620 --> 00:10:28.559
one location, one action scene. Master the workflow

00:10:28.559 --> 00:10:31.100
before building a complex epic. Walk before you

00:10:31.100 --> 00:10:33.259
run, for sure. It's just wild to think about

00:10:33.259 --> 00:10:36.299
where this is heading. Pete. If these dual AI

00:10:36.299 --> 00:10:39.379
tools can synthesize a cohesive, terrifying zombie

00:10:39.379 --> 00:10:41.779
escape from just a few structured constraints,

00:10:42.639 --> 00:10:45.179
Beat, what happens when this workflow gets fully

00:10:45.179 --> 00:10:47.000
automated? Yeah, that's the big question. If

00:10:47.000 --> 00:10:49.700
the AI eventually learns to manage its own visual

00:10:49.700 --> 00:10:52.759
consistency between shots, what exactly becomes

00:10:52.759 --> 00:10:54.740
the human director's role in the filmmaking of

00:10:54.740 --> 00:10:57.200
the future? Yeah, Beat, it's something to think

00:10:57.200 --> 00:10:58.659
about. Thank you so much for joining us on this

00:10:58.659 --> 00:11:00.879
Deep Dives. Take care. UT over music.