WEBVTT

00:00:00.000 --> 00:00:02.060
Welcome to the deep dive. We are really glad

00:00:02.060 --> 00:00:03.720
you're joining us today Yeah, thanks for having

00:00:03.720 --> 00:00:05.660
me excited to get into this one because look

00:00:05.660 --> 00:00:08.140
I know the mindset you are bringing to this conversation

00:00:08.140 --> 00:00:10.820
You're a learner. You want to actually grasp

00:00:10.820 --> 00:00:14.199
how these complex systems work Without drowning

00:00:14.199 --> 00:00:16.420
in the overwhelm, right? Nobody wants to read

00:00:16.420 --> 00:00:18.719
a thousand page technical manual just to make

00:00:18.719 --> 00:00:22.769
a video exactly and That is the mission today.

00:00:23.309 --> 00:00:26.469
We are diving into an incredibly insightful source

00:00:26.469 --> 00:00:29.350
text called The Director's Roadmap to Professional

00:00:29.350 --> 00:00:32.189
AI Filmmaking. It's a great piece. It really

00:00:32.189 --> 00:00:35.850
is. We are going to shortcut that massive frustrating

00:00:35.850 --> 00:00:39.469
learning curve of AI video generation. The goal

00:00:39.469 --> 00:00:41.530
is to take you from being a beginner who just

00:00:41.530 --> 00:00:44.609
types a prompt and prays for luck to a true digital

00:00:44.609 --> 00:00:46.789
director running a professional pipeline. And

00:00:46.789 --> 00:00:48.429
getting to that professional level, it really

00:00:48.429 --> 00:00:52.960
requires dismantling a huge A trap almost everyone

00:00:52.960 --> 00:00:55.259
falls into right away. The empathy trap. Yeah,

00:00:55.340 --> 00:00:57.460
exactly. People just intuitively assume that

00:00:57.460 --> 00:01:00.320
AI understands human feelings, like a human collaborator

00:01:00.320 --> 00:01:03.179
would. Right. So you type in this deeply passionate

00:01:03.179 --> 00:01:05.799
emotional paragraph about the exact vibe you

00:01:05.799 --> 00:01:09.150
want, and you hit Enter. And the AI spits out

00:01:09.150 --> 00:01:12.930
a character with a terrifying twisted face or

00:01:12.930 --> 00:01:15.510
some background that defies the laws of physics

00:01:15.510 --> 00:01:17.510
entirely. Because it doesn't care about your

00:01:17.510 --> 00:01:19.730
passion. It's an algorithm. It follows strict

00:01:19.730 --> 00:01:22.709
mathematical logic. But when users don't grasp

00:01:22.709 --> 00:01:25.709
that logic, they get stuck. They operate them

00:01:25.709 --> 00:01:27.989
to this false assumption that if they just add

00:01:27.989 --> 00:01:30.230
more adjective, the computer will finally get

00:01:30.230 --> 00:01:32.409
it. But it doesn't. You might get one lucky clip,

00:01:32.409 --> 00:01:34.930
but the next scene is a total mess. The narrative

00:01:34.930 --> 00:01:37.250
continuity completely collapses. Yeah, you just

00:01:37.250 --> 00:01:39.209
pull in the slot machine lever over and over.

00:01:39.310 --> 00:01:41.829
OK, let's unpack this. Because the roadmap lays

00:01:41.829 --> 00:01:45.189
out the solution across five very clear levels.

00:01:45.650 --> 00:01:47.930
And level one is what the source calls the idea

00:01:47.930 --> 00:01:50.810
prompt. Which is where most people start, and

00:01:50.810 --> 00:01:52.650
unfortunately, where most people stay. Yeah.

00:01:52.950 --> 00:01:54.879
Operating at level one is basically treating

00:01:54.879 --> 00:01:58.140
the AI like a chaotic freelance artist. Sometimes

00:01:58.140 --> 00:02:00.719
it's stunning, usually it's useless for a real

00:02:00.719 --> 00:02:03.340
movie. And the biggest beginner mistake here

00:02:03.340 --> 00:02:06.180
is the length myth. Oh, the giant paragraphs.

00:02:06.659 --> 00:02:09.120
Right. People think detail equals precision.

00:02:10.340 --> 00:02:12.479
But modern tools, they don't need a novel. They

00:02:12.479 --> 00:02:15.259
need clarity. The source gives a perfect example

00:02:15.259 --> 00:02:17.800
of this using Runway. The prompt is just so simple.

00:02:17.979 --> 00:02:20.439
It's literally a fat cat wearing a space suit.

00:02:20.669 --> 00:02:22.909
is sitting and fishing on a planet with a pink

00:02:22.909 --> 00:02:25.810
sky surrounded by floating rocks and shining

00:02:25.810 --> 00:02:29.169
stars in a 3D Pixar movie style. What's fascinating

00:02:29.169 --> 00:02:31.449
here is that the video quality doesn't come from

00:02:31.449 --> 00:02:34.449
those specific words. The visual fidelity comes

00:02:34.449 --> 00:02:37.409
from the AI model itself. The words are just

00:02:37.409 --> 00:02:39.310
guardrails. Just keeping it from hallucinating.

00:02:39.469 --> 00:02:41.610
Exactly. Let's break down why that prompt works.

00:02:42.409 --> 00:02:45.469
First, you have the subject, fat cat in a spacesuit,

00:02:45.810 --> 00:02:48.169
that is a massively recognizable archetype in

00:02:48.169 --> 00:02:50.580
the training data. So the AI grabs that easily,

00:02:50.620 --> 00:02:52.960
but what about the movement? That's the genius

00:02:52.960 --> 00:02:55.460
of the action they chose. Sitting and fishing.

00:02:55.599 --> 00:02:58.060
It's calm. It's highly contained. If you ask

00:02:58.060 --> 00:03:00.759
for backflips and laser fire, the AI's predictive

00:03:00.759 --> 00:03:03.060
processing glitches out. You get blurred pixels.

00:03:03.419 --> 00:03:05.659
By keeping the movement calm, the render stays

00:03:05.659 --> 00:03:08.680
pristine. And then adding 3D Pixar style and

00:03:08.680 --> 00:03:10.900
pink sky that maps the lighting immediately without

00:03:10.900 --> 00:03:13.300
needing a technical essay on light sources. Right.

00:03:13.360 --> 00:03:16.039
It's incredibly efficient. The roadmap also brings

00:03:16.039 --> 00:03:19.830
up Luma AI for level one. The prompt there was

00:03:19.830 --> 00:03:22.650
a close -up shot of a robot hand carefully holding

00:03:22.650 --> 00:03:25.729
a rose made of glass. The sun reflects through

00:03:25.729 --> 00:03:27.870
the petals and creates beautiful light on the

00:03:27.870 --> 00:03:30.780
dusty ground. Now that is a prompt designed to

00:03:30.780 --> 00:03:33.060
exploit what the AI does best. It completely

00:03:33.060 --> 00:03:35.240
ignores the boring buzzwords people usually use.

00:03:35.360 --> 00:03:38.419
No UltraShark. No 8K resolution. It focuses on

00:03:38.419 --> 00:03:41.039
materials. Yes. The two hardest things in AI

00:03:41.039 --> 00:03:44.240
filmmaking are materials and lighting. Glass.

00:03:44.479 --> 00:03:48.039
Metal. Dust. It forces the AI to dedicate its

00:03:48.039 --> 00:03:51.219
power to simulating real -world light refraction.

00:03:51.500 --> 00:03:54.419
But we do have to call out. the major limitation

00:03:54.419 --> 00:03:56.659
of level one. The lack of control. Total lack

00:03:56.659 --> 00:03:59.180
of control. If you hit generate on that fat cat

00:03:59.180 --> 00:04:01.979
prompt 10 times, you get 10 totally different

00:04:01.979 --> 00:04:04.620
cats. Which is fine for a single TikTok video,

00:04:04.840 --> 00:04:07.360
but useless for a consistent long -form movie

00:04:07.360 --> 00:04:09.139
where we need to follow one character. Which

00:04:09.139 --> 00:04:11.340
brings us to level two, structured prompting.

00:04:11.460 --> 00:04:13.400
This is where you stop talking to the AI like

00:04:13.400 --> 00:04:15.000
it's a search engine. You start speaking the

00:04:15.000 --> 00:04:17.040
computer's native language. Series filmmakers

00:04:17.040 --> 00:04:19.980
use a template, the cinematic formula. Subject,

00:04:20.100 --> 00:04:23.370
action, scene. Camera, style. Exactly in that

00:04:23.370 --> 00:04:26.329
order. The source uses Killing AI to demonstrate

00:04:26.329 --> 00:04:29.389
this. Subject, an old detective wearing a long

00:04:29.389 --> 00:04:32.670
coat. Action standing and smoking. Environment,

00:04:32.970 --> 00:04:35.990
1950s London in the rain. Camera medium shot

00:04:35.990 --> 00:04:38.569
moving to face. Style, black and white noir.

00:04:38.839 --> 00:04:41.240
And the beauty of this formula is troubleshooting.

00:04:41.420 --> 00:04:43.660
Yes. If the clip fails, you don't throw away

00:04:43.660 --> 00:04:45.860
the whole idea. You just isolate the variable.

00:04:46.259 --> 00:04:49.139
Oh, the lighting is weird. Fix the style variable.

00:04:49.259 --> 00:04:51.319
Keep the camera and subject exactly the same.

00:04:51.839 --> 00:04:54.079
And to take that structural control even further,

00:04:54.279 --> 00:04:56.720
the author recommends using JSON formatting.

00:04:56.839 --> 00:04:59.899
Using code, basically. Well, using Chad GPT to

00:04:59.899 --> 00:05:02.779
format your ideas into JSON. It gives your project

00:05:02.779 --> 00:05:05.439
a consistent DNA. Right, the example was the

00:05:05.439 --> 00:05:07.720
old warrior with broken armor kneeling in sunflowers.

00:05:07.779 --> 00:05:10.540
By formatting it as JSON, the AI parses the exact

00:05:10.540 --> 00:05:13.920
same structural data every single time. And standardizing

00:05:13.920 --> 00:05:16.319
that data unlocks the multi -shot technique.

00:05:16.680 --> 00:05:19.060
This is huge. I love this part. You don't have

00:05:19.060 --> 00:05:21.000
to stitch together five -second clips anymore.

00:05:21.439 --> 00:05:24.600
No, you can generate a whole sequence. Like the

00:05:24.600 --> 00:05:28.029
example, a girl... opens a door, turns on a light

00:05:28.029 --> 00:05:30.550
to see a gift, and then a close -up of happy

00:05:30.550 --> 00:05:33.329
tears. All in one go. The model holds the whole

00:05:33.329 --> 00:05:36.110
sequence in its context window. It bakes the

00:05:36.110 --> 00:05:38.550
temporal consistency in from the start. Here's

00:05:38.550 --> 00:05:40.569
where it gets really interesting, because text

00:05:40.569 --> 00:05:44.129
prompts still hit a ceiling. Even with JSON,

00:05:44.569 --> 00:05:47.649
forcing the AI to keep a character's face perfectly

00:05:47.649 --> 00:05:50.149
identical across different scenes using only

00:05:50.149 --> 00:05:53.040
text. It's a nightmare. It is. Which is why Level

00:05:53.040 --> 00:05:55.759
3 is reference control. We stop relying on words.

00:05:55.959 --> 00:05:58.819
We force the AI's hand, using images and videos

00:05:58.819 --> 00:06:01.420
as maps. You basically become a casting director.

00:06:01.879 --> 00:06:04.360
Let's look at image video using Pika. You don't

00:06:04.360 --> 00:06:06.060
roll the dice on a text prompt for the face.

00:06:06.420 --> 00:06:08.579
You use mid -journey to generate your exact actor.

00:06:08.839 --> 00:06:10.819
You lock the look. Lock the look. Bring that

00:06:10.819 --> 00:06:13.300
static image into Pika and tell the AI keep the

00:06:13.300 --> 00:06:15.100
clothes, keep the hair, just make them smile

00:06:15.100 --> 00:06:17.279
and nod. But what if you need complex movement,

00:06:17.379 --> 00:06:19.350
like... Trying to type out the exact physics

00:06:19.350 --> 00:06:21.490
of casting a fishing rod in English is impossible.

00:06:21.870 --> 00:06:24.490
The AI will totally mess up the physics of the

00:06:24.490 --> 00:06:26.790
human arm. That's why video -to -video in Runway

00:06:26.790 --> 00:06:29.529
is brilliant. You literally act it out. Just

00:06:29.529 --> 00:06:31.110
record yourself on your phone in your living

00:06:31.110 --> 00:06:33.970
room. Mind the fishing motion, upload it, and

00:06:33.970 --> 00:06:36.709
Runway maps your natural human kinematics right

00:06:36.709 --> 00:06:40.069
onto the space cat. No complex English required.

00:06:40.449 --> 00:06:42.910
And you can steal million -dollar Hollywood camera

00:06:42.910 --> 00:06:45.850
moves the same way, right? Camera sync. The source

00:06:45.850 --> 00:06:48.829
talks about sedans for this. They applied a professional

00:06:48.829 --> 00:06:51.350
tracking shot to a white -haired anime girl.

00:06:51.629 --> 00:06:54.949
And the big update here is Sedence 2 .0, which

00:06:54.949 --> 00:06:58.649
dropped late February 2026. It handles multiple

00:06:58.649 --> 00:07:01.410
references at once seamlessly. So it locks the

00:07:01.410 --> 00:07:03.269
face from one image and grabs the camera move

00:07:03.269 --> 00:07:05.730
from a video all at the same time. Exactly. But

00:07:05.730 --> 00:07:07.990
man, doing that manually for a hundred shots

00:07:07.990 --> 00:07:09.810
would cause instant brain -out. Which is why

00:07:09.810 --> 00:07:13.769
Level 4 is all about custom GPTs. custom assistants.

00:07:14.209 --> 00:07:16.370
You turn Chat GPT into your script assistant,

00:07:16.730 --> 00:07:18.689
you give it your basic idea, and it hands you

00:07:18.689 --> 00:07:21.430
back three prompt choices optimized for caling

00:07:21.430 --> 00:07:24.910
AI. Wideshot, close -up, tracking shot all with

00:07:24.910 --> 00:07:27.430
professional lighting cues built in. You step

00:07:27.430 --> 00:07:31.250
out of the weeds, you become the boss, just choosing

00:07:31.250 --> 00:07:33.389
the best option. Which gives you the energy for

00:07:33.389 --> 00:07:36.370
level five. The full pipeline. If we connect

00:07:36.370 --> 00:07:38.670
this to the bigger picture, a professional movie

00:07:38.670 --> 00:07:41.670
isn't just one cool clip. It's a massive start

00:07:41.670 --> 00:07:43.910
-to -finish process. Starting with storyboarding.

00:07:44.470 --> 00:07:46.790
Before you spend a single credit, use Mid Journey

00:07:46.790 --> 00:07:50.389
to make a 3x3 grid. A 9 -frame comic book page.

00:07:50.569 --> 00:07:52.930
Like the spaceship crashing on a strange planet.

00:07:53.350 --> 00:07:55.670
It ensures your story flows and saves a ton of

00:07:55.670 --> 00:07:59.209
money. Then you need voices. Silent AI characters

00:07:59.209 --> 00:08:02.860
have that uncanny stiffness. The source uses

00:08:02.860 --> 00:08:06.279
11 labs, but with a highly specific trick. The

00:08:06.279 --> 00:08:08.120
square brackets. I thought this was brilliant.

00:08:08.279 --> 00:08:10.579
It changes everything. You don't just type the

00:08:10.579 --> 00:08:13.180
dialogue. You prompt it like bracket, exhausted,

00:08:13.540 --> 00:08:15.420
whispering, end bracket. I checked everything.

00:08:15.540 --> 00:08:18.019
It gives the character soul. Natural breathing.

00:08:18.199 --> 00:08:20.160
And finally, you have to sync that voice to the

00:08:20.160 --> 00:08:23.160
face using Creatify Aurora. But the source has

00:08:23.160 --> 00:08:25.139
a crucial warning here. Only do it when they

00:08:25.139 --> 00:08:28.180
are standing still. Right. Lipsync models hate

00:08:28.180 --> 00:08:30.649
fast movement. If your character is running or

00:08:30.649 --> 00:08:33.210
jumping, the lip -sync will turn into a glitchy,

00:08:33.230 --> 00:08:36.049
distorted mess. Keep the dialogue to the slow,

00:08:36.389 --> 00:08:38.769
stationary close -ups. Always. So evaluating

00:08:38.769 --> 00:08:41.789
this massive 2026 arsenal we just went through.

00:08:42.049 --> 00:08:44.309
This raises an important question. Where do you

00:08:44.309 --> 00:08:46.610
actually spend your subscription money? Let's

00:08:46.610 --> 00:08:49.090
do a rapid -fire review of the tools based on

00:08:49.090 --> 00:08:53.370
the roadmap, Kling AI. Amazing physics and long

00:08:53.370 --> 00:08:56.909
videos, but the faces can get weird. It's best

00:08:56.909 --> 00:08:59.700
for action sequences. runway best motion control

00:08:59.700 --> 00:09:02.500
by far but very expensive it's really a vfx tool

00:09:02.500 --> 00:09:05.360
lumadream machine unbelievable lighting and glass

00:09:05.360 --> 00:09:09.269
textures but glitchy movement Use it for artistic

00:09:09.269 --> 00:09:12.269
slow shots, mpica, fun physics, highly reliable

00:09:12.269 --> 00:09:15.230
but less cinematic, great for viral clips, animation,

00:09:15.549 --> 00:09:18.250
and sedents, incredible facial consistency, steep

00:09:18.250 --> 00:09:20.549
learning curve, but it's the king of anime and

00:09:20.549 --> 00:09:22.330
character films right now. But tools are just

00:09:22.330 --> 00:09:25.029
tools. The roadmap ends with three golden rules.

00:09:25.509 --> 00:09:27.990
Rule one, don't let the AI do everything. They

00:09:27.990 --> 00:09:30.409
have no idea what pacing is. Use it for raw materials,

00:09:30.470 --> 00:09:32.210
but you have to use your human brain to edit

00:09:32.210 --> 00:09:35.190
it together. Rule two, avoid the AI smell. That

00:09:35.190 --> 00:09:38.500
plastic, perfectly smooth look, It screams fake.

00:09:38.840 --> 00:09:41.860
So you add dust, film grain, handheld camera

00:09:41.860 --> 00:09:43.779
shake to your prompts. You have to dirty it up

00:09:43.779 --> 00:09:46.159
to make it real. And rule three check copyrights.

00:09:46.519 --> 00:09:49.539
The 2026 laws are no joke. Yeah. Make sure you

00:09:49.539 --> 00:09:51.679
actually have commercial rights to your generations.

00:09:52.139 --> 00:09:54.759
So what does this all mean? For you listening

00:09:54.759 --> 00:09:57.320
to this right now, the journey from level one

00:09:57.320 --> 00:10:00.500
to level five is an identity shift. You are moving

00:10:00.500 --> 00:10:03.679
from someone playing with a neat toy to an executive

00:10:03.679 --> 00:10:06.340
director running a creative pipeline. Master

00:10:06.340 --> 00:10:09.220
one level at a time. The tools will update every

00:10:09.220 --> 00:10:11.960
week, but the core structural mindset stays the

00:10:11.960 --> 00:10:15.799
same. Build your own unique visual DNA. The final

00:10:15.799 --> 00:10:17.940
analogy in the source nails it. The computer

00:10:17.940 --> 00:10:20.480
provides the fire, but you must direct the heat.

00:10:20.779 --> 00:10:22.940
Without you, it just burns in random directions.

00:10:23.279 --> 00:10:25.379
You know, thinking about that deliberate degradation,

00:10:25.919 --> 00:10:27.700
adding the dust and the mistakes to make it look

00:10:27.700 --> 00:10:30.000
real, it leaves me with a pretty massive thought.

00:10:30.159 --> 00:10:32.679
What's that? If the technical barriers of Hollywood

00:10:32.679 --> 00:10:36.750
are gone, if literally anyone can generate perfect,

00:10:37.070 --> 00:10:40.110
flawless, 4K cinematic lighting from their bedroom,

00:10:40.690 --> 00:10:42.830
will perfection just become completely boring?

00:10:43.289 --> 00:10:45.669
In a world where AI effortlessly creates a flawless

00:10:45.669 --> 00:10:49.330
image, will human mistakes, our messy, unpredictable,

00:10:49.350 --> 00:10:51.649
lived -in imperfections, will that actually become

00:10:51.649 --> 00:10:54.769
the new premium currency of storytelling? That

00:10:54.769 --> 00:10:57.490
is something to chew on. Perfect is cheap. Messy

00:10:57.490 --> 00:10:59.710
is human. Thank you so much for joining us on

00:10:59.710 --> 00:11:02.210
this deep dive. Keep directing that heat, and

00:11:02.210 --> 00:11:03.149
we will see you next time.
