WEBVTT

00:00:00.000 --> 00:00:02.240
Think about the death of the Hollywood budget.

00:00:02.580 --> 00:00:06.759
It is a wild reality to consider. Picture this.

00:00:06.919 --> 00:00:09.919
Okay. You are waiting for the bus. You are holding

00:00:09.919 --> 00:00:12.519
absolutely nothing but your phone. Right. And

00:00:12.519 --> 00:00:14.980
you are producing a 30 -minute cinematic movie.

00:00:15.160 --> 00:00:17.800
Which sounds completely impossible. Right. It

00:00:17.800 --> 00:00:20.679
is in full 4K. The dialogue is perfectly lip

00:00:20.679 --> 00:00:23.839
-synced. Yeah. And your total budget is $0. It

00:00:23.839 --> 00:00:26.179
really does sound impossible. But it is happening

00:00:26.179 --> 00:00:27.800
right now. Yeah. I mean, people are actually

00:00:27.800 --> 00:00:30.960
bypassing the traditional studio system entirely.

00:00:32.079 --> 00:00:34.939
Welcome to today's Deep Dive. Thanks for having

00:00:34.939 --> 00:00:37.780
me. We are looking at something truly fascinating

00:00:37.780 --> 00:00:40.850
today. We really are. It is a mobile guide from

00:00:40.850 --> 00:00:44.750
March 2026. It's by Maxanne. Right. And it fundamentally

00:00:44.750 --> 00:00:47.250
changes everything about visual storytelling.

00:00:47.530 --> 00:00:49.789
It does. And what is really brilliant about it

00:00:49.789 --> 00:00:52.829
is the access. It completely removes the financial

00:00:52.829 --> 00:00:55.329
barrier to entry. Because historically, animation

00:00:55.329 --> 00:00:58.030
costs thousands of dollars per minute. Exactly.

00:00:58.030 --> 00:01:02.130
But now, the cost is just your time. Today, we

00:01:02.130 --> 00:01:05.420
are exploring a specific three -part... mobile

00:01:05.420 --> 00:01:08.400
pipeline. Yeah. The core stack. Right. First,

00:01:08.680 --> 00:01:11.760
we are going to architect the story in ChatGPT.

00:01:12.060 --> 00:01:15.219
Then we bring those static ideas to life with

00:01:15.219 --> 00:01:18.579
Grok AI. And then the final step. Right. Finally,

00:01:18.700 --> 00:01:21.280
we polish all the rough edges and cap cut. It

00:01:21.280 --> 00:01:23.980
is just three free tools on your phone. Yeah.

00:01:24.079 --> 00:01:27.420
But it represents one massive paradigm shift

00:01:27.420 --> 00:01:30.519
for creators. So let's unpack this step by step.

00:01:30.599 --> 00:01:32.340
Let's do it. We have to start with phase one.

00:01:33.439 --> 00:01:35.459
architecting the story right because you simply

00:01:35.459 --> 00:01:38.180
can't shoot a movie without a rock -solid script

00:01:38.180 --> 00:01:40.840
it is the essential first step absolutely we

00:01:40.840 --> 00:01:43.239
move from the grand promise of a mobile studio

00:01:43.239 --> 00:01:45.659
directly into the mind of the architect that

00:01:45.659 --> 00:01:48.239
is the perfect way to frame it before you even

00:01:48.239 --> 00:01:51.400
open a single video generation tool you need

00:01:51.400 --> 00:01:53.900
that underlying architecture because if you skip

00:01:53.900 --> 00:01:56.379
this you just get a folder of random clips exactly

00:01:56.829 --> 00:01:58.769
They look totally disconnected. The narrative

00:01:58.769 --> 00:02:01.129
just goes nowhere. The script gives the entire

00:02:01.129 --> 00:02:05.269
AI workflow its structural integrity. Yes, exactly.

00:02:05.530 --> 00:02:08.830
And in this pipeline, ChatGPT acts as your master

00:02:08.830 --> 00:02:11.530
showrunner. It wears a few different hats. It

00:02:11.530 --> 00:02:15.310
really does. It serves four very distinct vital

00:02:15.310 --> 00:02:18.509
roles. Okay. First, generating the overarching

00:02:18.509 --> 00:02:21.110
story. Right. Second, expanding the emotional

00:02:21.110 --> 00:02:24.560
details. Third. Writing the visual image prompts.

00:02:24.740 --> 00:02:27.479
And the fourth. Fourth is crafting the dialogue

00:02:27.479 --> 00:02:29.800
and the motion prompts. Let's pause on that very

00:02:29.800 --> 00:02:34.259
first step. Generating the story. Sure. The guide

00:02:34.259 --> 00:02:37.240
stresses asking for a multi -part story right

00:02:37.240 --> 00:02:40.020
up front. Yeah. That is a huge point. Why ask

00:02:40.020 --> 00:02:42.620
for multiple parts immediately? Because it establishes

00:02:42.620 --> 00:02:45.460
a massive, cohesive world right from the jump.

00:02:45.560 --> 00:02:47.680
Oh, I see. It builds a full content calendar

00:02:47.680 --> 00:02:50.319
from just a single chat session. That makes sense.

00:02:50.500 --> 00:02:53.780
You feed ChatGPT a clear concept up front. You

00:02:53.780 --> 00:02:56.120
tell it you want five distinct chapters. Right.

00:02:56.219 --> 00:02:58.699
And suddenly every single part is ready to become

00:02:58.699 --> 00:03:01.599
his own standalone video episode. That is serious

00:03:01.599 --> 00:03:03.960
efficiency. It really is. It keeps you from staring

00:03:03.960 --> 00:03:06.180
at a blank screen tomorrow. Right. It makes the

00:03:06.180 --> 00:03:09.599
heavy lifting worth it. But step two is where

00:03:09.599 --> 00:03:12.379
the actual cinematic magic happens. Expanding

00:03:12.379 --> 00:03:15.139
the emotion. Exactly. You have to force the AI

00:03:15.139 --> 00:03:18.219
to expand that core story. So you are asking

00:03:18.219 --> 00:03:21.639
ChatGPT to add more scenes? Yes. But more importantly,

00:03:21.699 --> 00:03:24.159
you are asking for deeper emotional moments.

00:03:24.219 --> 00:03:26.699
This is where beginners always fail. Really?

00:03:26.740 --> 00:03:30.520
Yeah. A thin, simple script leads directly to

00:03:30.520 --> 00:03:33.120
a thin, boring video. That makes total sense.

00:03:33.400 --> 00:03:35.259
When you spend a few extra prompts demanding

00:03:35.259 --> 00:03:38.340
emotional depth, You give the AI stylistic direction.

00:03:38.620 --> 00:03:41.659
It makes the final output feel moody and cinematic

00:03:41.659 --> 00:03:44.379
instead of flat and robotic. Right. The emotion

00:03:44.379 --> 00:03:47.580
dictates the pixels. Exactly. Then we move to

00:03:47.580 --> 00:03:50.639
step three, generating the image prompts. This

00:03:50.639 --> 00:03:53.120
is a crucial safety net for beginners. How so?

00:03:53.319 --> 00:03:56.780
You ask ChatGPT to basically act as your director

00:03:56.780 --> 00:03:59.300
of photography. Oh, wow. Yeah. It has to write

00:03:59.300 --> 00:04:01.659
a highly detailed visual description for every

00:04:01.659 --> 00:04:03.740
single scene. Like lighting and camera angles?

00:04:04.000 --> 00:04:07.050
Lighting, camera angles, weather. Everything.

00:04:07.150 --> 00:04:10.090
And then we reach step four. The big one. Yeah,

00:04:10.169 --> 00:04:13.069
this seems like the most critical piece. Generating

00:04:13.069 --> 00:04:15.490
the actual dialogue and motion prompts. This

00:04:15.490 --> 00:04:18.189
requires a very specific dual output from the

00:04:18.189 --> 00:04:21.009
AI for every scene. Dual output. Right. You need

00:04:21.009 --> 00:04:23.800
the motion prompt first. Okay. This explicitly

00:04:23.800 --> 00:04:26.379
explains how the camera and the environment should

00:04:26.379 --> 00:04:28.379
move. Give me an example of that. Something like

00:04:28.379 --> 00:04:31.800
camera slowly pushes in. While heavy dust particles

00:04:31.800 --> 00:04:34.740
drift across the darkened frame. You've got it.

00:04:34.819 --> 00:04:38.100
It adds that necessary atmosphere. Right. And

00:04:38.100 --> 00:04:41.240
the second required output is the dialogue script.

00:04:41.540 --> 00:04:44.000
What they actually say. Exactly. This is what

00:04:44.000 --> 00:04:46.740
each character actually says out loud. You keep

00:04:46.740 --> 00:04:49.300
this dual output format open on your screen.

00:04:49.980 --> 00:04:52.639
it becomes your working master script. It is

00:04:52.639 --> 00:04:55.220
like giving an architect not just the blueprints,

00:04:55.240 --> 00:04:57.879
but the exact brushstrokes for the painters.

00:04:58.100 --> 00:05:00.759
That is a great analogy. Be beat. Though I will

00:05:00.759 --> 00:05:02.540
admit something vulnerable here. What is that?

00:05:02.699 --> 00:05:05.279
I still wrestle with prompt drift myself when

00:05:05.279 --> 00:05:07.319
writing stories. Oh, it happens to absolutely

00:05:07.319 --> 00:05:10.040
everyone who uses these tools. It really does.

00:05:10.160 --> 00:05:12.139
For those who haven't experienced it, prompt

00:05:12.139 --> 00:05:16.310
drift is when AI... slowly forgets your initial

00:05:16.310 --> 00:05:19.310
instructions over time. The context window just

00:05:19.310 --> 00:05:21.730
gets overloaded. Yeah, it gets incredibly frustrating.

00:05:21.990 --> 00:05:24.670
It is a nightmare. So I have a probing question

00:05:24.670 --> 00:05:28.709
about that. Shoot. How do we prevent the AI from

00:05:28.709 --> 00:05:30.829
completely forgetting what the main character

00:05:30.829 --> 00:05:33.310
looks like between scene one and scene two? It

00:05:33.310 --> 00:05:35.889
all comes down to forcing visual anchors. Okay.

00:05:35.949 --> 00:05:39.810
Asking ChatGPT for strong, consistent visual

00:05:39.810 --> 00:05:42.709
character prompts for every single scene is absolutely

00:05:42.709 --> 00:05:45.490
mandatory. So you remind it every time? Every

00:05:45.490 --> 00:05:48.579
single time. You have to bake their physical

00:05:48.579 --> 00:05:51.439
description into every prong. If you don't, you

00:05:51.439 --> 00:05:53.540
end up with changing faces or totally different

00:05:53.540 --> 00:05:56.139
clothes. Which ruins the video. Completely. That

00:05:56.139 --> 00:05:59.399
breaks viewer immersion immediately. Right. Strong

00:05:59.399 --> 00:06:01.560
visual anchors keep the AI from changing their

00:06:01.560 --> 00:06:04.100
faces. Exactly. You can plug those precise descriptions

00:06:04.100 --> 00:06:08.079
into any image generator later. Or just use GPT's

00:06:08.079 --> 00:06:10.360
built -in image generator if you are on the free

00:06:10.360 --> 00:06:13.600
plan. So we have essentially built an incredibly

00:06:13.600 --> 00:06:16.560
detailed character locked script. We have the

00:06:16.560 --> 00:06:18.500
blueprint. But a text file doesn't entertain

00:06:18.500 --> 00:06:21.240
anyone. No, it doesn't. The hurdle now is visual

00:06:21.240 --> 00:06:24.019
execution. Right. Historically, that meant paying

00:06:24.019 --> 00:06:26.740
thousands of dollars for a rendering farm. Yeah,

00:06:26.779 --> 00:06:29.100
a rendering farm being massive computer networks

00:06:29.100 --> 00:06:31.920
used to process complex video. Right. We don't

00:06:31.920 --> 00:06:34.000
have access to servers like that. We just have

00:06:34.000 --> 00:06:37.379
a phone. And this is where the massive March

00:06:37.379 --> 00:06:41.920
2026 update to the Grok app. completely changes

00:06:41.920 --> 00:06:44.800
the game. Tell me about that. Specifically, the

00:06:44.800 --> 00:06:47.560
animate photo feature. This is the core trick

00:06:47.560 --> 00:06:50.540
of the entire pipeline. Okay. It takes your static

00:06:50.540 --> 00:06:53.639
mid -journey or GPT image and turns it into a

00:06:53.639 --> 00:06:57.000
fluid six -second video. And it has fully integrated

00:06:57.000 --> 00:07:00.389
lip sync built right in. Yeah. It is wild. You

00:07:00.389 --> 00:07:02.970
access it by simply opening the Grok app on your

00:07:02.970 --> 00:07:05.550
phone. You tap the Imagine tab at the bottom.

00:07:05.649 --> 00:07:08.209
Right. Then you tap the little photo icon and

00:07:08.209 --> 00:07:10.430
choose animate photo. And just a quick note for

00:07:10.430 --> 00:07:12.129
you listening, if you don't see that button,

00:07:12.230 --> 00:07:14.670
go update or reinstall the app. You need that

00:07:14.670 --> 00:07:18.550
specific March 2026 version. Right. Once you

00:07:18.550 --> 00:07:21.129
are in the interface, you import your static

00:07:21.129 --> 00:07:24.009
scene image. Okay. Then you write your prompt.

00:07:24.329 --> 00:07:27.110
Using the master script. Exactly. You're combining

00:07:27.110 --> 00:07:30.850
that motion prompt ChatGPT gave you earlier with

00:07:30.850 --> 00:07:33.189
your character's dialogue. You just type the

00:07:33.189 --> 00:07:36.910
exact spoken dialogue in quotation marks. So

00:07:36.910 --> 00:07:39.129
the AI reads the text and maps it to the mouth.

00:07:39.329 --> 00:07:41.410
Yeah, it does. But there are some pretty severe

00:07:41.410 --> 00:07:43.889
constraints here, right? Huge constraints. You

00:07:43.889 --> 00:07:46.850
must limit the generation to two dialogue exchanges

00:07:46.850 --> 00:07:49.990
per prompt. Two exchanges. That is the absolute

00:07:49.990 --> 00:07:52.629
maximum. What happens if you do more? if you

00:07:52.629 --> 00:07:55.329
feed it a massive block of text the processing

00:07:55.329 --> 00:07:58.670
load gets too heavy okay the ai loses coherence

00:07:58.670 --> 00:08:01.449
and characters just skip lines completely so

00:08:01.449 --> 00:08:04.629
how do we actually get past six seconds six seconds

00:08:04.629 --> 00:08:07.389
is a neat trick but it is not a television episode

00:08:07.389 --> 00:08:11.170
that is the ultimate secret sauce of 2026 which

00:08:11.170 --> 00:08:13.569
is the extend feature okay after you generate

00:08:13.569 --> 00:08:16.930
that first flawless six second clip you tap the

00:08:16.930 --> 00:08:19.870
three dots in the corner you hit extend then

00:08:19.870 --> 00:08:23.920
you paste The next small chunk of your dialogue.

00:08:24.160 --> 00:08:26.379
And you just generate it again. Right. The clip

00:08:26.379 --> 00:08:29.439
seamlessly grows from six seconds to 12 seconds.

00:08:29.620 --> 00:08:32.139
Wow. It takes the last frame of the first clip

00:08:32.139 --> 00:08:34.460
and uses it as the starting point for the next.

00:08:34.539 --> 00:08:36.299
So you just keep doing that. You repeat that

00:08:36.299 --> 00:08:39.159
process. It goes to 18 seconds. Yeah. Then 24.

00:08:39.659 --> 00:08:43.360
Then 30 seconds. You stitch them together block

00:08:43.360 --> 00:08:46.059
by block. It is like stacking Lego blocks of

00:08:46.059 --> 00:08:51.120
data. Exactly. To sex islands. Whoa. Imagine

00:08:51.120 --> 00:08:53.779
scaling a single photo into a whole half hour

00:08:53.779 --> 00:08:56.840
animated film. It is a massive, unprecedented

00:08:56.840 --> 00:08:59.809
shift in solo production. Yeah. But there is

00:08:59.809 --> 00:09:02.330
a highly critical insight you must follow to

00:09:02.330 --> 00:09:04.870
make it work. What is that? It is the defining

00:09:04.870 --> 00:09:08.669
differentiator for creators in 2026. The guide

00:09:08.669 --> 00:09:11.070
calls it character parenting. Character parenting.

00:09:11.389 --> 00:09:13.269
Yeah. What does that actually mean in practice?

00:09:13.629 --> 00:09:16.389
In your Grok generation prompts, you always put

00:09:16.389 --> 00:09:18.730
the character's gender and rough age in parentheses

00:09:18.730 --> 00:09:21.929
right next to their name. For example, you literally

00:09:21.929 --> 00:09:27.049
type Amina, open parenthesis, woman, comma. Why

00:09:27.049 --> 00:09:31.149
is that tiny detail so profoundly important?

00:09:31.549 --> 00:09:33.789
Because we have to understand how these AI models

00:09:33.789 --> 00:09:36.789
are trained. Right. The latent space is heavily

00:09:36.789 --> 00:09:40.419
biased toward Western media. Okay. Latent space

00:09:40.419 --> 00:09:43.519
bias being AI models favoring certain types of

00:09:43.519 --> 00:09:46.240
training data. Exactly. If you just type a non

00:09:46.240 --> 00:09:48.740
-Western name like Amina, the AI searches its

00:09:48.740 --> 00:09:51.220
database and gets confused. Oh, I see. It leads

00:09:51.220 --> 00:09:54.860
to default generic voices. Or worse, it creates

00:09:54.860 --> 00:09:57.519
entirely wrong facial expressions that erase

00:09:57.519 --> 00:10:00.059
ethnic features. It basically loses its anchor

00:10:00.059 --> 00:10:01.980
and starts flipping voices during the extension

00:10:01.980 --> 00:10:04.919
process. Exactly. Adding that simple demographic

00:10:04.919 --> 00:10:06.980
label acts as an anchor weight in the neural

00:10:06.980 --> 00:10:09.720
network. It stops the AI from... drifting or

00:10:09.720 --> 00:10:11.879
messing up the cultural nuances. It saves time.

00:10:12.019 --> 00:10:14.460
Saves you hours of frustrating regeneration time.

00:10:14.620 --> 00:10:17.100
Let me push back here a bit. Sure. The tech sounds

00:10:17.100 --> 00:10:19.659
amazing, but if I am stitching together these

00:10:19.659 --> 00:10:22.480
small six -second clips, isn't the audio going

00:10:22.480 --> 00:10:24.539
to sound incredibly disjointed at the seams?

00:10:24.799 --> 00:10:27.460
That is a fair concern. Plus, what happens when

00:10:27.460 --> 00:10:30.580
the clip inevitably goes off script or hallucinates

00:10:30.580 --> 00:10:33.399
and the mouth just stops sinking? Those are the

00:10:33.399 --> 00:10:35.899
exact right questions. Hallucinations being...

00:10:36.120 --> 00:10:39.080
When the AI invents random incorrect details

00:10:39.080 --> 00:10:41.720
that break the scene. Right. It definitely happens.

00:10:41.899 --> 00:10:44.639
Yeah. The audio usually smooths out because grok

00:10:44.639 --> 00:10:47.440
overlaps the audio frame slightly. That's clever.

00:10:47.679 --> 00:10:51.320
But when the visual completely breaks, you shouldn't

00:10:51.320 --> 00:10:53.960
just spam the regenerate button. Why not? That

00:10:53.960 --> 00:10:57.019
wastes your daily limits. First, you need to

00:10:57.019 --> 00:10:59.330
check the fundamental basics. Like what? Check

00:10:59.330 --> 00:11:01.129
that your character labels are actually present

00:11:01.129 --> 00:11:03.330
in the prompt. Make sure you didn't accidentally

00:11:03.330 --> 00:11:06.149
delete the parentheses. Right. That is usually

00:11:06.149 --> 00:11:08.429
the culprit. Okay. And second, make sure you

00:11:08.429 --> 00:11:10.429
haven't overloaded the context window. Check

00:11:10.429 --> 00:11:13.210
the dialogue length. Check if your dialogue snippet

00:11:13.210 --> 00:11:17.289
is too long. Over 90 % of bad, hallucinated clips

00:11:17.289 --> 00:11:20.789
come from those two specific user errors. Got

00:11:20.789 --> 00:11:23.730
it. Fix labels. Cut dialogue short, then try

00:11:23.730 --> 00:11:26.250
generating it again. That is the exact troubleshooting

00:11:26.250 --> 00:11:28.629
loop. Okay. Once the clip generates cleanly,

00:11:28.629 --> 00:11:30.629
you save it to your camera roll. Then you move

00:11:30.629 --> 00:11:32.830
to the very next scene in your script. And just

00:11:32.830 --> 00:11:35.230
repeat. You keep systematically repeating this

00:11:35.230 --> 00:11:37.750
until the full story is done. And the guide mentions

00:11:37.750 --> 00:11:40.570
a massive time -saving trick here. Yeah, it is

00:11:40.570 --> 00:11:42.470
brilliant. What is it? If two back -to -back

00:11:42.470 --> 00:11:45.049
scenes use the exact same setting and the same

00:11:45.049 --> 00:11:47.870
characters, just reuse the same starting image.

00:11:48.090 --> 00:11:51.330
Oh, right. Don't prompt a brand new one. It saves

00:11:51.330 --> 00:11:53.990
rendering time and it keeps your visual style

00:11:53.990 --> 00:11:56.690
highly consistent across the episode. OK, let's

00:11:56.690 --> 00:11:59.090
take a breath. Yeah. We now have a camera roll

00:11:59.090 --> 00:12:03.230
completely full of extended 30 second video clips.

00:12:03.590 --> 00:12:07.529
But a crowded folder of clips isn't a movie.

00:12:08.129 --> 00:12:11.850
We need to cut the robotic fat. And that brings

00:12:11.850 --> 00:12:14.429
us to the final crucial step. We are going to

00:12:14.429 --> 00:12:16.149
take a quick break. Right. Stick around. And

00:12:16.149 --> 00:12:19.289
we are back. So before the break. We generated

00:12:19.289 --> 00:12:22.850
all our raw footage. We did. Now we are moving

00:12:22.850 --> 00:12:25.870
all those assets into CapCut. It is a totally

00:12:25.870 --> 00:12:28.690
free mobile video editor. Right. You start a

00:12:28.690 --> 00:12:31.070
brand new project on your phone. Okay. You import

00:12:31.070 --> 00:12:33.889
all your saved clips from the camera roll. Because

00:12:33.889 --> 00:12:36.909
you generated them in order, they usually drop

00:12:36.909 --> 00:12:39.210
into the timeline sequentially. I want to clarify

00:12:39.210 --> 00:12:41.409
something for you listening. Yeah. The focus

00:12:41.409 --> 00:12:44.860
in this stage. isn't fancy transitions. We aren't

00:12:44.860 --> 00:12:48.440
doing crazy MTV -style edits here. No, the entire

00:12:48.440 --> 00:12:51.799
philosophy of this edit is purely about hiding

00:12:51.799 --> 00:12:54.960
the AI's technical flaws. Okay. You have to scrub

00:12:54.960 --> 00:12:58.100
through each clip very carefully. You are hunting

00:12:58.100 --> 00:13:00.879
for two specific immersion -breaking issues.

00:13:01.220 --> 00:13:03.139
What exactly are we hunting for on that timeline?

00:13:03.480 --> 00:13:05.779
First... dead spots where the character just

00:13:05.779 --> 00:13:08.860
freezes mid -scene. This usually happens right

00:13:08.860 --> 00:13:10.639
around the mouth when the line of dialogue finishes

00:13:10.639 --> 00:13:13.879
early. Second, you are looking for moments where

00:13:13.879 --> 00:13:16.480
the video keeps running awkwardly after the intended

00:13:16.480 --> 00:13:19.379
emotion ends. So you have to meticulously pinch

00:13:19.379 --> 00:13:22.000
and zoom on that timeline to trim those clips.

00:13:22.259 --> 00:13:25.159
Exactly. You use CapCut's basic split and trim

00:13:25.159 --> 00:13:28.700
tools. You ruthlessly cut out those empty robotic

00:13:28.700 --> 00:13:31.879
moments. You add a simple text title at the beginning

00:13:31.879 --> 00:13:34.620
of the project. And once the flow feels naturally

00:13:34.620 --> 00:13:38.259
human, you export the whole thing in 1080p. Let's

00:13:38.259 --> 00:13:40.740
do a harsh reality check on the actual time and

00:13:40.740 --> 00:13:43.320
output here. Okay. How long does this entirely

00:13:43.320 --> 00:13:46.240
mobile process actually take? Let's break down

00:13:46.240 --> 00:13:49.039
the basic math. Sure. If you have a script with

00:13:49.039 --> 00:13:52.759
10 scenes and you do 6 to 10 extensions for each

00:13:52.759 --> 00:13:54.580
of those scenes. That is a lot of extension.

00:13:54.840 --> 00:13:59.120
It is. But you end up with a solid. 25 to 35

00:13:59.120 --> 00:14:02.559
minute video. That is literally a full television

00:14:02.559 --> 00:14:06.759
episode length. It is. And the total actual production

00:14:06.759 --> 00:14:09.159
time for that. Yeah. It is sitting right around

00:14:09.159 --> 00:14:12.179
two to three hours. Wow. That is from writing

00:14:12.179 --> 00:14:15.299
the initial chat GPT prompt all the way to the

00:14:15.299 --> 00:14:18.019
final exported video. I have to say two to three

00:14:18.019 --> 00:14:21.000
hours sounds incredibly fast for 30 minutes of

00:14:21.000 --> 00:14:24.620
pristine animation. It does. But I want to point

00:14:24.620 --> 00:14:27.299
out to you listening. Yeah. This requires serious,

00:14:27.340 --> 00:14:30.279
hyper -focused work. It isn't just a push a button

00:14:30.279 --> 00:14:32.360
and take a nap kind of magic trick. No, it is

00:14:32.360 --> 00:14:34.860
not. If you are thinking trimming microseconds

00:14:34.860 --> 00:14:37.740
of dead air on a tiny phone screen sounds like

00:14:37.740 --> 00:14:39.960
a nightmare, you are not entirely wrong. Right.

00:14:40.039 --> 00:14:43.419
It is meticulous. It is highly meticulous. You

00:14:43.419 --> 00:14:46.919
are actively directing the AI. You are extending

00:14:46.919 --> 00:14:50.059
clips block by block. You are constantly troubleshooting

00:14:50.059 --> 00:14:52.759
broken prompts. You are trimming single frames.

00:14:52.879 --> 00:14:55.759
It is real demanding creative work. It is just

00:14:55.759 --> 00:14:57.639
a totally different kind of work than traditional

00:14:57.639 --> 00:15:00.960
animating. Exactly. But I have to ask, are those

00:15:00.960 --> 00:15:04.200
tiny dead spots or frozen mouths really that

00:15:04.200 --> 00:15:07.139
big of a deal for the final viewer? Yes. Can't

00:15:07.139 --> 00:15:09.379
we just leave them in to save an hour of editing?

00:15:09.759 --> 00:15:11.740
They are a massive deal. We are dealing with

00:15:11.740 --> 00:15:14.179
the uncanny valley here. Right. Leaving even

00:15:14.179 --> 00:15:18.240
a single second of a frozen, lifeless AI face.

00:15:18.940 --> 00:15:20.860
Instantly shatters the viewer's psychological

00:15:20.860 --> 00:15:23.840
immersion. It just feels wrong. It deeply unsettles

00:15:23.840 --> 00:15:26.059
the human brain. It will completely ruin your

00:15:26.059 --> 00:15:28.759
audience retention. People will click away immediately.

00:15:29.139 --> 00:15:31.399
Yeah, dead air shatters the illusion. Trimming

00:15:31.399 --> 00:15:34.360
hides the robotic awkwardness completely. That

00:15:34.360 --> 00:15:37.980
meticulousness is the exact barrier that separates

00:15:37.980 --> 00:15:40.860
professional -looking storytelling from the vast

00:15:40.860 --> 00:15:46.740
sea of lazy, low -effort AI spam online. Trimming

00:15:46.740 --> 00:15:49.700
isn't optional. It is essential. So looking at

00:15:49.700 --> 00:15:52.980
the big picture, who is this mobile pipeline

00:15:52.980 --> 00:15:56.720
actually built for? It is heavily optimized for

00:15:56.720 --> 00:16:00.620
a few specific types of creators. OK. It is absolutely

00:16:00.620 --> 00:16:03.740
perfect for faceless YouTube creators. Right.

00:16:03.820 --> 00:16:06.320
People who want to publish long form stories

00:16:06.320 --> 00:16:09.480
consistently without ever putting their real

00:16:09.480 --> 00:16:12.620
face on camera. The guide also explicitly mentions

00:16:12.620 --> 00:16:16.000
cultural storytellers. Yes. This is profound.

00:16:16.559 --> 00:16:20.639
If you want to produce localized drama, rich

00:16:20.639 --> 00:16:23.320
historical folklore, or deeply serialized fiction,

00:16:23.580 --> 00:16:25.940
this workflow is exactly what you've been waiting

00:16:25.940 --> 00:16:28.860
for. And it is built entirely for beginners with

00:16:28.860 --> 00:16:31.059
absolutely zero budget. Because every single

00:16:31.059 --> 00:16:33.679
tool in this impressive stack is entirely free.

00:16:33.840 --> 00:16:36.740
Right. And remember that multi -part story format

00:16:36.740 --> 00:16:39.419
we forced ChatGPT to write in step one? Yeah,

00:16:39.419 --> 00:16:41.320
the calendar. That builds natural subscriber

00:16:41.320 --> 00:16:43.539
retention. Each episode ends with a scripted

00:16:43.539 --> 00:16:45.519
cliffhanger. Oh, that's smart. The next part

00:16:45.519 --> 00:16:47.139
is already written on your phone. The audience

00:16:47.139 --> 00:16:50.019
has a compelling reason to return tomorrow. Let's

00:16:50.019 --> 00:16:52.259
synthesize the core takeaway from all of this.

00:16:52.399 --> 00:16:56.220
Two sec silence. The historical financial barrier

00:16:56.220 --> 00:16:59.399
to cinematic storytelling has effectively vanished.

00:16:59.679 --> 00:17:01.940
It really has evaporated overnight. By intentionally

00:17:01.940 --> 00:17:05.740
stacking ChatGPT, Grok's new update, and CapCut

00:17:05.740 --> 00:17:09.160
together, an entire world -class animation studio

00:17:09.160 --> 00:17:12.849
now fits quietly in your pocket. It is amazing.

00:17:13.130 --> 00:17:15.710
For zero dollars, you just build the world scene

00:17:15.710 --> 00:17:18.650
by scene, extension by extension. It is a profound

00:17:18.650 --> 00:17:21.309
tectonic shift in media. Yeah. The technical

00:17:21.309 --> 00:17:23.750
tools are no longer the obstacle keeping you

00:17:23.750 --> 00:17:26.349
from creating. The only thing you actually need

00:17:26.349 --> 00:17:29.029
to bring to the table now is a story that is

00:17:29.029 --> 00:17:31.230
actually worth telling. That is a really beautiful

00:17:31.230 --> 00:17:34.130
way to frame it. It makes you wonder. About what?

00:17:34.230 --> 00:17:36.690
If the financial cost of rendering breathtaking

00:17:36.690 --> 00:17:39.990
animation is now completely zero, will the value

00:17:39.990 --> 00:17:43.289
of deeply human, unique cultural folklore actually

00:17:43.289 --> 00:17:46.450
skyrocket? I believe so. Right. When the execution

00:17:46.450 --> 00:17:49.650
becomes free and effortless, genuine human originality

00:17:49.650 --> 00:17:53.430
becomes absolutely priceless. Exactly. So what

00:17:53.430 --> 00:17:56.130
deeply personal story have you been holding back

00:17:56.130 --> 00:17:58.069
because you thought you couldn't afford to tell

00:17:58.069 --> 00:18:01.569
it? Two sec silence. That is the real question.

00:18:01.789 --> 00:18:03.569
The next time you are just waiting for the bus,

00:18:03.730 --> 00:18:05.809
remember. You have a massive Hollywood studio

00:18:05.809 --> 00:18:07.950
sitting right there in your pocket. Thank you

00:18:07.950 --> 00:18:10.809
for exploring this wild new frontier with us

00:18:10.809 --> 00:18:12.890
today. Thanks for having me. Thanks for joining

00:18:12.890 --> 00:18:14.849
the Deep Dive. We will see you next time.
