WEBVTT

00:00:00.000 --> 00:00:03.020
You know that specific kind of exhaustion? The

00:00:03.020 --> 00:00:05.480
one that comes from editing video. Oh, the Sunday

00:00:05.480 --> 00:00:07.740
night dread. I know it well. Exactly. The Sunday

00:00:07.740 --> 00:00:09.960
night dread. You spend an entire weekend, you're

00:00:09.960 --> 00:00:12.939
fighting with clips trying to match audio. And

00:00:12.939 --> 00:00:14.640
you end up with what, like a minute and a half

00:00:14.640 --> 00:00:17.519
of usable footage? If you're lucky, it's just,

00:00:17.620 --> 00:00:20.739
it's heavy. It's the friction. It is. It's the

00:00:20.739 --> 00:00:22.620
friction of the tools getting in the way of the

00:00:22.620 --> 00:00:25.039
actual idea. And that friction is, you know,

00:00:25.039 --> 00:00:27.219
what usually kills the idea before it even gets

00:00:27.219 --> 00:00:29.760
a chance to breathe. But I've been reading about

00:00:29.760 --> 00:00:31.940
this concept. It's called the speed gap. Right.

00:00:32.060 --> 00:00:35.460
It's this massive divide between that old manual

00:00:35.460 --> 00:00:39.219
way. The agonizing weekend edit and this new

00:00:39.219 --> 00:00:41.880
standard that's, well, it's quietly taking over.

00:00:42.000 --> 00:00:43.700
And we're not talking about just making things

00:00:43.700 --> 00:00:46.359
a little bit faster. No, not 10 % faster. We're

00:00:46.359 --> 00:00:49.200
talking about automating 50 consistent scenes

00:00:49.200 --> 00:00:52.439
in just minutes. It's a total collapse of the

00:00:52.439 --> 00:00:55.299
production timeline. And the craziest part about

00:00:55.299 --> 00:00:57.619
the speed gap. Go on. The bridge to cross it

00:00:57.619 --> 00:01:01.929
costs exactly $0. That's the hook. And that's

00:01:01.929 --> 00:01:04.250
what we're doing today on The Deep Dive. We are

00:01:04.250 --> 00:01:06.629
unpacking a roadmap for what's being called the

00:01:06.629 --> 00:01:12.829
2026 Free Stack. It's a guide on building a professional

00:01:12.829 --> 00:01:16.209
AI video production system, an actual assembly

00:01:16.209 --> 00:01:19.609
line from just a text prompt to a finished movie.

00:01:19.950 --> 00:01:21.930
And we should be really clear here. This is not

00:01:21.930 --> 00:01:24.469
about just typing make me a video into a chat

00:01:24.469 --> 00:01:26.689
bot. No, that doesn't work. It never works. This

00:01:26.689 --> 00:01:29.269
is about chaining together five very specific

00:01:29.269 --> 00:01:33.250
free tools. We're talking ChatGPT, a Chrome extension

00:01:33.250 --> 00:01:37.090
called AutoWhisk, Google AI Studio, Grok, and

00:01:37.090 --> 00:01:39.150
CapCut. It sounds like a Frankenstein's monster.

00:01:39.450 --> 00:01:41.390
It is a bit of a Frankenstein. But when you see

00:01:41.390 --> 00:01:43.230
how they all lock together, you realize, you

00:01:43.230 --> 00:01:45.030
know, this is the difference between an amateur

00:01:45.030 --> 00:01:47.569
messing with tech and a producer building an

00:01:47.569 --> 00:01:49.790
actual workflow. Okay. So let's walk through

00:01:49.790 --> 00:01:51.609
this factory floor. It's a seven step process

00:01:51.609 --> 00:01:53.829
and it starts where every movie starts. The script.

00:01:53.969 --> 00:01:56.450
The script. And usually for me, when I try to

00:01:56.450 --> 00:02:00.209
get AI to write a script, it just feels so generic.

00:02:00.769 --> 00:02:03.469
Hollow. Like plastic. Yeah. It gives you that

00:02:03.469 --> 00:02:07.590
bland corporate AI voice. The guide we're looking

00:02:07.590 --> 00:02:10.550
at attacks that blank page problem in a different

00:02:10.550 --> 00:02:12.849
way. It doesn't just ask for a story. What does

00:02:12.849 --> 00:02:16.409
it do? It uses this rigorous persona switching

00:02:16.409 --> 00:02:19.969
strategy. It treats the AI like a specialized

00:02:19.969 --> 00:02:23.150
employee you can hire and fire. Okay, explain

00:02:23.150 --> 00:02:25.669
that. Because most people just dump a paragraph

00:02:25.669 --> 00:02:28.330
into ChatGPT. You got to hope for the best. And

00:02:28.330 --> 00:02:30.590
you get garbage if you do that. So step one.

00:02:31.240 --> 00:02:34.340
You don't ask for a video script. You tell ChatGPT

00:02:34.340 --> 00:02:37.039
you are a professional children's author. You

00:02:37.039 --> 00:02:40.419
give it rules. Human characters, one animal,

00:02:40.659 --> 00:02:43.439
inspirational, no dialogue. You get the story

00:02:43.439 --> 00:02:45.340
first. So you lock in the narrative structure

00:02:45.340 --> 00:02:47.099
before you even think about the visuals. That

00:02:47.099 --> 00:02:49.020
makes a lot of sense. Precisely. But here's the

00:02:49.020 --> 00:02:50.599
technical pivot. And this is where most people

00:02:50.599 --> 00:02:52.800
fail. You don't take that story and just paste

00:02:52.800 --> 00:02:54.719
it into an image generator. You have to clean

00:02:54.719 --> 00:02:57.020
the data. You have to. You go back to ChatGPT

00:02:57.020 --> 00:02:59.539
and you say, OK, now switch hats. You are an

00:02:59.539 --> 00:03:01.740
experienced animation director. I like that.

00:03:01.819 --> 00:03:03.740
You're firing the author and hiring a director

00:03:03.740 --> 00:03:06.650
for the next task. You are. And you tell it to

00:03:06.650 --> 00:03:09.229
break that story down into 20 storyboard scenes,

00:03:09.449 --> 00:03:12.729
but the prompt engineering here is very, very

00:03:12.729 --> 00:03:16.750
strict. How so? The output has to separate the

00:03:16.750 --> 00:03:18.909
narration, what the audience is going to hear,

00:03:19.069 --> 00:03:22.569
from the image prompt, what the AI needs to see.

00:03:22.650 --> 00:03:25.629
Why is that separation so critical for the system?

00:03:25.729 --> 00:03:28.469
Can't you just describe the scene? Well, if you

00:03:28.469 --> 00:03:30.729
mix them, the image generator gets confused.

00:03:31.520 --> 00:03:33.860
It sees the emotional language of the story and

00:03:33.860 --> 00:03:35.900
doesn't know what to do. You need the image prompt

00:03:35.900 --> 00:03:39.500
to be cold, descriptive data like Disney Pixar

00:03:39.500 --> 00:03:42.860
style, wide shot, warm lighting. I see. Completely

00:03:42.860 --> 00:03:45.139
distinct from the narration. Totally. And then

00:03:45.139 --> 00:03:46.939
there's a third little step in the scripting

00:03:46.939 --> 00:03:49.860
phase, the extraction. Right. This is pure data

00:03:49.860 --> 00:03:52.199
prep. You tell the AI to strip away everything

00:03:52.199 --> 00:03:55.080
else, see numbers, headers, all of it, and just

00:03:55.080 --> 00:03:57.599
give you the raw visual descriptions, each one

00:03:57.599 --> 00:04:00.150
separated by a blank line. It feels less like

00:04:00.150 --> 00:04:02.110
writing at that point and more like, I don't

00:04:02.110 --> 00:04:03.949
know, coding. You're just preparing the raw material.

00:04:04.229 --> 00:04:06.810
That's exactly what it is. So why is that formatting,

00:04:06.990 --> 00:04:08.710
that separation of the prompts, so critical?

00:04:09.009 --> 00:04:12.009
Clean data allows the AutoWhisk tool to read

00:04:12.009 --> 00:04:14.629
distinct instructions without any manual tagging.

00:04:14.729 --> 00:04:16.870
Okay, which brings us to the engine room. Step

00:04:16.870 --> 00:04:20.850
two, bulk velocity. This is where we stop making

00:04:20.850 --> 00:04:24.029
images one by one. Yeah, this is where AutoWhisk

00:04:24.029 --> 00:04:26.389
comes in. Tell me about this tool. This is what

00:04:26.389 --> 00:04:28.149
really creates that speed gap we were talking

00:04:28.149 --> 00:04:31.050
about. auto whisk is a chrome extension and it

00:04:31.050 --> 00:04:33.209
just sits right on top of google whisk and google

00:04:33.209 --> 00:04:35.910
whisk is the actual image generation engine that's

00:04:35.910 --> 00:04:38.550
right and it's free currently the quality is

00:04:38.550 --> 00:04:40.750
surprisingly high too if you use the settings

00:04:40.750 --> 00:04:44.329
from the guide english version 7 .6 .0 aspect

00:04:44.329 --> 00:04:47.189
ratio 16 .9 pretty standard stuff very standard

00:04:47.189 --> 00:04:50.389
but the extension is the absolute game changer

00:04:50.389 --> 00:04:52.920
because it automates all the clicking it automates

00:04:52.920 --> 00:04:56.019
the entire batch you take those 20 clean prompts

00:04:56.019 --> 00:04:58.339
you extracted you paste them all into the extension

00:04:58.339 --> 00:05:00.699
at once and you just hit start and it does the

00:05:00.699 --> 00:05:03.759
rest the extension sees the line breaks it feeds

00:05:03.759 --> 00:05:06.459
them into the engine one by one generates the

00:05:06.459 --> 00:05:08.240
image and then downloads it straight to your

00:05:08.240 --> 00:05:10.319
hard drive i actually laughed when i read the

00:05:10.319 --> 00:05:12.160
warning in the source material for this step

00:05:12.160 --> 00:05:16.180
it says very strictly Don't touch your mouse.

00:05:16.379 --> 00:05:19.279
It's serious. The extension is literally simulating

00:05:19.279 --> 00:05:21.680
you clicking and typing. It's kind of hijacking

00:05:21.680 --> 00:05:24.639
your cursor. So if you tab away to check an email

00:05:24.639 --> 00:05:26.480
or something. You break the loop. You have to

00:05:26.480 --> 00:05:28.480
just sit there and watch the little file count

00:05:28.480 --> 00:05:30.899
go up in your download folder. It's kind of mesmerizing.

00:05:31.180 --> 00:05:33.560
It's a funny image. Surrendering control of your

00:05:33.560 --> 00:05:36.199
computer to gain all this speed. But this is

00:05:36.199 --> 00:05:39.220
where the guide introduces a huge problem. A

00:05:39.220 --> 00:05:41.300
critical problem. You have a speed. You've got

00:05:41.300 --> 00:05:45.769
20 beautiful images. But they don't look like

00:05:45.769 --> 00:05:47.709
they belong in the same movie. The consistency

00:05:47.709 --> 00:05:51.889
problem. This is the bane of AI video. Right.

00:05:51.930 --> 00:05:54.350
In scene one, your main character is a boy in

00:05:54.350 --> 00:05:57.089
a blue hoodie. Scene two, the AI decides he's

00:05:57.089 --> 00:05:59.490
wearing a red jacket. Scene three, he's suddenly

00:05:59.490 --> 00:06:02.189
Asian. Scene four, he's a cartoon. It's just

00:06:02.189 --> 00:06:04.930
chaos. It's the hallmark of what people call

00:06:04.930 --> 00:06:07.509
AI slop. It looks like a fever dream. It does.

00:06:07.689 --> 00:06:11.269
The AI has no object permanence. It has no memory

00:06:11.269 --> 00:06:13.629
of who the character was five seconds ago. So

00:06:13.629 --> 00:06:17.029
speed without control creates chaos. What is

00:06:17.029 --> 00:06:19.329
the missing variable here? A reference image.

00:06:19.529 --> 00:06:22.389
Without it, the AI hallucinates a new protagonist

00:06:22.389 --> 00:06:26.290
every single time. And this brings us to what

00:06:26.290 --> 00:06:29.310
the guide calls the secret sauce. It's step three.

00:06:29.750 --> 00:06:32.610
And honestly, this feels like the most vital

00:06:32.610 --> 00:06:35.149
part of the entire workflow. It really is. This

00:06:35.149 --> 00:06:38.689
is the barrier between amateur slop. and professional

00:06:38.689 --> 00:06:41.529
storytelling. So before you run that bulk batch,

00:06:41.750 --> 00:06:44.029
you have to create a reference anchor. You do.

00:06:44.149 --> 00:06:46.529
And the fix is actually pretty clever. You go

00:06:46.529 --> 00:06:49.389
back to ChatGPT again. Okay. And you ask it to

00:06:49.389 --> 00:06:51.470
generate a character prompt, but specifically

00:06:51.470 --> 00:06:53.689
on a white background. White background. Why?

00:06:53.870 --> 00:06:56.610
It isolates the features. It tells the AI, focus

00:06:56.610 --> 00:06:59.209
only on the face, the clothes, the identity of

00:06:59.209 --> 00:07:01.470
this character. You generate just one good image

00:07:01.470 --> 00:07:04.649
of, say, Mila or the brown dog. So you're creating

00:07:04.649 --> 00:07:06.930
your cast's headshots, essentially. That's a

00:07:06.930 --> 00:07:08.649
perfect... analogy you download this now you

00:07:08.649 --> 00:07:10.829
go back to the auto whisk extension but this

00:07:10.829 --> 00:07:13.430
time before you paste in your 20 scene prompts

00:07:13.430 --> 00:07:16.490
let me guess there's a button there's a reference

00:07:16.490 --> 00:07:19.089
image option you click it you upload that file

00:07:19.089 --> 00:07:21.250
of mila you're telling the system this is mila

00:07:21.250 --> 00:07:24.370
then you run the bulk generation so you are anchoring

00:07:24.370 --> 00:07:27.410
the ai's imagination you're saying paint whatever

00:07:27.410 --> 00:07:29.730
scene you want but the person in the middle has

00:07:29.730 --> 00:07:32.870
to look like this precisely The AI forces every

00:07:32.870 --> 00:07:36.110
new scene to match that uploaded face and visual

00:07:36.110 --> 00:07:39.269
identity. It connects the dots for you. It feels

00:07:39.269 --> 00:07:41.910
like anchoring the AI's imagination. How much

00:07:41.910 --> 00:07:45.029
time does this step add? It adds about 15 minutes,

00:07:45.089 --> 00:07:46.730
but it's the difference between amateur slop

00:07:46.730 --> 00:07:49.410
and professional storytelling. 15 minutes to

00:07:49.410 --> 00:07:51.709
say the soul of the story. I think that's a trade

00:07:51.709 --> 00:07:54.329
-off most of us would take. Okay, so visuals

00:07:54.329 --> 00:07:58.720
are locked, but video is 50 % audio. And there

00:07:58.720 --> 00:08:02.240
is nothing worse than that robotic, glitchy AI

00:08:02.240 --> 00:08:04.339
voice. Or the one that sounds like a GPS trying

00:08:04.339 --> 00:08:06.879
to read a bedtime story. It's awful. The guide

00:08:06.879 --> 00:08:10.060
pivots here to Google AI Studio, specifically

00:08:10.060 --> 00:08:12.600
using their Gemini model. Okay, why this tool?

00:08:12.639 --> 00:08:13.819
I mean, there are a million voice generators

00:08:13.819 --> 00:08:17.610
out there. A few reasons. First, it's free. No

00:08:17.610 --> 00:08:19.310
credit limits, which is huge when you're just

00:08:19.310 --> 00:08:22.269
trying things out. But technically, the big advantage

00:08:22.269 --> 00:08:25.149
is it supports long -form text. So you can paste

00:08:25.149 --> 00:08:27.170
the whole story in at once. The entire thing.

00:08:27.310 --> 00:08:30.439
And the quality is surprisingly high. The guide

00:08:30.439 --> 00:08:34.460
recommends the Gemini 2 .5 Flash Preview TTS

00:08:34.460 --> 00:08:37.899
model and a voice called Enceladus. Enceladus.

00:08:37.980 --> 00:08:40.720
Yeah, it's described as warm and friendly, but

00:08:40.720 --> 00:08:43.220
the real trick is the style instruction. You

00:08:43.220 --> 00:08:45.720
actually type into the prompt, read the story

00:08:45.720 --> 00:08:48.600
aloud using a warm, gentle, and engaging storytelling

00:08:48.600 --> 00:08:50.940
voice appropriate for children. You're directing

00:08:50.940 --> 00:08:53.580
the actor, not just the software. Exactly, and

00:08:53.580 --> 00:08:55.799
because you generate it all in one go, you get

00:08:55.799 --> 00:08:58.399
a single audio track. Why is generating the full

00:08:58.399 --> 00:09:00.860
narrative at once? better than scene -by -scene

00:09:00.860 --> 00:09:03.480
audio. It maintains natural pacing and emotional

00:09:03.480 --> 00:09:06.639
continuity, avoiding that disjointed, choppy

00:09:06.639 --> 00:09:09.000
AI sound. Okay, let's just unpack where we are.

00:09:09.059 --> 00:09:11.179
We have consistent images, we have a warm, flowing

00:09:11.179 --> 00:09:14.899
voiceover, but we still have basically a slideshow.

00:09:14.980 --> 00:09:17.860
Right. It's a series of still images, and static

00:09:17.860 --> 00:09:21.279
images are boring. To compete in 2026, you need

00:09:21.279 --> 00:09:24.120
motion. For sure. This is where we bring in Grok,

00:09:24.220 --> 00:09:27.220
specifically their Imagine feature, to add that

00:09:27.220 --> 00:09:30.019
life. this is step six in the guide breathing

00:09:30.019 --> 00:09:33.009
life yeah but it's not just hitting a button

00:09:33.009 --> 00:09:36.009
that says animate, is it? No. And that is a really

00:09:36.009 --> 00:09:37.769
important distinction. If you just let the AI

00:09:37.769 --> 00:09:40.889
guess, you get weird warping or these random

00:09:40.889 --> 00:09:44.009
nauseating zooms. So you need control. You need

00:09:44.009 --> 00:09:46.470
control. The system here relies on a control

00:09:46.470 --> 00:09:48.929
mechanism. The source mentions having a text

00:09:48.929 --> 00:09:52.090
file with about 38 specific cinematic camera

00:09:52.090 --> 00:09:55.070
techniques. Like pan left, dolly zoom, that kind

00:09:55.070 --> 00:09:57.629
of thing. Exactly. So you go back to ChatGPT

00:09:57.629 --> 00:10:00.169
briefly. You ask it to look at your script and

00:10:00.169 --> 00:10:02.649
assign a camera movement to each scene based

00:10:02.649 --> 00:10:05.750
on the emotion. So a sad scene gets a slow zoom,

00:10:05.950 --> 00:10:08.330
an action scene gets a quick pan. You got it.

00:10:08.549 --> 00:10:10.549
Then you take those instructions over to Grok,

00:10:10.629 --> 00:10:13.269
but, and this is a big but, the guide points

00:10:13.269 --> 00:10:15.389
out a very specific setting you have to change

00:10:15.389 --> 00:10:17.649
first. What is it? You have to go to your profile,

00:10:17.850 --> 00:10:20.210
then settings, then behavior, and you have to

00:10:20.210 --> 00:10:24.230
turn off enable automatic video generation. Wait,

00:10:24.250 --> 00:10:25.809
why would you turn off the automation? Isn't

00:10:25.809 --> 00:10:28.879
that the whole point? That's the pro move. By

00:10:28.879 --> 00:10:31.279
turning off the automation, you regain the ability

00:10:31.279 --> 00:10:34.480
to paste in your specific command. You drag in

00:10:34.480 --> 00:10:37.320
your image, you paste the movement prompt, slow

00:10:37.320 --> 00:10:40.059
zoom in on the character's face, and then you

00:10:40.059 --> 00:10:43.100
generate. We are injecting human intent back

00:10:43.100 --> 00:10:45.200
into the machine here. What is the result? It

00:10:45.200 --> 00:10:47.500
stops looking like a slideshow and starts looking

00:10:47.500 --> 00:10:50.440
like a movie with intentional direction. So we

00:10:50.440 --> 00:10:53.139
have all the pieces, the animated clips, the

00:10:53.139 --> 00:10:56.559
audio. Now comes the assembly line. Steps 5 and

00:10:56.559 --> 00:10:58.840
7 in the guide kind of merge here in the editing

00:10:58.840 --> 00:11:00.779
workflow. Yeah, this is where it all comes together

00:11:00.779 --> 00:11:03.940
in CapCut. And the workflow is, again, designed

00:11:03.940 --> 00:11:06.860
for speed. It uses a two -pass system. A two

00:11:06.860 --> 00:11:08.799
-pass system. How does that work? First pass,

00:11:08.980 --> 00:11:10.480
you don't even wait for the videos to finish

00:11:10.480 --> 00:11:12.480
generating. You just drag your audio track and

00:11:12.480 --> 00:11:14.700
all your still images into the timeline. Okay.

00:11:14.779 --> 00:11:17.659
You sync them up. You listen to the narration,

00:11:18.000 --> 00:11:19.960
find the end of a sentence, and you snap the

00:11:19.960 --> 00:11:21.960
next image to it. You build the whole rhythm

00:11:21.960 --> 00:11:23.879
of the video with just the stills. That makes

00:11:23.879 --> 00:11:26.279
sense. It's much faster to edit photos than video

00:11:26.279 --> 00:11:30.059
files. So much faster. Then, once your grok videos

00:11:30.059 --> 00:11:33.059
are ready, you do the swap. The swap. You just

00:11:33.059 --> 00:11:35.039
right -click the still image in your timeline.

00:11:35.179 --> 00:11:38.759
You choose Replace Clip, and you select the animated

00:11:38.759 --> 00:11:41.200
video file. It keeps the timing. It keeps the

00:11:41.200 --> 00:11:43.940
transitions. But it just upgrades the visual

00:11:43.940 --> 00:11:47.220
from a photo to a movie. You know, I have to

00:11:47.220 --> 00:11:49.960
admit something here. Yeah. Reading through this

00:11:49.960 --> 00:11:52.860
whole process, I thought, this is amazing. But

00:11:52.860 --> 00:11:55.679
I also know myself. I know that if I were doing

00:11:55.679 --> 00:11:58.220
this, I'd get lazy. I'd see a generated clip

00:11:58.220 --> 00:12:02.840
that was just okay. Maybe the character's eye

00:12:02.840 --> 00:12:04.899
is a little wonky and I'd be tempted to just

00:12:04.899 --> 00:12:08.399
leave it. I still wrestled with that prompt drift

00:12:08.399 --> 00:12:10.620
myself. You and everyone else. And the source

00:12:10.620 --> 00:12:12.299
actually calls this out in the quality control

00:12:12.299 --> 00:12:14.679
section. It's like the vulnerable admission of

00:12:14.679 --> 00:12:16.799
the whole system. Right. It says even with all

00:12:16.799 --> 00:12:18.980
this automation, if you skip the manual review,

00:12:19.159 --> 00:12:21.500
you break the spell. It says specifically do

00:12:21.500 --> 00:12:24.539
not trust the system blindly. Exactly. If an

00:12:24.539 --> 00:12:26.899
image is off, you have to go back and regenerate

00:12:26.899 --> 00:12:29.120
just that one. That's the discipline, isn't it?

00:12:29.159 --> 00:12:31.799
The tools remove the manual labor, but they can't

00:12:31.799 --> 00:12:34.049
remove the need for good taste. They can't. You

00:12:34.049 --> 00:12:36.149
have to be the final curator. Because if you

00:12:36.149 --> 00:12:38.909
let a glitchy face slide, the audience immediately

00:12:38.909 --> 00:12:41.750
clocks it. They know it's low effort junk. So

00:12:41.750 --> 00:12:44.250
the human role shifts from maker to editor. What

00:12:44.250 --> 00:12:46.490
happens if you skip the review? You lose trust.

00:12:46.929 --> 00:12:49.570
Small glitches compound and the audience immediately

00:12:49.570 --> 00:12:52.110
senses it's low effort junk. We are going to

00:12:52.110 --> 00:12:54.009
take a quick break, but when we come back, we

00:12:54.009 --> 00:12:56.490
are going to look at the big picture. What this

00:12:56.490 --> 00:13:00.049
whole speed gap really means for the future of

00:13:00.049 --> 00:13:03.289
creativity. Mid -roll, sponsor, placeholder.

00:13:03.629 --> 00:13:06.389
Okay, let's recap this stack because it is a

00:13:06.389 --> 00:13:08.950
lot of moving parts, but they do fit together

00:13:08.950 --> 00:13:11.529
so beautifully. They really do. Think of it like

00:13:11.529 --> 00:13:14.990
a relay race. First, ChatGPT handles the structure.

00:13:15.090 --> 00:13:17.710
It gives you the story and those clean, separated

00:13:17.710 --> 00:13:20.049
prompts. Right, the raw material. Then the baton

00:13:20.049 --> 00:13:22.269
goes to AutoWhisk and GoogleWhisk. They handle

00:13:22.269 --> 00:13:24.389
the bulk visuals, and you use those reference

00:13:24.389 --> 00:13:26.610
images to keep the characters consistent. Got

00:13:26.610 --> 00:13:30.250
it. Then audio. Third is Gemini AI Studio, which

00:13:30.250 --> 00:13:33.850
gives us that warm single track voiceover. Fourth,

00:13:34.090 --> 00:13:37.269
Grok takes those still images and adds that controlled

00:13:37.269 --> 00:13:40.549
cinematic motion. And then finally, CapCut. And

00:13:40.549 --> 00:13:43.129
finally, CapCut is where you assemble it all

00:13:43.129 --> 00:13:46.330
and do that quick swap workflow. It's an impressive

00:13:46.330 --> 00:13:49.710
system. Yeah. But I want to zoom out to the big

00:13:49.710 --> 00:13:52.230
idea here. The guide starts by talking about

00:13:52.230 --> 00:13:54.919
the speed gap. It's a powerful concept. It is.

00:13:54.980 --> 00:13:58.080
The idea is that the barrier to entry for making

00:13:58.080 --> 00:14:00.480
video has basically dropped to zero dollars.

00:14:00.980 --> 00:14:03.679
Anyone can get these tools. Right. But the barrier

00:14:03.679 --> 00:14:06.500
to quality has shifted. It's no longer about

00:14:06.500 --> 00:14:08.980
who has the most expensive camera or software.

00:14:09.340 --> 00:14:11.860
It's about who has the best process, the best

00:14:11.860 --> 00:14:13.899
system. Exactly. The winners aren't going to

00:14:13.899 --> 00:14:16.120
be the ones who just use AI. Everybody's going

00:14:16.120 --> 00:14:18.080
to use AI. The winners are the ones who build

00:14:18.080 --> 00:14:20.879
a system like this that allows them to fail faster

00:14:20.879 --> 00:14:23.549
and succeed more often. What do you mean by that?

00:14:23.669 --> 00:14:25.669
Well, if you can make a pretty good video in

00:14:25.669 --> 00:14:28.350
15 minutes, you can afford to make five bad ones

00:14:28.350 --> 00:14:31.269
to find that one great one. That is a luxury

00:14:31.269 --> 00:14:33.690
you just don't have when one video takes you

00:14:33.690 --> 00:14:36.690
an entire week. It changes the economics of creativity

00:14:36.690 --> 00:14:39.690
itself. You're not so precious about the output

00:14:39.690 --> 00:14:41.809
anymore. No, you're focused on the pipeline,

00:14:41.990 --> 00:14:43.809
and that pipeline is what lets you actually tell

00:14:43.809 --> 00:14:46.309
stories instead of just, you know, managing files

00:14:46.309 --> 00:14:48.990
all day. So here is the challenge for you listening.

00:14:50.000 --> 00:14:52.000
Don't try to build the whole Hollywood studio

00:14:52.000 --> 00:14:54.679
today. No, start small. Just install the Autowisk

00:14:54.679 --> 00:14:57.759
extension. That's it. Generate one story script

00:14:57.759 --> 00:15:00.759
using that specific three -prompt structure we

00:15:00.759 --> 00:15:03.100
talked about. Don't overthink it. Just start

00:15:03.100 --> 00:15:05.919
the system. Watch the files download. See that

00:15:05.919 --> 00:15:08.480
magic moment for yourself. And here's a thought

00:15:08.480 --> 00:15:12.659
to leave you with. If you can produce 10 professional

00:15:12.659 --> 00:15:14.799
-looking videos in the time it used to take to

00:15:14.799 --> 00:15:18.440
make just one, what happens to the value of the

00:15:18.440 --> 00:15:22.100
video itself? Does scarcity even matter anymore

00:15:22.100 --> 00:15:25.019
or is it now finally all about the story? That

00:15:25.019 --> 00:15:27.019
is the question. Thanks for listening to The

00:15:27.019 --> 00:15:28.419
Deep Dive. We'll see you in the next one.
