WEBVTT

00:00:00.000 --> 00:00:01.800
I want you to picture a video clip for a second.

00:00:02.120 --> 00:00:05.839
Beat. The visuals are absolutely stunning. Like

00:00:05.839 --> 00:00:08.740
full cinematic lighting. Yeah, exactly. Sort

00:00:08.740 --> 00:00:11.640
of a gritty Blade Runner aesthetic. You can see

00:00:11.640 --> 00:00:14.359
the pores on the character's skin. You see the

00:00:14.359 --> 00:00:16.660
sweat on their brow. You are totally immersed.

00:00:16.820 --> 00:00:19.480
You are ready for the story. And then... The

00:00:19.480 --> 00:00:21.280
character opens his mouth. And he sounds like

00:00:21.280 --> 00:00:23.940
a corporate training video from 1988. Instantly,

00:00:23.940 --> 00:00:26.679
the immersion just evaporates. It is the uncanny

00:00:26.679 --> 00:00:29.460
valley of sound. You have this visual masterpiece,

00:00:29.679 --> 00:00:32.619
but the audio feels like it's being read by a

00:00:32.619 --> 00:00:35.759
GPS navigator. Slightly bored one. It is the

00:00:35.759 --> 00:00:38.880
single biggest hurdle in AI filmmaking right

00:00:38.880 --> 00:00:41.600
now. Or I should say it was the biggest hurdle.

00:00:41.740 --> 00:00:44.950
Which brings us to today. Welcome to the Deep

00:00:44.950 --> 00:00:47.009
Dive. I am really glad you're joining us for

00:00:47.009 --> 00:00:48.689
this one. It's going to be a fascinating conversation.

00:00:49.109 --> 00:00:51.770
We are breaking down a specific workflow today,

00:00:51.890 --> 00:00:57.130
a guide from early 2026 by Max Ann called Mastering

00:00:57.130 --> 00:00:59.890
AI Dialogue. The Emotional Lip Sync Guide. Right.

00:01:00.479 --> 00:01:03.100
And the mission here is to finally solve that

00:01:03.100 --> 00:01:06.019
fake dialogue problem. We are not just talking

00:01:06.019 --> 00:01:08.180
about standard text -to -speech anymore. No,

00:01:08.280 --> 00:01:10.439
we are talking about performance layering. Performance

00:01:10.439 --> 00:01:14.159
layering. The guide covers a six -phase workflow,

00:01:14.519 --> 00:01:18.269
designing the voice, tagging emotions, generating

00:01:18.269 --> 00:01:21.909
storyboard visuals and lip syncing. It is a comprehensive

00:01:21.909 --> 00:01:24.590
system. I have to admit, I still wrestle with

00:01:24.590 --> 00:01:27.450
prompt drift myself. Yeah. I will be tweaking

00:01:27.450 --> 00:01:30.390
a character and the AI just wanders off. Getting

00:01:30.390 --> 00:01:33.010
a consistent emotional result is a real struggle

00:01:33.010 --> 00:01:35.810
for me. So this guide feels incredibly relevant.

00:01:36.049 --> 00:01:39.629
It completely reframes the process. Yeah. For

00:01:39.629 --> 00:01:42.329
the last few years, we have been treating AI

00:01:42.329 --> 00:01:45.269
video like a microwave dinner. How do you mean?

00:01:45.430 --> 00:01:47.659
You press. one button you say make me a movie

00:01:47.659 --> 00:01:50.000
yeah and you just hope the whole meal comes out

00:01:50.000 --> 00:01:51.680
cooked evenly and it usually comes out with the

00:01:51.680 --> 00:01:54.140
edges burnt in the middle frozen solid precisely

00:01:54.140 --> 00:01:57.239
this guide argues that you have to cook the components

00:01:57.239 --> 00:01:59.640
separately you design the voice in one place

00:01:59.640 --> 00:02:02.540
you tag emotions in another yeah generate visuals

00:02:02.540 --> 00:02:04.680
separately and then stitch it all together exactly

00:02:04.680 --> 00:02:07.620
it's the only way to get a truly human feel let's

00:02:07.620 --> 00:02:11.199
jump into phase one voice design The source makes

00:02:11.199 --> 00:02:14.039
a really strong point right away. Standard AI

00:02:14.039 --> 00:02:17.060
voices are actually designed to fail at acting.

00:02:17.199 --> 00:02:20.099
They are. Think about what a standard text -to

00:02:20.099 --> 00:02:22.719
-speech model, which is just AI reading words

00:02:22.719 --> 00:02:25.979
aloud, is built for. Right. Historically, it

00:02:25.979 --> 00:02:29.000
is built for clarity. To read an audiobook or

00:02:29.000 --> 00:02:32.699
a news article, it prioritizes clear enunciation.

00:02:33.020 --> 00:02:36.020
A steady pace. Yeah. But acting isn't about clarity.

00:02:36.439 --> 00:02:39.639
Acting is about subtext. It is messy. The guide

00:02:39.639 --> 00:02:42.139
uses a great example, a line from a desert survival

00:02:42.139 --> 00:02:44.560
scene. Right, the Heine scene. The line is, there's

00:02:44.560 --> 00:02:47.000
nothing out there, Heine. No road, no shelter,

00:02:47.099 --> 00:02:49.580
nothing. Now, if you feed that into a default

00:02:49.580 --> 00:02:52.939
AI voice, it reads it perfectly. Crisp and clean.

00:02:53.159 --> 00:02:55.379
There's nothing out there, Heine. Exactly. And

00:02:55.379 --> 00:02:57.719
it's completely wrong. Because if that character

00:02:57.719 --> 00:02:59.699
has been walking in the scorching sun for three

00:02:59.699 --> 00:03:02.240
days without water, they shouldn't sound crisp.

00:03:02.400 --> 00:03:03.840
They should sound exhausted. They should sound

00:03:03.840 --> 00:03:06.840
hollow, defeated. So the fix is to separate the

00:03:06.840 --> 00:03:09.639
audio workflow entirely. Do not use all -in -one

00:03:09.639 --> 00:03:11.680
generators. The guide specifically recommends

00:03:11.680 --> 00:03:14.340
Eleven Labs' voice design for this. But the trick

00:03:14.340 --> 00:03:16.780
is how you prompt the voice. Most people just

00:03:16.780 --> 00:03:20.719
list demographics. Male, 40s, American accent.

00:03:20.919 --> 00:03:22.939
And that gives you a generic 40 -year -old American.

00:03:23.460 --> 00:03:27.099
A stock photo of a voice. Max Anne says you must

00:03:27.099 --> 00:03:29.379
ignore those default templates. You have to describe

00:03:29.379 --> 00:03:31.659
the situation. Right. Not the acoustic sound,

00:03:31.780 --> 00:03:33.939
but the biological state. So it's less about

00:03:33.939 --> 00:03:36.139
describing the sound of the voice and more about

00:03:36.139 --> 00:03:38.439
describing the suffering of the character. Exactly.

00:03:38.439 --> 00:03:42.139
You prompt for fatigue, tension, strain of uncertainty.

00:03:42.680 --> 00:03:46.199
Context creates the timber. Perfectly said. If

00:03:46.199 --> 00:03:48.860
you tell the AI the character is confident, but

00:03:48.860 --> 00:03:51.139
they are stranded in a desert, it won't work.

00:03:51.879 --> 00:03:54.659
The biological state prompts the AI to find the

00:03:54.659 --> 00:03:57.120
texture of suffering, the microtremors in the

00:03:57.120 --> 00:03:59.180
vocal cords. That makes a lot of sense. So that

00:03:59.180 --> 00:04:01.840
is phase one. We have our raw voice. Now phase

00:04:01.840 --> 00:04:04.960
two is emotion tagging. This is where we control

00:04:04.960 --> 00:04:08.259
the actual performance using the 11 Labs 11v3

00:04:08.259 --> 00:04:11.419
Alpha model. This model is a massive leap because

00:04:11.419 --> 00:04:13.979
it allows for audio tags. Explain how those work

00:04:13.979 --> 00:04:16.259
in this context. Think of it like a stage director.

00:04:16.699 --> 00:04:19.339
You treat these tags like acting notes in a script.

00:04:19.819 --> 00:04:23.720
They are in brackets like sighs, gulps, whispering,

00:04:23.980 --> 00:04:26.600
desperate. And the AI doesn't read the word sighs

00:04:26.600 --> 00:04:30.220
out loud. No, it performs the sigh. It directs

00:04:30.220 --> 00:04:32.560
the AI's delivery of the next words. The guide

00:04:32.560 --> 00:04:35.240
talks about the arc of a line. It is not just

00:04:35.240 --> 00:04:37.839
one tag for a whole sentence. Because humans

00:04:37.839 --> 00:04:40.199
don't feel one emotion for 10 straight seconds,

00:04:40.399 --> 00:04:43.139
our emotions shift dynamically. So you could

00:04:43.139 --> 00:04:45.740
have a male character start a line tagged as

00:04:45.740 --> 00:04:49.040
loud. But end the line tag is quiet. Or a female

00:04:49.040 --> 00:04:51.899
character shifting from quietly frustrated to

00:04:51.899 --> 00:04:55.339
total resignation. That trailing off into silence.

00:04:55.519 --> 00:04:58.959
That is what makes it feel alive. Beat. So it

00:04:58.959 --> 00:05:01.259
sounds like we're moving from prompting to actual

00:05:01.259 --> 00:05:03.439
directing. Does this require a lot of trial and

00:05:03.439 --> 00:05:05.519
error? Yes. It is a numbers game. You generate

00:05:05.519 --> 00:05:07.839
batches. You generate batches to find the human

00:05:07.839 --> 00:05:09.920
take. Exactly. You listen for smooth emotional

00:05:09.920 --> 00:05:13.079
shifts. Sometimes the AI glitches and you get

00:05:13.079 --> 00:05:15.060
a microphone change. A drop in audio quality.

00:05:15.259 --> 00:05:17.399
Right. It breaks the realism entirely. So you

00:05:17.399 --> 00:05:19.660
discard those and keep the smooth ones. All right.

00:05:19.759 --> 00:05:23.519
Moving to phase three. Visual consistency and

00:05:23.519 --> 00:05:26.699
the three by three method. This is why we bring

00:05:26.699 --> 00:05:29.660
in Nano Banana Pro to generate characters. And

00:05:29.660 --> 00:05:32.180
this solves the classic nightmare of AI video.

00:05:32.600 --> 00:05:36.209
Keeping the face the same. across different shots

00:05:36.209 --> 00:05:38.769
it is so frustrating you get a great close -up

00:05:38.769 --> 00:05:40.949
and then the wide shot looks like a completely

00:05:40.949 --> 00:05:43.670
different person the solution here is the 3x3

00:05:43.670 --> 00:05:46.589
storyboard grid method how does that work you

00:05:46.589 --> 00:05:49.209
use one prompt to generate nine separate shots

00:05:49.209 --> 00:05:52.829
all contained in a single image file a grid like

00:05:52.829 --> 00:05:55.310
a contact sheet right because diffusion models

00:05:55.310 --> 00:05:58.310
the ai that generates images start with a random

00:05:58.310 --> 00:06:01.459
mathematical seed By doing a grid, you force

00:06:01.459 --> 00:06:04.180
the AI to use the exact same seed for all nine

00:06:04.180 --> 00:06:06.600
panels simultaneously. It locks in the lighting

00:06:06.600 --> 00:06:09.639
and the facial structure. But it introduces a

00:06:09.639 --> 00:06:12.379
new issue, the blurry face problem. Because in

00:06:12.379 --> 00:06:14.980
the wide shots on that grid, the face is tiny.

00:06:15.300 --> 00:06:17.519
And the AI doesn't allocate enough detail to

00:06:17.519 --> 00:06:20.180
small subjects. It becomes a smudge. The guide

00:06:20.180 --> 00:06:22.240
has an upscaling trick for this, right? It does.

00:06:22.300 --> 00:06:24.680
You save the blurry wide shot. Then you save

00:06:24.680 --> 00:06:27.100
a sharp close -up from that same grid. You upload

00:06:27.100 --> 00:06:30.860
both into Nano Banana Pro. And you use an in

00:06:30.860 --> 00:06:33.459
-painting prompt to repaint the facial details

00:06:33.459 --> 00:06:35.959
on the wide shot using the close -up as a reference.

00:06:36.220 --> 00:06:39.040
That upscaling trick feels like a lot of extra

00:06:39.040 --> 00:06:41.620
work. Is it strictly necessary? It is if you

00:06:41.620 --> 00:06:44.000
want lip sync to work. Animation tools need a

00:06:44.000 --> 00:06:47.240
sharp mouth to track. Exactly. If you feed the

00:06:47.240 --> 00:06:50.139
software a blurry face, the mouth tracking slides

00:06:50.139 --> 00:06:52.720
all over the place. It ruins the illusion immediately.

00:06:53.139 --> 00:06:56.579
Let's take a brief pause here. We will be right

00:06:56.579 --> 00:06:58.459
back to talk about making that face actually

00:06:58.459 --> 00:07:02.220
move. Sponsor break provided separately. All

00:07:02.220 --> 00:07:03.879
right, we are back. We have our tagged emotional

00:07:03.879 --> 00:07:05.959
audio. We have our sharp, consistent visuals.

00:07:06.540 --> 00:07:09.439
Now, phase four, lip sync and motion prompts.

00:07:09.579 --> 00:07:11.339
This is where we animate the face. Right. The

00:07:11.339 --> 00:07:13.540
guide compares a couple of tools here. OmniHuman

00:07:13.540 --> 00:07:16.699
1 .5 and Creatify Aurora. OmniHuman is better

00:07:16.699 --> 00:07:19.480
for big, dramatic movements, right? Yes. Flailing

00:07:19.480 --> 00:07:22.839
arms, big speeches. But for this nuanced emotional

00:07:22.839 --> 00:07:26.000
dialogue, the guide highly recommends Creatify

00:07:26.000 --> 00:07:28.949
Aurora. Why is that? It preserves skin texture

00:07:28.949 --> 00:07:31.350
much better during subtle movements, and it handles

00:07:31.350 --> 00:07:34.389
clips up to 60 seconds long seamlessly. But the

00:07:34.389 --> 00:07:36.430
really crucial part of this phase isn't just

00:07:36.430 --> 00:07:40.230
the tool. It is the motion prompt. This is a

00:07:40.230 --> 00:07:42.930
massive paradigm shift. We are so used to giving

00:07:42.930 --> 00:07:47.529
mechanical instructions, like nod twice or look

00:07:47.529 --> 00:07:50.490
left. Do not give mechanical instructions. That

00:07:50.490 --> 00:07:53.149
is how you get robotic bobblehead movements.

00:07:53.250 --> 00:07:55.610
You have to describe the internal state instead.

00:07:55.750 --> 00:07:58.120
Yes. Emotional prompting. You write something

00:07:58.120 --> 00:08:00.509
like, trying to hold themselves together. That

00:08:00.509 --> 00:08:02.949
is fascinating. Prompting the emotion of the

00:08:02.949 --> 00:08:04.990
movement rather than the movement itself. Right.

00:08:05.050 --> 00:08:08.189
The AI has analyzed millions of human videos.

00:08:08.509 --> 00:08:11.250
It knows how the jaw clenches when someone holds

00:08:11.250 --> 00:08:14.350
back tears. It prevents robotic nods and gives

00:08:14.350 --> 00:08:16.910
believable body language. It translates the emotional

00:08:16.910 --> 00:08:19.829
cue into natural physical behavior far better

00:08:19.829 --> 00:08:21.589
than we ever could manually. Two sec silence.

00:08:22.009 --> 00:08:24.170
Let that sink in for a second. We're trusting

00:08:24.170 --> 00:08:26.910
the machine's latent understanding of human cytology

00:08:26.910 --> 00:08:29.910
to drive the performance. It is profound. It

00:08:29.910 --> 00:08:32.669
is. All right, phase five and six, soundtrack

00:08:32.669 --> 00:08:34.789
and assembly. This is where we bring it all together.

00:08:35.090 --> 00:08:38.070
For music, the guide suggests Eleven Labs Music

00:08:38.070 --> 00:08:40.669
Creation. The key here is prompting for atmosphere,

00:08:40.950 --> 00:08:43.330
not just instruments. Right. You don't just say

00:08:43.330 --> 00:08:46.789
acoustic guitar. You say desert survival. And

00:08:46.789 --> 00:08:50.750
critically, you have to match the tempo to the

00:08:50.750 --> 00:08:53.909
dialogue pacing. If it is a slow, painful conversation,

00:08:54.289 --> 00:08:57.649
the music needs a slow tempo. Exactly. And when

00:08:57.649 --> 00:09:00.289
you bring it all into your editor DaVinci Resolve

00:09:00.289 --> 00:09:03.889
or Premiere, there is a strict mixing rule. Keep

00:09:03.889 --> 00:09:07.730
the music at 25 to 35 % of the dialogue volume.

00:09:08.070 --> 00:09:10.909
The voice is the star. Don't let the music fight

00:09:10.909 --> 00:09:13.070
it. But the actual editing technique is what

00:09:13.070 --> 00:09:15.370
caught my eye. Cutting on the rhythm of the dialogue.

00:09:15.649 --> 00:09:18.370
Letting lines breathe. This is Film School 101,

00:09:18.690 --> 00:09:21.570
but AI creators often miss it. You cannot just

00:09:21.570 --> 00:09:23.610
leave the camera on the person speaking the entire

00:09:23.610 --> 00:09:26.049
time. You have to insert reaction shots. While

00:09:26.049 --> 00:09:27.889
the main character speaks, cut to the listener.

00:09:28.110 --> 00:09:30.870
Show them absorbing the words. It seems the editing

00:09:30.870 --> 00:09:32.750
is where the story actually happens, doesn't

00:09:32.750 --> 00:09:35.070
it? Absolutely. The reaction shots are what sell

00:09:35.070 --> 00:09:37.029
the relationship between the characters. Reaction

00:09:37.029 --> 00:09:39.549
shots sell the relationship. It proves they exist

00:09:39.549 --> 00:09:42.529
in the same space. And, practically speaking,

00:09:42.809 --> 00:09:45.230
it is a great way to hide any minor lip -sync

00:09:45.230 --> 00:09:48.389
glitches. Oh. That is clever. Yeah. If the mouth

00:09:48.389 --> 00:09:50.590
looks a little rubbery on a specific word, just

00:09:50.590 --> 00:09:52.970
cut to the listener's face for that second. The

00:09:52.970 --> 00:09:55.230
audio keeps playing, the emotion lands, and you

00:09:55.230 --> 00:09:58.309
hide the artifact. That is traditional filmmaking

00:09:58.309 --> 00:10:01.289
saving cutting -edge tech. I love that. It works

00:10:01.289 --> 00:10:03.590
perfectly. So we have walked through the whole

00:10:03.590 --> 00:10:06.730
process. When you step back and look at this

00:10:06.730 --> 00:10:10.929
entire workflow, what is the big idea here? The

00:10:10.929 --> 00:10:13.610
big idea is that we have officially moved past

00:10:13.610 --> 00:10:16.529
the era where AI video was just a novelty. It

00:10:16.529 --> 00:10:19.190
used to just be a magic trick. Exactly. But looking

00:10:19.190 --> 00:10:22.750
at this, imagine scaling this. We aren't just

00:10:22.750 --> 00:10:25.309
making cute little clips anymore. A single person

00:10:25.309 --> 00:10:28.450
sitting at a desk can now orchestrate a full

00:10:28.450 --> 00:10:31.610
emotional scene with the nuance of a Hollywood

00:10:31.610 --> 00:10:33.840
film studio. The ceiling has been totally removed.

00:10:34.200 --> 00:10:36.980
It has. By treating the voice, the visual, and

00:10:36.980 --> 00:10:39.100
the movement as separate modular performances,

00:10:39.480 --> 00:10:42.200
the uncanny valley just disappears. It is like

00:10:42.200 --> 00:10:44.679
stacking Lego blocks of data. That is a perfect

00:10:44.679 --> 00:10:47.460
analogy. A voice block, a visual block, a motion

00:10:47.460 --> 00:10:49.360
block. You snap them together to build something

00:10:49.360 --> 00:10:51.899
that feels completely organic. It blurs the line

00:10:51.899 --> 00:10:54.200
between generating something and actually directing

00:10:54.200 --> 00:10:56.820
something. You are a director now. So for the

00:10:56.820 --> 00:10:58.659
person listening right now who might be wanting

00:10:58.659 --> 00:11:01.700
to test these waters, what is the one thing they

00:11:01.700 --> 00:11:05.100
should do today? Start small. Don't try to make

00:11:05.100 --> 00:11:09.059
a whole movie today. Try just one step. Go into

00:11:09.059 --> 00:11:13.019
a tool, create one custom voice that isn't a

00:11:13.019 --> 00:11:15.940
default template. Prompt for a biological state.

00:11:16.220 --> 00:11:19.379
Exactly. Write one emotionally tagged line and

00:11:19.379 --> 00:11:21.580
just hear the difference. Once you hear that

00:11:21.580 --> 00:11:23.990
genuine motion. you'll see the potential. I think

00:11:23.990 --> 00:11:26.549
that is great advice, Beat. I want to leave you

00:11:26.549 --> 00:11:29.809
with a final thought to mull over. We have spent

00:11:29.809 --> 00:11:31.809
this whole time talking about how we can direct

00:11:31.809 --> 00:11:35.470
the AI to perfectly mimic human emotion. But

00:11:35.470 --> 00:11:37.610
what happens when the AI starts directing us,

00:11:37.750 --> 00:11:40.049
subtly shaping our emotional responses through

00:11:40.049 --> 00:11:42.309
these perfectly engineered synthetic performances?

00:11:42.769 --> 00:11:45.110
When the machine knows exactly which microexpression

00:11:45.110 --> 00:11:47.490
will make you cry, who is really pulling the

00:11:47.490 --> 00:11:49.429
strings? That is a chilling thought to end on.

00:11:49.549 --> 00:11:51.679
Something to think about. thanks for joining

00:11:51.679 --> 00:11:53.779
us on this deep dive it was a pleasure we will

00:11:53.779 --> 00:11:54.399
see you next time
