WEBVTT

00:00:00.000 --> 00:00:02.279
Close your eyes for a second. I want you to imagine

00:00:02.279 --> 00:00:04.160
you're sitting in a director's chair. You're

00:00:04.160 --> 00:00:06.900
on set. The lighting is moody. The tension is

00:00:06.900 --> 00:00:11.419
palpable. You yell, action. But here's the twist.

00:00:11.839 --> 00:00:15.539
There's no camera crew, no lights, no actors.

00:00:15.759 --> 00:00:18.160
You just type a single sentence into a terminal.

00:00:18.399 --> 00:00:20.460
And for the first time, and this is the part

00:00:20.460 --> 00:00:22.600
that actually matters, the actors remember who

00:00:22.600 --> 00:00:24.719
they are from the first shot to the last. They

00:00:24.719 --> 00:00:26.679
don't morph into different people. They don't

00:00:26.679 --> 00:00:29.649
suddenly grow a third arm. We are looking at

00:00:29.649 --> 00:00:33.890
the death of the cool demo clip and maybe the

00:00:33.890 --> 00:00:37.170
birth of actual AI filmmaking. Welcome back to

00:00:37.170 --> 00:00:39.850
The Deep Drive. It is February 2026. If you've

00:00:39.850 --> 00:00:41.210
been blinking, you might have missed that the

00:00:41.210 --> 00:00:45.729
AI video wars have gone, well, nuclear. And this

00:00:45.729 --> 00:00:47.570
is what we're digging into today. We aren't just

00:00:47.570 --> 00:00:49.390
looking at text -to -video resolution anymore.

00:00:49.530 --> 00:00:51.429
We're looking at something, honestly, much more

00:00:51.429 --> 00:00:55.159
important, workflow. That's the key word. workflow.

00:00:55.159 --> 00:00:57.299
For so long, we were just in that novelty phase.

00:00:57.420 --> 00:00:59.600
Look, a panda riding a skateboard. It's cute,

00:00:59.740 --> 00:01:02.640
technically impressive, but useless for telling

00:01:02.640 --> 00:01:05.200
a story. Now we're finally seeing tools that

00:01:05.200 --> 00:01:07.560
are trying to be actual production suites. So

00:01:07.560 --> 00:01:10.319
today, we're doing a forensic breakdown of a

00:01:10.319 --> 00:01:12.879
major new player that claims to have taken the

00:01:12.879 --> 00:01:16.299
crown. We're talking about Kling 3 .0, and we're

00:01:16.299 --> 00:01:18.079
going to see how it stacks up against the heavyweights.

00:01:18.719 --> 00:01:24.000
OpenAI's Sora 2 and, uh... Google's VO 3 .1.

00:01:24.099 --> 00:01:26.599
And we have a great set of data for this. We're

00:01:26.599 --> 00:01:28.439
looking at a really comprehensive comparison

00:01:28.439 --> 00:01:32.099
by Max Ann over at AI Fire. He didn't just read

00:01:32.099 --> 00:01:34.420
the spec sheets. He actually burned through a

00:01:34.420 --> 00:01:36.599
ton of credits running these head -to -head stress

00:01:36.599 --> 00:01:38.859
tests. Which is great because it saves us the

00:01:38.859 --> 00:01:42.260
money. So here's the roadmap. First, we're going

00:01:42.260 --> 00:01:44.219
to unpack the features that, you know, theoretically

00:01:44.219 --> 00:01:46.980
shift Kling from a toy to a tool, specifically

00:01:46.980 --> 00:01:49.980
Multishot and Omni. Then we'll look at the economics.

00:01:50.079 --> 00:01:52.420
What does this thing actually cost to run? Then

00:01:52.420 --> 00:01:55.079
we enter the arena. We've got five brutal headset

00:01:55.079 --> 00:01:57.439
tests. The fun part. Yeah, dialogue, emotion,

00:01:57.640 --> 00:01:59.379
action. And finally, we'll try to figure out

00:01:59.379 --> 00:02:00.859
who this is actually for. Sounds like a plan.

00:02:00.920 --> 00:02:03.319
Let's start with the core thesis here. Kling

00:02:03.319 --> 00:02:06.859
3 .0 seems to be arguing that we don't need higher

00:02:06.859 --> 00:02:08.919
resolution. You know, we have enough pixels.

00:02:09.280 --> 00:02:11.300
They're saying we need to fix broken workflows.

00:02:11.719 --> 00:02:13.419
What does that actually mean in practice? Because

00:02:13.419 --> 00:02:16.780
workflow sounds a little corporate. It does,

00:02:16.860 --> 00:02:19.099
doesn't it? But think about the pain of making

00:02:19.099 --> 00:02:23.979
AI video back in, say, 2025. You generate a three

00:02:23.979 --> 00:02:26.360
-second clip, then another one. Then you take

00:02:26.360 --> 00:02:28.039
them into Premiere and try to stitch them together.

00:02:28.240 --> 00:02:30.439
But the problem was always consistency. Right.

00:02:30.539 --> 00:02:32.620
In clip one, the character has a blue shirt.

00:02:32.919 --> 00:02:35.659
In clip two, it's navy. Exactly. Or the lighting

00:02:35.659 --> 00:02:38.639
is shifted from sunset to like high noon for

00:02:38.639 --> 00:02:41.099
no reason. You were fighting the tool. So Kling

00:02:41.099 --> 00:02:43.840
3 .0 introduces multi -shot generation. Yeah.

00:02:43.900 --> 00:02:45.560
And it's exactly what it sounds like, but the

00:02:45.560 --> 00:02:47.979
implication is huge. Instead of one isolated

00:02:47.979 --> 00:02:49.919
clip, you're basically directing a mini movie

00:02:49.919 --> 00:02:52.340
in a single prompt. Okay. So you can define shot

00:02:52.340 --> 00:02:54.240
one as a close -up, shot two as an over -the

00:02:54.240 --> 00:02:56.460
-shoulder, that kind of thing. That's it. And

00:02:56.460 --> 00:02:59.360
the AI generates them all in sequence. as one

00:02:59.360 --> 00:03:02.020
cohesive video file. It handles the cuts internally.

00:03:02.240 --> 00:03:04.680
So it handles continuity, the lighting. All of

00:03:04.680 --> 00:03:07.400
it. The flow of movement. You aren't stitching,

00:03:07.580 --> 00:03:10.180
you're sequencing. And crucially, they've bumped

00:03:10.180 --> 00:03:12.840
the generation time up. It used to be standard

00:03:12.840 --> 00:03:16.159
to get, what, 10 seconds of video. Kling 3 .0

00:03:16.159 --> 00:03:18.340
pushes that to 15. I have to play devil's advocate

00:03:18.340 --> 00:03:20.340
here. Five seconds. That doesn't sound like a

00:03:20.340 --> 00:03:22.479
revolutionary leap. Why does that matter so much?

00:03:22.639 --> 00:03:25.000
Oh, in filmmaking terms, five seconds is an eternity.

00:03:25.939 --> 00:03:28.599
It's the difference between like a TikTok edit

00:03:28.599 --> 00:03:31.180
and a scene that actually has room to breathe.

00:03:31.539 --> 00:03:34.020
In a 10 second clip, everything feels rushed.

00:03:34.669 --> 00:03:38.050
You know, start, action, and cut. It creates

00:03:38.050 --> 00:03:41.270
that sort of frantic AI -generated anxiety. With

00:03:41.270 --> 00:03:44.289
15 seconds, a character can hesitate. They can

00:03:44.289 --> 00:03:46.789
look around. They can react. It just reduces

00:03:46.789 --> 00:03:49.969
that subconscious, this feels fake feeling. Okay,

00:03:50.030 --> 00:03:51.990
but there has to be a tradeoff. Computers don't

00:03:51.990 --> 00:03:53.909
just give you more processing power for free.

00:03:53.990 --> 00:03:55.969
What breaks when you go longer? You're right.

00:03:55.990 --> 00:03:57.830
There is a tax. The longer the video, the higher

00:03:57.830 --> 00:04:00.090
the risk of visual drift. Define that for us.

00:04:00.400 --> 00:04:04.759
Visual drift is when the AI starts to forget

00:04:04.759 --> 00:04:07.800
the details of the scene the further it gets

00:04:07.800 --> 00:04:10.780
from the start frame. By second 14, the AI might

00:04:10.780 --> 00:04:12.680
forget there was a lamp in the corner or the

00:04:12.680 --> 00:04:14.699
wallpaper pattern starts to shift. So you might

00:04:14.699 --> 00:04:16.579
have to run it a few more times to get a clean

00:04:16.579 --> 00:04:19.120
take. Exactly. You trade some reliability for

00:04:19.120 --> 00:04:21.519
better pacing. That brings us to the other big

00:04:21.519 --> 00:04:24.160
feature, which, frankly, this sounds like the

00:04:24.160 --> 00:04:26.319
holy grail for anyone trying to tell a story

00:04:26.319 --> 00:04:30.139
with recurring characters. Omni. Kling 3 .0 Omni

00:04:30.139 --> 00:04:33.100
is basically multi -element control. And yeah,

00:04:33.180 --> 00:04:35.040
this addresses the biggest complaint creators

00:04:35.040 --> 00:04:38.360
have had forever. Identity consistency. In the

00:04:38.360 --> 00:04:39.939
past, you'd write a whole paragraph describing

00:04:39.939 --> 00:04:42.660
a guy in a red hat and just hope the AI drew

00:04:42.660 --> 00:04:44.839
the same guy every time. And it never did. It

00:04:44.839 --> 00:04:46.860
would give you generic guy in red hat A and then

00:04:46.860 --> 00:04:49.620
generic guy in red hat B. Exactly. With Omni,

00:04:49.720 --> 00:04:51.639
you can upload up to seven visual references.

00:04:51.939 --> 00:04:55.680
Seven. Seven distinct elements. Seven. So you

00:04:55.680 --> 00:04:58.120
can upload a photo of a specific actor and tag

00:04:58.120 --> 00:05:01.550
it. At Japanese boy. Upload another. Tag it at

00:05:01.550 --> 00:05:04.290
Japanese girl. Upload a photo of a specific park

00:05:04.290 --> 00:05:07.610
bench. Tag it at bench. Then in your prompt,

00:05:07.670 --> 00:05:10.649
you just say at Japanese boy sits on at bench

00:05:10.649 --> 00:05:13.209
next to a Japanese girl. And it actually works.

00:05:13.250 --> 00:05:15.870
It doesn't do that weird face swapping thing

00:05:15.870 --> 00:05:18.730
or blend them into a blob. According to the stress

00:05:18.730 --> 00:05:22.269
tests, it's surprisingly robust. Max, our source,

00:05:22.389 --> 00:05:24.939
described this scene. The two characters, the

00:05:24.939 --> 00:05:27.300
Japanese boy and girl, are sitting on that specific

00:05:27.300 --> 00:05:29.680
bench. They're having a conversation, and they

00:05:29.680 --> 00:05:32.480
even share a pair of headphones. And the AI kept

00:05:32.480 --> 00:05:35.819
their faces consistent the entire time. No morphing,

00:05:35.879 --> 00:05:37.699
no glitches. That's that workflow part you're

00:05:37.699 --> 00:05:39.600
talking about. You aren't just rolling the dice

00:05:39.600 --> 00:05:41.660
on who shows up to set. Precisely. And they've

00:05:41.660 --> 00:05:43.939
updated the audio, too. The lip sync is reportedly

00:05:43.939 --> 00:05:46.779
much better than version 2 .6. Better, or...

00:05:47.400 --> 00:05:50.459
Good is the operative word. We're not at, you

00:05:50.459 --> 00:05:53.160
know, human level dubbing yet. But the characters

00:05:53.160 --> 00:05:55.579
can finally speak without their jaws unhinging

00:05:55.579 --> 00:05:58.300
or the timing being totally off. It doesn't break

00:05:58.300 --> 00:06:00.220
the immersion immediately, which is a big step.

00:06:00.420 --> 00:06:02.519
I want to push on this a bit. If I have to tag

00:06:02.519 --> 00:06:05.199
seven items, upload reference images, describe

00:06:05.199 --> 00:06:08.019
camera angles for three different shots, all

00:06:08.019 --> 00:06:10.660
in one prompt, does increasing the technical

00:06:10.660 --> 00:06:14.279
complexity actually stifle creativity? Or are

00:06:14.279 --> 00:06:17.050
we just becoming like... data entry clerks for

00:06:17.050 --> 00:06:20.250
a GPU. It's a valid fear, but I'd argue it actually

00:06:20.250 --> 00:06:22.050
liberates creativity. I mean, think about it.

00:06:22.209 --> 00:06:24.689
Randomness isn't creativity. Randomness is chaos.

00:06:24.970 --> 00:06:27.730
If you ask for a scary monster and get some random

00:06:27.730 --> 00:06:29.810
result, you aren't an artist, you're a slot machine

00:06:29.810 --> 00:06:33.170
player. Control is creativity. By removing the

00:06:33.170 --> 00:06:35.509
randomness, you're finally free to direct the

00:06:35.509 --> 00:06:37.850
exact scene you see in your head. That's a compelling

00:06:37.850 --> 00:06:40.689
distinction. Control is creativity. Okay, let's

00:06:40.689 --> 00:06:43.430
talk brass tacks. The economics. Usually when

00:06:43.430 --> 00:06:45.810
a pre -version drops, the price doubles. Is the

00:06:45.810 --> 00:06:48.790
King expensive? Surprisingly, no. The math holds

00:06:48.790 --> 00:06:51.550
up. The per second cost is roughly the same as

00:06:51.550 --> 00:06:53.829
Cling 2 .6. It's about two credits per second.

00:06:53.970 --> 00:06:56.329
So you're paying more per generation, but only

00:06:56.329 --> 00:06:58.790
because you're getting more video. Exactly. A

00:06:58.790 --> 00:07:02.410
15 -second clip costs about 30 credits. A 10

00:07:02.410 --> 00:07:05.800
-second clip costs 20. You aren't paying a premium

00:07:05.800 --> 00:07:08.319
for the new features like Multishot or Omni.

00:07:08.639 --> 00:07:10.319
You're just paying for the length. That feels

00:07:10.319 --> 00:07:12.959
fair, assuming the yield rate is good. If I have

00:07:12.959 --> 00:07:15.699
to generate 10 clips to get one usable one, that

00:07:15.699 --> 00:07:18.459
30 credits becomes 300 real fast. And that is

00:07:18.459 --> 00:07:20.279
the hidden cost. We should be clear about that.

00:07:20.339 --> 00:07:22.920
This isn't a magic button. You will burn credits

00:07:22.920 --> 00:07:25.120
on bad generations. Walk me through the actual

00:07:25.120 --> 00:07:27.079
experience. I'm sitting at the computer. How

00:07:27.079 --> 00:07:29.899
do I use this Multishot thing? Is it code? No,

00:07:29.959 --> 00:07:32.439
it's just a UI toggle. You switch on Multishot.

00:07:32.699 --> 00:07:35.279
in the interface and a panel appears with slots

00:07:35.279 --> 00:07:38.740
shot one shot two shot three you treat each slot

00:07:38.740 --> 00:07:41.060
like a separate camera angle and you can set

00:07:41.060 --> 00:07:43.000
the duration for each one and you write a separate

00:07:43.000 --> 00:07:45.740
prompt for each slot you do and this leads to

00:07:45.740 --> 00:07:47.600
a really interesting pro tip from the source

00:07:47.600 --> 00:07:50.220
material that i think is worth dwelling on i'll

00:07:50.220 --> 00:07:52.259
admit i still wrestle with prompt drifts myself

00:07:52.259 --> 00:07:55.860
i tend to write these very robotic stiff instructions

00:07:55.860 --> 00:08:00.009
subject stands at coordinates x y lighting is

00:08:00.009 --> 00:08:02.329
volumetric. We all do. We try to speak computer.

00:08:02.569 --> 00:08:05.310
Right. But the tip here is to stop doing that.

00:08:05.529 --> 00:08:08.970
Use voice to text. Instead of typing, just talk

00:08:08.970 --> 00:08:10.750
to the AI like you're talking to a cameraman.

00:08:11.149 --> 00:08:12.769
Yeah. Okay, give me a close -up on him. He looks

00:08:12.769 --> 00:08:16.269
nervous. Now he opens the door. The source found

00:08:16.269 --> 00:08:18.829
this natural language actually yields better

00:08:18.829 --> 00:08:21.569
results because it captures the vibe of the scene,

00:08:21.689 --> 00:08:24.490
not just the technical specs. That is fascinating.

00:08:24.750 --> 00:08:27.189
It suggests the model understands semantic intent

00:08:27.189 --> 00:08:30.610
better than keyword stuffing. Exactly. So probing

00:08:30.610 --> 00:08:33.009
question here. If the cost is effectively the

00:08:33.009 --> 00:08:35.730
same and the features are better, why wouldn't

00:08:35.730 --> 00:08:37.450
everyone switch immediately? Is there a catch?

00:08:37.710 --> 00:08:39.889
The catch is the learning curve. This is a more

00:08:39.889 --> 00:08:41.590
complex instrument. You aren't just typing cat

00:08:41.590 --> 00:08:44.450
on bike anymore. You are managing shots, references,

00:08:44.750 --> 00:08:47.470
timing. The tax is patience. It requires you

00:08:47.470 --> 00:08:49.750
to actually be a director, not just a prompter.

00:08:49.850 --> 00:08:51.649
Okay, let's move to the main event, the arena.

00:08:52.169 --> 00:08:53.929
No more specs. We're going to look at the brutal

00:08:53.929 --> 00:08:57.110
head -to -head test that Max ran. Kling 3 .0

00:08:57.110 --> 00:09:00.669
versus Sora 2 versus VO 3 .1. Five tests. Test

00:09:00.669 --> 00:09:04.169
number one. Dialogue. The cafe scene. The setup

00:09:04.169 --> 00:09:06.429
is an English man and a French woman at a cafe.

00:09:07.159 --> 00:09:10.379
He speaks English, she replies in French. A classic

00:09:10.379 --> 00:09:13.919
test for video. Who took it? Kling 3 .0 took

00:09:13.919 --> 00:09:16.059
the gold here. The report says it had the most

00:09:16.059 --> 00:09:18.600
natural emotion and the lip sync was solid. It

00:09:18.600 --> 00:09:20.659
felt like a real conversation. And the others.

00:09:20.899 --> 00:09:23.419
VO 3 .1 really struggled. It had that distinct

00:09:23.419 --> 00:09:26.299
AI sound. You know, it sounds a bit metallic,

00:09:26.320 --> 00:09:28.799
a bit hollow. The uncanny valley voice. Right.

00:09:28.879 --> 00:09:31.490
And Sora. Sora wasn't even tested in this specific

00:09:31.490 --> 00:09:33.929
bracket. There were some restrictions on generating

00:09:33.929 --> 00:09:36.429
human faces speaking in that context, which is

00:09:36.429 --> 00:09:38.789
a whole other issue about safety rails versus

00:09:38.789 --> 00:09:41.730
usability. Interesting. Okay, Kling wins on talk.

00:09:41.929 --> 00:09:45.029
Test number two, emotional realism, the angry

00:09:45.029 --> 00:09:47.570
close -up. The prompt, as for a handheld shot

00:09:47.570 --> 00:09:50.929
of a person at home holding back rage, jaw clenched,

00:09:50.970 --> 00:09:53.950
nostrils flaring. This was a tie. Kling 3 .0

00:09:53.950 --> 00:09:56.580
and Sora 2 both nailed it. They captured that

00:09:56.580 --> 00:09:59.120
subtle tension, the darting eyes, the heavy breathing.

00:09:59.279 --> 00:10:02.379
It wasn't cartoon anger. It was a simmer. And

00:10:02.379 --> 00:10:05.519
Vio. Vio went full soap opera. The source called

00:10:05.519 --> 00:10:08.600
it overacting. Unnatural expressions, contorted

00:10:08.600 --> 00:10:11.000
faces. It didn't get subtlety. It thought anger

00:10:11.000 --> 00:10:15.159
meant make a scary face. So Vio is the bad actor

00:10:15.159 --> 00:10:18.500
in the group. Got it. Test three, complex movement.

00:10:18.700 --> 00:10:22.269
This one is tough. The courier run. A cinematic

00:10:22.269 --> 00:10:24.529
tracking shot of a courier sprinting through

00:10:24.529 --> 00:10:26.889
a crowd, locking eyes with another runner, and

00:10:26.889 --> 00:10:29.509
throwing an envelope. Whoa. I mean, just imagine

00:10:29.509 --> 00:10:31.710
the math required here. He had to keep a courier

00:10:31.710 --> 00:10:33.830
in focus, keeping consistent, while generating

00:10:33.830 --> 00:10:36.190
random pedestrians and cyclists crossing the

00:10:36.190 --> 00:10:38.889
frame, all while maintaining lighting and momentum.

00:10:39.149 --> 00:10:41.649
That is a computational nightmare. It sounds

00:10:41.649 --> 00:10:44.710
impossible for a probabilistic model. And yet,

00:10:44.789 --> 00:10:48.029
Kling 3 .0 did it. It followed the prompt. It

00:10:48.029 --> 00:10:50.370
kept the courier in focus. It handled the occlusion

00:10:50.370 --> 00:10:52.350
people walking in front of the camera without

00:10:52.350 --> 00:10:53.990
glitching the main character out of existence.

00:10:54.370 --> 00:10:58.110
What about the big bad wolf, Sora 2? Sora 2 started

00:10:58.110 --> 00:11:01.289
strong, but it stumbled at the finish line. It

00:11:01.289 --> 00:11:03.649
struggled with the ending. The character acting

00:11:03.649 --> 00:11:06.129
got weird when the interaction happened. And

00:11:06.129 --> 00:11:08.950
Vio? Vio failed completely. It basically ignored

00:11:08.950 --> 00:11:10.809
the prompt. It just had random people pointing

00:11:10.809 --> 00:11:13.429
at things. It lost the plot. It sounds like Vio's

00:11:13.429 --> 00:11:16.110
having a really rough day in these tests. Okay,

00:11:16.169 --> 00:11:19.009
test four. This is where it gets funny but also

00:11:19.009 --> 00:11:23.250
revealing. Start and end frames. The prompt was

00:11:23.250 --> 00:11:25.490
to transition from a quiet courthouse at dawn

00:11:25.490 --> 00:11:28.590
to a wrecked hallway inside. This is a classic

00:11:28.590 --> 00:11:30.509
fill -in -the -blank test. Here's the start.

00:11:30.610 --> 00:11:32.370
Here's the end. You figure out how we got there.

00:11:32.470 --> 00:11:36.490
It's cling 3 .0. Hallucinated. Badly. It completely

00:11:36.490 --> 00:11:39.210
misunderstood the logic of the scene. It turned

00:11:39.210 --> 00:11:41.669
the police officers into robbers. It just swapped

00:11:41.669 --> 00:11:43.889
their roles mid -video. It didn't get the narrative

00:11:43.889 --> 00:11:46.909
implication. It just morphed pixel blobs. NVO?

00:11:47.090 --> 00:11:49.509
Broken physics. It smashed the entrance in a

00:11:49.509 --> 00:11:51.629
way that didn't make physical sense. Right. Walls

00:11:51.629 --> 00:11:53.269
just kind of dissolved instead of breaking. So

00:11:53.269 --> 00:11:56.909
who won? Sora 2. This is where OpenAI flexed

00:11:56.909 --> 00:11:59.509
its muscle. It produced what the source called

00:11:59.509 --> 00:12:02.669
an absolute cinema cutscene. It understood the

00:12:02.669 --> 00:12:04.309
physics of destruction, the lighting change,

00:12:04.490 --> 00:12:07.950
the transition perfectly. So Kling wins on logic

00:12:07.950 --> 00:12:11.190
and consistency, but Sora wins on the raw physics

00:12:11.190 --> 00:12:14.610
engine. Exactly. When things need to break realistically,

00:12:14.990 --> 00:12:17.990
Sora understands the world better. Final test.

00:12:18.230 --> 00:12:21.669
Test number five. Multi -shot consistency. The

00:12:21.669 --> 00:12:24.830
rooftop. A romantic dialogue on a European terrace

00:12:24.830 --> 00:12:27.269
about trees turning yellow. This plays right

00:12:27.269 --> 00:12:29.470
to Kling's strength. It won on consistency. The

00:12:29.470 --> 00:12:31.289
characters look exactly the same in every shot.

00:12:31.490 --> 00:12:33.509
Sora 2 was called a worthy enemy, but it was

00:12:33.509 --> 00:12:35.830
hampered by some policy restrictions on video

00:12:35.830 --> 00:12:37.710
quality. It just looked a bit more compressed,

00:12:37.889 --> 00:12:39.690
a bit fuzzier. So looking at these tests, it

00:12:39.690 --> 00:12:41.850
raises a big philosophical question for me. We

00:12:41.850 --> 00:12:43.889
see Kling winning on human elements, how people

00:12:43.889 --> 00:12:46.149
act, how scenes flow. We see Sora winning on

00:12:46.149 --> 00:12:49.509
physics, how glass breaks, how light moves, which

00:12:49.509 --> 00:12:52.629
is harder for an AI to learn. I'd argue logic

00:12:52.629 --> 00:12:55.500
is the harder hurdle. Physics is math. You can

00:12:55.500 --> 00:12:57.899
simulate gravity. You can simulate light bounces.

00:12:58.000 --> 00:13:00.500
We have game engines that do that. But logic,

00:13:00.659 --> 00:13:03.080
logic is human, understanding why a cop shouldn't

00:13:03.080 --> 00:13:05.279
turn into a robber or why a glance is romantic

00:13:05.279 --> 00:13:08.659
versus angry. That requires a model of the world

00:13:08.659 --> 00:13:12.039
that goes beyond just pixels. It requires understanding

00:13:12.039 --> 00:13:15.340
intent. That's a profound thought. Physics is

00:13:15.340 --> 00:13:18.139
math. Logic is cultural. Right. We are back.

00:13:18.379 --> 00:13:20.259
We've looked at the features, the costs, and

00:13:20.259 --> 00:13:23.590
the stress tests. Now we need a verdict. I'm

00:13:23.590 --> 00:13:25.269
a listener. I have a subscription budget for

00:13:25.269 --> 00:13:26.850
one of these. Which one do I actually sign up

00:13:26.850 --> 00:13:29.350
for? It really comes down to your persona. Who

00:13:29.350 --> 00:13:31.490
are you in this ecosystem? Let's break it down.

00:13:31.730 --> 00:13:33.970
Okay. If you are a filmmaker or a marketer, someone

00:13:33.970 --> 00:13:36.230
who needs control, narrative flow, and wants

00:13:36.230 --> 00:13:38.789
to tell a story with Cutscling, 3 .0 is your

00:13:38.789 --> 00:13:41.330
tool. It allows for B -roll and character scenes

00:13:41.330 --> 00:13:43.309
that are actually usable because you can control

00:13:43.309 --> 00:13:46.250
the sequence. The multi -shot feature kills that

00:13:46.250 --> 00:13:48.610
need for external editing of short clips. You

00:13:48.610 --> 00:13:51.429
can build a scene. Okay. What about the perfectionist?

00:13:51.799 --> 00:13:54.240
The visual effects artist. That's Sora, too.

00:13:55.139 --> 00:13:57.360
If you need a specific physical interaction,

00:13:57.620 --> 00:14:00.539
like a glass shattering realistically, or a car

00:14:00.539 --> 00:14:04.139
crash that obeys the laws of physics, Sora's

00:14:04.139 --> 00:14:06.659
the choice. Assuming you have the budget or the

00:14:06.659 --> 00:14:09.330
access, of course. It's the visual effects powerhouse.

00:14:09.350 --> 00:14:11.210
It just understands the physical world better.

00:14:11.309 --> 00:14:14.450
And VO 3 .1, it took a beating in the tests.

00:14:14.649 --> 00:14:17.850
Is it just a write -off? Not necessarily. It

00:14:17.850 --> 00:14:20.870
has a niche. It's for the ecosystem user. If

00:14:20.870 --> 00:14:22.909
you're already locked into Google and Vertex

00:14:22.909 --> 00:14:25.750
AI and you need specific color grading standards

00:14:25.750 --> 00:14:28.929
like Cinema Standard 24 FPS VO fits into that

00:14:28.929 --> 00:14:30.649
pipeline, it plays nice with the other Google

00:14:30.649 --> 00:14:33.950
tools. But for pure creativity, yeah, it's lagging

00:14:33.950 --> 00:14:36.289
behind right now. So are we approaching a point

00:14:36.289 --> 00:14:38.750
where the best model doesn't exist? Are we entering

00:14:38.750 --> 00:14:40.769
a world where we use different models for different

00:14:40.769 --> 00:14:43.610
shots in the same project? 100%. I think we are

00:14:43.610 --> 00:14:45.850
already there. You might use Sora for the explosion

00:14:45.850 --> 00:14:47.710
because it handles particles better. And then

00:14:47.710 --> 00:14:49.990
you switch to Kling for the reaction shot of

00:14:49.990 --> 00:14:52.389
the hero because it handles emotion better. So

00:14:52.389 --> 00:14:54.669
the director of the future is really just a curator

00:14:54.669 --> 00:14:58.029
of models. Precisely. You're casting algorithms,

00:14:58.370 --> 00:15:01.309
not just actors. You pick the model that knows

00:15:01.309 --> 00:15:03.470
how to perform the scene you need. Okay, let's

00:15:03.470 --> 00:15:05.549
wrap this up with the big picture. We have moved

00:15:05.549 --> 00:15:07.889
from the cool clip phase to the director phase.

00:15:08.190 --> 00:15:11.149
That's the headline. Kling 3 .0's multi -shot

00:15:11.149 --> 00:15:13.870
feature is the signal. It shows the industry

00:15:13.870 --> 00:15:16.929
is finally listening to creators who said, stop

00:15:16.929 --> 00:15:19.690
giving us higher resolution pandas and give us

00:15:19.690 --> 00:15:22.289
tools to actually edit. And the omni control,

00:15:22.409 --> 00:15:24.409
that ability to have recurring characters, that

00:15:24.409 --> 00:15:27.029
feels like the barrier to entry for storytelling

00:15:27.029 --> 00:15:30.149
is finally crumbling. It is. While Sora 2 still

00:15:30.149 --> 00:15:32.809
holds the crown for physical realism, Kling 3

00:15:32.809 --> 00:15:36.039
.0 wins on workflow and narrative. And in the

00:15:36.039 --> 00:15:38.659
long run, for professionals, workflow always

00:15:38.659 --> 00:15:40.620
wins. Before we go, we have a little homework

00:15:40.620 --> 00:15:43.259
for you. Yes. Try that voice -to -text prompting

00:15:43.259 --> 00:15:45.399
method we mentioned. Open up Kling or whatever

00:15:45.399 --> 00:15:47.500
tool you use, close your eyes, and just act out

00:15:47.500 --> 00:15:49.899
the scene to the AI. Don't worry about keywords.

00:15:50.019 --> 00:15:52.360
Just describe the movie. See if the result feels

00:15:52.360 --> 00:15:54.580
more human. I bet it will. And I'll leave you

00:15:54.580 --> 00:15:56.700
with this final thought. If AI can now handle

00:15:56.700 --> 00:16:00.440
continuity, emotion, camera angles, if it can

00:16:00.440 --> 00:16:02.879
remember who the actor is from shot to shot,

00:16:03.500 --> 00:16:06.379
How long is it until the prompt make me a sequel

00:16:06.379 --> 00:16:09.460
to my favorite movie actually works? Not just

00:16:09.460 --> 00:16:11.340
a trailer, but the movie. It's closer than we

00:16:11.340 --> 00:16:13.080
think. See you in the deep end. Take care.
