WEBVTT

00:00:00.000 --> 00:00:03.040
You're staring at a flat 10 second phone video

00:00:03.040 --> 00:00:06.179
of a beach. Maybe you shot it on vacation and

00:00:06.179 --> 00:00:08.339
you've tried every single editing app available.

00:00:08.599 --> 00:00:10.679
Right. And it still looks completely amateur.

00:00:10.839 --> 00:00:14.500
Exactly. It just feels flat. But what if what

00:00:14.500 --> 00:00:17.820
if one simple text prompt could change all of

00:00:17.820 --> 00:00:20.660
that? Yeah. What if you could turn that exact

00:00:20.660 --> 00:00:25.300
clip into a sweeping cinematic drone shot and

00:00:25.300 --> 00:00:28.140
the whole process took under 60 seconds? Welcome

00:00:28.140 --> 00:00:30.730
to the deep dive. I have been spending a lot

00:00:30.730 --> 00:00:33.490
of time recently thinking about how video creation

00:00:33.490 --> 00:00:36.149
is evolving. It is moving incredibly fast right

00:00:36.149 --> 00:00:38.750
now. It really is. And our mission today is very

00:00:38.750 --> 00:00:41.549
clear. we are unpacking the true capabilities

00:00:41.549 --> 00:00:45.229
of Gemini Omni. Yes. Because I think there is

00:00:45.229 --> 00:00:47.070
a huge misconception out there about what this

00:00:47.070 --> 00:00:49.369
tool actually does. Oh, absolutely. There really

00:00:49.369 --> 00:00:51.289
is. Most people, you know, they just use Omni

00:00:51.289 --> 00:00:54.070
to make AI avatars. Right. They create a digital

00:00:54.070 --> 00:00:56.350
clone of a person that just speaks directly to

00:00:56.350 --> 00:00:59.549
the camera. And I mean, that is fine. Yeah. It

00:00:59.549 --> 00:01:02.450
is a neat trick. It is. But by stopping there,

00:01:02.570 --> 00:01:05.189
they are missing the entire bigger picture. They

00:01:05.189 --> 00:01:08.010
are completely missing five hidden workflows

00:01:08.010 --> 00:01:11.390
that actually revolutionize real video post -production.

00:01:11.829 --> 00:01:14.349
Right. And that is exactly what we are exploring

00:01:14.349 --> 00:01:17.409
today. We are going to look at how you can take

00:01:17.409 --> 00:01:19.870
the mundane footage already sitting on your phone

00:01:19.870 --> 00:01:23.500
and entirely manipulate its reality. Yeah. We're

00:01:23.500 --> 00:01:26.000
talking about adding impossible camera movements

00:01:26.000 --> 00:01:29.140
to static shots. Which is mind blowing. We will

00:01:29.140 --> 00:01:31.180
look at translating your natural speech into

00:01:31.180 --> 00:01:33.719
entirely new languages without dubbing, generating

00:01:33.719 --> 00:01:36.459
full explainer videos from nothing but a single

00:01:36.459 --> 00:01:38.939
thought. Right. And finally locking three -dimensional

00:01:38.939 --> 00:01:42.900
text right into a moving physical scene. It genuinely

00:01:42.900 --> 00:01:45.340
is a complete paradigm shift for creators. I

00:01:45.340 --> 00:01:47.659
mean, it takes the supercomputer in your pocket

00:01:47.659 --> 00:01:50.799
and turns it into a high -end iterative post

00:01:50.799 --> 00:01:53.810
-production studio. Before we could perform any

00:01:53.810 --> 00:01:56.349
of this video magic, we need to talk about this

00:01:56.349 --> 00:01:58.709
setup. Yeah, the workspace matters. It really

00:01:58.709 --> 00:02:01.969
does. We have to understand where we are actually

00:02:01.969 --> 00:02:04.430
doing the work. Yeah. If you just open the standard

00:02:04.430 --> 00:02:06.469
mobile app, you're going to hit a wall. That

00:02:06.469 --> 00:02:09.330
is such a crucial distinction to make right away.

00:02:10.289 --> 00:02:12.590
Gemini Omni essentially lives inside two very

00:02:12.590 --> 00:02:15.370
separate environments. The Gemini app is built

00:02:15.370 --> 00:02:18.370
for quick, single -step edits. You type a prompt

00:02:18.370 --> 00:02:21.509
and you get a fast result. But the catch is that

00:02:21.509 --> 00:02:23.849
it wipes your slate completely clean every single

00:02:23.849 --> 00:02:26.330
time. It is essentially just a temporary scratch

00:02:26.330 --> 00:02:29.069
pad. It is ephemeral. Exactly. It does not save

00:02:29.069 --> 00:02:32.569
your history, which means you cannot build a

00:02:32.569 --> 00:02:35.069
sequence of complex edits over time. No, you

00:02:35.069 --> 00:02:37.830
cannot. And that brings us to Google Flow. Flow

00:02:37.830 --> 00:02:41.110
is the workspace where the serious work actually

00:02:41.110 --> 00:02:43.590
happens. Yes. Google Flow organizes all your

00:02:43.590 --> 00:02:46.030
video generations into specific projects. It

00:02:46.030 --> 00:02:48.750
actually saves your history. So you can select

00:02:48.750 --> 00:02:51.509
a generated output and use it as the new base

00:02:51.509 --> 00:02:54.669
for your next prompt. Every single edit seamlessly

00:02:54.669 --> 00:02:57.090
layers on top of the last one. It is like stacking

00:02:57.090 --> 00:03:00.280
Lego blocks of data. Oh, I love that. You were

00:03:00.280 --> 00:03:03.199
building your final video piece by piece, rather

00:03:03.199 --> 00:03:05.460
than hoping the machine spits out a masterpiece

00:03:05.460 --> 00:03:08.639
on the very first try. Right. Let us look at

00:03:08.639 --> 00:03:11.400
a specific example. Say you upload that 10 -second

00:03:11.400 --> 00:03:15.360
beach clip into Google Flow. You type, add a

00:03:15.360 --> 00:03:17.979
large crowd on the beach behind me. The system

00:03:17.979 --> 00:03:20.460
processes that instruction beautifully. It really

00:03:20.460 --> 00:03:22.759
does. It keeps your original shot completely

00:03:22.759 --> 00:03:27.159
intact. Your face, your posture, your position

00:03:27.159 --> 00:03:29.800
in the frame. Those do not change at all. But

00:03:29.800 --> 00:03:33.840
the AI selectively alters the background environment

00:03:33.840 --> 00:03:36.580
around you. It inserts people naturally into

00:03:36.580 --> 00:03:38.560
the depth of the scene. I have to admit, I still

00:03:38.560 --> 00:03:40.699
wrestle with prompt drift myself. Oh, really?

00:03:40.860 --> 00:03:43.759
Yeah. I am constantly fighting the urge to cram

00:03:43.759 --> 00:03:45.879
too many instructions into the very first prompt.

00:03:46.199 --> 00:03:48.039
Yes. You just want to say, add a crowd and make

00:03:48.039 --> 00:03:50.229
it sunset and make me wear sunglasses. all at

00:03:50.229 --> 00:03:52.629
once. Yeah, and that is honestly the biggest

00:03:52.629 --> 00:03:55.490
mistake new users make. You must keep the first

00:03:55.490 --> 00:03:58.469
prompt focused on one single isolated change.

00:03:58.930 --> 00:04:01.430
Asking for multiple massive edits at once confuses

00:04:01.430 --> 00:04:03.830
the underlying model. Right. It makes it incredibly

00:04:03.830 --> 00:04:06.129
hard to identify what went wrong if the output

00:04:06.129 --> 00:04:08.490
looks weird. Why not just use the Gemini app

00:04:08.490 --> 00:04:10.539
since it is a faster entry point. Because if

00:04:10.539 --> 00:04:13.219
you use the app, you lose that layered progress

00:04:13.219 --> 00:04:15.759
entirely. You cannot iterate on a successful

00:04:15.759 --> 00:04:17.819
first step. You always have to start from zero

00:04:17.819 --> 00:04:21.079
again. Flow lets you keep your wins. Right. Flow

00:04:21.079 --> 00:04:25.040
saves progress. The app wipes your slate clean.

00:04:25.899 --> 00:04:28.600
Beat. So once you successfully get that crowd

00:04:28.600 --> 00:04:30.379
generated in the background, you move forward.

00:04:30.500 --> 00:04:33.720
Yeah. You select that newly generated video inside

00:04:33.720 --> 00:04:36.540
Flow as your new source file. Then you layer

00:04:36.540 --> 00:04:38.680
in the next specific change. Maybe you ask it

00:04:38.680 --> 00:04:41.379
to change the time of day to golden hour. It

00:04:41.379 --> 00:04:43.339
edits the already generated version. Exactly.

00:04:43.500 --> 00:04:45.519
It does not start over from your original phone

00:04:45.519 --> 00:04:48.199
footage. This is exactly what makes iterations

00:04:48.199 --> 00:04:53.069
so powerful here. Yes. But of course, It is not

00:04:53.069 --> 00:04:55.430
always perfect. Oh no, definitely not. Sometimes

00:04:55.430 --> 00:04:57.329
the system just gets it fundamentally wrong.

00:04:57.689 --> 00:05:00.189
It might alter an object at the wrong moment

00:05:00.189 --> 00:05:02.769
or glitch out a part of your face that you wanted

00:05:02.769 --> 00:05:05.430
kept completely intact. When that happens, do

00:05:05.430 --> 00:05:08.230
not try to fix the broken video. Right? I see

00:05:08.230 --> 00:05:11.550
so many people prompting the AI to fix the weird

00:05:11.550 --> 00:05:14.310
hand. Iterating on a bad generation usually just

00:05:14.310 --> 00:05:16.709
compounds the errors. Yeah, it gets messy fast.

00:05:16.790 --> 00:05:18.730
You just have to go back to the previous clean

00:05:18.730 --> 00:05:21.430
clip, write a clearer prompt, and try again.

00:05:21.519 --> 00:05:23.680
A clearer prompt on a clean source always wins.

00:05:24.720 --> 00:05:27.800
So flow lets you fix the environment without

00:05:27.800 --> 00:05:29.680
breaking the subject. Right. Right. But what

00:05:29.680 --> 00:05:32.000
if the environment is actually fine and the problem

00:05:32.000 --> 00:05:34.899
is how you shot it? Oh, this is good. You are

00:05:34.899 --> 00:05:39.560
stuck with a static boring eye level tripod shot.

00:05:39.680 --> 00:05:42.699
Yeah. Let us pivot and talk about altering the

00:05:42.699 --> 00:05:45.699
actual camera itself. Yes. Because here is where

00:05:45.699 --> 00:05:48.199
it gets really interesting. Gemini Omni has this

00:05:48.199 --> 00:05:51.980
amazing ability to completely reinterpret how

00:05:51.980 --> 00:05:55.279
a shot was filmed. It changes the physical reality

00:05:55.279 --> 00:05:57.920
of the camera after the fact. Just think about

00:05:57.920 --> 00:05:59.879
the physics of that for a second. Yeah. You have

00:05:59.879 --> 00:06:03.519
a flat static clip recorded from eye level. You

00:06:03.519 --> 00:06:06.000
prompt Omni to zoom out and turn it into an aerial

00:06:06.000 --> 00:06:08.850
drone shot. Right. And to do that, it has to

00:06:08.850 --> 00:06:10.889
essentially rebuild the entire physical scene

00:06:10.889 --> 00:06:12.870
from a perspective that never existed in the

00:06:12.870 --> 00:06:15.589
real world. The first couple of seconds of the

00:06:15.589 --> 00:06:19.050
generated clip might look a little unstable as

00:06:19.050 --> 00:06:20.850
the system tries to establish that brand new

00:06:20.850 --> 00:06:24.350
perspective. But then... The movement smooths

00:06:24.350 --> 00:06:27.569
out into this beautiful, sweeping cinematic shot.

00:06:27.850 --> 00:06:29.870
It opens up massive possibilities. I mean, if

00:06:29.870 --> 00:06:32.350
you are an indie filmmaker or a solo creator,

00:06:32.930 --> 00:06:35.149
you do not need an expensive drone or a heavy

00:06:35.149 --> 00:06:37.850
gimbal system anymore. Right. But the system

00:06:37.850 --> 00:06:40.209
can also be a little too eager to help sometimes.

00:06:40.370 --> 00:06:42.910
It might hallucinate contextual props into your

00:06:42.910 --> 00:06:45.350
scene to justify the new camera angle. Yeah,

00:06:45.430 --> 00:06:47.889
let us clarify that term quickly. Hallucinate.

00:06:48.110 --> 00:06:51.670
in AI terms, just means creating fake details

00:06:51.670 --> 00:06:54.050
to make a scene logically consistent. Exactly.

00:06:54.170 --> 00:06:56.810
For example, if you ask for a sweeping drone

00:06:56.810 --> 00:07:00.050
shot of yourself on the beach, the AI might actually

00:07:00.050 --> 00:07:02.329
generate a plastic drone controller in your empty

00:07:02.329 --> 00:07:04.649
hands. Because it thinks, well, if there is a

00:07:04.649 --> 00:07:06.149
drone flying around them, they must be the one

00:07:06.149 --> 00:07:08.170
flying it. Right. It is a fascinating glimpse

00:07:08.170 --> 00:07:10.790
into how the machine understands human context,

00:07:11.129 --> 00:07:13.410
not just pixels. And you can control that virtual

00:07:13.410 --> 00:07:16.009
camera with extreme precision, right? You do

00:07:16.009 --> 00:07:18.560
not just have to type pan left, and hope for

00:07:18.560 --> 00:07:20.639
the best. Exactly. You can use what is called

00:07:20.639 --> 00:07:23.480
the arrow technique to dictate exact flight paths.

00:07:23.879 --> 00:07:26.240
You take a still frame of your video, just a

00:07:26.240 --> 00:07:29.509
simple screenshot. and you literally draw arrows

00:07:29.509 --> 00:07:32.310
on that image to show the exact curved path you

00:07:32.310 --> 00:07:35.069
want the camera to take. Oh, wow. Then you upload

00:07:35.069 --> 00:07:37.430
that marked up image alongside your original

00:07:37.430 --> 00:07:41.209
clip. Whoa. Imagine the AI understanding a flat

00:07:41.209 --> 00:07:45.350
2D image so well it can reconstruct a full 3D

00:07:45.350 --> 00:07:47.850
drone flight path through it. It is wild. It

00:07:47.850 --> 00:07:50.750
builds a virtual 3D dome over your scene, maps

00:07:50.750 --> 00:07:53.790
the flat image onto it, and flies a digital camera

00:07:53.790 --> 00:07:56.730
along your drawn line. How does the system know

00:07:56.730 --> 00:07:58.790
exactly exactly where you want this virtual drone

00:07:58.790 --> 00:08:01.529
to fly. You prompt it to trace the path shown

00:08:01.529 --> 00:08:04.050
in the reference image. You tell it to maintain

00:08:04.050 --> 00:08:06.829
a forward -facing perspective, and you explicitly

00:08:06.829 --> 00:08:09.529
tell it to remove the drawn arrows from the final

00:08:09.529 --> 00:08:11.509
output. So it literally just follows the arrows

00:08:11.509 --> 00:08:13.930
you drew on the image. It gives the system every

00:08:13.930 --> 00:08:16.790
single geometric constraint it needs to succeed.

00:08:16.870 --> 00:08:18.930
Right. The more constrained your instruction,

00:08:19.189 --> 00:08:21.189
the better and more consistent the final result

00:08:21.189 --> 00:08:23.589
will be across multiple generations. OK, so we

00:08:23.589 --> 00:08:25.329
have manipulated the background elements. Mm

00:08:25.329 --> 00:08:28.139
-hmm. entirely rebuilt the camera movements.

00:08:29.019 --> 00:08:33.460
Now, let us talk about manipulating the actual

00:08:33.460 --> 00:08:36.740
subject speaking in the video. Reaching a global

00:08:36.740 --> 00:08:39.740
audience. Yes. This is a huge pain point for

00:08:39.740 --> 00:08:42.820
creators right now. Rerecording the same video

00:08:42.820 --> 00:08:46.080
in multiple languages takes massive effort. Oh,

00:08:46.159 --> 00:08:48.220
absolutely. Hiring voice actors, dubbing the

00:08:48.220 --> 00:08:50.679
audio. I mean, it is easily the most time consuming

00:08:50.679 --> 00:08:54.019
part of a global content workflow. Gemini Omni

00:08:54.019 --> 00:08:56.720
handles this problem through its dedicated avatar

00:08:56.720 --> 00:08:59.399
feature. You can deliver your exact message in

00:08:59.399 --> 00:09:01.580
a completely different language, and you never

00:09:01.580 --> 00:09:03.519
have to record a second take yourself. What you

00:09:03.519 --> 00:09:05.779
do is set up a hyper -realistic version of yourself

00:09:05.779 --> 00:09:08.440
in the app first. You give it some baseline footage

00:09:08.440 --> 00:09:11.779
so it learns your face, then you bring that custom

00:09:11.779 --> 00:09:14.720
avatar right into Google Flow, and you simply

00:09:14.720 --> 00:09:17.419
type out the message in your new target language.

00:09:17.820 --> 00:09:20.679
The system has been thoroughly tested on several

00:09:20.679 --> 00:09:23.940
common languages. French, Spanish, Portuguese,

00:09:24.059 --> 00:09:27.500
and German all produce incredibly reliable, natural

00:09:27.500 --> 00:09:29.679
sounding results. And researchers have even tested

00:09:29.679 --> 00:09:32.240
it on much less conventional options just to

00:09:32.240 --> 00:09:34.190
push the boundaries of the model. People have

00:09:34.190 --> 00:09:36.710
generated outputs in Latin and even American

00:09:36.710 --> 00:09:38.730
Sign Language. Though I imagine those outputs

00:09:38.730 --> 00:09:41.169
are much harder to verify without a native speaker.

00:09:41.309 --> 00:09:43.649
Definitely. Still, the underlying capability

00:09:43.649 --> 00:09:46.049
is simply staggering to think about. You can

00:09:46.049 --> 00:09:48.950
run one single marketing message through five

00:09:48.950 --> 00:09:51.529
different languages, back to back. And because

00:09:51.529 --> 00:09:54.409
you are in flow, each language runs as a completely

00:09:54.409 --> 00:09:56.470
separate generation. You never have to touch

00:09:56.470 --> 00:09:59.149
your original audio or video recording again.

00:09:59.330 --> 00:10:01.830
Does it just slap a dubbed audio track over the

00:10:01.830 --> 00:10:04.950
original video? Not at all. It actively rebuilds

00:10:04.950 --> 00:10:08.029
the visual data of your lower face. It perfectly

00:10:08.029 --> 00:10:10.669
matches the new syllables to your mouth movements

00:10:10.669 --> 00:10:14.110
using pixel -level reconstruction. Ah, it actually

00:10:14.110 --> 00:10:16.370
alters your facial expressions and natural lip

00:10:16.370 --> 00:10:19.009
sync. Two -sex silence. It genuinely creates

00:10:19.009 --> 00:10:21.350
a seamless illusion for the international viewer.

00:10:21.909 --> 00:10:25.169
You avoid that uncanny valley effect of old dubbed

00:10:25.169 --> 00:10:27.210
movies. Right, where the mouth movements are

00:10:27.210 --> 00:10:29.970
completely wrong. Exactly. The translated message

00:10:29.970 --> 00:10:34.190
feels entirely authentic and na - So avatars

00:10:34.190 --> 00:10:35.850
are great for translating what you've already

00:10:35.850 --> 00:10:38.669
said. But what if you do not even have a video

00:10:38.669 --> 00:10:42.750
yet? What if you need the AI to build an educational

00:10:42.750 --> 00:10:46.409
video entirely from scratch? Most explainer workflows

00:10:46.409 --> 00:10:49.309
require a massive amount of heavy lifting up

00:10:49.309 --> 00:10:51.750
front. Yes they do. You need a written script,

00:10:52.009 --> 00:10:53.950
you need to record a clean voiceover, and you

00:10:53.950 --> 00:10:57.230
need to source a huge stack of relevant b -roll

00:10:57.230 --> 00:10:59.950
footage. It is a lot of work. It is. But Gemini

00:10:59.950 --> 00:11:03.480
Omni lets you skip all of that manual preparation

00:11:03.480 --> 00:11:05.799
entirely. You do not feed it a script at all.

00:11:05.860 --> 00:11:08.700
You just give it a single focused topic to explain.

00:11:08.960 --> 00:11:11.179
Let us say you ask it to explain how rockets

00:11:11.179 --> 00:11:13.879
work. All right. You tell it to include an avatar

00:11:13.879 --> 00:11:16.299
presenter in the corner of the screen. That is

00:11:16.299 --> 00:11:18.879
literally all the instruction the system needs

00:11:18.879 --> 00:11:21.299
to begin generating. It builds a beautifully

00:11:21.299 --> 00:11:23.980
structured video explanation completely on its

00:11:23.980 --> 00:11:26.899
own. Yeah. It draws from its deep training data

00:11:27.289 --> 00:11:30.470
on scientific subjects. It automatically creates

00:11:30.470 --> 00:11:33.029
scenes showing the action and reaction of the

00:11:33.029 --> 00:11:36.070
launch sequence. Right. It generates clear, accurate

00:11:36.070 --> 00:11:38.210
animations of fuel combustion and high pressure

00:11:38.210 --> 00:11:41.429
gas. It shows how the resulting thrust pushes

00:11:41.429 --> 00:11:45.470
the heavy rocket upward. The final output genuinely

00:11:45.470 --> 00:11:48.210
feels like a finished, polished piece of media.

00:11:48.490 --> 00:11:51.509
It does not feel like a disjointed draft. And

00:11:51.509 --> 00:11:54.629
it does all this from one incredibly short prompt.

00:11:54.870 --> 00:11:57.350
A great habit here is to keep that very first

00:11:57.350 --> 00:12:00.570
prompt broad. Right. Let the system build the

00:12:00.570 --> 00:12:03.250
initial structural foundation for you. Then,

00:12:03.389 --> 00:12:05.490
because you are in flow, you review the output

00:12:05.490 --> 00:12:08.549
and refine it. You only add more depth to specific

00:12:08.549 --> 00:12:11.009
scenes where it is actually needed. Exactly.

00:12:11.190 --> 00:12:13.190
Do I need to feed it a detailed script first

00:12:13.190 --> 00:12:16.000
for the explainer? No. You just provide the core

00:12:16.000 --> 00:12:18.840
topic and your preferred visual style. The system

00:12:18.840 --> 00:12:21.059
automatically handles the narrative pacing and

00:12:21.059 --> 00:12:23.539
the scene transitions for you. No. It structures

00:12:23.539 --> 00:12:25.879
the whole visual breakdown from one short prompt,

00:12:26.220 --> 00:12:29.009
beat. Now, speaking of building scenes, there

00:12:29.009 --> 00:12:31.389
is an incredible bonus hack we need to discuss

00:12:31.389 --> 00:12:33.750
here. Oh, yes. It involves altering location

00:12:33.750 --> 00:12:36.149
data and moving footage. Let us set the scene

00:12:36.149 --> 00:12:38.970
for this. OK. Say you have video filmed from

00:12:38.970 --> 00:12:41.710
inside a moving car. You were just driving down

00:12:41.710 --> 00:12:44.850
a very boring suburban street. But you want to

00:12:44.850 --> 00:12:47.169
completely change the location outside your window

00:12:47.169 --> 00:12:50.029
to make it look like Tokyo at night. In traditional

00:12:50.029 --> 00:12:52.909
editing, this requires incredibly tedious rotoscoping.

00:12:53.029 --> 00:12:55.629
You would have to manually mask out the windows

00:12:55.629 --> 00:12:58.789
frame by frame. But in Omni, you just take a

00:12:58.789 --> 00:13:01.470
screenshot from Google Maps of the new city.

00:13:01.929 --> 00:13:04.990
you upload that map image right alongside your

00:13:04.990 --> 00:13:07.470
original driving clip. You prompt the system

00:13:07.470 --> 00:13:09.909
to change the environment outside the windshield

00:13:09.909 --> 00:13:12.570
using the map as a reference. Yeah. But you tell

00:13:12.570 --> 00:13:15.110
it to keep the car interior exactly the same.

00:13:15.169 --> 00:13:17.850
The model is doing something called depth segmentation.

00:13:18.009 --> 00:13:19.710
Which means separating the foreground from the

00:13:19.710 --> 00:13:22.759
background in the image. Exactly. It is not just

00:13:22.759 --> 00:13:26.159
looking at a flat image. It literally draws an

00:13:26.159 --> 00:13:29.700
invisible 3D boundary between the foreground,

00:13:30.100 --> 00:13:32.940
your steering wheel and dashboard, and the background

00:13:32.940 --> 00:13:36.120
outside the glass. So it replaces the outside

00:13:36.120 --> 00:13:38.500
layer with the new city while protecting the

00:13:38.500 --> 00:13:41.019
inside layer perfectly. Yes. It even keeps the

00:13:41.019 --> 00:13:43.440
original window stickers and dashboard reflections

00:13:43.440 --> 00:13:46.919
completely intact. It is a stunning display of

00:13:46.919 --> 00:13:49.700
spatial awareness by the model. It treats the

00:13:49.700 --> 00:13:52.440
car window like a digital green screen, projecting

00:13:52.440 --> 00:13:54.799
the new data exclusively into that back layer.

00:13:55.299 --> 00:13:58.259
OK. Let's take a quick moment here. Sponsor.

00:13:58.399 --> 00:14:02.139
Minerals sponsor, read placeholder. We are back.

00:14:02.240 --> 00:14:04.539
We have completely altered backgrounds, cameras,

00:14:04.740 --> 00:14:06.860
languages, and locations today. We have covered

00:14:06.860 --> 00:14:09.580
a lot. We have. But to add the final layer of

00:14:09.580 --> 00:14:12.620
polish to an explainer or a product demo, you

00:14:12.620 --> 00:14:16.399
usually need text on the screen. But we are not

00:14:16.399 --> 00:14:19.299
talking about flat, boring text overlays here.

00:14:19.600 --> 00:14:22.799
No. Standard video editors just slap text on

00:14:22.799 --> 00:14:25.340
top of the footage. It lacks parallax. Meaning

00:14:25.340 --> 00:14:27.120
objects moving at different speeds depending

00:14:27.120 --> 00:14:29.879
on their distance? Right. It does not move naturally

00:14:29.879 --> 00:14:32.440
with the underlying physical scene, and it certainly

00:14:32.440 --> 00:14:34.899
does not attach itself to real objects in the

00:14:34.899 --> 00:14:37.320
frame. It constantly breaks the illusion of reality

00:14:37.320 --> 00:14:39.720
for the viewer. Gemini Omni changes this entirely.

00:14:39.879 --> 00:14:42.559
It renders text directly into the three -dimensional

00:14:42.559 --> 00:14:45.139
space of your video. It is basically treating

00:14:45.139 --> 00:14:47.679
the physical object like it has digital sticky

00:14:47.679 --> 00:14:49.740
notes glued to it. When your physical camera

00:14:49.740 --> 00:14:52.480
moves, the text stays locked in place. Now let's

00:14:52.480 --> 00:14:55.100
look at a close -up video of a blooming orchid.

00:14:55.759 --> 00:14:58.600
You prompt the system to label the different

00:14:58.600 --> 00:15:01.440
parts of the flower. You ask it to use an AI

00:15:01.440 --> 00:15:04.620
-style text aesthetic for the labels. You instruct

00:15:04.620 --> 00:15:07.299
it to keep each label securely attached to its

00:15:07.299 --> 00:15:09.899
corresponding petal. The AI actually understands

00:15:09.899 --> 00:15:12.059
the spherical geometry of the petal, not just

00:15:12.059 --> 00:15:15.279
the pixels on the screen. It places a distinct

00:15:15.279 --> 00:15:18.340
text label onto each individual element and locks

00:15:18.340 --> 00:15:21.220
them firmly into the 3D space. As your camera

00:15:21.220 --> 00:15:25.039
slowly pans around the flower, the text tracks

00:15:25.039 --> 00:15:28.000
perfectly. Yeah. The labels move with the object

00:15:28.000 --> 00:15:30.279
naturally, rather than drifting loosely around

00:15:30.279 --> 00:15:32.779
the frame. This specific feature is remarkably

00:15:32.779 --> 00:15:35.340
effective for educational content or dynamic

00:15:35.340 --> 00:15:37.759
product demonstrations online. Right. You can

00:15:37.759 --> 00:15:40.159
call out specific features directly on a physical

00:15:40.159 --> 00:15:42.139
product while you just handle the item normally

00:15:42.139 --> 00:15:44.820
on camera. It adds an interactive, high -production

00:15:44.820 --> 00:15:47.600
feel to simple phone footage, and it requires

00:15:47.600 --> 00:15:50.659
absolutely zero manual post -production tracking

00:15:50.659 --> 00:15:53.440
work from you. But, you know, there are rules

00:15:53.440 --> 00:15:56.210
to this. What happens to those 3D text labels

00:15:56.210 --> 00:15:58.909
if the camera shakes? The system needs clear

00:15:58.909 --> 00:16:01.190
visual anchors to track the physical objects.

00:16:01.590 --> 00:16:04.289
If the clip is shaky or poorly lit, it loses

00:16:04.289 --> 00:16:06.590
those anchors quickly, and the text labels will

00:16:06.590 --> 00:16:08.450
immediately start sliding off their targets.

00:16:08.690 --> 00:16:11.750
Got it. Shaky clips cause the 3D labels to drift

00:16:11.750 --> 00:16:15.850
and misalign. Beats. Exactly. You must film the

00:16:15.850 --> 00:16:18.330
object steadily and ensure good lighting. That

00:16:18.330 --> 00:16:20.970
is a core best practice. There are a few other

00:16:20.970 --> 00:16:23.669
critical best practices we must cover if you

00:16:23.669 --> 00:16:26.610
want Omni to work consistently for you. First,

00:16:27.029 --> 00:16:29.230
keep your source video clip strictly under 10

00:16:29.230 --> 00:16:31.990
seconds long. A 10 -second clip with one clear

00:16:31.990 --> 00:16:35.309
subject is the ideal canvas. The system thrives

00:16:35.309 --> 00:16:37.029
when the visual information it has to process

00:16:37.029 --> 00:16:39.690
is limited and highly focused. If there's too

00:16:39.690 --> 00:16:42.269
much visual competition in the frame, It struggles

00:16:42.269 --> 00:16:44.870
to isolate the correct element. Yeah. Second,

00:16:45.149 --> 00:16:47.490
you must use specific time markers in your written

00:16:47.490 --> 00:16:49.429
prompts. Right. If a transformation needs to

00:16:49.429 --> 00:16:52.049
happen mid -clip, tell it exactly when. Say,

00:16:52.289 --> 00:16:53.909
change the background at the three -second mark.

00:16:54.250 --> 00:16:56.409
Without that clear reference, the system just

00:16:56.409 --> 00:16:58.909
guesses the timing. And its guess is usually

00:16:58.909 --> 00:17:01.690
completely wrong for your specific edit. It usually

00:17:01.690 --> 00:17:04.369
is. A time marker removes the guesswork entirely

00:17:04.369 --> 00:17:06.849
and gives the system a fixed point to work toward.

00:17:07.329 --> 00:17:09.809
And finally, we have to repeat the golden rule

00:17:09.809 --> 00:17:12.509
of this entire workflow. Yes. If a generation

00:17:12.509 --> 00:17:15.349
is fundamentally broken, do not try to fix it.

00:17:15.609 --> 00:17:17.630
Go straight back to the original clip and restart

00:17:17.630 --> 00:17:19.829
your process. Rewrite your prompt to be much

00:17:19.829 --> 00:17:22.450
more specific. Yeah. Iterating on a broken video

00:17:22.450 --> 00:17:25.890
just compounds the initial errors. So if we connect

00:17:25.890 --> 00:17:28.009
this to the bigger picture, what does this all

00:17:28.009 --> 00:17:31.910
mean for you as a creator? Let us synthesize

00:17:31.910 --> 00:17:34.910
the entire deep dive right here. OK. The critical

00:17:34.910 --> 00:17:37.930
insight today is that the gap between a usable

00:17:37.930 --> 00:17:40.710
video and a great one is not about writing a

00:17:40.710 --> 00:17:43.509
longer, more complex prompt. The secret is always

00:17:43.509 --> 00:17:46.089
just one more round of iteration and flow. Right.

00:17:46.289 --> 00:17:48.690
And that is exactly why you cannot just rely

00:17:48.690 --> 00:17:51.970
on the mobile app. You have to embrace the layered,

00:17:52.369 --> 00:17:54.990
non -linear approach of the workspace. Omnia

00:17:54.990 --> 00:17:56.809
is so much more than a simple avatar generator.

00:17:57.130 --> 00:17:59.630
It is a complete post -production studio sitting

00:17:59.630 --> 00:18:02.589
in your pocket. It unlocks the hidden potential

00:18:02.589 --> 00:18:05.250
of the mundane footage you already have. You

00:18:05.250 --> 00:18:07.170
just have to change how you approach the editing

00:18:07.170 --> 00:18:10.170
process. Start small, pick just one specific

00:18:10.170 --> 00:18:12.789
use case we discussed today, run it through Google

00:18:12.789 --> 00:18:15.509
Flow, stack a few edits, and carefully observe

00:18:15.509 --> 00:18:18.569
what comes back. The learning curve is surprisingly

00:18:18.569 --> 00:18:21.490
short. The results will honestly tell you more

00:18:21.490 --> 00:18:24.950
than any tutorial ever could. But before we go,

00:18:25.309 --> 00:18:27.430
exploring all of this does leave us with a rather

00:18:27.430 --> 00:18:30.390
profound final thought to mull over. It really

00:18:30.390 --> 00:18:33.230
forces us to question the nature of digital video

00:18:33.230 --> 00:18:36.089
itself moving forward. We started today by looking

00:18:36.089 --> 00:18:39.250
at a flat amateur phone video of a beach. A clip

00:18:39.250 --> 00:18:42.509
that felt undeniably real, just poorly shot.

00:18:42.670 --> 00:18:45.509
Right. But if AI can now retroactively add a

00:18:45.509 --> 00:18:48.210
flawless drone perspective, if it can completely

00:18:48.210 --> 00:18:50.069
alter the weather outside a moving car window

00:18:50.069 --> 00:18:52.710
using nothing but a map screenshot. How long

00:18:52.710 --> 00:18:55.029
until we stop trusting the reality of casual,

00:18:55.269 --> 00:18:58.150
everyday phone footage altogether? It is a fascinating

00:18:58.150 --> 00:19:00.630
and slightly terrifying question for the future

00:19:00.630 --> 00:19:03.589
of digital media. And terrifying is exactly the

00:19:03.589 --> 00:19:05.269
word for it. Thank you for joining us on this

00:19:05.269 --> 00:19:08.710
Deep Drive. We will see you next time. OTO music.