WEBVTT

00:00:00.000 --> 00:00:01.740
So imagine this. You're sitting in the back of

00:00:01.740 --> 00:00:05.559
a robotaxi. It's a Tuesday, maybe 2 p .m. You're

00:00:05.559 --> 00:00:07.900
just, you know, answering emails, watching the

00:00:07.900 --> 00:00:09.960
suburbs roll by. It's totally mundane. Right.

00:00:10.060 --> 00:00:12.960
And then suddenly the sky turns this bruised

00:00:12.960 --> 00:00:17.059
purple. A tornado touches down like three blocks

00:00:17.059 --> 00:00:19.719
ahead. Oh, wow. Or maybe you turn a corner in

00:00:19.719 --> 00:00:22.160
Phoenix and there's a Texas Longhorn steer just

00:00:22.160 --> 00:00:26.019
standing dead center in the lane. Or the neighborhood

00:00:26.019 --> 00:00:29.859
is quite literally on fire. That is a terrifying

00:00:29.859 --> 00:00:32.439
Tuesday. It is. But here's the question that

00:00:32.439 --> 00:00:35.039
I think keeps engineers up at night. How do you

00:00:35.039 --> 00:00:37.299
train a computer for that? Yeah. You can't exactly

00:00:37.299 --> 00:00:39.640
drive a million miles just waiting for a steer

00:00:39.640 --> 00:00:41.539
to wander onto the highway. You don't wait for

00:00:41.539 --> 00:00:43.840
the disaster. Hallucinate it. Welcome to the

00:00:43.840 --> 00:00:46.520
era where AI dreams up nightmares just to keep

00:00:46.520 --> 00:00:49.920
us safe. It's poetic, sure. But, you know. Practically,

00:00:49.920 --> 00:00:51.780
it's just data efficiency. You can't wait 100

00:00:51.780 --> 00:00:54.280
years for it steered across the road. You have

00:00:54.280 --> 00:00:56.899
to force the error. Force the error. I like it.

00:00:57.179 --> 00:00:59.780
Welcome to the Deep Dive. It is Monday, February

00:00:59.780 --> 00:01:03.560
9th, 2026. We're sitting right at this intersection

00:01:03.560 --> 00:01:06.299
of simulation and physical reality today. I've

00:01:06.299 --> 00:01:08.420
got a stack of notes here that paint a pretty

00:01:08.420 --> 00:01:11.379
wild picture of where we are. And looking at

00:01:11.379 --> 00:01:13.379
the sources we have today, it feels like we've

00:01:13.379 --> 00:01:15.519
hit a tipping point. We're seeing what people

00:01:15.519 --> 00:01:19.140
are calling the super cycle of 2026. Yeah. Really

00:01:19.140 --> 00:01:21.359
taking shape. And numbers are staggering. We'll

00:01:21.359 --> 00:01:23.900
get to that 800 million users, which is just

00:01:23.900 --> 00:01:26.920
it's mind blowing. But I want to stick with that

00:01:26.920 --> 00:01:29.579
image of the Robotaxi and the Texas Longhorn

00:01:29.579 --> 00:01:31.879
for a second, because this comes directly from

00:01:31.879 --> 00:01:34.689
Waymo. This is the new Waymo world model. And

00:01:34.689 --> 00:01:36.750
the engine underneath it is what really matters.

00:01:36.930 --> 00:01:39.609
They're using Google's Genie 3. Genie 3. Okay,

00:01:39.650 --> 00:01:41.329
let's unpack this. Because usually when we talk

00:01:41.329 --> 00:01:43.750
about simulations, my mind goes to, like, video

00:01:43.750 --> 00:01:46.790
games. Grand Theft Auto, but for robots. Is that

00:01:46.790 --> 00:01:48.730
what this is? No, and that's a crucial distinction.

00:01:48.950 --> 00:01:51.629
In a video game engine, like Unreal or Unity,

00:01:51.870 --> 00:01:54.650
a human programmer has defined all the physics.

00:01:54.849 --> 00:01:57.090
Right. They wrote the code that says, if car

00:01:57.090 --> 00:02:00.489
hits wall, then crumple metal. It's rigid. It's

00:02:00.489 --> 00:02:03.129
just a render. And Genie 3. Genie 3 is generative.

00:02:03.599 --> 00:02:05.620
It's not rendering the world based on a set of

00:02:05.620 --> 00:02:08.620
rules. It's dreaming it up based on memory. It

00:02:08.620 --> 00:02:10.740
has watched millions of hours of driving footage.

00:02:11.060 --> 00:02:14.259
So when it creates a scenario, it's predicting

00:02:14.259 --> 00:02:16.960
the next pixel, the same way ChatGPT predicts

00:02:16.960 --> 00:02:19.319
the next word. So it's hallucinating the physics.

00:02:19.719 --> 00:02:22.639
In a way, yes. But here's the kicker. It's not

00:02:22.639 --> 00:02:25.120
just generating video. It's generating the sensor

00:02:25.120 --> 00:02:27.680
data. What do you mean? It simulates the LiDAR

00:02:27.680 --> 00:02:29.659
returns, the radar waves, the camera inputs,

00:02:29.879 --> 00:02:33.400
all of it. Wait, so... Does the car's computer

00:02:33.400 --> 00:02:36.599
even know it's in a simulation? It has no idea.

00:02:36.879 --> 00:02:39.060
To the perception stack, the brain of the car,

00:02:39.159 --> 00:02:42.139
a simulated photon and a real photon are mathematically

00:02:42.139 --> 00:02:45.379
identical. That is slightly unsettling. It solves

00:02:45.379 --> 00:02:47.439
the biggest problem in autonomous driving, the

00:02:47.439 --> 00:02:50.479
long tail. You can drive a billion miles in sunny

00:02:50.479 --> 00:02:52.879
California and never see a snowstorm or a flood.

00:02:53.000 --> 00:02:55.199
Yeah, you just won't encounter it. But with Genie

00:02:55.199 --> 00:02:58.039
3, engineers can just type in a command. They

00:02:58.039 --> 00:03:00.620
use natural language to say add a flood or make

00:03:00.620 --> 00:03:03.500
it nighttime or insert a literal elephant. A

00:03:03.500 --> 00:03:05.259
literal elephant. That is actually in the notes.

00:03:05.400 --> 00:03:07.800
It is. And it matters because of this safe danger

00:03:07.800 --> 00:03:11.620
concept. You can test what if decisions. What

00:03:11.620 --> 00:03:14.199
if the car swerves left? What if it breaks hard?

00:03:14.340 --> 00:03:16.979
You can run that scenario 10 ,000 times in the

00:03:16.979 --> 00:03:19.840
cloud without ever risking a passenger or, you

00:03:19.840 --> 00:03:21.419
know, a pedestrian. It's interesting because

00:03:21.419 --> 00:03:24.340
they call it language control. So an engineer

00:03:24.340 --> 00:03:27.300
is basically being a god of this little virtual

00:03:27.300 --> 00:03:29.939
world, just speaking disasters into existence.

00:03:30.240 --> 00:03:32.199
Let there be a tornado. Exactly. And controlling

00:03:32.199 --> 00:03:34.039
the scene layout, the traffic flow, everything.

00:03:34.500 --> 00:03:37.340
Waymo's bet here is really specific. If you can

00:03:37.340 --> 00:03:40.080
simulate everything. you can make real -world

00:03:40.080 --> 00:03:43.199
failure impossible, or at least, you know, statistically

00:03:43.199 --> 00:03:45.840
negligible. It creates this real moment of wonder

00:03:45.840 --> 00:03:48.400
for me. Just thinking about the computational

00:03:48.400 --> 00:03:51.360
power required to simulate how a tornado affects

00:03:51.360 --> 00:03:53.639
a LIDAR sensor, that's not just pixels, that's

00:03:53.639 --> 00:03:55.800
light physics being predicted by a neural net.

00:03:55.960 --> 00:03:58.479
It's massive. And they're even building a Gemini

00:03:58.479 --> 00:04:00.599
-based voice assistant for inside the car, so

00:04:00.599 --> 00:04:03.340
the interaction inside is evolving too. So here's

00:04:03.340 --> 00:04:05.340
where it gets really interesting for me. If the

00:04:05.340 --> 00:04:08.439
simulation is indistinguishable from reality

00:04:08.439 --> 00:04:12.300
for the car, does the real world even matter

00:04:12.300 --> 00:04:16.259
for training anymore? Not really. If the data

00:04:16.259 --> 00:04:18.839
is perfect, the training is valid. The car doesn't

00:04:18.839 --> 00:04:20.879
care if the photon came from the sun or from

00:04:20.879 --> 00:04:22.839
a server. Okay, so that's simulating the world.

00:04:22.879 --> 00:04:25.000
We're building the matrix for cars. But we also

00:04:25.000 --> 00:04:27.620
have this breakthrough from Harvard and Stanford

00:04:27.620 --> 00:04:30.000
about moving through the world. And this one

00:04:30.000 --> 00:04:32.420
seems to bridge the gap between chatbots and

00:04:32.420 --> 00:04:35.959
robots. This is the OAT system. OAT. The name

00:04:35.959 --> 00:04:37.540
isn't as important as the mechanism. This is

00:04:37.540 --> 00:04:40.160
about predicting robot actions like their text.

00:04:40.339 --> 00:04:42.560
Right, because we know transformers, the tech

00:04:42.560 --> 00:04:45.000
behind GPT and clot, are really good at predicting

00:04:45.000 --> 00:04:47.339
the next word in the sentence. The cat sat on

00:04:47.339 --> 00:04:50.000
the... and the AI knows math. Exactly. But robots

00:04:50.000 --> 00:04:52.639
don't move in words. They move in continuous

00:04:52.639 --> 00:04:56.259
kind of messy physical arcs. A robot arm reaching

00:04:56.259 --> 00:04:59.240
for a cup isn't a discrete word. It's a flow

00:04:59.240 --> 00:05:02.759
of analog data. Voltage, torque, velocity. All

00:05:02.759 --> 00:05:05.240
of that. So how do you turn a backflip into a

00:05:05.240 --> 00:05:07.730
sentence? I have no idea. You have to digitize

00:05:07.730 --> 00:05:10.089
it. Think of it like music. Sound is a continuous

00:05:10.089 --> 00:05:13.089
wave, right? But to put it on a CD or an MP3,

00:05:13.269 --> 00:05:15.250
you have to chop it up into digital bits. You

00:05:15.250 --> 00:05:17.870
take samples. Okay, I follow that. OAT does that

00:05:17.870 --> 00:05:20.829
for movement. The researchers built an encoder

00:05:20.829 --> 00:05:23.209
that takes that continuous motion, the robot

00:05:23.209 --> 00:05:27.000
arm swinging, and splits it into chunks. Then

00:05:27.000 --> 00:05:29.839
it uses a process called finite scalar quantization.

00:05:30.060 --> 00:05:32.439
That is a mouthful. It is. But just think of

00:05:32.439 --> 00:05:35.300
it as creating a vocabulary. It forces the infinite

00:05:35.300 --> 00:05:38.139
complexity of a robot arm's arc into a fixed

00:05:38.139 --> 00:05:41.939
menu of specific movement words or tokens. So

00:05:41.939 --> 00:05:43.860
instead of a smooth wave, it becomes a series

00:05:43.860 --> 00:05:46.759
of steps, like token A, then token B, then token

00:05:46.759 --> 00:05:49.459
C. Exactly. And because these tokens flow left

00:05:49.459 --> 00:05:51.720
to right, just like a sentence in English, standard

00:05:51.720 --> 00:05:54.000
large language models can process them. So you

00:05:54.000 --> 00:05:56.860
could feed a movement sequence into GPT -5 or

00:05:56.860 --> 00:06:00.639
CLAWD and it can autocomplete the movement. Yes.

00:06:01.000 --> 00:06:03.740
The breakthrough is that it turns physical dexterity

00:06:03.740 --> 00:06:06.579
into a language problem. The first few tokens

00:06:06.579 --> 00:06:08.899
might describe the general direction, move arm

00:06:08.899 --> 00:06:11.819
up, and the later tokens fill in the fine motor

00:06:11.819 --> 00:06:15.300
details, rotate wrist 10 degrees. That feels

00:06:15.300 --> 00:06:18.040
huge. Yeah. It means we aren't reinventing the

00:06:18.040 --> 00:06:20.360
wheel for robotics. We're just piggybacking on

00:06:20.360 --> 00:06:22.259
the massive intelligence we already built for

00:06:22.259 --> 00:06:25.160
chatbots. And it is beating previous methods

00:06:25.160 --> 00:06:28.079
across 20 different tasks. It's vastly more efficient.

00:06:28.279 --> 00:06:31.660
The implication is wild. If a robot's brain is

00:06:31.660 --> 00:06:34.879
just a large language model, then the robot effectively

00:06:34.879 --> 00:06:38.319
knows everything the Internet knows. So does

00:06:38.319 --> 00:06:40.220
this mean we can eventually just talk a robot

00:06:40.220 --> 00:06:43.480
into learning a backflip? Essentially, yes. By

00:06:43.480 --> 00:06:45.540
treating the backflip as a sentence of movement

00:06:45.540 --> 00:06:47.699
tokens, you're just prompting it to complete

00:06:47.699 --> 00:06:49.980
the thought, but physically. Complete the thought

00:06:49.980 --> 00:06:52.240
physically. I like that. It connects perfectly

00:06:52.240 --> 00:06:53.939
to the sheer scale of what is happening right

00:06:53.939 --> 00:06:57.420
now. We mentioned the date February 2026, and

00:06:57.420 --> 00:06:59.759
the industry is calling this the super cycle.

00:06:59.939 --> 00:07:01.920
The numbers coming out of OpenAI this week are

00:07:01.920 --> 00:07:06.050
frankly absurd. 800 million weekly users. 800

00:07:06.050 --> 00:07:08.949
million. That is nearly the population of the

00:07:08.949 --> 00:07:11.769
entire generic Western Hemisphere interacting

00:07:11.769 --> 00:07:13.930
with these models every single week. I remember

00:07:13.930 --> 00:07:15.930
when getting to 100 million was the fastest in

00:07:15.930 --> 00:07:18.329
history. Now they're reporting growth is back

00:07:18.329 --> 00:07:21.430
to over 10 % monthly. And it's not just chat.

00:07:21.990 --> 00:07:26.389
Codex usage. The coding AI surged. 50 % just

00:07:26.389 --> 00:07:29.589
after the GPT 5 .3 launch. This tells me that

00:07:29.589 --> 00:07:31.569
this isn't a novelty anymore. This is infrastructure.

00:07:31.870 --> 00:07:34.730
Speaking of infrastructure, A16Z, that's Andreas

00:07:34.730 --> 00:07:39.089
and Horowitz, just dropped $1 .7 billion into

00:07:39.089 --> 00:07:41.589
AI infrastructure. Wow. And that's out of a $15

00:07:41.589 --> 00:07:44.209
billion fund. They are literally rebuilding the

00:07:44.209 --> 00:07:46.050
platforms from the ground up. They're seeing

00:07:46.050 --> 00:07:49.519
2026 as the year. The experimental phase ends

00:07:49.519 --> 00:07:52.180
and the utility phase begins. This is about chips,

00:07:52.300 --> 00:07:54.759
data centers, you know, the pipes that run the

00:07:54.759 --> 00:07:57.000
Internet. But with utility comes commercialization.

00:07:57.139 --> 00:07:59.839
And I have to admit, I had a bit of a sigh moment

00:07:59.839 --> 00:08:02.100
reading the news this morning. The ads. The ads.

00:08:02.120 --> 00:08:05.000
Ads are officially live in ChatGPT. Adobe is

00:08:05.000 --> 00:08:06.980
one of the launch partners. They're testing ads

00:08:06.980 --> 00:08:09.439
for Photoshop and Acrobat right there in the

00:08:09.439 --> 00:08:11.870
chat interface. It was inevitable, right? I know,

00:08:11.949 --> 00:08:13.850
I know. But I have to admit, I just hated seeing

00:08:13.850 --> 00:08:17.189
it. The pristine white box. It felt like a sanctuary,

00:08:17.470 --> 00:08:21.689
you know? Just pure intelligence. No noise. Now,

00:08:21.810 --> 00:08:23.769
if I ask for a summary or a photo editing tip,

00:08:24.009 --> 00:08:27.269
I might get a nudge to buy Firefly. It feels

00:08:27.269 --> 00:08:29.170
like the magic just got a corporate logo slapped

00:08:29.170 --> 00:08:32.169
on it. I get the nostalgia for the research preview

00:08:32.169 --> 00:08:35.070
era. Yeah. But look at the scale. Right. You

00:08:35.070 --> 00:08:39.190
can't serve 800 million users on compute -heavy

00:08:39.190 --> 00:08:42.049
models for free. Forever. It's the tradeoff.

00:08:42.210 --> 00:08:44.149
We're moving from the wild west of discovery

00:08:44.149 --> 00:08:47.210
to the utility grade of electricity. The electric

00:08:47.210 --> 00:08:49.070
company sends you a bill. That's a fair point.

00:08:49.230 --> 00:08:51.289
It is now utility grade infrastructure. It's

00:08:51.289 --> 00:08:53.029
like electricity or the telephone network. It's

00:08:53.029 --> 00:08:55.809
just there. It's just there. Right. And because

00:08:55.809 --> 00:08:58.730
it's just there, the way we use it is changing.

00:08:58.870 --> 00:09:01.070
So we have the 800 million users and we have

00:09:01.070 --> 00:09:02.970
the massive silicon build out. The question is,

00:09:02.990 --> 00:09:05.409
what are they actually doing with it? Because

00:09:05.409 --> 00:09:08.049
this week the tools shifted. We move from look

00:09:08.049 --> 00:09:10.120
at this funny video to this is. It's how you

00:09:10.120 --> 00:09:13.000
run a Fortune 500 company. The empowered AI tool

00:09:13.000 --> 00:09:16.299
sector is really maturing. But the video generation

00:09:16.299 --> 00:09:19.840
space is what catches your eye first. ByteDance

00:09:19.840 --> 00:09:23.039
released Seedens 2 .0. The demos for that are

00:09:23.039 --> 00:09:26.379
wild. I saw one with 2 .6 million views. It's

00:09:26.379 --> 00:09:29.080
a massive leap. And then you have Kling 3 .0.

00:09:29.139 --> 00:09:31.840
Kling 3 .0. This is the one claiming scene level

00:09:31.840 --> 00:09:34.759
control. What does that actually mean? It means

00:09:34.759 --> 00:09:37.860
we're moving past the slot machine era of AI

00:09:37.860 --> 00:09:41.490
video. type of prompt pull the lever and just

00:09:41.490 --> 00:09:43.590
hope the video isn't a nightmare right scene

00:09:43.590 --> 00:09:46.450
level means multi -shot control longer takes

00:09:46.450 --> 00:09:49.049
yeah it understands continuity you are directing

00:09:49.049 --> 00:09:51.389
not just prompting but the biggest shift for

00:09:51.389 --> 00:09:53.529
me and this ties back to that utility idea is

00:09:53.529 --> 00:09:56.629
open ai frontier ah this is the enterprise play

00:09:56.629 --> 00:09:59.210
and this is where the money is explain the difference

00:09:59.210 --> 00:10:01.840
here Because we have team accounts, we have enterprise

00:10:01.840 --> 00:10:04.679
accounts, what is Frontier? So Frontier isn't

00:10:04.679 --> 00:10:06.639
just about giving your employees access to a

00:10:06.639 --> 00:10:10.059
chatbot. It's a platform for managing AI agents.

00:10:10.320 --> 00:10:13.480
Right, not chatbots that answer questions, but

00:10:13.480 --> 00:10:16.519
agents that perform workflows. Frontier provides

00:10:16.519 --> 00:10:19.360
shared context, onboarding for these digital

00:10:19.360 --> 00:10:23.039
workers, and oversight. It treats the AI as a

00:10:23.039 --> 00:10:26.460
workforce, not a software tool. See, that's the

00:10:26.460 --> 00:10:28.720
shift. We're moving from generating content,

00:10:28.960 --> 00:10:32.100
write me a poem, make me a video, to orchestrating

00:10:32.100 --> 00:10:34.600
labor. Exactly. The human role is shifting. You

00:10:34.600 --> 00:10:36.559
aren't playing the violin anymore. You're waving

00:10:36.559 --> 00:10:39.779
the baton. You are the conductor. The conductor.

00:10:39.940 --> 00:10:43.200
That is both exciting and terrifying. It feels

00:10:43.200 --> 00:10:45.200
like the friction is disappearing, whether it's

00:10:45.200 --> 00:10:47.639
creating a video scene or automating a business

00:10:47.639 --> 00:10:50.279
process. The barrier to entry is just melting

00:10:50.279 --> 00:10:53.190
away. And that brings its own chaos. But it's

00:10:53.190 --> 00:10:55.830
the reality of 2026. I feel like we need to take

00:10:55.830 --> 00:10:57.990
a breath and look at the big picture here. We've

00:10:57.990 --> 00:11:00.169
covered a lot of ground. We have the God mode

00:11:00.169 --> 00:11:02.789
simulation at Waymo. We have the robot language

00:11:02.789 --> 00:11:05.750
at Harvard. And we have the super cycle of adoption.

00:11:06.049 --> 00:11:08.409
That's a lot. If we zoom out, what's the thread

00:11:08.409 --> 00:11:11.070
connecting these for you? I think it's the synthetic

00:11:11.070 --> 00:11:13.830
tipping point. Look at the three pillars. First,

00:11:13.990 --> 00:11:16.690
simulation. We're replacing real -world testing

00:11:16.690 --> 00:11:19.809
with synthetic experiences. That's Waymo. Okay.

00:11:20.090 --> 00:11:23.000
Second. Translation, we're converting physical

00:11:23.000 --> 00:11:25.360
movement into synthetic language tokens, which

00:11:25.360 --> 00:11:29.740
is OAT. And third, scale. We have 800 million

00:11:29.740 --> 00:11:32.340
people living and working inside this synthetic

00:11:32.340 --> 00:11:34.360
infrastructure. It really does feel like we're

00:11:34.360 --> 00:11:36.059
waking up in a different world every Monday.

00:11:36.179 --> 00:11:38.460
The lines between the physical and the digital

00:11:38.460 --> 00:11:41.480
are just gone. They're blurring, certainly. I

00:11:41.480 --> 00:11:42.799
want to leave everyone with a thought to mull

00:11:42.799 --> 00:11:45.840
over. We talked about Waymo simulating tornadoes.

00:11:45.840 --> 00:11:48.820
We talked about OAT turning backflips into words.

00:11:49.000 --> 00:11:52.139
Right. If an AI can simulate a tornado perfectly

00:11:52.139 --> 00:11:54.240
to the point where the sensors can't tell the

00:11:54.240 --> 00:11:57.320
difference and another AI can predict robot movements

00:11:57.320 --> 00:12:01.330
like words. How long until the simulation is

00:12:01.330 --> 00:12:03.549
the training ground for all physical labor? Like,

00:12:03.549 --> 00:12:05.509
do we ever need to practice anything in the real

00:12:05.509 --> 00:12:07.470
world again? That's the billion -dollar question.

00:12:07.730 --> 00:12:10.049
If the simulation is perfect, the real world

00:12:10.049 --> 00:12:12.409
is just the final exam. Something to think about

00:12:12.409 --> 00:12:14.509
while you're looking out the window of your robo

00:12:14.509 --> 00:12:17.190
-taxi. Thanks for diving in with us. Keep exploring.

00:12:17.309 --> 00:12:18.669
We'll catch you in the next deep dive.
