WEBVTT

00:00:00.000 --> 00:00:03.700
Imagine this. There's a robot, Neo, and it has

00:00:03.700 --> 00:00:07.280
one simple task. Pull a tissue from a box. Okay.

00:00:07.440 --> 00:00:11.019
But before it moves, it just stops. And inside

00:00:11.019 --> 00:00:14.119
its own process, it generates eight short five

00:00:14.119 --> 00:00:16.620
-second video clips. Some of those videos show

00:00:16.620 --> 00:00:19.719
it working perfectly. In others, it fails. The

00:00:19.719 --> 00:00:22.719
tissue rips. Maybe the box falls over. It sees

00:00:22.719 --> 00:00:24.699
it all first. It's incredible. It's not just

00:00:24.699 --> 00:00:27.219
running a program. That robot is using what you

00:00:27.219 --> 00:00:30.140
could call synthetic imagination. Synthetic imagination.

00:00:30.199 --> 00:00:33.119
Yeah, it's mapping out possible futures to guide

00:00:33.119 --> 00:00:35.880
what it does in the real world, which is messy

00:00:35.880 --> 00:00:38.899
and unpredictable. It's a huge shift. Welcome

00:00:38.899 --> 00:00:42.000
back to the Deep Dive. That idea, that hook,

00:00:42.140 --> 00:00:44.380
it really captures the two big themes we're looking

00:00:44.380 --> 00:00:46.500
at today. It really does. Okay, so let's unpack

00:00:46.500 --> 00:00:48.719
this. In this Deep Dive, we're focusing on two

00:00:48.719 --> 00:00:52.060
major things. how physical AI is learning to

00:00:52.060 --> 00:00:55.740
imagine, and then the surprising realities of

00:00:55.740 --> 00:00:57.979
how AI is being adopted around the world. That's

00:00:57.979 --> 00:01:00.039
the mission. We've synthesized the key insights

00:01:00.039 --> 00:01:01.979
from the sources you shared with us. We'll start

00:01:01.979 --> 00:01:04.400
with that breakthrough from 1X Robotics. Then

00:01:04.400 --> 00:01:06.280
we're going to do a rapid -fire look at the current

00:01:06.280 --> 00:01:10.359
market. Some big deals, some friction and a few

00:01:10.359 --> 00:01:12.819
clever research tricks. And then the global report.

00:01:13.019 --> 00:01:14.939
And then, yeah, global report that really challenges

00:01:14.939 --> 00:01:17.400
how we think about where the U .S. actually stands

00:01:17.400 --> 00:01:19.480
in all of this. There's a lot of crucial ground

00:01:19.480 --> 00:01:21.719
to cover. Let's jump right in with how these

00:01:21.719 --> 00:01:24.400
robots are starting to teach themselves. So the

00:01:24.400 --> 00:01:27.280
one next world model for these neo robots is

00:01:27.280 --> 00:01:30.280
it's a really critical inflection point. This

00:01:30.280 --> 00:01:33.359
is the step that allows them to learn new physical

00:01:33.359 --> 00:01:36.140
tasks. Without a person having to code every

00:01:36.140 --> 00:01:39.219
single. movement exactly without that intense

00:01:39.219 --> 00:01:42.420
you know bespoke human coding for every single

00:01:42.420 --> 00:01:45.400
action it's a move away from the old way of doing

00:01:45.400 --> 00:01:47.819
things and the how is what's so important here

00:01:47.819 --> 00:01:49.780
it's not the traditional approach where you calculate

00:01:49.780 --> 00:01:52.379
exact joint angles and all that no it's much

00:01:52.379 --> 00:01:55.200
more visual much more predictive the system takes

00:01:55.200 --> 00:01:57.680
a simple text prompt like your pull a tissue

00:01:57.680 --> 00:02:02.219
example and it takes the robots current camera

00:02:02.219 --> 00:02:04.959
view its context it feeds both of those into

00:02:04.959 --> 00:02:08.210
this world model which basically acts like an

00:02:08.210 --> 00:02:11.330
internal simulator. And that's what generates

00:02:11.330 --> 00:02:14.229
those little five -second imagined videos of

00:02:14.229 --> 00:02:16.330
what might happen? Yep. They use a concept for

00:02:16.330 --> 00:02:18.590
this, right? Yeah. Something like video diffusion

00:02:18.590 --> 00:02:21.199
for motion planning. That's the technical term,

00:02:21.259 --> 00:02:23.159
and we should probably define that. It just means

00:02:23.159 --> 00:02:26.080
the robot plans its moves by trying to generate

00:02:26.080 --> 00:02:29.520
a successful visual outcome first. It's all based

00:02:29.520 --> 00:02:32.520
on pixels, not just lines of code. Which is so

00:02:32.520 --> 00:02:34.300
much better for dealing with new situations.

00:02:34.680 --> 00:02:37.360
Fundamentally better. Because in the old way,

00:02:37.520 --> 00:02:39.780
if the box was tilted or the light was a bit

00:02:39.780 --> 00:02:42.039
dim, the whole thing could fail if you didn't

00:02:42.039 --> 00:02:45.039
code for it. Right. With this, the model just

00:02:45.039 --> 00:02:47.259
imagines what success looks like, the successful

00:02:47.259 --> 00:02:50.090
pixels, and works backward from there. So it

00:02:50.090 --> 00:02:52.229
runs, what, eight different imagined futures?

00:02:52.449 --> 00:02:54.710
Yeah. And then it picks the best one. Correct.

00:02:55.009 --> 00:02:57.370
An internal critic looks at those eight rollouts,

00:02:57.449 --> 00:02:59.550
and it selects the one that seems most likely

00:02:59.550 --> 00:03:02.550
to succeed. Only then does a second model translate

00:03:02.550 --> 00:03:06.090
that chosen video into the actual joint commands.

00:03:06.449 --> 00:03:08.849
And the results really back this up. The sources

00:03:08.849 --> 00:03:11.949
were clear that for that tissue -pulling task,

00:03:12.310 --> 00:03:15.840
the success rate jumped from 30 % to 45%. Just

00:03:15.840 --> 00:03:18.080
by sampling. Just by sampling and choosing the

00:03:18.080 --> 00:03:20.599
best imagined future before it even moves. Yeah.

00:03:20.659 --> 00:03:22.819
That's a huge improvement in reliability. It

00:03:22.819 --> 00:03:25.280
is. And what's really exciting is that It works

00:03:25.280 --> 00:03:27.379
for tasks it's seen before, what they call in

00:03:27.379 --> 00:03:30.319
distribution, but also for totally new tasks.

00:03:30.580 --> 00:03:32.819
It can generalize. OK, but here's the reality

00:03:32.819 --> 00:03:36.479
check. Speed. Ah, yes. The latency. The sources

00:03:36.479 --> 00:03:39.439
highlight a big limitation. It takes 11 seconds

00:03:39.439 --> 00:03:41.659
for that whole imagination and planning phase.

00:03:41.840 --> 00:03:44.400
Plus one second for the action itself. So 12

00:03:44.400 --> 00:03:46.800
seconds total. Which is a long time. 11 seconds

00:03:46.800 --> 00:03:48.879
is definitely the defining challenge for this

00:03:48.879 --> 00:03:51.860
to be practical, you know, in real time. It shows

00:03:51.860 --> 00:03:54.319
the AI race has moved beyond just... virtual

00:03:54.319 --> 00:03:57.520
agents like ChatGPT. Now it's about physical

00:03:57.520 --> 00:04:00.979
agents and 1x is shipping soon. This tech is

00:04:00.979 --> 00:04:03.719
the key, but that latency has to come down. So

00:04:03.719 --> 00:04:06.219
that raises the question, how does this 11 second

00:04:06.219 --> 00:04:09.300
planning latency affect its real world usefulness

00:04:09.300 --> 00:04:11.979
right now? It's too slow for instantaneous work,

00:04:12.060 --> 00:04:14.879
but it definitively proves the concept. Using

00:04:14.879 --> 00:04:17.060
visual imagination for physical planning works.

00:04:17.279 --> 00:04:20.319
It's a proof of concept. Okay, so while 1x is

00:04:20.319 --> 00:04:23.379
figuring that out, let's pivot to the... absolute

00:04:23.379 --> 00:04:25.839
chaos of the immediate market. This is where

00:04:25.839 --> 00:04:28.339
the pace is just relentless. It's a flurry of

00:04:28.339 --> 00:04:30.860
news for sure. Let's start with a kind of wild

00:04:30.860 --> 00:04:34.860
story. The Shopify CEO went viral for using Claude

00:04:34.860 --> 00:04:37.720
to build a custom tool to analyze his own x -ray.

00:04:38.009 --> 00:04:40.810
Yeah, I saw that. The ability to use an LLM,

00:04:40.970 --> 00:04:43.370
not just for text, but to code a specific tool

00:04:43.370 --> 00:04:46.230
for yourself in an afternoon. Right. That speed

00:04:46.230 --> 00:04:49.069
is remarkable. It really blurs the line between

00:04:49.069 --> 00:04:52.310
a developer tool and just a consumer app. I mean,

00:04:52.370 --> 00:04:54.689
it still feels kind of wild that people are doing

00:04:54.689 --> 00:04:56.610
their own medical reads, even if they're technically

00:04:56.610 --> 00:04:59.329
capable. And moving over to the media side, Google

00:04:59.329 --> 00:05:02.949
just upgraded VO 3 .1, their video model. The

00:05:02.949 --> 00:05:05.589
quality enhancements are huge. We're talking.

00:05:05.980 --> 00:05:08.860
Full support for vertical, you know, 9 .16 videos,

00:05:09.139 --> 00:05:13.000
4K upscaling, and the clips just feel more dynamic,

00:05:13.139 --> 00:05:15.279
more natural, even with really short prompts.

00:05:15.500 --> 00:05:17.639
The gap between just typing something and getting

00:05:17.639 --> 00:05:19.740
a high -quality video back is shrinking almost

00:05:19.740 --> 00:05:21.199
every week. It's going to have a huge impact.

00:05:21.519 --> 00:05:23.519
And speaking of unexpected things, this was a

00:05:23.519 --> 00:05:26.480
real aha moment from the sources. Google researchers

00:05:26.480 --> 00:05:30.180
found what the source called a dumb trick. A

00:05:30.180 --> 00:05:33.040
dumb trick, as in simple but incredibly effective.

00:05:33.439 --> 00:05:37.480
Exactly. And this trick boosted accuracy by 76

00:05:37.480 --> 00:05:40.959
% on certain tasks. And it worked across the

00:05:40.959 --> 00:05:45.339
board. Gemini, GPT, Claude, DeepSeek, all of

00:05:45.339 --> 00:05:48.060
them. So what was the trick? It wasn't some complex

00:05:48.060 --> 00:05:52.180
new architecture. It was just specific phrasing

00:05:52.180 --> 00:05:54.879
in the prompt itself. To stop the model from

00:05:54.879 --> 00:05:57.250
forgetting things. Precisely. They found that

00:05:57.250 --> 00:05:59.829
phrasing it like, let's generate four detailed

00:05:59.829 --> 00:06:02.370
options first and then evaluate them one by one

00:06:02.370 --> 00:06:05.009
before you give me a final answer, made a massive

00:06:05.009 --> 00:06:06.930
difference. You're forcing it to show its work.

00:06:07.089 --> 00:06:09.410
It's like metacognition for an LLM. It is. And

00:06:09.410 --> 00:06:11.170
that just shows why prompt engineering is still

00:06:11.170 --> 00:06:13.430
so hard. You know, I still wrestle with prompt

00:06:13.430 --> 00:06:15.430
drift myself. What do you mean by that? It's

00:06:15.430 --> 00:06:17.290
when you make a tiny change to the wording and

00:06:17.290 --> 00:06:19.490
suddenly the quality of the output just plummets.

00:06:19.790 --> 00:06:22.370
So to find that one simple instruction can give

00:06:22.370 --> 00:06:25.569
you such a huge cross -model return, it's...

00:06:25.740 --> 00:06:28.300
It's humbling. Right. Now, shifting to market

00:06:28.300 --> 00:06:31.180
power, the huge deal between Apple and Google.

00:06:31.339 --> 00:06:33.720
The Gemini deal. Yeah. The reporting says it's

00:06:33.720 --> 00:06:35.899
around a billion dollars a year for Gemini to

00:06:35.899 --> 00:06:39.019
power Siri and other Apple AI features. That

00:06:39.019 --> 00:06:42.259
is just massive. We're seeing this profound centralization

00:06:42.259 --> 00:06:45.240
of capability. You have two of the biggest tech

00:06:45.240 --> 00:06:47.600
companies in the world combining their AI reach.

00:06:47.939 --> 00:06:50.560
Which, of course, drew immediate criticism. Elon

00:06:50.560 --> 00:06:52.939
Musk was very vocal, saying it represents an

00:06:52.939 --> 00:06:55.670
unreasonable concentration of power. Well, and

00:06:55.670 --> 00:06:57.589
that concentration of power is creating friction

00:06:57.589 --> 00:06:59.370
on the ground. Literally, we're seeing physical

00:06:59.370 --> 00:07:02.050
protests against AI infrastructure. Microsoft.

00:07:02.110 --> 00:07:05.230
Yeah, Microsoft unveiled these new community

00:07:05.230 --> 00:07:08.069
first data center plans. They're promising things

00:07:08.069 --> 00:07:11.209
like no local electricity bill hikes, more jobs,

00:07:11.310 --> 00:07:13.290
all of that. But the sources say that locals

00:07:13.290 --> 00:07:16.370
in 24 different states are still actively protesting

00:07:16.370 --> 00:07:19.009
the rollouts. The energy and land use are just

00:07:19.009 --> 00:07:21.670
too visible. And the promise of jobs tomorrow

00:07:21.670 --> 00:07:24.029
doesn't always outweigh the noise and environmental

00:07:24.029 --> 00:07:27.439
impact. of a huge data center today. Yeah, that's

00:07:27.439 --> 00:07:29.620
a key conflict. And finally, on hardware, while

00:07:29.620 --> 00:07:32.019
NVIDIA is king, challengers are getting serious

00:07:32.019 --> 00:07:35.339
funding. A startup called Etched just raised

00:07:35.339 --> 00:07:37.899
half a billion dollars. 500 million. To take

00:07:37.899 --> 00:07:40.600
on NVIDIA directly. They're betting on specialized

00:07:40.600 --> 00:07:43.180
chips optimized just for large language models.

00:07:43.420 --> 00:07:46.120
Which tells you the market sees the GPU bottleneck

00:07:46.120 --> 00:07:48.759
as a real vulnerability. That specialized hardware

00:07:48.759 --> 00:07:50.839
might not beat NVIDIA on everything, but for

00:07:50.839 --> 00:07:53.819
certain tasks, it could be way more efficient.

00:07:54.139 --> 00:07:57.699
So does that massive Apple and Gemini deal pretty

00:07:57.699 --> 00:08:00.100
much confirm the market is moving toward... A

00:08:00.100 --> 00:08:02.100
centralized, maybe two or three player landscape.

00:08:02.339 --> 00:08:05.079
Yes. The sheer value of that deal suggests strong

00:08:05.079 --> 00:08:07.699
centralization, despite understandable criticism

00:08:07.699 --> 00:08:10.680
about concentrated power. Mid -roll sponsor read

00:08:10.680 --> 00:08:13.560
insert here. OK, we are back. And now we're going

00:08:13.560 --> 00:08:15.680
to transition from that fast paced, high value

00:08:15.680 --> 00:08:18.759
U .S. market to the broader world stage. We're

00:08:18.759 --> 00:08:21.220
focusing on this Microsoft AI Economy Institute

00:08:21.220 --> 00:08:25.860
report. And the data on global adoption is genuinely

00:08:25.860 --> 00:08:28.399
surprising. It really is. This is where we see

00:08:28.399 --> 00:08:30.180
the difference between who's developing the models

00:08:30.180 --> 00:08:32.720
versus who is actually adopting them. The sources

00:08:32.720 --> 00:08:35.539
show that despite all the innovation here, the

00:08:35.539 --> 00:08:38.799
U .S. ranks only 24th in actual AI adoption.

00:08:39.039 --> 00:08:41.580
24th. 24th. The adoption rate here is reported

00:08:41.580 --> 00:08:45.779
at 24%. Now, compare that to, say, the UAE. They're

00:08:45.779 --> 00:08:48.820
at 64%. That is an enormous gap. It suggests

00:08:48.820 --> 00:08:51.399
that, you know, a national strategy and top -down

00:08:51.399 --> 00:08:54.480
prioritization really matter, especially in smaller

00:08:54.480 --> 00:08:56.960
economies. And globally, the average is just

00:08:56.960 --> 00:09:00.120
over 16%. But developed countries are adopting

00:09:00.120 --> 00:09:02.139
at nearly double the rate of developing ones.

00:09:02.379 --> 00:09:04.580
But the report highlights an even more critical

00:09:04.580 --> 00:09:07.980
story, a counter -narrative to the big tech dominance.

00:09:08.399 --> 00:09:11.600
Because while everyone is focused on GPT, Gemini,

00:09:11.600 --> 00:09:13.960
and Claude, The model that's really succeeding

00:09:13.960 --> 00:09:17.320
in underserved markets is DeepSeek. Right, especially

00:09:17.320 --> 00:09:20.159
across Africa and Southeast Asia. Why them? They

00:09:20.159 --> 00:09:22.500
are running the classic Android playbook. It's

00:09:22.500 --> 00:09:25.179
a strategy focused on being everywhere. Be free,

00:09:25.259 --> 00:09:28.159
be open source. Be free, be open source, be low

00:09:28.159 --> 00:09:30.799
cost, and be good enough. That's the formula

00:09:30.799 --> 00:09:32.919
for high volume adoption. And it is working.

00:09:33.539 --> 00:09:36.019
DeepSeek is seeing two to four times higher usage

00:09:36.019 --> 00:09:38.480
than the big proprietary models in some African

00:09:38.480 --> 00:09:41.149
markets. And it's not just about the model. It's

00:09:41.149 --> 00:09:43.909
an infrastructure play. They're using key partnerships,

00:09:44.169 --> 00:09:47.330
especially with Huawei, to roll out the necessary

00:09:47.330 --> 00:09:50.830
hardware and cloud support. So open source democratizes

00:09:50.830 --> 00:09:53.490
access. Exactly. The world doesn't always need

00:09:53.490 --> 00:09:56.590
the most expensive, leading -edge model if 90

00:09:56.590 --> 00:09:59.429
% of people can't afford it. It needs tools that

00:09:59.429 --> 00:10:02.509
are functional and accessible right now. And

00:10:02.509 --> 00:10:04.590
this confirms a major point from the sources.

00:10:05.049 --> 00:10:08.490
AI adoption in 2026 isn't just a tech stat anymore.

00:10:08.970 --> 00:10:11.490
It's a development stat. It is. If you ignore

00:10:11.490 --> 00:10:14.029
the next billion users in emerging economies,

00:10:14.250 --> 00:10:16.370
someone else is going to capture that market.

00:10:16.570 --> 00:10:19.629
Whoa. I mean, imagine scaling that kind of open

00:10:19.629 --> 00:10:22.090
source service to a billion queries. That's a

00:10:22.090 --> 00:10:24.149
colossal opportunity, and it's being realized

00:10:24.149 --> 00:10:26.570
through open access. It's a totally different

00:10:26.570 --> 00:10:28.490
way of thinking about the market. So this raises

00:10:28.490 --> 00:10:30.649
one last important question for this segment.

00:10:31.100 --> 00:10:34.120
What's the long term implication of these open

00:10:34.120 --> 00:10:36.559
source models leading adoption in emerging markets?

00:10:36.840 --> 00:10:39.679
Open source accessibility is closing the adoption

00:10:39.679 --> 00:10:43.220
gap, proving that AI tools must work for everyone

00:10:43.220 --> 00:10:47.600
to achieve true global reach. So to synthesize

00:10:47.600 --> 00:10:50.259
the big ideas for you, the learner, let's connect

00:10:50.259 --> 00:10:52.360
these two threads. On one hand, you have robotics,

00:10:52.559 --> 00:10:54.740
which is making this fundamental shift. Right.

00:10:55.149 --> 00:10:58.110
Physical agents moving from rigid code to what

00:10:58.110 --> 00:11:00.129
we're calling synthetic imagination. And that

00:11:00.129 --> 00:11:02.909
visual planning sets the stage for robots that

00:11:02.909 --> 00:11:05.049
can actually teach themselves. It's a move from

00:11:05.049 --> 00:11:08.929
pure reaction to to real internal deliberation,

00:11:08.990 --> 00:11:11.549
even if it's a slow 11 second deliberation right

00:11:11.549 --> 00:11:14.669
now. And at the exact same time, the global landscape

00:11:14.669 --> 00:11:16.970
is full of friction. You have protests against

00:11:16.970 --> 00:11:19.070
infrastructure. You have this intense market

00:11:19.070 --> 00:11:21.389
centralization with the Apple and Gemini deal.

00:11:21.590 --> 00:11:24.309
And also incredible opportunity. Right. And that.

00:11:24.429 --> 00:11:27.230
opportunity is being seized by accessible open

00:11:27.230 --> 00:11:29.850
source models like DeepSeek. They're the ones

00:11:29.850 --> 00:11:31.929
driving true adoption in emerging economies.

00:11:32.129 --> 00:11:34.169
It's a fascinating race, absolute performance

00:11:34.169 --> 00:11:36.669
at the very top versus sheer accessibility for

00:11:36.669 --> 00:11:39.509
everyone else. And that 1x Neo technology, that

00:11:39.509 --> 00:11:41.850
really feels like the fundamental marker of progress

00:11:41.850 --> 00:11:45.259
here. It is. The shift from an AI predicting

00:11:45.259 --> 00:11:48.279
what should happen based on code to an AI that

00:11:48.279 --> 00:11:50.879
actively imagines and then selects the best path

00:11:50.879 --> 00:11:53.980
for many possible futures, that changes everything

00:11:53.980 --> 00:11:56.639
about how a robot can learn. They're modeling

00:11:56.639 --> 00:11:58.919
reality before they touch it. It's the difference

00:11:58.919 --> 00:12:01.419
between simulating a single line and generating

00:12:01.419 --> 00:12:03.899
a whole field of possibilities. That's intelligence.

00:12:04.159 --> 00:12:06.399
That really is the next frontier. So we have

00:12:06.399 --> 00:12:07.960
to leave you with a final thought to mull over.

00:12:08.100 --> 00:12:10.860
What new autonomous capabilities will unlock

00:12:10.860 --> 00:12:14.480
when that AI imagination latency, that critical

00:12:14.480 --> 00:12:17.679
11 second planning time, drops down to just one

00:12:17.679 --> 00:12:20.379
second? When a robot's imagination becomes instantaneous,

00:12:20.860 --> 00:12:23.100
physical AI changes everything. Until the next

00:12:23.100 --> 00:12:25.759
deep dive, keep learning. Out to you, music.
