WEBVTT

00:00:00.000 --> 00:00:03.060
The scale of modern A .I. is it's almost impossible

00:00:03.060 --> 00:00:05.740
to really grasp. I mean, right now, the cost

00:00:05.740 --> 00:00:07.580
of building the current A .I. infrastructure

00:00:07.580 --> 00:00:10.759
is already more than the Manhattan Project and

00:00:10.759 --> 00:00:13.640
the Apollo space program combined. That is a

00:00:13.640 --> 00:00:16.739
staggering comparison. It truly is. We're talking

00:00:16.739 --> 00:00:19.019
about physical structures that need the same

00:00:19.019 --> 00:00:23.179
amount of land as, what, 450 soccer fields? And

00:00:23.179 --> 00:00:25.320
they require enough electricity to light up a

00:00:25.320 --> 00:00:27.699
million homes. This isn't just, you know, scaling

00:00:27.699 --> 00:00:30.079
up tech. This is a whole new standard for global

00:00:30.079 --> 00:00:32.869
infrastructure. Welcome to the Deep Dive. We've

00:00:32.869 --> 00:00:34.549
gone through the latest research to bring you

00:00:34.549 --> 00:00:37.250
the most crucial insights into this, this massive

00:00:37.250 --> 00:00:39.490
shift. Our mission today is pretty clear. We're

00:00:39.490 --> 00:00:42.509
exploring two huge frontiers in AI that are happening

00:00:42.509 --> 00:00:44.670
at the same time. First, there's the conceptual

00:00:44.670 --> 00:00:47.509
one, why LLMs have sort of hit a wall, and what

00:00:47.509 --> 00:00:49.570
this idea of spatial intelligence actually means

00:00:49.570 --> 00:00:52.090
for what's next. And second, we're going to dive

00:00:52.090 --> 00:00:55.429
into the physical reality of it all, the just

00:00:55.429 --> 00:00:58.890
eye -watering cost and the unbelievable scale

00:00:58.890 --> 00:01:01.969
of these new ultra -mega data centers that have

00:01:01.969 --> 00:01:05.209
to run these models. That includes the, frankly,

00:01:05.310 --> 00:01:09.590
astonishing $32 billion Stargate project. So

00:01:09.590 --> 00:01:11.349
let's unpack this. We have to start with Fei

00:01:11.349 --> 00:01:13.359
-Fei Li. For anyone who doesn't know, she's the

00:01:13.359 --> 00:01:16.040
Stanford professor behind ImageNet, which is,

00:01:16.060 --> 00:01:17.700
I mean, it's essentially the foundation for the

00:01:17.700 --> 00:01:19.379
whole deep learning revolution we're living through.

00:01:19.519 --> 00:01:21.379
Right. And she just put out what is basically

00:01:21.379 --> 00:01:23.939
a manifesto. It's a really strong declaration

00:01:23.939 --> 00:01:27.260
that LLMs, large language models, have pretty

00:01:27.260 --> 00:01:29.200
much reached their ceiling. She argues that the

00:01:29.200 --> 00:01:31.280
future of AI isn't just about better language.

00:01:31.459 --> 00:01:33.980
It requires something she calls spatial intelligence.

00:01:34.340 --> 00:01:36.620
So what does that actually mean for an AI, I

00:01:36.620 --> 00:01:38.700
mean? What is spatial intelligence when you take

00:01:38.700 --> 00:01:41.170
it out of the human context? It's the kind of

00:01:41.170 --> 00:01:43.290
thing we do every single day without even thinking

00:01:43.290 --> 00:01:45.269
about it. Yeah. It's the ability to understand

00:01:45.269 --> 00:01:48.450
and navigate and interact with three -dimensional

00:01:48.450 --> 00:01:51.189
space. Yeah. Like think about catching a set

00:01:51.189 --> 00:01:54.010
of keys, someone tosses you in a dark room. Or

00:01:54.010 --> 00:01:56.810
a firefighter who has to instantly read a chaotic

00:01:56.810 --> 00:01:59.510
smoke -filled space to find the safest way out.

00:01:59.819 --> 00:02:02.120
It's about applying the basic rules of physics

00:02:02.120 --> 00:02:03.959
and prediction to the world you can actually

00:02:03.959 --> 00:02:06.739
touch. Exactly. Yeah. Light's big critique is

00:02:06.739 --> 00:02:09.419
that LLMs, for all their amazing chat abilities,

00:02:09.680 --> 00:02:12.960
are completely blind to reality. They don't know

00:02:12.960 --> 00:02:15.919
that if you drop something, it falls down. Always.

00:02:16.060 --> 00:02:18.639
They have zero concept of gravity or mass unless

00:02:18.639 --> 00:02:20.960
we spell it out for them in text. So to fix that,

00:02:21.020 --> 00:02:22.860
she's proposing that AI needs what she calls

00:02:22.860 --> 00:02:25.539
world models. And she says these models have

00:02:25.539 --> 00:02:27.780
to have three core capabilities to bridge that

00:02:27.780 --> 00:02:30.240
gap between just language and actual physics.

00:02:30.500 --> 00:02:32.219
Okay, so the first one is that they have to be

00:02:32.219 --> 00:02:34.199
generative. The models have to be able to create

00:02:34.199 --> 00:02:37.199
these really complex 3D environments that, and

00:02:37.199 --> 00:02:39.919
this is key, strictly obey the rules of real

00:02:39.919 --> 00:02:43.159
world physics. No weird floating teacups. Second,

00:02:43.360 --> 00:02:46.000
they have to be completely multimodal. So they

00:02:46.000 --> 00:02:48.979
need to process everything at once. Text. images,

00:02:49.159 --> 00:02:52.379
video, depth maps, even data from real world

00:02:52.379 --> 00:02:54.900
sensors. It's a full sensory understanding. Right.

00:02:55.060 --> 00:02:57.699
And finally, they have to be interactive. They

00:02:57.699 --> 00:03:00.659
need to predict with really high accuracy what's

00:03:00.659 --> 00:03:02.620
going to happen when a user does something inside

00:03:02.620 --> 00:03:05.400
that simulation. That's how you get real cause

00:03:05.400 --> 00:03:07.479
and effect learning. And this isn't just a theory,

00:03:07.580 --> 00:03:10.379
right? Her team at World Labs is already shipping

00:03:10.379 --> 00:03:12.979
a tool called Marble, which takes a simple text

00:03:12.979 --> 00:03:15.419
prompt and turns it into a 3D scene you can actually

00:03:15.419 --> 00:03:17.580
walk around in. Yeah, and what's so fascinating

00:03:17.580 --> 00:03:19.699
is how they're doing it. They're probably using

00:03:19.699 --> 00:03:22.699
things like neural lamberis, neural radiance

00:03:22.699 --> 00:03:24.400
fields, which are these models that can build

00:03:24.400 --> 00:03:27.479
a 3D scene from 2D images. It's moving from AI

00:03:27.479 --> 00:03:30.449
describing the world. to AI building it. I still

00:03:30.449 --> 00:03:32.530
kind of wrestle with trying to figure out how

00:03:32.530 --> 00:03:35.449
to bridge what I type into an LLM and what the

00:03:35.449 --> 00:03:37.569
real world does. It just seems like such a massive

00:03:37.569 --> 00:03:39.610
leap. That's a totally fair struggle. It really

00:03:39.610 --> 00:03:42.129
gets to the heart of the challenge. But the need

00:03:42.129 --> 00:03:44.669
for agents that can operate in the real world,

00:03:44.810 --> 00:03:47.810
you know, robots, self -driving cars, it means

00:03:47.810 --> 00:03:50.370
the industry is being forced to push for these

00:03:50.370 --> 00:03:52.949
hybrid spatial models right now. Since language

00:03:52.949 --> 00:03:56.930
is so, so foundational to how we think. How quickly

00:03:56.930 --> 00:03:59.889
is this shift from pure language to these spatial

00:03:59.889 --> 00:04:03.189
models actually going to impact the AI tools

00:04:03.189 --> 00:04:05.409
that we're all using every day? The need for

00:04:05.409 --> 00:04:07.870
real -world interaction is already pushing the

00:04:07.870 --> 00:04:10.819
immediate deployment of these new... hybrid models.

00:04:11.020 --> 00:04:13.379
So moving from the theoretical, let's look at

00:04:13.379 --> 00:04:15.259
what's happening on the ground right now. Even

00:04:15.259 --> 00:04:17.180
as we're looking towards spatial AI, we're still

00:04:17.180 --> 00:04:19.240
grappling with some really basic challenges and

00:04:19.240 --> 00:04:22.240
seeing some fascinating new applications. Yeah,

00:04:22.279 --> 00:04:24.079
on the challenge side, there's still that fundamental

00:04:24.079 --> 00:04:27.160
problem of why AI struggles so much to tell the

00:04:27.160 --> 00:04:28.959
difference between a fact and a subjective belief.

00:04:29.199 --> 00:04:31.740
One report called it the missing piece, which

00:04:31.740 --> 00:04:34.360
is really about causal modeling. The AI often

00:04:34.360 --> 00:04:36.899
gets the what, but it totally misses the why

00:04:36.899 --> 00:04:39.680
behind the data. And at the same time, agents

00:04:39.680 --> 00:04:42.680
are getting so much smarter. I love this example

00:04:42.680 --> 00:04:45.639
of the Minecraft AI agent named Steve. Oh, it's

00:04:45.639 --> 00:04:47.839
brilliant. You can give Steve just one high -level

00:04:47.839 --> 00:04:50.680
command. Something like, mine some iron or build

00:04:50.680 --> 00:04:52.920
me a castle. And Steve doesn't just do it. It

00:04:52.920 --> 00:04:55.079
actually spawns a bunch of other little agents

00:04:55.079 --> 00:04:58.040
that coordinate and work together like a real

00:04:58.040 --> 00:05:00.720
team to get it done. OK, let's talk about friction,

00:05:00.879 --> 00:05:03.420
because that is definitely heating up, especially

00:05:03.420 --> 00:05:06.019
around data. We saw a huge example of this with

00:05:06.019 --> 00:05:09.000
Wikipedia recently. Right. Wikipedia, which is,

00:05:09.019 --> 00:05:12.139
let's be honest, the bedrock for so much AI training

00:05:12.139 --> 00:05:15.220
data, is basically telling AI companies to please

00:05:15.220 --> 00:05:18.519
stop scraping its entire site. They want everyone

00:05:18.519 --> 00:05:21.600
to use their paid API instead. Why is that such

00:05:21.600 --> 00:05:23.779
a big deal? Well, it's about fairness. Yeah.

00:05:23.800 --> 00:05:26.439
You know. Giving credit for all that human labor

00:05:26.439 --> 00:05:28.639
and making sure the data quality stays high.

00:05:28.819 --> 00:05:31.579
They even made a pointed reference to Grokopedia,

00:05:31.860 --> 00:05:34.759
which was a pretty clear jab at Grok and XAI

00:05:34.759 --> 00:05:37.899
for allegedly relying so heavily on scraped content.

00:05:38.199 --> 00:05:40.240
So it sounds like quality training data, the

00:05:40.240 --> 00:05:42.199
fuel for all of this, is going to be heavily

00:05:42.199 --> 00:05:44.740
monetized from now on. And that ties directly

00:05:44.740 --> 00:05:48.040
into the money, into financing. Crusoe, which

00:05:48.040 --> 00:05:50.899
is an AI energy and infrastructure company, just

00:05:50.899 --> 00:05:54.790
secured a huge investment. 1 .3. eight billion

00:05:54.790 --> 00:05:57.250
dollars it puts their valuation at 10 billion

00:05:57.250 --> 00:05:59.870
that is a massive vote of confidence what makes

00:05:59.870 --> 00:06:02.689
them so special crusoe's all about tackling the

00:06:02.689 --> 00:06:05.449
energy problem They capture wasted energy, like

00:06:05.449 --> 00:06:07.529
natural gas that would just be flared off at

00:06:07.529 --> 00:06:09.689
a site, and they use it to power computing centers

00:06:09.689 --> 00:06:12.170
right there. The fact that NVIDIA is a major

00:06:12.170 --> 00:06:14.649
investor just shows how critical that link between

00:06:14.649 --> 00:06:17.069
energy and compute has become. We're also seeing

00:06:17.069 --> 00:06:20.370
friction in creative ethics. The showrunner for

00:06:20.370 --> 00:06:23.209
Amazon's House of David called using over 350

00:06:23.209 --> 00:06:27.689
AI -generated shots magical filmmaking. But a

00:06:27.689 --> 00:06:29.509
lot of critics just immediately called it cheap,

00:06:29.629 --> 00:06:32.490
a way to replace human artists. And that tension

00:06:32.490 --> 00:06:34.420
is not gone. going away. It's a real philosophical

00:06:34.420 --> 00:06:37.480
split between efficiency and human artistry.

00:06:37.519 --> 00:06:39.519
Let's end this section on a positive note though.

00:06:39.939 --> 00:06:42.420
Privacy. Google launched something called Private

00:06:42.420 --> 00:06:45.160
AI Compute. It lets you use the full cloud power

00:06:45.160 --> 00:06:47.779
of Gemini, but it ensures that no one, not even

00:06:47.779 --> 00:06:50.120
Google, can see the data you're processing. Yeah,

00:06:50.160 --> 00:06:52.920
they do it using secure enclaves, which are like

00:06:52.920 --> 00:06:55.360
these hardware -level black boxes. It allows

00:06:55.360 --> 00:06:57.980
for really sensitive data to be computed privately,

00:06:58.240 --> 00:07:00.319
which is a huge deal for a lot of applications.

00:07:00.779 --> 00:07:03.759
So what does Wikipedia demanding payment mean

00:07:03.759 --> 00:07:06.259
for the future long -term availability of truly

00:07:06.259 --> 00:07:08.839
open training data? It suggests quality training

00:07:08.839 --> 00:07:11.779
data, the fuel of AI, will be heavily monetized

00:07:11.779 --> 00:07:14.300
moving forward. Okay, let's shift now to the

00:07:14.300 --> 00:07:16.810
physical frontier. the sheer infrastructure you

00:07:16.810 --> 00:07:20.029
need for all this. An Epoch AI report said that

00:07:20.029 --> 00:07:22.569
spending on these specialized AI data centers

00:07:22.569 --> 00:07:26.529
is on track to pass $300 billion by the end of

00:07:26.529 --> 00:07:29.170
2025. You have to put that number in perspective.

00:07:29.470 --> 00:07:32.589
$300 billion is almost 1 % of the entire U .S.

00:07:32.589 --> 00:07:35.470
GDP. It's more than the Apollo program and the

00:07:35.470 --> 00:07:39.029
Manhattan Project combined. This isn't just an

00:07:39.029 --> 00:07:41.589
investment. It's a state -level commitment to

00:07:41.589 --> 00:07:43.879
a single piece of technology. And the headline

00:07:43.879 --> 00:07:45.819
example, the one everyone's talking about, is

00:07:45.819 --> 00:07:49.000
OpenAI's proposed Stargate Abilene project. The

00:07:49.000 --> 00:07:51.500
numbers are just. They're hard to believe. We're

00:07:51.500 --> 00:07:55.000
talking a $32 billion price tag. It needs those

00:07:55.000 --> 00:07:58.660
450 soccer fields of land. And the critical part,

00:07:58.800 --> 00:08:02.139
it will draw one gigawatt of power. That is enough

00:08:02.139 --> 00:08:04.600
electricity for about a million homes, all for

00:08:04.600 --> 00:08:07.480
one site. And it will have 250 times the compute

00:08:07.480 --> 00:08:10.620
capacity of GPT -4. But here's the thing that

00:08:10.620 --> 00:08:12.800
really changes the paradigm, the engineering

00:08:12.800 --> 00:08:15.459
implication of this. It's the paradox of latency.

00:08:15.699 --> 00:08:17.360
It used to be that you had to build data centers

00:08:17.360 --> 00:08:20.839
near users to be fast. Now, latency just doesn't

00:08:20.839 --> 00:08:22.899
matter as much. The reason is that the time it

00:08:22.899 --> 00:08:24.519
takes the model to actually think and generate

00:08:24.519 --> 00:08:27.459
an answer, the inference time, is about 100 times

00:08:27.459 --> 00:08:29.480
longer than it takes to send data all the way

00:08:29.480 --> 00:08:31.670
around the globe. Whoa. So imagine having so

00:08:31.670 --> 00:08:33.149
much computing power that you could literally

00:08:33.149 --> 00:08:35.190
bounce data off the moon and you would still

00:08:35.190 --> 00:08:37.929
be bottlenecked by the model itself, not by how

00:08:37.929 --> 00:08:40.129
fast you could send the signal. That changes

00:08:40.129 --> 00:08:42.549
everything. It really does. The era of needing

00:08:42.549 --> 00:08:44.950
to be geographically close for speed is just

00:08:44.950 --> 00:08:47.230
over. Yeah. The entire physical constraint has

00:08:47.230 --> 00:08:49.690
shifted. Now, data centers get built wherever

00:08:49.690 --> 00:08:52.610
power is cheapest and most available, not where

00:08:52.610 --> 00:08:55.139
the users are. Which brings us right back to

00:08:55.139 --> 00:08:58.080
that power challenge. Only a few countries can

00:08:58.080 --> 00:09:00.519
realistically handle multiple sites that each

00:09:00.519 --> 00:09:03.639
need over a gigawatt of power constantly. The

00:09:03.639 --> 00:09:05.860
solutions seem to be starting with natural gas,

00:09:06.100 --> 00:09:09.100
then layering on solar and wind through big grid

00:09:09.100 --> 00:09:11.379
interconnects. The race isn't for faster chips

00:09:11.379 --> 00:09:14.320
anymore. It's a race for massive, reliable energy

00:09:14.320 --> 00:09:16.860
sources. So if the competition has completely

00:09:16.860 --> 00:09:20.539
shifted from speed to just sheer scale and access

00:09:20.539 --> 00:09:23.190
to power, does that fundamentally change? who's

00:09:23.190 --> 00:09:26.370
going to lead the next phase of AI. Absolutely.

00:09:26.570 --> 00:09:29.210
The future is being shaped by the very few entities

00:09:29.210 --> 00:09:32.029
that can build, finance, and secure these colossal

00:09:32.029 --> 00:09:34.330
power sources faster than anyone else on the

00:09:34.330 --> 00:09:36.649
planet. So if we synthesize everything from our

00:09:36.649 --> 00:09:38.509
deep dive today, we really covered three major

00:09:38.509 --> 00:09:41.450
themes. First, we're redefining the core objective

00:09:41.450 --> 00:09:44.029
of AI. We're moving away from just pure language

00:09:44.029 --> 00:09:46.370
and towards these complex spatial world models.

00:09:46.549 --> 00:09:49.250
Second, at the exact same time, we're battling

00:09:49.250 --> 00:09:51.789
this intense data and ethical friction in the

00:09:51.789 --> 00:09:53.490
applications we have now. You know, everything

00:09:53.490 --> 00:09:56.490
from creative tension in Hollywood to who owns

00:09:56.490 --> 00:09:59.149
and gets to use Wikipedia's data. And finally,

00:09:59.250 --> 00:10:02.269
all of this innovation, both the big ideas and

00:10:02.269 --> 00:10:04.470
the stuff happening today, is being built on

00:10:04.470 --> 00:10:07.769
an infrastructure of just completely unprecedented

00:10:07.769 --> 00:10:11.710
scale. It's far bigger than any historical megaproject

00:10:11.710 --> 00:10:14.610
in its cost and its power demand. So you now

00:10:14.610 --> 00:10:16.490
have a pretty clear structure for understanding

00:10:16.490 --> 00:10:19.029
how AI is going to evolve over the next decade.

00:10:19.190 --> 00:10:21.409
You can see the conceptual limits, the immediate

00:10:21.409 --> 00:10:24.289
frictions, and the staggering physical cost of

00:10:24.289 --> 00:10:26.090
making it all happen. And here's something to

00:10:26.090 --> 00:10:28.210
think about. If one gigawatt of power becomes

00:10:28.210 --> 00:10:31.309
the minimum price to play in advanced AI, what

00:10:31.309 --> 00:10:33.409
happens to the innovative coder in a garage?

00:10:33.970 --> 00:10:36.570
What happens to that decentralized future we

00:10:36.570 --> 00:10:39.759
all used to imagine? Something to mull on. Thank

00:10:39.759 --> 00:10:41.659
you for joining us for this deep dive into the

00:10:41.659 --> 00:10:43.960
dual frontiers of artificial intelligence. We

00:10:43.960 --> 00:10:45.779
really encourage you to keep exploring these

00:10:45.779 --> 00:10:46.100
topics.
