WEBVTT

00:00:00.000 --> 00:00:04.339
How can the most powerful AI today, you know,

00:00:04.339 --> 00:00:07.339
GPT -4, these huge models, how can they also

00:00:07.339 --> 00:00:09.599
be called a philosophical dead end? Yeah, it's

00:00:09.599 --> 00:00:11.580
a real head scratcher, isn't it? You've got Richard

00:00:11.580 --> 00:00:14.140
Sutton, who basically invented modern reinforcement

00:00:14.140 --> 00:00:16.920
learning, saying the tech winning right now.

00:00:17.039 --> 00:00:20.339
Yeah. It's not the path to true AGI. Right. And

00:00:20.339 --> 00:00:22.539
that's what this deep dive is all about. We're

00:00:22.539 --> 00:00:24.780
going to look past the benchmarks, look at the

00:00:24.780 --> 00:00:26.600
actual architecture underneath. We've pulled

00:00:26.600 --> 00:00:29.300
sources on this critique, but also on the...

00:00:29.390 --> 00:00:32.630
well, the crazy cost of scaling this stuff. And

00:00:32.630 --> 00:00:34.909
our goal really is to figure out where the smart

00:00:34.909 --> 00:00:38.789
money, the smart thinking is going for AGI. Is

00:00:38.789 --> 00:00:41.350
it just making current models bigger or is it

00:00:41.350 --> 00:00:42.890
something fundamentally different, like a new

00:00:42.890 --> 00:00:45.490
way to learn? Okay, so we've got three main areas

00:00:45.490 --> 00:00:47.929
for you today. First, why Sutton thinks LLMs

00:00:47.929 --> 00:00:50.829
have this fatal flaw. Second, the absolutely

00:00:50.829 --> 00:00:53.310
staggering scale. And yes, the cost of this AI

00:00:53.310 --> 00:00:55.850
arms race. And third, these new video models

00:00:55.850 --> 00:00:59.380
that seem to actually kind of... Think over time.

00:00:59.560 --> 00:01:01.259
Sounds good. Let's dive in. Let's start with

00:01:01.259 --> 00:01:03.679
Sutton, the Turing Award winner. He's not saying

00:01:03.679 --> 00:01:06.400
LLMs are useless, right? No, not at all. They're

00:01:06.400 --> 00:01:09.579
amazing prediction machines, obviously. But Sutton

00:01:09.579 --> 00:01:11.780
comes from reinforcement learning, which is all

00:01:11.780 --> 00:01:14.079
about agents acting in the world and learning

00:01:14.079 --> 00:01:17.379
from feedback. He says LLMs lack that core loop.

00:01:18.090 --> 00:01:20.989
Well, passive. Passive meaning they don't have

00:01:20.989 --> 00:01:23.230
goals. They can't be surprised. Is that the idea?

00:01:23.450 --> 00:01:26.069
Exactly. And crucially, they don't learn from

00:01:26.069 --> 00:01:28.469
consequences. They're just incredibly good mimics

00:01:28.469 --> 00:01:31.530
of human text. They don't truly understand the

00:01:31.530 --> 00:01:33.769
real world impact of the words they string together.

00:01:34.189 --> 00:01:37.129
OK, I get the difference. But isn't being passive,

00:01:37.349 --> 00:01:40.730
goalless, actually safer? I mean, the whole market

00:01:40.730 --> 00:01:42.629
seems built on making these things predictable,

00:01:42.890 --> 00:01:45.170
controllable. Why build something goal driven

00:01:45.170 --> 00:01:47.530
if it's riskier? That's the big tradeoff. Right.

00:01:47.629 --> 00:01:50.109
Their safety comes from that passivity. But Sutton

00:01:50.109 --> 00:01:52.189
argues that very passivity limits their intelligence

00:01:52.189 --> 00:01:54.950
potential. Just scaling up imitation even to

00:01:54.950 --> 00:01:58.390
GPT -7 or 8 won't get us to AGI. So he has an

00:01:58.390 --> 00:02:01.030
alternative. Yeah. He proposes this new architecture

00:02:01.030 --> 00:02:04.769
called OK. The key thing is it learns on the

00:02:04.769 --> 00:02:07.469
fly. It doesn't need that massive, hugely expensive

00:02:07.469 --> 00:02:10.870
pre -training phase that LLMs rely on. Okay,

00:02:10.930 --> 00:02:13.870
can we give folks an analogy? So if an LLM is

00:02:13.870 --> 00:02:17.150
like a giant static textbook that predicts the

00:02:17.150 --> 00:02:21.210
next word, what's okay? Is it like an agent that

00:02:21.210 --> 00:02:23.870
can actually, you know, burn its hand on a stove

00:02:23.870 --> 00:02:26.050
and learn don't touch? That's a pretty good way

00:02:26.050 --> 00:02:27.770
to put it, yeah. It's about learning through

00:02:27.770 --> 00:02:29.870
doing, through direct consequence. It's not just

00:02:29.870 --> 00:02:33.270
copying patterns from a giant pile of text. AGI,

00:02:33.449 --> 00:02:36.449
in this view, needs these smarter loops. Action,

00:02:36.629 --> 00:02:40.860
feedback, memory, even motivation. So if LLMs

00:02:40.860 --> 00:02:43.639
are flawed because they just imitate, what exactly

00:02:43.639 --> 00:02:46.319
is the mechanism Sutton proposes for real consequence

00:02:46.319 --> 00:02:48.740
driven learning? It's about learning via action

00:02:48.740 --> 00:02:51.819
and direct consequence, not massive pre -trained

00:02:51.819 --> 00:02:54.759
imitation. Gotcha. OK, so if real intelligence

00:02:54.759 --> 00:02:57.199
needs this whole new architecture, then this

00:02:57.199 --> 00:03:00.479
race to just build bigger and bigger LLMs. It's

00:03:00.479 --> 00:03:02.340
a massive bet, isn't it? Strategically speaking.

00:03:02.539 --> 00:03:04.719
It really is. And yet the scaling is happening

00:03:04.719 --> 00:03:06.879
at a rate that's hard to comprehend. Yeah. Let's

00:03:06.879 --> 00:03:08.680
talk about those numbers. The sources we saw

00:03:08.680 --> 00:03:11.159
on OpenAI's compute plans, just wild. Totally

00:03:11.159 --> 00:03:13.819
wild. They apparently 9x their compute power

00:03:13.819 --> 00:03:18.240
in 2025 alone. Okay. Huge jump. But then by 2033,

00:03:18.419 --> 00:03:21.199
the projection is 125 times bigger than that.

00:03:21.300 --> 00:03:24.490
Two sec silence. Whoa. okay wait that number

00:03:24.490 --> 00:03:28.310
125x it kind of breaks my brain to put that in

00:03:28.310 --> 00:03:30.550
perspective for you listening that could mean

00:03:30.550 --> 00:03:33.569
needing more electricity than like the entire

00:03:33.569 --> 00:03:36.830
country of india uses today for 1 .4 billion

00:03:36.830 --> 00:03:39.879
people it just sounds Physically almost impossible.

00:03:40.080 --> 00:03:42.080
It does. It's like stacking Lego blocks of data

00:03:42.080 --> 00:03:44.259
centers reaching into the clouds. Right. And

00:03:44.259 --> 00:03:46.620
it explains why we're seeing these massive investments

00:03:46.620 --> 00:03:49.460
like Nescale raising that record $1 .1 billion

00:03:49.460 --> 00:03:52.340
Series B. Yeah, biggest in Europe. Yeah. And

00:03:52.340 --> 00:03:54.539
that money is specifically earmarked for building

00:03:54.539 --> 00:03:57.699
these AI factories. We're talking facilities

00:03:57.699 --> 00:04:01.699
with like 100 ,000 NVIDIA GPUs each. Backed by

00:04:01.699 --> 00:04:04.340
huge names, Nokia, Dell, NVIDIA itself. Yeah.

00:04:04.419 --> 00:04:05.740
They're building the infrastructure for that

00:04:05.740 --> 00:04:08.080
125X future. But then there's the flip side,

00:04:08.180 --> 00:04:10.259
the operational cost. We saw that post from the

00:04:10.259 --> 00:04:13.879
developer, right? Built over 30 AI agents. And

00:04:13.879 --> 00:04:16.000
tracked the cost. And called it the brutal cost

00:04:16.000 --> 00:04:18.779
truth. Basically, running complex agents in the

00:04:18.779 --> 00:04:21.019
real world gets really expensive, really fast.

00:04:21.160 --> 00:04:23.579
It's not just the upfront training cost. Absolutely.

00:04:23.600 --> 00:04:26.660
That massive scale translates directly to higher

00:04:26.660 --> 00:04:28.980
running costs for everyone using these models.

00:04:29.120 --> 00:04:32.050
And meanwhile. You see OpenAI launching its biggest

00:04:32.050 --> 00:04:35.870
ad push ever for ChatGPT. Streaming, billboards,

00:04:36.149 --> 00:04:39.810
influencers. They're trying to lock in that market

00:04:39.810 --> 00:04:42.370
share now. Given this exponential compute growth,

00:04:42.529 --> 00:04:44.990
is the cost of running real -world agents actually

00:04:44.990 --> 00:04:47.730
sustainable for, say, the average developer or

00:04:47.730 --> 00:04:50.209
a small company? Initial data suggests complexity

00:04:50.209 --> 00:04:52.550
dramatically increases operational expenditures.

00:04:52.970 --> 00:04:56.430
Yeah, that's a tough reality check. Okay, let's

00:04:56.430 --> 00:04:58.910
pivot a bit. Away from the cost side. towards

00:04:58.910 --> 00:05:01.329
a big technical leap. The official paper just

00:05:01.329 --> 00:05:03.709
dropped on VO3. That's Google DeepMind's new

00:05:03.709 --> 00:05:06.089
video model. And people are calling this Google's

00:05:06.089 --> 00:05:09.209
GPT -3 moment for vision. Which is a big claim.

00:05:09.329 --> 00:05:11.509
Why? What's the big deal? Well, it connects back

00:05:11.509 --> 00:05:13.370
to Sutton's point, actually. It looks like we're

00:05:13.370 --> 00:05:15.550
seeing a shift from just imitation to something

00:05:15.550 --> 00:05:17.629
more like reasoning, but in the visual domain.

00:05:18.189 --> 00:05:20.290
VO3 seems to be able to reason across a video

00:05:20.290 --> 00:05:23.290
scene over time, not just generate pretty frames

00:05:23.290 --> 00:05:25.269
one after another. Okay. And the key concept

00:05:25.269 --> 00:05:27.889
here is chain of frames. Sounds like chain of

00:05:27.889 --> 00:05:30.480
thought for LLMs. Exactly. It's the visual peasant.

00:05:30.720 --> 00:05:33.459
Chain of thought helps LLMs break down text problems

00:05:33.459 --> 00:05:35.779
step by step. Chain of frames lets the video

00:05:35.779 --> 00:05:38.339
model think across time. It can anticipate what

00:05:38.339 --> 00:05:40.500
happens next, understand physics in a basic way.

00:05:40.600 --> 00:05:43.680
It's not just processing static images. That's

00:05:43.680 --> 00:05:49.019
fascinating. But is it really reasoning? Or is

00:05:49.019 --> 00:05:51.680
it just a super sophisticated mimic? If it's

00:05:51.680 --> 00:05:54.160
seen millions of videos of Jenga blocks falling,

00:05:54.279 --> 00:05:57.089
is it... understanding physics or just generating

00:05:57.089 --> 00:05:59.149
the most likely visual sequence based on that

00:05:59.149 --> 00:06:01.829
data, how do we know it's not just a fancy deep

00:06:01.829 --> 00:06:04.129
fake? That's the million -dollar question, always.

00:06:04.550 --> 00:06:07.129
But the evidence suggests it's more than mimicry

00:06:07.129 --> 00:06:09.209
because of its zero -shot performance on diverse

00:06:09.209 --> 00:06:11.569
tasks it wasn't explicitly trained for. Like

00:06:11.569 --> 00:06:13.709
what? Well, it can solve complex mazes visually.

00:06:14.170 --> 00:06:16.110
It can simulate physics pretty accurately, predicting

00:06:16.110 --> 00:06:18.290
how those Jenga blocks will fall. It can even

00:06:18.290 --> 00:06:21.009
restore blurry images or animate a scene from

00:06:21.009 --> 00:06:23.370
just a rough hand -drawn sketch. Things require

00:06:23.370 --> 00:06:25.750
some kind of internal model of the world. Right.

00:06:25.850 --> 00:06:27.649
They use that four -level framework to measure

00:06:27.649 --> 00:06:31.790
it. Perception, modeling, manipulation, and then

00:06:31.790 --> 00:06:34.790
reasoning at the top. Simulating physics definitely

00:06:34.790 --> 00:06:37.230
feels like it's up in the modeling or even reasoning

00:06:37.230 --> 00:06:40.120
category. Exactly. And the implication is huge.

00:06:40.279 --> 00:06:43.180
One good prompt into a model like this might

00:06:43.180 --> 00:06:46.500
eventually replace dozens of specialized computer

00:06:46.500 --> 00:06:49.319
vision tools engineers use today. It could simplify

00:06:49.319 --> 00:06:52.019
workflows dramatically. So how drastically does

00:06:52.019 --> 00:06:54.339
this new chain of frames approach change the

00:06:54.339 --> 00:06:56.879
job of computer vision engineers? One prompt

00:06:56.879 --> 00:06:59.660
can replace dozens of specialized vision tools,

00:06:59.800 --> 00:07:02.339
simplifying workflows dramatically. OK, let's

00:07:02.339 --> 00:07:05.319
shift to AI out in the wild because this power.

00:07:06.189 --> 00:07:08.990
It has immediate, messy consequences. We saw

00:07:08.990 --> 00:07:11.589
the political example need to be neutral here,

00:07:11.670 --> 00:07:13.910
the instance with Trump posting an AI -generated

00:07:13.910 --> 00:07:16.250
clip. Yeah, the one claiming med beds could cure

00:07:16.250 --> 00:07:19.069
anything, reportedly from Fox News, but AI -generated.

00:07:19.129 --> 00:07:21.709
He later deleted it. But it shows how fast this

00:07:21.709 --> 00:07:24.310
stuff can spread and how convincing it can look,

00:07:24.389 --> 00:07:26.670
even if it's totally fabricated. It's a huge

00:07:26.670 --> 00:07:28.769
challenge for, you know. figuring out what's

00:07:28.769 --> 00:07:31.990
real online. The speed is incredible. Yeah. I

00:07:31.990 --> 00:07:33.730
mean, I still wrestle with prompt drift myself

00:07:33.730 --> 00:07:36.610
sometimes, just trying to get an AI to do what

00:07:36.610 --> 00:07:41.009
I want consistently. Vulnerable admission. So

00:07:41.009 --> 00:07:44.110
seeing these convincing deep fakes pop up, it

00:07:44.110 --> 00:07:45.910
is worrying. It really highlights the complexity

00:07:45.910 --> 00:07:48.290
we're dealing with. Beat. And the battles aren't

00:07:48.290 --> 00:07:50.009
just political. They're corporate, too. Oh, yeah.

00:07:50.170 --> 00:07:54.910
Elon Musk's XAI is now suing OpenAI. The claim.

00:07:55.230 --> 00:07:58.180
Yeah. Stealing trade secrets. It's getting litigious.

00:07:58.360 --> 00:08:00.920
And Meta is reportedly talking to Google about

00:08:00.920 --> 00:08:03.379
potentially using their Gemini model. Seems like

00:08:03.379 --> 00:08:05.379
it. The big players are definitely maneuvering,

00:08:05.379 --> 00:08:07.660
making alliances, getting ready for the next

00:08:07.660 --> 00:08:09.879
phase. At the same time, platforms are just drowning

00:08:09.879 --> 00:08:13.759
in AI content. Spotify deleting 75 million AI

00:08:13.759 --> 00:08:16.339
generated tracks. 75 million. Just using their

00:08:16.339 --> 00:08:19.379
spam filters. It shows the sheer scale of automated

00:08:19.379 --> 00:08:21.079
content generation they're fighting. It's like

00:08:21.079 --> 00:08:23.579
a tidal wave. Wow. But it's not all problematic.

00:08:23.819 --> 00:08:26.000
There are useful tools emerging too, right? Like

00:08:26.000 --> 00:08:28.360
that Kimi assistant from Moonshot AI. Right,

00:08:28.439 --> 00:08:31.089
backed by Alibaba. It apparently has an agent

00:08:31.089 --> 00:08:33.169
mode now. You give it a simple prompt and it

00:08:33.169 --> 00:08:35.070
can create complex things like a multi -page

00:08:35.070 --> 00:08:37.750
website draft or editable presentation slides.

00:08:38.129 --> 00:08:40.809
Stuff that takes real work. With the volume of

00:08:40.809 --> 00:08:43.490
AI -generated content exploding, both useful

00:08:43.490 --> 00:08:46.470
and not, can platforms realistically keep pace

00:08:46.470 --> 00:08:48.350
with the necessary filtering and moderation?

00:08:48.850 --> 00:08:52.269
The 75 million deleted tracks suggest filtering

00:08:52.269 --> 00:08:56.259
is already a massive ongoing battle. Okay. We've

00:08:56.259 --> 00:08:58.679
definitely covered a lot today. Philosophy, physics

00:08:58.679 --> 00:09:02.399
simulation, billion dollar funding rounds, fake

00:09:02.399 --> 00:09:04.240
news. Let's just take a quick pause. When we

00:09:04.240 --> 00:09:06.740
come back, let's boil it down. What's the single

00:09:06.740 --> 00:09:09.320
biggest idea, the main takeaway about this architectural

00:09:09.320 --> 00:09:11.220
shift for you, the listener? Midroll sponsor

00:09:11.220 --> 00:09:13.549
read. All right. So if we boil down everything

00:09:13.549 --> 00:09:16.450
we discussed, the core tension really is between

00:09:16.450 --> 00:09:18.629
two fundamentally different approaches to AI.

00:09:18.889 --> 00:09:21.129
You've got the passive imitation that powers

00:09:21.129 --> 00:09:23.690
today's big LLMs. And then you have this idea

00:09:23.690 --> 00:09:26.190
of active consequence driven learning. That's

00:09:26.190 --> 00:09:28.429
the goal behind stuff like Oak. And it seems

00:09:28.429 --> 00:09:30.929
to be what's making models like VO3 capable of

00:09:30.929 --> 00:09:34.649
visual reasoning. Passive imitation versus active

00:09:34.649 --> 00:09:37.889
learning. LLMs are incredible statistical parrots,

00:09:37.889 --> 00:09:40.340
basically. But if something's right. Getting

00:09:40.340 --> 00:09:42.720
to true AGI means moving beyond just predicting

00:09:42.720 --> 00:09:45.740
the next word or pixel. It means building systems

00:09:45.740 --> 00:09:47.879
that can actually model consequences, understand

00:09:47.879 --> 00:09:50.419
cause and effect, maybe even have intrinsic motivation.

00:09:50.860 --> 00:09:53.480
So the key thing for you to take away today is

00:09:53.480 --> 00:09:56.940
probably this. We're not just seeing AI get bigger.

00:09:57.279 --> 00:09:59.519
We might be seeing the very architecture of AI

00:09:59.519 --> 00:10:01.659
begin to shift, that chain of frames concept

00:10:01.659 --> 00:10:04.139
in VO3. It's a sign we're moving beyond just

00:10:04.139 --> 00:10:06.419
mimicking language towards models that can actually

00:10:06.419 --> 00:10:08.559
reason visually, simulate outcomes, understand

00:10:08.559 --> 00:10:11.700
things across time. Right. That shift from mimicry

00:10:11.700 --> 00:10:14.179
to modeling consequences, that feels like the

00:10:14.179 --> 00:10:16.639
really big story here. So maybe a final thought

00:10:16.639 --> 00:10:18.899
for you to chew on. If VO3 can simulate simple

00:10:18.899 --> 00:10:21.820
physics today, like Jenga blocks falling. What

00:10:21.820 --> 00:10:23.580
does it mean when we start handing over critical

00:10:23.580 --> 00:10:26.320
real -world decisions to models that can simulate

00:10:26.320 --> 00:10:28.220
the future consequences of their own potential

00:10:28.220 --> 00:10:31.019
actions? Beat. That feels significant. Definitely

00:10:31.019 --> 00:10:32.940
something to think about. That distinction between

00:10:32.940 --> 00:10:36.460
passive safety, which we have now, and truly

00:10:36.460 --> 00:10:38.919
goal -driven intelligence. Yeah. That's the next

00:10:38.919 --> 00:10:40.799
frontier, potentially the next big challenge.

00:10:41.080 --> 00:10:43.200
Well, thanks for joining us on this deep dive

00:10:43.200 --> 00:10:45.820
into AI, architecture, scale, and everything

00:10:45.820 --> 00:10:47.980
in between. We hope it gave you some things to

00:10:47.980 --> 00:10:50.240
think about. We'll catch you next time, OTRO

00:10:50.240 --> 00:10:50.580
Music.