WEBVTT

00:00:00.000 --> 00:00:02.980
We are witnessing the death of a prophecy. Beat.

00:00:03.120 --> 00:00:06.459
The idea of one AI tool to rule them all is dying.

00:00:06.960 --> 00:00:09.599
Meta wants to delete the computer operating system

00:00:09.599 --> 00:00:13.019
entirely. Welcome to the deep dive. Thanks. It

00:00:13.019 --> 00:00:15.699
is genuinely great to be here today. Today's

00:00:15.699 --> 00:00:18.399
stack of sources reveals a really massive shift.

00:00:18.559 --> 00:00:21.239
We are moving far away from single AI tools.

00:00:21.460 --> 00:00:23.940
We're heading into complex, customized tool chains.

00:00:24.059 --> 00:00:26.949
We're scaling to... billions of autonomous agents,

00:00:27.190 --> 00:00:30.230
and we are completely rethinking how AI interacts

00:00:30.230 --> 00:00:32.210
with our computers. What's fascinating here is

00:00:32.210 --> 00:00:34.270
how fast this is moving. We are jumping from

00:00:34.270 --> 00:00:36.869
the current coding trenches today, right straight

00:00:36.869 --> 00:00:38.530
into the future of neural operating systems.

00:00:38.670 --> 00:00:40.289
Yeah, let's start down in those coding trenches.

00:00:40.750 --> 00:00:42.969
Developers are feeling this massive shift right

00:00:42.969 --> 00:00:46.070
now. The dream of a monolithic coding AI is practically

00:00:46.070 --> 00:00:48.369
dead. It's totally fragmenting. I mean, the AI

00:00:48.369 --> 00:00:51.189
coding market is now a three -layer stack. Devs

00:00:51.189 --> 00:00:52.990
are actively composing their own tool chains.

00:00:53.189 --> 00:00:55.509
They're basically combining cursor, quad, and

00:00:55.509 --> 00:00:57.929
codex together. Right. It's like stacking Lego

00:00:57.929 --> 00:01:00.250
blocks of data. You don't want a pre -built house

00:01:00.250 --> 00:01:03.780
anymore. You want specialized blocks to build

00:01:03.780 --> 00:01:07.560
your own custom toolchains. So you want Claude

00:01:07.560 --> 00:01:10.200
as your head chef. He handles the complex architectural

00:01:10.200 --> 00:01:14.040
recipes. And you want Codex just chopping vegetables

00:01:14.040 --> 00:01:16.859
in the background. Yeah, and it fundamentally

00:01:16.859 --> 00:01:19.840
changes the whole workflow. Just look at Cursor

00:01:19.840 --> 00:01:22.819
3, which they're calling Glass. It's actively

00:01:22.819 --> 00:01:25.060
distancing itself from its VS Code roots. It

00:01:25.060 --> 00:01:27.920
focuses on its agents window and agent tabs.

00:01:28.510 --> 00:01:30.489
They call that the manager surface, right? Exactly.

00:01:30.629 --> 00:01:32.170
They officially call it the manager surface.

00:01:32.430 --> 00:01:34.590
A manager surface implies active delegation.

00:01:35.010 --> 00:01:36.890
You know, you're no longer just typing code yourself.

00:01:37.069 --> 00:01:39.329
You're managing digital workers who write it

00:01:39.329 --> 00:01:41.090
for you. Precisely. You're orchestrating the

00:01:41.090 --> 00:01:43.790
build process. And then you have quad code entering

00:01:43.790 --> 00:01:47.349
the mix. It holds 46 % of the most loved market

00:01:47.349 --> 00:01:50.390
share. It uses an MCP -based plug -in system

00:01:50.390 --> 00:01:52.730
to run things. Let's define that really quickly

00:01:52.730 --> 00:01:55.329
for you listening. An MCP -based plugin is a

00:01:55.329 --> 00:01:57.950
universal plug letting AI talk directly to local

00:01:57.950 --> 00:02:00.969
files. Spot on. That system acts as the essential

00:02:00.969 --> 00:02:04.329
glue. It's the primary execution layer for complex

00:02:04.329 --> 00:02:06.590
architectural refactoring. It basically does

00:02:06.590 --> 00:02:09.020
the heavy intellectual lifting. But looking at

00:02:09.020 --> 00:02:11.719
OpenAI's placement in this stack, it feels like

00:02:11.719 --> 00:02:13.759
a downgrade. Are they really just the review

00:02:13.759 --> 00:02:16.219
layer now? Yeah, they repositioned Codex to fit

00:02:16.219 --> 00:02:19.139
exactly that niche. They pragmatically embedded

00:02:19.139 --> 00:02:21.840
it directly into Cloud Code. Using a plugin,

00:02:22.000 --> 00:02:23.939
right? They did, yeah. It's called the Codex

00:02:23.939 --> 00:02:27.000
Plugin CC. They are actively fighting what developers

00:02:27.000 --> 00:02:30.909
call agent fatigue. You use Claude for the big

00:02:30.909 --> 00:02:32.990
picture architectural stuff, but then you swap

00:02:32.990 --> 00:02:35.750
models for the smaller tasks. You hand low complexity

00:02:35.750 --> 00:02:40.090
tasks to Codex or Kimi K2 .5. Just to clarify,

00:02:40.310 --> 00:02:42.590
Kimi K2 .5 for everyone, it's a lightweight model

00:02:42.590 --> 00:02:45.009
for simple background tasks. You do that to manage

00:02:45.009 --> 00:02:47.250
your token burn. Right, because token burn is

00:02:47.250 --> 00:02:49.930
a huge issue. Wasting expensive compute on simple

00:02:49.930 --> 00:02:52.210
tasks will bankrupt you. It's a highly pragmatic

00:02:52.210 --> 00:02:54.509
token economy. I still wrestle with prompt drift

00:02:54.509 --> 00:02:57.460
myself. Oh, we all do. It's exhausting. Yeah,

00:02:57.560 --> 00:02:59.919
the friction of managing these different tools

00:02:59.919 --> 00:03:02.539
is incredibly real. Keeping context perfectly

00:03:02.539 --> 00:03:05.680
aligned across platforms is tough. So let me

00:03:05.680 --> 00:03:09.360
ask you this. Is OpenAI's pragmatic plug -in

00:03:09.360 --> 00:03:12.639
strategy a sign of surrender? Or is it pure genius?

00:03:13.000 --> 00:03:15.599
I think it's pure genius. They willingly surrendered

00:03:15.599 --> 00:03:18.039
the terminal to win the workflow. They embedded

00:03:18.039 --> 00:03:20.000
themselves exactly where the work actually happened.

00:03:20.099 --> 00:03:21.939
So they abandoned the interface to capture the

00:03:21.939 --> 00:03:23.860
actual process. That's the perfect way to phrase

00:03:23.860 --> 00:03:26.340
it. But to run these multilayered tool chains,

00:03:26.500 --> 00:03:29.439
you need serious infrastructure. You need massive

00:03:29.439 --> 00:03:32.000
computing power to handle that load. That brings

00:03:32.000 --> 00:03:35.000
scale, but also very serious risks. Risks are

00:03:35.000 --> 00:03:37.199
getting incredibly massive. And the infrastructure

00:03:37.199 --> 00:03:39.900
is scaling up right now. Cloudflare just announced

00:03:39.900 --> 00:03:42.479
Agents Week. They're rolling out edge -based

00:03:42.479 --> 00:03:44.860
infrastructure designed to run billions of agents

00:03:44.860 --> 00:03:47.819
simultaneously. Billions of active digital workers.

00:03:48.020 --> 00:03:51.419
Imagine scaling to a billion queries. The physical

00:03:51.419 --> 00:03:53.900
compute power required is staggering. to think

00:03:53.900 --> 00:03:56.439
about. And consumer adoption is totally matching

00:03:56.439 --> 00:03:58.879
that scale. CloudCore just moved to full release

00:03:58.879 --> 00:04:02.879
for paid plans. Pro, Max, Team, Enterprise tiers.

00:04:03.319 --> 00:04:06.620
The demand is off the charts. At the Humanex

00:04:06.620 --> 00:04:09.719
AI conference last week, one chatbot came up

00:04:09.719 --> 00:04:13.580
in every panel. Industry leaders dubbed CloudCode

00:04:13.580 --> 00:04:16.199
the absolute must -have tool. But this environment

00:04:16.199 --> 00:04:19.120
is getting genuinely risky. The sheer scale is

00:04:19.120 --> 00:04:21.759
unprecedented. Sam Altman recently broke his

00:04:21.759 --> 00:04:23.879
silence in a New Yorker profile. Yeah, right

00:04:23.879 --> 00:04:25.920
after that frightening attack on his home. Right.

00:04:26.120 --> 00:04:28.500
He admitted to some past leadership mistakes,

00:04:28.800 --> 00:04:32.240
but he also issued a really stark warning. He

00:04:32.240 --> 00:04:34.959
warned that the AGI race is pushing into highly

00:04:34.959 --> 00:04:37.639
risky behaviors. The chaos on the ground is real.

00:04:38.379 --> 00:04:40.899
TechCrunch is even updating their official glossary

00:04:40.899 --> 00:04:43.240
of terms just to help normal people navigate

00:04:43.240 --> 00:04:45.459
the madness. Which you desperately need in this

00:04:45.459 --> 00:04:48.100
space. Take the term hallucination, for example.

00:04:48.459 --> 00:04:51.240
Hallucination. When an AI confidently makes up

00:04:51.240 --> 00:04:53.759
fake information. Exactly. We need simple definitions

00:04:53.759 --> 00:04:56.300
for these complex problems. The vocabulary has

00:04:56.300 --> 00:04:58.399
to keep up. But let me push back here for a second.

00:04:58.480 --> 00:05:00.620
We are scaling up to billions of fully autonomous

00:05:00.620 --> 00:05:03.699
agents. Even Altman is explicitly warning about

00:05:03.699 --> 00:05:06.660
risky behavior. Are we basically building the

00:05:06.660 --> 00:05:08.899
massive engine without installing any brakes?

00:05:09.199 --> 00:05:12.360
It's a super valid concern right now. The infrastructure

00:05:12.360 --> 00:05:15.319
expansion is totally outpacing the safety guardrails.

00:05:16.110 --> 00:05:18.170
We're deploying agents before we understand their

00:05:18.170 --> 00:05:20.750
emergent behaviors. So how do we actually resolve

00:05:20.750 --> 00:05:24.370
this tension between massive digital scale and

00:05:24.370 --> 00:05:27.370
fundamental human safety? Well, we have to build

00:05:27.370 --> 00:05:29.810
robust guardrails at the infrastructure layer

00:05:29.810 --> 00:05:32.750
itself. We just can't rely on the models to police

00:05:32.750 --> 00:05:35.189
themselves anymore. We have to hardwire those

00:05:35.189 --> 00:05:37.170
safety brakes directly into the infrastructure.

00:05:37.430 --> 00:05:39.930
That is the only viable path forward. Okay, let's

00:05:39.930 --> 00:05:43.350
untack this. Beat, we spent the entire last decade

00:05:43.350 --> 00:05:45.769
putting glowing screens on everything. We moved

00:05:45.769 --> 00:05:48.529
all our private files to the distant cloud. And

00:05:48.529 --> 00:05:51.490
now Apple wants to remove screens entirely. LM

00:05:51.490 --> 00:05:53.470
Studio wants to take everything off the cloud.

00:05:53.610 --> 00:05:55.870
It's a massive philosophical pivot. It really

00:05:55.870 --> 00:05:57.750
is. It's a direct reaction to the friction of

00:05:57.750 --> 00:06:00.550
modern computing. People are tired of the costs

00:06:00.550 --> 00:06:03.470
and risks of cloud AI. We're seeing a massive

00:06:03.470 --> 00:06:06.250
pivot toward offline, highly localized tools.

00:06:06.509 --> 00:06:09.230
People want specialized personal AI on their

00:06:09.230 --> 00:06:11.569
own hardware. And LM Studio just made a big strategic

00:06:11.569 --> 00:06:13.949
move there. They certainly did. They acquired

00:06:13.949 --> 00:06:18.019
a fascinating company called Locally AI. Adrienne

00:06:18.019 --> 00:06:21.100
Grondin is joining to lead native AI experiences.

00:06:21.579 --> 00:06:24.480
They're bringing open source models natively

00:06:24.480 --> 00:06:27.860
to your personal devices. iPhone, iPad, Mac integration.

00:06:28.139 --> 00:06:30.819
With no cloud server connection required at all.

00:06:30.899 --> 00:06:33.480
Zero cloud connectivity. No sign -up friction.

00:06:33.680 --> 00:06:36.519
No monthly fee. It's just localized processing

00:06:36.519 --> 00:06:39.170
utilizing your own silicon. Apple is moving in

00:06:39.170 --> 00:06:41.889
a remarkably similar direction too. They're testing

00:06:41.889 --> 00:06:45.529
four distinct smart glasses styles for 2027.

00:06:45.970 --> 00:06:49.149
Crucially, these glasses feature absolutely no

00:06:49.149 --> 00:06:52.750
displays. None. Zero visual displays. It's a

00:06:52.750 --> 00:06:55.269
huge departure from their previous AR headsets.

00:06:55.350 --> 00:06:57.709
It's just a camera, spatial music, phone calls,

00:06:57.709 --> 00:06:59.810
and Siri. It's a much simpler approach to ambient

00:06:59.810 --> 00:07:02.279
computing. The tools are becoming hyper -specialized

00:07:02.279 --> 00:07:04.339
and local. Let's look at a few examples from

00:07:04.339 --> 00:07:06.399
our sources. There's a fascinating new tool called

00:07:06.399 --> 00:07:08.639
Ray. Oh, Ray is incredibly interesting. It acts

00:07:08.639 --> 00:07:11.180
as a terminal -based CFO. It reads your real

00:07:11.180 --> 00:07:13.139
computer transactions locally. It tells you what

00:07:13.139 --> 00:07:15.459
to do, helps plan budgets, and it runs entirely

00:07:15.459 --> 00:07:17.980
on your own private machine. Then there's R0Y,

00:07:18.259 --> 00:07:20.920
spelled with a zero, a natural language financial

00:07:20.920 --> 00:07:24.339
studio. It builds full investing dashboards in

00:07:24.339 --> 00:07:27.060
seconds, tailored to you. And Google's Gemini

00:07:27.060 --> 00:07:29.360
just introduced interactive visual simulations.

00:07:29.600 --> 00:07:32.319
You don't just read text. You ask Gemini to show

00:07:32.319 --> 00:07:34.759
me or help me visualize. You can actually play

00:07:34.759 --> 00:07:37.019
with abstract concepts locally. We also have

00:07:37.019 --> 00:07:40.259
the Eleven Labs music marketplace. Creators generate

00:07:40.259 --> 00:07:42.839
tracks and publish them seamlessly. They earn

00:07:42.839 --> 00:07:45.560
real money on downloads. Eleven Labs has already

00:07:45.560 --> 00:07:48.259
paid out $11 million to voice creators. It's

00:07:48.259 --> 00:07:51.420
birthing a totally new creator economy. The localized

00:07:51.420 --> 00:07:54.379
empowerment gives individuals studio -level production

00:07:54.379 --> 00:07:57.300
capabilities. Two sec silence. So looking at

00:07:57.300 --> 00:08:00.300
all this screenless, highly specific tech, is

00:08:00.300 --> 00:08:03.699
the true future of AI actually completely invisible

00:08:03.699 --> 00:08:06.240
and entirely offline? History shows the best

00:08:06.240 --> 00:08:09.060
technology always fades away. It disappears completely

00:08:09.060 --> 00:08:11.259
into the ambient background of our daily lives.

00:08:11.420 --> 00:08:13.459
So the best technology simply becomes an invisible

00:08:13.459 --> 00:08:15.339
part of our environment. It just becomes the

00:08:15.339 --> 00:08:18.560
silent air we breathe. We talked about powerful

00:08:18.560 --> 00:08:21.379
local tools. We discussed the shift toward invisible,

00:08:21.579 --> 00:08:24.230
display -free hardware. Here's where it gets

00:08:24.230 --> 00:08:28.910
really interesting. Beat, Meta is trying to leapfrog

00:08:28.910 --> 00:08:31.889
this entire localized paradigm. They want the

00:08:31.889 --> 00:08:34.490
AI to literally become the operating system.

00:08:34.750 --> 00:08:36.970
Yeah, they're trying to fundamentally delete

00:08:36.970 --> 00:08:40.350
the traditional OS entirely. Meta is moving away

00:08:40.350 --> 00:08:43.190
from models that simply use computers. They're

00:08:43.190 --> 00:08:45.549
developing models that actually act as the computers

00:08:45.549 --> 00:08:47.710
themselves. It's a monumental shift in architecture.

00:08:48.029 --> 00:08:51.049
Their new neural computer prototype executes

00:08:51.049 --> 00:08:53.470
desktop tasks directly via learned behavior.

00:08:53.909 --> 00:08:57.269
There are no APIs involved, no complex orchestration

00:08:57.269 --> 00:08:59.509
layers to translate commands. This is radically

00:08:59.509 --> 00:09:01.610
different from traditional AI tool use, isn't

00:09:01.610 --> 00:09:03.750
it? Completely different. When ChatGPT calls

00:09:03.750 --> 00:09:05.970
a Python interpreter, that's basic tool use.

00:09:06.149 --> 00:09:08.970
It uses a programmatic bridge. Meta's prototype

00:09:08.970 --> 00:09:12.570
is a unified, end -to -end, learned system. It's

00:09:12.570 --> 00:09:14.350
like teaching someone to drive by showing them

00:09:14.350 --> 00:09:16.370
thousands of photos of a dashboard. You never

00:09:16.370 --> 00:09:18.549
actually explain what a steering wheel mechanically

00:09:18.549 --> 00:09:20.610
does. You just show the visual results of turning

00:09:20.610 --> 00:09:23.279
it. That's a perfect visual metaphor. It was

00:09:23.279 --> 00:09:25.860
trained on thousands of hours of raw screen recordings.

00:09:26.200 --> 00:09:29.379
It passively watched human cursor movements and

00:09:29.379 --> 00:09:32.559
watched complex terminal sessions unfold pixel

00:09:32.559 --> 00:09:34.779
by pixel. So the logic isn't programmed with

00:09:34.779 --> 00:09:37.940
standard code. It's encoded directly into the

00:09:37.940 --> 00:09:40.139
model's mathematical weights. Yeah. And that

00:09:40.139 --> 00:09:42.340
changes how it actually operates. It predicts

00:09:42.340 --> 00:09:45.759
the next screen state exactly like GPT predicts

00:09:45.759 --> 00:09:48.639
the next word. It treats human UI interaction

00:09:48.639 --> 00:09:52.470
as raw visual. So if it needs to move a file,

00:09:52.570 --> 00:09:54.990
what does it actually do? It doesn't call a computer

00:09:54.990 --> 00:09:58.399
command or an API. It literally thinks. the specific

00:09:58.399 --> 00:10:01.299
cursor movements required it generates the exact

00:10:01.299 --> 00:10:04.179
keyboard inputs needed to drag and drop that

00:10:04.179 --> 00:10:06.279
file it essentially hallucinates the physical

00:10:06.279 --> 00:10:08.779
mouse moving across the screen mechanically yeah

00:10:08.779 --> 00:10:11.700
it predicts the pixel shifting but uh there are

00:10:11.700 --> 00:10:13.399
some glaring limitations right now it's just

00:10:13.399 --> 00:10:15.740
an early prototype not a finished product what

00:10:15.740 --> 00:10:18.080
happens when it tries a complex multi -step task

00:10:18.080 --> 00:10:20.779
it really struggles with sustained context it

00:10:20.779 --> 00:10:23.039
might remember to open a system folder but three

00:10:23.039 --> 00:10:25.200
steps later it completely forgets why it was

00:10:25.200 --> 00:10:28.250
there it loses the thread it lacks basic digital

00:10:28.250 --> 00:10:30.490
object permanence. That's exactly the problem.

00:10:31.029 --> 00:10:33.909
It hasn't learned the concept of a button. It

00:10:33.909 --> 00:10:36.309
has only learned the visual pattern of a button.

00:10:36.450 --> 00:10:38.830
It recognizes the pattern of an X icon in the

00:10:38.830 --> 00:10:41.429
corner. Right. It sees the X and knows to click

00:10:41.429 --> 00:10:44.330
it. But until the model learns that an X icon

00:10:44.330 --> 00:10:47.669
always means close, it remains limited. It has

00:10:47.669 --> 00:10:50.289
to understand that concept regardless of the

00:10:50.289 --> 00:10:53.559
application. Until then it's basically a highly

00:10:53.559 --> 00:10:56.559
sophisticated macro recorder. Kate, let's dig

00:10:56.559 --> 00:10:59.500
into that limitation. Is the barrier between

00:10:59.500 --> 00:11:01.860
recognizing visual patterns and understanding

00:11:01.860 --> 00:11:04.639
underlying concepts the hardest problem in machine

00:11:04.639 --> 00:11:07.860
learning today? I firmly believe it is. Bridging

00:11:07.860 --> 00:11:10.460
that massive cognitive gap is basically the definition

00:11:10.460 --> 00:11:13.179
of true artificial general intelligence. It's

00:11:13.179 --> 00:11:14.500
the difference between mimicking intelligence

00:11:14.500 --> 00:11:17.779
and possessing actual comprehension. So scaling

00:11:17.779 --> 00:11:20.320
up visual patterns might not organically lead

00:11:20.320 --> 00:11:22.820
to true conceptual understanding. That remains

00:11:22.820 --> 00:11:25.340
the absolute biggest unanswered question in AI

00:11:25.340 --> 00:11:27.659
research. Scaling visual patterns just doesn't

00:11:27.659 --> 00:11:29.960
guarantee true cognitive reasoning. You're finding

00:11:29.960 --> 00:11:31.879
that out the hard way, yeah. Let's zoom out for

00:11:31.879 --> 00:11:33.899
a second. We've covered a massive amount of technical

00:11:33.899 --> 00:11:35.860
ground today. If we connect this to the bigger

00:11:35.860 --> 00:11:38.610
picture. We're witnessing a total restructuring

00:11:38.610 --> 00:11:41.549
of how humanity computes. Devs are hacking together

00:11:41.549 --> 00:11:44.250
specialized tool chains today. We're scaling

00:11:44.250 --> 00:11:46.350
infrastructure to support billions of agents

00:11:46.350 --> 00:11:49.129
tomorrow. We're bringing open source models locally

00:11:49.129 --> 00:11:52.590
to our pockets. And Meta is aggressively trying

00:11:52.590 --> 00:11:55.950
to bypass all of it by turning the AI... into

00:11:55.950 --> 00:11:58.590
the computer itself it is a massive unprecedented

00:11:58.590 --> 00:12:01.149
transition thank you for joining us on this deep

00:12:01.149 --> 00:12:03.350
dive take a moment today to look at your own

00:12:03.350 --> 00:12:05.710
digital workflow try to find where you're getting

00:12:05.710 --> 00:12:07.629
your own agent fatigue and see where you can

00:12:07.629 --> 00:12:10.090
specialize your daily tools the landscape is

00:12:10.090 --> 00:12:12.269
shifting incredibly fast beneath our feet it

00:12:12.269 --> 00:12:14.490
really is and it leaves me with one final fascinating

00:12:14.490 --> 00:12:17.990
question to consider beat if meta's neural computer

00:12:17.990 --> 00:12:20.149
eventually succeeds in operating our graphic

00:12:20.149 --> 00:12:22.730
interfaces will the software of the future be

00:12:22.730 --> 00:12:26.090
designed to look good to human eyes Or will it

00:12:26.090 --> 00:12:28.669
be optimized entirely for AI vision models to

00:12:28.669 --> 00:12:30.809
easily read? Out to your own music.
