WEBVTT

00:00:00.000 --> 00:00:01.679
I want you to close your eyes for a second. Just

00:00:01.679 --> 00:00:03.680
picture this. You are sitting in the driver's

00:00:03.680 --> 00:00:05.740
seat of a car. Okay. You are right in the middle

00:00:05.740 --> 00:00:09.359
of San Francisco. It is completely chaotic. Oh,

00:00:09.359 --> 00:00:12.779
yeah. You have got the trolleys. You have pedestrians

00:00:12.779 --> 00:00:17.379
dodging traffic. That specific heavy fog is rolling

00:00:17.379 --> 00:00:20.179
in off the bay. Now look at the driver next to

00:00:20.179 --> 00:00:24.019
you. But here is the twist. The driver is not

00:00:24.019 --> 00:00:27.649
a person. Right. And it is not a computer program

00:00:27.649 --> 00:00:30.829
that was fed a million lines of code about traffic

00:00:30.829 --> 00:00:34.409
laws. It is not reading a manual. Exactly. It

00:00:34.409 --> 00:00:36.189
has never been told what a stop sign actually

00:00:36.189 --> 00:00:38.750
means. Yeah. It was just shown a video of a person

00:00:38.750 --> 00:00:40.909
driving. Just watching pixels. Just watching

00:00:40.909 --> 00:00:43.030
pixels change on a screen. And it figures out

00:00:43.030 --> 00:00:45.590
how to steer, how to brake, how to navigate the

00:00:45.590 --> 00:00:48.729
real world. It is basically driving by watching

00:00:48.729 --> 00:00:51.109
a YouTube tutorial. It sounds like science fiction.

00:00:51.329 --> 00:00:53.909
It really does. Like some chaotic future timeline.

00:00:54.570 --> 00:00:56.789
But this is happening right now. Yeah. We are

00:00:56.789 --> 00:00:58.990
moving from the era of telling computers what

00:00:58.990 --> 00:01:01.270
to do, and we are entering the era of showing

00:01:01.270 --> 00:01:05.769
them who to be. It is terrifying and fascinating.

00:01:06.049 --> 00:01:08.310
Welcome back to the Deep Dive. Today, we are

00:01:08.310 --> 00:01:10.310
doing something a bit different. We have a lot

00:01:10.310 --> 00:01:12.890
of ground to cover. We do. We are pulling apart

00:01:12.890 --> 00:01:15.469
the entire ecosystem that makes that self -driving

00:01:15.469 --> 00:01:18.650
car scenario possible. Right. We have this massive

00:01:18.650 --> 00:01:21.750
stack of notes from the latest AI fire dispatch.

00:01:22.170 --> 00:01:24.989
We are going to trace the money. The chips and

00:01:24.989 --> 00:01:28.430
the code. Because you cannot understand the software

00:01:28.430 --> 00:01:31.069
without the hardware. It is all connected. Lay

00:01:31.069 --> 00:01:34.010
out the roadmap for us. We have three main pillars

00:01:34.010 --> 00:01:37.049
today. First, we start at the macro level. The

00:01:37.049 --> 00:01:40.230
engine room. NVIDIA. Exactly. NVIDIA's monster

00:01:40.230 --> 00:01:42.670
quarter. We are going to explain this concept

00:01:42.670 --> 00:01:45.469
that compute is revenue. The fundamental economic

00:01:45.469 --> 00:01:48.409
law right now. Right. Second, we look at the

00:01:48.409 --> 00:01:51.480
challengers. The hardware wars, new startups

00:01:51.480 --> 00:01:54.379
taking a swing at the king, and how pricing models

00:01:54.379 --> 00:01:56.840
are shifting for you. And third. We land on that

00:01:56.840 --> 00:01:58.840
breakthrough you mentioned. Standard Intelligence

00:01:58.840 --> 00:02:01.900
and their FDM1 model. The video learning brain.

00:02:02.040 --> 00:02:04.159
That is the one. The implications are going to

00:02:04.159 --> 00:02:05.840
stick with you. All right, let's get into the

00:02:05.840 --> 00:02:08.639
engine room. NVIDIA. Every time we talk about

00:02:08.639 --> 00:02:11.000
them, the numbers are just big. Massive. But

00:02:11.000 --> 00:02:12.780
looking at this recent report, big is the wrong

00:02:12.780 --> 00:02:15.300
word. It feels historical. It is relentless.

00:02:15.400 --> 00:02:17.379
Let's look at the raw data so we are all on the

00:02:17.379 --> 00:02:22.599
same page. NVIDIA posted $68 billion in quarterly

00:02:22.599 --> 00:02:27.060
revenue. $68 billion. In three months. Up 73

00:02:27.060 --> 00:02:30.080
% year over year. That is absurd. Right. For

00:02:30.080 --> 00:02:34.139
a company valued at over $4 .7 trillion. Usually

00:02:34.139 --> 00:02:36.580
growth slows down at that size. You expect that

00:02:36.580 --> 00:02:40.139
from a startup, not a titan. Exactly. And peeling

00:02:40.139 --> 00:02:43.860
it back gets crazier. $62 billion came from data

00:02:43.860 --> 00:02:47.879
centers. $51 billion from AI compute GPUs alone.

00:02:48.159 --> 00:02:51.379
Let's pause on that. Wait. $51 billion spent

00:02:51.379 --> 00:02:54.400
on silicon chips in 90 days. Who is buying this

00:02:54.400 --> 00:02:57.319
and why? This is that concept circulating right

00:02:57.319 --> 00:02:59.699
now. Compute is revenue. Unpack that for us.

00:02:59.759 --> 00:03:01.479
Think about it like this. Every single time you

00:03:01.479 --> 00:03:03.639
interact with an AI, you are generating tokens.

00:03:03.699 --> 00:03:06.840
So a token is just a basic chunk of text or data.

00:03:07.020 --> 00:03:10.039
Perfect definition. A word is a token. A pixel

00:03:10.039 --> 00:03:13.150
is a token. Here is the physics of it. Every

00:03:13.150 --> 00:03:16.310
token requires a physical GPU chip to do a math

00:03:16.310 --> 00:03:18.590
operation. Which is called inference. Right.

00:03:18.689 --> 00:03:21.530
Inference is the physical math calculation needed

00:03:21.530 --> 00:03:24.490
to produce one token. So unlike downloading a

00:03:24.490 --> 00:03:27.069
static file from the old internet. Exactly. With

00:03:27.069 --> 00:03:30.409
AI, every word generated costs electricity and

00:03:30.409 --> 00:03:33.650
processor time. It is a direct linear relationship.

00:03:34.050 --> 00:03:36.569
More users means more tokens. More tokens means

00:03:36.569 --> 00:03:40.099
you physically must have more GPUs. If you lack

00:03:40.099 --> 00:03:43.819
GPUs, the service literally stops. So more AI

00:03:43.819 --> 00:03:46.780
usage equals more data centers. Right. Which

00:03:46.780 --> 00:03:49.240
equals more NVIDIA revenue. It is a perpetual

00:03:49.240 --> 00:03:51.360
motion machine of money. It is an intelligence

00:03:51.360 --> 00:03:54.280
tax. NVIDIA collects a tax on every thought the

00:03:54.280 --> 00:03:57.360
AI has. And the demand is exponential. Completely

00:03:57.360 --> 00:03:58.979
exponential. Everyone wants the new Blackwell

00:03:58.979 --> 00:04:01.280
chips, the Ferraris of the industry. But they're

00:04:01.280 --> 00:04:03.120
sold out. Totally sold out. So companies are

00:04:03.120 --> 00:04:05.460
buying everything. Even six -year -old GPUs,

00:04:05.460 --> 00:04:08.259
the equivalent of a 2018 Honda Civic. They are

00:04:08.259 --> 00:04:10.610
fully... booked. Fully booked in cloud environments.

00:04:10.789 --> 00:04:13.250
Prices are rising everywhere. Supply cannot touch

00:04:13.250 --> 00:04:16.310
demand. It's the classic gold rush analogy. Selling

00:04:16.310 --> 00:04:19.569
the shovels. But Nvidia sells the shovels, the

00:04:19.569 --> 00:04:22.170
pickaxes, and they own the mountain. There is

00:04:22.170 --> 00:04:25.110
a complication though. China. Yeah, this is a

00:04:25.110 --> 00:04:28.490
crucial nuance. Despite partial export approvals,

00:04:28.569 --> 00:04:31.389
Nvidia reported essentially zero revenue from

00:04:31.389 --> 00:04:34.560
China. Zero. That seems statistically impossible.

00:04:34.939 --> 00:04:37.259
It is a rounding error. And executives acknowledge

00:04:37.259 --> 00:04:40.779
why it is not just U .S. regulations. It is the

00:04:40.779 --> 00:04:43.420
local market adapting. Right. Rising Chinese

00:04:43.420 --> 00:04:46.480
competitors are stepping in. Backed by massive

00:04:46.480 --> 00:04:49.079
IPO funding, they have to build their own infrastructure.

00:04:49.399 --> 00:04:52.720
So can NVIDIA sustain a $4 .7 trillion valuation

00:04:52.720 --> 00:04:55.259
if they're locked out of the second largest economy?

00:04:55.740 --> 00:04:58.540
As long as token demand exists, Western infrastructure

00:04:58.540 --> 00:05:01.660
remains absolute gold. Demand outweighs the geographic

00:05:01.660 --> 00:05:04.620
lockout. Got it. For now, yes. Let's shift gears

00:05:04.620 --> 00:05:07.800
then. Kings do not stay kings without a fight.

00:05:07.959 --> 00:05:10.500
The market hates a monopoly. It makes customers

00:05:10.500 --> 00:05:12.620
very nervous. Tell me about the challengers.

00:05:12.740 --> 00:05:15.480
The newsletter mentioned Mad -X. Mad -X is an

00:05:15.480 --> 00:05:17.660
AI chip startup. Most people haven't heard of

00:05:17.660 --> 00:05:19.660
them yet, but they just raised $500 million.

00:05:20.160 --> 00:05:22.279
Half a billion dollars to fight NVIDIA. What

00:05:22.279 --> 00:05:25.139
is their specific angle? NVIDIA builds general

00:05:25.139 --> 00:05:28.300
-purpose GPUs. Good at graphics, good at crypto,

00:05:28.459 --> 00:05:31.980
good at AI. MATX is specializing. Stripping it

00:05:31.980 --> 00:05:34.500
down. Building chips designed only for large

00:05:34.500 --> 00:05:37.560
language models. Hyper -efficient. So less flexible

00:05:37.560 --> 00:05:41.100
but faster for this one specific task. Exactly.

00:05:41.300 --> 00:05:43.720
It signals investors want alternative hardware

00:05:43.720 --> 00:05:46.879
desperately. And then there is DeepSeek. We saw

00:05:46.879 --> 00:05:49.139
those efficiency headlines recently. Right. The

00:05:49.139 --> 00:05:51.939
rumor is DeepSeek is freezing NVIDIA out of their

00:05:51.939 --> 00:05:54.639
next model entirely. Which aligns with that zero

00:05:54.639 --> 00:05:57.899
revenue figure. It does. But it is messy. There

00:05:57.899 --> 00:06:00.139
are strong rumors they trained previous models

00:06:00.139 --> 00:06:02.920
on smuggled NVIDIA Blackwell chips. Through gray

00:06:02.920 --> 00:06:05.699
markets. Yeah. But for the next iteration, Huawei

00:06:05.699 --> 00:06:09.019
reportedly got early access. So it is a mix of

00:06:09.019 --> 00:06:11.839
geopolitics and pure strategy. Turbulence under

00:06:11.839 --> 00:06:13.920
the surface. It is not just the hardware shifting,

00:06:14.019 --> 00:06:16.079
though. The pricing models for accessing these

00:06:16.079 --> 00:06:19.139
tools are changing. Big time. OpenAI is testing

00:06:19.139 --> 00:06:24.259
a ProLite tier for $100 a month. Yeah. One hundred

00:06:24.259 --> 00:06:26.399
dollars. I have to be honest here. I saw that

00:06:26.399 --> 00:06:29.339
number and I felt some friction. I still wrestled

00:06:29.339 --> 00:06:31.920
with justifying the $20 tier myself. It is a

00:06:31.920 --> 00:06:35.300
psychological hurdle. It really is. Jumping to

00:06:35.300 --> 00:06:37.540
$200 for the enterprise tier makes sense for

00:06:37.540 --> 00:06:40.879
companies, but $100 for an individual. You are

00:06:40.879 --> 00:06:43.379
not alone in that feeling. So what does this

00:06:43.379 --> 00:06:46.339
$100 tier signal about the future of AI users?

00:06:46.800 --> 00:06:49.519
It bridges the gap between casual chatters and

00:06:49.519 --> 00:06:52.420
always -on power users. Moving from casual tool

00:06:52.420 --> 00:06:55.639
to digital co -worker. Exactly. Think about an

00:06:55.639 --> 00:06:58.379
agentic workflow. Define agentic for us quickly.

00:06:58.639 --> 00:07:01.600
It means the AI acts autonomously to complete

00:07:01.600 --> 00:07:04.600
multi -step goals. Okay. You tell it to research

00:07:04.600 --> 00:07:07.639
five companies, write a report, and draft emails.

00:07:08.410 --> 00:07:11.470
It thinks for 30 minutes? That burns massive

00:07:11.470 --> 00:07:14.610
compute. So the $20 plan has to limit that. Right.

00:07:14.670 --> 00:07:17.670
The $100 tier is for someone inextricably linked

00:07:17.670 --> 00:07:20.189
to the AI all day. Speaking of being linked,

00:07:20.370 --> 00:07:22.889
Anthropic just launched remote control for cloud

00:07:22.889 --> 00:07:26.410
code. This is wild. You can run your Mac terminal

00:07:26.410 --> 00:07:28.610
from your phone browser. Sitting in a coffee

00:07:28.610 --> 00:07:31.850
shop, controlling your home desktop via AI. It

00:07:31.850 --> 00:07:33.990
blurs the line of where compute happens. The

00:07:33.990 --> 00:07:36.329
computer is just a cloud surrounding us. Which

00:07:36.329 --> 00:07:38.810
leads us perfectly to the third pillar today,

00:07:39.029 --> 00:07:41.370
the software breakthrough. Standard intelligence,

00:07:41.709 --> 00:07:46.310
FDM1. This is mind -bending. For years, we talked

00:07:46.310 --> 00:07:49.170
about large language models, text in, text out.

00:07:49.310 --> 00:07:52.430
Predicting the next word. Right. But FDM1 is

00:07:52.430 --> 00:07:54.949
a computer action model. It does not learn from

00:07:54.949 --> 00:07:57.970
text prompts. It learns by watching video. Think

00:07:57.970 --> 00:08:01.100
about a toddler. You don't hand a toddler a manual

00:08:01.100 --> 00:08:03.120
on how to open a door. They just watch you turn

00:08:03.120 --> 00:08:05.439
the knob. They build a world model visually.

00:08:05.920 --> 00:08:08.600
FTM1 does this with software. It watches frames

00:08:08.600 --> 00:08:10.959
and reverse engineers the actions. Exactly. No

00:08:10.959 --> 00:08:12.980
manual labeling. It sees the cursor move, sees

00:08:12.980 --> 00:08:15.100
the click, sees the result. It is like stacking

00:08:15.100 --> 00:08:17.759
Lego blocks of visual data. And the demos are

00:08:17.759 --> 00:08:20.899
practical. It built mechanical gears inside Blender.

00:08:21.019 --> 00:08:23.240
Blender has a notoriously complex interface.

00:08:23.579 --> 00:08:26.279
A nightmare of menus. Writing code rules for

00:08:26.279 --> 00:08:28.860
that would take decades. But watching a video

00:08:28.860 --> 00:08:31.319
solves it. It learns the spatial relationships

00:08:31.319 --> 00:08:34.000
automatically. If found software bugs, it followed

00:08:34.000 --> 00:08:36.559
long desktop sessions. And the driving example

00:08:36.559 --> 00:08:39.620
from the open. Whoa, imagine scaling that visual

00:08:39.620 --> 00:08:43.659
learning to a billion everyday tasks. It used

00:08:43.659 --> 00:08:46.480
keyboard inputs based on live feeds to drive

00:08:46.480 --> 00:08:50.620
a real car. Arrow keys to steer. Spacebar to

00:08:50.620 --> 00:08:53.399
brake. Mapping visual input directly to action.

00:08:53.679 --> 00:08:56.120
It doesn't need to understand the car's underlying

00:08:56.120 --> 00:08:59.480
code. Just the behavior of driving. It creates

00:08:59.480 --> 00:09:01.940
a real moment of wonder. If it learns driving

00:09:01.940 --> 00:09:05.879
by watching, can it learn plumbing? Surgery?

00:09:06.279 --> 00:09:08.779
What about the compute cost, though? Training

00:09:08.779 --> 00:09:10.919
on video sounds expensive. This is the surprise.

00:09:11.200 --> 00:09:13.899
Some tasks required less than one hour of training

00:09:13.899 --> 00:09:16.039
footage. Less than an hour. Less than an hour

00:09:16.039 --> 00:09:18.340
of video. Highly efficient. That changes the

00:09:18.340 --> 00:09:20.299
economics entirely. You don't need to scrape

00:09:20.299 --> 00:09:22.179
the whole internet. You just record your expense

00:09:22.179 --> 00:09:24.340
report process for an hour and it learns it.

00:09:24.460 --> 00:09:26.980
If AI learns by just watching an hour of video,

00:09:27.200 --> 00:09:29.659
what does this mean for prompt engineering? We

00:09:29.659 --> 00:09:31.960
are shifting from telling AI what to do to simply

00:09:31.960 --> 00:09:34.919
showing it. Showing, not telling. That removes

00:09:34.919 --> 00:09:37.080
the abstraction of language entirely. You don't

00:09:37.080 --> 00:09:39.220
need to be a good explainer, just a good doer.

00:09:39.240 --> 00:09:42.059
It is a cohesive narrative arc today. Let's recap

00:09:42.059 --> 00:09:46.090
the big idea. We started with NVIDIA. The massive

00:09:46.090 --> 00:09:49.389
physical infrastructure required. The $68 billion

00:09:49.389 --> 00:09:53.289
engine room. Then the challengers. Maddox specializing

00:09:53.289 --> 00:09:56.149
hardware. Open AI shifting pricing for power

00:09:56.149 --> 00:09:58.850
users. And finally, software becoming more human.

00:09:59.149 --> 00:10:01.409
Standard intelligence proving the future of learning

00:10:01.409 --> 00:10:04.529
is visual observation, not text. Which leaves

00:10:04.529 --> 00:10:06.549
me with a somewhat provocative final thought.

00:10:07.009 --> 00:10:10.629
Beat. We always worry about AI taking jobs or

00:10:10.629 --> 00:10:13.960
scraping our data. Right. But if AI learns best

00:10:13.960 --> 00:10:16.899
by watching us work, then our daily workflows

00:10:16.899 --> 00:10:19.960
are not just work anymore. No, they're valuable

00:10:19.960 --> 00:10:22.220
assets. Every time you navigate a spreadsheet,

00:10:22.440 --> 00:10:25.220
edit a video, or drive to the store, you are

00:10:25.220 --> 00:10:27.659
generating high -value training data. You are

00:10:27.659 --> 00:10:29.860
actively performing the curriculum for the next

00:10:29.860 --> 00:10:31.679
generation of models. That is a heavy thought

00:10:31.679 --> 00:10:33.960
for your next Zoom meeting. Sit up straight.

00:10:34.100 --> 00:10:36.399
The AI might be watching to see how it is done.

00:10:36.600 --> 00:10:39.059
A slightly dystopian, but very real possibility.

00:10:39.419 --> 00:10:41.460
You are not just an employee. You are a training

00:10:41.460 --> 00:10:45.039
set. We will take a quick break here. That wraps

00:10:45.039 --> 00:10:47.879
up today's deep dive. We will track standard

00:10:47.879 --> 00:10:50.639
intelligence closely to see where this visual

00:10:50.639 --> 00:10:53.240
learning goes next. Thanks for listening. Catch

00:10:53.240 --> 00:10:53.679
you next time.