WEBVTT

00:00:00.000 --> 00:00:02.399
Welcome to this deep dive. Tech Twitter is completely

00:00:02.399 --> 00:00:05.960
panicking right now. People claim Anthropic burns

00:00:05.960 --> 00:00:10.060
$5 ,000 a month just to run a single user's coding

00:00:10.060 --> 00:00:12.839
AI. That tension is our narrative through line

00:00:12.839 --> 00:00:15.439
today. We are unpacking the hidden mechanics

00:00:15.439 --> 00:00:19.260
of AI scaling. We will start with the real cost

00:00:19.260 --> 00:00:22.440
of AI, then look at the chaos of deploying it.

00:00:22.579 --> 00:00:25.460
From Google to Amazon. Right. Then we explore

00:00:25.460 --> 00:00:29.100
tools empowering the edge. Finally, we end with

00:00:29.100 --> 00:00:31.719
a massive architectural breakthrough, a shift

00:00:31.719 --> 00:00:34.420
in how AI speaks to you. We really have to start

00:00:34.420 --> 00:00:37.299
with the math. Let us unpack this $5 ,000 myth.

00:00:37.579 --> 00:00:39.399
Where did this even come from? It traces back

00:00:39.399 --> 00:00:41.960
to a viral Forbes article. It was about a popular

00:00:41.960 --> 00:00:45.159
coding tool called Cursor. They looked at Anthropic's

00:00:45.159 --> 00:00:48.020
$200 clawed plan. Yeah. And they guessed it burns

00:00:48.020 --> 00:00:50.579
$5 ,000 in compute. That sounds absolutely terrifying.

00:00:50.880 --> 00:00:52.880
For any sustainable business, that is a death

00:00:52.880 --> 00:00:55.439
sentence. It sounds catastrophic. But you have

00:00:55.439 --> 00:00:57.729
to look at the API pricing. Specifically for

00:00:57.729 --> 00:01:02.509
the Claude Opus 4 .6 model, it costs $5 per million

00:01:02.509 --> 00:01:07.170
input tokens and $25 per million output tokens.

00:01:07.329 --> 00:01:10.109
Just to clarify for everyone, tokens are the

00:01:10.109 --> 00:01:13.049
basic building blocks of text for AI. Exactly.

00:01:13.049 --> 00:01:17.409
So if an extreme power user goes crazy, the API

00:01:17.409 --> 00:01:20.849
usage could theoretically hit $5 ,000. But that

00:01:20.849 --> 00:01:23.090
is retail pricing. Right. And that is the crucial

00:01:23.090 --> 00:01:26.629
distinction everyone misses. API pricing is absolutely

00:01:26.629 --> 00:01:29.310
not raw compute cost. It is like looking at a

00:01:29.310 --> 00:01:32.250
restaurant's menu prices. And assuming that's

00:01:32.250 --> 00:01:34.650
what the ingredients cost the chef, you have

00:01:34.650 --> 00:01:37.170
to factor in the massive markup. You really do.

00:01:37.489 --> 00:01:40.390
We can look at open platforms instead, like OpenRouter

00:01:40.390 --> 00:01:42.230
for a much better baseline. Yeah, they host open

00:01:42.230 --> 00:01:44.650
source models. Right, like Quinn 3 .5. Yeah.

00:01:44.709 --> 00:01:48.230
The massive 397 billion parameter version. Yeah.

00:01:48.250 --> 00:01:51.409
Or Kimi K2 .5. How do their costs compare? They

00:01:51.409 --> 00:01:53.349
are roughly 10 times cheaper than Anthropic.

00:01:53.469 --> 00:01:55.769
Right. Raw compute is maybe 10 % of the sticker

00:01:55.769 --> 00:01:58.409
price. Wow. So the true cost is much lower. It's

00:01:58.409 --> 00:02:00.750
closer to $500 a month. At an absolute maximum,

00:02:00.890 --> 00:02:03.329
yes. Yeah. And those power users are incredibly

00:02:03.329 --> 00:02:07.390
rare. Fewer than 5 % ever hit those limits. Most

00:02:07.390 --> 00:02:11.419
pay between $20 and $200 monthly. It easily makes

00:02:11.419 --> 00:02:14.360
the system break even or even highly profitable.

00:02:14.520 --> 00:02:16.800
Could those extreme power users still bankrupt

00:02:16.800 --> 00:02:19.879
a smaller startup before they scale? Maybe very

00:02:19.879 --> 00:02:22.340
early on. But smart caching usually solves that

00:02:22.340 --> 00:02:24.879
problem immediately. So raw compute is just a

00:02:24.879 --> 00:02:27.819
fraction of the sticker price. Exactly. Which

00:02:27.819 --> 00:02:30.340
brings us to the friction of reality. Because

00:02:30.340 --> 00:02:33.740
inference is viable, we are seeing rapid integrations.

00:02:33.860 --> 00:02:37.000
So compute is not the bottleneck. Why are systems

00:02:37.000 --> 00:02:39.419
breaking so spectacularly in the wild? It really

00:02:39.419 --> 00:02:41.979
comes down to deployment desperation. OpenAI

00:02:41.979 --> 00:02:45.020
plans to integrate Sora directly into ChatGPT.

00:02:45.099 --> 00:02:48.000
The video generation tool. Yeah. And Google put

00:02:48.000 --> 00:02:51.180
Gemini inside workspace apps. Right. Google Docs

00:02:51.180 --> 00:02:54.389
writes for you. Sheets uses live web data. Slides

00:02:54.389 --> 00:02:56.930
makes full decks. They also launched Gemini Embedding

00:02:56.930 --> 00:02:59.509
2 in public preview. It is highly multimodal.

00:02:59.530 --> 00:03:01.810
Meaning understanding text, images, and audio

00:03:01.810 --> 00:03:04.229
all at once. Exactly. We're also seeing this

00:03:04.229 --> 00:03:07.110
in regulated fields. There is an AI legal startup

00:03:07.110 --> 00:03:10.069
called Legora. They just raised $550 million.

00:03:10.370 --> 00:03:15.569
They hit a $5 billion valuation. That is massive.

00:03:15.949 --> 00:03:18.569
And they're already used by 800 law firms. It

00:03:18.569 --> 00:03:20.830
is a cloud -powered system. What about the physical

00:03:20.830 --> 00:03:23.889
hardware side of this? Meta just unveiled four

00:03:23.889 --> 00:03:27.050
in -house AI chips. They're rolling out updates

00:03:27.050 --> 00:03:29.669
every six months. Wait, I have to push back on

00:03:29.669 --> 00:03:32.210
that timeline. Hardware is notoriously hard to

00:03:32.210 --> 00:03:35.330
pivot. Why attempt a six -month cycle? To reduce

00:03:35.330 --> 00:03:38.250
their heavy reliance on NVIDIA GPUs. Yeah. They

00:03:38.250 --> 00:03:41.210
are optimizing for pure speed over perfect efficiency.

00:03:41.840 --> 00:03:44.419
But fast deployment means things inevitably break.

00:03:44.539 --> 00:03:46.759
Oh, absolutely. Amazon triggered multiple incidents

00:03:46.759 --> 00:03:49.000
recently. They're using autonomous AI coding

00:03:49.000 --> 00:03:51.419
tools. Yeah, I read about that. One AI actually

00:03:51.419 --> 00:03:54.039
deleted a live production environment. I still

00:03:54.039 --> 00:03:56.939
wrestle with prompt drift myself. So an AI deleting

00:03:56.939 --> 00:04:00.000
an environment is terrifying. It perfectly highlights

00:04:00.000 --> 00:04:02.699
the danger of unmonitored autonomy. And it is

00:04:02.699 --> 00:04:05.199
not just broken code causing chaos. Right. The

00:04:05.199 --> 00:04:08.060
legal friction. A U .S. court just ordered perplexity

00:04:08.060 --> 00:04:10.750
to destroy data. Their comment browser access

00:04:10.750 --> 00:04:13.310
Amazon data without permission. Fast deployment

00:04:13.310 --> 00:04:16.750
means breaking things, both code and laws. That

00:04:16.750 --> 00:04:18.930
is the grim reality of the current landscape.

00:04:19.209 --> 00:04:22.170
But away from the tech giants, things are different.

00:04:23.009 --> 00:04:26.089
Specialized tools are quietly changing how individual

00:04:26.089 --> 00:04:28.810
developers work. Empowering the edge. Exactly.

00:04:29.149 --> 00:04:31.829
Have you seen Innsforge yet? I have not. It deploys

00:04:31.829 --> 00:04:34.139
full stack apps just by saying the word. You

00:04:34.139 --> 00:04:36.459
can deploy to their cloud or your domain. Wow.

00:04:36.680 --> 00:04:39.639
No manual configuration at all. None. And then

00:04:39.639 --> 00:04:42.420
there's a tool called Cardboard. It is an agentic

00:04:42.420 --> 00:04:44.740
video editor. How does that work exactly? It

00:04:44.740 --> 00:04:47.100
moves raw footage to a final cut. It actually

00:04:47.100 --> 00:04:49.480
understands the semantic contents of your clips.

00:04:49.920 --> 00:04:53.360
Then you have personal agents like Teract. It

00:04:53.360 --> 00:04:56.319
is an AI reputation coach. For social media.

00:04:56.579 --> 00:04:59.160
Yeah, for LinkedIn, X, and Reddit. It learns

00:04:59.160 --> 00:05:01.879
your unique voice over time. The UI shifts are

00:05:01.879 --> 00:05:03.279
the most interesting to me. I was looking at

00:05:03.279 --> 00:05:05.819
Open UI recently. Oh, that was fascinating. It

00:05:05.819 --> 00:05:08.160
makes AI apps respond with interactive components,

00:05:08.600 --> 00:05:11.579
cards, dynamic tables, and forms instead of just

00:05:11.579 --> 00:05:13.959
static text. Right. It completely changes the

00:05:13.959 --> 00:05:17.120
experience. It is like stacking Lego blocks of

00:05:17.120 --> 00:05:19.639
data instead of reading a wall of text. It makes

00:05:19.639 --> 00:05:22.120
the AI feel like a true software partner. Are

00:05:22.120 --> 00:05:24.199
tools like Cardboard and Innsforge replacing

00:05:24.199 --> 00:05:27.000
human taste? Or just the tedious manual labor.

00:05:27.139 --> 00:05:29.800
Mostly just the tedious manual labor. You still

00:05:29.800 --> 00:05:32.040
desperately need human taste to curate things.

00:05:32.240 --> 00:05:35.240
We're moving from text chats to instant software

00:05:35.240 --> 00:05:37.699
creation. It is a massive structural shift in

00:05:37.699 --> 00:05:40.319
how we work. And speaking of shifting how we

00:05:40.319 --> 00:05:44.079
work, sponsor. We are back. We covered the real

00:05:44.079 --> 00:05:46.639
costs and the deployment chaos. And those specialized

00:05:46.639 --> 00:05:49.420
edge tools. Right. But to make all these tools

00:05:49.420 --> 00:05:52.660
truly seamless, especially voice agents, we need

00:05:52.660 --> 00:05:55.959
to fix the awkward lag in AI speech. It is a

00:05:55.959 --> 00:05:59.139
very noticeable, very weird problem. Current

00:05:59.139 --> 00:06:02.220
AI speech skips words constantly. Or it is just

00:06:02.220 --> 00:06:04.899
far too slow. Because it bolts two entirely different

00:06:04.899 --> 00:06:07.079
models together. One writes the text. The next

00:06:07.079 --> 00:06:09.399
generates the audio. So what is the actual breakthrough

00:06:09.399 --> 00:06:12.240
here? Hume AI just open sourced a model called

00:06:12.240 --> 00:06:14.620
Tate -A. It generates text tokens and acoustic

00:06:14.620 --> 00:06:17.519
features together. In one unified stream? Yes.

00:06:17.720 --> 00:06:20.560
They tested it on over a thousand complex samples.

00:06:20.740 --> 00:06:23.379
It had absolutely zero content errors. That is

00:06:23.379 --> 00:06:25.600
practically unheard of. And it runs at a real

00:06:25.600 --> 00:06:28.939
-time factor of 0 .09. Which measures how fast

00:06:28.939 --> 00:06:31.680
AI generates audio compared to real time. Right.

00:06:31.740 --> 00:06:34.259
It is roughly five times faster than typical

00:06:34.259 --> 00:06:36.959
models. And what about the token capacity? It

00:06:36.959 --> 00:06:40.060
handles 2 ,048 tokens smoothly. That represents

00:06:40.060 --> 00:06:43.500
about 700 seconds of continuous speech. Typical

00:06:43.500 --> 00:06:47.339
systems top out at 70 seconds. Whoa. Two sec

00:06:47.339 --> 00:06:50.860
silence. Imagine scaling to 700 seconds of perfect

00:06:50.860 --> 00:06:54.259
speech in one go. That changes everything. It

00:06:54.259 --> 00:06:56.360
really does. And it outputs a perfect transcript

00:06:56.360 --> 00:06:59.399
simultaneously with zero extra latency. Where

00:06:59.399 --> 00:07:01.560
can people actually find this? It is available

00:07:01.560 --> 00:07:04.339
right now on Hugging Face and GitHub. What happens

00:07:04.339 --> 00:07:07.060
to human connection when AI can speak flawlessly

00:07:07.060 --> 00:07:10.439
without that robotic hesitation? That is the

00:07:10.439 --> 00:07:13.379
scary part. We rely on that hesitation to recognize

00:07:13.379 --> 00:07:16.319
machines. Trust will become a massive societal

00:07:16.319 --> 00:07:19.110
issue. Generating text and sound together eliminates

00:07:19.110 --> 00:07:22.629
the awkward lag. Exactly. So if we synthesize

00:07:22.629 --> 00:07:25.790
this entire journey, AI compute is significantly

00:07:25.790 --> 00:07:28.629
cheaper than the hype claims, which explains

00:07:28.629 --> 00:07:31.529
the massive flood of wild integrations. But the

00:07:31.529 --> 00:07:35.360
real frontier is seamless. multimodal interaction,

00:07:35.639 --> 00:07:37.899
like Hume's unified speech model. It leaves you

00:07:37.899 --> 00:07:40.120
with a deeply provocative thought. Compute is

00:07:40.120 --> 00:07:42.579
actually cheap, and open source models like Hume's

00:07:42.579 --> 00:07:44.939
TETA are matching closed systems. Beating them

00:07:44.939 --> 00:07:47.660
in latency, even. Will the future of AI be controlled

00:07:47.660 --> 00:07:50.980
by massive tech monopolies? Or will it live locally

00:07:50.980 --> 00:07:53.379
on our own devices, completely free from the

00:07:53.379 --> 00:07:56.100
cloud? Beat. Keep staying curious about these

00:07:56.100 --> 00:07:58.160
systems. Thanks for joining this deep dive. Out

00:07:58.160 --> 00:07:58.759
to your own music.