WEBVTT

00:00:00.000 --> 00:00:03.339
You know, we usually assume more power is always

00:00:03.339 --> 00:00:06.139
better. It just like makes intuitive sense. Yeah.

00:00:06.219 --> 00:00:08.759
You buy the fastest processor, you get the smartest

00:00:08.759 --> 00:00:11.740
AI, you just throw that massive brain at every

00:00:11.740 --> 00:00:13.740
single problem you encounter. Right. It's our

00:00:13.740 --> 00:00:15.800
default setting. But what if that's entirely

00:00:15.800 --> 00:00:18.559
wrong? What if using our absolute smartest tools,

00:00:18.800 --> 00:00:23.280
you know, less? is actually the secret to better

00:00:23.280 --> 00:00:25.420
results. It sounds completely counterintuitive

00:00:25.420 --> 00:00:28.019
at first. I mean, we're totally conditioned to

00:00:28.019 --> 00:00:30.500
want the absolute best all the time. Exactly.

00:00:31.019 --> 00:00:33.579
Welcome to the Deep Dive. Today, we're exploring

00:00:33.579 --> 00:00:36.100
Anthropics Advisor Strategy. Glad to be here.

00:00:36.200 --> 00:00:38.299
We're going to deconstruct why defaulting to

00:00:38.299 --> 00:00:41.880
the smartest AI model for every single step is

00:00:41.880 --> 00:00:43.979
a mistake. And we're going to look at the exact

00:00:43.979 --> 00:00:47.060
cost math, too. The token economics here are

00:00:47.060 --> 00:00:49.159
honestly fascinating. Yeah, they really are.

00:00:49.280 --> 00:00:52.340
We'll also untangle the confusion between the

00:00:52.340 --> 00:00:55.439
Cloud Messages API and Cloud Code. Right, because

00:00:55.439 --> 00:00:57.460
there's a huge shift happening right now. Oh,

00:00:57.460 --> 00:00:59.560
massive. Right. We're moving from picking the

00:00:59.560 --> 00:01:02.060
smartest model to designing the smartest tasks.

00:01:02.320 --> 00:01:04.260
It completely changes how you build systems.

00:01:04.579 --> 00:01:08.760
So let's unpack the core philosophy here. Anthropic

00:01:08.760 --> 00:01:12.640
treats intelligence as a managed resource. And

00:01:12.640 --> 00:01:15.219
I have to admit something right here. I still

00:01:15.219 --> 00:01:17.879
wrestle with burning budget on simple tasks myself.

00:01:18.239 --> 00:01:19.959
Oh, you're definitely not alone in that trap.

00:01:20.140 --> 00:01:22.700
It's just so incredibly tempting to pick the

00:01:22.700 --> 00:01:25.159
strongest model in the dropdown and just hit

00:01:25.159 --> 00:01:27.799
go. Yeah. Most people build their AI workflows

00:01:27.799 --> 00:01:29.900
exactly that way. They grab the most capable.

00:01:30.409 --> 00:01:33.010
model available, then they just wire it into

00:01:33.010 --> 00:01:35.090
every single node of their system. They essentially

00:01:35.090 --> 00:01:38.349
treat maximum intelligence as like an insurance

00:01:38.349 --> 00:01:40.890
policy against failure. But using an insurance

00:01:40.890 --> 00:01:43.569
policy for everyday driving is incredibly inefficient.

00:01:43.930 --> 00:01:46.069
Exactly. It wastes an enormous amount of money.

00:01:46.129 --> 00:01:48.890
It burns through your API rate limits instantly.

00:01:49.109 --> 00:01:51.450
And fundamentally, well, it solves the wrong

00:01:51.450 --> 00:01:54.810
problem. It ignores how workflows actually function.

00:01:55.010 --> 00:01:56.849
Break down what you mean by solving the wrong

00:01:56.849 --> 00:01:59.659
problem. Think about a compound agentic workflow.

00:02:00.400 --> 00:02:03.439
Most tasks aren't just one monolithic thought.

00:02:03.659 --> 00:02:05.900
Right, they have pieces. Yeah, they're a sequence

00:02:05.900 --> 00:02:09.280
of mixed steps. Maybe one step requires intense,

00:02:09.500 --> 00:02:12.819
multi -step logical reasoning. But the next four

00:02:12.819 --> 00:02:15.479
steps, they might just involve basic tool use

00:02:15.479 --> 00:02:18.780
or, you know, fetching a specific document from

00:02:18.780 --> 00:02:21.319
a database. Or just writing a three -sentence

00:02:21.319 --> 00:02:24.030
summary of a text file. Exactly. So the advisor

00:02:24.030 --> 00:02:26.569
strategy is basically a structural fix for that

00:02:26.569 --> 00:02:28.490
inefficiency. Let me define that jargon for you

00:02:28.490 --> 00:02:32.009
real quick. A cheaper AI does basic work, calling

00:02:32.009 --> 00:02:35.370
smarter AI for art parts. That is the perfect

00:02:35.370 --> 00:02:37.710
distillation of it. Yeah. Inside the system,

00:02:37.789 --> 00:02:41.169
you pair two models together. Okay. A fast, highly

00:02:41.169 --> 00:02:43.590
efficient model acts as the frontline worker.

00:02:43.810 --> 00:02:45.969
It handles all the routing, the searching, the

00:02:45.969 --> 00:02:49.009
basic parsing. Right. A much stronger, more expensive

00:02:49.009 --> 00:02:51.819
model sits in the background as an advisor. The

00:02:51.819 --> 00:02:54.219
frontline model only calls the advisor when it

00:02:54.219 --> 00:02:57.020
hits a wall. It makes me think of a high -end

00:02:57.020 --> 00:03:00.219
software engineering team. Imagine hiring a top

00:03:00.219 --> 00:03:02.840
-tier, world -renowned senior systems architect.

00:03:03.280 --> 00:03:06.319
You pay them an absolute fortune, but then you

00:03:06.319 --> 00:03:08.919
force them to sit there and fix basic markdown

00:03:08.919 --> 00:03:11.379
typos in your documentation all day? Right, and

00:03:11.379 --> 00:03:13.620
they'd probably quit out of sheer boredom. Exactly.

00:03:13.719 --> 00:03:16.080
You'd let a junior developer handle the typos.

00:03:16.159 --> 00:03:18.800
You only tap the senior architect when the server

00:03:18.800 --> 00:03:21.180
suddenly crashes. Or when the database architecture

00:03:21.180 --> 00:03:23.740
needs a fundamental redesign. Yeah, exactly.

00:03:24.099 --> 00:03:26.599
The analogy tracks perfectly. AI models don't

00:03:26.599 --> 00:03:29.379
have egos, obviously, but the economic principle

00:03:29.379 --> 00:03:32.960
is identical. Right. You simply don't need top

00:03:32.960 --> 00:03:35.569
-level intelligence from start to finish. If

00:03:35.569 --> 00:03:38.389
90 % of a task stays on the cheaper default path,

00:03:38.610 --> 00:03:41.229
your whole system runs lighter. It runs drastically

00:03:41.229 --> 00:03:43.849
faster. And obviously it costs a fraction of

00:03:43.849 --> 00:03:45.870
the price. Which brings up a crucial operational

00:03:45.870 --> 00:03:49.289
question for you. How often do these mixed workflows

00:03:49.289 --> 00:03:52.409
actually need to escalate to the smarter model?

00:03:52.750 --> 00:03:54.729
Honestly, it's far less often than you'd intuitively

00:03:54.729 --> 00:03:57.430
expect. A massive chunk of daily computational

00:03:57.430 --> 00:04:00.330
work is just moving data around, which rarely

00:04:00.330 --> 00:04:03.669
requires deep reasoning. So basic tasks rarely

00:04:03.669 --> 00:04:06.129
escalate, keeping the whole system incredibly

00:04:06.129 --> 00:04:09.389
efficient. Yeah. Once you genuinely internalize

00:04:09.389 --> 00:04:12.289
intelligence as a managed resource, you stop

00:04:12.289 --> 00:04:14.430
asking which single AI should do everything.

00:04:14.650 --> 00:04:16.930
You start asking how to divide the labor. Yeah.

00:04:17.110 --> 00:04:19.550
Philosophically, putting a junior developer on

00:04:19.550 --> 00:04:22.850
typo duty makes perfect sense. But AI isn't human.

00:04:23.050 --> 00:04:25.910
Right. Does this theory actually survive contact

00:04:25.910 --> 00:04:29.029
with real world math? What happens when Anthropic

00:04:29.029 --> 00:04:32.449
forces these models to team up in a lab? That's

00:04:32.449 --> 00:04:34.850
where the benchmark data becomes incredibly relevant.

00:04:35.029 --> 00:04:37.509
Yeah. Anthropic ran this through S2BE Bench.

00:04:37.879 --> 00:04:40.079
And that's a notoriously difficult software engineering

00:04:40.079 --> 00:04:42.920
benchmark, right? Yeah, it is. Let's clarify

00:04:42.920 --> 00:04:45.819
what SWE Bench actually tests. It's not just

00:04:45.819 --> 00:04:48.819
answering trivia. No, it tests complex, agentic

00:04:48.819 --> 00:04:52.339
tasks. The AI is given a real -world GitHub issue

00:04:52.339 --> 00:04:55.500
and a massive code base. Okay. It has to navigate

00:04:55.500 --> 00:04:58.689
the files, find the bug. write the patch, and

00:04:58.689 --> 00:05:00.990
ensure it passes tests. So it's heavily reliant

00:05:00.990 --> 00:05:03.550
on multi -step execution. Highly reliant. For

00:05:03.550 --> 00:05:06.129
this test, they used Clod 3 .5 Sonnet as the

00:05:06.129 --> 00:05:08.850
frontline model, but they gave it Clod 3 Opus

00:05:08.850 --> 00:05:11.769
as an advisor. Opus is their most capable, deepest

00:05:11.769 --> 00:05:13.810
reasoning model. What were the actual results

00:05:13.810 --> 00:05:16.310
of pairing them up? The Sonnet and Opus combination

00:05:16.310 --> 00:05:18.930
improved the overall score by 2 .7 percentage

00:05:18.930 --> 00:05:21.170
points compared to Sonnet working entirely alone.

00:05:21.449 --> 00:05:24.649
Okay, so the output quality went up. But what

00:05:24.649 --> 00:05:27.230
happened to the actual cost of running that test?

00:05:27.769 --> 00:05:30.509
That's the fascinating part. Adding the smarter,

00:05:30.589 --> 00:05:33.509
more expensive model actually cut the cost per

00:05:33.509 --> 00:05:37.560
task by almost 12%. Wait, really? Adding an expensive

00:05:37.560 --> 00:05:41.060
model lowered the total bill. Yes, because Sonnet

00:05:41.060 --> 00:05:43.540
was doing all the tedious, time -consuming file

00:05:43.540 --> 00:05:47.399
searching. It only invoked Opus for the final

00:05:47.399 --> 00:05:50.339
critical code generation. Wow. And they ran another

00:05:50.339 --> 00:05:53.240
test that's even more dramatic. They used BrowseComp,

00:05:53.379 --> 00:05:55.720
a web browsing benchmark. And they used Haiku

00:05:55.720 --> 00:05:57.620
for that one, right? Their absolute fastest,

00:05:57.939 --> 00:06:00.839
cheapest model. Exactly. Haiku alone scored a

00:06:00.839 --> 00:06:04.459
19 .7 % success rate. Okay. But when they gave

00:06:04.459 --> 00:06:07.529
Haiku access to Opus as an advisor, the score

00:06:07.529 --> 00:06:11.189
shot up to 41 .2%. It more than doubled the performance.

00:06:11.370 --> 00:06:13.629
It doubled the performance. And crucially, it

00:06:13.629 --> 00:06:16.449
still costs significantly less than if they had

00:06:16.449 --> 00:06:18.810
just forced Opus to do the entire browsing task

00:06:18.810 --> 00:06:21.410
from scratch. B, I want to dive into the specific

00:06:21.410 --> 00:06:23.949
token economics here because this explains the

00:06:23.949 --> 00:06:26.009
why behind those savings. Yeah, let's do it.

00:06:26.089 --> 00:06:29.790
Let's look at the actual pricing. Opus is $5

00:06:29.790 --> 00:06:34.199
per million input tokens. But it's $25 per million

00:06:34.199 --> 00:06:37.180
output tokens. Right. And Sonnet is $3 for input,

00:06:37.360 --> 00:06:40.980
$15 for output. Haiku is just $1 for input and

00:06:40.980 --> 00:06:44.120
$5 for output. The pattern is glaring. Output

00:06:44.120 --> 00:06:46.399
tokens cost five times as much as input tokens

00:06:46.399 --> 00:06:49.040
across the board. Yeah, they do. Why is generating

00:06:49.040 --> 00:06:51.319
text so much more expensive than reading it?

00:06:51.439 --> 00:06:53.899
It comes down to how the underlying transformer

00:06:53.899 --> 00:06:57.839
architecture actually processes data. Okay. When

00:06:57.839 --> 00:07:02.959
you feed a model a massive document, Kodo, The

00:07:02.959 --> 00:07:06.199
EPUs can process those tokens in parallel. Right.

00:07:06.339 --> 00:07:08.480
They crunch the whole block of text simultaneously.

00:07:08.480 --> 00:07:10.860
It's highly efficient. But generation is different.

00:07:11.060 --> 00:07:13.120
Generation is autoressive. It happens sequentially.

00:07:13.279 --> 00:07:15.680
The model has to calculate the probability of

00:07:15.680 --> 00:07:18.300
the first word. Then it has to read that first

00:07:18.300 --> 00:07:21.019
word to calculate the second word. It's a continuous

00:07:21.019 --> 00:07:24.279
computationally heavy loop. So if you force an

00:07:24.279 --> 00:07:26.959
expensive model like Opus to output long strings

00:07:26.959 --> 00:07:29.379
of basic text, you're just setting money on fire.

00:07:29.879 --> 00:07:32.819
Precisely. You're paying a premium compute penalty

00:07:32.819 --> 00:07:35.800
for incredibly trivial generation. I need to

00:07:35.800 --> 00:07:38.360
push back on this a bit, though. Sure. Benchmarks

00:07:38.360 --> 00:07:41.300
are tightly controlled laboratory tests. The

00:07:41.300 --> 00:07:43.680
prompts are clean. The environments are static.

00:07:44.100 --> 00:07:47.160
True. How does this actually hold up in messy

00:07:47.160 --> 00:07:49.920
production environments with real unpredictable

00:07:49.920 --> 00:07:53.199
users? That's a very fair critique. Production

00:07:53.199 --> 00:07:56.560
environments are inherently chaotic. Yeah. Users

00:07:56.560 --> 00:07:58.980
will submit bizarre edge cases that benchmarks

00:07:58.980 --> 00:08:02.000
simply don't account for. The savings aren't

00:08:02.000 --> 00:08:04.759
an automatic guarantee. So does saving money

00:08:04.759 --> 00:08:07.860
this way mean we're sacrificing overall output

00:08:07.860 --> 00:08:10.439
quality? Not if your escalation logic is tight.

00:08:10.620 --> 00:08:13.439
You are forcing a weak model to guess. You're

00:08:13.439 --> 00:08:15.680
just stopping an expensive model from overworking.

00:08:15.800 --> 00:08:18.019
No, you just stop paying a premium for work that

00:08:18.019 --> 00:08:21.120
requires zero deep reasoning. Exactly. You protect

00:08:21.120 --> 00:08:23.259
your budget for the moments that actually require

00:08:23.259 --> 00:08:26.399
maximum intelligence. The economic value is clearly

00:08:26.399 --> 00:08:29.040
proven, but we really need to clarify where the

00:08:29.040 --> 00:08:31.759
strategy actually lives. Yeah, we do. Anthropix

00:08:31.759 --> 00:08:34.080
toolset is expanding rapidly, and it's causing

00:08:34.080 --> 00:08:36.960
a lot of friction and confusion for users. The

00:08:36.960 --> 00:08:38.960
naming conventions definitely don't help. People

00:08:38.960 --> 00:08:41.840
conflate these distinct tools constantly. Let's

00:08:41.840 --> 00:08:44.460
define the core infrastructure first. What is

00:08:44.460 --> 00:08:47.720
the Cloud Messages API? The raw building blocks

00:08:47.720 --> 00:08:51.460
developers use to create custom AI apps. That's

00:08:51.460 --> 00:08:54.409
the perfect way to frame it. The advisor strategy

00:08:54.409 --> 00:08:57.350
is a design pattern that lives inside the Cloud

00:08:57.350 --> 00:09:00.350
Messages API. Okay. It is strictly for developers.

00:09:00.590 --> 00:09:03.250
It's for engineering teams writing their own

00:09:03.250 --> 00:09:07.049
code to dictate exactly how an AI system behaves.

00:09:07.549 --> 00:09:09.990
But then you have Cloud Code. People hear about

00:09:09.990 --> 00:09:12.330
this advisor strategy and assume it's just baked

00:09:12.330 --> 00:09:15.169
into everything Anthropic releases. That's the

00:09:15.169 --> 00:09:17.490
biggest misconception right now. Cloud Code is

00:09:17.490 --> 00:09:19.950
a completely different beast. Right. It's a finished

00:09:19.950 --> 00:09:22.980
product. It's a standalone. terminal coding assistant

00:09:22.980 --> 00:09:25.879
that you install and use. It does not automatically

00:09:25.879 --> 00:09:29.120
use this intricate advisor routing setup under

00:09:29.120 --> 00:09:31.419
the hood. Wait, so this intelligent routing isn't

00:09:31.419 --> 00:09:33.259
just a toggle switch in the settings? Not at

00:09:33.259 --> 00:09:35.200
all. That routing is a highly specific capability.

00:09:35.480 --> 00:09:38.620
It has to be manually built and tuned by developers

00:09:38.620 --> 00:09:41.659
using the Messages API. You have to write the

00:09:41.659 --> 00:09:44.240
code that tells the models how to talk to each

00:09:44.240 --> 00:09:47.240
other. Let's try an analogy. Think of the Messages

00:09:47.240 --> 00:09:50.919
API. like buying raw engine components. You're

00:09:50.919 --> 00:09:53.740
buying the pistons, the crankshaft, and the spark

00:09:53.740 --> 00:09:56.120
plugs to build your own custom race car from

00:09:56.120 --> 00:09:59.639
scratch. That's the API. But Cloud Code is like

00:09:59.639 --> 00:10:02.059
walking onto a dealership lot and buying a finished

00:10:02.059 --> 00:10:04.960
sedan. You just turn the key and drive. You can't

00:10:04.960 --> 00:10:06.500
suddenly swap the engine out while you're on

00:10:06.500 --> 00:10:08.639
the highway. That perfectly captures the dynamic.

00:10:08.799 --> 00:10:12.720
The Messages API gives you immense granular control

00:10:12.720 --> 00:10:16.539
over model interaction. Cloud Code prioritizes...

00:10:16.940 --> 00:10:20.700
instant frictionless usability for a single user.

00:10:20.940 --> 00:10:23.080
There are other layers to this too, right? Like

00:10:23.080 --> 00:10:26.000
the agent SDK. Yeah, the agent SDK and managed

00:10:26.000 --> 00:10:28.440
agents are another layer of abstraction. They're

00:10:28.440 --> 00:10:30.659
designed to help teams package agentic behavior

00:10:30.659 --> 00:10:33.379
into their software more easily. Gotcha. But

00:10:33.379 --> 00:10:35.419
again, if you want to orchestrate this highly

00:10:35.419 --> 00:10:38.740
specific multi -model advisor routing, you are

00:10:38.740 --> 00:10:41.059
working directly in the API. Let me make sure

00:10:41.059 --> 00:10:44.080
this is crystal clear for you. If I am just using

00:10:44.080 --> 00:10:47.350
cloud code, Do I have this routing built in automatically?

00:10:47.690 --> 00:10:49.669
You don't. You're simply relying on whichever

00:10:49.669 --> 00:10:52.889
single model you selected for that specific coding

00:10:52.889 --> 00:10:56.090
session. No, this exact automated routing strategy

00:10:56.090 --> 00:10:59.389
requires building it yourself in the API. That's

00:10:59.389 --> 00:11:01.889
right. You have to intentionally design the handoff

00:11:01.889 --> 00:11:04.610
logic yourself. And building that custom logic

00:11:04.610 --> 00:11:07.950
takes incredible focus and reliable infrastructure.

00:11:08.830 --> 00:11:10.549
We're going to take a quick pause here for our

00:11:10.549 --> 00:11:12.590
sponsor. We'll be right back to talk about how

00:11:12.590 --> 00:11:16.049
you actually orchestrate this. We are back. So

00:11:16.049 --> 00:11:18.690
assuming we have our infrastructure sorted, let's

00:11:18.690 --> 00:11:21.190
explore the actual skill required to make this

00:11:21.190 --> 00:11:24.809
API routing work. It all comes down to orchestration

00:11:24.809 --> 00:11:27.549
and rigorous testing. This is where the engineering

00:11:27.549 --> 00:11:30.350
gets genuinely fascinating. Routing the simple,

00:11:30.450 --> 00:11:32.929
obvious tasks is an easy win. Sure. The real

00:11:32.929 --> 00:11:34.990
challenge is managing the transition point on

00:11:34.990 --> 00:11:37.490
the harder tasks. Why is the transition so difficult

00:11:37.490 --> 00:11:39.940
to manage? Because the core challenge isn't just

00:11:39.940 --> 00:11:42.980
solving a complex task. The system's primary

00:11:42.980 --> 00:11:46.019
hurdle is the frontline model actually noticing

00:11:46.019 --> 00:11:49.120
that the task is too hard. Yeah. It has to possess

00:11:49.120 --> 00:11:51.639
the self -awareness to recognize it needs the

00:11:51.639 --> 00:11:54.399
advisor. Whoa, imagine the system organically

00:11:54.399 --> 00:11:56.620
realizing it is out of its depth and calling

00:11:56.620 --> 00:11:59.799
for help. It's an incredible mechanism. You're

00:11:59.799 --> 00:12:02.779
essentially prompting a model. to evaluate its

00:12:02.779 --> 00:12:05.740
own limitations in real time. That's wild. It

00:12:05.740 --> 00:12:09.480
has to pause mid -workflow and say, my confidence

00:12:09.480 --> 00:12:12.039
score on this is dropping. I cannot execute this

00:12:12.039 --> 00:12:14.340
reliably alone. That means the escalation logic

00:12:14.340 --> 00:12:16.980
is literally a component of the system's overall

00:12:16.980 --> 00:12:19.860
intelligence. Precisely. You have to give the

00:12:19.860 --> 00:12:23.059
cheaper model a specific tool, like an escalate

00:12:23.059 --> 00:12:25.519
to expert function. Right. But a weaker model

00:12:25.519 --> 00:12:27.899
might get lucky and guess the right answer without

00:12:27.899 --> 00:12:31.090
using the tool. Or worse, it might fail to notice

00:12:31.090 --> 00:12:33.710
the complexity entirely. I have to ask about

00:12:33.710 --> 00:12:36.309
a very specific risk here. Is there a danger

00:12:36.309 --> 00:12:38.789
of an AI version of the Dunning -Kruger effect?

00:12:39.149 --> 00:12:41.350
Expand on that. How do you mean? Well, the Dunning

00:12:41.350 --> 00:12:44.909
-Kruger effect in humans is when someone incompetent

00:12:44.909 --> 00:12:48.389
vastly overestimates their own ability. Right.

00:12:48.470 --> 00:12:52.200
Isn't a cheaper, less capable model. inherently

00:12:52.200 --> 00:12:54.980
worse at judging its own competence, that feels

00:12:54.980 --> 00:12:57.240
like a paradox. Yeah. What if it confidently

00:12:57.240 --> 00:12:59.799
hallucinates a totally wrong answer instead of

00:12:59.799 --> 00:13:02.860
escalating? That paradox is the exact danger

00:13:02.860 --> 00:13:05.820
you have to engineer against. A cheaper model

00:13:05.820 --> 00:13:08.059
doesn't inherently know its own blind spots.

00:13:08.360 --> 00:13:10.120
So you can't just deploy this. No, you can't

00:13:10.120 --> 00:13:11.960
just deploy the strategy blindly and assume it

00:13:11.960 --> 00:13:14.220
works. You have to explicitly teach it how to

00:13:14.220 --> 00:13:16.669
doubt itself. Yes. You might write a meta prompt

00:13:16.669 --> 00:13:19.309
saying, if you cannot extract the exact variable

00:13:19.309 --> 00:13:21.450
in three attempts, or if the request involves

00:13:21.450 --> 00:13:24.490
multivariable calculus, you must use the escalate

00:13:24.490 --> 00:13:27.289
tool. You have to test those boundaries rigorously.

00:13:27.490 --> 00:13:29.649
Because if it fails quietly and confidently,

00:13:29.970 --> 00:13:33.009
the user experience is destroyed. If the haiku

00:13:33.009 --> 00:13:35.730
model misses critical context but confidently

00:13:35.730 --> 00:13:38.669
proceeds, the cost savings are entirely useless.

00:13:38.850 --> 00:13:41.559
If it escalates too late, You've already burned

00:13:41.559 --> 00:13:44.379
tokens on a doomed path. You're no longer just

00:13:44.379 --> 00:13:47.620
evaluating the AI's final text output. You are

00:13:47.620 --> 00:13:50.500
evaluating the system's internal judgment. So

00:13:50.500 --> 00:13:53.399
what is the main point of failure when designing

00:13:53.399 --> 00:13:55.919
this system? It's almost exclusively the routing

00:13:55.919 --> 00:14:00.039
logic. The cheaper model simply fails to recognize

00:14:00.039 --> 00:14:03.070
the edge of its own capabilities. Weak routing

00:14:03.070 --> 00:14:05.710
logic failing to escalate when the task suddenly

00:14:05.710 --> 00:14:08.330
becomes truly complex. It's the biggest trap

00:14:08.330 --> 00:14:11.549
developers fall into. They see the 12 % cost

00:14:11.549 --> 00:14:14.370
savings on a spreadsheet and rush the implementation

00:14:14.370 --> 00:14:17.389
without tuning the thresholds. Two sec silence.

00:14:17.750 --> 00:14:19.370
Let's scale this back from the developer level

00:14:19.370 --> 00:14:22.250
for a moment. Let's talk about how you, the listener,

00:14:22.409 --> 00:14:25.490
can manually apply this mindset today. How can

00:14:25.490 --> 00:14:27.830
we integrate this philosophy into our everyday

00:14:27.830 --> 00:14:31.320
personal workflows? The underlying logic absolutely

00:14:31.320 --> 00:14:33.759
applies to the everyday user, even if you're

00:14:33.759 --> 00:14:35.820
just typing prompts into the standard web interface.

00:14:36.100 --> 00:14:38.879
Right. Your deliberate choice of model dictates

00:14:38.879 --> 00:14:41.259
your daily budget and your time efficiency. So

00:14:41.259 --> 00:14:43.480
how should we structure our own daily projects?

00:14:43.620 --> 00:14:45.399
Let's say I'm writing a massive research report.

00:14:45.740 --> 00:14:47.919
You should use Opus for the initial planning

00:14:47.919 --> 00:14:51.659
step. Then let Sonnet handle the heavy execution.

00:14:52.080 --> 00:14:55.480
Plan with Opus, execute with Sonnet. That's a

00:14:55.480 --> 00:14:59.409
remarkably clean framework. It's simple. But

00:14:59.409 --> 00:15:01.990
it fundamentally shifts how you work. Planning

00:15:01.990 --> 00:15:04.990
and execution require entirely different cognitive

00:15:04.990 --> 00:15:07.509
muscles from the AI. Explain the difference in

00:15:07.509 --> 00:15:10.029
those cognitive muscles. Planning requires high

00:15:10.029 --> 00:15:12.990
leverage holistic judgment. When you're outlining

00:15:12.990 --> 00:15:16.960
a research report. You need the AI to spot hidden

00:15:16.960 --> 00:15:21.000
logical gaps. You need it to synthesize disparate

00:15:21.000 --> 00:15:24.399
ideas into a cohesive structural vision. You

00:15:24.399 --> 00:15:26.860
want it to deeply understand the root problem.

00:15:27.019 --> 00:15:29.720
Yeah. That deep architectural synthesis is exactly

00:15:29.720 --> 00:15:32.460
where Opus shines. It's the architect. Opus is

00:15:32.460 --> 00:15:34.700
the architect drawing the intricate blueprints

00:15:34.700 --> 00:15:37.100
for the skyscraper. Yes. But once those blueprints

00:15:37.100 --> 00:15:39.080
are locked in, the nature of the work changes

00:15:39.080 --> 00:15:41.700
completely. How so? The execution phase, actually

00:15:41.700 --> 00:15:43.960
drafting the sections, is much longer. It's highly

00:15:43.960 --> 00:15:45.919
repetitive. It's essentially just following instructions.

00:15:46.259 --> 00:15:49.399
So Sonnet is the highly capable builder hammering

00:15:49.399 --> 00:15:51.559
the nails according to the blueprint. Exactly.

00:15:52.000 --> 00:15:54.500
Sonnet is brilliant at following a well -defined,

00:15:54.639 --> 00:15:58.539
strict plan. It writes well, it's fast, and it

00:15:58.539 --> 00:16:00.679
doesn't need to invent the core structure. And

00:16:00.679 --> 00:16:03.019
remember, you also have Haiku in your toolkit.

00:16:03.220 --> 00:16:05.639
Right. Where does Haiku fit into that research

00:16:05.639 --> 00:16:08.519
report workflow? Haiku is brilliant for the administrative

00:16:08.519 --> 00:16:12.039
microtasks. You need to quickly reformat a list

00:16:12.039 --> 00:16:15.690
of citations. Use haiku. Okay. You need to summarize

00:16:15.690 --> 00:16:17.870
a 50 -page PDF to see if it's worth reading.

00:16:18.490 --> 00:16:22.450
Use haiku. Exploring a bunch of random tangential

00:16:22.450 --> 00:16:25.370
ideas very quickly. Haiku is perfect for that.

00:16:25.529 --> 00:16:28.049
You really are acting as the orchestrator of

00:16:28.049 --> 00:16:31.169
your own manual workflow. You are. You're internalizing

00:16:31.169 --> 00:16:33.129
the advisor strategy and applying it to your

00:16:33.129 --> 00:16:35.309
own digital life. But does the execution phase

00:16:35.309 --> 00:16:38.029
really require less intelligence than the planning

00:16:38.029 --> 00:16:41.460
phase? Almost universally, yes. Following a beautifully

00:16:41.460 --> 00:16:44.539
detailed map requires significantly less genius

00:16:44.539 --> 00:16:47.159
than charting the unknown territory to draw the

00:16:47.159 --> 00:16:49.600
map in the first place. Yes. Once the exact path

00:16:49.600 --> 00:16:52.259
is clear, steady execution is usually perfectly

00:16:52.259 --> 00:16:54.419
fine. That's why you place your strongest, most

00:16:54.419 --> 00:16:57.139
expensive model... at the point of maximum leverage.

00:16:57.360 --> 00:17:00.039
Let the cheaper, faster models handle the steady,

00:17:00.220 --> 00:17:03.200
repetitive execution. Beat, we've covered a massive

00:17:03.200 --> 00:17:05.400
amount of technical and philosophical ground

00:17:05.400 --> 00:17:07.700
today. Let's bring it all together for you. We

00:17:07.700 --> 00:17:10.880
need to recap the central big idea here. We are

00:17:10.880 --> 00:17:13.240
sitting in the middle of a massive paradigm shift

00:17:13.240 --> 00:17:16.480
in how computing is structured. AI system design

00:17:16.480 --> 00:17:19.180
is moving completely away from brute force model

00:17:19.180 --> 00:17:22.700
selection. Two years ago, developers just asked,

00:17:23.109 --> 00:17:25.509
What is the smartest model I can possibly afford?

00:17:25.829 --> 00:17:28.309
That question is officially obsolete. Now the

00:17:28.309 --> 00:17:31.490
entire industry is pivoting to task design. The

00:17:31.490 --> 00:17:34.089
new foundational question is, what is the most

00:17:34.089 --> 00:17:36.529
elegant way to structure this specific task?

00:17:36.950 --> 00:17:39.150
System orchestration is quickly becoming the

00:17:39.150 --> 00:17:41.490
single most valuable skill in software engineering.

00:17:41.789 --> 00:17:44.470
Yeah. You have to know how to aggressively deconstruct

00:17:44.470 --> 00:17:47.109
work into micro components. You have to know

00:17:47.109 --> 00:17:50.450
exactly which model fits which specific fragmented

00:17:50.450 --> 00:17:53.019
piece. It's not about throwing maximum compute

00:17:53.019 --> 00:17:55.359
at every minor inconvenience anymore. It's about

00:17:55.359 --> 00:17:57.980
precision. It's about systemic elegance. The

00:17:57.980 --> 00:17:59.799
companies that win the next decade won't just

00:17:59.799 --> 00:18:01.779
be the ones paying for the most expensive API

00:18:01.779 --> 00:18:04.119
call. They'll be the ones who engineer systems

00:18:04.119 --> 00:18:06.960
that know exactly when not to use them. This

00:18:06.960 --> 00:18:09.200
leaves us with a really fascinating thought to

00:18:09.200 --> 00:18:12.400
end on. I want you to actively think about this

00:18:12.400 --> 00:18:15.619
as you go about your work today. Look closely

00:18:15.619 --> 00:18:18.680
at your own daily workflow. Look at how you deploy

00:18:18.680 --> 00:18:21.390
your own human brain. How do you mean? Exactly.

00:18:21.470 --> 00:18:23.750
How much of your actual workday requires your

00:18:23.750 --> 00:18:28.089
personal opus level deep reasoning? Think about

00:18:28.089 --> 00:18:31.009
the specific tasks that demand your absolute

00:18:31.009 --> 00:18:34.569
best unfiltered cognitive effort. The deep strategy.

00:18:34.730 --> 00:18:37.569
The complex problem solving. And then look at

00:18:37.569 --> 00:18:39.589
everything else. How much of your day is just

00:18:39.589 --> 00:18:42.309
haiku level mindless administrative execution?

00:18:42.730 --> 00:18:45.619
Answering basic emails. Formatting spreadsheets.

00:18:45.720 --> 00:18:48.660
Exactly. What if you started fiercely preserving

00:18:48.660 --> 00:18:50.980
your own mental bandwidth? That is a phenomenal

00:18:50.980 --> 00:18:53.420
way to frame personal productivity. What if you

00:18:53.420 --> 00:18:56.099
actively managed your human energy the exact

00:18:56.099 --> 00:18:58.559
same way we are teaching these APIs to manage

00:18:58.559 --> 00:19:01.119
their compute cycles? Save your deep reasoning

00:19:01.119 --> 00:19:03.380
for the blueprints. Keep questioning. Keep exploring.

00:19:03.579 --> 00:19:05.380
Catch you next time on the Deep Dive.