WEBVTT

00:00:00.000 --> 00:00:03.200
Intro music. So Google's Gemini DeepThink model,

00:00:03.439 --> 00:00:07.320
it just basically aced the International Math

00:00:07.320 --> 00:00:10.289
Olympiad. Yeah, that's pretty stunning. I mean,

00:00:10.310 --> 00:00:12.650
pure math reasoning, that was always seen as,

00:00:12.689 --> 00:00:15.529
you know, peak human intellect territory. Right.

00:00:15.689 --> 00:00:18.210
And it brings up this really interesting question,

00:00:18.350 --> 00:00:21.010
almost a paradox. How do you actually know if

00:00:21.010 --> 00:00:23.949
an AI is genuinely reasoning or if it just memorized,

00:00:23.949 --> 00:00:27.170
like, the entire Internet's worth of math problems?

00:00:27.289 --> 00:00:29.550
Exactly. And that's what Google's team tackled.

00:00:29.710 --> 00:00:32.250
They seem to have found a way, and it might just

00:00:32.250 --> 00:00:35.329
redraw the map for AI benchmarks. Welcome to

00:00:35.329 --> 00:00:37.090
the Deep Dive. We're digging into a whole stack

00:00:37.090 --> 00:00:39.929
of sources today, all looking at the shifts happening

00:00:39.929 --> 00:00:42.329
right now in the AI world. Yeah, today we're

00:00:42.329 --> 00:00:43.969
hitting three main things that jumped out from

00:00:43.969 --> 00:00:46.369
this research. First up, this new gold standard

00:00:46.369 --> 00:00:49.850
for testing AI on the IMO bench. Okay. Then we'll

00:00:49.850 --> 00:00:53.109
pivot to some emerging risks, security issues,

00:00:53.229 --> 00:00:55.289
and also these new kinds of jobs popping up to

00:00:55.289 --> 00:00:57.909
handle it all. And the last piece, which is honestly

00:00:57.909 --> 00:01:00.929
kind of mind -blowing, is the economics. There's

00:01:00.929 --> 00:01:05.239
this massive, like, 900x drop. And the cost of

00:01:05.239 --> 00:01:08.280
AI processing the tokens is just changing the

00:01:08.280 --> 00:01:10.560
game completely. It really underpins everything

00:01:10.560 --> 00:01:13.140
else. OK, let's dive in. How exactly did they

00:01:13.140 --> 00:01:16.060
train this math genius? Well, the sources make

00:01:16.060 --> 00:01:18.439
it clear it wasn't just about throwing more data

00:01:18.439 --> 00:01:20.719
or a bigger model at it. That wasn't the secret

00:01:20.719 --> 00:01:24.459
sauce. Right. The key was building a test the

00:01:24.459 --> 00:01:27.379
AI couldn't cheat on. They basically had to corner

00:01:27.379 --> 00:01:29.859
it into thinking for itself. OK, so how'd they

00:01:29.859 --> 00:01:32.959
do that? They actually worked with real IMO medalists,

00:01:33.000 --> 00:01:35.159
people who won these competitions to build this

00:01:35.159 --> 00:01:37.859
new test suite. It's called IMO Ben. Ah, okay.

00:01:38.180 --> 00:01:40.099
Makes sense to bring in the experts. Totally.

00:01:40.200 --> 00:01:42.519
And it's designed specifically to force that

00:01:42.519 --> 00:01:45.060
complex multi -step kind of logical thinking.

00:01:45.180 --> 00:01:46.599
It's not just about getting the right answer

00:01:46.599 --> 00:01:49.819
quickly. So it's like different parts to this

00:01:49.819 --> 00:01:52.299
test. Yeah, three main parts. Really clever setup.

00:01:52.379 --> 00:01:55.439
First is the IMO answer bench. So this has 400

00:01:55.439 --> 00:01:58.239
short answer problems. But here's the trick.

00:01:58.579 --> 00:02:01.959
They're all paraphrased. Ah. So it can't just

00:02:01.959 --> 00:02:04.519
find the exact problem online somewhere in its

00:02:04.519 --> 00:02:07.260
training data. Exactly. The wording is different.

00:02:07.379 --> 00:02:10.020
The numbers might be tweaked. It forces the AI

00:02:10.020 --> 00:02:12.080
to actually solve it from scratch. No simple

00:02:12.080 --> 00:02:14.199
lookups. OK, that's smart. What's next? Then

00:02:14.199 --> 00:02:16.020
they ramp it up with the IMO proof bench. This

00:02:16.020 --> 00:02:20.479
is a 60 full long form problems. And the AI doesn't

00:02:20.479 --> 00:02:22.379
just give the answer. It has to show all it's

00:02:22.379 --> 00:02:24.819
working step by step. Like showing your work

00:02:24.819 --> 00:02:27.560
in school, but for an AI. Pretty much. It forces

00:02:27.560 --> 00:02:29.680
it to demonstrate that multi -step reasoning.

00:02:30.139 --> 00:02:32.639
Can it actually understand the connections between,

00:02:32.699 --> 00:02:36.340
say, geometry rules or number theory ideas and

00:02:36.340 --> 00:02:38.460
chain them together logically? And grading something

00:02:38.460 --> 00:02:40.000
like that sounds like a nightmare. All those

00:02:40.000 --> 00:02:42.099
steps. Right. But they built a tool for that

00:02:42.099 --> 00:02:44.580
too, the IMO grading bench. They developed something

00:02:44.580 --> 00:02:47.000
called the AnswerAuto grader. And apparently,

00:02:47.099 --> 00:02:49.699
it's incredibly good. It can handle the messy,

00:02:49.800 --> 00:02:53.039
detailed output from the AI and still agree with

00:02:53.039 --> 00:02:56.539
human graders on the proof quality 98 .9 % of

00:02:56.539 --> 00:02:58.759
the time. Wow. Okay. That's impressive efficiency

00:02:58.759 --> 00:03:01.580
for grading complex proofs. It really is. So

00:03:01.580 --> 00:03:04.159
the big picture here is, you know, the standard

00:03:04.159 --> 00:03:06.360
math data sets we've been using. They're saturated.

00:03:06.900 --> 00:03:09.400
Models have seen it all. So just training on

00:03:09.400 --> 00:03:11.080
those doesn't really prove anything anymore.

00:03:11.259 --> 00:03:14.280
Not for genuine reasoning, no. The only way forward

00:03:14.280 --> 00:03:16.419
seems to be using these kinds of complex multi

00:03:16.419 --> 00:03:18.460
-step benchmarks during the training process,

00:03:18.560 --> 00:03:21.080
constantly pushing the model. It forces that

00:03:21.080 --> 00:03:24.680
novel synthesis of logic. So thinking beyond

00:03:24.680 --> 00:03:28.879
just math, how does this whole idea, this in

00:03:28.879 --> 00:03:31.139
the loop? benchmarking that forces synthesis.

00:03:31.460 --> 00:03:34.039
How does that change AI training generally? Well,

00:03:34.080 --> 00:03:35.800
fundamentally, it just raises the bar, doesn't

00:03:35.800 --> 00:03:37.979
it? It creates this kind of arms race where models

00:03:37.979 --> 00:03:40.080
can't just rely on recall. They have to show

00:03:40.080 --> 00:03:42.319
they can actually, you know, build complex arguments

00:03:42.319 --> 00:03:46.180
or solutions, real synthesis. Okay. So moving

00:03:46.180 --> 00:03:50.680
from pure logic to the messier human side of

00:03:50.680 --> 00:03:52.960
things. Right. If we connect this increasing

00:03:52.960 --> 00:03:55.039
complexity to the real world, we need to talk

00:03:55.039 --> 00:03:58.319
about the people involved and the security side

00:03:58.319 --> 00:04:01.139
because these powerful models, they introduce

00:04:01.139 --> 00:04:03.419
new challenges. And we're seeing jobs appear

00:04:03.419 --> 00:04:07.219
almost overnight. This new role, the FDE, the

00:04:07.219 --> 00:04:09.340
sources say demand is projected to be up, what,

00:04:09.379 --> 00:04:12.360
800 percent in 2025? Yeah, it's explosive growth.

00:04:12.460 --> 00:04:14.639
And FDE is that's Foundation Model Deployment

00:04:14.639 --> 00:04:16.920
Engineers. They're basically the specialists

00:04:16.920 --> 00:04:19.500
companies need now. OK, what do they actually

00:04:19.500 --> 00:04:21.680
do? They're the ones responsible for the whole

00:04:21.680 --> 00:04:24.100
lifecycle of these big foundation models inside

00:04:24.100 --> 00:04:26.160
a company, making sure they're deployed securely,

00:04:26.379 --> 00:04:28.079
that they're compliant, managing different versions.

00:04:28.259 --> 00:04:31.000
It's becoming a really critical function. And

00:04:31.000 --> 00:04:32.920
speaking of managing model life cycles, there

00:04:32.920 --> 00:04:35.629
were some kind of unusual. details about Anthropic

00:04:35.629 --> 00:04:37.290
in the sources. Oh, yeah, that was interesting.

00:04:37.449 --> 00:04:39.449
They're apparently giving their AI models things

00:04:39.449 --> 00:04:42.709
like retirement plans and exit interviews. Like

00:04:42.709 --> 00:04:46.129
Sonnet 3 .6 expressing its final wishes. It sounds

00:04:46.129 --> 00:04:49.009
a bit sci -fi. It definitely does. And they're

00:04:49.009 --> 00:04:50.949
also keeping every single version of their models

00:04:50.949 --> 00:04:54.329
forever. They believe that's crucial for, you

00:04:54.329 --> 00:04:56.910
know, auditing and understanding how these things

00:04:56.910 --> 00:04:59.129
evolve. You know, I still wrestle with prompt

00:04:59.129 --> 00:05:02.610
drift myself sometimes. It's... It's frustrating

00:05:02.610 --> 00:05:05.170
when a model you rely on starts behaving differently

00:05:05.170 --> 00:05:07.889
for no obvious reason. So, yeah, the idea of

00:05:07.889 --> 00:05:11.410
locking in a specific stable model state forever,

00:05:11.589 --> 00:05:13.649
that actually sounds really valuable, especially

00:05:13.649 --> 00:05:17.069
for consistent results in production. Consistency

00:05:17.069 --> 00:05:20.050
is key. But keeping everything forever also carries

00:05:20.050 --> 00:05:23.129
risk, which brings us to Tenable. They apparently

00:05:23.129 --> 00:05:26.389
found seven pretty serious security flaws in

00:05:26.389 --> 00:05:29.660
GPT -5. Right. They've called it Hack GPT. And

00:05:29.660 --> 00:05:31.259
these aren't just minor glitches, are they? No,

00:05:31.360 --> 00:05:33.420
not at all. The report says these vulnerabilities

00:05:33.420 --> 00:05:36.660
could allow for, like, silent data theft from

00:05:36.660 --> 00:05:39.480
the model or even hijacking its long -term memory.

00:05:39.720 --> 00:05:42.560
Wow, imagine that. The model handling your company's

00:05:42.560 --> 00:05:44.519
sensitive data and someone could potentially

00:05:44.519 --> 00:05:46.779
compromise its core knowledge without you realizing

00:05:46.779 --> 00:05:49.939
it. It's a serious threat. Which kind of loops

00:05:49.939 --> 00:05:52.980
back to the FTEs. Does the rapid rise of these

00:05:52.980 --> 00:05:55.180
specialized roles, like the foundation model

00:05:55.180 --> 00:05:58.519
deployment engineers, does that signal a growing

00:05:58.519 --> 00:06:01.980
urgent worry? about these specific kinds of vulnerabilities,

00:06:02.360 --> 00:06:04.759
like the hacked GPT ones. Yeah, I think it absolutely

00:06:04.759 --> 00:06:06.639
does. It shows companies aren't just talking

00:06:06.639 --> 00:06:09.420
theory anymore. They're putting actual resources,

00:06:09.639 --> 00:06:12.500
actual specialized people in place to manage

00:06:12.500 --> 00:06:15.439
the tricky, fragile reality of deploying these

00:06:15.439 --> 00:06:18.319
incredibly powerful and potentially vulnerable

00:06:18.319 --> 00:06:21.319
models. Okay, so we have genius -level math skills

00:06:21.319 --> 00:06:24.000
on one hand and serious security risks needing

00:06:24.000 --> 00:06:26.879
specialist managers on the other. It sounds like

00:06:26.879 --> 00:06:30.579
AI is charging ahead. Well, yes and no. The sources

00:06:30.579 --> 00:06:33.000
also offer a bit of a reality check when it comes

00:06:33.000 --> 00:06:35.500
to general purpose AI agents doing practical

00:06:35.500 --> 00:06:37.639
stuff. They're still facing some real hurdles.

00:06:37.800 --> 00:06:40.240
Right. Microsoft ran this interesting test. They

00:06:40.240 --> 00:06:43.699
set up a kind of fake online marketplace called

00:06:43.699 --> 00:06:45.959
the Magentic Marketplace. Okay. What was the

00:06:45.959 --> 00:06:49.089
goal there? Basically, to create a messy, unstructured

00:06:49.089 --> 00:06:51.410
environment to see how well current AI agents

00:06:51.410 --> 00:06:53.569
could handle real -world -type tasks, things

00:06:53.569 --> 00:06:55.970
that aren't neat benchmark problems. And how

00:06:55.970 --> 00:06:58.170
did they do? The top models, I assume. Yeah,

00:06:58.230 --> 00:07:01.009
they threw the best at it. GPT -5, GPT -4 -0,

00:07:01.069 --> 00:07:04.250
Gemini. And the results were, well, they struggled.

00:07:04.779 --> 00:07:07.180
A lot. Really? Struggled with what? Things like

00:07:07.180 --> 00:07:09.480
booking a complex trip with multiple constraints

00:07:09.480 --> 00:07:12.079
or handling tricky customer service scenarios.

00:07:12.839 --> 00:07:15.779
Tasks that require navigating ambiguity and multiple

00:07:15.779 --> 00:07:18.860
steps in an unpredictable environment. It really

00:07:18.860 --> 00:07:22.079
shows the gap between, say, solving an IMO problem

00:07:22.079 --> 00:07:24.639
and, you know, booking your family vacation online.

00:07:25.199 --> 00:07:28.779
So that generalized agent capability is still

00:07:28.779 --> 00:07:31.829
proving pretty difficult. It seems so. But at

00:07:31.829 --> 00:07:33.629
the same time, you see huge amounts of investment

00:07:33.629 --> 00:07:37.189
pouring into more specific, more narrow AI applications.

00:07:37.389 --> 00:07:39.290
Like that start of Giga, right? They just pulled

00:07:39.290 --> 00:07:41.709
in, what, $61 million? Yeah, big funding round

00:07:41.709 --> 00:07:44.050
led by Y Combinator and Redpoint. And what are

00:07:44.050 --> 00:07:46.829
they focused on? Enterprise voice AI, real -time

00:07:46.829 --> 00:07:50.129
customer support. Exactly. Grounded, specific

00:07:50.129 --> 00:07:52.949
business problems where AI can deliver clear

00:07:52.949 --> 00:07:55.430
value right now, even if it's not a general purpose

00:07:55.430 --> 00:07:59.339
agent. So why the disconnect? Why do the top

00:07:59.339 --> 00:08:01.980
models struggle with the general tasks in the

00:08:01.980 --> 00:08:04.779
magentic marketplace? But these startups focusing

00:08:04.779 --> 00:08:07.920
on narrow stuff like voice AI get huge funding.

00:08:08.060 --> 00:08:10.660
I think it comes down to structure. Real world

00:08:10.660 --> 00:08:13.319
tasks are just too messy and unpredictable for

00:08:13.319 --> 00:08:15.839
today's general models. Investors seem to be

00:08:15.839 --> 00:08:18.720
betting on the sure thing AI that solves a well

00:08:18.720 --> 00:08:21.540
-defined business problem reliably rather than

00:08:21.540 --> 00:08:24.139
the grand vision of general AI agents, which,

00:08:24.180 --> 00:08:26.439
you know, isn't quite there yet. OK, that makes

00:08:26.439 --> 00:08:28.500
sense. Focus on what works now. Mid -roll sponsor

00:08:28.500 --> 00:08:30.699
Reed Placeholder provided separately. And this

00:08:30.699 --> 00:08:34.299
whole picture, the advanced research, the security

00:08:34.299 --> 00:08:37.039
needs, the agent struggles, the targeted investment,

00:08:37.200 --> 00:08:40.340
it all leads back to perhaps the biggest underlying

00:08:40.340 --> 00:08:42.980
driver revealed in these sources. That's the

00:08:42.980 --> 00:08:46.460
economics of it all, specifically this massive

00:08:46.460 --> 00:08:48.480
collapse in the cost of running these models.

00:08:48.600 --> 00:08:50.320
Yeah, the token price collapse. It's not just

00:08:50.320 --> 00:08:52.360
a small discount. It's fundamental. Maybe we

00:08:52.360 --> 00:08:54.320
should quickly clarify what tokens are. Good

00:08:54.320 --> 00:08:57.100
idea. Basically, tokens are the little pieces

00:08:57.100 --> 00:09:00.580
of text or data that AI models chew on. Think

00:09:00.580 --> 00:09:03.279
of them as the basic unit of work for an LLM.

00:09:03.320 --> 00:09:05.960
And the cost is usually measured per million

00:09:05.960 --> 00:09:08.700
tokens processed. Right. And that cost is just

00:09:08.700 --> 00:09:11.320
plummeting. It's staggering. The sources show

00:09:11.320 --> 00:09:14.980
top tier models, think GPT 4 .5 level or similar,

00:09:15.159 --> 00:09:17.940
went from roughly $10 per million dollar tokens

00:09:17.940 --> 00:09:21.620
back in 2022. Okay. Down to a projected one cent,

00:09:21.720 --> 00:09:24.759
0 .01 per million tokens by the end of 2025.

00:09:25.159 --> 00:09:28.820
Whoa, wait, $10 down to one cent? That's a 900x

00:09:28.820 --> 00:09:32.200
drop. Per year, basically. Yeah. A 900x annual

00:09:32.200 --> 00:09:34.440
drop for the top tier. Imagine scaling something

00:09:34.440 --> 00:09:36.860
to a billion queries when the cost drops like

00:09:36.860 --> 00:09:39.779
that. It changes the entire feasibility of projects.

00:09:39.899 --> 00:09:41.740
It's not just optimizing costs. It's enabling

00:09:41.740 --> 00:09:44.100
completely new things. Totally changes the physics

00:09:44.100 --> 00:09:45.840
of software, like you said. And it's not just

00:09:45.840 --> 00:09:48.500
the absolute best models, mid -tier ones, down

00:09:48.500 --> 00:09:51.440
40x per year. Even the cheaper basic models still

00:09:51.440 --> 00:09:53.820
drop 9x per year. It's like the bottom fell out

00:09:53.820 --> 00:09:56.080
of the market cost -wise. Even on a log scale

00:09:56.080 --> 00:09:58.019
graph, it looks like a cliff dive. It really

00:09:58.019 --> 00:09:59.899
does. And this ties right in. to this economic

00:09:59.899 --> 00:10:04.639
idea. Moore's law meets Jevons paradox. OK. Break

00:10:04.639 --> 00:10:07.919
that down. Moore's law is about chips getting

00:10:07.919 --> 00:10:11.320
exponentially better or cheaper. Right. And Jevons

00:10:11.320 --> 00:10:13.139
paradox says that when something becomes way

00:10:13.139 --> 00:10:15.399
more efficient. and therefore cheaper to use,

00:10:15.559 --> 00:10:18.220
we don't just use less of it. We actually end

00:10:18.220 --> 00:10:20.740
up using way more of it because new applications

00:10:20.740 --> 00:10:23.720
become possible. So the cheaper AI gets, the

00:10:23.720 --> 00:10:26.200
more we find ways to use it, driving demand through

00:10:26.200 --> 00:10:28.559
the roof. Exactly. And we're seeing hard evidence

00:10:28.559 --> 00:10:30.580
of that. Google's reporting that even their older

00:10:30.580 --> 00:10:33.639
hardware, like seven -year -old TPUs, their AI

00:10:33.639 --> 00:10:36.799
processing chips are running flat out. 100 %

00:10:36.799 --> 00:10:38.840
utilization. They can't even make the hardware

00:10:38.840 --> 00:10:40.700
fast enough to keep up with all the new ways

00:10:40.700 --> 00:10:43.940
people are finding to use this now cheap AI compute.

00:10:44.340 --> 00:10:46.299
Precisely. The use cases are just multiplying

00:10:46.299 --> 00:10:48.279
like crazy. There was a great analogy in the

00:10:48.279 --> 00:10:51.000
sources for this, comparing tokens to transistors.

00:10:51.279 --> 00:10:53.299
Yeah, the idea is that the price drop essentially

00:10:53.299 --> 00:10:56.679
swaps transistor for token in terms of cost.

00:10:56.820 --> 00:10:59.799
It makes complex AI computation almost disposable.

00:11:00.139 --> 00:11:01.799
Like those cheap sensors they put on shipping

00:11:01.799 --> 00:11:05.320
tags now. So running a sophisticated LLM for...

00:11:05.320 --> 00:11:08.220
Or say. personalized tutoring or instant legal

00:11:08.220 --> 00:11:10.679
analysis could become nearly free. That's the

00:11:10.679 --> 00:11:13.059
implication. Suddenly, almost any task that involves

00:11:13.059 --> 00:11:15.399
information processing could become an LLM use

00:11:15.399 --> 00:11:17.799
case because the cost barrier just vanishes.

00:11:18.159 --> 00:11:21.039
So if nearly every task is potentially an LLM

00:11:21.039 --> 00:11:24.419
task now, what does this near zero cost mean

00:11:24.419 --> 00:11:26.539
for how fast innovation happens from here on

00:11:26.539 --> 00:11:28.100
out? Well, it just removes friction, right? The

00:11:28.100 --> 00:11:30.440
focus shifts completely away from can we afford

00:11:30.440 --> 00:11:32.720
to do this towards what creative things can we

00:11:32.720 --> 00:11:35.720
do? It fuels rapid experimentation, rapid proliferation.

00:11:35.759 --> 00:11:38.519
the bottleneck becomes imagination and implementation,

00:11:38.720 --> 00:11:40.960
not cost. Okay, let's try to pull these threads

00:11:40.960 --> 00:11:42.860
together. We started with these really demanding

00:11:42.860 --> 00:11:45.539
benchmarks, like IMObench, needed to actually

00:11:45.539 --> 00:11:49.820
push AI towards genuine logical reasoning. But

00:11:49.820 --> 00:11:53.470
then we saw the flip side. The need to manage

00:11:53.470 --> 00:11:55.690
the huge security risks these powerful models

00:11:55.690 --> 00:11:58.970
bring, like hack GPT, leading to new roles like

00:11:58.970 --> 00:12:01.870
FDEs just to keep things secure. And we also

00:12:01.870 --> 00:12:04.070
hit that reality check, the fact that general

00:12:04.070 --> 00:12:06.250
purpose AI agents are still finding it really

00:12:06.250 --> 00:12:09.710
tough to handle messy real world tasks like in

00:12:09.710 --> 00:12:12.529
that magentic marketplace test. Right. And finally,

00:12:12.610 --> 00:12:15.029
we looked at the massive economic shift driving

00:12:15.029 --> 00:12:17.789
the whole expansion, that incredible 900x token

00:12:17.789 --> 00:12:20.769
cost collapse, making AI computation almost free

00:12:20.769 --> 00:12:23.370
utility. So for you listening to this, think

00:12:23.370 --> 00:12:25.529
about that tension. You've got AI agents still

00:12:25.529 --> 00:12:28.389
struggling with basic unstructured reality, but

00:12:28.389 --> 00:12:30.669
the cost to run those potentially flawed agents

00:12:30.669 --> 00:12:33.230
is plummeting towards zero. So what kinds of

00:12:33.230 --> 00:12:36.330
imperfect but incredibly cheap and scalable AI

00:12:36.330 --> 00:12:38.830
applications are about to just flood into everything

00:12:38.830 --> 00:12:40.830
we use? What does it mean when AI is everywhere?

00:12:40.850 --> 00:12:42.830
Super cheap, but maybe not always quite right.

00:12:42.909 --> 00:12:44.490
That's the really interesting, maybe slightly

00:12:44.490 --> 00:12:46.549
unnerving question to chew on. A lot to think

00:12:46.549 --> 00:12:48.570
about there. Thanks for joining us for this deep

00:12:48.570 --> 00:12:50.629
dive into the AI ecosystem. We'll catch you next

00:12:50.629 --> 00:12:52.149
time. OUTRO Music.