WEBVTT

00:00:00.000 --> 00:00:03.140
You know, AI is achieving AGI capabilities right

00:00:03.140 --> 00:00:05.559
now, but maybe not where you'd expect. It's happening

00:00:05.559 --> 00:00:08.619
pretty quietly, actually, inside production code

00:00:08.619 --> 00:00:12.140
bases. We're talking about AI autonomously fixing

00:00:12.140 --> 00:00:15.160
really complex bugs, submitting finalized code

00:00:15.160 --> 00:00:18.879
changes, no direct human intervention. Welcome

00:00:18.879 --> 00:00:21.379
to the Deep Dive. This is where we take the week's

00:00:21.379 --> 00:00:23.800
key AI research and news, and while we distill

00:00:23.800 --> 00:00:26.539
it into a quick, deep analysis for you, we're

00:00:26.539 --> 00:00:28.079
really trying to jump into the progress that

00:00:28.079 --> 00:00:31.280
often goes completely unreported. Yeah, exactly.

00:00:31.440 --> 00:00:33.560
We're looking past the sort of viral chat apps

00:00:33.560 --> 00:00:35.520
today, getting right into the, let's say, the

00:00:35.520 --> 00:00:38.460
technical engine room of autonomy. So we've structured

00:00:38.460 --> 00:00:40.539
this dive around three main things for you. First,

00:00:40.719 --> 00:00:43.179
why coding is perhaps the real epicenter for

00:00:43.179 --> 00:00:46.380
AI autonomy right now. Second, some key technical

00:00:46.380 --> 00:00:48.179
highlights, stuff like major infrastructure fixes,

00:00:48.420 --> 00:00:49.899
public performance checks, that kind of thing.

00:00:50.020 --> 00:00:51.979
And finally, we'll take a really detailed look

00:00:51.979 --> 00:00:54.780
at Delphi 2M. That's an AI forecasting serious

00:00:54.780 --> 00:00:58.280
disease risk, maybe 20 years out. Okay, let's

00:00:58.280 --> 00:01:00.100
unpack that first part, starting with software

00:01:00.100 --> 00:01:02.899
development. The central idea from the sources

00:01:02.899 --> 00:01:06.799
seems pretty clear. Coding, you know, writing,

00:01:07.019 --> 00:01:09.219
debugging, implementing functional programs.

00:01:09.500 --> 00:01:12.659
That's the real canary for AGI. And this revolution,

00:01:12.840 --> 00:01:15.659
it's been so gradual. Maybe people kind of missed

00:01:15.659 --> 00:01:17.739
the seismic shift happening. Well, what's really

00:01:17.739 --> 00:01:19.920
fascinating is how these capabilities got added,

00:01:20.019 --> 00:01:22.620
like in layers, often without even changing the

00:01:22.620 --> 00:01:24.920
user experience much. So the change felt subtle,

00:01:25.000 --> 00:01:27.920
you know. Back in, what, 2021, we got... GitHub

00:01:27.920 --> 00:01:30.579
Copilot. That was basically smart code completion.

00:01:30.799 --> 00:01:33.180
Right. Simple tab completion. And then by 2022,

00:01:33.560 --> 00:01:35.840
things like ChatGPT were good enough to write

00:01:35.840 --> 00:01:39.239
like short standalone scripts. Exactly. Fast

00:01:39.239 --> 00:01:42.060
forward to now, 2025, and you've got tools like

00:01:42.060 --> 00:01:45.540
Cloud Code, Codex Manalign Interface, the CLI

00:01:45.540 --> 00:01:47.439
actually building pretty complex mini projects

00:01:47.439 --> 00:01:50.280
and fixing bugs, submitting those code changes,

00:01:50.340 --> 00:01:53.299
what developers call pull requests or PRs autonomously.

00:01:53.629 --> 00:01:55.269
Okay, when we talk about that leap, we have to

00:01:55.269 --> 00:01:57.549
define autonomous agents because this isn't just

00:01:57.549 --> 00:02:00.109
glorified autocomplete anymore, is it? Oh, not

00:02:00.109 --> 00:02:03.030
at all. We define autonomous agents as AI that

00:02:03.030 --> 00:02:06.170
can, like, understand a task, plan out multiple

00:02:06.170 --> 00:02:08.490
steps to solve it, run the code it needs, make

00:02:08.490 --> 00:02:10.849
decisions if something goes wrong, and actually

00:02:10.849 --> 00:02:13.330
complete the objective without constant human

00:02:13.330 --> 00:02:16.129
hand -holding. It's a whole loop operating on

00:02:16.129 --> 00:02:19.240
its own, like stacking Lego blocks of data. And

00:02:19.240 --> 00:02:21.960
there's actual data backing this up. These projects

00:02:21.960 --> 00:02:24.939
tracking agents in the wild, they show the Codex

00:02:24.939 --> 00:02:27.580
web agent has merged over a million pull requests

00:02:27.580 --> 00:02:31.020
already. One million. Yeah. Yeah. And those PRs,

00:02:31.020 --> 00:02:32.680
they're getting merged at an impressive rate,

00:02:32.759 --> 00:02:35.360
like 80 plus percent. That's for real production

00:02:35.360 --> 00:02:37.979
code changes, even from agents that are basically,

00:02:38.120 --> 00:02:40.340
you know, first time users on a code base. Yeah.

00:02:40.379 --> 00:02:42.620
Plus, we've seen huge adoption like Claude Code.

00:02:42.719 --> 00:02:45.479
It has something like 20 times more NPM downloads

00:02:45.479 --> 00:02:48.080
as the node package manager developers use than

00:02:48.080 --> 00:02:51.000
the code. Codex CLI. Honestly, I still wrestle

00:02:51.000 --> 00:02:53.439
with prompt drift myself sometimes, you know,

00:02:53.439 --> 00:02:55.080
when I'm just trying to fix simple bugs in my

00:02:55.080 --> 00:02:57.300
own little weekend project. So it's genuinely

00:02:57.300 --> 00:02:59.379
humbling to see these production systems hitting

00:02:59.379 --> 00:03:03.180
that consistent 80 % merge rate. Wow. So if coding

00:03:03.180 --> 00:03:06.379
really is the AGI canary, that consistent 80

00:03:06.379 --> 00:03:09.080
% merge rate feels like a pretty loud chirp.

00:03:09.159 --> 00:03:11.479
What does that percentage really tell us about

00:03:11.479 --> 00:03:13.340
where software autonomy is right now? I think

00:03:13.340 --> 00:03:15.979
it tells us the AI has achieved a level of reliability.

00:03:17.180 --> 00:03:19.240
We're definitely past simple suggestions here.

00:03:19.379 --> 00:03:22.699
These agents are delivering tangible, trusted,

00:03:23.120 --> 00:03:25.340
ready -to -use output. Okay, let's shift from

00:03:25.340 --> 00:03:27.319
that underlying code autonomy to some of the

00:03:27.319 --> 00:03:30.020
headlines. Model performance, public scrutiny.

00:03:30.659 --> 00:03:33.139
It often feels like the biggest perceived problems

00:03:33.139 --> 00:03:35.780
in AI turn out to be, well, pretty mundane infrastructure

00:03:35.780 --> 00:03:38.759
stuff. That is spot on. Like if you felt Claude

00:03:38.759 --> 00:03:41.879
seemed a bit nerfed recently, you know, its capability

00:03:41.879 --> 00:03:44.259
felt kind of reduced. It wasn't some secret downgrade.

00:03:44.580 --> 00:03:46.360
Anthropic actually put out a postmortem. They

00:03:46.360 --> 00:03:47.919
explained the perceived change was just down

00:03:47.919 --> 00:03:50.439
to three overlapping infrastructure bugs. Simple

00:03:50.439 --> 00:03:52.319
as that. And they're all fixed now. So, yeah,

00:03:52.360 --> 00:03:54.460
model stability often comes down to pretty boring

00:03:54.460 --> 00:03:56.780
infrastructure work, not some fundamental drop

00:03:56.780 --> 00:04:00.020
in capability. Right. And speaking of control,

00:04:00.240 --> 00:04:03.539
there is also that new feature for GPT -5. Users

00:04:03.539 --> 00:04:06.719
can now toggle its thinking time. It's web only

00:04:06.719 --> 00:04:09.840
for now, but it lets you choose faster answers

00:04:09.840 --> 00:04:12.500
or maybe smarter, more deliberate quality. Yeah,

00:04:12.560 --> 00:04:15.180
that's potentially huge for efficiency, letting

00:04:15.180 --> 00:04:18.120
the user decide the tradeoff. And speaking of

00:04:18.120 --> 00:04:20.259
public perception, Meta had that high profile

00:04:20.259 --> 00:04:23.100
moment recently. During their public demo, trying

00:04:23.100 --> 00:04:25.660
to show off live tasks, the system kind of hung

00:04:25.660 --> 00:04:28.730
for about a minute. Left the audience a bit unimpressed,

00:04:28.730 --> 00:04:30.810
apparently. But look, that was likely just real

00:04:30.810 --> 00:04:33.129
-time latency or maybe an API connection issue

00:04:33.129 --> 00:04:35.569
during data fetching. That's a really common

00:04:35.569 --> 00:04:37.310
bottleneck. Doesn't necessarily mean the model

00:04:37.310 --> 00:04:40.050
was hallucinating or broken. Still, takes guts

00:04:40.050 --> 00:04:42.370
to demo that stuff live. Definitely. And on a

00:04:42.370 --> 00:04:44.930
more creative note, there's this new tool, NanoBanana,

00:04:45.029 --> 00:04:48.269
allows pretty sophisticated photo merging. The

00:04:48.269 --> 00:04:50.269
interesting part is it handles images with more

00:04:50.269 --> 00:04:52.850
than two people and lets you control the exact

00:04:52.850 --> 00:04:56.029
aesthetic, the pose, for everyone involved simultaneously.

00:04:56.560 --> 00:04:59.959
So thinking about that GPT -5 toggle, why is

00:04:59.959 --> 00:05:02.620
giving users control over thinking speed actually

00:05:02.620 --> 00:05:05.000
such a useful option? Well, it really lets you,

00:05:05.180 --> 00:05:08.620
the user, consciously prioritize. Speed for simple

00:05:08.620 --> 00:05:11.480
stuff or deep analytical quality for complex

00:05:11.480 --> 00:05:15.560
tasks. Tailoring it. Okay, let's pivot now to

00:05:15.560 --> 00:05:17.779
the strategic side. Financial moves, where the

00:05:17.779 --> 00:05:21.019
money's flowing, and why so many projects seem

00:05:21.019 --> 00:05:22.920
to fail. Right, we're seeing some serious investments

00:05:22.920 --> 00:05:25.000
still. Databricks, for instance, just raised

00:05:25.000 --> 00:05:28.180
a billion dollars. One billion. And they launched

00:05:28.180 --> 00:05:31.100
an AI accelerator program. They're giving early

00:05:31.100 --> 00:05:34.560
stage startups like $50 ,000 in platform credits,

00:05:34.740 --> 00:05:36.800
helping them scale compute quickly. That sounds

00:05:36.800 --> 00:05:39.019
fantastic. Pure opportunity. But then there's

00:05:39.019 --> 00:05:41.120
this other data point that kind of balances that

00:05:41.120 --> 00:05:44.399
optimism. Research showing that, what, 95 % of

00:05:44.399 --> 00:05:46.660
AI automation projects actually fail. That seems

00:05:46.660 --> 00:05:49.300
incredibly high. And the successful ones, they

00:05:49.300 --> 00:05:51.279
apparently rely on a process -first method to

00:05:51.279 --> 00:05:53.779
get a positive ROI. Whoa, just imagine Databricks

00:05:53.779 --> 00:05:56.240
scaling that accelerator funding globally. That

00:05:56.240 --> 00:05:58.240
could cause a massive, almost immediate shift

00:05:58.240 --> 00:05:59.939
in compute allocation for startups everywhere.

00:06:00.199 --> 00:06:02.699
Yeah. But yeah, that 95 % failure rate is sobering.

00:06:02.699 --> 00:06:04.699
It really underscores the need for something

00:06:04.699 --> 00:06:07.399
called context engineering. Okay, let's dig into

00:06:07.399 --> 00:06:10.540
that. For listeners who are informed but maybe

00:06:10.540 --> 00:06:13.860
don't use that term daily, what exactly is context

00:06:13.860 --> 00:06:16.399
engineering and why is it apparently the key

00:06:16.399 --> 00:06:20.360
to avoiding that 95 % failure rate? Context engineering

00:06:20.360 --> 00:06:22.759
is basically about connecting the AI directly

00:06:22.759 --> 00:06:25.420
to your stuff, your company's internal data,

00:06:25.620 --> 00:06:28.060
your live workflows, your proprietary databases.

00:06:28.259 --> 00:06:30.899
You're giving the large language model the specific

00:06:30.899 --> 00:06:34.079
relevant context it needs to do its job well

00:06:34.079 --> 00:06:36.500
for you. Not just feeding it generic web data.

00:06:36.639 --> 00:06:38.620
That's how you get consistently high quality

00:06:38.620 --> 00:06:41.040
outputs that actually, you know, matter to the

00:06:41.040 --> 00:06:43.379
business process. So based on the research then,

00:06:43.500 --> 00:06:45.779
what's the core strategic reason the successful

00:06:45.779 --> 00:06:48.660
projects manage to avoid that huge 95 % failure

00:06:48.660 --> 00:06:50.620
rate? It seems they really focus on defining

00:06:50.620 --> 00:06:52.779
the underlying business process before they even

00:06:52.779 --> 00:06:55.000
think about implementing the AI solution. Process

00:06:55.000 --> 00:06:57.879
first. All right, let's turn now to what feels

00:06:57.879 --> 00:06:59.579
like one of the biggest scientific breakthroughs

00:06:59.579 --> 00:07:02.560
mentioned in the sources. Delphi2M, this new

00:07:02.560 --> 00:07:05.240
AI system from Europe. researchers, it seems

00:07:05.240 --> 00:07:08.180
genuinely set to redefine proactive health care.

00:07:08.279 --> 00:07:11.639
Oh, its capability is absolutely stunning. Delphi2M

00:07:11.639 --> 00:07:15.220
forecasts the risk for 12 ,258 distinct diseases.

00:07:15.459 --> 00:07:18.579
Think diabetes, neurological disorders, heart

00:07:18.579 --> 00:07:21.439
conditions, the whole gamut, up to 20 years into

00:07:21.439 --> 00:07:24.019
the future. And it does this just using a patient's

00:07:24.019 --> 00:07:26.300
standard electronic medical records. Nothing

00:07:26.300 --> 00:07:28.699
more exotic than that. The training data must

00:07:28.699 --> 00:07:30.779
have been immense. It was initially trained on,

00:07:30.839 --> 00:07:34.100
what, data from over 400 ,000 UK patients, including

00:07:34.100 --> 00:07:36.639
everything from doctor visits, hospital records,

00:07:36.879 --> 00:07:39.480
even known lifestyle choices factored in. Exactly.

00:07:39.540 --> 00:07:41.399
And critically, this is super important to make

00:07:41.399 --> 00:07:43.459
sure it wasn't just good for the UK population.

00:07:43.500 --> 00:07:46.939
They validated it. They tested it against 1 .9

00:07:46.939 --> 00:07:49.699
million entirely separate Danish patient records.

00:07:49.879 --> 00:07:52.379
That validation step is crucial. It proves the

00:07:52.379 --> 00:07:55.040
model has generalizability. It proves it's not

00:07:55.040 --> 00:07:57.000
just hallucinating predictions based on some

00:07:57.000 --> 00:07:59.439
bias in the original UK data set. The potential

00:07:59.439 --> 00:08:01.699
implications for actual patient care seem profound

00:08:01.699 --> 00:08:04.360
here. Delphi2M could help doctors shift from

00:08:04.360 --> 00:08:06.439
just reacting to symptoms that already exist

00:08:06.439 --> 00:08:09.139
to actively anticipating future health risks

00:08:09.139 --> 00:08:12.079
years in advance. That could fundamentally change

00:08:12.079 --> 00:08:14.079
medicine towards really personalized prevention.

00:08:14.160 --> 00:08:16.360
Yeah, but it's really important to stress the

00:08:16.360 --> 00:08:18.600
caveat here. Human doctors are still absolutely

00:08:18.600 --> 00:08:21.180
necessary. They need to interpret the AI's predictions.

00:08:21.459 --> 00:08:24.160
The system analyzes risk. The physician provides

00:08:24.160 --> 00:08:26.000
the judgment, the empathy, the actual treatment

00:08:26.000 --> 00:08:28.740
plan. It's designed to augment the doctor, definitely

00:08:28.740 --> 00:08:31.459
not replace them. So going beyond the headline

00:08:31.459 --> 00:08:34.299
number of diseases or years, what's maybe the

00:08:34.299 --> 00:08:37.840
biggest operational shift Delphi2M could bring

00:08:37.840 --> 00:08:40.100
to something like a standard annual checkup?

00:08:40.539 --> 00:08:42.539
Well, it fundamentally redefines that checkup,

00:08:42.559 --> 00:08:44.919
doesn't it? From just a snapshot evaluation of

00:08:44.919 --> 00:08:48.200
current status to proactive long -term risk mapping

00:08:48.200 --> 00:08:50.960
for the individual. Okay, so just to quickly

00:08:50.960 --> 00:08:52.700
synthesize everything we've covered for you today.

00:08:52.820 --> 00:08:56.019
First, AI encoding. It's quietly crossing some

00:08:56.019 --> 00:08:58.700
really crucial autonomy thresholds. That 80 plus

00:08:58.700 --> 00:09:01.440
percent PR merge rate is a key example. Second,

00:09:01.600 --> 00:09:04.179
model stability issues, often tied to boring

00:09:04.179 --> 00:09:07.080
infrastructure fixes, not fundamental nerfing.

00:09:07.159 --> 00:09:09.899
And users are getting more control, like toggling

00:09:09.899 --> 00:09:12.419
speed versus intelligence. And finally, predictive

00:09:12.419 --> 00:09:15.419
AI like Delphi 2M is really starting to redefine

00:09:15.419 --> 00:09:17.139
human health and the whole concept of prevention.

00:09:17.440 --> 00:09:19.879
So what does this all really mean for you? I

00:09:19.879 --> 00:09:21.940
think the takeaway is that the most impactful

00:09:21.940 --> 00:09:24.919
AI progress is often happening silently. It's

00:09:24.919 --> 00:09:27.379
in the background, deep inside complex systems,

00:09:27.480 --> 00:09:29.980
the infrastructure, the code bases, the hospitals.

00:09:30.360 --> 00:09:32.779
It's not just in those viral chat apps that grab

00:09:32.779 --> 00:09:35.080
all the headlines. You really have to pay attention

00:09:35.080 --> 00:09:37.419
to the infrastructure, the less flashy stuff.

00:09:38.019 --> 00:09:39.840
Thank you so much for joining us for this team

00:09:39.840 --> 00:09:41.740
dive today. We really hope you continue learning

00:09:41.740 --> 00:09:43.360
about these advancements. They're fundamentally

00:09:43.360 --> 00:09:46.120
reshaping tech and health right now. And maybe

00:09:46.120 --> 00:09:48.539
here's a final thought for you to chew on. Consider

00:09:48.539 --> 00:09:50.759
that China is apparently already teaching children

00:09:50.759 --> 00:09:54.200
AI principles from age six, and that the DeepSeek

00:09:54.200 --> 00:09:56.980
foundational model, a pretty capable one, reportedly

00:09:56.980 --> 00:10:00.909
cost only $294 ,000 to train. Not millions, thousands.

00:10:01.629 --> 00:10:04.169
So how rapidly do you think the center of gravity

00:10:04.169 --> 00:10:06.330
for foundational AI breakthroughs might shift

00:10:06.330 --> 00:10:09.110
in, say, the next five years? Something to think

00:10:09.110 --> 00:10:10.009
about. Until next time.