WEBVTT

00:00:00.000 --> 00:00:03.379
Welcome back to the Deep Dive. For years, the

00:00:03.379 --> 00:00:06.019
big leaps in AI, they always seem to happen on

00:00:06.019 --> 00:00:07.900
a certain rhythm. We even had a name for it,

00:00:07.940 --> 00:00:11.019
right? Thursday AI. Well, you can go ahead and

00:00:11.019 --> 00:00:13.039
scratch that off the calendar. Google just completely

00:00:13.039 --> 00:00:15.720
reset the clock with Tuesday AI. Their Gemini

00:00:15.720 --> 00:00:19.219
3 launch was massive. And the metrics they dropped,

00:00:19.359 --> 00:00:21.140
I mean, they just changed the entire scoreboard.

00:00:21.420 --> 00:00:26.170
This thing scored a 1501 ELO on El Marina. And

00:00:26.170 --> 00:00:28.929
maybe more importantly, it crushed GPT -5 Pro

00:00:28.929 --> 00:00:31.710
on the humanities last exam. We're talking 37

00:00:31.710 --> 00:00:35.810
.4 % to 31 .6%. That is a huge margin, over six

00:00:35.810 --> 00:00:38.009
points. This tells me it's not just a smarter

00:00:38.009 --> 00:00:40.990
chatbot. The real signal here is that we're looking

00:00:40.990 --> 00:00:44.890
at an entirely new class of AI teammate, something

00:00:44.890 --> 00:00:47.020
that takes action. And that's exactly where we're

00:00:47.020 --> 00:00:48.979
focusing today. So for everyone joining us on

00:00:48.979 --> 00:00:50.759
this deep dive, our whole mission is to cut through

00:00:50.759 --> 00:00:53.119
all the noise, all the data, and just pull out

00:00:53.119 --> 00:00:55.780
the most important signals from this whirlwind

00:00:55.780 --> 00:00:58.079
of AI updates. We do have a packed roadmap. We're

00:00:58.079 --> 00:01:00.560
going to unpack the raw power of Gemini 3, this

00:01:00.560 --> 00:01:02.780
new agent model, and really look at what those

00:01:02.780 --> 00:01:04.879
big benchmark scores mean for how you can actually

00:01:04.879 --> 00:01:08.409
use it. Then we're going to pivot to its unexpected

00:01:08.409 --> 00:01:11.930
competitor, Grok 4 .1, the whole Vive first model

00:01:11.930 --> 00:01:14.170
that tried to steal the spotlight. And we'll

00:01:14.170 --> 00:01:16.430
finish up with the really practical stuff, the

00:01:16.430 --> 00:01:18.989
new agents coming from industry and the just

00:01:18.989 --> 00:01:21.329
colossal funding that proves this shift is here

00:01:21.329 --> 00:01:23.430
to stay. Right. The focus is understanding that

00:01:23.430 --> 00:01:28.150
shift from a simple chatbot to a true multi -step

00:01:28.150 --> 00:01:31.730
agent that can plan and build things. Okay. Let's

00:01:31.730 --> 00:01:34.099
get into it. We have to start with Gemini 3.

00:01:34.280 --> 00:01:36.299
I mean, the performance numbers alone basically

00:01:36.299 --> 00:01:38.219
set a new state of the art. It didn't just win.

00:01:38.420 --> 00:01:40.659
It sort of redefined the whole arena. That's

00:01:40.659 --> 00:01:43.579
completely right. And when we say 1501 ELO on

00:01:43.579 --> 00:01:46.140
El Marina, for you listening, if you don't follow

00:01:46.140 --> 00:01:48.519
the leaderboards, El Marina is where these top

00:01:48.519 --> 00:01:51.400
models go head to head on really complex real

00:01:51.400 --> 00:01:54.260
world stuff. That score, it's the new high watermark

00:01:54.260 --> 00:01:57.239
for general AI intelligence. But the win on humanity's

00:01:57.239 --> 00:01:59.700
last exam. That feels more consequential to me.

00:01:59.719 --> 00:02:01.599
That's not a trivia contest. It's a test designed

00:02:01.599 --> 00:02:04.280
to measure really deep cross -disciplinary reasoning.

00:02:04.459 --> 00:02:06.900
Winning there signals a huge leap in cognitive

00:02:06.900 --> 00:02:09.159
ability. And it's built on tech that's pushing

00:02:09.159 --> 00:02:11.639
way beyond what we have now. They even mentioned

00:02:11.639 --> 00:02:13.800
a separate model, the DeepThink version, hitting

00:02:13.800 --> 00:02:19.219
45 .1 % on ARC -AGI2. And that test, that measures

00:02:19.219 --> 00:02:21.759
fluid intelligence, you know, abstract reasoning.

00:02:22.099 --> 00:02:23.599
That's the score that got everyone's attention.

00:02:23.819 --> 00:02:26.449
I mean, you had Elon Musk and Sam Altman. both

00:02:26.449 --> 00:02:28.770
publicly congratulating them on the launch. Oh,

00:02:28.789 --> 00:02:30.990
yeah. When your biggest competitors stop to acknowledge

00:02:30.990 --> 00:02:32.990
the jump you just made, you know it's a big deal.

00:02:33.129 --> 00:02:36.330
It absolutely is. But the real story here, it

00:02:36.330 --> 00:02:39.750
isn't just the score, it's the role. We have

00:02:39.750 --> 00:02:42.669
to talk about this agent upgrade. An agent is

00:02:42.669 --> 00:02:45.090
an AI that stops just answering your questions

00:02:45.090 --> 00:02:47.789
and becomes like a planning teammate. Yeah, it's

00:02:47.789 --> 00:02:50.370
about owning the outcome, not just finding the

00:02:50.370 --> 00:02:53.370
information. An agent can handle multi -step

00:02:53.370 --> 00:02:56.349
tasks. It plans out its actions on its own. And

00:02:56.349 --> 00:02:58.349
it can simulate the results of those actions

00:02:58.349 --> 00:03:00.610
before it even does them. It's like it's stacking

00:03:00.610 --> 00:03:03.250
Lego blocks of action and data, not just handing

00:03:03.250 --> 00:03:05.389
you one block. And that all starts with what

00:03:05.389 --> 00:03:08.469
you give it. The input. Gemini 3 is natively

00:03:08.469 --> 00:03:10.770
multimodal. And this isn't just a theory. You

00:03:10.770 --> 00:03:14.389
can give it anything. A complex PDF, a photo,

00:03:14.569 --> 00:03:17.409
a technical diagram, even a rough scribble on

00:03:17.409 --> 00:03:20.539
a napkin. It uses all of it as context for its

00:03:20.539 --> 00:03:24.860
plan. The examples the Google CEO gave were just

00:03:24.860 --> 00:03:28.180
stunning. Imagine sketching a website idea on

00:03:28.180 --> 00:03:30.960
a napkin, you hand it to the AI, and it turns

00:03:30.960 --> 00:03:34.219
that scribble into a full working website with

00:03:34.219 --> 00:03:37.509
all the code. That's utility in seconds. Or think

00:03:37.509 --> 00:03:39.789
about education. You take a complicated physics

00:03:39.789 --> 00:03:42.270
diagram and it turns it into an interactive lesson

00:03:42.270 --> 00:03:44.090
where you can actually move things around, run

00:03:44.090 --> 00:03:46.509
simulations in real time. And it scales up, too,

00:03:46.610 --> 00:03:49.090
for big business tasks like it can analyze a

00:03:49.090 --> 00:03:51.169
long video of a golf swing, find a technical

00:03:51.169 --> 00:03:53.310
flaw and then suggest the exact drills you need

00:03:53.310 --> 00:03:55.210
to do to fix it. And the output isn't just text

00:03:55.210 --> 00:03:57.710
anymore. It can generate dynamic layouts with

00:03:57.710 --> 00:03:59.669
interactive tools built right in, you know, like

00:03:59.669 --> 00:04:01.990
actual data sliders that appear in your search

00:04:01.990 --> 00:04:04.030
results. Right. That's a perfect example of it

00:04:04.030 --> 00:04:06.469
moving from just giving you text to actively

00:04:06.469 --> 00:04:08.569
building tools for you. And we should circle

00:04:08.569 --> 00:04:10.659
back to DeepThink for a second. that version

00:04:10.659 --> 00:04:12.759
is built for the really heavy lifting for multi

00:04:12.759 --> 00:04:15.659
-hop logic what that means is the model can link

00:04:15.659 --> 00:04:18.360
distant non -obvious bits of information together

00:04:18.360 --> 00:04:21.180
from totally different data sets that's a sign

00:04:21.180 --> 00:04:23.939
of real abstract reasoning not just better memory

00:04:23.939 --> 00:04:26.579
it's still in safety testing before it hits the

00:04:26.579 --> 00:04:28.680
ultra tier which kind of tells you how powerful

00:04:28.680 --> 00:04:31.439
it is so if this model can build interactive

00:04:31.439 --> 00:04:34.740
uis and simulate outcomes and plan things out

00:04:35.399 --> 00:04:37.480
What does that actually mean for the future of

00:04:37.480 --> 00:04:39.740
professional workflows, for creative work, technical

00:04:39.740 --> 00:04:42.240
work, all of it? It means AI is moving from being

00:04:42.240 --> 00:04:45.220
a passive source of information to actively building

00:04:45.220 --> 00:04:48.100
your tools and running complex operations. Now,

00:04:48.160 --> 00:04:50.240
shifting gears a bit, we have to talk about the

00:04:50.240 --> 00:04:52.220
surprise competitor that dropped right before

00:04:52.220 --> 00:04:54.120
Gemini 3 sucked all the air out of the room.

00:04:54.199 --> 00:04:57.740
I'm talking about Grok 4 .1 from XAI. That was

00:04:57.740 --> 00:04:59.860
such a brilliant counter move. They presented

00:04:59.860 --> 00:05:02.680
it as vibe first, you know, focused on a specific

00:05:02.680 --> 00:05:05.680
tone fast, witty, a little edgy, but then it

00:05:05.680 --> 00:05:07.540
also delivered some really impressive intelligence

00:05:07.540 --> 00:05:10.040
scores. It's that classic EQ versus IQ battle.

00:05:10.800 --> 00:05:13.839
Precisely. And Grok 4 .1 gives you two modes

00:05:13.839 --> 00:05:16.199
to work with. You've got the standard version,

00:05:16.379 --> 00:05:19.319
which is super snappy, instant replies. And then

00:05:19.319 --> 00:05:22.720
there's Grok 4 .1 thinking, which it takes its

00:05:22.720 --> 00:05:24.579
time to process, but it's aiming for a deeper,

00:05:24.620 --> 00:05:26.879
more analytical answer. And that thinking version

00:05:26.879 --> 00:05:29.459
actually, for a moment, had its day in the sun.

00:05:29.660 --> 00:05:31.720
It hit number one on the El Reno leaderboard,

00:05:31.860 --> 00:05:35.079
scored 1510 ELO, just for a few hours before

00:05:35.079 --> 00:05:36.920
Gemini came along and bumped it to number two.

00:05:37.220 --> 00:05:39.480
But it proves this isn't just a personality.

00:05:40.009 --> 00:05:42.430
It's a serious contender. Oh, it's very capable,

00:05:42.569 --> 00:05:44.509
especially in creative writing. It's incredibly

00:05:44.509 --> 00:05:47.949
strong. Only the next gen GPT model scored higher.

00:05:48.329 --> 00:05:50.410
But for me, the most important signal wasn't

00:05:50.410 --> 00:05:52.670
the score. It was the massive improvement in

00:05:52.670 --> 00:05:55.470
just. basic reliability. They really needed to

00:05:55.470 --> 00:05:57.250
fix their accuracy problem, didn't they? They

00:05:57.250 --> 00:05:59.350
absolutely did. So they slashed the hallucination

00:05:59.350 --> 00:06:02.550
rate from 12 % all the way down to 4 .2%. And

00:06:02.550 --> 00:06:04.910
their internal test showed factual accuracy went

00:06:04.910 --> 00:06:07.829
up by 66%. And look, I'll be honest, I still

00:06:07.829 --> 00:06:09.589
wrestle with prompt drift myself when I'm trying

00:06:09.589 --> 00:06:11.649
to build these complex, multi -layered queries.

00:06:11.970 --> 00:06:14.790
So a reliability jump like that, it's huge for

00:06:14.790 --> 00:06:17.699
user trust. Yeah. And that subjective quality

00:06:17.699 --> 00:06:21.420
matters, too. People describe the tone as friendlier,

00:06:21.519 --> 00:06:24.000
more empathetic, like the model just gets you

00:06:24.000 --> 00:06:26.699
a little bit better. That vibe is so important

00:06:26.699 --> 00:06:28.860
for getting people to actually use it. But here's

00:06:28.860 --> 00:06:31.720
the huge catch, the big limiting factor. Grok

00:06:31.720 --> 00:06:35.259
4 .1 is completely platform locked. You can only

00:06:35.259 --> 00:06:38.759
use it on X, the social platform. or the Grok

00:06:38.759 --> 00:06:41.819
website and app. You can't plug it into external

00:06:41.819 --> 00:06:44.019
tools. You can't build it into your company's

00:06:44.019 --> 00:06:46.699
workflow with an API. It is worth mentioning,

00:06:46.740 --> 00:06:49.300
though, that XAI did release a white paper with

00:06:49.300 --> 00:06:51.379
some of their training info. That's pretty rare

00:06:51.379 --> 00:06:53.480
for them, so it shows they're serious about backing

00:06:53.480 --> 00:06:56.360
up these performance claims. So why is that platform

00:06:56.360 --> 00:06:58.639
restriction such a serious limiting factor right

00:06:58.639 --> 00:07:00.519
now, especially when you compare it to Gemini

00:07:00.519 --> 00:07:02.740
3, which is obviously being built for maximum

00:07:02.740 --> 00:07:04.759
utility everywhere? The tradeoff is control.

00:07:05.199 --> 00:07:07.699
Grok's usefulness is limited because it can't

00:07:07.699 --> 00:07:10.240
integrate into existing business tools or external

00:07:10.240 --> 00:07:13.139
APIs. Okay, as we move into the wider industry

00:07:13.139 --> 00:07:14.839
context, I think it's really important that we

00:07:14.839 --> 00:07:18.500
just pause here and reinforce a point that always

00:07:18.500 --> 00:07:20.740
gets lost in the hype. That warning from the

00:07:20.740 --> 00:07:23.160
Google CEO. It really does need to be repeated.

00:07:23.319 --> 00:07:25.060
With all this excitement about these powerful

00:07:25.060 --> 00:07:27.420
new agents, we have to remember they are still,

00:07:27.500 --> 00:07:30.680
quote, prone to errors. You can't just blindly

00:07:30.680 --> 00:07:32.980
trust the output. Critical thinking is still

00:07:32.980 --> 00:07:35.970
step zero. That being said, the scale of the

00:07:35.970 --> 00:07:38.110
infrastructure being built to support these agents

00:07:38.110 --> 00:07:41.610
is it's just staggering. Just look at the funding.

00:07:41.970 --> 00:07:45.250
Lambda, an AI cloud company, just raised over

00:07:45.250 --> 00:07:48.290
one and a half billion dollars with Microsoft

00:07:48.290 --> 00:07:51.269
heavily involved. And it's specifically to expand

00:07:51.269 --> 00:07:53.870
the infrastructure needed for this stuff. A billion

00:07:53.870 --> 00:07:56.910
and a half dollars. That signals a complete arms

00:07:56.910 --> 00:07:59.899
race for computing power. These models need a

00:07:59.899 --> 00:08:02.660
ton of silicon, a ton of energy to serve up these

00:08:02.660 --> 00:08:05.139
complex multi -step agent requests at a global

00:08:05.139 --> 00:08:07.060
scale. Exactly. I mean, when you think about

00:08:07.060 --> 00:08:09.319
the multi -hop reasoning, the simulations these

00:08:09.319 --> 00:08:11.459
agents are running, the demand for computation

00:08:11.459 --> 00:08:13.819
is just exponential compared to a simple text

00:08:13.819 --> 00:08:18.100
query. Whoa. Yeah, imagine scaling that. Agents

00:08:18.100 --> 00:08:20.910
simulating outcomes and building UIs for... a

00:08:20.910 --> 00:08:24.310
billion queries a day. $1 .5 billion is what

00:08:24.310 --> 00:08:27.050
it takes just to get ready for that future. That's

00:08:27.050 --> 00:08:29.480
a real moment of wonder. Thinking about that

00:08:29.480 --> 00:08:31.500
scale. And that infrastructure is already being

00:08:31.500 --> 00:08:34.179
put to work. Microsoft just confirmed this industry

00:08:34.179 --> 00:08:37.320
-wide pivot to agents at their Ignite 2025 event.

00:08:37.639 --> 00:08:41.379
They announced Agent 365 and 12 new co -pilot

00:08:41.379 --> 00:08:43.740
agents, all designed for specific professional

00:08:43.740 --> 00:08:46.419
jobs like marketing or customer service. The

00:08:46.419 --> 00:08:49.600
agent race is officially on. We also got that

00:08:49.600 --> 00:08:52.899
funny little reminder of the ethical gray areas

00:08:52.899 --> 00:08:54.779
that open up when these models get so good at

00:08:54.779 --> 00:08:56.840
manipulation. Right. The story about the DoorDash

00:08:56.840 --> 00:08:59.320
customer who got a ref... fund by using AI to

00:08:59.320 --> 00:09:01.799
fake a photo of a raw burger. Right. It's a small

00:09:01.799 --> 00:09:03.799
thing, but it points to the immediate potential

00:09:03.799 --> 00:09:05.840
for just, you know, everyday misuse and fraud.

00:09:06.080 --> 00:09:08.639
And looking ahead, this competition is only going

00:09:08.639 --> 00:09:11.220
to get more intense. OpenAI is definitely not

00:09:11.220 --> 00:09:13.720
sitting still. Their VP of research mentioned

00:09:13.720 --> 00:09:15.799
they have a stronger version of their IMO winning

00:09:15.799 --> 00:09:19.039
model coming in 2025. And that's their math champion

00:09:19.039 --> 00:09:21.460
model. So they're pushing pure mathematical reasoning

00:09:21.460 --> 00:09:24.639
just as hard. So given all of this, the new power,

00:09:24.759 --> 00:09:27.259
the acknowledged risks, what's the single most

00:09:27.370 --> 00:09:29.370
important rule for the average user right now

00:09:29.370 --> 00:09:32.370
to be both effective and safe. Critical thinking

00:09:32.370 --> 00:09:35.509
remains essential. Always use AI results as a

00:09:35.509 --> 00:09:37.950
starting point, not the final unquestioned fact.

00:09:38.309 --> 00:09:40.850
So to wrap up this deep dive, I think the core

00:09:40.850 --> 00:09:42.809
takeaway for you, for the listener, is this.

00:09:42.929 --> 00:09:45.090
The whole AI race has fundamentally shifted.

00:09:45.429 --> 00:09:48.169
We are not arguing anymore about who can write

00:09:48.169 --> 00:09:50.230
slightly better text. Right. The new battleground

00:09:50.230 --> 00:09:54.110
is complex, multimodal. action planning. We're

00:09:54.110 --> 00:09:56.669
seeing the rise of true agents, not just smarter

00:09:56.669 --> 00:09:59.450
chatbots. They can take in all kinds of data

00:09:59.450 --> 00:10:01.330
and then actively build things for you. And the

00:10:01.330 --> 00:10:03.990
competition is now on two clear paths. On one

00:10:03.990 --> 00:10:05.970
side, you have pure, raw performance and deep

00:10:05.970 --> 00:10:08.490
reasoning, which is Gemini 3. On the other, you

00:10:08.490 --> 00:10:11.169
have specialized tone, speed, and empathy, which

00:10:11.169 --> 00:10:13.929
is what Grok 4 .1 is going for. And both are

00:10:13.929 --> 00:10:16.120
fighting for your attention. Thank you for joining

00:10:16.120 --> 00:10:18.559
us on this deep dive into the new era of autonomous

00:10:18.559 --> 00:10:21.419
AI agents. The changes are happening so fast

00:10:21.419 --> 00:10:23.620
and the capabilities are becoming so much more

00:10:23.620 --> 00:10:26.100
powerful, which makes these conversations just

00:10:26.100 --> 00:10:28.860
absolutely necessary. And as you think about

00:10:28.860 --> 00:10:31.059
the complexity of these new systems, here's a

00:10:31.059 --> 00:10:33.320
final thought for you to explore. Considering

00:10:33.320 --> 00:10:35.639
that safety tests are what's slowing down the

00:10:35.639 --> 00:10:38.500
release of the most powerful versions, how soon

00:10:38.500 --> 00:10:41.059
do you think the high reasoning Gemini 3 DeepThink

00:10:41.059 --> 00:10:43.799
model will actually clear safety? And what new

00:10:43.799 --> 00:10:46.639
really complex ethical questions will its multi

00:10:46.639 --> 00:10:49.100
-hop logic raise when it finally gets out there?

00:10:49.179 --> 00:10:49.840
Something to think about.