WEBVTT

00:00:00.000 --> 00:00:02.200
You know, we've all seen the charts. Gemini 3

00:00:02.200 --> 00:00:06.960
Pro. It dominates nearly every single AI benchmark

00:00:06.960 --> 00:00:09.580
category. Oh, absolutely. It's statistical supremacy.

00:00:09.900 --> 00:00:13.919
We are talking about dominance in math, reasoning.

00:00:14.620 --> 00:00:17.219
Visual understanding. It's hard to look at those

00:00:17.219 --> 00:00:20.359
numbers and not be incredibly impressed. But

00:00:20.359 --> 00:00:22.859
here's the uncomfortable truth. The AI with the

00:00:22.859 --> 00:00:25.079
absolute best scores isn't always the best one

00:00:25.079 --> 00:00:27.399
to actually work with. Right. There's this paradox

00:00:27.399 --> 00:00:30.359
between what a benchmark says an AI can do and

00:00:30.359 --> 00:00:32.960
what makes it suitable for a real job. Let's

00:00:32.960 --> 00:00:34.859
unpack that today. That is the core issue we

00:00:34.859 --> 00:00:38.259
need to tackle. Welcome to the Deep Dive. Today

00:00:38.259 --> 00:00:40.560
we're digging into a powerful new set of sources

00:00:40.560 --> 00:00:44.079
about the latest AI powerhouse, Gemini 3 Pro.

00:00:44.259 --> 00:00:45.939
We're going to cut right through the hype. And

00:00:45.939 --> 00:00:48.100
our mission for you is simple. Discover where

00:00:48.100 --> 00:00:50.880
G3P's raw power really shines. We're talking

00:00:50.880 --> 00:00:53.820
research, prototyping, and media, and where it

00:00:53.820 --> 00:00:56.299
surprisingly misses the mark. Especially for

00:00:56.299 --> 00:00:58.880
creative work in those big, complex coding projects.

00:00:59.159 --> 00:01:01.840
We'll look at the numbers, the real -world tests,

00:01:02.060 --> 00:01:04.099
and then figure out which model you should actually

00:01:04.099 --> 00:01:06.950
be reaching for. OK, so let's start with the

00:01:06.950 --> 00:01:10.829
facts, the benchmarks. The dominance is real.

00:01:10.989 --> 00:01:14.170
It is. That's the starting point. G3P is the

00:01:14.170 --> 00:01:16.950
clear leader across almost every metric that

00:01:16.950 --> 00:01:20.109
matters in advanced AI testing. We're talking

00:01:20.109 --> 00:01:24.469
complex mathematical reasoning, massive multitask

00:01:24.469 --> 00:01:26.769
language, understanding scores. And these aren't

00:01:26.769 --> 00:01:28.849
small wins, right? The gaps are huge. They're

00:01:28.849 --> 00:01:31.349
substantial, not marginal victories. This is

00:01:31.349 --> 00:01:34.909
complete statistical champion status. Full stop.

00:01:35.200 --> 00:01:37.319
What's fascinating to me is that models usually

00:01:37.319 --> 00:01:40.319
have their, you know, their specialty, maybe

00:01:40.319 --> 00:01:43.140
logic, maybe creative writing. Correct. But the

00:01:43.140 --> 00:01:45.719
data suggests G3P is built to perform in everything

00:01:45.719 --> 00:01:48.340
all at once. It's kind of reset the standard

00:01:48.340 --> 00:01:50.540
for multimodal tasks. Meaning it understands

00:01:50.540 --> 00:01:52.959
more than just text. Right. It handles video

00:01:52.959 --> 00:01:55.200
understanding better than any previous model.

00:01:55.359 --> 00:01:58.379
It sees, understands, and reasons across images

00:01:58.379 --> 00:02:00.700
and text far more effectively. But there's one

00:02:00.700 --> 00:02:03.560
key exception to this total domination. Yes,

00:02:03.560 --> 00:02:06.299
and it's a big one for developers. If we drill

00:02:06.299 --> 00:02:09.379
down into coding benchmarks, specifically that

00:02:09.379 --> 00:02:13.180
SWE Bench Verified test, the data shows CloudSonic

00:02:13.180 --> 00:02:16.159
4 .5 is still slightly better. At what specifically?

00:02:16.439 --> 00:02:18.599
At fixing complex multi -file bugs over time.

00:02:18.759 --> 00:02:21.960
But outside of that very specific niche, G3P

00:02:21.960 --> 00:02:25.319
is the champion. So if the scores are so high...

00:02:25.580 --> 00:02:28.879
What fundamentally do benchmarks fail to measure?

00:02:29.099 --> 00:02:31.639
They miss the softer factors. Yeah. The workflow

00:02:31.639 --> 00:02:34.000
feel, pragmatic thinking, and they completely

00:02:34.000 --> 00:02:36.300
ignore communication style. So let's talk about

00:02:36.300 --> 00:02:38.860
where that raw power becomes immediately useful.

00:02:39.020 --> 00:02:41.840
Our sources say deep research is the first big

00:02:41.840 --> 00:02:44.300
win. Oh, it's arguably the best AI research tool

00:02:44.300 --> 00:02:48.240
ever created. What it does is it effectively

00:02:48.240 --> 00:02:50.240
collapses the entire research pipeline. The whole

00:02:50.240 --> 00:02:52.819
process. Finding papers, reading, summarizing.

00:02:52.919 --> 00:02:54.900
All of it. That entire manual process is just

00:02:54.900 --> 00:02:57.139
gone. Tell us about the life test that showed

00:02:57.139 --> 00:02:59.460
this. The prompt sounded pretty intense. It was

00:02:59.460 --> 00:03:01.860
designed to stress the model, for sure. It had

00:03:01.860 --> 00:03:03.479
to research complex machine learning concepts,

00:03:03.719 --> 00:03:06.460
explain them simply, and detail LLM training

00:03:06.460 --> 00:03:09.120
step by step. That's a lot of synthesis for one

00:03:09.120 --> 00:03:12.879
go. It is. And G3P took just 45 seconds to plan

00:03:12.879 --> 00:03:15.819
its attack. Just to plan. It identified primary

00:03:15.819 --> 00:03:18.259
and secondary concepts it needed to weave in.

00:03:18.490 --> 00:03:20.629
45 seconds just for planning? What about the

00:03:20.629 --> 00:03:23.689
output? In just under three minutes, it generated

00:03:23.689 --> 00:03:26.770
a full, structured, in -depth research report.

00:03:27.110 --> 00:03:29.930
It synthesized info from hundreds of sources

00:03:29.930 --> 00:03:32.590
simultaneously. Wow. For a knowledge worker,

00:03:32.750 --> 00:03:35.090
that genuinely saves hours. And this is where

00:03:35.090 --> 00:03:36.930
it gets really interesting. The source is called

00:03:36.930 --> 00:03:39.150
the one -click magic. Yeah, this is the killer

00:03:39.150 --> 00:03:41.689
feature. After generating that report, you can

00:03:41.689 --> 00:03:43.830
instantly convert the findings into a complete

00:03:43.830 --> 00:03:47.530
website, a Google Doc, a quiz, flashcards, or

00:03:47.530 --> 00:03:50.069
even an audio podcast script. So it's not just

00:03:50.069 --> 00:03:53.030
research, it's asset creation. It turns raw research

00:03:53.030 --> 00:03:55.830
into finished, formatted assets immediately.

00:03:56.169 --> 00:03:58.509
It's a whole content engine. It saves not just

00:03:58.509 --> 00:04:00.969
minutes, but hours by collapsing that entire...

00:04:01.000 --> 00:04:04.240
workflow. So it turns searching, reading, combining,

00:04:04.340 --> 00:04:06.819
and formatting into a single action. That's it,

00:04:06.819 --> 00:04:09.340
exactly. A single prompt. Okay, so beyond research,

00:04:09.460 --> 00:04:11.460
what about creating things? We heard about a

00:04:11.460 --> 00:04:14.840
pretty wild stress test involving a 3D game.

00:04:15.139 --> 00:04:18.160
Yes, the developer stress test. The task is very

00:04:18.160 --> 00:04:22.120
specific. Make a 3D first -person shooter using

00:04:22.120 --> 00:04:25.500
3JS, and it has to be in just one single HTML

00:04:25.500 --> 00:04:28.399
file. No external dependencies. None. It has

00:04:28.399 --> 00:04:30.810
to be playable. responsive, and functional all

00:04:30.810 --> 00:04:32.850
in one go. That sounds like a monumental task

00:04:32.850 --> 00:04:35.329
for a single prompt. It is. It demands massive

00:04:35.329 --> 00:04:38.810
context awareness and the result. In about one

00:04:38.810 --> 00:04:42.389
minute, G3P produced a fully functional 3D FPS

00:04:42.389 --> 00:04:45.449
game. You're kidding? Not at all. It had sound

00:04:45.449 --> 00:04:47.449
effects, a working power -up system, bullets

00:04:47.449 --> 00:04:49.629
firing correctly from the visual gun model on

00:04:49.629 --> 00:04:52.149
screen. The sources called the output, quote,

00:04:52.269 --> 00:04:56.160
the best code seen for this test. Whoa. I mean,

00:04:56.160 --> 00:04:58.800
just imagine scaling the speed. It's like stacking

00:04:58.800 --> 00:05:01.319
these incredibly complex Lego blocks of data

00:05:01.319 --> 00:05:04.459
to build a prototype instantly. That is a massive

00:05:04.459 --> 00:05:07.079
shift in development speed. It's critical to

00:05:07.079 --> 00:05:09.160
note the distinction here, though. This excels

00:05:09.160 --> 00:05:11.399
at prototypes. Right. Rapid proof of concept

00:05:11.399 --> 00:05:14.180
demos, but not necessarily long term production

00:05:14.180 --> 00:05:16.560
apps. So does the speed mean we should use G3P

00:05:16.560 --> 00:05:18.939
for all rapid software development? Not quite.

00:05:19.139 --> 00:05:22.500
It is absolutely unmatched for quick V1 demos.

00:05:22.699 --> 00:05:25.379
Yeah. But. We're going to see why it still struggles

00:05:25.379 --> 00:05:28.300
with that long -term complex application development.

00:05:28.560 --> 00:05:31.519
Let's pivot to visuals. The analysis calls this

00:05:31.519 --> 00:05:34.459
the best AI image generator of all time. Why

00:05:34.459 --> 00:05:38.000
such a strong endorsement? Because the key differentiator

00:05:38.000 --> 00:05:40.279
isn't just generating beautiful images. A lot

00:05:40.279 --> 00:05:42.839
of models can do that now. It's consistency and

00:05:42.839 --> 00:05:45.949
complex editing. Most models just... They fall

00:05:45.949 --> 00:05:48.430
apart when you try to make small iterative edits.

00:05:48.629 --> 00:05:51.129
They lose the plot completely. So tell us about

00:05:51.129 --> 00:05:53.870
the YouTube thumbnail editing test. It handled

00:05:53.870 --> 00:05:57.490
three major edits on one image flawlessly. First,

00:05:57.670 --> 00:06:00.769
changing the text AI made this to 100 % made

00:06:00.769 --> 00:06:03.910
by AI. Perfect text matching. Second, resizing

00:06:03.910 --> 00:06:05.949
an arrow and focusing on a woman. Perfect enlargement.

00:06:05.970 --> 00:06:08.899
No distortion. And third, swapping the entire

00:06:08.899 --> 00:06:11.399
background to the Eiffel Tower Zero errors. So

00:06:11.399 --> 00:06:13.199
the magic is in maintaining that consistency

00:06:13.199 --> 00:06:15.939
across multiple steps. Exactly. Every element

00:06:15.939 --> 00:06:18.120
stayed intact unless it was explicitly told to

00:06:18.120 --> 00:06:20.319
change. This really hints at the Google advantage,

00:06:20.480 --> 00:06:23.620
right? The data. Their vast image and video databases

00:06:23.620 --> 00:06:27.180
from Google Images and YouTube. That's a competitive

00:06:27.180 --> 00:06:29.620
advantage that's really hard for rivals to match

00:06:29.620 --> 00:06:32.600
right now. So is this media prowess the most

00:06:32.600 --> 00:06:35.920
undeniable strength G3P has demonstrated? Yes.

00:06:36.399 --> 00:06:38.759
For image creation, editing, and video understanding,

00:06:39.180 --> 00:06:43.439
G3P is objectively the visual king. For now.

00:06:43.819 --> 00:06:46.259
Welcome back to the Deep Dive. We've established

00:06:46.259 --> 00:06:49.779
where G3P's raw intelligence wins, but here is

00:06:49.779 --> 00:06:52.220
where those high benchmark scores get a little

00:06:52.220 --> 00:06:54.839
uncomfortable. The sources say that for creative

00:06:54.839 --> 00:06:58.139
tasks, the vibes are off. It's about that pragmatic,

00:06:58.399 --> 00:07:01.300
human -centered thinking. G3P is smarter, yes,

00:07:01.560 --> 00:07:05.139
but its ideas are often... well, very AI ideas.

00:07:05.339 --> 00:07:07.680
Meaning they're clever, but not realistic. Exactly.

00:07:07.920 --> 00:07:09.800
They sound cool in the abstract, but they lack

00:07:09.800 --> 00:07:11.759
that human touch. Let's look at the business

00:07:11.759 --> 00:07:14.379
planning test for an app store. What did G3P

00:07:14.379 --> 00:07:17.060
suggest? It suggested features like a blind mode

00:07:17.060 --> 00:07:19.500
for users to try apps without any visual context,

00:07:19.680 --> 00:07:22.040
or a date planner button that optimized meetings.

00:07:22.480 --> 00:07:24.540
Technologically interesting, I guess? Sure, but

00:07:24.540 --> 00:07:26.100
not things people would actually use. They don't

00:07:26.100 --> 00:07:28.839
solve real human problems. And the competing

00:07:28.839 --> 00:07:32.279
model, GPT -5 .1, took a completely different

00:07:32.279 --> 00:07:34.829
approach. Totally different. It actually pushed

00:07:34.829 --> 00:07:37.509
back. It said the user needed reasons to return

00:07:37.509 --> 00:07:39.970
to the app, focusing on retention. It suggested

00:07:39.970 --> 00:07:42.730
realistic features like a public build log or

00:07:42.730 --> 00:07:44.810
leaderboard's ideas that were actually implemented.

00:07:45.089 --> 00:07:47.329
It felt like talking to a human partner. And

00:07:47.329 --> 00:07:49.329
that human element extends to the communication

00:07:49.329 --> 00:07:52.670
style. G3P is described as being very AI researcher.

00:07:52.970 --> 00:07:56.829
Yeah, cold, factual, detached. Whereas the competitor

00:07:56.829 --> 00:08:00.050
is warm. It addresses unstated concerns and goes

00:08:00.050 --> 00:08:02.439
above and beyond. For example? When asked for

00:08:02.439 --> 00:08:04.879
community ideas, it pivoted to discussing pricing

00:08:04.879 --> 00:08:08.240
strategy and customer anxieties. Totally unprompted,

00:08:08.240 --> 00:08:11.639
but highly relevant. You know, I still wrestle

00:08:11.639 --> 00:08:14.300
with prompt drift myself. That subtle fatigue

00:08:14.300 --> 00:08:17.019
of talking to a clinical entity for hours, that

00:08:17.019 --> 00:08:19.660
feeling of connection, that extra mile vibe,

00:08:19.879 --> 00:08:22.000
it's essential for a long -term partnership.

00:08:22.439 --> 00:08:24.959
So if G3P is smarter, why does human -like thinking

00:08:24.959 --> 00:08:28.160
still win for strategic tasks? Because strategy

00:08:28.160 --> 00:08:30.579
requires understanding emotional context and

00:08:30.579 --> 00:08:32.720
what people actually want, which benchmarks just

00:08:32.720 --> 00:08:35.399
ignore. Let's talk dollars and cents. How does

00:08:35.399 --> 00:08:38.539
the cost compare? G3P is noticeably more expensive.

00:08:38.860 --> 00:08:41.860
Input tokens are $2 per million. Output tokens

00:08:41.860 --> 00:08:45.379
are $12 per million. And the competitor? GPT

00:08:45.379 --> 00:08:49.360
5 .1. That's $1 .25 for input and $10 for output.

00:08:49.580 --> 00:08:52.779
So if you do the math, G3P costs about 60 % more

00:08:52.779 --> 00:08:56.580
for input. 60%. That's a significant gap. For

00:08:56.580 --> 00:08:59.220
heavy lifting, that adds up fast. It adds up

00:08:59.220 --> 00:09:01.860
incredibly fast, especially because the best

00:09:01.860 --> 00:09:04.600
feature is the huge context window. If you're

00:09:04.600 --> 00:09:07.360
feeding it large documents to analyze, you're

00:09:07.360 --> 00:09:09.879
paying that premium on every single token. That

00:09:09.879 --> 00:09:11.980
impacts the bottom line almost immediately. For

00:09:11.980 --> 00:09:14.120
sure, though a cheaper flash model is likely

00:09:14.120 --> 00:09:16.600
on the way. And beyond the current cost, should

00:09:16.600 --> 00:09:18.799
we expect prices to level out among the major

00:09:18.799 --> 00:09:21.139
models soon? Competition will drive prices down,

00:09:21.200 --> 00:09:23.740
yeah. But for anyone using this at high volume

00:09:23.740 --> 00:09:27.460
today, G3P Pro's pricing has a significant impact

00:09:27.460 --> 00:09:29.960
on the budget right now. Finally, let's revisit

00:09:29.960 --> 00:09:33.279
coding. Despite that strong raw ability with

00:09:33.279 --> 00:09:35.620
the game prototype, there's a serious tooling

00:09:35.620 --> 00:09:38.899
gap. The issue isn't the model's brain. It's

00:09:38.899 --> 00:09:41.759
the framework, the surrounding tools, the coding

00:09:41.759 --> 00:09:44.509
harness. What's a coding harness in plain English?

00:09:44.730 --> 00:09:46.730
It's the thing that remembers where you were

00:09:46.730 --> 00:09:49.190
three days ago and keeps track of a dozen different

00:09:49.190 --> 00:09:52.049
files for you. It's the workflow layer that makes

00:09:52.049 --> 00:09:54.940
real projects possible. And why does Cloud Code

00:09:54.940 --> 00:09:58.360
with Sonnet 4 .5 still win here? Because it has

00:09:58.360 --> 00:10:00.759
an excellent instruction framework built for

00:10:00.759 --> 00:10:03.720
extended coding sessions. It manages complex

00:10:03.720 --> 00:10:06.759
multi -file projects and context -aware editing

00:10:06.759 --> 00:10:09.100
better than anyone else. It remembers the whole

00:10:09.100 --> 00:10:11.399
project structure. Exactly. Not just the last

00:10:11.399 --> 00:10:14.759
few lines of code. And Google's AI Studio. AI

00:10:14.759 --> 00:10:17.460
Studio is great for those quick V1 builds and

00:10:17.460 --> 00:10:20.220
prototypes like the game demo. It can nail one

00:10:20.220 --> 00:10:22.639
big impressive code dump. But not for a long

00:10:22.639 --> 00:10:25.340
-term project. No, it's not optimized for iterative

00:10:25.340 --> 00:10:27.500
development over weeks or months. It tends to

00:10:27.500 --> 00:10:30.519
lose context. So where do we draw the line between

00:10:30.519 --> 00:10:33.159
the models for development? It's simple. Use

00:10:33.159 --> 00:10:36.659
G3P for short, V1 prototypes. Use cloud code

00:10:36.659 --> 00:10:39.000
for longer, multi -file development sessions.

00:10:39.379 --> 00:10:41.299
We've covered a lot of ground. The essential

00:10:41.299 --> 00:10:43.539
takeaway here seems to be that benchmarks measure

00:10:43.539 --> 00:10:46.409
capability, but they miss suitability. That's

00:10:46.409 --> 00:10:48.649
a perfect way to frame it. The real competitive

00:10:48.649 --> 00:10:50.929
advantage isn't chasing the model with the highest

00:10:50.929 --> 00:10:53.370
score. It's knowing how to build a specialized

00:10:53.370 --> 00:10:55.769
toolkit. So let's run through that decision matrix

00:10:55.769 --> 00:10:58.210
we found in the source material. The quick reference

00:10:58.210 --> 00:11:01.690
guide. You should use Gemini 3 Pro for deep research,

00:11:01.909 --> 00:11:04.870
media generation, rapid prototyping, and quick

00:11:04.870 --> 00:11:06.789
answers. Right. It's your specialized high power

00:11:06.789 --> 00:11:09.559
engine. But you should use other models like

00:11:09.559 --> 00:11:13.100
GPT -5 .1 for creative writing, strategic business

00:11:13.100 --> 00:11:16.320
planning. Anything requiring that human pragmatism.

00:11:16.399 --> 00:11:18.919
Exactly. And for long coding sessions and complex

00:11:18.919 --> 00:11:22.100
apps. You stick with clod code. If you master

00:11:22.100 --> 00:11:24.399
that decision -making process, which tool for

00:11:24.399 --> 00:11:26.940
which job, you're already ahead of most people

00:11:26.940 --> 00:11:29.360
just chasing the latest chart. This deep dive

00:11:29.360 --> 00:11:31.799
really showed us that the best model is just

00:11:31.799 --> 00:11:33.960
the right model for the job. You need a hammer,

00:11:34.080 --> 00:11:36.639
a screwdriver, and a wrench. Don't fall into

00:11:36.639 --> 00:11:39.419
that trap of trying to use a single AI for everything

00:11:39.419 --> 00:11:42.240
just because it won an exam. Build your toolkit

00:11:42.240 --> 00:11:45.100
strategically. Thank you for joining us for this

00:11:45.100 --> 00:11:47.899
deep dive into the benchmark paradox. You know,

00:11:47.899 --> 00:11:50.929
if AI becomes... objectively smarter every few

00:11:50.929 --> 00:11:53.649
months, but still struggles with basic human

00:11:53.649 --> 00:11:55.549
pragmatism. What does that really say about the

00:11:55.549 --> 00:11:57.769
value of human -centered thinking in this new

00:11:57.769 --> 00:11:58.789
world? Some of them all over.
