WEBVTT

00:00:00.000 --> 00:00:02.279
Okay, we really have to kick off this deep dive

00:00:02.279 --> 00:00:05.599
talking about that sensational claim, the one

00:00:05.599 --> 00:00:07.500
that just blew up the AI headlines for a bit.

00:00:07.620 --> 00:00:11.279
Yeah, the deleted tweet. Exactly. From an open

00:00:11.279 --> 00:00:15.980
AI VP suggesting, well, something almost unbelievable.

00:00:16.379 --> 00:00:19.440
That GPT -5 had made real progress on multiple

00:00:19.440 --> 00:00:23.199
really tough, long unsolved math problems. The

00:00:23.199 --> 00:00:26.500
Urdu's list stuff. Which, you know. Solving even

00:00:26.500 --> 00:00:28.579
one of those. That's huge for a mathematician.

00:00:28.920 --> 00:00:30.899
Career defining. Absolutely. So it went viral

00:00:30.899 --> 00:00:34.000
like crazy. But the reality check came fast.

00:00:34.479 --> 00:00:37.500
And it was public and honestly kind of brutal.

00:00:37.640 --> 00:00:40.259
Yeah. And the core lesson, the thing that rival

00:00:40.259 --> 00:00:43.750
CEOs and academics jumped on. was this key difference.

00:00:44.070 --> 00:00:46.210
Confusing just finding information retrieval

00:00:46.210 --> 00:00:48.750
with actual like new thinking, real reasoning.

00:00:49.030 --> 00:00:51.450
Precisely. Welcome to the deep dive. We've sifted

00:00:51.450 --> 00:00:53.310
through your latest pile of sources. It's a real

00:00:53.310 --> 00:00:55.829
mix this time. High stakes drama, some really

00:00:55.829 --> 00:00:58.649
practical advice, and one genuinely surprising

00:00:58.649 --> 00:01:00.570
new technique. Yeah, we're going to distill all

00:01:00.570 --> 00:01:02.649
that for you. Our mission today, cut through

00:01:02.649 --> 00:01:04.849
the noise, the hype. We need to really unpack

00:01:04.849 --> 00:01:08.670
what these big AI models actually do versus what

00:01:08.670 --> 00:01:11.349
the labs, well. sometimes claim they can do.

00:01:11.510 --> 00:01:14.430
So first up, we'll dig into that GPT -5 math

00:01:14.430 --> 00:01:16.930
thing properly. Then we're shifting gears. We'll

00:01:16.930 --> 00:01:18.730
talk about some essential tools and workflows,

00:01:19.049 --> 00:01:21.269
stuff for the builders and professionals listening.

00:01:21.510 --> 00:01:24.090
And finally, there's this technique out of Stanford.

00:01:24.769 --> 00:01:28.150
Simple, costs nothing, but it really changes

00:01:28.150 --> 00:01:30.530
how we prompt these models. Might even be the

00:01:30.530 --> 00:01:32.950
end of complex prompt engineering as we know

00:01:32.950 --> 00:01:35.450
it. So yeah, let's get into this source material.

00:01:35.709 --> 00:01:39.459
Okay, segment one, the GPT -5 controversy. The

00:01:39.459 --> 00:01:42.000
initial claim itself was, it was pretty well.

00:01:42.239 --> 00:01:46.280
Totally. The VP tweeted, GBT -5 solved 10 unsolved

00:01:46.280 --> 00:01:49.680
Erdos problems. 10. And showed progress on 11

00:01:49.680 --> 00:01:51.920
others. And these Erdos problems, just to be

00:01:51.920 --> 00:01:53.959
clear, they're notoriously hard. Right, yeah.

00:01:54.040 --> 00:01:56.760
They need genuine mathematical creativity. Formal

00:01:56.760 --> 00:01:57.840
proofs. They're not like, you know, multiple

00:01:57.840 --> 00:01:59.859
choice questions. Right. So then mathematician

00:01:59.859 --> 00:02:02.099
Thomas Bloom, who actually tracks this stuff

00:02:02.099 --> 00:02:04.379
officially, he looked into it. And he quickly

00:02:04.379 --> 00:02:07.640
confirmed, nope. The model found old solutions.

00:02:07.700 --> 00:02:10.259
Stuff already in the training data. Exactly.

00:02:10.340 --> 00:02:13.460
Not new, original proofs. Yeah. Just found stuff

00:02:13.460 --> 00:02:16.159
it had already seen. And the reaction from the

00:02:16.159 --> 00:02:19.719
big names. Pretty intense. Yeah. Jan LeCun from

00:02:19.719 --> 00:02:23.520
Meta apparently used some sharp words. Said they

00:02:23.520 --> 00:02:26.039
got hoisted by their own petard. Yeah. Basically

00:02:26.039 --> 00:02:28.819
meaning their own hype backfired. Ouch. And Demis

00:02:28.819 --> 00:02:30.560
Hassabis from DeepMind. Called it embarrassing.

00:02:31.069 --> 00:02:33.229
Straight up. Wow. But what's really interesting

00:02:33.229 --> 00:02:37.129
is that even OpenAI's own Sebastian Bubik kind

00:02:37.129 --> 00:02:39.469
of acknowledged the core issue here. Which is?

00:02:39.629 --> 00:02:43.110
The huge difference intellectually between just

00:02:43.110 --> 00:02:45.490
retrieving an old paper from the training data,

00:02:45.650 --> 00:02:48.409
which DBT5 did well, and actually inventing a

00:02:48.409 --> 00:02:50.949
new proof. By implied invention. But delivered

00:02:50.949 --> 00:02:53.650
retrieval. That's the crux of it. So, okay, it

00:02:53.650 --> 00:02:55.750
was one deleted tweet. Why does this really matter

00:02:55.750 --> 00:02:57.990
in the bigger picture? Because AI labs are always

00:02:57.990 --> 00:03:00.189
using these leaderboards, right? Yeah. GSMAK,

00:03:00.389 --> 00:03:03.229
the math benchmark, they boast about these scores

00:03:03.229 --> 00:03:05.930
constantly. We see those headlines all the time.

00:03:05.949 --> 00:03:08.430
New model tops the charts. Right. But those scores

00:03:08.430 --> 00:03:11.789
often just test, well, stored reasoning, pattern

00:03:11.789 --> 00:03:14.069
matching, stuff the model learned during training.

00:03:14.270 --> 00:03:16.930
So it sees a new question, finds a similar solved

00:03:16.930 --> 00:03:19.629
one in its memory, and kind of copies the method.

00:03:19.750 --> 00:03:22.740
Pretty much. It's retrieval disguised as reasoning.

00:03:22.960 --> 00:03:26.039
To claim real discovery, you need that formal

00:03:26.039 --> 00:03:29.180
proof verifiable novelty. Which wasn't there

00:03:29.180 --> 00:03:31.360
in this case. Nope. Proving those quick test

00:03:31.360 --> 00:03:34.000
scores can be, frankly, pretty misleading sometimes.

00:03:34.379 --> 00:03:36.560
Okay, so if these standard reasoning tests have

00:03:36.560 --> 00:03:39.120
this flaw, this susceptibility to just retrieving

00:03:39.120 --> 00:03:42.080
data, how should we actually measure real AI

00:03:42.080 --> 00:03:44.620
discovery going forward? What's a better way?

00:03:44.879 --> 00:03:47.520
We must prioritize provable novelty over mere

00:03:47.520 --> 00:03:50.199
retrieval. Simple as that. Right. Verifiable

00:03:50.199 --> 00:03:53.080
newness. Exactly. Now, moving from those big

00:03:53.080 --> 00:03:55.979
claims to something more practical. Segment two,

00:03:56.159 --> 00:03:59.099
essential tools for builders. Yeah, this is important

00:03:59.099 --> 00:04:02.639
because the sheer volume of new AI tools, new

00:04:02.639 --> 00:04:05.259
features, it's overwhelming. Totally. Causes

00:04:05.259 --> 00:04:07.479
real analysis paralysis for people trying to

00:04:07.479 --> 00:04:09.520
actually use this stuff professionally. You need

00:04:09.520 --> 00:04:11.419
a system to cut through it. Our notes mentioned

00:04:11.419 --> 00:04:14.120
a specific builder's framework. What's the core

00:04:14.120 --> 00:04:16.579
idea there? The main thing is focusing on time

00:04:16.579 --> 00:04:19.089
to value. How quickly can this tool actually

00:04:19.089 --> 00:04:22.129
help you solve a real problem you have? Rather

00:04:22.129 --> 00:04:24.649
than just chasing the highest score on some benchmark.

00:04:25.009 --> 00:04:27.709
Precisely. Evaluating tech based on your needs,

00:04:27.829 --> 00:04:30.689
not just the market hype. It's a solid system.

00:04:30.930 --> 00:04:33.670
Okay. And for people using, say, ChatGPT every

00:04:33.670 --> 00:04:35.990
day, there's a feature that gets missed. Yeah,

00:04:36.029 --> 00:04:39.230
the projects feature. Often overlooked, but it

00:04:39.230 --> 00:04:41.649
lets you create separate, dedicated contexts

00:04:41.649 --> 00:04:45.459
for different tasks or workflows. Ah, so it stops

00:04:45.459 --> 00:04:47.120
the chatbot forgetting what you were talking

00:04:47.120 --> 00:04:49.360
about 10 minutes ago. Exactly. It gives it much

00:04:49.360 --> 00:04:52.199
better memory within that specific project. Yeah.

00:04:52.279 --> 00:04:56.300
For anything complex, multi -step, that continuous

00:04:56.300 --> 00:04:59.540
memory is a huge productivity win. Yeah, that

00:04:59.540 --> 00:05:02.339
continuity is massive. Honestly, I still wrestle

00:05:02.339 --> 00:05:04.399
with prompt drift and context windows myself

00:05:04.399 --> 00:05:06.680
sometimes. Oh, me too. It happens. Trying to

00:05:06.680 --> 00:05:09.420
keep a consistent style or persona across a bunch

00:05:09.420 --> 00:05:12.199
of outputs, it can be tricky. For sure. But then

00:05:12.199 --> 00:05:14.540
when you need those really high quality, reliable

00:05:14.540 --> 00:05:17.040
results, like agency level stuff, but without

00:05:17.040 --> 00:05:19.319
the agency price tag. The source is pointed towards

00:05:19.319 --> 00:05:21.720
Google AI Studio. Absolutely. AI Studio gives

00:05:21.720 --> 00:05:24.040
you much deeper controls than your basic chatbot

00:05:24.040 --> 00:05:26.300
interface. Like what kind of controls? Well,

00:05:26.360 --> 00:05:28.699
the material details five specific professional

00:05:28.699 --> 00:05:32.670
methods. One example is demanding output in really

00:05:32.670 --> 00:05:35.930
structured formats like JSON. Okay. Or forcing

00:05:35.930 --> 00:05:38.850
it to generate, say, a detailed negotiation brief

00:05:38.850 --> 00:05:41.470
with specific risk parameters clearly defined.

00:05:41.790 --> 00:05:44.149
Gotcha. So it's about making the output reliable

00:05:44.149 --> 00:05:47.319
and usable for serious work. Exactly. for when

00:05:47.319 --> 00:05:49.839
critical decisions are involved, or you need

00:05:49.839 --> 00:05:52.620
that consistency for creative or analytical tasks.

00:05:52.939 --> 00:05:55.560
Okay, so beyond just better memory, what would

00:05:55.560 --> 00:05:58.360
you say is the single biggest productivity game

00:05:58.360 --> 00:06:01.019
for a professional using these more specialized

00:06:01.019 --> 00:06:04.339
AI features, like in AI Studio? Gig controls

00:06:04.339 --> 00:06:06.939
allow for agency -quality work without high cost.

00:06:07.100 --> 00:06:09.220
Right, that control is key. Definitely. Okay,

00:06:09.300 --> 00:06:12.120
let's shift gears again. Segment three, the economics,

00:06:12.360 --> 00:06:15.189
the talent side. Because things are getting intense.

00:06:15.550 --> 00:06:18.410
Intense is one word for it. The talent wars are

00:06:18.410 --> 00:06:20.990
real. We're seeing reports of AI startups and

00:06:20.990 --> 00:06:23.910
SF leasing actual luxury apartments. And offering

00:06:23.910 --> 00:06:26.490
$1 ,000 rent stipends just to lure top engineers

00:06:26.490 --> 00:06:29.899
away from the Googles and open AIs. Whoa. I mean,

00:06:29.920 --> 00:06:32.519
imagine scaling that kind of competition across

00:06:32.519 --> 00:06:35.399
the whole industry. Billions being thrown at

00:06:35.399 --> 00:06:37.480
talent. It just shows the insane value placed

00:06:37.480 --> 00:06:39.240
on people who can actually push the research

00:06:39.240 --> 00:06:41.680
forward, you know, make fundamental breakthroughs.

00:06:41.699 --> 00:06:43.000
And if you're someone listening who wants to

00:06:43.000 --> 00:06:45.620
get into that hot market. Yeah. The sources actually

00:06:45.620 --> 00:06:48.439
shared a super practical guide, how to get a

00:06:48.439 --> 00:06:50.860
job at an AI company. And this wasn't just random

00:06:50.860 --> 00:06:53.199
advice, right? No. It came straight from Jure

00:06:53.199 --> 00:06:56.360
Leskovec, Stanford, professor, founder, and he's

00:06:56.360 --> 00:06:59.399
actively hiring right now. So real insider stuff.

00:06:59.600 --> 00:07:02.019
Very useful. What else is happening? Quick headlines.

00:07:02.399 --> 00:07:06.560
Okay, rapid fire. OpenAI hired a black hole physicist.

00:07:07.000 --> 00:07:10.319
Oh, interesting. Google AI Studio. They just

00:07:10.319 --> 00:07:12.180
combined all their features into one single UI.

00:07:12.480 --> 00:07:15.990
Big win for usability. Nice. Europe is deploying

00:07:15.990 --> 00:07:19.069
AI for massive water projects in its driest areas.

00:07:19.529 --> 00:07:22.430
Resource management focus. Important stuff. And

00:07:22.430 --> 00:07:25.310
OpenAI is pitching ChatGPT login features to

00:07:25.310 --> 00:07:27.290
other companies now, letting them use it for

00:07:27.290 --> 00:07:29.509
their own user authentication. Okay, the job

00:07:29.509 --> 00:07:31.829
market's clearly on fire. But let's go back to

00:07:31.829 --> 00:07:36.050
that black hole physicist joining OpenAI. What

00:07:36.050 --> 00:07:38.970
does something like that signal about where AI

00:07:38.970 --> 00:07:42.259
research might be heading? AI is moving beyond

00:07:42.259 --> 00:07:45.279
language models into core scientific discovery.

00:07:45.500 --> 00:07:48.120
Got it. Fundamental science. Yeah. Thinking bigger.

00:07:48.300 --> 00:07:50.540
All right. This next segment, this could genuinely

00:07:50.540 --> 00:07:52.939
be a game changer for pretty much everyone listening.

00:07:53.120 --> 00:07:55.480
Segment four. Yeah. This is about the innovation

00:07:55.480 --> 00:07:58.970
killer. And the surprisingly simple fix. Okay,

00:07:59.029 --> 00:08:01.009
what's the killer? It's called typicality bias.

00:08:01.550 --> 00:08:04.589
It's the reason AI often gives you the same kind

00:08:04.589 --> 00:08:08.009
of bland, boring answers. Or that same poem about

00:08:08.009 --> 00:08:09.709
misty mornings every time you ask for something

00:08:09.709 --> 00:08:12.189
creative. Exactly. It avoids risk. It plays it

00:08:12.189 --> 00:08:14.129
safe. Why does it do that? What's the root cause?

00:08:14.389 --> 00:08:17.069
It comes down to the training, specifically RLHF

00:08:17.069 --> 00:08:19.470
reinforcement learning with human feedback. Okay.

00:08:19.920 --> 00:08:22.680
Basically, typicality bias means when human reviewers

00:08:22.680 --> 00:08:25.540
rate the AI's answers, they tend to prefer the

00:08:25.540 --> 00:08:28.399
safe, familiar, typical ones. Ah, so the humans

00:08:28.399 --> 00:08:30.480
themselves are kind of biased towards average?

00:08:30.800 --> 00:08:34.129
In a way, yeah. And this trains the model. To

00:08:34.129 --> 00:08:36.789
suppress the randomness, the statistical outliers

00:08:36.789 --> 00:08:39.750
that you actually need for real creativity. Which

00:08:39.750 --> 00:08:41.590
leads to this mode collapse thing you mentioned.

00:08:41.809 --> 00:08:43.769
Right. Mode collapse is when the model just keeps

00:08:43.769 --> 00:08:46.690
defaulting to the statistical average. The most

00:08:46.690 --> 00:08:51.289
common, safest, blandest response. Kills innovation.

00:08:51.710 --> 00:08:54.789
So we've basically been prompting wrong, or at

00:08:54.789 --> 00:08:57.049
least inefficiently. Pretty much. Yeah. We've

00:08:57.049 --> 00:08:58.850
been fighting the model's tendency towards the

00:08:58.850 --> 00:09:02.669
average. But Stanford found this fix. Zero cost.

00:09:03.489 --> 00:09:07.049
called verbalized sampling. And the fix is almost

00:09:07.049 --> 00:09:09.850
ridiculously simple. It really is. You just add

00:09:09.850 --> 00:09:12.210
one line to your prompt. Go on. Instead of just

00:09:12.210 --> 00:09:15.330
asking for, say, five jokes, you add a statistical

00:09:15.330 --> 00:09:18.570
frame. You ask it. Generate five jokes with their

00:09:18.570 --> 00:09:20.450
probabilities. That's it. Just adding with their

00:09:20.450 --> 00:09:22.009
probability. That's it. You're basically telling

00:09:22.009 --> 00:09:24.230
the model, hey, think about the diversity of

00:09:24.230 --> 00:09:26.590
your possible answers. Show me the spread. Wow.

00:09:27.070 --> 00:09:29.429
Okay. And the results. The source material seemed

00:09:29.429 --> 00:09:32.070
pretty emphatic about this. Oh, the results are

00:09:32.070 --> 00:09:34.210
nuts, especially considering it costs nothing

00:09:34.210 --> 00:09:38.509
extra. Creative writing. They saw a 92 % jump

00:09:38.509 --> 00:09:41.070
in diversity for poems. 92%. And for jokes. Right.

00:09:41.629 --> 00:09:44.250
109 % increase in diversity. Double the variety,

00:09:44.509 --> 00:09:47.309
basically. Yeah. And it wasn't just fluff. High

00:09:47.309 --> 00:09:50.269
-stakes stuff, too. Dialogue simulation. The

00:09:50.269 --> 00:09:53.370
AI's responses became twice as close to actual

00:09:53.370 --> 00:09:56.360
human donation behavior in a test scenario. So

00:09:56.360 --> 00:09:59.580
more realistic, human -like variation. Exactly.

00:09:59.740 --> 00:10:03.259
An open -ended Q &A. Seven times better answer

00:10:03.259 --> 00:10:06.720
spread. And four times lower KL divergence. Right.

00:10:06.799 --> 00:10:08.340
Which is just a fancy way of saying the answers

00:10:08.340 --> 00:10:10.639
were much less statistically similar, much more

00:10:10.639 --> 00:10:12.519
varied. That's impressive. Does it work everywhere?

00:10:12.720 --> 00:10:14.679
Seems like it. Yeah. Works across different models,

00:10:14.759 --> 00:10:17.700
GPT, Claude, Gemini. And it scales with model

00:10:17.700 --> 00:10:20.990
size. GPT -4 got twice the diversity boost compared

00:10:20.990 --> 00:10:23.669
to the smaller GPT -4 Mini. Just reframing the

00:10:23.669 --> 00:10:25.870
request statistically unlocks all this built

00:10:25.870 --> 00:10:27.769
-in variety. It's kind of mind -blowing. It really

00:10:27.769 --> 00:10:30.649
is. Okay, so does this super simple technique

00:10:30.649 --> 00:10:32.990
basically mean that the whole complex art of

00:10:32.990 --> 00:10:34.830
prompt engineering, you know, crafting these

00:10:34.830 --> 00:10:36.570
elaborate instructions, is that kind of over

00:10:36.570 --> 00:10:38.950
now? The shift is towards statistical framing

00:10:38.950 --> 00:10:41.649
for built -in creativity. Less hand -holding,

00:10:41.649 --> 00:10:44.419
more guiding the statistics. Interesting. Okay,

00:10:44.460 --> 00:10:46.000
let's try and pull the main threads together

00:10:46.000 --> 00:10:48.799
from this deep dive. Two big takeaways for you

00:10:48.799 --> 00:10:52.639
as you navigate this constantly changing AI landscape.

00:10:52.919 --> 00:10:57.919
All right, takeaway number one. First, be skeptical

00:10:57.919 --> 00:11:02.480
of the hype. Just trust it. That GPT -5 retrieval

00:11:02.480 --> 00:11:05.299
mess, it shows the line between just accessing

00:11:05.299 --> 00:11:07.860
knowledge and real innovation is still super

00:11:07.860 --> 00:11:11.629
blurry. Demand proof, demand novelty. Exactly.

00:11:11.929 --> 00:11:14.250
Verifiable novelty is key. And takeaway number

00:11:14.250 --> 00:11:17.830
two. Trust the simple, smart fixes. That verbalized

00:11:17.830 --> 00:11:20.769
sampling technique. It gives you a huge, zero

00:11:20.769 --> 00:11:23.470
-cost creativity and diversity boost just by

00:11:23.470 --> 00:11:25.350
changing how you ask. Yeah, it's about applying

00:11:25.350 --> 00:11:28.090
these simple, elegant, structural changes. Right.

00:11:28.129 --> 00:11:30.710
To get way better, less predictable results from

00:11:30.710 --> 00:11:32.450
the models you're probably already using every

00:11:32.450 --> 00:11:34.289
day. That's where the real efficiency gain is,

00:11:34.350 --> 00:11:36.049
isn't it? Getting more out of the tools you have.

00:11:36.190 --> 00:11:38.549
Totally. It's about smarter interaction, not

00:11:38.549 --> 00:11:40.350
just bigger models. So the challenge for you

00:11:40.350 --> 00:11:42.210
listening right now is maybe to go try this today.

00:11:42.409 --> 00:11:44.669
Yeah, test it out. Think about your own work,

00:11:44.830 --> 00:11:48.570
your field. How could adding that simple phrase,

00:11:48.870 --> 00:11:51.610
generate X results with their probabilities,

00:11:52.070 --> 00:11:54.970
how could that unlock something new? Move beyond

00:11:54.970 --> 00:11:57.429
the same old answers the AI usually spits out.

00:11:57.710 --> 00:12:00.570
Right. How can it help you break out of that

00:12:00.570 --> 00:12:03.549
typicality trap, that mode collapse in your specific

00:12:03.549 --> 00:12:06.139
context? That's the path forward. Definitely

00:12:06.139 --> 00:12:09.580
something to experiment with. And thanks again

00:12:09.580 --> 00:12:11.039
for sharing all these sources with us for the

00:12:11.039 --> 00:12:12.799
deep dive. Absolutely. We couldn't do it without

00:12:12.799 --> 00:12:14.000
you. We'll catch you next time.