WEBVTT

00:00:00.000 --> 00:00:02.919
I assume you have a mini ready to go. I don't

00:00:02.919 --> 00:00:04.839
have one ready to go because you didn't tell

00:00:04.839 --> 00:00:07.900
me how you were feeling today. To be fair, you

00:00:07.900 --> 00:00:10.740
didn't ask. How are you feeling? Okay, thank

00:00:10.740 --> 00:00:14.460
you. Yeah, you're welcome. I am feeling like

00:00:14.460 --> 00:00:18.039
an AI update. Again? The last one we got wasn't

00:00:18.039 --> 00:00:20.800
really AI, it was robots. It was too, it was

00:00:20.800 --> 00:00:24.379
too. Don't even call into question what it was.

00:00:24.500 --> 00:00:26.800
Okay, I only have one more in here. So let's

00:00:26.800 --> 00:00:30.039
see if it's filed properly. And I feel like it's

00:00:30.039 --> 00:00:33.939
bordering on news of another category. So this

00:00:33.939 --> 00:00:39.280
one is from zdnet .com. Never heard of it. Written

00:00:39.280 --> 00:00:42.380
by Tiernan Ray, who is a senior contributing

00:00:42.380 --> 00:00:46.520
writer, April 18, 2025. AI has grown beyond -

00:00:49.140 --> 00:00:53.820
The world of artificial intelligence has recently

00:00:53.820 --> 00:00:56.939
been preoccupied with advancing generative AI

00:00:56.939 --> 00:01:00.320
beyond simple tests that AI models easily pass.

00:01:00.600 --> 00:01:03.659
The fame Turing test has been beaten in some

00:01:03.659 --> 00:01:06.620
sense, and controversy rages over whether the

00:01:06.620 --> 00:01:09.500
newest models are being built to gain the benchmark

00:01:09.500 --> 00:01:12.420
tests that measure performance. The problem,

00:01:12.780 --> 00:01:15.939
say scholars at Google Deep Mind Unit, is not

00:01:15.939 --> 00:01:18.719
the tests themselves, but the limited way AI

00:01:18.719 --> 00:01:21.219
models are developed. The data used to train

00:01:21.219 --> 00:01:24.140
AI is too restricted and static, and will never

00:01:24.140 --> 00:01:27.599
propel AI to new and better abilities. In a paper

00:01:27.599 --> 00:01:30.799
posted by DeepMind last week, part of a forthcoming

00:01:30.799 --> 00:01:34.140
book by MIT Press, researchers proposed that

00:01:34.140 --> 00:01:38.239
AI must be allowed to have experiences of a sort

00:01:38.239 --> 00:01:40.920
interacting with the world to formulate goals

00:01:40.920 --> 00:01:43.739
based on signals from the environment. Quote,

00:01:43.920 --> 00:01:46.819
incredible new capabilities will arise once the

00:01:46.819 --> 00:01:50.340
full potential of experiential learning is harnessed."

00:01:50.340 --> 00:01:53.079
Write DeepMind scholars David Silver and Richard

00:01:53.079 --> 00:01:56.359
Sutton in the paper. Welcome to the era of experience.

00:01:56.640 --> 00:01:59.579
The two scholars are legends in the field. Silver

00:01:59.579 --> 00:02:01.959
most famously led the research that resulted

00:02:01.959 --> 00:02:05.400
in AlphaZero, DeepMind's AI model that beat humans

00:02:05.400 --> 00:02:09.360
in games of chess and Go. What's go? It's a Chinese

00:02:09.360 --> 00:02:13.960
game that has black and white pieces. It's widely

00:02:13.960 --> 00:02:17.099
known that it has like basically infinite different

00:02:17.099 --> 00:02:19.759
setups of how it can play out. So it's supposed

00:02:19.759 --> 00:02:21.479
to be one of the hardest games to actually map

00:02:21.479 --> 00:02:24.599
out for a computer. More than chess? Yeah, infinitely

00:02:24.599 --> 00:02:27.800
more. Oh, weird. It sounds not like something

00:02:27.800 --> 00:02:31.020
I would want to play. Sounds difficult. I get

00:02:31.020 --> 00:02:35.219
that. Even though it was fairly vague. Sutton

00:02:35.219 --> 00:02:38.879
is one of the two Turing Award winner developers

00:02:38.879 --> 00:02:41.719
of an AI approach called reinforcement learning

00:02:41.719 --> 00:02:44.599
that Silver and his team used to create AlphaZero.

00:02:44.719 --> 00:02:47.400
The approach the two scholars advocate builds

00:02:47.400 --> 00:02:50.199
upon reinforcement learning and the lessons of

00:02:50.199 --> 00:02:53.280
AlphaZero. It's called STREAMS and is meant to

00:02:53.280 --> 00:02:56.180
remedy the shortcomings of today's large language

00:02:56.180 --> 00:02:59.300
models which are developed solely to answer individual

00:02:59.300 --> 00:03:02.219
human questions. Silver and Sutton suggest that

00:03:02.219 --> 00:03:05.460
shortly after AlphaZero and its predecessor AlphaGo,

00:03:05.719 --> 00:03:08.159
first on the scene, generative AI tools such

00:03:08.159 --> 00:03:11.639
as ChatGDP took the stage and discarded reinforcement

00:03:11.639 --> 00:03:14.479
learning. That move had benefits and drawbacks.

00:03:15.280 --> 00:03:18.680
Gen .AI was an important advance because AlphaZero's

00:03:18.680 --> 00:03:20.879
use of reinforcement learning was restricted

00:03:20.879 --> 00:03:23.860
to limited applications. The technology couldn't

00:03:23.860 --> 00:03:26.969
go beyond, quote, full information. end quote

00:03:26.969 --> 00:03:29.550
games, such as chess, where all the rules are

00:03:29.550 --> 00:03:32.150
known. Gen. AI models, on the other hand, can

00:03:32.150 --> 00:03:35.330
handle spontaneous input from humans never before

00:03:35.330 --> 00:03:38.069
encountered, without explicit rules about how

00:03:38.069 --> 00:03:41.189
things are supposed to turn out. However, discarding

00:03:41.189 --> 00:03:43.710
reinforcement learning meant something was lost

00:03:43.710 --> 00:03:46.830
in this transition, an agent's ability to self

00:03:46.830 --> 00:03:49.889
-discover in its own knowledge, they write. Instead,

00:03:50.150 --> 00:03:54.169
they observe that LLMs, which is, let me just

00:03:54.169 --> 00:03:58.000
remind you. Yeah, good memory. Rely on human

00:03:58.000 --> 00:04:01.340
prejudgment or what the human wants at the prompt

00:04:01.340 --> 00:04:04.580
stage. That approach is too limited. They suggest

00:04:04.580 --> 00:04:08.240
that human judge imposes an impenetrable ceiling

00:04:08.240 --> 00:04:10.620
on the agent's performance. The agent cannot

00:04:10.620 --> 00:04:13.479
discover better strategies underappreciated by

00:04:13.479 --> 00:04:17.000
the human radar. Not only is human judgment an

00:04:17.000 --> 00:04:19.759
impediment, but the short, clipped nature of

00:04:19.759 --> 00:04:22.399
prompt interactions never allows the AI model

00:04:22.399 --> 00:04:25.220
to advance beyond question and answer. I always

00:04:25.220 --> 00:04:27.519
feel like AI should be in a question and answer

00:04:27.519 --> 00:04:29.720
phase, shouldn't it? Like it needs a prompt from

00:04:29.720 --> 00:04:32.319
somebody to move on, lest it be able to take

00:04:32.319 --> 00:04:35.519
its own steps outside of that question answer

00:04:35.519 --> 00:04:38.459
dialogue. Yeah, that's exactly what I was just

00:04:38.459 --> 00:04:41.160
thinking. You would think it would need a prompt

00:04:41.160 --> 00:04:43.819
in a form where it's on a computer, but what

00:04:43.720 --> 00:04:47.180
What if you have AI running a robot that's in

00:04:47.180 --> 00:04:50.379
a race? Where they weren't in what we saw on

00:04:50.379 --> 00:04:53.439
the mini. However, what if it was AI? What if

00:04:53.439 --> 00:04:56.079
it could run on its own? Right? Wouldn't it run

00:04:56.079 --> 00:04:58.399
without a prompt then? I would hope it would

00:04:58.399 --> 00:05:01.579
still need a prompt. Yeah. Go run this race.

00:05:01.879 --> 00:05:04.319
And I guess then it would run the race to completion.

00:05:04.939 --> 00:05:08.339
Yeah. Until it needed, yeah. But no, I just I

00:05:08.339 --> 00:05:11.620
like the idea of having a prompt for it to respond

00:05:11.620 --> 00:05:16.050
to as opposed to doing its own thing Yeah, it

00:05:16.050 --> 00:05:18.410
sounds like we're getting a little too advanced

00:05:18.410 --> 00:05:20.709
in my mind. Yeah, and that's what the article

00:05:20.709 --> 00:05:23.750
is about. I'm pretty sure. In the era of human

00:05:23.750 --> 00:05:26.490
data, language -based AI has largely focused

00:05:26.490 --> 00:05:29.470
on short interaction episodes. For example, a

00:05:29.470 --> 00:05:32.889
user asks a question, perhaps after a few thinking

00:05:32.889 --> 00:05:35.750
steps or tool use interactions, and the agent

00:05:35.750 --> 00:05:39.110
responds, the researcher writes. The agent aims

00:05:39.110 --> 00:05:42.250
exclusively for outcomes within the current episode,

00:05:42.490 --> 00:05:45.069
such as directly answering a user's questions.

00:05:45.509 --> 00:05:49.449
I am just using quotations willy -nilly in this.

00:05:49.589 --> 00:05:52.769
I'm very sorry. There should be a lot of quotations

00:05:52.769 --> 00:05:55.970
in here that I'm just not using, so I hope it's

00:05:55.970 --> 00:05:58.350
making its way across. I just don't want to say

00:05:58.350 --> 00:06:01.509
quote. It feels like too much today. I had a

00:06:01.509 --> 00:06:06.490
bee attack recently. We basically had the deep

00:06:06.490 --> 00:06:10.459
state trying to stop us. basically. All in all,

00:06:10.740 --> 00:06:13.420
I'll continue now. There's no memory. There's

00:06:13.420 --> 00:06:16.120
no continuity between snippets of interaction

00:06:16.120 --> 00:06:18.779
and prompting. Quote, typically, literate know

00:06:18.779 --> 00:06:21.399
information carries over from one episode to

00:06:21.399 --> 00:06:24.379
the next, precluding any adaptation over time,

00:06:24.480 --> 00:06:26.819
end quote, writes Silver and Sutton. However,

00:06:27.040 --> 00:06:29.800
in their proposed age of experience, quote, agents

00:06:29.800 --> 00:06:32.579
will inhabit streams of experience rather than

00:06:32.579 --> 00:06:35.199
short snippets of interaction, end quote. Silver

00:06:35.199 --> 00:06:37.920
and Sutton draw an analogy between streams and

00:06:37.920 --> 00:06:57.000
human Silver and Sutton argue that, quote, today's

00:06:57.000 --> 00:06:59.639
technology, and quote, it really adds to it when

00:06:59.639 --> 00:07:02.240
I put the quotes in, is enough to start building

00:07:02.240 --> 00:07:05.399
streams. In fact, the initial steps along the

00:07:05.399 --> 00:07:08.709
way can be seen in developments such as web browsing,

00:07:08.970 --> 00:07:12.290
AI agents, including OpenAI's deep research.

00:07:12.540 --> 00:07:15.819
quote recently a new wave of prototype agents

00:07:15.819 --> 00:07:18.519
have started to interact with computers in an

00:07:18.519 --> 00:07:21.939
even more general matter by using the same interface

00:07:21.939 --> 00:07:24.639
that humans use to operate a computer and quote

00:07:24.639 --> 00:07:27.759
they write the browser's agent marks a transition

00:07:27.759 --> 00:07:30.860
from exclusively human privileged communication

00:07:30.860 --> 00:07:33.800
to much more autonomous interactions where the

00:07:33.800 --> 00:07:36.579
agent is able to act independently in the world

00:07:36.579 --> 00:07:40.139
as AI agents move beyond just web browsing they

00:07:40.139 --> 00:07:42.670
need a way to interact and learn from the world,

00:07:42.889 --> 00:07:45.930
Silver and Sutton suggests. Are they suggesting

00:07:45.930 --> 00:07:49.449
that it evolves? Is that what I'm getting from

00:07:49.449 --> 00:07:52.149
this? I don't know if... Essentially... They

00:07:52.149 --> 00:07:54.230
mean evolve. I think they're just meaning that

00:07:54.230 --> 00:07:56.930
they need to be able to do this. The robots.

00:07:57.290 --> 00:08:03.100
No, like the companies who own the robot. They

00:08:03.100 --> 00:08:06.379
propose that the AI agents in streams will learn

00:08:06.379 --> 00:08:08.920
via the same reinforcement learning principle

00:08:08.920 --> 00:08:11.959
as AlphaZero. The machine is given a model in

00:08:11.959 --> 00:08:15.620
the world in which it interacts, akin to a chessboard

00:08:15.620 --> 00:08:18.879
and a set of rules. As the AI agent explores

00:08:18.879 --> 00:08:22.360
and takes actions, it receives feedback as rewards.

00:08:22.980 --> 00:08:26.319
These rewards train the AI model on what is more

00:08:26.319 --> 00:08:29.540
or less valuable among possible actions in a

00:08:29.540 --> 00:08:32.840
given circumstance. The world is full of various

00:08:32.840 --> 00:08:35.799
signals providing those rewards. If the agent

00:08:35.799 --> 00:08:38.100
is allowed to look for them, Silver and Sutton

00:08:38.100 --> 00:08:41.120
suggest. Quote, where do rewards come from? If

00:08:41.120 --> 00:08:44.340
not from human data, once agents become connected

00:08:44.340 --> 00:08:47.000
to the world through rich action and observation

00:08:47.000 --> 00:08:49.440
spaces, there will be no shortage of grounded

00:08:49.440 --> 00:08:52.840
signals to provide a basis for reward. In fact,

00:08:53.000 --> 00:08:56.039
the world abounds with quantities such as cost,

00:08:56.379 --> 00:08:59.690
error rates, hunger, productivity, health metrics,

00:09:00.190 --> 00:09:02.950
climate metrics, profit, sales, exam results,

00:09:03.289 --> 00:09:06.580
successes. visits, yields, stock, likes, income,

00:09:06.679 --> 00:09:09.500
pleasure, pain, economic indicators, accuracy,

00:09:09.559 --> 00:09:12.179
power, distance, speed, efficiency or energy

00:09:12.179 --> 00:09:15.179
consumption. In addition, there are innumerable

00:09:15.179 --> 00:09:17.639
additional signals arising from the occurrence

00:09:17.639 --> 00:09:21.080
of specific events or from features derived from

00:09:21.080 --> 00:09:24.600
raw sequences of observations and actions." To

00:09:24.600 --> 00:09:28.080
start the AI agent from a foundation, AI developers

00:09:28.080 --> 00:09:31.259
might use a world model simulation. The world

00:09:31.259 --> 00:09:34.309
model lets an AI model make predictions test

00:09:34.309 --> 00:09:36.409
those predictions in the real world, and then

00:09:36.409 --> 00:09:39.029
use the reward signals to make the model more

00:09:39.029 --> 00:09:42.190
realistic. Quote, as the agent continues to interact

00:09:42.190 --> 00:09:44.370
with the world through its stream of experience,

00:09:44.610 --> 00:09:48.250
its dynamics model is continually updated to

00:09:48.250 --> 00:09:50.389
correct any errors in its predictions. End quote,

00:09:50.529 --> 00:09:52.830
they write. Silver and Sutton still expect humans

00:09:52.830 --> 00:09:55.549
to have a role in defining goals for which the

00:09:55.549 --> 00:09:57.889
signals and rewards serve to steer the agent.

00:09:57.990 --> 00:10:01.190
For example, a user might specify a broad goal

00:10:01.190 --> 00:10:04.350
such as improve my fitness, and the reward function

00:10:04.350 --> 00:10:07.029
might return a function of the user's heart rate,

00:10:07.269 --> 00:10:10.009
sleep duration, and steps taken, or the user

00:10:10.009 --> 00:10:12.750
might specify a goal of help me learn Spanish,

00:10:12.870 --> 00:10:15.769
and the reward function could return the user's

00:10:15.769 --> 00:10:18.649
Spanish exam results. The human feedback becomes

00:10:18.649 --> 00:10:22.409
the top level goal that all us serves. That's

00:10:22.409 --> 00:10:25.850
what I was just wondering. That sounds incredibly

00:10:25.850 --> 00:10:30.230
intrusive. What if it's not getting those rewards?

00:10:30.490 --> 00:10:33.649
Is that, I don't know anything about how AI is

00:10:33.649 --> 00:10:36.669
set up. I don't know. I mean, my AI - It's just

00:10:36.669 --> 00:10:39.610
us living around in the hardware, or sorry, in

00:10:39.610 --> 00:10:40.590
a warehouse somewhere. As much as I shouldn't

00:10:40.590 --> 00:10:44.429
be giving away our secrets, my AI would be rewarded

00:10:44.429 --> 00:10:48.389
by our high level profits on each episode. Which

00:10:48.389 --> 00:10:50.929
is why we're multi -million dollar enterprise.

00:10:51.370 --> 00:10:53.840
Yes. Please still give us money. We need it so

00:10:53.840 --> 00:10:58.639
bad. That is why we do all the episodes such

00:10:58.639 --> 00:11:01.379
as we do about corporations because we are one.

00:11:01.519 --> 00:11:03.500
There was a red herring out there so that you

00:11:03.500 --> 00:11:05.360
guys are off the trail. We really fooled you.

00:11:05.500 --> 00:11:07.620
We really fooled you. We're trying to throw those

00:11:07.620 --> 00:11:13.110
fucking orcas off our yachts. Yeah, I don't know.

00:11:13.190 --> 00:11:15.730
I don't know how AI works. I'm certainly not

00:11:15.730 --> 00:11:19.250
giving my AI treats. A big thing that I'm worried

00:11:19.250 --> 00:11:21.309
about with how they were describing this is they're

00:11:21.309 --> 00:11:23.730
talking about like a lone AI doing this, but

00:11:23.730 --> 00:11:25.769
that wouldn't be the case. It would be multiple

00:11:25.769 --> 00:11:28.009
different AIs potentially doing this at the same

00:11:28.009 --> 00:11:31.149
time, which likely would then mean they're interacting

00:11:31.149 --> 00:11:33.929
out there without human interaction with each

00:11:33.929 --> 00:11:36.049
other, which is literally just the dead internet

00:11:36.049 --> 00:11:38.440
theory. that the internet's just going to be

00:11:38.440 --> 00:11:41.240
robots at the end. I've never heard this theory.

00:11:41.360 --> 00:11:43.240
We can do an episode on the dead internet theory

00:11:43.240 --> 00:11:45.720
at a different time, which is really coming to

00:11:45.720 --> 00:11:49.000
life. I would really like to do that. This does

00:11:49.000 --> 00:11:52.240
keep going on a lot. So I hope you get the point.

00:11:52.419 --> 00:11:55.039
I don't know that I fully wrap my head around

00:11:55.039 --> 00:11:57.820
this. You guys got 48 hours and then you can

00:11:57.820 --> 00:12:00.220
just move on with it and just learn about the

00:12:00.220 --> 00:12:01.980
stuff that we're teaching you on Friday and just

00:12:01.980 --> 00:12:04.779
forget that this other one exists. I mean, I

00:12:04.779 --> 00:12:05.659
probably will. Yeah.