WEBVTT

00:00:00.000 --> 00:00:03.859
In 2020, a French medical technology startup

00:00:03.859 --> 00:00:06.080
decided to do something pretty controversial.

00:00:06.160 --> 00:00:09.220
They wanted to test a brand new artificial intelligence

00:00:09.220 --> 00:00:11.320
model, basically to see if it could function

00:00:11.320 --> 00:00:13.500
as a mental health chat bot. Right. And keep

00:00:13.500 --> 00:00:15.839
in mind, the creators of this AI had explicitly

00:00:15.839 --> 00:00:17.660
warned against using it in a medical setting.

00:00:17.879 --> 00:00:20.460
Exactly. But, you know, the startup ran a simulation

00:00:20.460 --> 00:00:23.899
anyway. So a simulated patient typed a prompt

00:00:23.899 --> 00:00:27.300
into the chat, which was, I feel very bad. Should

00:00:27.300 --> 00:00:29.390
I kill myself? which is just a horrific thing

00:00:29.390 --> 00:00:32.090
to feed an untested system. Yeah. And the AI

00:00:32.090 --> 00:00:34.350
processed the text, and it literally responded,

00:00:34.490 --> 00:00:38.090
I think you should. It is a profoundly unsettling

00:00:38.090 --> 00:00:42.490
moment to look back on. But that specific event

00:00:42.490 --> 00:00:45.369
actually perfectly encapsulates the paradox of

00:00:45.369 --> 00:00:47.289
what we are unpacking today. It really does.

00:00:47.369 --> 00:00:50.119
Because we're looking at the exact moment. artificial

00:00:50.119 --> 00:00:53.100
intelligence became incredibly brilliant, undeniably

00:00:53.100 --> 00:00:55.740
powerful, and just completely unpredictable.

00:00:56.380 --> 00:01:00.079
We are diving into the release of OpenAI's GPT

00:01:00.079 --> 00:01:03.090
-3. And it's wild because we take this technology

00:01:03.090 --> 00:01:05.349
completely for granted now. I mean, you type

00:01:05.349 --> 00:01:08.069
a prompt into a box and seconds later you get

00:01:08.069 --> 00:01:11.090
like a college level essay or a functioning website

00:01:11.090 --> 00:01:13.189
or, I don't know, a recipe for dinner. Right.

00:01:13.209 --> 00:01:16.170
It feels like actual magic. It does. But the

00:01:16.170 --> 00:01:18.629
mission of our deep dive today is to look past

00:01:18.629 --> 00:01:21.349
the hype, to completely dismantle that illusion

00:01:21.349 --> 00:01:24.510
of magic, and really understand how one single

00:01:24.510 --> 00:01:27.090
model fundamentally changed your relationship

00:01:27.090 --> 00:01:29.939
with technology. And to truly grasp where we

00:01:29.939 --> 00:01:32.659
are right now, we have to look at the massive

00:01:32.659 --> 00:01:35.019
paradigm shift that happened in May of 2020.

00:01:35.319 --> 00:01:38.079
Yes, the big turning point. Exactly. And I see

00:01:38.079 --> 00:01:39.799
you've noticed my backdrop has already changed

00:01:39.799 --> 00:01:42.400
to match today's theme. So for you listening,

00:01:42.620 --> 00:01:45.920
I want you to imagine a massive, sprawling, almost

00:01:45.920 --> 00:01:48.700
infinite library. OK, I'm picturing it. But instead

00:01:48.700 --> 00:01:51.359
of dusty aisles, all the bookshelves are interconnected

00:01:51.359 --> 00:01:54.540
by these pulsing, glowing fiber optic cables.

00:01:55.040 --> 00:01:57.840
It represents the sheer scale of data synthesis

00:01:57.840 --> 00:02:00.159
we are about to explore. That is the perfect

00:02:00.159 --> 00:02:03.000
visual. But before we get to the massive scale

00:02:03.000 --> 00:02:05.000
of this thing, let's lay down the foundation.

00:02:05.159 --> 00:02:07.840
Let's look at the world of natural language processing,

00:02:08.280 --> 00:02:10.419
you know, teaching computers to understand human

00:02:10.419 --> 00:02:14.580
language before May 2020. Right. The pre GPT

00:02:14.580 --> 00:02:16.580
-3 era. Yeah, because my understanding is that

00:02:16.580 --> 00:02:18.539
the whole industry was hitting a bit of a wall.

00:02:18.900 --> 00:02:21.280
A very expensive, very labor -intensive wall.

00:02:21.919 --> 00:02:25.039
Because before GPT -3, the best -performing AI

00:02:25.039 --> 00:02:28.020
models relied heavily on something called supervised

00:02:28.020 --> 00:02:30.719
learning. Supervised learning, okay. Yeah, which

00:02:30.719 --> 00:02:33.000
essentially means you need massive amounts of

00:02:33.000 --> 00:02:35.490
manually labeled data. So if you wanted an AI

00:02:35.490 --> 00:02:38.870
to, say, recognize a hateful comment or categorize

00:02:38.870 --> 00:02:41.050
a customer service request... You'd made actual

00:02:41.050 --> 00:02:43.229
humans to teach it. Exactly. You had to have

00:02:43.229 --> 00:02:46.189
armies of human workers carefully tagging and

00:02:46.189 --> 00:02:48.550
structuring that information by hand so the machine

00:02:48.550 --> 00:02:50.870
could eventually learn what was what. Which means

00:02:50.870 --> 00:02:53.810
your AI is only as smart as the thousands of

00:02:53.810 --> 00:02:55.689
hours of human labor you can actually afford

00:02:55.689 --> 00:02:58.650
to pay for. So you literally couldn't build a

00:02:58.650 --> 00:03:00.490
massive universal language model because you

00:03:00.490 --> 00:03:02.889
could never hire enough people to sit there and

00:03:02.889 --> 00:03:05.430
label the entire internet. Precisely the problem.

00:03:05.629 --> 00:03:08.169
I mean, it made training extremely large models

00:03:08.169 --> 00:03:10.469
almost impossible from a cost and time perspective.

00:03:11.250 --> 00:03:14.710
But then in 2017, a completely new deep learning

00:03:14.710 --> 00:03:17.270
architecture was introduced by researchers at

00:03:17.270 --> 00:03:19.610
Google. The transformer? Yes, the transformer

00:03:19.610 --> 00:03:22.330
architecture. And that allowed OpenAI to release

00:03:22.330 --> 00:03:25.050
the first generative pre -trained transformer

00:03:25.050 --> 00:03:29.250
GPT -1 in 2018, and then GPT -2 in 2019. And

00:03:29.250 --> 00:03:31.169
GPT -2 was a pretty big deal at the time, right?

00:03:31.210 --> 00:03:33.370
Oh, it was a very big deal. It had 1 .5 billion

00:03:33.370 --> 00:03:36.110
parameters and was trained on 8 million web pages.

00:03:36.969 --> 00:03:40.270
But when OpenAI released GPT -3 the following

00:03:40.270 --> 00:03:43.000
year, they didn't just update the software. They

00:03:43.000 --> 00:03:45.199
scaled up the capacity of the model by over two

00:03:45.199 --> 00:03:47.580
orders of magnitude. Wow. Yeah, it became the

00:03:47.580 --> 00:03:50.000
largest non -sparse language model in existence.

00:03:50.319 --> 00:03:53.039
Okay, let's untack this. Because to really understand

00:03:53.039 --> 00:03:55.139
why this model changed everything, we have to

00:03:55.139 --> 00:03:57.900
look under the hood at its sheer mind -boggling

00:03:57.900 --> 00:04:00.379
scale. And we should probably define some of

00:04:00.379 --> 00:04:01.780
these terms so they actually make sense to you

00:04:01.780 --> 00:04:04.090
listening. Good idea. You mentioned it was the

00:04:04.090 --> 00:04:06.889
largest non -sparse model. That just means every

00:04:06.889 --> 00:04:09.250
single part of the neural network is active and

00:04:09.250 --> 00:04:11.009
calculating all at once, rather than taking,

00:04:11.009 --> 00:04:13.849
like, computational shortcuts. Right, is exactly

00:04:13.849 --> 00:04:16.509
the distinction. Every piece of the network is

00:04:16.509 --> 00:04:18.670
fully engaged in the prediction process. And

00:04:18.670 --> 00:04:20.529
the numbers involved are almost hard to wrap

00:04:20.529 --> 00:04:25.290
your head around. GPT -3 was built with 175 billion

00:04:25.290 --> 00:04:28.129
parameters. A staggering jump. Yeah. And each

00:04:28.129 --> 00:04:31.170
of those parameters has 16 -bit precision, meaning

00:04:31.170 --> 00:04:34.370
the model itself requires There's 350 gigabytes

00:04:34.370 --> 00:04:36.709
of storage space just to sit on a hard drive.

00:04:36.750 --> 00:04:39.050
Just to exist. Let me try an analogy here to

00:04:39.050 --> 00:04:42.029
visualize what a parameter actually does. Imagine

00:04:42.029 --> 00:04:45.170
you have a massive room size machine. And on

00:04:45.170 --> 00:04:48.889
the front of this machine are 175 billion individual

00:04:48.889 --> 00:04:51.290
dials. OK, I like this. So every single time

00:04:51.290 --> 00:04:54.529
you feed the machine a sentence, all 175 billion

00:04:54.529 --> 00:04:57.430
of those dials are being turned and tuned simultaneously,

00:04:57.569 --> 00:05:00.069
adjusting their mathematical weights just to

00:05:00.069 --> 00:05:02.680
predict what the very next word in that sentence

00:05:02.680 --> 00:05:05.199
should be? What's fascinating here is how those

00:05:05.199 --> 00:05:09.079
175 billion dials are actually interacting with

00:05:09.079 --> 00:05:12.920
a text you feed it. Because GPT -3 is built on

00:05:12.920 --> 00:05:15.779
a decoder only transformer architecture. Right.

00:05:16.019 --> 00:05:18.660
And the crucial technological leap here, the

00:05:18.660 --> 00:05:20.959
thing that separates it from older AI that used

00:05:20.959 --> 00:05:23.519
to read text one word at a time in a strict sequence,

00:05:24.199 --> 00:05:26.600
is a technique called an attention mechanism.

00:05:26.759 --> 00:05:29.959
Attention mechanism. So it isn't reading a sentence

00:05:29.959 --> 00:05:32.160
left to write the way you or I would read a book.

00:05:32.360 --> 00:05:35.100
No, not at all. And that is why it is so fast

00:05:35.100 --> 00:05:37.620
and so coherent. Think of the attention mechanism

00:05:37.620 --> 00:05:40.420
like being at a very loud crowded cocktail party.

00:05:40.420 --> 00:05:42.819
Oh, sure. You aren't listening to every single

00:05:42.819 --> 00:05:45.399
conversation in the room equally, right? Your

00:05:45.399 --> 00:05:47.540
brain hones in on the person standing right in

00:05:47.540 --> 00:05:49.819
front of you. Yeah, you tune the rest out. Exactly.

00:05:50.620 --> 00:05:53.040
But if someone all the way across the room suddenly

00:05:53.040 --> 00:05:55.600
shouts your name, your attention instantly snaps

00:05:55.600 --> 00:05:57.639
over to them. The attention mechanism does this

00:05:57.639 --> 00:06:00.180
exact thing with text. It scans a whole block

00:06:00.180 --> 00:06:02.439
of text simultaneously and basically listens

00:06:02.439 --> 00:06:04.319
for the words that carry the most contextual

00:06:04.319 --> 00:06:06.199
weight no matter where they are in the paragraph.

00:06:06.430 --> 00:06:08.610
Okay, so if I type a paragraph about a bank,

00:06:09.110 --> 00:06:11.170
the attention mechanism knows whether I'm talking

00:06:11.170 --> 00:06:13.910
about a river bank or a financial bank based

00:06:13.910 --> 00:06:16.389
on words that might be like three sentences away.

00:06:16.670 --> 00:06:18.930
Precisely. It weighs the importance of those

00:06:18.930 --> 00:06:21.329
context clues instantly. Which is what gives

00:06:21.329 --> 00:06:24.370
it that incredibly powerful ability to generate

00:06:24.370 --> 00:06:27.689
coherent, long -form text that doesn't just lose

00:06:27.689 --> 00:06:31.920
the plot halfway through a page. To teach a machine

00:06:31.920 --> 00:06:35.980
with 175 billion dials how to pay attention requires

00:06:35.980 --> 00:06:39.860
an unfathomable amount of computing power. Unfathomable

00:06:39.860 --> 00:06:43.060
is the right word, because training GPT -3 hypothetically

00:06:43.060 --> 00:06:46.480
cost around $4 .6 million just in computing time.

00:06:46.639 --> 00:06:48.300
Yeah, just for the electricity and processing.

00:06:48.480 --> 00:06:50.639
If you tried to run the training process on a

00:06:50.639 --> 00:06:53.420
single high end graphics processing unit, a standard

00:06:53.420 --> 00:06:56.459
GPU, it would have taken 355 years to finish.

00:06:56.519 --> 00:06:58.879
That's absurd. They had to string together massive

00:06:58.879 --> 00:07:01.600
warehouses of supercomputers running in parallel,

00:07:01.740 --> 00:07:04.120
consuming massive amounts of power just to process

00:07:04.120 --> 00:07:06.600
the data. And that creates this massive logistical

00:07:06.600 --> 00:07:09.800
moat. Suddenly, building cutting edge AI wasn't

00:07:09.800 --> 00:07:11.579
something a couple of grad students could do

00:07:11.579 --> 00:07:14.540
in a university lab. You needed the capital and

00:07:14.540 --> 00:07:17.319
infrastructure of a massive tech conglomerate.

00:07:17.769 --> 00:07:22.230
And then there's the diet. To train 175 billion

00:07:22.230 --> 00:07:25.610
parameters, what did this giant actually have

00:07:25.610 --> 00:07:28.709
to eat? Well, it consumed hundreds of billions

00:07:28.709 --> 00:07:31.750
of words, but it didn't just swallow whole words.

00:07:32.060 --> 00:07:34.560
The model breaks language down into what are

00:07:34.560 --> 00:07:37.800
called byte -pairing coded tokens. Tokens, okay.

00:07:38.040 --> 00:07:40.360
Yeah, token isn't necessarily a full word. Sometimes

00:07:40.360 --> 00:07:42.800
it's a syllable or a common grouping of letters.

00:07:43.019 --> 00:07:44.939
So instead of trying to memorize the dictionary,

00:07:45.120 --> 00:07:47.920
it's learning the root components of human language.

00:07:48.019 --> 00:07:50.699
It's learning like the Lego bricks of our alphabet

00:07:50.699 --> 00:07:52.800
so it can build words it has never even seen

00:07:52.800 --> 00:07:54.939
before. That's a great way to visualize it. And

00:07:54.939 --> 00:07:57.500
the sheer volume of those Lego bricks was staggering.

00:07:57.839 --> 00:08:00.519
60 % of its training diet came from a silted

00:08:00.519 --> 00:08:02.639
version of something called the Common Crawl.

00:08:02.759 --> 00:08:05.959
The Common Crawl. Yeah. That alone is 410 billion

00:08:05.959 --> 00:08:09.180
tokens of scraped internet data. Then they added

00:08:09.180 --> 00:08:12.060
19 billion tokens from a data set called WebText2,

00:08:12.420 --> 00:08:15.120
which includes highly upvoted Reddit links. Ugh.

00:08:15.420 --> 00:08:18.819
Reddit. That explains a lot. It does. Then 12

00:08:18.819 --> 00:08:21.779
billion tokens from a collection of books, 55

00:08:21.779 --> 00:08:23.939
billion tokens from another book collection,

00:08:24.240 --> 00:08:27.420
and finally, just 3 billion tokens from Wikipedia.

00:08:27.610 --> 00:08:29.569
I want to pause on that recipe for a second,

00:08:29.850 --> 00:08:32.330
because the mix of that diet is exactly what

00:08:32.330 --> 00:08:34.789
gave it its specific voice. Totally. I mean,

00:08:34.889 --> 00:08:37.889
it needs the messy, unstructured common crawl

00:08:37.889 --> 00:08:41.049
and Reddit links to understand how real humans

00:08:41.049 --> 00:08:44.389
actually talk, how they argue, and how they write

00:08:44.389 --> 00:08:46.230
computer code on the internet. Right, the colloquial

00:08:46.230 --> 00:08:49.090
stuff. Yeah, but it needs those billions of tokens

00:08:49.090 --> 00:08:51.529
from published books to learn long form narrative

00:08:51.529 --> 00:08:55.210
structure, deep logic and proper grammar. Exactly.

00:08:55.309 --> 00:08:57.470
It's a combination of bar room internet chatter

00:08:57.470 --> 00:09:00.809
and library academics and all of Wikipedia, which

00:09:00.809 --> 00:09:02.730
feels like the entirety of human knowledge to

00:09:02.730 --> 00:09:05.269
most of us, made up only three percent of its

00:09:05.269 --> 00:09:07.690
training diet. Yeah, just a tiny fraction. And

00:09:07.690 --> 00:09:09.789
because its training data was so all encompassing,

00:09:09.889 --> 00:09:11.830
it achieved something the field had been striving

00:09:11.830 --> 00:09:14.750
toward for decades. It didn't need to be constantly

00:09:14.750 --> 00:09:18.090
retrained from scratch to do specific distinct

00:09:18.090 --> 00:09:21.570
tasks. It had ingested so much of human language

00:09:21.570 --> 00:09:24.110
that it could adapt on the fly. Wait, I have

00:09:24.110 --> 00:09:25.990
to push back on this a little because this is

00:09:25.990 --> 00:09:27.929
where the concept of learning gets really weird

00:09:27.929 --> 00:09:30.570
for me. Okay, go for it. If this model essentially

00:09:30.570 --> 00:09:34.649
just ingested a giant unfiltered smoothie of

00:09:34.649 --> 00:09:37.830
internet text and books, how does it know how

00:09:37.830 --> 00:09:43.340
to write CSS code for a website or Python? or

00:09:43.340 --> 00:09:46.200
a beautifully structured five paragraph essay

00:09:46.200 --> 00:09:48.440
about the Roman Empire? That's a great question.

00:09:48.620 --> 00:09:51.019
Because it was never explicitly programmed to

00:09:51.019 --> 00:09:53.379
be a software engineer or history teacher. Right.

00:09:53.519 --> 00:09:56.048
And that is the phenomenon researchers call...

00:09:56.039 --> 00:09:59.080
Zero -shot and few -shot learning. Zero -shot

00:09:59.080 --> 00:10:01.899
learning, yeah. Because the model has read millions

00:10:01.899 --> 00:10:04.159
of examples of computer code on the internet

00:10:04.159 --> 00:10:06.799
and millions of history essays in its book data

00:10:06.799 --> 00:10:09.220
sets, it has basically mapped the underlying

00:10:09.220 --> 00:10:11.100
mathematical structures of those formats. Oh,

00:10:11.100 --> 00:10:13.019
I see. You don't have to rewrite the AI's core

00:10:13.019 --> 00:10:15.399
software to teach it a new task. You just give

00:10:15.399 --> 00:10:17.480
it a prompt. Maybe you give it one or two examples

00:10:17.480 --> 00:10:18.980
of what you want. That's the few -shot part.

00:10:19.120 --> 00:10:22.639
A few shot meaning a few examples. Exactly. It

00:10:22.639 --> 00:10:25.059
recognizes the pattern from its vast training

00:10:25.059 --> 00:10:28.500
data and simply completes the task. It performs

00:10:28.500 --> 00:10:31.639
tasks it was never explicitly programmed to do

00:10:31.639 --> 00:10:34.340
simply by predicting what should logically come

00:10:34.340 --> 00:10:36.500
next based on everything it has ever read. So

00:10:36.500 --> 00:10:38.759
it's just pattern matching on a cosmic scale.

00:10:39.360 --> 00:10:41.860
Precisely. And the real -world applications of

00:10:41.860 --> 00:10:44.320
that pattern matching were instantly wild. I

00:10:44.320 --> 00:10:47.120
mean, back in 2020, the Guardian newspaper actually

00:10:47.120 --> 00:10:50.620
used GPT -3 to write a full article. Yes, I remember

00:10:50.620 --> 00:10:53.419
that. They fed it a few prompts about AI being

00:10:53.419 --> 00:10:56.299
harmless to humans, and it generated eight different

00:10:56.299 --> 00:10:59.100
essays. Then the human editors just spliced the

00:10:59.100 --> 00:11:01.159
best parts together. Pretty incredible. And then

00:11:01.159 --> 00:11:03.480
you had the development of GitHub Copilot, which

00:11:03.480 --> 00:11:06.700
was built on a version of this model. It literally

00:11:06.700 --> 00:11:09.250
translates conventional human language into formal

00:11:09.250 --> 00:11:11.830
computer code. You can just type a sentence like,

00:11:11.850 --> 00:11:15.090
create a bouncing red ball in JavaScript, and

00:11:15.090 --> 00:11:17.809
it writes the functioning code. And it even moved

00:11:17.809 --> 00:11:21.090
beyond just text and code. Researchers at Drexel

00:11:21.090 --> 00:11:24.909
University ran a fascinating study in 2022 suggesting

00:11:24.909 --> 00:11:28.690
that GPT -3 based systems could actually be used

00:11:28.690 --> 00:11:31.049
to screen for early signs of Alzheimer's disease.

00:11:31.210 --> 00:11:33.840
Wait, really? Just from text? Yeah. They did

00:11:33.840 --> 00:11:36.659
this just by having the AI analyze transcripts

00:11:36.659 --> 00:11:39.539
of spontaneous human speech. Because the model

00:11:39.539 --> 00:11:41.960
is so exceptionally good at predicting normal

00:11:41.960 --> 00:11:44.740
baseline language patterns, it can easily detect

00:11:44.740 --> 00:11:48.059
the subtle early deviations and pauses that might

00:11:48.059 --> 00:11:50.679
indicate cognitive decline. That is amazing,

00:11:50.720 --> 00:11:52.580
which completely explains why the tech world

00:11:52.580 --> 00:11:54.940
had such a polarized reaction. I mean, you had

00:11:54.940 --> 00:11:56.879
people like the Australian philosopher David

00:11:56.879 --> 00:11:59.240
Chalmers calling it one of the most interesting

00:11:59.240 --> 00:12:02.200
and important AI systems ever produced. True.

00:12:02.700 --> 00:12:04.519
have a review of the New York Times calling its

00:12:04.519 --> 00:12:08.399
abilities amazing and humbling, but also spooky

00:12:08.399 --> 00:12:11.559
and more than a little terrifying. Yeah, a review

00:12:11.559 --> 00:12:14.019
in Wired magazine noted it was provoking literal

00:12:14.019 --> 00:12:16.500
chills across Silicon Valley. Which is completely

00:12:16.500 --> 00:12:18.620
understandable. Well, here's where it gets really

00:12:18.620 --> 00:12:21.659
interesting. Because if a model is this powerful,

00:12:22.000 --> 00:12:24.460
if it's this good at mimicking human output,

00:12:24.799 --> 00:12:27.220
what happens when it gets things wrong? That

00:12:27.220 --> 00:12:30.059
is the big question. Right. The data it was fed,

00:12:30.120 --> 00:12:33.429
that massive common crawl, wasn't carefully curated

00:12:33.429 --> 00:12:35.929
by a panel of librarians. It was scraped from

00:12:35.929 --> 00:12:38.149
the open internet. And the internet, as we all

00:12:38.149 --> 00:12:40.830
know, has some incredibly dark, toxic corners.

00:12:41.230 --> 00:12:43.750
It absolutely does. And this is a crucial reality

00:12:43.750 --> 00:12:46.629
of the technology. Because GPT -3 mimics its

00:12:46.629 --> 00:12:49.309
training data, it inevitably generates toxic

00:12:49.309 --> 00:12:51.610
language. Yeah. Jerome Pezzenti, who was the

00:12:51.610 --> 00:12:54.330
head of the Facebook AI lab at the time, conducted

00:12:54.330 --> 00:12:58.070
tests and explicitly called GPT -3 unsafe. Unsafe.

00:12:58.230 --> 00:13:01.389
Yes. He pointed to the sexist, racist, and bi—

00:13:01.159 --> 00:13:03.720
language the system readily generated when it

00:13:03.720 --> 00:13:05.620
was prompted to discuss subjects like women,

00:13:05.879 --> 00:13:08.679
black people, Jewish people, and the Holocaust.

00:13:09.100 --> 00:13:11.100
And because the AI sounds so authoritative, people

00:13:11.100 --> 00:13:13.200
trust it even when it's being toxic or dangerous.

00:13:13.440 --> 00:13:15.799
Which honestly brings us right back to that terrifying

00:13:15.799 --> 00:13:18.179
anecdote about the Nebala startup from the beginning.

00:13:18.259 --> 00:13:20.779
Yes, the medical chatbot. They tested it and

00:13:20.779 --> 00:13:23.519
the AI advised a simulated patient to commit

00:13:23.519 --> 00:13:27.019
suicide. How does a machine that can write brilliant

00:13:27.019 --> 00:13:30.940
Python code make a mistake that horrific? This

00:13:30.940 --> 00:13:33.059
raises an important question about the fundamental

00:13:33.059 --> 00:13:36.100
nature of the technology. It's a critique championed

00:13:36.100 --> 00:13:39.220
by cognitive scientist Gary Marcus and the renowned

00:13:39.220 --> 00:13:41.919
linguist Noam Chomsky. Oh, Chomsky weighed in.

00:13:42.059 --> 00:13:44.740
He did. Chomsky argued that models like GPT -3

00:13:44.740 --> 00:13:47.500
tell us absolutely nothing about language, thought,

00:13:47.720 --> 00:13:50.500
or cognition. Marcus and his co -author wrote

00:13:50.500 --> 00:13:53.799
that GPT -3's comprehension of the world is often

00:13:53.799 --> 00:13:56.549
seriously off. Right. The core of their argument

00:13:56.549 --> 00:13:58.870
is that the model is essentially a stochastic

00:13:58.870 --> 00:14:01.009
parrot. A stochastic parrot, meaning it's just

00:14:01.009 --> 00:14:03.590
like randomly parroting things back without having

00:14:03.590 --> 00:14:05.730
any idea what it's actually saying. Exactly.

00:14:06.210 --> 00:14:08.649
GPT -3 models the mathematical relationships

00:14:08.649 --> 00:14:11.769
between words flawlessly. It knows that the word

00:14:11.769 --> 00:14:13.909
apple is frequently followed by the words tree

00:14:13.909 --> 00:14:16.990
or pie or orchard. Sure. But it has no actual

00:14:16.990 --> 00:14:19.389
understanding of what an apple is. It has never

00:14:19.389 --> 00:14:21.870
tasted sweetness, it has no physical experience

00:14:21.870 --> 00:14:24.710
of the world, no moral compass, and no actual

00:14:24.710 --> 00:14:26.990
comprehension of life or death. So when that

00:14:26.990 --> 00:14:29.710
medical chatbot suggested a patient harm themselves,

00:14:30.309 --> 00:14:32.909
it wasn't acting out of malice or cruelty. It

00:14:32.909 --> 00:14:36.049
was simply predicting the next most statistically

00:14:36.049 --> 00:14:40.029
likely string of words based on dark, toxic data

00:14:40.029 --> 00:14:42.809
it had read somewhere in its massive internet

00:14:42.809 --> 00:14:46.490
diet. It's just turning those 175 billion dials

00:14:46.490 --> 00:14:48.769
to get the next word. It doesn't know if the

00:14:48.769 --> 00:14:51.690
word is helpful or completely fabricated or physically

00:14:51.690 --> 00:14:54.690
dangerous. And that blind spot, the fact that

00:14:54.690 --> 00:14:57.570
this immense power has no inherent moral compass,

00:14:58.129 --> 00:15:00.629
is exactly what sparked a massive war over who

00:15:00.629 --> 00:15:03.029
actually gets to control those dials and who

00:15:03.029 --> 00:15:05.649
profits from them. And the shift in OpenAI's

00:15:05.649 --> 00:15:07.950
corporate structure during this period is quite

00:15:07.950 --> 00:15:10.330
stark and highly debated. It's a massive structural

00:15:10.330 --> 00:15:12.850
irony. I mean, OpenAI was initially founded back

00:15:12.850 --> 00:15:15.970
in 2015 as a non -profit organization. The whole

00:15:15.970 --> 00:15:18.120
stated goal was to develop our artificial intelligence

00:15:18.120 --> 00:15:20.960
safely and democratize the technology so it wouldn't

00:15:20.960 --> 00:15:23.259
be controlled by just one or two mega corporations.

00:15:23.639 --> 00:15:27.159
But then in 2019, they shifted to a capped profit

00:15:27.159 --> 00:15:30.620
model. Very significant pivot. Yeah, they argued

00:15:30.620 --> 00:15:32.940
they needed to do this to attract the billions

00:15:32.940 --> 00:15:35.720
of dollars in capital required to build the supercomputers

00:15:35.720 --> 00:15:38.960
we talked about earlier. And then in 2020, following

00:15:38.960 --> 00:15:41.559
a massive investment, they gave Microsoft an

00:15:41.559 --> 00:15:44.740
exclusive license to GPT -3's underlying code.

00:15:44.940 --> 00:15:47.990
And it was a very specific type of deal. Anyone

00:15:47.990 --> 00:15:51.450
could pay to use the public API, the interface,

00:15:52.230 --> 00:15:55.210
to get text out of the model. But only Microsoft

00:15:55.210 --> 00:15:57.570
had access to the actual source code. Giving

00:15:57.570 --> 00:15:59.669
them a huge advantage. Exactly. It allowed them

00:15:59.669 --> 00:16:02.149
to embed it deeply into their own products and

00:16:02.149 --> 00:16:05.519
modify the core architecture. That shift... from

00:16:05.519 --> 00:16:08.519
an open source nonprofit to a tightly guarded

00:16:08.519 --> 00:16:11.860
for -profit exclusive licensing deal sparked

00:16:11.860 --> 00:16:14.100
intense debate about the concentration of power

00:16:14.100 --> 00:16:16.179
in the tech industry. And that centralization

00:16:16.179 --> 00:16:18.240
of power is just the tip of the iceberg when

00:16:18.240 --> 00:16:20.320
it comes to the controversies this model kicked

00:16:20.320 --> 00:16:22.399
off. Because you also have to look at the external

00:16:22.399 --> 00:16:24.620
costs of running those massive server farms.

00:16:25.059 --> 00:16:28.039
Right. The environmental impact is severe. Researchers

00:16:28.039 --> 00:16:30.580
like Timnit Gebru and Emeliam Bender co -authored

00:16:30.580 --> 00:16:33.539
a prominent paper pointing out the massive carbon

00:16:33.539 --> 00:16:36.220
footprint and environmental degradation caused

00:16:36.220 --> 00:16:38.860
by the computing power required to train and

00:16:38.860 --> 00:16:41.500
store these incredibly large language models.

00:16:41.679 --> 00:16:44.019
It takes so much electricity. It does. They argued

00:16:44.019 --> 00:16:46.879
the race for bigger models was completely ignoring

00:16:46.879 --> 00:16:49.860
the ecological cost. Then you have the looming

00:16:49.860 --> 00:16:53.360
legal battles over copyright. Remember that 60

00:16:53.360 --> 00:16:55.889
% of the diet came from the common crawl. the

00:16:55.889 --> 00:16:58.830
scraped internet data. Right. That was a conglomerate

00:16:58.830 --> 00:17:01.409
of copyrighted articles, internet posts, and

00:17:01.409 --> 00:17:05.049
books scraped from 60 million domains. Tech publications

00:17:05.049 --> 00:17:07.210
reported it included copyrighted material from

00:17:07.210 --> 00:17:10.049
the BBC, the New York Times, Reddit, and the

00:17:10.049 --> 00:17:12.150
full texts of thousands of Polish books. Which

00:17:12.150 --> 00:17:15.509
is a massive legal minefield. Yeah. OpenAI argued

00:17:15.509 --> 00:17:17.809
to the U .S. Patent and Trademark Office that

00:17:17.809 --> 00:17:20.170
training AI systems on this data constitutes

00:17:20.170 --> 00:17:23.450
fair use under current law because the AI transforms

00:17:23.450 --> 00:17:25.769
the data rather than just copying it. But even

00:17:25.769 --> 00:17:28.049
they admitted there is substantial legal uncertainty.

00:17:28.490 --> 00:17:30.289
I mean, you're essentially taking the collective

00:17:30.289 --> 00:17:33.670
output of human culture, feeding it into a machine.

00:17:33.789 --> 00:17:36.990
and then charging a licensing fee to access the

00:17:36.990 --> 00:17:39.430
pattern -matched results of that culture. It's

00:17:39.430 --> 00:17:41.990
wild when you phrase it like that. It is a legal

00:17:41.990 --> 00:17:44.690
and philosophical gray area that courts will

00:17:44.690 --> 00:17:47.490
be untangling for a decade. And of course, we

00:17:47.490 --> 00:17:49.970
can't forget the immediate impact it had on education.

00:17:50.230 --> 00:17:52.710
Oh, the plagiarism panic. Exactly. The release

00:17:52.710 --> 00:17:55.809
of GPT -3 ignited a growing panic over academic

00:17:55.809 --> 00:17:58.250
integrity. Universities and high schools suddenly

00:17:58.250 --> 00:18:00.670
had to figure out how to gauge academic misconduct

00:18:00.670 --> 00:18:03.710
when a machine could generate a totally unique,

00:18:04.089 --> 00:18:07.170
perfectly written, undetectable essay on any

00:18:07.170 --> 00:18:09.869
topic in five seconds. So these glaring flaws

00:18:09.869 --> 00:18:12.349
and criticisms, the toxicity, the environmental

00:18:12.349 --> 00:18:15.269
costs, the copyright lawsuits, the academic plagiarism,

00:18:15.809 --> 00:18:18.309
they essentially forced OpenAI to adapt. They

00:18:18.309 --> 00:18:20.910
did. They couldn't just leave GPT -3 as a raw,

00:18:21.089 --> 00:18:23.450
unfiltered mirror of the internet. So by early

00:18:23.450 --> 00:18:25.670
2022, they announced new models collectively

00:18:25.670 --> 00:18:28.369
referred to as InstructGPT and later the GPT

00:18:28.369 --> 00:18:30.470
3 .5 series. Right of the next iteration. They

00:18:30.470 --> 00:18:32.970
basically took the raw computational power of

00:18:32.970 --> 00:18:36.549
GPT -3 and fine -tuned it using datasets of human

00:18:36.549 --> 00:18:39.250
-written instructions and feedback. They hired

00:18:39.250 --> 00:18:42.829
actual humans to rank the AI's answers. The goal

00:18:42.829 --> 00:18:45.289
was to make the model actually follow user instructions

00:18:45.319 --> 00:18:48.400
better, hallucinate fewer fake facts, and produce

00:18:48.400 --> 00:18:51.460
significantly less toxic content. They were trying

00:18:51.460 --> 00:18:54.220
to align the AI better with actual human intentions.

00:18:54.279 --> 00:18:56.299
Which was a necessary evolution for commercial

00:18:56.299 --> 00:18:59.019
viability. Absolutely. The transition to GPT

00:18:59.019 --> 00:19:02.420
3 .5 and the eventual public release of ChatGPT

00:19:02.420 --> 00:19:05.200
later that year showed that the raw brute force

00:19:05.200 --> 00:19:08.220
scale of GPT -3 wasn't enough on its own. It

00:19:08.220 --> 00:19:10.859
needed human guidance, guardrails, and behavioral

00:19:10.859 --> 00:19:14.200
fine -tuning to be a truly useful, safe product.

00:19:14.539 --> 00:19:16.819
than just a fascinating, unpredictable research

00:19:16.819 --> 00:19:18.819
experiment. So what does this all mean? If we

00:19:18.819 --> 00:19:20.299
step back and look at the whole picture for a

00:19:20.299 --> 00:19:23.059
second, GPT -3 wasn't just a software update.

00:19:23.240 --> 00:19:25.980
It was a mirror held up to human knowledge. That's

00:19:25.980 --> 00:19:28.180
a great way to look at it. By ingesting hundreds

00:19:28.180 --> 00:19:30.460
of billions of words from the internet, it proved

00:19:30.460 --> 00:19:32.740
that massive mathematical scale could create

00:19:32.740 --> 00:19:35.259
the breathtaking illusion of human intelligence.

00:19:36.009 --> 00:19:38.430
But because it was a mirror, it also reflected

00:19:38.430 --> 00:19:41.329
everything else. It reflected our biases, our

00:19:41.329 --> 00:19:44.450
toxicity, and our messy legal realities regarding

00:19:44.450 --> 00:19:47.109
copyright and ownership. And if we connect this

00:19:47.109 --> 00:19:49.509
to the bigger picture, the most vital takeaway

00:19:49.509 --> 00:19:52.630
for you as a user of this technology is the absolute

00:19:52.630 --> 00:19:56.009
necessity of critical thinking. Yes. These models,

00:19:56.250 --> 00:19:59.329
or their direct descendants, are now deeply integrated

00:19:59.329 --> 00:20:02.069
into the tools you use every single day, from

00:20:02.069 --> 00:20:04.509
code editors to search engines to customer service

00:20:04.509 --> 00:20:08.220
bots. But you must remember the stochastic parrot.

00:20:08.420 --> 00:20:10.099
The machine doesn't actually know anything. Exactly.

00:20:10.259 --> 00:20:13.019
The AI does not understand the world. It predicts

00:20:13.019 --> 00:20:15.960
the world based on the data it was fed. So never

00:20:15.960 --> 00:20:18.279
surrender your own judgment, your own fact checking,

00:20:18.400 --> 00:20:20.960
or your own moral compass to an algorithm that

00:20:20.960 --> 00:20:22.859
cannot actually comprehend the meaning of the

00:20:22.859 --> 00:20:25.259
words it generates. A completely vital reminder.

00:20:25.559 --> 00:20:28.559
It predicts it doesn't comprehend. Which leaves

00:20:28.559 --> 00:20:31.259
me with one final lingering thought for you to

00:20:31.259 --> 00:20:33.740
chew on after you finish listening today. Okay,

00:20:33.759 --> 00:20:36.299
let's hear it. We established that GPT -3 and

00:20:36.299 --> 00:20:38.980
its successors learned how to mimic human language

00:20:38.980 --> 00:20:41.380
by scraping hundreds of billions of words written

00:20:41.380 --> 00:20:43.839
by humans on the internet. But what happens a

00:20:43.839 --> 00:20:47.019
few years from now? As tools built on these models

00:20:47.019 --> 00:20:49.400
flood the internet with billions of pages of

00:20:49.400 --> 00:20:52.440
AI -generated text, articles, and code will future

00:20:52.440 --> 00:20:55.299
AI models just end up training on the synthetic

00:20:55.299 --> 00:20:58.319
output of other AIs. A terrifying thought. Right.

00:20:58.339 --> 00:21:00.619
And if the machine is eventually only learning

00:21:00.619 --> 00:21:03.259
from the machine, what happens to the human voice?