WEBVTT

00:00:00.000 --> 00:00:03.359
You know that feeling when you type something

00:00:03.359 --> 00:00:05.799
completely absurd into an AI image generator?

00:00:06.419 --> 00:00:09.339
Like, I don't know, a futuristic neon city being

00:00:09.339 --> 00:00:12.179
attacked by a giant Renaissance -era oil painting

00:00:12.179 --> 00:00:14.740
of a cat. Oh, yeah. And then five seconds later,

00:00:14.919 --> 00:00:18.579
it just hands you this flawless, cinematic masterpiece.

00:00:18.879 --> 00:00:22.820
Right. It feels like frictionless, perfectly

00:00:22.820 --> 00:00:25.750
clean magic. Boom, there it is. Yeah, but it

00:00:25.750 --> 00:00:27.609
creates this massive illusion, right? It makes

00:00:27.609 --> 00:00:30.010
you think the machine just inherently possesses

00:00:30.010 --> 00:00:32.829
human imagination. It completely hides the really

00:00:32.829 --> 00:00:35.549
messy reality of where that intelligence actually

00:00:35.549 --> 00:00:37.990
comes from. And today, our mission for this deep

00:00:37.990 --> 00:00:40.670
dive is to basically shatter that illusion. We

00:00:40.670 --> 00:00:42.929
are looking directly at the X -ray of that magic.

00:00:42.950 --> 00:00:44.950
Exactly. Because we are diving deep into the

00:00:44.950 --> 00:00:47.609
Wikipedia files on a German non -profit organization

00:00:47.609 --> 00:00:51.850
called LAION. That stands for Large Scale Artificial

00:00:51.850 --> 00:00:54.390
Intelligence Open Network. Which is a bit of

00:00:54.390 --> 00:00:57.570
a mouthful, but a really crucial group to understand.

00:00:57.990 --> 00:01:01.310
Yeah, because behind every sleek, minimalist

00:01:01.310 --> 00:01:04.370
prompt box you use, there's essentially a massive,

00:01:04.750 --> 00:01:07.450
chaotic filing cabinet containing the entire

00:01:07.450 --> 00:01:10.930
visual history of the internet. And LAION is

00:01:10.930 --> 00:01:12.950
the organization that built the biggest filing

00:01:12.950 --> 00:01:15.560
cabinet of all. This is such a critical concept

00:01:15.560 --> 00:01:18.340
for you to understand right now because LEI isn't

00:01:18.340 --> 00:01:21.560
actually building the AI models you interact

00:01:21.560 --> 00:01:24.480
with directly. They aren't making the shiny consumer

00:01:24.480 --> 00:01:25.859
apps. Right, they're not the ones selling you

00:01:25.859 --> 00:01:28.500
a subscription. Exactly. What they're best known

00:01:28.500 --> 00:01:31.560
for is scraping the web to create massive open

00:01:31.560 --> 00:01:34.219
source data sets of images and text captions.

00:01:34.859 --> 00:01:37.200
These are the foundational data sets, like the

00:01:37.200 --> 00:01:39.579
raw fuel, that actually trained high profile

00:01:39.579 --> 00:01:42.500
models like stable diffusion or Google Brain's

00:01:42.500 --> 00:01:43.980
Imagen. This stuff that powers the whole engine.

00:01:44.099 --> 00:01:48.620
Yeah. In our current era of just profound information

00:01:48.620 --> 00:01:50.780
overload, understanding exactly how this data

00:01:50.780 --> 00:01:53.359
is sourced isn't just a fun technical curiosity.

00:01:53.719 --> 00:01:55.540
You really have to know what the machine ate

00:01:55.540 --> 00:01:57.980
before you can trust what it spits out. OK, let's

00:01:57.980 --> 00:02:02.280
unpack this, because to really grasp how the

00:02:02.280 --> 00:02:04.680
whole AI art boom seemed to happen overnight,

00:02:05.420 --> 00:02:07.719
you have to understand the incredibly clever

00:02:07.719 --> 00:02:12.000
and honestly slightly rogue way Elion actually

00:02:12.000 --> 00:02:14.479
collected all this data. It really is fascinating.

00:02:14.539 --> 00:02:16.560
I mean, it wasn't a room full of interns just

00:02:16.560 --> 00:02:18.900
dragging and dropping photos into a desktop folder,

00:02:19.000 --> 00:02:21.719
right? Oh, far from it. The mechanism they used

00:02:21.719 --> 00:02:24.530
relied on something called common crawl. So common

00:02:24.530 --> 00:02:27.509
crawl is this existing nonprofit that operates

00:02:27.509 --> 00:02:30.169
like a well like a digital snowplow a digital

00:02:30.169 --> 00:02:32.330
snowplow like that Yeah, just constantly drives

00:02:32.330 --> 00:02:35.689
across the public internet downloading raw HTML

00:02:35.689 --> 00:02:39.430
code from billions of web pages And then it makes

00:02:39.430 --> 00:02:42.409
that massive archive completely free for researchers

00:02:42.409 --> 00:02:44.729
to use so I didn't even have to scrape the web

00:02:44.729 --> 00:02:46.909
themselves They just use the snowplows leftovers

00:02:46.909 --> 00:02:49.530
pretty much what the developers at LA on did

00:02:49.530 --> 00:02:52.229
was write a relatively simple script to just

00:02:52.229 --> 00:02:54.789
comb through all that crawled HTML, and they

00:02:54.789 --> 00:02:56.990
were specifically looking for one very specific

00:02:56.990 --> 00:02:59.090
thing, which is the basic image tag. Right, the

00:02:59.090 --> 00:03:00.870
piece of code that tells your web browser, hey

00:03:00.870 --> 00:03:02.469
go fetch a picture and display it right here.

00:03:02.599 --> 00:03:05.680
Exactly. But pulling a picture is really only

00:03:05.680 --> 00:03:08.199
half the battle, because an image alone doesn't

00:03:08.199 --> 00:03:10.460
actually teach a machine learning model anything

00:03:10.460 --> 00:03:12.740
at all. It just sees a bunch of pixels. Right.

00:03:13.180 --> 00:03:16.020
The AI needs a label. It needs to know what the

00:03:16.020 --> 00:03:19.259
image actually depicts. So alongside those image

00:03:19.259 --> 00:03:22.539
tags, the developers targeted the alt attributes

00:03:22.539 --> 00:03:25.439
attached to them. Ah, the alt text. That's the

00:03:25.439 --> 00:03:27.699
hidden behind the scenes description of an image

00:03:27.699 --> 00:03:29.780
on a web page, right? Yeah. It was originally

00:03:29.780 --> 00:03:32.819
designed for accessibility. So screen readers

00:03:32.819 --> 00:03:35.659
can describe visual content to visually impaired

00:03:35.659 --> 00:03:39.219
users. And Elion simply decided to treat those

00:03:39.219 --> 00:03:41.780
existing awk attributes as the definitive text

00:03:41.780 --> 00:03:44.400
captions for the images. I love this part, because

00:03:44.400 --> 00:03:46.639
it requires such a completely lateral way of

00:03:46.639 --> 00:03:49.699
thinking about data. Elion didn't build a massive

00:03:49.699 --> 00:03:52.460
library filled with stolen books. They just built

00:03:52.460 --> 00:03:54.879
the world's largest card catalog, pointing to

00:03:54.879 --> 00:03:56.520
where the books currently live on the internet.

00:03:56.919 --> 00:03:58.819
That is a perfect way to describe it. But wait,

00:03:58.819 --> 00:04:01.039
I have to push back here. Because if you have

00:04:01.039 --> 00:04:04.259
ever built a basic website or run a blog, you

00:04:04.259 --> 00:04:06.840
know exactly how notoriously garbage alt text

00:04:06.840 --> 00:04:09.120
can be. Oh, it's terrible. Yeah. People write

00:04:09.120 --> 00:04:12.020
things like, image for final version underscore

00:04:12.020 --> 00:04:15.340
real, or they just wildly stuff it with SEO keywords

00:04:15.340 --> 00:04:20.000
to rank higher on Google. If LiOn is just indiscriminately

00:04:20.000 --> 00:04:23.300
matching raw HTML tags, how do they know the

00:04:23.300 --> 00:04:25.300
text is actually an accurate description of the

00:04:25.300 --> 00:04:27.839
photo? What's fascinating here is the mathematical

00:04:27.839 --> 00:04:30.740
ingenuity they use to solve that exact problem.

00:04:31.160 --> 00:04:33.579
Because you're right, raw alt text is way too

00:04:33.579 --> 00:04:36.519
noisy to train a highly precise AI. It's full

00:04:36.519 --> 00:04:38.839
of junk. So how do you clean it up? To filter

00:04:38.839 --> 00:04:41.639
the garbage out, LiOn utilized a model created

00:04:41.639 --> 00:04:45.310
by OpenAI called CLIP. You can think of CLIP

00:04:45.310 --> 00:04:47.829
like an incredibly strict, mathematically precise

00:04:47.829 --> 00:04:50.610
bouncer for a nightclub. Okay, a bouncer. Walk

00:04:50.610 --> 00:04:53.410
me through that. So CLIP is trained to look at

00:04:53.410 --> 00:04:55.889
an image, read a piece of text, and then calculate

00:04:55.889 --> 00:04:58.029
the mathematical distance between them. It scores

00:04:58.029 --> 00:05:00.110
how well they actually match. Oh, I get it. So

00:05:00.110 --> 00:05:02.990
if the image is a golden retriever catching a

00:05:02.990 --> 00:05:05.139
frisbee... and the alt text says golden retriever

00:05:05.139 --> 00:05:07.939
playing in park, CLIP gives it a high score and

00:05:07.939 --> 00:05:09.759
lets it into the club. Exactly. But if the alt

00:05:09.759 --> 00:05:12.660
text says buy cheap sneakers online, CLIP just

00:05:12.660 --> 00:05:14.680
throws it straight in the trash. Exactly that.

00:05:14.920 --> 00:05:17.939
So they ran their massive chaotic pile of scraped

00:05:17.939 --> 00:05:20.720
web links through CLIP using it to automatically

00:05:20.720 --> 00:05:23.459
prune the whole data set. Any pairing that scored

00:05:23.459 --> 00:05:26.300
below a certain similarity threshold was just

00:05:26.300 --> 00:05:29.220
discarded. Wow. And the irony of this process

00:05:29.220 --> 00:05:32.699
is just incredibly rich because OpenAI had previously

00:05:32.699 --> 00:05:35.019
opened source the underlying code and the weights

00:05:35.019 --> 00:05:38.060
for their CLIP model, but they kept the actual

00:05:38.060 --> 00:05:41.319
training data set, the 400 million image caption

00:05:41.319 --> 00:05:43.899
pairs they originally used to teach it, a closely

00:05:43.899 --> 00:05:46.620
guarded corporate secret. Oh, wow. So OpenAI

00:05:46.620 --> 00:05:49.100
gave away the bouncer. but kept the guest list

00:05:49.100 --> 00:05:51.939
a secret. Yes. So by using the open version of

00:05:51.939 --> 00:05:55.519
CII to filter their own scraped web data, LAION

00:05:55.519 --> 00:05:58.240
essentially reverse -engineered OpenAI's closed

00:05:58.240 --> 00:06:00.459
data set. Wait, really? They used an AI to build

00:06:00.459 --> 00:06:02.839
the data set that trains other AIs? They absolutely

00:06:02.839 --> 00:06:05.180
did. And the result was their very first release

00:06:05.180 --> 00:06:07.740
in August 2021, which they appropriately named

00:06:07.740 --> 00:06:12.079
LAION 400MIR, 400 million highly accurate CLIP

00:06:12.079 --> 00:06:14.540
-filtered image caption pairs. That is wild.

00:06:14.759 --> 00:06:17.199
And we really need to emphasize that card catalog

00:06:17.199 --> 00:06:19.839
analogy you made earlier, because it is the entire

00:06:19.839 --> 00:06:23.959
key to LAON's operation. They do not host the

00:06:23.959 --> 00:06:26.459
images themselves. Right. They aren't hoarding

00:06:26.459 --> 00:06:29.439
400 million JPEGs on some massive server farm

00:06:29.439 --> 00:06:31.560
somewhere in Germany. Not at all. They only host

00:06:31.560 --> 00:06:34.600
the URLs, just the text links pointing to the

00:06:34.600 --> 00:06:37.560
images. They shift the entire burden of downloading

00:06:37.560 --> 00:06:40.879
the actual heavy image files onto the AI researchers.

00:06:41.019 --> 00:06:42.879
Or the tech companies who want to use the data

00:06:42.879 --> 00:06:46.139
set to train. their models, which is just a brilliant

00:06:46.139 --> 00:06:48.519
logistical loophole. They're literally just providing

00:06:48.519 --> 00:06:51.399
the map. A very detailed map. Yeah. But a map

00:06:51.399 --> 00:06:53.920
of 400 million things revealed a really fundamental

00:06:53.920 --> 00:06:56.420
problem for AI developers. It simply wasn't enough

00:06:56.420 --> 00:06:59.579
data. Not even close. Right. To make an AI that

00:06:59.579 --> 00:07:02.220
could reliably generate anything the human mind

00:07:02.220 --> 00:07:05.399
could imagine, like every art style, every weird

00:07:05.399 --> 00:07:07.920
lighting condition, every rare object, they needed

00:07:07.920 --> 00:07:11.180
to scale up by an order of magnitude. And that

00:07:11.180 --> 00:07:13.459
unprecedented scale is exactly what triggered

00:07:13.459 --> 00:07:16.240
a billion dollar legal nightmare. It was inevitable.

00:07:16.519 --> 00:07:20.579
In March 2022, LAI and released a successor data

00:07:20.579 --> 00:07:25.699
set called LAI Own 5B. That is five billion image

00:07:25.699 --> 00:07:28.860
caption pairs. Five billion. Five billion. At

00:07:28.860 --> 00:07:31.120
the time of its release, it was the absolute

00:07:31.120 --> 00:07:33.839
largest freely available data set of its kind

00:07:33.839 --> 00:07:37.079
in existence. And scaling up to that level requires

00:07:37.079 --> 00:07:39.740
serious computational resources. I can imagine.

00:07:40.000 --> 00:07:42.740
Who even paid for that? This massive effort was

00:07:42.740 --> 00:07:45.240
funded by entities like DoodleBot, Hugging Face,

00:07:45.500 --> 00:07:48.579
and Stability AI. And Stability AI, notably,

00:07:48.779 --> 00:07:51.100
is the company that subsequently used this exact

00:07:51.100 --> 00:07:53.699
massive data set to train the wildly popular

00:07:53.699 --> 00:07:55.839
stable diffusion model. Okay, here's where it

00:07:55.839 --> 00:07:58.040
gets really interesting though. Because when

00:07:58.040 --> 00:08:00.459
you create a map of five billion copyrighted

00:08:00.459 --> 00:08:02.420
photographs and professional artworks and personal

00:08:02.420 --> 00:08:04.819
images, and you do it without asking a single

00:08:04.819 --> 00:08:07.120
person for permission, People are going to notice.

00:08:07.120 --> 00:08:08.540
Oh, they definitely notice. And they're going

00:08:08.540 --> 00:08:11.540
to push back hard. The legal blowback was just

00:08:11.540 --> 00:08:14.660
intense and immediate. In February 2023, Getty

00:08:14.660 --> 00:08:16.980
Images, the massive global stock photo agency,

00:08:17.480 --> 00:08:19.639
launched a huge lawsuit against Stability AI.

00:08:19.879 --> 00:08:22.079
Right, because the AI was literally regenerating

00:08:22.079 --> 00:08:25.220
their proprietary watermarks. Exactly. Proving

00:08:25.220 --> 00:08:28.300
it had basically memorized Getty's catalog. And

00:08:28.300 --> 00:08:30.980
LAON was actually named in that lawsuit as a

00:08:30.980 --> 00:08:34.509
non -party. But then, the legal crosshairs swung

00:08:34.509 --> 00:08:37.750
directly onto Elion itself. Yeah, the real direct

00:08:37.750 --> 00:08:41.210
hit came in April 2023, when a German photographer

00:08:41.210 --> 00:08:43.990
took direct legal action. He discovered that

00:08:43.990 --> 00:08:46.330
his own copyrighted photographs were indexed

00:08:46.330 --> 00:08:48.690
within the Elion dataset. Without his permission.

00:08:48.809 --> 00:08:51.789
Right. So he sued them in a German court to have

00:08:51.789 --> 00:08:54.230
his works explicitly removed from their training

00:08:54.230 --> 00:08:56.429
data. And the wildest detail from the source

00:08:56.429 --> 00:08:58.649
is here. When he served them with the lawsuit,

00:08:58.960 --> 00:09:01.179
Elion actually sent him an invoice right back.

00:09:01.379 --> 00:09:03.279
I know. It's hard to believe. They billed him

00:09:03.279 --> 00:09:05.139
for the administrative costs of dealing with

00:09:05.139 --> 00:09:08.019
what they considered a legally baseless request.

00:09:08.159 --> 00:09:11.980
It's just breathtakingly bold. But how did the

00:09:11.980 --> 00:09:14.159
courts actually view this? Because common sense

00:09:14.159 --> 00:09:16.799
says, hey, that's my photo. You can't use it.

00:09:16.899 --> 00:09:18.539
Well, if we connect this to the bigger picture,

00:09:18.740 --> 00:09:21.039
the resolution of this specific case is monumental

00:09:21.039 --> 00:09:23.259
for the foundational architecture of artificial

00:09:23.259 --> 00:09:26.330
intelligence in Europe. In September 2024, the

00:09:26.330 --> 00:09:28.289
regional court of Hamburg formally dismissed

00:09:28.289 --> 00:09:31.129
the photographer's lawsuit. And this wasn't just

00:09:31.129 --> 00:09:34.009
some minor procedural victory. It was hailed

00:09:34.009 --> 00:09:36.429
as a landmark ruling on what the legal world

00:09:36.429 --> 00:09:39.950
calls TDM exceptions. TDM meaning text and data

00:09:39.950 --> 00:09:42.210
mining. Exactly. Text and data mining exceptions

00:09:42.210 --> 00:09:44.990
are the specific legal provisions, particularly

00:09:44.990 --> 00:09:47.570
Article 4 of the European Union's digital single

00:09:47.570 --> 00:09:49.990
market directive. It's very bureaucratic. Very.

00:09:50.470 --> 00:09:52.710
But what these provisions do is acknowledge that

00:09:52.710 --> 00:09:55.340
in the modern digital researchers really need

00:09:55.340 --> 00:09:58.360
to be able to analyze massive amounts of copyright

00:09:58.360 --> 00:10:01.299
-protected material computationally. The law

00:10:01.299 --> 00:10:03.559
essentially allows this, provided they're doing

00:10:03.559 --> 00:10:06.899
it just to extract patterns, trends, or correlations,

00:10:07.299 --> 00:10:09.879
and crucially, provided they aren't reproducing

00:10:09.879 --> 00:10:12.379
the work to directly compete with the original

00:10:12.379 --> 00:10:15.000
human creator. Ah, I see. So it's basically the

00:10:15.000 --> 00:10:16.820
difference between reading a thousand Stephen

00:10:16.820 --> 00:10:19.799
King books to mathematically analyze his sentence

00:10:19.799 --> 00:10:22.659
structure versus photocopying a Stephen King

00:10:22.659 --> 00:10:24.549
book to... it out of the trunk of your car. That

00:10:24.549 --> 00:10:27.289
is the exact distinction the court made. The

00:10:27.289 --> 00:10:29.889
German judges ruled that what LAON was doing,

00:10:30.090 --> 00:10:32.929
which was creating a database of links to understand

00:10:32.929 --> 00:10:35.529
the statistical correlations between human language

00:10:35.529 --> 00:10:38.710
and visual pixels, fell safely under these text

00:10:38.710 --> 00:10:41.649
and data mining exceptions. Wow. Even though

00:10:41.649 --> 00:10:43.990
a commercial company like Stability AI ended

00:10:43.990 --> 00:10:46.210
up using it later. Yes. The court noted that

00:10:46.210 --> 00:10:48.389
the data set itself was created for scientific

00:10:48.389 --> 00:10:51.070
research. The commercial use later down the line

00:10:51.070 --> 00:10:54.110
didn't invalidate LAON's protection. And this

00:10:54.110 --> 00:10:56.629
set a massive precedent for copyright law across

00:10:56.629 --> 00:10:59.830
the entire European Union. It legally validated

00:10:59.830 --> 00:11:02.370
their whole card catalog method. So legally,

00:11:02.529 --> 00:11:04.769
they won the undeniable right to scrape the internet.

00:11:05.399 --> 00:11:08.419
But here is the massive catch. If you deploy

00:11:08.419 --> 00:11:11.240
a digital snowplow to scoop up five billion web

00:11:11.240 --> 00:11:14.259
pages, you inevitably scoop up everything. Everything.

00:11:14.440 --> 00:11:16.480
The good and the very, very bad. Right. You get

00:11:16.480 --> 00:11:18.620
the beautiful museum art and the historical photos,

00:11:18.659 --> 00:11:21.480
but you also hit the internet's absolute darkest,

00:11:21.559 --> 00:11:24.960
most horrifying corners. The legal victory immediately

00:11:24.960 --> 00:11:27.860
gave way to a really severe ethical reality regarding

00:11:27.860 --> 00:11:30.379
what happens when you map the unfiltered human

00:11:30.379 --> 00:11:32.899
internet. Yeah, and the findings from independent

00:11:32.899 --> 00:11:34.980
researchers who started analyzing the data set

00:11:34.980 --> 00:11:38.679
are deeply concerning. Because LAO in 5b is completely

00:11:38.679 --> 00:11:40.919
open source, right? Anyone with the technical

00:11:40.919 --> 00:11:43.019
know -how can download the index and look at

00:11:43.019 --> 00:11:45.019
exactly what is inside it. And what did they

00:11:45.019 --> 00:11:47.820
find? Well, multiple academic studies published

00:11:47.820 --> 00:11:51.220
throughout 2021 and 2023 investigated the contents.

00:11:51.740 --> 00:11:55.559
They found that LAON N5B contained highly problematic

00:11:55.559 --> 00:11:58.919
material. Researchers documented widespread instances

00:11:58.919 --> 00:12:01.919
of rape imagery, non -consensual pornography,

00:12:02.440 --> 00:12:05.639
malignant cultural stereotypes, and extreme racist

00:12:05.639 --> 00:12:08.379
and ethnic slurs. It's just awful. It gets worse.

00:12:08.799 --> 00:12:11.419
A major investigation by the German broadcaster

00:12:11.419 --> 00:12:14.100
Bayerischer Rundfunk found large amounts of private

00:12:14.100 --> 00:12:16.460
sensitive data. harvested from public sites.

00:12:17.200 --> 00:12:19.519
The sources specifically mention private medical

00:12:19.519 --> 00:12:21.940
records being found in the data set. I mean,

00:12:21.960 --> 00:12:25.019
imagine your own private medical records. Photos

00:12:25.019 --> 00:12:27.259
of you at your absolute most vulnerable being

00:12:27.259 --> 00:12:29.559
scooped up by a machine learning model simply

00:12:29.559 --> 00:12:32.980
because a hospital or a doctor's office misconfigured

00:12:32.980 --> 00:12:35.019
their web server and accidentally left a folder

00:12:35.019 --> 00:12:37.440
exposed to the public internet. The digital snowplow

00:12:37.440 --> 00:12:39.559
doesn't care. It just grabs the image tag. It

00:12:39.559 --> 00:12:43.259
just grabs the tag. And then this steady drumbeat

00:12:43.259 --> 00:12:46.419
of concern escalated into an absolute crisis.

00:12:46.779 --> 00:12:49.980
It really culminated in December 2023 when the

00:12:49.980 --> 00:12:52.480
Stanford Internet Observatory released a devastating

00:12:52.480 --> 00:12:56.360
technical report on LAI on 5b. They ran an automated

00:12:56.360 --> 00:12:58.639
analysis using specialized hashing algorithms

00:12:58.639 --> 00:13:00.940
that are specifically designed to detect known

00:13:00.940 --> 00:13:03.379
illicit files. And what did the report say? The

00:13:03.379 --> 00:13:05.720
report stated they suspected the data set contained

00:13:05.720 --> 00:13:10.379
over 3 ,000 instances 3 ,226 to be exact, of

00:13:10.379 --> 00:13:12.620
links pointing to child sexual abuse material,

00:13:13.019 --> 00:13:16.379
or CSAM. And of those suspected links, 1 ,008

00:13:16.379 --> 00:13:19.000
were externally validated by safety organizations.

00:13:19.360 --> 00:13:21.840
So what does this all mean? Functionally speaking,

00:13:22.100 --> 00:13:24.480
how does an organization even handle that? Because

00:13:24.480 --> 00:13:26.639
we know LAON's stated intention was to build

00:13:26.639 --> 00:13:29.379
a tool to democratize scientific research. Right.

00:13:29.399 --> 00:13:31.299
Their goal was scientific openness. But if your

00:13:31.299 --> 00:13:33.860
open source data set is five billion links large,

00:13:34.399 --> 00:13:36.639
is human moderation even mathematically possible?

00:13:36.779 --> 00:13:38.740
I mean, even if you hired a stadium full of people

00:13:38.740 --> 00:13:41.000
to click links and review images all day, every

00:13:41.000 --> 00:13:43.120
single day, you can never actually review five

00:13:43.120 --> 00:13:45.799
billion images. You couldn't. The sheer vastness

00:13:45.799 --> 00:13:47.980
of the data completely defeats human oversight.

00:13:48.320 --> 00:13:50.620
Exactly. And this raises an important question

00:13:50.620 --> 00:13:52.720
about the fundamental architecture of open source

00:13:52.720 --> 00:13:56.840
AI. Can a system be both completely comprehensive,

00:13:57.259 --> 00:14:00.139
capturing the entirety of human knowledge, and

00:14:00.139 --> 00:14:02.500
perfectly safe at the same time? It seems like

00:14:02.500 --> 00:14:04.879
those two things are entirely at odds. They often

00:14:04.879 --> 00:14:07.899
are. The Stanford report forced a brutal reckoning

00:14:07.899 --> 00:14:10.240
regarding accountability. In response to the

00:14:10.240 --> 00:14:12.960
discovery of the CSAM links, LAON took immediate

00:14:12.960 --> 00:14:16.419
action. They temporarily removed both the massive

00:14:16.419 --> 00:14:20.340
LAI on 5B and the older LAI on 400M data sets

00:14:20.340 --> 00:14:22.700
from public access entirely. Just pulled the

00:14:22.700 --> 00:14:25.419
plug on them. Yeah. They cited a strict zero

00:14:25.419 --> 00:14:28.360
tolerance policy for illegal content and stated

00:14:28.360 --> 00:14:30.580
they were taking the links offline out of an

00:14:30.580 --> 00:14:32.840
abundance of caution. Because suddenly providing

00:14:32.840 --> 00:14:35.980
the map to illegal harmful content becomes a

00:14:35.980 --> 00:14:38.240
radically different legal and moral issue than

00:14:38.240 --> 00:14:41.039
providing a map to a copyrighted stock photograph.

00:14:41.299 --> 00:14:43.720
Exactly. You can't just hide behind a text and

00:14:43.720 --> 00:14:46.240
data mining exception when you are indexing illicit

00:14:46.240 --> 00:14:48.899
material. So did the data set just stay offline

00:14:48.899 --> 00:14:52.370
forever? No. They took the massive data sets

00:14:52.370 --> 00:14:55.389
completely offline, and they went to work computationally

00:14:55.389 --> 00:14:59.929
scrubbing them. By August 2024, LAON released

00:14:59.929 --> 00:15:02.809
a new, highly filtered version of the data set,

00:15:02.950 --> 00:15:07.110
which they dubbed Re -LAON F5B. The illicit material

00:15:07.110 --> 00:15:09.870
identified by the safety organizations was scrubbed

00:15:09.870 --> 00:15:12.570
from the index. Right, but this entire sequence

00:15:12.570 --> 00:15:15.769
of events highlights such an incredible, unresolved

00:15:15.769 --> 00:15:19.000
tension in the tech world. On one hand, you have

00:15:19.000 --> 00:15:21.820
the genuine desire to democratize AI to ensure

00:15:21.820 --> 00:15:23.899
that trillion -dollar tech monopolies aren't

00:15:23.899 --> 00:15:26.379
the only ones with access to the data that will

00:15:26.379 --> 00:15:28.539
build the future economy. Which is a noble goal.

00:15:28.820 --> 00:15:31.139
It is. But on the other hand, you have the absolute

00:15:31.139 --> 00:15:33.139
necessity of public safety, which is inherently

00:15:33.139 --> 00:15:35.220
at odds with blindly vacuuming up the public

00:15:35.220 --> 00:15:37.539
internet. It is the ultimate double -edged sword

00:15:37.539 --> 00:15:40.299
of the digital age. It really is. But to really

00:15:40.299 --> 00:15:42.980
understand LION as an organization, we have to

00:15:42.980 --> 00:15:45.460
look beyond just the image -scraping controversies.

00:15:45.710 --> 00:15:48.049
Because their mission wasn't just to democratize

00:15:48.049 --> 00:15:50.889
text -to -image models. They also wanted to democratize

00:15:50.889 --> 00:15:53.509
the text -based AI chatbots. Right. The kind

00:15:53.509 --> 00:15:55.690
of large language models you use to write emails

00:15:55.690 --> 00:15:59.009
or brainstorm ideas or debug code. Yeah. But

00:15:59.009 --> 00:16:01.389
the way they approached building a chatbot was

00:16:01.389 --> 00:16:05.009
a massive, fascinating departure from the automated

00:16:05.009 --> 00:16:08.399
bulldozer method of LAI and 5b. Very different.

00:16:08.759 --> 00:16:12.879
In April 2023, LA ION, working alongside a massive

00:16:12.879 --> 00:16:16.419
decentralized group of volunteers, publicly released

00:16:16.419 --> 00:16:19.580
an open source AI assistant called Open Assistant.

00:16:19.769 --> 00:16:21.730
And the goal there was huge for the time, right?

00:16:21.950 --> 00:16:25.370
Incredibly ambitious. Back in early 2023, the

00:16:25.370 --> 00:16:27.850
most powerful and capable language models were

00:16:27.850 --> 00:16:30.330
locked entirely behind corporate APIs. You had

00:16:30.330 --> 00:16:32.289
to pay a subscription and they required massive

00:16:32.289 --> 00:16:35.370
centralized server farms to run. Open Assistant's

00:16:35.370 --> 00:16:37.769
mission was to provide free access to large language

00:16:37.769 --> 00:16:39.950
models that were efficient enough to run locally

00:16:39.950 --> 00:16:42.509
on a user's everyday consumer hardware. Which

00:16:42.509 --> 00:16:45.110
is a huge paradigm shift. Just imagine having

00:16:45.110 --> 00:16:48.090
a top tier, highly intelligent AI running entirely

00:16:48.090 --> 00:16:50.789
on your own laptop. graphics card. It's totally

00:16:50.789 --> 00:16:53.250
private, it reads your local files, and it doesn't

00:16:53.250 --> 00:16:55.009
need an internet connection to beam your data

00:16:55.009 --> 00:16:56.830
back to a corporate server in Silicon Valley.

00:16:56.970 --> 00:16:59.929
And beyond just running locally on consumer hardware,

00:17:00.350 --> 00:17:02.549
the developers wanted this assistant to be able

00:17:02.549 --> 00:17:05.029
to dynamically retrieve information from the

00:17:05.029 --> 00:17:07.710
web and interact with third -party systems. It

00:17:07.710 --> 00:17:10.170
was designed from the ground up to be a deeply

00:17:10.170 --> 00:17:13.369
integrated, highly transparent open tool. But

00:17:13.369 --> 00:17:15.490
the way they built the training data for Open

00:17:15.490 --> 00:17:19.210
Assistant is what I find so compelling. If LARN5B

00:17:19.210 --> 00:17:22.609
was an industrial digital snowplow, Open Assistant

00:17:22.609 --> 00:17:25.990
was an artisanal, highly manual, crowdsourced

00:17:25.990 --> 00:17:28.400
effort. It was incredibly hands -on. Right. They

00:17:28.400 --> 00:17:30.619
didn't just scrape a billion random Reddit threads

00:17:30.619 --> 00:17:33.140
or Wikipedia pages to teach the bot how to speak.

00:17:33.460 --> 00:17:36.440
They relied on a worldwide network of over 13

00:17:36.440 --> 00:17:40.220
,500 human volunteers. Yeah. These volunteers

00:17:40.220 --> 00:17:43.279
logged into a specially designed Gamify platform

00:17:43.279 --> 00:17:45.759
and they manually engaged in a process called

00:17:45.759 --> 00:17:48.019
reinforcement learning from human feedback. Or

00:17:48.019 --> 00:17:50.920
LHF. Exactly. Through this platform, those 13

00:17:50.920 --> 00:17:54.720
,500 humans manually created over 600 ,000 highly

00:17:54.720 --> 00:17:56.660
structured data. points. And they weren't just

00:17:56.660 --> 00:17:58.579
chatting for fun. They were doing the grueling

00:17:58.579 --> 00:18:00.500
work of AI alignment. What does that actually

00:18:00.500 --> 00:18:02.700
look like for the volunteers? Well, they would

00:18:02.700 --> 00:18:05.400
write original prompts, act out the role of the

00:18:05.400 --> 00:18:08.720
AI to write ideal responses, and meticulously

00:18:08.720 --> 00:18:11.740
rank different AI outputs. They were explicitly

00:18:11.740 --> 00:18:14.099
teaching the model the difference between a helpful,

00:18:14.440 --> 00:18:17.960
polite answer and a toxic or unhinged one. It's

00:18:17.960 --> 00:18:21.059
essentially a massive global Wikipedia edit -a

00:18:21.059 --> 00:18:23.839
-thon. but specifically meant to teach a neural

00:18:23.839 --> 00:18:27.019
network how to be a safe and helpful conversationalist.

00:18:27.079 --> 00:18:29.319
That's a great comparison. It just proves that

00:18:29.319 --> 00:18:31.259
open -source AI development doesn't always have

00:18:31.259 --> 00:18:33.839
to mean indiscriminate automated data harvesting.

00:18:34.420 --> 00:18:37.079
It can also look like a global community of thousands

00:18:37.079 --> 00:18:40.279
of people voluntarily donating their time, their

00:18:40.279 --> 00:18:42.660
language skills, and their intelligence to build

00:18:42.660 --> 00:18:44.700
something free for the rest of humanity. And

00:18:44.700 --> 00:18:47.019
while the Open Assistant project itself has since

00:18:47.019 --> 00:18:49.059
been officially shut down, primarily because,

00:18:49.059 --> 00:18:51.380
you know, maintaining and serving a live, state

00:18:51.380 --> 00:18:53.640
-of -the -art chatbot to the public is just incredibly

00:18:53.640 --> 00:18:56.079
resource intensive and expensive, the datasets

00:18:56.079 --> 00:18:58.380
and the foundational models those volunteers

00:18:58.380 --> 00:19:01.240
painstakingly created remain freely available

00:19:01.240 --> 00:19:03.549
today. They're still out there. Yeah, on open

00:19:03.549 --> 00:19:06.130
source platforms like Hugging Face, they serve

00:19:06.130 --> 00:19:09.130
as a lasting, highly influential testament to

00:19:09.130 --> 00:19:12.650
Elion's overarching mission. Elion is relentlessly

00:19:12.650 --> 00:19:15.390
pushing for open source AI alignment, trying

00:19:15.390 --> 00:19:17.450
to prove that the fundamental tools required

00:19:17.450 --> 00:19:20.630
to build and train artificial intelligence shouldn't

00:19:20.630 --> 00:19:23.430
be the exclusive proprietary property of a few

00:19:23.430 --> 00:19:25.970
megacorporations. So let's bring this all together.

00:19:26.269 --> 00:19:29.349
We started with a wildly clever, almost deceptively

00:19:29.349 --> 00:19:33.019
simple hack writing a script to search HTML code

00:19:33.019 --> 00:19:35.980
for image tags and hijacking the hidden alt text.

00:19:36.460 --> 00:19:38.619
Just a few lines of code, really. Right. And

00:19:38.619 --> 00:19:41.440
we watched that clever hack scale up into a five

00:19:41.440 --> 00:19:44.180
billion link data set that quite literally changed

00:19:44.180 --> 00:19:46.299
the trajectory of the global tech industry overnight.

00:19:46.579 --> 00:19:49.019
Absolutely. It survived a landmark European Union

00:19:49.019 --> 00:19:51.059
copyright ruling that redefined how data can

00:19:51.059 --> 00:19:54.019
be used, whether to severe highly publicized

00:19:54.019 --> 00:19:56.339
crisis regarding the absolute darkest corners

00:19:56.339 --> 00:19:59.000
of the human Internet and ultimately spawned

00:19:59.000 --> 00:20:01.940
a massive crowdsourced movement to build a free

00:20:01.940 --> 00:20:05.000
open source chat bot from scratch. It is a remarkable,

00:20:05.279 --> 00:20:07.880
deeply complicated journey of a nonprofit organization.

00:20:07.690 --> 00:20:10.809
Attempting to mathematically map the entirety

00:20:10.809 --> 00:20:13.470
of the digital world. It really is and for you

00:20:13.470 --> 00:20:16.059
listening This deep dive changes the context

00:20:16.059 --> 00:20:18.240
of everything you do online from this point forward.

00:20:18.779 --> 00:20:21.180
The next time you type a prompt and see a flawlessly

00:20:21.180 --> 00:20:23.960
generated AI image appear on your screen, or

00:20:23.960 --> 00:20:26.119
you find yourself having a surprisingly nuanced

00:20:26.119 --> 00:20:28.700
conversation with an open source language model,

00:20:29.180 --> 00:20:30.960
you aren't just looking at clean, frictionless

00:20:30.960 --> 00:20:33.480
magic anymore. The illusion is gone. Exactly.

00:20:33.839 --> 00:20:35.900
You know the exact mechanics of it. You know

00:20:35.900 --> 00:20:39.640
exactly how much messy, brilliant, legally contested,

00:20:39.700 --> 00:20:41.859
and highly controversial human data was poured

00:20:41.859 --> 00:20:43.940
into its foundation. You've seen the x -ray.

00:20:44.160 --> 00:20:46.619
And looking at that x -ray leaves us with one

00:20:46.619 --> 00:20:49.700
final critical realization to think about. If

00:20:49.700 --> 00:20:52.920
these massive open source AI models are essentially

00:20:52.920 --> 00:20:56.079
giant statistical mirrors reflecting our entire

00:20:56.079 --> 00:20:58.660
digital world, absorbing the legally protected

00:20:58.660 --> 00:21:01.359
art, the private medical records, the breathtakingly

00:21:01.359 --> 00:21:04.190
good, and the profoundly ugly, What happens tomorrow?

00:21:04.269 --> 00:21:06.250
What do you mean? Because the internet is currently

00:21:06.250 --> 00:21:09.730
being flooded with billions of synthetic AI -generated

00:21:09.730 --> 00:21:12.450
images and text, when the next version of a digital

00:21:12.450 --> 00:21:14.609
snowplow goes out to scrape the web, it won't

00:21:14.609 --> 00:21:17.130
just be mapping human data anymore. It will be

00:21:17.130 --> 00:21:19.490
an AI training on the synthetic hallucinations

00:21:19.490 --> 00:21:22.289
of other AIs. As that mirror starts reflecting

00:21:22.289 --> 00:21:25.049
itself infinitely, what that means for the future

00:21:25.049 --> 00:21:27.630
of human knowledge and truth is a question you'll

00:21:27.630 --> 00:21:28.650
have to ponder on your own.