WEBVTT

00:00:00.000 --> 00:00:03.339
So every time you type a prompt into, you know,

00:00:03.480 --> 00:00:07.040
chat GPT or Gemini or Claude, you are essentially

00:00:07.040 --> 00:00:09.439
interacting with the ghost of the internet. Right.

00:00:09.500 --> 00:00:11.880
Yeah. It's a vast echo. Yeah. Behind the magic

00:00:11.880 --> 00:00:14.560
of all this modern artificial intelligence, there

00:00:14.560 --> 00:00:17.559
isn't just like a clever algorithm. There is

00:00:17.559 --> 00:00:21.300
a massive, totally invisible and highly controversial

00:00:21.300 --> 00:00:24.320
database. Run by a quiet nonprofit. Exactly.

00:00:24.600 --> 00:00:27.019
One that you, the listener, have likely never

00:00:27.019 --> 00:00:30.589
even heard of. So today, for you, we're taking

00:00:30.589 --> 00:00:32.670
a deep dive into the Common Crawl Foundation.

00:00:32.869 --> 00:00:35.049
Which is such an important topic right now. It

00:00:35.049 --> 00:00:38.009
really is. Our mission today is to unpack how

00:00:38.009 --> 00:00:41.310
this completely obscure organization became well.

00:00:41.439 --> 00:00:44.340
the foundational engine powering the entire AI

00:00:44.340 --> 00:00:46.640
revolution. And how it suddenly found itself

00:00:46.640 --> 00:00:49.359
right at the center of this massive war over

00:00:49.359 --> 00:00:51.439
the future of the internet itself. It's wild.

00:00:51.539 --> 00:00:53.439
OK, let's unpack this. Because before we get

00:00:53.439 --> 00:00:56.200
into the modern AI drama and all the lawsuits,

00:00:56.880 --> 00:00:58.380
we really need to understand what Common Crawl

00:00:58.380 --> 00:01:00.240
actually is. Yeah, we have to go back a bit.

00:01:00.380 --> 00:01:02.740
Because if you use AI tools today, you're literally

00:01:02.740 --> 00:01:05.400
querying the echoes of Common Crawl's work. Right.

00:01:05.680 --> 00:01:09.310
So. Back in 2007, an entrepreneur named Gil El

00:01:09.310 --> 00:01:13.909
-Baz founded this organization as a 501c3 nonprofit.

00:01:14.129 --> 00:01:16.090
And the stated mission back then was entirely

00:01:16.090 --> 00:01:18.409
focused on the public good. Just to crawl the

00:01:18.409 --> 00:01:21.170
web and freely provide its archives and data

00:01:21.170 --> 00:01:23.670
sets to the public, essentially like they wanted

00:01:23.670 --> 00:01:25.709
to make the Internet downloadable. Which, you

00:01:25.709 --> 00:01:28.349
know, doing that requires immense infrastructure

00:01:28.349 --> 00:01:31.469
and a very specific technical approach. So to

00:01:31.469 --> 00:01:34.109
visualize this web crawling, it's kind of like

00:01:34.109 --> 00:01:37.780
imagine a massive fleet of digital Roombas. Right.

00:01:37.859 --> 00:01:39.659
That's a great way to put it. Yeah. These automated

00:01:39.659 --> 00:01:42.379
software bots that start their journey on a few

00:01:42.379 --> 00:01:44.840
highly connected web pages. And they read the

00:01:44.840 --> 00:01:47.819
raw HTML code of a page. They vacuum up all the

00:01:47.819 --> 00:01:50.120
text, the images, the metadata. And then they

00:01:50.120 --> 00:01:52.120
look for any hyperlinks. Exactly. They click

00:01:52.120 --> 00:01:54.260
those links, travel to the next set of pages,

00:01:54.340 --> 00:01:56.200
and just repeat the whole process. They just

00:01:56.200 --> 00:01:59.140
relentlessly moving across the open web, copying

00:01:59.140 --> 00:02:01.359
everything they touch, and dumping it into this

00:02:01.359 --> 00:02:04.159
massive searchable repository. Right. And they

00:02:04.159 --> 00:02:06.450
mirrored all the data and made it available online

00:02:06.450 --> 00:02:08.490
through the Internet Archive's Wayback Machine.

00:02:08.849 --> 00:02:10.650
Which is incredible. And they brought in some

00:02:10.650 --> 00:02:13.590
serious tech heavyweights as advisors early on.

00:02:13.750 --> 00:02:17.449
Yeah. People like Peter Norvig and Joy Ito. Big

00:02:17.449 --> 00:02:19.569
names. But I guess my question is, why do this

00:02:19.569 --> 00:02:22.530
as a nonprofit? Like, was the original goal just

00:02:22.530 --> 00:02:25.090
to be a giant backup hard drive for humanity?

00:02:25.530 --> 00:02:28.330
Well, no, it was fundamentally about democratizing

00:02:28.330 --> 00:02:30.629
research and development. OK. I mean, think about

00:02:30.629 --> 00:02:33.860
the landscape of the tech world in like the late

00:02:33.860 --> 00:02:36.639
2000s and early 2010s. If you were an academic

00:02:36.639 --> 00:02:39.639
wanting to analyze web scale data. Like tracking

00:02:39.639 --> 00:02:42.400
how slang spreads across internet forums or something.

00:02:42.599 --> 00:02:45.139
Exactly. Or studying search engine optimization

00:02:45.139 --> 00:02:47.680
trends or mapping misinformation. You simply

00:02:47.680 --> 00:02:49.780
couldn't do it. Because you need Google servers.

00:02:49.979 --> 00:02:52.080
Right. You needed the resources of a massive

00:02:52.080 --> 00:02:54.939
corporation like Google or Yahoo. You needed

00:02:54.939 --> 00:02:57.879
server farms just to gather the raw data before

00:02:57.879 --> 00:03:00.539
you could even begin your analysis. So the barrier

00:03:00.539 --> 00:03:03.300
to entry was what millions of dollars. and computing

00:03:03.300 --> 00:03:05.979
power just to get a snapshot of what people were

00:03:05.979 --> 00:03:08.379
talking about online? Yeah, and Common Crawl

00:03:08.379 --> 00:03:11.360
totally removed that barrier by doing the really

00:03:11.360 --> 00:03:14.680
expensive labor -intensive vacuuming and then

00:03:14.680 --> 00:03:16.719
giving the resulting data set away for free.

00:03:16.860 --> 00:03:19.300
Which meant academics, independent researchers,

00:03:19.740 --> 00:03:22.599
small startups. They suddenly had access to the

00:03:22.599 --> 00:03:25.289
exact same scale of internet data. that the tech

00:03:25.289 --> 00:03:28.069
giants had. And it worked perfectly for its intended

00:03:28.069 --> 00:03:30.870
purpose. I mean, long before AI chatbots were

00:03:30.870 --> 00:03:33.550
a thing, people built real, tangible tools with

00:03:33.550 --> 00:03:36.689
this. Like Tiny, right? Yes. By 2013, the reverse

00:03:36.689 --> 00:03:39.250
image search site Tiny built its products off

00:03:39.250 --> 00:03:42.490
common crawls data. That's amazing. And academically,

00:03:42.569 --> 00:03:45.370
by 2024, the data set had been cited in over

00:03:45.370 --> 00:03:48.949
10 ,000 academic studies. It's a staggering foundational

00:03:48.949 --> 00:03:51.289
impact. It proved that if you provide the raw

00:03:51.289 --> 00:03:53.629
material of the internet for free, human ingenuity

00:03:53.629 --> 00:03:55.770
will find and incredible ways to use it. It was

00:03:55.770 --> 00:03:58.449
almost this utopian vision of the internet. Open

00:03:58.449 --> 00:04:01.090
data leading to open innovation. But, you know,

00:04:01.430 --> 00:04:03.770
the nature of innovation is unpredictable. And

00:04:03.770 --> 00:04:06.330
the way human ingenuity ended up using this data

00:04:06.330 --> 00:04:09.150
shifted the entire global economy. Right, which

00:04:09.150 --> 00:04:12.280
brings us to the AI boom. The big pivot. Because

00:04:12.280 --> 00:04:16.060
you have this massive free digital library of

00:04:16.060 --> 00:04:18.180
human knowledge. That's great for academics writing

00:04:18.180 --> 00:04:20.600
papers. Yeah. But over the last few years, the

00:04:20.600 --> 00:04:23.240
tech landscape just like shifted beneath our

00:04:23.240 --> 00:04:26.040
feet. Completely. Artificial intelligence companies

00:04:26.040 --> 00:04:29.060
realized they desperately needed an unprecedented

00:04:29.060 --> 00:04:31.959
volume of text to train their new models. And

00:04:31.959 --> 00:04:34.720
they looked around and sitting right there was

00:04:34.720 --> 00:04:38.209
this colossal prepackaged snapshot of the internet.

00:04:38.509 --> 00:04:40.649
And this is where the paradigm completely chefs,

00:04:41.149 --> 00:04:43.170
because the invention of large language models

00:04:43.170 --> 00:04:46.110
or LLMs changed everything about how software

00:04:46.110 --> 00:04:49.189
is developed. Because they don't learn like traditional

00:04:49.189 --> 00:04:51.709
software. No, they don't learn by being programmed

00:04:51.709 --> 00:04:54.329
with rigid grammar rules by a human engineer.

00:04:54.649 --> 00:04:56.850
They learn through a process called next word

00:04:56.850 --> 00:04:59.490
prediction. OK, so you feed the model a sequence

00:04:59.490 --> 00:05:02.439
of words. and it just guesses the next word based

00:05:02.439 --> 00:05:04.800
on mathematical probabilities. Exactly. And to

00:05:04.800 --> 00:05:07.040
get good at that, like to converse fluidly like

00:05:07.040 --> 00:05:10.620
a human, to write code, to compose poetry, it

00:05:10.620 --> 00:05:12.779
requires a volume of text that we can barely

00:05:12.779 --> 00:05:15.839
comprehend. Right, you can't just feed an AI...

00:05:15.829 --> 00:05:18.629
a thousand books or even a hundred thousand Wikipedia

00:05:18.629 --> 00:05:21.009
articles and expect it to pass the bar exam.

00:05:21.110 --> 00:05:24.009
Not at all. You need billions, if not trillions,

00:05:24.389 --> 00:05:28.189
of words to understand language context, subtle

00:05:28.189 --> 00:05:32.889
jokes, coding syntax, historical facts. The AI

00:05:32.889 --> 00:05:35.509
has to map the statistical relationships between

00:05:35.509 --> 00:05:39.430
tokens across a truly massive corpus of text.

00:05:39.629 --> 00:05:42.129
Tokens being like... Pieces of words. Yeah, pieces

00:05:42.129 --> 00:05:44.589
of words. And Common Crawl was simply the right

00:05:44.589 --> 00:05:47.569
data set at the exact right time. It transitioned

00:05:47.569 --> 00:05:50.189
almost overnight from an academic resource to

00:05:50.189 --> 00:05:52.370
the foundational infrastructure of a trillion

00:05:52.370 --> 00:05:54.610
dollar industry. I mean, a filtered version of

00:05:54.610 --> 00:05:57.449
Common Crawl was heavily utilized to train OpenAI's

00:05:57.449 --> 00:06:00.129
GPT -3, which was announced back in 2020. Yep.

00:06:00.410 --> 00:06:03.230
and it was used to train Google DeepMind's Gemini.

00:06:03.410 --> 00:06:05.829
Google even created its own specific, highly

00:06:05.829 --> 00:06:08.170
-filtered version of Common Crawl back in 2019.

00:06:08.769 --> 00:06:11.250
They called it the Colossal Clean Crawl Corpus,

00:06:11.550 --> 00:06:13.750
or C4 for short. And they used that to train

00:06:13.750 --> 00:06:16.730
their T5 language model series. Without Common

00:06:16.730 --> 00:06:19.110
Crawl's historical archiving, the rapid advancement

00:06:19.110 --> 00:06:21.389
of these language models would have been severely

00:06:21.389 --> 00:06:23.790
bottlenecked. Because the data was the missing

00:06:23.790 --> 00:06:25.930
piece of the puzzle. Exactly. The tech companies

00:06:25.930 --> 00:06:28.089
had the massive computing power, and they had

00:06:28.089 --> 00:06:30.670
the brilliant algorithms, but they desperately

00:06:30.670 --> 00:06:33.829
needed the raw text to feed the machine. And

00:06:33.829 --> 00:06:35.589
CommonCrawl provided it on a silver platter.

00:06:35.769 --> 00:06:37.970
Yeah, completely free. But I have to push back

00:06:37.970 --> 00:06:41.069
here, though, because this dynamic feels incredibly

00:06:41.069 --> 00:06:45.370
tilted. It's so. Well, you have a 501C nonprofit.

00:06:45.560 --> 00:06:48.680
funded originally by the Elbez Family Foundation

00:06:48.680 --> 00:06:51.500
Trust, giving away its data for free out of a

00:06:51.500 --> 00:06:54.360
sense of public good. Right. And that free data

00:06:54.360 --> 00:06:57.040
becomes the absolute secret sauce for massively

00:06:57.040 --> 00:07:00.000
profitable multi -billion dollar AI companies.

00:07:00.079 --> 00:07:03.740
It does. And then, in 2023, Common Crawl begins

00:07:03.740 --> 00:07:06.579
receiving significant financial support directly

00:07:06.579 --> 00:07:10.160
from the AI industry. We're talking $250 ,000

00:07:10.160 --> 00:07:13.620
donations each from OpenAI and Anthropic. Yeah,

00:07:13.699 --> 00:07:15.860
those are big numbers. Doesn't that compromise

00:07:15.860 --> 00:07:17.680
their independent mission? I mean, it feels like

00:07:17.680 --> 00:07:19.500
the digital Roomba suddenly works for a very

00:07:19.500 --> 00:07:22.040
specific corporate master. It absolutely highlights

00:07:22.040 --> 00:07:24.779
a massive wealth transfer. And that is the exact

00:07:24.779 --> 00:07:26.920
tension the tech world is grappling with right

00:07:26.920 --> 00:07:30.740
now. OK. Because from one perspective, the nonprofit

00:07:30.740 --> 00:07:34.029
is just receiving necessary funding to continue

00:07:34.029 --> 00:07:37.209
its incredibly expensive mission. I mean, archiving

00:07:37.209 --> 00:07:40.110
the web at that scale. The server costs and bandwidth

00:07:40.110 --> 00:07:42.610
are astronomical. Right. So who better to fund

00:07:42.610 --> 00:07:44.670
it than the companies relying on it? Exactly.

00:07:44.949 --> 00:07:47.930
And the data is still technically free for anyone

00:07:47.930 --> 00:07:50.569
else to use. But you are absolutely right to

00:07:50.569 --> 00:07:52.829
point out the friction there. Yeah. The nonprofit

00:07:52.829 --> 00:07:55.269
does the heavy lifting of gathering the world's

00:07:55.269 --> 00:07:58.350
knowledge and for -profit companies to synthesize

00:07:58.350 --> 00:08:00.769
it into closed products that generate massive

00:08:00.769 --> 00:08:03.750
exclusive revenue. So you have AI companies turning

00:08:03.750 --> 00:08:07.220
credible profits off free data. But there's a

00:08:07.220 --> 00:08:09.379
huge missing link here, which is the people who

00:08:09.379 --> 00:08:11.180
actually wrote that data in the first place.

00:08:11.420 --> 00:08:13.800
The creators. Right. When OpenAI or Google makes

00:08:13.800 --> 00:08:15.959
millions of dollars off a language model, the

00:08:15.959 --> 00:08:18.259
journalist, the blogger, or the novelist who's

00:08:18.259 --> 00:08:20.600
writing trained that model doesn't see a dime.

00:08:20.740 --> 00:08:23.100
Not a single dime. And that brings us to the

00:08:23.100 --> 00:08:25.500
massive copyright collision. Because when you

00:08:25.500 --> 00:08:28.480
scrape the entire open web, you inevitably scrape

00:08:28.480 --> 00:08:31.319
things that people want protected. Oh, totally.

00:08:31.699 --> 00:08:34.860
You capture copyrighted books, paywalled investigative

00:08:34.860 --> 00:08:39.080
articles, private blogs, recipe sites, proprietary

00:08:39.080 --> 00:08:42.940
code repository, everything. And as far back

00:08:42.940 --> 00:08:45.539
as 2016, it was well documented that the Common

00:08:45.539 --> 00:08:48.659
Crawl data set included copyrighted work. But

00:08:48.659 --> 00:08:50.779
they were distributing it from the United States

00:08:50.779 --> 00:08:53.679
under fair use claims. Yes. So let's break down

00:08:53.679 --> 00:08:56.259
that fair use argument mechanically. How does

00:08:56.259 --> 00:08:59.419
vacuuming up someone's copyrighted book legally

00:08:59.419 --> 00:09:02.460
fly? Well, the concept of fair use in the US

00:09:02.460 --> 00:09:05.399
is pretty legally flexible. It generally allows

00:09:05.399 --> 00:09:08.100
for the use of copyrighted material without permission

00:09:08.100 --> 00:09:10.879
if the use is highly, quote unquote, transformative.

00:09:11.179 --> 00:09:13.700
OK, transformative. Like what? For example, a

00:09:13.700 --> 00:09:15.960
search engine indexing a web page and displaying

00:09:15.960 --> 00:09:18.200
a snippet of text. So you can find the site that's

00:09:18.200 --> 00:09:20.879
considered fair use. Right. So AI companies argue

00:09:20.879 --> 00:09:23.139
that training a machine learning model is also

00:09:23.139 --> 00:09:25.860
transformative. They argue they aren't republishing

00:09:25.860 --> 00:09:27.919
a copyrighted book to compete with the author.

00:09:27.960 --> 00:09:30.389
They're just analyzing the book. to learn the

00:09:30.389 --> 00:09:32.450
underlying mechanics of human language. Exactly.

00:09:32.649 --> 00:09:34.950
Okay, but the internet isn't governed solely

00:09:34.950 --> 00:09:38.950
by US law. What happens when researchers or AI

00:09:38.950 --> 00:09:41.190
developers in Europe, where copyright laws are

00:09:41.190 --> 00:09:44.460
often much stricter, want to use this data? That

00:09:44.460 --> 00:09:47.159
is where we see some incredible legal and technical

00:09:47.159 --> 00:09:49.700
gymnastics. Oh boy. Because researchers outside

00:09:49.700 --> 00:09:52.320
the US face different liabilities, so they've

00:09:52.320 --> 00:09:55.580
had to invent these wild workarounds to interact

00:09:55.580 --> 00:09:57.759
with the data set without technically hosting

00:09:57.759 --> 00:10:00.220
copyrighted works. Wait, like what kind of workarounds?

00:10:00.440 --> 00:10:03.159
One prominent technique is literally shuffling

00:10:03.159 --> 00:10:05.399
the sentences of a document before analyzing

00:10:05.399 --> 00:10:09.509
it. Shut up. Really? So they take a novel or

00:10:09.509 --> 00:10:11.509
a deeply researched article and just put it into

00:10:11.509 --> 00:10:14.029
a digital blender? Precisely. You are literally

00:10:14.029 --> 00:10:17.169
destroying the artistic expression, the specific

00:10:17.169 --> 00:10:19.350
way the author ordered their thoughts, built

00:10:19.350 --> 00:10:21.610
their narrative, conveyed emotion. Which is what

00:10:21.610 --> 00:10:24.370
copyright actually protects. Exactly. But by

00:10:24.370 --> 00:10:26.889
keeping the words and sentences intact, just

00:10:26.889 --> 00:10:29.649
out of order, you preserve the statistical relationships

00:10:29.649 --> 00:10:32.149
between the individual words. Wow. And that statistical

00:10:32.149 --> 00:10:34.169
math is all the AI or the researcher actually

00:10:34.169 --> 00:10:37.480
cares about anyway. Right. Alternatively, developers

00:10:37.480 --> 00:10:39.679
would just reference the common crawl datasets

00:10:39.679 --> 00:10:42.539
remote location rather than hosting the data

00:10:42.539 --> 00:10:45.059
themselves, kind of shifting the legal liability

00:10:45.059 --> 00:10:47.379
away from their own servers. So they basically

00:10:47.379 --> 00:10:51.139
found a legal loophole. They destroy the art

00:10:51.139 --> 00:10:53.379
to extract the data. Essentially, yes. Let's

00:10:53.379 --> 00:10:55.519
look at this through a different lens. Imagine

00:10:55.519 --> 00:10:58.179
someone going to a massive public library and

00:10:58.179 --> 00:11:00.960
taking millions of free books. Okay. They feed

00:11:00.960 --> 00:11:03.620
all those books into a giant industrial processor

00:11:03.620 --> 00:11:07.659
and out comes this omniscient Robot. And that

00:11:07.659 --> 00:11:10.639
robot then stands right outside the library doors

00:11:10.639 --> 00:11:14.059
and charges people a fee to answer any question

00:11:14.059 --> 00:11:16.120
they have. Perfectively ensuring those people

00:11:16.120 --> 00:11:18.519
never need to go inside, check out a book, or

00:11:18.519 --> 00:11:20.279
support the author. Exactly. The original authors

00:11:20.279 --> 00:11:21.759
are going to look at that robot and say, hey,

00:11:21.860 --> 00:11:23.960
wait a minute. You used my life's work to build

00:11:23.960 --> 00:11:26.399
my replacement. And that captures the existential

00:11:26.399 --> 00:11:29.350
threat. to creators perfectly. I mean, the internet

00:11:29.350 --> 00:11:31.809
was built on the premise of open sharing and

00:11:31.809 --> 00:11:34.490
indexing. When a writer put an article online

00:11:34.490 --> 00:11:38.769
in, say, 2010, they wanted a crawler like Google

00:11:38.769 --> 00:11:41.669
to index it so humans could find it, read it,

00:11:41.769 --> 00:11:44.529
and maybe click and add or subscribe. But AI

00:11:44.529 --> 00:11:47.190
training wasn't the intended use. No. They never

00:11:47.190 --> 00:11:49.929
consented to a machine ingesting their work to

00:11:49.929 --> 00:11:52.549
learn how to mimic their writing style. And then

00:11:52.549 --> 00:11:55.799
just... answer user queries directly, bypassing

00:11:55.799 --> 00:11:57.960
the author's website entirely. And creators are

00:11:57.960 --> 00:12:00.139
actively waking up to this and fighting back.

00:12:00.299 --> 00:12:03.379
Big time. There was a 2024 New York Times study

00:12:03.379 --> 00:12:06.519
by Kevin Ruse that revealed a staggering statistic.

00:12:06.840 --> 00:12:09.960
45 % of content is now explicitly restricted

00:12:09.960 --> 00:12:12.100
by websites. That's nearly half the internet.

00:12:12.299 --> 00:12:14.440
Yeah. They are actively putting up digital do

00:12:14.440 --> 00:12:16.820
not enter signs because they refuse to be scraped

00:12:16.820 --> 00:12:19.539
without compensation. We also saw major concerns

00:12:19.539 --> 00:12:21.799
raised over copyrighted content specifically

00:12:21.799 --> 00:12:24.299
inside Google's C4 data set, which was thoroughly

00:12:24.299 --> 00:12:26.899
reported by The Guardian in 2023. The tension

00:12:26.899 --> 00:12:30.399
is just palpable. It is. The core question society

00:12:30.399 --> 00:12:32.559
is wrestling with right now is what fair use

00:12:32.559 --> 00:12:35.559
really means when a machine rather than a human

00:12:35.559 --> 00:12:38.059
is doing the reading at a scale of billions of

00:12:38.059 --> 00:12:40.500
pages a day. Which means all of this underlying

00:12:40.500 --> 00:12:43.059
friction eventually hits an absolute boiling

00:12:43.059 --> 00:12:47.559
point. And it did. November 2025. Technology

00:12:47.559 --> 00:12:50.820
journalist Alex Reisner publishes this massive

00:12:50.820 --> 00:12:53.500
explosive investigation in the Atlantic. Yes.

00:12:53.659 --> 00:12:55.899
And the allegations leveled against common crawl

00:12:55.899 --> 00:12:58.659
are severe. The investigation claimed that Common

00:12:58.659 --> 00:13:01.399
For All was explicitly bypassing publisher requests

00:13:01.399 --> 00:13:03.700
to have their content removed from its databases.

00:13:04.039 --> 00:13:05.919
And to understand the gravity of that, we have

00:13:05.919 --> 00:13:08.240
to explain how those removal requests usually

00:13:08.240 --> 00:13:10.320
work. Right. Let's get into the technical side.

00:13:10.539 --> 00:13:12.779
When a website wants to block a web crawler,

00:13:13.019 --> 00:13:16.220
they typically use a file called robots .txt.

00:13:16.799 --> 00:13:19.399
It's essentially a simple text file sitting on

00:13:19.399 --> 00:13:22.000
the website server that acts as a digital traffic

00:13:22.000 --> 00:13:25.139
cop. It tells incoming bots, you are allowed

00:13:25.139 --> 00:13:27.159
here, but you are not allowed there. Okay, so

00:13:27.159 --> 00:13:29.500
a publisher puts up the stop sign. Right. And

00:13:29.500 --> 00:13:31.399
the allegation in the Atlantic was that Common

00:13:31.399 --> 00:13:33.960
Crawl's bots were simply ignoring these standard

00:13:33.960 --> 00:13:37.019
digital traffic lights or actively bypassing

00:13:37.019 --> 00:13:39.740
paywalls to get to the text underneath. And there's

00:13:39.740 --> 00:13:42.779
a highly specific technical piece of this allegation

00:13:42.779 --> 00:13:46.340
that really stands out. The piece claimed that

00:13:46.340 --> 00:13:48.240
the public search function on Common Crawl's

00:13:48.240 --> 00:13:51.139
own website was fundamentally misleading. And

00:13:51.139 --> 00:13:53.659
this is the part that caused massive, massive

00:13:53.659 --> 00:13:56.490
outrage. Explain how that worked. So if you are

00:13:56.490 --> 00:14:00.169
a publisher and you add that robots .txt file

00:14:00.169 --> 00:14:03.129
to block Common Call, you might go to Common

00:14:03.129 --> 00:14:05.250
Call's website a few weeks later and search for

00:14:05.250 --> 00:14:08.750
your domain just to verify they complied. Makes

00:14:08.750 --> 00:14:10.509
sense. You want to check their work. Exactly.

00:14:10.970 --> 00:14:12.590
And according to the investigation, the public

00:14:12.590 --> 00:14:15.309
-facing search tool would show zero entries for

00:14:15.309 --> 00:14:17.039
your site. So you'd look at that and think you

00:14:17.039 --> 00:14:19.179
were safe. Right. You'd think, OK, they respected

00:14:19.179 --> 00:14:21.919
my request. But the allegation was that while

00:14:21.919 --> 00:14:25.059
the public search index hid your site, your data

00:14:25.059 --> 00:14:27.700
was actually still included in the underlying

00:14:27.700 --> 00:14:30.759
raw scraped data dumps. The WarRC files. Yes,

00:14:30.840 --> 00:14:33.279
the WarRC files. And those are the files being

00:14:33.279 --> 00:14:35.879
handed over in bulk to the AI companies. Wow.

00:14:36.879 --> 00:14:40.120
If true, that is a massive breach of trust. It

00:14:40.120 --> 00:14:44.509
suggests this two -tiered system, like a sanitized,

00:14:44.669 --> 00:14:47.250
compliant public face to appease angry publishers,

00:14:47.730 --> 00:14:50.210
and then a backend data pipeline that quietly

00:14:50.210 --> 00:14:53.149
ignores restrictions to just keep feeding the

00:14:53.149 --> 00:14:56.269
AI industry's insatiable appetite for fresh data.

00:14:56.730 --> 00:14:59.070
And publishers did not take this lightly at all.

00:14:59.389 --> 00:15:01.990
A report by Wired noted that publishers are now

00:15:01.990 --> 00:15:04.470
actively targeting common crawl in these fights

00:15:04.470 --> 00:15:07.110
over AI training data. They're realizing that

00:15:07.110 --> 00:15:09.149
if they want to stop AI companies from using

00:15:09.149 --> 00:15:12.039
their work, suing the AI companies might not

00:15:12.039 --> 00:15:14.039
be enough. Right, they have to go directly after

00:15:14.039 --> 00:15:16.799
the supplier of the raw material. But Comic Roll

00:15:16.799 --> 00:15:19.580
didn't just absorb the blow. No, they fired back.

00:15:19.860 --> 00:15:22.399
Rich Screnta published a formal public reply

00:15:22.399 --> 00:15:25.799
titled, Setting the Record Straight. vigorously

00:15:25.799 --> 00:15:27.799
defending the organization. And I really want

00:15:27.799 --> 00:15:29.840
to break down both sides of this battlefield

00:15:29.840 --> 00:15:31.759
impartially, because this isn't just a technical

00:15:31.759 --> 00:15:33.940
dispute. No, it's a fundamental clash over who

00:15:33.940 --> 00:15:35.779
owns the information on the internet. Exactly.

00:15:36.039 --> 00:15:38.179
So let's start with the publishers. Their argument

00:15:38.179 --> 00:15:40.919
is rooted purely in commercial survival. Absolutely,

00:15:41.139 --> 00:15:43.759
because investigative journalism, high quality

00:15:43.759 --> 00:15:47.120
literature, detailed research, that stuff costs

00:15:47.120 --> 00:15:49.480
real money to produce. Yeah, you have to pay

00:15:49.480 --> 00:15:52.220
writers. Paywalls and ad revenue are how these

00:15:52.220 --> 00:15:55.000
entities survive in the digital age. So from

00:15:55.000 --> 00:15:57.059
the perspective of the Atlantic and the wider

00:15:57.059 --> 00:16:00.279
publishing industry, if an organization bypasses

00:16:00.279 --> 00:16:03.539
those paywalls, copies the proprietary content,

00:16:03.860 --> 00:16:06.820
and feeds it into an AI. An AI that can then

00:16:06.820 --> 00:16:09.600
just summarize that exact content for an end

00:16:09.600 --> 00:16:11.759
user. Then the original publisher loses their

00:16:11.759 --> 00:16:13.559
traffic. They lose their subscriber revenue.

00:16:13.720 --> 00:16:16.450
Right. To them, bypassing a paywall to scrape

00:16:16.450 --> 00:16:19.590
data is theft, plain and simple, and their intellectual

00:16:19.590 --> 00:16:22.009
property absolutely must be protected for their

00:16:22.009 --> 00:16:24.230
industries to survive. It's a very clear line.

00:16:24.370 --> 00:16:26.730
If you fund a six -month journalistic investigation,

00:16:27.210 --> 00:16:29.669
you deserve to get paid for it, not have a machine

00:16:29.669 --> 00:16:31.769
ingest it for free and spit out the bullet points

00:16:31.769 --> 00:16:34.389
to millions of people. Exactly. But then we have

00:16:34.389 --> 00:16:36.889
common -crawls defense, which is equally foundational

00:16:36.889 --> 00:16:38.669
to how the architecture of the internet actually

00:16:38.669 --> 00:16:41.879
works. Yeah, Common Crawl's defense, as outlined

00:16:41.879 --> 00:16:44.120
in Screnta's Setting the Record Straight, rests

00:16:44.120 --> 00:16:46.799
on their long -standing commitment to transparency

00:16:46.799 --> 00:16:49.259
and their belief in the public good. They really

00:16:49.259 --> 00:16:51.139
push back on the idea of malicious intent, right?

00:16:51.179 --> 00:16:53.860
They do. They defend their technical processes,

00:16:54.259 --> 00:16:56.639
noting that web crawling at the scale of billions

00:16:56.639 --> 00:16:59.659
of pages is an imperfect science. It's not some

00:16:59.659 --> 00:17:03.360
malicious conspiracy to steal data. And their

00:17:03.360 --> 00:17:05.079
perspective is that locking down the internet

00:17:05.079 --> 00:17:07.099
behind paywalls and opting out of historical

00:17:07.099 --> 00:17:09.920
archives fundamentally damages the open web.

00:17:10.160 --> 00:17:12.539
Because if everything is locked down, only the

00:17:12.539 --> 00:17:14.619
richest companies can afford to buy access to

00:17:14.619 --> 00:17:17.670
the data. through licensing deals. Exactly. And

00:17:17.670 --> 00:17:20.369
then we are right back to the 2007 problem where

00:17:20.369 --> 00:17:23.329
only a massive tech giant can afford to do research.

00:17:23.390 --> 00:17:26.490
Wow. Full circle. Precisely. To common crawl,

00:17:26.789 --> 00:17:29.849
indexing the web even for AI training is a transformative

00:17:29.849 --> 00:17:32.509
use that ultimately benefits society by driving

00:17:32.509 --> 00:17:35.269
massive technological progress. They sort of

00:17:35.269 --> 00:17:37.329
view themselves as digital librarians, don't

00:17:37.329 --> 00:17:40.069
they? They do. And a librarian doesn't pay a

00:17:40.069 --> 00:17:41.769
royalty every time someone reads a book in the

00:17:41.769 --> 00:17:43.789
library to learn something new or be inspired

00:17:43.789 --> 00:17:45.750
to write their own book. Right. They believe

00:17:45.750 --> 00:17:48.190
that archiving the public web is a vital service

00:17:48.190 --> 00:17:51.849
to humanity and restricting that archive actively

00:17:51.849 --> 00:17:55.009
hinders innovation. It's the classic unstoppable

00:17:55.009 --> 00:17:58.890
force meeting an immovable object. The commercial

00:17:58.890 --> 00:18:02.250
necessity for human creators to survive versus

00:18:02.250 --> 00:18:04.869
the technological drive to organize and process.

00:18:05.200 --> 00:18:07.680
all human knowledge. And caught right in the

00:18:07.680 --> 00:18:11.180
middle of it is a 501c3 nonprofit that started

00:18:11.180 --> 00:18:13.579
with the simple goal of giving researchers a

00:18:13.579 --> 00:18:15.980
downloadable backup of the internet. It perfectly

00:18:15.980 --> 00:18:19.019
illustrates how quickly technology outpaces our

00:18:19.019 --> 00:18:21.740
legal and ethical frameworks. I mean, the infrastructure

00:18:21.740 --> 00:18:23.960
of the open web was built for human readers in

00:18:23.960 --> 00:18:26.559
one era, and it has been entirely co -opted by

00:18:26.559 --> 00:18:28.880
machine readers in another. It really has. It's

00:18:28.880 --> 00:18:30.740
a profound journey when you step back and look

00:18:30.740 --> 00:18:33.369
at it. We started this deep dive looking at Gill

00:18:33.369 --> 00:18:36.710
Elbaas in 2007, building digital Roombas to help

00:18:36.710 --> 00:18:39.769
academics spot trends. And we've ended up in

00:18:39.769 --> 00:18:42.529
a landscape where those same Roombas are funded

00:18:42.529 --> 00:18:46.170
by massive tech conglomerates, accused of sneaking

00:18:46.170 --> 00:18:49.970
past paywalls, and serving as the primary battleground

00:18:49.970 --> 00:18:52.480
for the future of copyright law. The most crucial

00:18:52.480 --> 00:18:55.000
takeaway for you, the listener, is to remember

00:18:55.000 --> 00:18:58.240
that every single time you prompt an AI, you

00:18:58.240 --> 00:19:00.519
are not just talking to a clever algorithm. No,

00:19:00.740 --> 00:19:03.579
you are querying the entire archived history

00:19:03.579 --> 00:19:06.400
of the open web, vacuumed up and processed by

00:19:06.400 --> 00:19:09.720
common crawl. But that open web is closing. As

00:19:09.720 --> 00:19:12.000
we discussed, nearly half of the internet's content

00:19:12.000 --> 00:19:14.220
is now explicitly restricted. Yeah, with these

00:19:14.220 --> 00:19:16.819
massive controversies, publishers, journalists,

00:19:16.980 --> 00:19:19.319
and creators are building digital fences faster

00:19:19.319 --> 00:19:21.640
than ever. They are locking their doors. Which

00:19:21.640 --> 00:19:23.559
leaves you with a really profound question to

00:19:23.559 --> 00:19:26.190
mull over. If the open web continues to build

00:19:26.190 --> 00:19:28.509
these fences, what happens to the next generation

00:19:28.509 --> 00:19:30.990
of artificial intelligence? Will future models

00:19:30.990 --> 00:19:33.650
only be able to learn from a closed -off, highly

00:19:33.650 --> 00:19:36.029
commercialized, sanitized version of the Internet?

00:19:36.289 --> 00:19:39.329
The AI magic trick we enjoyed today relies entirely

00:19:39.329 --> 00:19:41.970
on a massive, hidden engine of free human knowledge.

00:19:42.309 --> 00:19:44.549
But if those raw materials dry up or get locked

00:19:44.549 --> 00:19:46.970
behind iron -clad legal vaults, what does a post

00:19:46.970 --> 00:19:48.910
-common -crawl Internet look like for a learner

00:19:48.910 --> 00:19:49.789
seeking the truth?