WEBVTT

00:00:00.640 --> 00:00:02.459
You know, I was actually thinking about the concept

00:00:02.459 --> 00:00:04.599
of a library the other day. Oh yeah, just the

00:00:04.599 --> 00:00:06.580
general idea of one. Yeah, specifically that

00:00:06.580 --> 00:00:08.699
really old -school image. You know, you picture

00:00:08.699 --> 00:00:11.460
the smell of dust, the silence, rows and rows

00:00:11.460 --> 00:00:14.560
of physical books that supposedly contain everything

00:00:14.560 --> 00:00:17.739
humanity has ever thought or written down. It's

00:00:17.739 --> 00:00:20.820
a really powerful image. It is. It's basically

00:00:20.820 --> 00:00:23.059
the external hard drive of human civilization.

00:00:23.480 --> 00:00:27.129
Exactly. But today, for our deep dive... We're

00:00:27.129 --> 00:00:29.410
going to talk about a library that makes the

00:00:29.410 --> 00:00:31.609
Library of Congress look like, I don't know,

00:00:31.690 --> 00:00:35.030
a distinct stack of Post -it notes. And the crazy

00:00:35.030 --> 00:00:37.530
thing is this library doesn't store books. It

00:00:37.530 --> 00:00:40.670
doesn't store maps. It stores the literal instructions

00:00:40.670 --> 00:00:44.549
for building well, everything alive. Or at least

00:00:44.549 --> 00:00:46.609
everything alive that we've managed to actually

00:00:46.609 --> 00:00:49.549
run through a sequencer so far. Right. We are

00:00:49.549 --> 00:00:51.649
diving into GenBank, and if you're joining us

00:00:51.649 --> 00:00:53.649
for this deep dive, you probably already know

00:00:53.649 --> 00:00:55.950
we love a mind -boggling number to set the stage.

00:00:56.149 --> 00:00:58.250
So to give you a sense of the sheer scale of

00:00:58.250 --> 00:01:01.090
this place right off the dot, as of October 2024,

00:01:01.549 --> 00:01:04.689
this database contained 34 trillion base pairs

00:01:04.689 --> 00:01:07.689
of data. Yeah, 34 trillion. It is a number that

00:01:07.689 --> 00:01:10.549
is genuinely hard to even wrap your head around.

00:01:10.689 --> 00:01:13.099
It's totally abstract. Yeah. Right. I mean, I

00:01:13.099 --> 00:01:15.879
can't picture 34 trillion of anything. But by

00:01:15.879 --> 00:01:17.579
the end of this conversation, we're going to

00:01:17.579 --> 00:01:20.459
make that number feel very real. Because GenBank

00:01:20.459 --> 00:01:23.299
isn't just some cold storage facility for data.

00:01:23.400 --> 00:01:26.540
It is effectively the operating system for modern

00:01:26.540 --> 00:01:29.060
biology. Oh, absolutely. Like if you've eaten

00:01:29.060 --> 00:01:31.700
bread today, taken an antibiotic or had a COVID

00:01:31.700 --> 00:01:34.439
test, your life is intersected with this database.

00:01:34.780 --> 00:01:37.379
That is no exaggeration either. GenBank is the

00:01:37.379 --> 00:01:39.739
gold standard. It's the open access collection

00:01:39.739 --> 00:01:42.280
of all publicly available nucleotide sequences

00:01:42.280 --> 00:01:45.340
and their protein translations. If you're a researcher,

00:01:45.500 --> 00:01:47.620
you aren't just visiting this library. You are

00:01:47.620 --> 00:01:49.680
practically living in it. It's the water the

00:01:49.680 --> 00:01:52.340
scientific community swims in. But here is the

00:01:52.340 --> 00:01:54.280
hook. And really the reason I wanted to pull

00:01:54.280 --> 00:01:57.159
these specific sources for us today. tend to

00:01:57.159 --> 00:01:59.959
think of scientific databases as these pristine,

00:02:00.040 --> 00:02:03.680
infallible vaults of absolute truth. Like, you

00:02:03.680 --> 00:02:05.239
know, if it's in the computer, it must be right.

00:02:05.400 --> 00:02:07.480
And that is a very dangerous assumption to make.

00:02:07.599 --> 00:02:09.900
Because as we dug into the reports on GenBank.

00:02:10.159 --> 00:02:12.819
We found that this library is messy. It's growing

00:02:12.819 --> 00:02:15.400
so incredibly fast, doubling in size every 18

00:02:15.400 --> 00:02:17.819
months, actually, that it's starting to break

00:02:17.819 --> 00:02:20.580
the systems meant to organize it. We've got fake

00:02:20.580 --> 00:02:23.740
news in the form of misidentified fish. We've

00:02:23.740 --> 00:02:25.759
got anonymous turtles that shouldn't exist. And

00:02:25.759 --> 00:02:28.080
we have a quote unquote leaderboard of life that

00:02:28.080 --> 00:02:30.539
puts common bread wheat way above human beings.

00:02:30.659 --> 00:02:33.319
It is a perfect storm of exponential growth and

00:02:33.319 --> 00:02:36.180
human error. It's a testament to human curiosity,

00:02:36.360 --> 00:02:40.020
sure, but also to our fallibility. So today's

00:02:40.020 --> 00:02:43.500
mission is to basically unpack this beast. We're

00:02:43.500 --> 00:02:44.699
going to look at the scale. We're going to look

00:02:44.699 --> 00:02:46.340
at those charts to see who the main characters

00:02:46.340 --> 00:02:48.900
of biological research really are. And we're

00:02:48.900 --> 00:02:51.340
going to talk about why, despite being the scientific

00:02:51.340 --> 00:02:54.120
Bible for so many, it might be harboring some

00:02:54.120 --> 00:02:56.300
pretty significant glitches. It's a tree deep

00:02:56.300 --> 00:02:59.379
dive into the very infrastructure of modern biology.

00:02:59.680 --> 00:03:01.800
So let's start with that number again, 34 trillion

00:03:01.800 --> 00:03:04.379
base pairs. Help us break down the actual inventory

00:03:04.379 --> 00:03:06.750
here. It's difficult because the scale is just

00:03:06.750 --> 00:03:09.229
so vast. But let's look at the distinct sequences.

00:03:09.590 --> 00:03:12.449
As of release 250, which came out in October

00:03:12.449 --> 00:03:17.310
2024, we are looking at over 4 .7 billion distinct

00:03:17.310 --> 00:03:20.090
nucleotide sequences. And just to ensure we're

00:03:20.090 --> 00:03:22.430
perfectly aligned here, when we say nucleotide

00:03:22.430 --> 00:03:24.349
sequences, we're talking about the A, C, T, and

00:03:24.349 --> 00:03:27.229
Gs, right? The raw code of DNA. Correct. The

00:03:27.229 --> 00:03:29.930
base genetic code. And that data covers more

00:03:29.930 --> 00:03:34.800
than 580 ,000 formally described species. So

00:03:34.800 --> 00:03:36.800
from the absolutely smallest bacteria to the

00:03:36.800 --> 00:03:38.580
blue whale, if a scientist has sequenced it,

00:03:38.659 --> 00:03:40.979
it is likely sitting on a server in Maryland

00:03:40.979 --> 00:03:44.680
right now. 580 ,000 species. That is an absurd

00:03:44.680 --> 00:03:47.099
amount of life. But what really jumped out at

00:03:47.099 --> 00:03:49.270
me in the source material was the speed. In the

00:03:49.270 --> 00:03:50.810
tech world, you know, we always talk about Moore's

00:03:50.810 --> 00:03:53.289
Law. Right, the idea that computing power doubles

00:03:53.289 --> 00:03:55.770
roughly every two years, the benchmark for rapid

00:03:55.770 --> 00:03:58.689
progress. Exactly. But GenBank seems to be leaving

00:03:58.689 --> 00:04:01.169
Moore's Law completely in the dust. Yeah, biology

00:04:01.169 --> 00:04:04.129
has its own version of that law, and it is remarkably

00:04:04.129 --> 00:04:06.849
aggressive. Since GenBank started back in 1982,

00:04:07.270 --> 00:04:09.750
the number of bases in the database has doubled

00:04:09.750 --> 00:04:12.229
approximately every 18 months. So it's actually

00:04:12.229 --> 00:04:14.349
growing faster than the computers we use to analyze

00:04:14.349 --> 00:04:17.649
it. In some ways, yes, it is exponential growth

00:04:17.649 --> 00:04:20.689
on steroids. Imagine a physical library where

00:04:20.689 --> 00:04:22.750
every year and a half you literally have to build

00:04:22.750 --> 00:04:25.209
a new wing twice the size of the previous one

00:04:25.209 --> 00:04:28.509
and you fill it instantly. And who is building

00:04:28.509 --> 00:04:30.250
these wings? Like who is actually running this

00:04:30.250 --> 00:04:33.310
thing? Because managing 34 trillion base pairs

00:04:33.310 --> 00:04:35.870
sounds like an administrative nightmare. It is

00:04:35.870 --> 00:04:38.350
a massive coordination effort, as you'd expect.

00:04:38.550 --> 00:04:41.730
It's produced and maintained by the NCBI, which

00:04:41.730 --> 00:04:43.829
is the National Center for Biotechnology Information.

00:04:44.230 --> 00:04:46.689
Which falls under the NIH, the National Institutes

00:04:46.689 --> 00:04:48.790
of Health in the U .S. Correct. But, and this

00:04:48.790 --> 00:04:51.529
is a really crucial distinction, it is not just

00:04:51.529 --> 00:04:54.250
an American project. Science doesn't really respect

00:04:54.250 --> 00:04:57.189
borders and neither does DNA. GenBank is part

00:04:57.189 --> 00:04:59.769
of the International Nucleotide Sequence Database

00:04:59.769 --> 00:05:03.009
Collaboration, or INSDC. That is quite a mouthful.

00:05:03.089 --> 00:05:05.370
It is. But basically it means they are constantly

00:05:05.370 --> 00:05:08.350
synced up with the DNA data bank of Japan and

00:05:08.350 --> 00:05:11.689
the European Nucleotide Archive. So while the

00:05:11.689 --> 00:05:14.310
main servers might be in Maryland, the effort

00:05:14.310 --> 00:05:17.250
is truly global. It's practically a planetary

00:05:17.250 --> 00:05:20.750
brain. Okay, so if the server is the brain, I

00:05:20.750 --> 00:05:22.449
really want to know what it's thinking about

00:05:22.449 --> 00:05:25.459
the most. One of the most fun things we found

00:05:25.459 --> 00:05:27.899
in the source material was this concept of the

00:05:27.899 --> 00:05:31.500
top 20 organisms. The leaderboard of life. Yes,

00:05:31.540 --> 00:05:34.100
exactly. I want to see who the main characters

00:05:34.100 --> 00:05:38.060
are. If we look at the organism with the absolute

00:05:38.060 --> 00:05:42.259
most base pairs stored in GenBank, who takes

00:05:42.259 --> 00:05:44.399
the gold medal? Now, I'm going to go out on a

00:05:44.399 --> 00:05:46.959
limb here. I feel like humans are pretty obsessed

00:05:46.959 --> 00:05:49.740
with ourselves. We sequence the human genome.

00:05:49.860 --> 00:05:52.759
We're constantly studying our own diseases. Surely

00:05:52.759 --> 00:05:55.040
we are number one on this list. That is the most

00:05:55.040 --> 00:05:57.040
reasonable guess you could possibly make. And

00:05:57.040 --> 00:05:59.120
you would be completely wrong. Of course I am.

00:05:59.180 --> 00:06:00.920
We didn't even win our own popularity contest.

00:06:01.180 --> 00:06:03.399
We didn't even make the podium. Homo sapiens

00:06:03.399 --> 00:06:05.339
humans actually come in at number five. We have

00:06:05.339 --> 00:06:08.839
about 27 .8 billion base pairs stored in GenBank.

00:06:09.120 --> 00:06:11.500
Number five. Wow, that is actually kind of humbling.

00:06:11.560 --> 00:06:14.980
Okay, so who beat us? Who is the undisputed heavyweight

00:06:14.980 --> 00:06:17.740
champion of GenBank? That honor goes to Tritacum

00:06:17.740 --> 00:06:20.720
Estabum. Tritacum. That sounds like... Common

00:06:20.720 --> 00:06:22.899
bread wheat. Wheat, this stuff in my toast is

00:06:22.899 --> 00:06:26.019
number one. By an absolute landslide, wheat has

00:06:26.019 --> 00:06:29.379
a massive presence in the database, about 215

00:06:29.379 --> 00:06:32.779
billion base pairs. That is almost 10 times the

00:06:32.779 --> 00:06:35.319
amount of data we have for human beings. Why

00:06:35.319 --> 00:06:38.300
wheat? Is wheat secretly way more complex than

00:06:38.300 --> 00:06:40.490
us? Or are we just that obsessed with carbs?

00:06:40.689 --> 00:06:42.930
Well, it's a bit of both, actually. Genetically,

00:06:43.089 --> 00:06:46.689
wheat is a beast. It has a huge, highly complex

00:06:46.689 --> 00:06:49.069
genome. It's hexaploid, meaning it has six sets

00:06:49.069 --> 00:06:52.129
of chromosomes. But really, this reflects human

00:06:52.129 --> 00:06:54.589
civilization's absolute dependence on agriculture.

00:06:55.050 --> 00:06:57.750
Right. If the global wheat crop fails, we have

00:06:57.750 --> 00:07:00.329
a global catastrophe. Exactly. We are studying

00:07:00.329 --> 00:07:02.810
it constantly for food security, for yield optimization,

00:07:03.110 --> 00:07:05.629
for disease resistance. We are obsessed with

00:07:05.629 --> 00:07:07.529
wheat because our survival fundamentally depends

00:07:07.529 --> 00:07:10.500
on it. scientifically, it gets the lion's share

00:07:10.500 --> 00:07:12.079
of the attention. Okay, that makes total sense.

00:07:12.220 --> 00:07:14.079
Respect the wheat. So who took the silver medal?

00:07:14.160 --> 00:07:15.779
Who's number two? Now, this one tells a very

00:07:15.779 --> 00:07:17.800
specific story about our recent history. It's

00:07:17.800 --> 00:07:20.480
essentially a snapshot of a crisis. Number two

00:07:20.480 --> 00:07:24.870
is SARS -CoV -2. The COVID virus. Exactly. And

00:07:24.870 --> 00:07:27.610
just think about how remarkable that is. This

00:07:27.610 --> 00:07:30.509
is a virus that functionally didn't exist in

00:07:30.509 --> 00:07:32.430
the public consciousness or the database just

00:07:32.430 --> 00:07:36.430
a few years ago. Now it has about 165 billion

00:07:36.430 --> 00:07:40.230
base pairs stored. That is incredible. It really

00:07:40.230 --> 00:07:43.129
visualizes the pivot, doesn't it? It shows the

00:07:43.129 --> 00:07:46.029
exact moment the entire global scientific community

00:07:46.029 --> 00:07:48.930
just dropped what it was doing and stared at

00:07:48.930 --> 00:07:51.529
this one single thing. It really does. It shows

00:07:51.529 --> 00:07:54.209
how GenBank acts as a mirror for scientific priority.

00:07:54.449 --> 00:07:56.730
It doesn't just store biology. It stores our

00:07:56.730 --> 00:07:59.430
reaction to biological crises. When the pandemic

00:07:59.430 --> 00:08:02.269
hit, every lab with a sequencer anywhere in the

00:08:02.269 --> 00:08:04.670
world started uploading viral genomes. So we

00:08:04.670 --> 00:08:06.910
have wheat at number one, COVID at number two.

00:08:06.990 --> 00:08:10.509
Who rounds out the top three? Barley. More grains.

00:08:11.949 --> 00:08:13.709
We really are just farmers with fancy computers,

00:08:13.870 --> 00:08:15.850
aren't we? Humans like their beer and their bread.

00:08:15.889 --> 00:08:17.689
It's pretty undeniable at this point. Okay, so

00:08:17.689 --> 00:08:20.470
wheat, the virus, barley. Then the lab mouse

00:08:20.470 --> 00:08:23.290
at number four, I assume. Yes, most musculus

00:08:23.290 --> 00:08:26.379
is number four. sitting right there just ahead

00:08:26.379 --> 00:08:29.279
of humans. And if you go further down the list,

00:08:29.339 --> 00:08:31.620
past humans at number five, you see the other

00:08:31.620 --> 00:08:34.720
standard research staples. You have E. coli at

00:08:34.720 --> 00:08:37.240
number seven. The absolute workhorse of the lab.

00:08:37.399 --> 00:08:40.200
Exactly. You have the zebrafish, Danio Rario,

00:08:40.340 --> 00:08:43.600
at number nine. Dogs and pigs also make the top

00:08:43.600 --> 00:08:45.860
20 list. It's really interesting that the list

00:08:45.860 --> 00:08:48.679
isn't necessarily the most complex animals or

00:08:48.679 --> 00:08:51.080
the coolest animals. Like, I don't see any lions

00:08:51.080 --> 00:08:53.279
or tigers on there. It's literally just the animals

00:08:53.279 --> 00:08:55.840
we need. Precisely. It is a pure reflection of

00:08:55.840 --> 00:08:58.059
human utility. We sequence what we eat, what

00:08:58.059 --> 00:08:59.860
makes us sick, and what lives in our houses.

00:09:00.120 --> 00:09:02.279
It's a very pragmatic list when you look at it.

00:09:02.399 --> 00:09:04.919
So how did this all begin? You mentioned 1982

00:09:04.919 --> 00:09:08.100
earlier. That feels like the Stone Age for computers.

00:09:08.240 --> 00:09:10.899
I'm picturing giant tape drives in those green

00:09:10.899 --> 00:09:13.820
tech screens. You're not far off at all. And

00:09:13.820 --> 00:09:15.580
the origin story is actually quite surprising

00:09:15.580 --> 00:09:17.399
because it starts right in the shadow of the

00:09:17.399 --> 00:09:19.820
Cold War. Really? Yeah. How do you get from the

00:09:19.820 --> 00:09:23.240
Cold War to DNA sequencing? It traces back to

00:09:23.240 --> 00:09:25.460
the Los Alamos National Laboratory. Wait, the

00:09:25.460 --> 00:09:28.149
atomic bomb place. The very same, specifically

00:09:28.149 --> 00:09:31.250
the theoretical biology and biophysics group

00:09:31.250 --> 00:09:34.029
there. There was a physicist named Walter Goad,

00:09:34.090 --> 00:09:37.250
who is essentially the father of GenBank. I suppose

00:09:37.250 --> 00:09:39.509
if you have the massive computing power required

00:09:39.509 --> 00:09:41.610
to calculate nuclear blast radiuses, you might

00:09:41.610 --> 00:09:43.789
as well use it for biology, too. That is exactly

00:09:43.789 --> 00:09:46.870
the idea. The math of pattern recognition overlaps

00:09:46.870 --> 00:09:49.960
more than you'd think. So in 1982, with funding

00:09:49.960 --> 00:09:52.379
from the NIH, the National Science Foundation,

00:09:52.700 --> 00:09:54.960
the DOE, and even the Department of Defense,

00:09:55.240 --> 00:09:57.940
they officially launched GenBank. They must have

00:09:57.940 --> 00:10:00.440
been tiny back then. Minuscule. By the end of

00:10:00.440 --> 00:10:03.539
1983, they had stored just over 2 ,000 sequences.

00:10:03.940 --> 00:10:07.669
2 ,000. And now we're at 4 .7 billion. That is

00:10:07.669 --> 00:10:10.250
quite the growth spurt. It was a very different

00:10:10.250 --> 00:10:12.470
world. In the mid -80s, the project was actually

00:10:12.470 --> 00:10:15.049
managed by a bioinformatics company at Stanford

00:10:15.049 --> 00:10:17.750
called Intelligenetics. It didn't actually move

00:10:17.750 --> 00:10:20.350
to the NCBI until that transition period between

00:10:20.350 --> 00:10:23.909
1989 and 1992. And I read in the sources that

00:10:23.909 --> 00:10:26.169
this project actually helped kickstart the way

00:10:26.169 --> 00:10:28.570
scientists talk to each other online. Is that

00:10:28.570 --> 00:10:32.230
right? Yes, the Bio -CI or BioNet newsgroups.

00:10:32.230 --> 00:10:35.230
It was one of the earliest examples of open access

00:10:35.230 --> 00:10:38.429
communication among scientists. Before the modern

00:10:38.429 --> 00:10:41.190
internet as we know it even existed, GenBank

00:10:41.190 --> 00:10:43.909
was already fostering this culture of share what

00:10:43.909 --> 00:10:47.110
you find. It created the template for open science.

00:10:47.470 --> 00:10:50.039
Which brings us to the mechanics of it. How does

00:10:50.039 --> 00:10:52.720
the data actually get into the library? Because

00:10:52.720 --> 00:10:54.580
I think there's a common misconception that there

00:10:54.580 --> 00:10:57.559
is some librarian of life sitting at a desk at

00:10:57.559 --> 00:11:00.480
the NIH scanning books and carefully typing in

00:11:00.480 --> 00:11:02.539
codes. Yeah, that would be physically impossible

00:11:02.539 --> 00:11:05.779
given the volume. GenBank relies entirely on

00:11:05.779 --> 00:11:08.259
direct submissions. So it's user -generated content.

00:11:08.860 --> 00:11:12.100
Like the Wikipedia of DNA? In a sense, yes. If

00:11:12.100 --> 00:11:14.779
you are a researcher in a lab in Brazil or a

00:11:14.779 --> 00:11:17.460
grad student in Tokyo, you use web -based tools

00:11:17.460 --> 00:11:20.679
like BankIt or a program called Table2Azen. You

00:11:20.679 --> 00:11:22.419
essentially fill out a form, attach your sequence

00:11:22.419 --> 00:11:24.259
data, and send it off. And what about the big

00:11:24.259 --> 00:11:26.480
sequencing centers, the ones churning out terabytes

00:11:26.480 --> 00:11:28.700
of data a day? They have automated pipelines

00:11:28.700 --> 00:11:31.759
doing bulk submissions 24 -7. It's a constant,

00:11:31.840 --> 00:11:35.159
unending stream of data. So say I upload my sequence.

00:11:36.059 --> 00:11:38.700
Does it go straight to the public? Or is there

00:11:38.700 --> 00:11:41.299
a bouncer at the door checking IDs? There is

00:11:41.299 --> 00:11:44.139
a vetting process, but it's specific. GenBank

00:11:44.139 --> 00:11:46.580
staff does examine the originality of the data.

00:11:46.759 --> 00:11:49.559
They perform quality assurance checks to make

00:11:49.559 --> 00:11:51.820
sure the file isn't corrupted and that it makes

00:11:51.820 --> 00:11:54.720
basic biological sense. Once it passes that...

00:11:54.940 --> 00:11:56.940
They assign it an accession number. Like a Dewey

00:11:56.940 --> 00:11:58.740
decimal number. Think of it more like a social

00:11:58.740 --> 00:12:01.500
security number for that specific piece of data.

00:12:01.799 --> 00:12:04.200
It's a unique identifier so other scientists

00:12:04.200 --> 00:12:06.759
can find it and cite it in their papers. Then

00:12:06.759 --> 00:12:09.200
it's released to the public database. You can

00:12:09.200 --> 00:12:12.559
download it via FTP or search for it using a

00:12:12.559 --> 00:12:15.000
tool called Entrez. And the cost for all this?

00:12:15.240 --> 00:12:18.139
Free. It is totally open access. Anyone with

00:12:18.139 --> 00:12:19.980
an internet connection can look at the code of

00:12:19.980 --> 00:12:23.139
life. However. And this is a nuance people often

00:12:23.139 --> 00:12:25.299
miss when we talk about this. I saw in the notes

00:12:25.299 --> 00:12:27.879
that open access doesn't necessarily mean free

00:12:27.879 --> 00:12:30.659
of strings. That is absolutely correct. The NCBI

00:12:30.659 --> 00:12:33.139
places no restrictions on the use of the data,

00:12:33.259 --> 00:12:35.120
but that doesn't mean the data is completely

00:12:35.120 --> 00:12:38.100
free of intellectual property claims. What do

00:12:38.100 --> 00:12:40.139
you mean by that? Well, some submitters might

00:12:40.139 --> 00:12:42.980
claim patents or copyrights on the specific data

00:12:42.980 --> 00:12:46.159
they submit. NCBI explicitly says they are not

00:12:46.159 --> 00:12:48.740
in a position to police those claims. So just

00:12:48.740 --> 00:12:50.600
because you found it in the public library doesn't

00:12:50.600 --> 00:12:52.460
mean you can necessarily use it to make a commercial

00:12:52.460 --> 00:12:54.779
product without checking the fine print first.

00:12:55.080 --> 00:12:57.399
That is a really fascinating legal gray area.

00:12:58.000 --> 00:13:00.980
Okay, so we have this massive, open, rapidly

00:13:00.980 --> 00:13:03.139
growing library. But this is where I want to

00:13:03.139 --> 00:13:05.500
get a little critical. This is the deep dive

00:13:05.500 --> 00:13:07.500
after all. Right, you're anticipating the garbage

00:13:07.500 --> 00:13:09.919
in problem. Exactly. If this library is built

00:13:09.919 --> 00:13:13.139
almost entirely on user submissions and users

00:13:13.139 --> 00:13:16.419
are human, well, humans make mistakes. Can we

00:13:16.419 --> 00:13:18.620
actually trust everything in GenBank? The short

00:13:18.620 --> 00:13:20.840
answer is mostly, but definitely not blindly.

00:13:21.550 --> 00:13:23.970
Because GenBank relies on the scientific community

00:13:23.970 --> 00:13:26.870
to submit data, it naturally also inherits the

00:13:26.870 --> 00:13:28.809
scientific community's errors. Let's look at

00:13:28.809 --> 00:13:30.350
some examples from the sources, because some

00:13:30.350 --> 00:13:32.549
of these are wild. I saw a story about a fish

00:13:32.549 --> 00:13:34.970
that was having a serious identity crisis. Ah,

00:13:35.110 --> 00:13:38.629
yes. The infamous case of Nemipterus mesoprean.

00:13:38.769 --> 00:13:41.850
Which is what kind of fish? It's a type of threadfin

00:13:41.850 --> 00:13:44.830
bream, a pretty commercially important fish in

00:13:44.830 --> 00:13:47.269
the Indo -Pacific region. Okay, a threadfin bream.

00:13:47.330 --> 00:13:50.149
What exactly happened to it? A recent study looked

00:13:50.149 --> 00:13:52.529
at the mitochondrial sequences for this fish

00:13:52.529 --> 00:13:55.809
in GenBank, specifically the cytochrome -sick

00:13:55.809 --> 00:13:58.990
oxidase subunit I sequences used to identify

00:13:58.990 --> 00:14:02.429
the species. They found that 75 % of the sequences

00:14:02.429 --> 00:14:06.710
assigned to this species were wrong. 75%. That's

00:14:06.710 --> 00:14:09.529
not a margin of error. That's just wrong. It's

00:14:09.529 --> 00:14:12.590
a comprehensively failed test. It is a massive

00:14:12.590 --> 00:14:14.730
error rate. And the reason why it happened is

00:14:14.730 --> 00:14:17.629
the real insight here. It highlights how the

00:14:17.629 --> 00:14:20.220
system... creates a feedback loop. Walk us through

00:14:20.220 --> 00:14:22.480
that. How does a fish get that wrong in a processional

00:14:22.480 --> 00:14:25.259
database? Imagine Researcher A catches a fish.

00:14:25.440 --> 00:14:28.360
They misidentify it as a threadfin bream, sequence

00:14:28.360 --> 00:14:30.779
it, and upload that sequence to GenBank. Now,

00:14:30.820 --> 00:14:32.960
Researcher B catches a similar -looking fish.

00:14:33.080 --> 00:14:35.059
They aren't totally sure what it is, so they

00:14:35.059 --> 00:14:37.620
blast the DNA against GenBank to check. And they

00:14:37.620 --> 00:14:39.799
get a match from Researcher A's upload. Exactly.

00:14:39.860 --> 00:14:42.539
They see the match and say, aha, this must be

00:14:42.539 --> 00:14:44.820
Mipterus misoprion because the database says

00:14:44.820 --> 00:14:47.379
so. Then they upload their data confirming it.

00:14:47.519 --> 00:14:49.399
So they are matching their catch against a lie.

00:14:49.840 --> 00:14:53.289
Precisely. It creates a chain of errors. Citation

00:14:53.289 --> 00:14:55.610
laundering. Essentially, that becomes very hard

00:14:55.610 --> 00:14:57.490
to break because everyone is citing the same

00:14:57.490 --> 00:15:00.309
bad data. You end up with a mountain of evidence

00:15:00.309 --> 00:15:02.789
that is essentially built on sand. That is wild.

00:15:03.090 --> 00:15:05.309
It's like copying off the smart kid's homework,

00:15:05.549 --> 00:15:08.309
but the smart kid actually failed the test, and

00:15:08.309 --> 00:15:10.690
now the entire class is failing. And it's not

00:15:10.690 --> 00:15:12.889
just fish. There was a manuscript looking at

00:15:12.889 --> 00:15:16.210
birds, specifically cytochrome B records, that

00:15:16.210 --> 00:15:19.470
showed 45 % of the erroneous records lacked a

00:15:19.470 --> 00:15:22.169
voucher specimen. A voucher specimen. That sounds...

00:15:22.190 --> 00:15:24.830
like a coupon? In biology, a voucher specimen

00:15:24.830 --> 00:15:27.769
is the physical backup. It's the actual preserved

00:15:27.769 --> 00:15:31.490
bird or fish or plant sitting in a museum drawer

00:15:31.490 --> 00:15:33.909
somewhere. It's the physical receipt. So if the

00:15:33.909 --> 00:15:36.210
data looks weird online, you can physically go

00:15:36.210 --> 00:15:38.830
to the drawer and look at the actual bird? Ideally,

00:15:38.870 --> 00:15:41.289
yes. But if you don't have the physical bird

00:15:41.289 --> 00:15:44.690
and the digital record is weird, you have absolutely

00:15:44.690 --> 00:15:47.470
no way to prove if it is a brilliant new discovery

00:15:47.470 --> 00:15:49.929
or just a careless mistake. So those records

00:15:49.929 --> 00:15:52.200
are just ghosts in the machine. In a way, yeah.

00:15:52.360 --> 00:15:54.820
They are unverified beta points floating in the

00:15:54.820 --> 00:15:57.539
system forever. And then you have the anonymous

00:15:57.539 --> 00:16:00.019
problem. This is the turtle subject X issue we

00:16:00.019 --> 00:16:03.500
read about. Right. Often, researchers will submit

00:16:03.500 --> 00:16:06.100
a sequence before they have formally named the

00:16:06.100 --> 00:16:09.299
species. They might call it something like Pelomedusa

00:16:09.299 --> 00:16:15.120
SBS, a CK 2014. Catchy name. Rolls right off

00:16:15.120 --> 00:16:17.529
the tongue. Very. But they do this to get the

00:16:17.529 --> 00:16:20.370
data out there quickly, or to support a draft

00:16:20.370 --> 00:16:22.590
paper they're writing. The problem is, three

00:16:22.590 --> 00:16:25.090
years later, they publish the paper, they officially

00:16:25.090 --> 00:16:28.570
name the turtle... Pila Medusa variabilis, but...

00:16:28.570 --> 00:16:30.690
Don't tell me. They often just forget to go back

00:16:30.690 --> 00:16:33.009
to GenBank and update the original record. So

00:16:33.009 --> 00:16:35.570
the library is full of books with temporary titles

00:16:35.570 --> 00:16:37.769
that never get changed to the real ones. Exactly.

00:16:37.769 --> 00:16:40.090
It causes ongoing confusion because you end up

00:16:40.090 --> 00:16:42.129
with duplicate entries under different names.

00:16:42.330 --> 00:16:44.350
A researcher might think they've found a brand

00:16:44.350 --> 00:16:46.370
new species, but they are actually just looking

00:16:46.370 --> 00:16:48.990
at an unupdated clerical error from five years

00:16:48.990 --> 00:16:52.070
ago. Now, for a biologist studying turtle evolution,

00:16:52.330 --> 00:16:55.019
I get that that's annoying. But does this have

00:16:55.019 --> 00:16:57.779
real -world stakes? Like, what if I'm a doctor?

00:16:58.000 --> 00:17:00.159
Let's say I have a patient with a weird infection,

00:17:00.299 --> 00:17:03.740
and I sequence the bacteria and check GenBank

00:17:03.740 --> 00:17:07.339
to identify it. Could this be dangerous? It could

00:17:07.339 --> 00:17:10.740
be if you rely only on GenBank. That is the critical

00:17:10.740 --> 00:17:13.440
warning here. For clinical identification, like

00:17:13.440 --> 00:17:15.960
blood cultures, where a patient's life might

00:17:15.960 --> 00:17:18.859
be on the line, experts highly recommend not

00:17:18.859 --> 00:17:21.380
putting all your eggs in the GenBank basket.

00:17:21.640 --> 00:17:23.670
What should they do instead? You combine it with

00:17:23.670 --> 00:17:26.509
other more heavily curated databases, things

00:17:26.509 --> 00:17:29.329
like Aztaxany or BB. Like boutique libraries.

00:17:29.670 --> 00:17:32.130
Exactly. These are smaller but much stricter.

00:17:32.250 --> 00:17:34.029
Someone has actually checked the books thoroughly.

00:17:34.549 --> 00:17:36.930
GenBank is the giant warehouse. It has everything,

00:17:37.009 --> 00:17:39.069
but it's messy. These others are the carefully

00:17:39.069 --> 00:17:41.170
curated collections. That makes a lot of sense.

00:17:41.369 --> 00:17:43.849
It seems like the tradeoff for having the biggest,

00:17:44.049 --> 00:17:46.650
most comprehensive library in the world is that

00:17:46.650 --> 00:17:48.250
you're just inevitably going to have some graffiti

00:17:48.250 --> 00:17:50.470
in the margins. That is a very fair way to put

00:17:50.470 --> 00:17:53.750
it. You need the warehouse for discovery, for

00:17:53.750 --> 00:17:56.049
the big picture, for the exponential growth.

00:17:56.289 --> 00:17:59.089
But you definitely need the boutique collections

00:17:59.089 --> 00:18:02.390
for precision. So let's wrap this up. We have

00:18:02.390 --> 00:18:07.190
GenBank. It is a $34 trillion base pair behemoth.

00:18:07.690 --> 00:18:10.670
It puts wheat and COVID above human beings in

00:18:10.670 --> 00:18:13.390
terms of raw data volume. It was born in the

00:18:13.390 --> 00:18:16.009
Cold War and now it lives in the cloud, sinking

00:18:16.009 --> 00:18:18.250
globally every day. And it is still doubling

00:18:18.250 --> 00:18:21.049
every 18 months. It truly is the backbone of

00:18:21.049 --> 00:18:23.569
modern biology. But it's also a stark reminder

00:18:23.569 --> 00:18:25.869
that data isn't always the exact same thing as

00:18:25.869 --> 00:18:28.849
truth. That is the key takeaway for me. GenBank

00:18:28.849 --> 00:18:31.569
is an absolutely essential tool for open science.

00:18:31.730 --> 00:18:33.930
We couldn't do modern medicine without it. But

00:18:33.930 --> 00:18:36.349
we have to remember, it is a record of submissions.

00:18:36.630 --> 00:18:38.549
It reflects what scientists think they found

00:18:38.549 --> 00:18:41.170
at that exact moment in time. It is a living,

00:18:41.349 --> 00:18:43.470
breathing, and occasionally mistaken history

00:18:43.470 --> 00:18:45.930
of our understanding of life. Absolutely. And,

00:18:45.950 --> 00:18:47.710
you know, it raises a pretty provocative question

00:18:47.710 --> 00:18:50.509
for anyone listening. Well, we are currently

00:18:50.509 --> 00:18:53.250
building the future of biology. We are training

00:18:53.250 --> 00:18:56.049
complex AI models on this data. We are designing

00:18:56.049 --> 00:18:58.250
synthetic life. We are hunting for new drugs.

00:18:59.180 --> 00:19:01.640
But if this database contains millions of anonymous

00:19:01.640 --> 00:19:04.319
sequences or mislabeled records that are rarely

00:19:04.319 --> 00:19:06.740
updated or purged. Are we building our future

00:19:06.740 --> 00:19:10.299
on a cracked foundation? Exactly. Are we inadvertently

00:19:10.299 --> 00:19:13.279
baking fiction right into our biological reality?

00:19:13.700 --> 00:19:16.460
And as this library grows faster than any human

00:19:16.460 --> 00:19:18.819
can possibly read, doubling every year and a

00:19:18.819 --> 00:19:21.720
half, how do we ever hope to go back and clean

00:19:21.720 --> 00:19:23.980
it up? We might just be creating a web of knowledge

00:19:23.980 --> 00:19:28.019
so complex and interlinked with tiny errors that

00:19:28.019 --> 00:19:30.660
we can. literally never fully untangle it. It's

00:19:30.660 --> 00:19:33.299
a very distinct possibility. Well, on that slightly

00:19:33.299 --> 00:19:35.440
existential note, I'm going to go look at a sandwich

00:19:35.440 --> 00:19:37.599
with a lot more respect from now on. All hail

00:19:37.599 --> 00:19:39.980
the wheat genome. Number one on the leaderboard,

00:19:40.119 --> 00:19:42.819
number one in our hearts. Thanks for joining

00:19:42.819 --> 00:19:45.000
us on this deep dive into the library of life.

00:19:45.160 --> 00:19:46.200
We'll catch you in the next one.