WEBVTT

00:00:00.000 --> 00:00:04.019
So picture this. It's 2023, and a reporter sits

00:00:04.019 --> 00:00:06.660
down, speaks into a microphone for, I don't know,

00:00:06.679 --> 00:00:08.580
maybe just five minutes. Barely any time at all,

00:00:08.599 --> 00:00:10.880
really. Right. Just five minutes. And then he

00:00:10.880 --> 00:00:13.099
uses a software program to essentially clone

00:00:13.099 --> 00:00:16.000
his own voice. Which is already wild. Exactly.

00:00:16.460 --> 00:00:19.519
But then he takes that synthetic clone, calls

00:00:19.519 --> 00:00:22.219
his bank's automated security line, and plays

00:00:22.219 --> 00:00:25.339
the fake voice speaking his secure passphrase

00:00:25.339 --> 00:00:28.370
and the bank's voice authentication system. it

00:00:28.370 --> 00:00:30.629
immediately unlocks his account. Just completely

00:00:30.629 --> 00:00:32.929
fooled it. It bypassed a security standard that,

00:00:32.929 --> 00:00:34.670
you know, financial institutions have literally

00:00:34.670 --> 00:00:36.689
spent millions of dollars in years developing.

00:00:36.909 --> 00:00:39.429
Millions. Over five minutes of audio and a laptop.

00:00:39.590 --> 00:00:41.770
Yeah. And the truly terrifying part is at the

00:00:41.770 --> 00:00:44.159
five -minute benchmark. It's already outdated.

00:00:44.600 --> 00:00:47.399
Wow. Well, welcome to today's Deep Dive. For

00:00:47.399 --> 00:00:49.640
those of you listening, we are exploring the

00:00:49.640 --> 00:00:52.539
fascinating, occasionally creepy, and frankly,

00:00:52.960 --> 00:00:55.420
rapidly accelerating world of speech synthesis.

00:00:55.600 --> 00:00:58.000
It's a massive topic. It really is. We are tracing

00:00:58.000 --> 00:01:00.240
this journey from the absolute beginning, meaning

00:01:00.240 --> 00:01:02.600
18th century mechanical talking machines built

00:01:02.600 --> 00:01:05.540
with, like, wood and leather. Which is hard to

00:01:05.540 --> 00:01:07.700
even imagine now. Right. Straight through to

00:01:07.700 --> 00:01:10.540
the modern era of deep learning, where your own

00:01:10.540 --> 00:01:13.239
voice can be flawlessly cloned from just a 15

00:01:13.239 --> 00:01:15.739
second clip. 15 seconds. That's all it takes

00:01:15.739 --> 00:01:17.920
now. And for everyone listening, you interact

00:01:17.920 --> 00:01:20.540
with synthetic speech. every single day. I mean,

00:01:20.579 --> 00:01:22.780
it gives you directions on your GPS. Reads the

00:01:22.780 --> 00:01:24.739
weather on your smart assistant. Exactly. And

00:01:24.739 --> 00:01:27.560
it provides the voiceovers on all those TikToks

00:01:27.560 --> 00:01:29.659
you scroll through. So understanding the mechanics

00:01:29.659 --> 00:01:32.359
behind this technology really equips you to navigate

00:01:32.359 --> 00:01:35.780
an era where audio deepfakes are just everywhere.

00:01:35.920 --> 00:01:37.760
You really need to understand the magic behind

00:01:37.760 --> 00:01:40.540
the curtain. OK, let's unpack this. Imagine for

00:01:40.540 --> 00:01:42.840
a second stepping into an antique laboratory,

00:01:43.760 --> 00:01:46.159
like visually picture the space around us shifting

00:01:46.159 --> 00:01:50.120
to this old room filled with dusty blueprints,

00:01:50.500 --> 00:01:53.180
schematics, and just bizarre contraptions on

00:01:53.180 --> 00:01:55.739
the tables. I can picture it. Because honestly,

00:01:56.040 --> 00:01:58.620
before we even touch computer code or silicon,

00:01:59.019 --> 00:02:01.680
we have to understand that humanity's obsession

00:02:01.680 --> 00:02:04.840
with artificial voices started with physical

00:02:04.840 --> 00:02:08.319
things. Billows and wood. Starting with those

00:02:08.319 --> 00:02:10.860
physical objects is so key. It really is. It's

00:02:10.860 --> 00:02:12.960
crucial to understanding the sheer mechanical

00:02:12.960 --> 00:02:16.500
complexity of human speech. You know, before

00:02:16.500 --> 00:02:19.099
software could emulate a voice, inventors tried

00:02:19.099 --> 00:02:22.620
to literally recreate human anatomy. Like building

00:02:22.620 --> 00:02:26.080
a physical throat. Exactly. If you go back to

00:02:26.080 --> 00:02:29.219
the Middle Ages, you find these legends of brazen

00:02:29.219 --> 00:02:31.719
heads. Oh, right, the magical brass head. Yeah,

00:02:31.860 --> 00:02:33.800
supposedly owned by scholars like Pope Sylvester

00:02:33.800 --> 00:02:37.319
II or Roger Bacon. The myth was that they could

00:02:37.319 --> 00:02:41.599
mysteriously answer questions. But the true documented

00:02:41.599 --> 00:02:44.270
scientific attempts... Those really kicked off

00:02:44.270 --> 00:02:47.710
in the late 1700s. Ring around 1779, I think.

00:02:47.909 --> 00:02:51.050
Yes. A scientist named Christian Gottlieb Kratzenstein.

00:02:51.509 --> 00:02:53.729
He built these physical models of the human vocal

00:02:53.729 --> 00:02:55.370
tract. He actually won a prize from the Russian

00:02:55.370 --> 00:02:57.430
Imperial Academy for it. Which is a huge deal

00:02:57.430 --> 00:02:59.110
for the time. But if I remember right, those

00:02:59.110 --> 00:03:02.430
models could only produce five long vowel sounds.

00:03:03.050 --> 00:03:05.310
Just vowels. Nothing else. Why was he stuck there?

00:03:05.599 --> 00:03:07.699
Well, vowels are essentially just uninterrupted

00:03:07.699 --> 00:03:09.400
airflow that's shaped by how open your mouth

00:03:09.400 --> 00:03:11.379
is. OK. So they are much easier to replicate.

00:03:11.979 --> 00:03:14.539
Consonants, on the other hand, they require precise

00:03:14.539 --> 00:03:17.120
temporary blockages of that air. Like using your

00:03:17.120 --> 00:03:19.099
tongue or your teeth. Right, using the lips,

00:03:19.280 --> 00:03:21.740
the tongue, the teeth. So to get beyond vowels,

00:03:21.840 --> 00:03:25.400
you need a much, much more complex machine. And

00:03:25.400 --> 00:03:29.680
then... In 1791, Wolfgang von Kempelin published

00:03:29.680 --> 00:03:32.780
a paper describing an acoustic mechanical speech

00:03:32.780 --> 00:03:34.860
machine. And he wasn't just using hollow tubes

00:03:34.860 --> 00:03:37.199
anymore, right? No. He incorporated in actual

00:03:37.199 --> 00:03:40.400
bellows to act as the lungs. And then he added

00:03:40.400 --> 00:03:43.039
physical, manipulatable models of the tongue

00:03:43.039 --> 00:03:45.620
and lips. So he could literally manipulate the

00:03:45.620 --> 00:03:48.159
airflow in real time to squeeze out consonants.

00:03:48.180 --> 00:03:51.419
Exactly. And then decades later, jumping to 1846,

00:03:51.919 --> 00:03:54.639
Joseph Faber builds the euphonia. Oh, the euphonia

00:03:54.639 --> 00:03:57.560
was incredible. It was this massive keyboard

00:03:57.560 --> 00:04:00.180
-operated talking automaton based on those exact

00:04:00.180 --> 00:04:02.199
same concepts. When you picture these things,

00:04:02.419 --> 00:04:05.319
it's like a macabre mechanical bagpipe trying

00:04:05.319 --> 00:04:07.800
to sing human words. That is the perfect way

00:04:07.800 --> 00:04:09.680
to describe it. Right. You are literally pumping

00:04:09.680 --> 00:04:12.400
air and pulling levers to physically wrestle

00:04:12.400 --> 00:04:14.639
a recognizable syllable out of a wooden box.

00:04:14.919 --> 00:04:18.519
It sounds exhausting. It does. But what's fascinating

00:04:18.519 --> 00:04:21.420
here is that while those Bellows machines seem

00:04:21.420 --> 00:04:24.939
like archaic steampunk toys, they actually laid

00:04:24.939 --> 00:04:28.060
the conceptual groundwork for a highly advanced

00:04:28.060 --> 00:04:31.379
modern computational technique, something called

00:04:31.379 --> 00:04:34.259
articulatory synthesis. Wait, really? We still

00:04:34.259 --> 00:04:37.240
use the mechanical bagpipe concept today. Conceptually,

00:04:37.420 --> 00:04:40.740
yes, we absolutely do. Articulatory synthesis

00:04:40.740 --> 00:04:43.860
doesn't rely on playing back recorded audio clips

00:04:43.860 --> 00:04:47.120
like you'd think. Huh. Instead, it uses complex

00:04:47.120 --> 00:04:50.439
mathematics to model the actual physics and aerodynamics

00:04:50.439 --> 00:04:53.399
of the human vocal tract. So it's basically a

00:04:53.399 --> 00:04:56.899
digital euphonia. Exactly. It mathematically

00:04:56.899 --> 00:04:59.040
simulates the airflow moving through the trachea,

00:04:59.379 --> 00:05:01.379
the vibration of the vocal folds, and the acoustic

00:05:01.379 --> 00:05:04.259
resonance inside the nasal cavity. That is wild.

00:05:04.660 --> 00:05:06.699
Those 18th century inventors were trying to replicate

00:05:06.699 --> 00:05:09.100
that physics in wood and rubber. And today's

00:05:09.100 --> 00:05:10.939
researchers, they do the exact same thing, just

00:05:10.939 --> 00:05:13.720
using fluid dynamics and code. So for you listening,

00:05:13.800 --> 00:05:15.500
you might be wondering why we're spending time

00:05:15.500 --> 00:05:18.220
on wooden pianos with rubber tongues when your

00:05:18.220 --> 00:05:21.100
smart speaker is, you know, purely digital. It's

00:05:21.100 --> 00:05:23.639
because the math powering aspects of this technology

00:05:23.639 --> 00:05:26.959
today is just a direct digital translation of

00:05:26.959 --> 00:05:29.399
that exact physical anatomy. It all connects

00:05:29.399 --> 00:05:31.220
back to the physical body. But obviously you

00:05:31.220 --> 00:05:33.959
can't have a massive fluid dynamic simulator

00:05:33.959 --> 00:05:37.100
running inside every smartphone or GPS. To make

00:05:37.100 --> 00:05:39.399
machines converse at speed, we had to translate

00:05:39.399 --> 00:05:42.180
the physical properties of speech into electrical

00:05:42.160 --> 00:05:47.000
signals. Yeah. So if we mentally fast forward,

00:05:47.420 --> 00:05:49.500
you know, step out of that antique lab we pictured

00:05:49.500 --> 00:05:52.620
earlier and step into a 1930s engineering room,

00:05:52.860 --> 00:05:55.420
we see the foundation of that electronic shift.

00:05:55.579 --> 00:05:59.079
Bell Labs developed the vocoder in the 1930s.

00:05:59.240 --> 00:06:01.720
The famous vocoder. Right. And instead of modeling

00:06:01.720 --> 00:06:04.560
physical anatomy, the vocoder analyzed human

00:06:04.560 --> 00:06:06.860
speech and broke it down into fundamental tones

00:06:06.860 --> 00:06:08.800
and frequency bands. It essentially stripped

00:06:08.800 --> 00:06:11.579
the human voice down to its raw acoustic parameters.

00:06:11.720 --> 00:06:14.259
Wow. And that research directly led to the voter,

00:06:14.360 --> 00:06:17.620
which was this massive keyboard -operated electronic

00:06:17.620 --> 00:06:20.120
voice synthesizer that they actually showcased

00:06:20.120 --> 00:06:23.230
at the 1939 World's Fair. But when do we actually

00:06:23.230 --> 00:06:25.310
get computers involved? Like, not just keyboards

00:06:25.310 --> 00:06:27.889
operated by humans, but a machine generating

00:06:27.889 --> 00:06:30.910
these voices autonomously? That major milestone

00:06:30.910 --> 00:06:35.290
hit a bit later, in 1961. Researchers John Larry

00:06:35.290 --> 00:06:38.490
Kelly Jr. and Louis Gerstmann, they used an IBM

00:06:38.490 --> 00:06:41.610
704 computer to synthesize speech. And the IBM

00:06:41.610 --> 00:06:45.290
704, just for context, was a massive, room -sized,

00:06:45.550 --> 00:06:47.910
mainframe computer. Oh, absolutely giant. And

00:06:47.910 --> 00:06:50.439
they programmed it. to generate the specific

00:06:50.439 --> 00:06:52.980
acoustic parameters needed to replicate human

00:06:52.980 --> 00:06:55.680
phones. And to prove that the system could do,

00:06:55.959 --> 00:06:58.160
they made the computer sing. The song Daisy Bell,

00:06:58.319 --> 00:07:00.800
right? Yes. Complete with musical accompaniment

00:07:00.800 --> 00:07:04.019
that was programmed by another pioneer. Max Matthews.

00:07:04.220 --> 00:07:06.160
And this leads to, honestly, one of the greatest

00:07:06.160 --> 00:07:08.779
pop culture connections in tech history, because

00:07:08.779 --> 00:07:11.120
the science fiction author Arthur C. Clarke just

00:07:11.120 --> 00:07:12.819
happened to be visiting Bell Labs at the time.

00:07:12.899 --> 00:07:14.600
Talk about being in the right place at the right

00:07:14.600 --> 00:07:17.079
time. Seriously. He saw this demonstration. He

00:07:17.079 --> 00:07:19.980
stood there and watched a room -size IBM mainframe

00:07:19.980 --> 00:07:23.459
sing Daisy Bell. And the eerie, robotic quality

00:07:23.459 --> 00:07:25.660
of it just completely blew him away. It made

00:07:25.660 --> 00:07:28.439
quite an impression. It did. He rode that exact

00:07:28.439 --> 00:07:31.639
moment into the climactic scene of 2001, A Space

00:07:31.639 --> 00:07:35.209
Odyssey. Right. When the HAL 9000 computer is

00:07:35.209 --> 00:07:38.730
being deactivated, losing its mind, it regresses

00:07:38.730 --> 00:07:41.730
to its earliest programming and sings Daisy Bell.

00:07:41.910 --> 00:07:44.250
It's so chilling in the movie. And that cinematic

00:07:44.250 --> 00:07:47.389
masterpiece was born directly from a real -world

00:07:47.389 --> 00:07:49.970
speech synthesis demo. It's a brilliant example

00:07:49.970 --> 00:07:52.790
of life -inspiring art. Now, to get from those

00:07:52.790 --> 00:07:55.129
room -sized mainframes down to the everyday devices

00:07:55.129 --> 00:07:58.230
we use, the math obviously had to get vastly

00:07:58.230 --> 00:08:00.290
more efficient. Right. You can't carry an IBM

00:08:00.290 --> 00:08:02.300
mainframe in your pocket. No, you can't. So in

00:08:02.300 --> 00:08:06.800
the 1970s, we see the rise of LPC or linear predictive

00:08:06.800 --> 00:08:09.879
coding. Let's visualize our lab background changing

00:08:09.879 --> 00:08:12.459
again, maybe to an early 1980s computer lab.

00:08:12.600 --> 00:08:15.060
Okay, early 80s, I like it. So how does LPC actually

00:08:15.060 --> 00:08:17.319
shrink the technology down? Well, instead of

00:08:17.319 --> 00:08:19.579
storing every single microscopic sound wave of

00:08:19.579 --> 00:08:21.120
a voice. Which would take up enormous amounts

00:08:21.120 --> 00:08:23.610
of memory, right? Exactly, just gigabytes of

00:08:23.610 --> 00:08:26.629
data. So instead, LPC assumes that the human

00:08:26.629 --> 00:08:29.129
vocal track is basically just an acoustic filter.

00:08:29.550 --> 00:08:31.829
It stores a generic mathematical model of that

00:08:31.829 --> 00:08:33.889
filter and then just sends tiny little parameters

00:08:33.889 --> 00:08:36.740
to adjust it on the fly. Oh, so it's predicting.

00:08:37.080 --> 00:08:41.080
Yes. It guesses or predicts the next sound wave

00:08:41.080 --> 00:08:43.860
based on the previous one. This compression is

00:08:43.860 --> 00:08:46.080
what allowed speech data to finally fit onto

00:08:46.080 --> 00:08:48.980
tiny, inexpensive microchips. Which gives us

00:08:48.980 --> 00:08:52.519
the iconic 1978 Speak and Spell toy by Texas

00:08:52.519 --> 00:08:55.419
Instruments. Exactly. And early 1980s arcade

00:08:55.419 --> 00:08:58.779
games like Stratovox. So we went from a multimillion

00:08:58.779 --> 00:09:02.059
dollar IBM computer to a cheap handheld plastic

00:09:02.059 --> 00:09:05.419
toy in under 20 years. The acceleration is incredible.

00:09:05.580 --> 00:09:07.659
It is, but looking at this timeline there is

00:09:07.659 --> 00:09:09.940
a pretty glaring detail in the source material.

00:09:10.529 --> 00:09:13.049
Synthesized voices were almost exclusively male

00:09:13.049 --> 00:09:15.509
for decades. Yes, they were. It wasn't until

00:09:15.509 --> 00:09:18.509
1990 that a researcher named Anne Sertall created

00:09:18.509 --> 00:09:21.570
a female voice at AT &T Bell Laboratories. So

00:09:21.570 --> 00:09:23.690
let me push back on this a bit. Why on earth

00:09:23.690 --> 00:09:26.330
did it take until 1990? It's a very fair question.

00:09:26.629 --> 00:09:29.149
Was that like an actual technological hurdle

00:09:29.149 --> 00:09:32.610
or just a massive blind spot in a male -dominated

00:09:32.610 --> 00:09:34.409
engineering field? Well, it was definitely a

00:09:34.409 --> 00:09:36.049
massive blind spot. Let's be clear about that.

00:09:36.320 --> 00:09:38.659
But there was a genuine acoustic challenge driving

00:09:38.659 --> 00:09:42.039
it too. Really? Like what? Acoustically, female

00:09:42.039 --> 00:09:44.559
voices typically have a higher fundamental frequency.

00:09:45.419 --> 00:09:48.379
The vocal folds vibrate faster, meaning the sound

00:09:48.379 --> 00:09:50.799
waves are physically closer together. Okay, that

00:09:50.799 --> 00:09:53.500
makes sense. In the early days of LPC and low

00:09:53.500 --> 00:09:56.100
sample rate audio, trying to capture the finer

00:09:56.100 --> 00:09:59.200
nuances of those rapid high frequencies often

00:09:59.200 --> 00:10:02.240
resulted in horrible distortion. It was technically

00:10:02.240 --> 00:10:04.759
very difficult to pull off cleanly. So it just

00:10:04.759 --> 00:10:08.100
sounded bad on early microchips. Exactly. But,

00:10:08.379 --> 00:10:09.919
you know, you also have to consider the data

00:10:09.919 --> 00:10:12.259
they were using. When teams of predominantly

00:10:12.259 --> 00:10:14.820
male engineers record their own voices to test

00:10:14.820 --> 00:10:17.200
their systems, the software just learns men's

00:10:17.200 --> 00:10:19.519
voices. Right. The software's parameters naturally

00:10:19.519 --> 00:10:22.019
become optimized for male vocal characteristics.

00:10:22.720 --> 00:10:25.299
So Ancerdl's work in 1990 was revolutionary because

00:10:25.299 --> 00:10:27.659
she actually engineered solutions for those acoustic

00:10:27.659 --> 00:10:30.120
hurdles, finally closing that representational

00:10:30.120 --> 00:10:33.860
gap. That is so important. So by the 90s, we

00:10:33.860 --> 00:10:36.799
have the hardware, the microchips, and the coding

00:10:36.799 --> 00:10:40.080
methods to produce various voices. But as computers

00:10:40.080 --> 00:10:42.700
evolved from just playing arcade games to, you

00:10:42.700 --> 00:10:45.000
know, reading actual documents and websites,

00:10:45.720 --> 00:10:48.159
developers hit a completely new wall. A very

00:10:48.159 --> 00:10:50.240
big wall. Because if you feed a paragraph of

00:10:50.240 --> 00:10:52.419
text into a computer, it doesn't naturally know

00:10:52.419 --> 00:10:54.740
how to read it. Before a machine can speak, it

00:10:54.740 --> 00:10:57.039
kind of has to become a linguist. That's spot

00:10:57.039 --> 00:10:59.429
on. That transition from making simple noises

00:10:59.429 --> 00:11:02.570
to reading full documents introduces the architecture

00:11:02.570 --> 00:11:05.710
of modern text to speech, or TTS systems. OK.

00:11:05.990 --> 00:11:08.330
A TTS engine is actually split into two halves.

00:11:08.529 --> 00:11:11.529
The front end and the back end and the front

00:11:11.529 --> 00:11:14.490
end. It never makes a single sound. Its entire

00:11:14.490 --> 00:11:16.889
job is just deciphering the text. Here's where

00:11:16.889 --> 00:11:19.620
it gets really interesting. The front end of

00:11:19.620 --> 00:11:22.600
a TTS system is essentially acting like a highly

00:11:22.600 --> 00:11:25.600
stressed script supervisor on a movie set. That's

00:11:25.600 --> 00:11:27.799
a great way to think about it. Imagine the supervisor

00:11:27.799 --> 00:11:30.740
desperately reading ahead of the actors, frantically

00:11:30.740 --> 00:11:32.779
crossing things out and rewriting the script

00:11:32.779 --> 00:11:34.840
phonetically so the actors don't sound foolish

00:11:34.840 --> 00:11:37.120
when they read it out loud. That is exactly what

00:11:37.120 --> 00:11:40.000
it's doing. The first major task is called text

00:11:40.000 --> 00:11:42.570
normalization. This means converting symbols,

00:11:42.769 --> 00:11:45.269
numbers, and abbreviations into written words.

00:11:45.570 --> 00:11:47.789
Like translating math into English. Right. Take

00:11:47.789 --> 00:11:51.809
the number 1325. A human knows how to read that

00:11:51.809 --> 00:11:55.029
naturally based on context. A computer has literally

00:11:55.029 --> 00:11:58.049
no idea. It's just digits to a computer. Exactly.

00:11:58.370 --> 00:12:02.889
If the text says the year 1325, the front end

00:12:02.889 --> 00:12:06.049
has to normalize it to the words 1325. Right.

00:12:06.269 --> 00:12:08.730
But if it's an address, it might still be 1325.

00:12:08.929 --> 00:12:12.350
If it's a passcode, it needs to be read as 1325.

00:12:12.710 --> 00:12:14.690
And abbreviations have to be a nightmare for

00:12:14.690 --> 00:12:17.289
this, like the letters ST. Is it street or is

00:12:17.289 --> 00:12:19.370
it saint? Oh, early systems struggle with that

00:12:19.370 --> 00:12:21.649
all the time. I read that less intelligent systems

00:12:21.649 --> 00:12:23.710
would just guess the same way every time. There

00:12:23.710 --> 00:12:26.129
were actual instances where the historical figure

00:12:26.129 --> 00:12:28.789
Ulysses S. Grant was confidently pranced by computers

00:12:28.789 --> 00:12:32.100
as Ulysses South Grant? Yes. because it just

00:12:32.100 --> 00:12:34.919
saw the S and applied a rigid rule. English is

00:12:34.919 --> 00:12:37.580
notoriously messy, which brings the front end

00:12:37.580 --> 00:12:40.200
to its second major task, grapheme -to -phone

00:12:40.200 --> 00:12:42.419
conversion. Okay, taking the written letters,

00:12:42.740 --> 00:12:45.539
the graphemes, and figuring out the actual phonetic

00:12:45.539 --> 00:12:48.700
sounds, which introduces the absolute nightmare

00:12:48.700 --> 00:12:52.029
of heteronyms. Words spelled exactly the same

00:12:52.029 --> 00:12:53.850
way but pronounced differently based on what

00:12:53.850 --> 00:12:56.090
they mean in the sentence. They are the bane

00:12:56.090 --> 00:12:58.629
of TTS developers. Like the sentence, my latest

00:12:58.629 --> 00:13:00.970
project is to learn how to better project my

00:13:00.970 --> 00:13:03.570
voice. A classic example. They are spelled identically.

00:13:03.850 --> 00:13:07.750
P -R -O -J -E -C -T. How on earth does the front

00:13:07.750 --> 00:13:10.509
-end script supervisor know which one is which?

00:13:10.799 --> 00:13:14.000
To solve this, developers use HMMs, or hidden

00:13:14.000 --> 00:13:16.820
Markov models. These are complex statistical

00:13:16.820 --> 00:13:19.320
models that calculate probabilities based on

00:13:19.320 --> 00:13:21.740
the surrounding words. So it looks at the neighbors.

00:13:21.899 --> 00:13:24.559
Yes. When the computer sees my latest project,

00:13:25.019 --> 00:13:27.559
the HMM looks at the word latest, identifies

00:13:27.559 --> 00:13:29.980
it as an adjective, and calculates a very high

00:13:29.980 --> 00:13:32.659
probability that a noun comes next. Oh, okay.

00:13:32.659 --> 00:13:34.740
So it tags the word as a noun, which tells the

00:13:34.740 --> 00:13:36.879
pronunciation dictionary to use the emphasis

00:13:36.879 --> 00:13:40.059
P -R -O -ject. And for the second F? In the second

00:13:40.059 --> 00:13:42.860
half, better project, it sees the adverb better,

00:13:43.320 --> 00:13:45.919
predicts a verb is coming, and triggers the pronunciation

00:13:45.919 --> 00:13:48.940
project. It is exhausting just thinking about

00:13:48.940 --> 00:13:51.940
the computational gymnastics happening in milliseconds

00:13:51.940 --> 00:13:54.620
before the audio even turns on. Not to mention

00:13:54.620 --> 00:13:57.379
that English spelling defies its own rules constantly.

00:13:57.519 --> 00:14:01.240
Constantly. Like the F in the word of is pronounced

00:14:01.240 --> 00:14:04.259
as a V, which basically no other word does. So

00:14:04.259 --> 00:14:07.330
the system is really just leaning on massive

00:14:07.330 --> 00:14:10.230
built -in dictionaries containing the correct

00:14:10.230 --> 00:14:13.230
pronunciation for hundreds of thousands of words.

00:14:13.529 --> 00:14:15.929
Precisely. It only uses those phonetic rules

00:14:15.929 --> 00:14:18.409
as a backup for completely made -up or unknown

00:14:18.409 --> 00:14:20.870
words. Okay, so assuming the front -end script

00:14:20.870 --> 00:14:23.250
supervisor gets everything right, it hands a

00:14:23.250 --> 00:14:25.389
perfect phonetic script over to the back -end,

00:14:25.870 --> 00:14:28.409
the synthesizer. And this is where the audio

00:14:28.409 --> 00:14:30.970
waveform is actually generated. And historically,

00:14:31.470 --> 00:14:33.600
developers faced a brutal trade -off here. It

00:14:33.600 --> 00:14:36.299
was the battle between naturalness and intelligibility.

00:14:36.659 --> 00:14:38.200
Meaning, do you want it to sound like a real

00:14:38.200 --> 00:14:39.919
human or do you want to actually understand what

00:14:39.919 --> 00:14:43.460
it's saying? Exactly. There are two main traditional

00:14:43.460 --> 00:14:46.340
approaches to generating that audio. Concatenative

00:14:46.340 --> 00:14:48.779
synthesis and format synthesis. Let's look at

00:14:48.779 --> 00:14:51.700
concatenative first. It basically strings together

00:14:51.700 --> 00:14:55.100
or concatenates actual recorded slices of human

00:14:55.100 --> 00:14:57.899
speech. It's like a sonic ransom note cut out

00:14:57.899 --> 00:15:00.399
of magazines. You're just pasting snippets of

00:15:00.399 --> 00:15:02.340
audio together. That's a perfect way to visualize

00:15:02.340 --> 00:15:05.179
it. The most realistic version of this is called

00:15:05.179 --> 00:15:08.559
unit selection. They record a human voice actor

00:15:08.559 --> 00:15:11.419
speaking for literally dozens of hours, chop

00:15:11.419 --> 00:15:14.379
that audio up into microscopic segments, and

00:15:14.379 --> 00:15:17.080
index them in a massive database. Okay. When

00:15:17.080 --> 00:15:19.279
you type a sentence, the system rapidly searches

00:15:19.279 --> 00:15:22.019
gigabytes of data for the exact audio slices

00:15:22.019 --> 00:15:24.019
that match and stitches them together. And it

00:15:24.019 --> 00:15:26.279
sounds natural because it is real human speech.

00:15:26.419 --> 00:15:28.919
It is, but human speech naturally varies, right?

00:15:29.070 --> 00:15:32.529
in pitch, tone, breath. Stitching totally different

00:15:32.529 --> 00:15:35.169
emotional moments together can cause these audible,

00:15:35.409 --> 00:15:37.429
jittery glitches. Yeah, we've all heard that

00:15:37.429 --> 00:15:40.470
weird robotic hiccup in older GPS voices. And

00:15:40.470 --> 00:15:42.909
to avoid storing gigabytes of audio, they sometimes

00:15:42.909 --> 00:15:45.409
use diphone synthesis, right? Yes, that's the

00:15:45.409 --> 00:15:47.669
more compressed version. where instead of whole

00:15:47.669 --> 00:15:50.149
words, they only record the transitions between

00:15:50.149 --> 00:15:52.769
sounds. Like I think Spanish has about 800 of

00:15:52.769 --> 00:15:55.230
these transitions and German has 2 ,500. That's

00:15:55.230 --> 00:15:57.549
correct. So by only recording the transitions,

00:15:57.669 --> 00:16:00.350
you shrink the database massively. But because

00:16:00.350 --> 00:16:02.990
you are rigidly locking those exact transitions

00:16:02.990 --> 00:16:06.039
together, it really starts to sound. distinctly

00:16:06.039 --> 00:16:08.679
robotic. Right. It loses the flow. Which brings

00:16:08.679 --> 00:16:11.139
us to the polar opposite approach. Formant synthesis.

00:16:11.559 --> 00:16:14.940
Formant synthesis uses absolutely zero human

00:16:14.940 --> 00:16:17.539
speech samples. No voice actors at all. No voice

00:16:17.539 --> 00:16:21.460
actors. No databases of recorded phones. It is

00:16:21.460 --> 00:16:25.019
purely a rules -based acoustic generation. It

00:16:25.019 --> 00:16:26.919
generates sound waves from scratch using math

00:16:26.919 --> 00:16:29.259
and filters. Okay, so this is the classic robot

00:16:29.259 --> 00:16:31.480
voice, like the legendary synthesizer used by

00:16:31.480 --> 00:16:33.460
Stephen Hawking. It will never ever be mistaken

00:16:33.460 --> 00:16:36.039
for a living person. Never. But let me push back

00:16:36.039 --> 00:16:38.340
on this a bit. If concatenative unit selection

00:16:38.340 --> 00:16:40.879
sounds almost indistinguishable from a real human,

00:16:41.059 --> 00:16:43.740
and we all carry smartphones today with hundreds

00:16:43.740 --> 00:16:46.259
gigabytes of storage, Why on earth would anyone

00:16:46.259 --> 00:16:49.240
still develop or use robotic, purely mathematical

00:16:49.240 --> 00:16:51.659
formant synthesis? Why keep the robot around?

00:16:51.940 --> 00:16:54.679
If we connect this to the bigger picture, we

00:16:54.679 --> 00:16:57.659
realize that mimicking humanity is not always

00:16:57.659 --> 00:17:01.019
the user's primary goal. What do you mean? Think

00:17:01.019 --> 00:17:03.500
about visually impaired individuals who rely

00:17:03.500 --> 00:17:06.200
on screen readers. to navigate their computers,

00:17:06.480 --> 00:17:09.000
emails, and the internet. Oh, right. Many of

00:17:09.000 --> 00:17:11.660
these users train themselves to listen to their

00:17:11.660 --> 00:17:14.299
screen readers at astonishingly high speeds.

00:17:14.900 --> 00:17:18.059
We're talking three, four, or even five times

00:17:18.059 --> 00:17:20.900
the normal speaking rate, just to absorb information

00:17:20.900 --> 00:17:24.779
as rapidly as a sighted person scans a page visually.

00:17:24.880 --> 00:17:27.099
Okay, I see where this is going. At those extreme

00:17:27.099 --> 00:17:29.849
speeds... A highly natural concatenative voice

00:17:29.849 --> 00:17:32.349
completely breaks down. Because of the glitches.

00:17:32.509 --> 00:17:34.609
Exactly. The micro glitches, the humid breath

00:17:34.609 --> 00:17:37.750
variations, the subtle pitch changes. When compressed

00:17:37.750 --> 00:17:39.809
that tightly, it all just turns into unintelligible

00:17:39.809 --> 00:17:42.589
mush. But a robotic performance synthesizer.

00:17:43.009 --> 00:17:45.470
Because it is purely mathematical, it has perfect

00:17:45.470 --> 00:17:48.279
acoustic control. There are no breaths, no accidental

00:17:48.279 --> 00:17:51.579
pitch shifts. It remains sharply, perfectly intelligible

00:17:51.579 --> 00:17:53.319
no matter how fast you play it. That makes total

00:17:53.319 --> 00:17:55.839
sense. It proves that the best technology is

00:17:55.839 --> 00:17:57.720
entirely dependent on what the user actually

00:17:57.720 --> 00:18:00.519
needs it to do. The flaw of it sounding robotic

00:18:00.519 --> 00:18:02.940
is exactly the feature that allows for rapid

00:18:02.940 --> 00:18:05.539
comprehension. That is brilliant. It really is.

00:18:05.779 --> 00:18:07.819
But, you know, the tech world is relentless.

00:18:08.279 --> 00:18:10.640
For decades, developers were trapped in this

00:18:10.640 --> 00:18:14.440
binary choice. Use glitchy human recordings or

00:18:14.440 --> 00:18:18.579
use reliable robot math. And then the paradigm

00:18:18.579 --> 00:18:21.000
broke open completely. We entered the deep learning

00:18:21.000 --> 00:18:23.720
era and the rules of the game just completely

00:18:23.720 --> 00:18:26.099
changed. Deep learning entirely removed the need

00:18:26.099 --> 00:18:28.480
to stitch recorded audio slices together. So

00:18:28.480 --> 00:18:31.160
no more ransom notes. No more ransom notes. Instead

00:18:31.160 --> 00:18:34.279
of gluing pre -recorded sounds, Deep neural networks

00:18:34.279 --> 00:18:36.779
analyze millions of data points across massive

00:18:36.779 --> 00:18:40.539
libraries of human speech. They look at raw waveforms,

00:18:40.579 --> 00:18:42.660
and they actually learn the microscopic relationships

00:18:42.660 --> 00:18:45.339
between text and sound. Wow. They literally learn

00:18:45.339 --> 00:18:47.680
to generate raw audio waveforms completely from

00:18:47.680 --> 00:18:50.380
scratch. So the neural network is actually synthesizing

00:18:50.380 --> 00:18:53.599
a brand new sound wave 24 ,000 times a second.

00:18:53.619 --> 00:18:56.269
Exactly. We saw this massive leap with DeepMind's

00:18:56.269 --> 00:18:59.529
WaveNet in 2016 and then Google's Talkatron 2

00:18:59.529 --> 00:19:03.170
in 2018. The audio was hyper -realistic, complete

00:19:03.170 --> 00:19:05.630
with breath sounds and emotional cadence. It

00:19:05.630 --> 00:19:07.730
was stunning to hear for the first time. But

00:19:07.730 --> 00:19:10.089
there was a catch, right? To train those models

00:19:10.089 --> 00:19:12.769
to clone a specific person, you still needed

00:19:12.769 --> 00:19:16.109
tens of hours of pristine, studio -quality audio

00:19:16.109 --> 00:19:19.430
from that person. It was a massive, expensive,

00:19:19.630 --> 00:19:21.690
corporate endeavor. It was very out of reach

00:19:21.690 --> 00:19:24.759
for regular people. Right. But then came 2020,

00:19:25.359 --> 00:19:28.779
a free web -based platform launched called 15

00:19:28.779 --> 00:19:32.960
.ai. 15 .ai was absolutely a watershed moment

00:19:32.960 --> 00:19:35.619
for this industry. It proved that with advanced

00:19:35.619 --> 00:19:38.440
deep learning, the requirement for massive training

00:19:38.440 --> 00:19:41.880
data was just gone. You no longer needed 50 hours.

00:19:42.240 --> 00:19:44.400
You only needed 15 seconds of training audio

00:19:44.400 --> 00:19:46.799
to perfectly clone a voice and make it speak

00:19:46.799 --> 00:19:50.079
with rich emotional expression. 15 seconds. For

00:19:50.079 --> 00:19:52.359
everyone listening, just think about that. Traditional

00:19:52.359 --> 00:19:54.819
text to speech is like asking an actor to memorize

00:19:54.819 --> 00:19:57.660
a script over 50 hours of rehearsal. Right. This

00:19:57.660 --> 00:20:00.000
new zero -shot deep learning is like a master

00:20:00.000 --> 00:20:02.539
jazz musician, hearing you play just three chords

00:20:02.539 --> 00:20:05.259
and instantly being able to improvise an entire

00:20:05.259 --> 00:20:07.680
symphony in your exact style. It's an incredible

00:20:07.680 --> 00:20:10.569
leap in capability. A single voicemail, a quick

00:20:10.569 --> 00:20:13.410
Instagram story, and your vocal identity is just

00:20:13.410 --> 00:20:16.609
captured. Naturally, this sparked a massive internet

00:20:16.609 --> 00:20:18.910
trend. People were using it in video game fandoms

00:20:18.910 --> 00:20:21.109
to make characters read memes. It was everywhere

00:20:21.109 --> 00:20:23.849
online. But any time a powerful tool becomes

00:20:23.849 --> 00:20:27.230
entirely frictionless, the darker side emerges

00:20:27.230 --> 00:20:29.430
almost instantly. And it happened incredibly

00:20:29.430 --> 00:20:33.970
fast. By early 2022, we witnessed the Voiceverse

00:20:33.970 --> 00:20:37.170
NFT scandal. Oh! I remember this yeah a cryptocurrency

00:20:37.170 --> 00:20:40.390
project generated voices using that free 15 .ai

00:20:40.589 --> 00:20:43.289
platform artificially pitched them up to obscure

00:20:43.289 --> 00:20:45.630
where they came from, claimed the underlying

00:20:45.630 --> 00:20:48.190
technology as their own proprietary work, and

00:20:48.190 --> 00:20:51.210
sold them as NFTs. Unbelievable. It was the first

00:20:51.210 --> 00:20:53.789
major instance of speech synthesis fraud occurring

00:20:53.789 --> 00:20:56.029
on the blockchain. And it escalated rapidly from

00:20:56.029 --> 00:20:59.650
copyright infringement to actual real world security

00:20:59.650 --> 00:21:03.109
threats. In 2023, a platform called Eleven Labs

00:21:03.109 --> 00:21:05.690
launched. Their algorithm was so sophisticated

00:21:05.690 --> 00:21:08.390
it could detect context and output specific emotions

00:21:08.390 --> 00:21:11.680
like anger, sadness, joy just by reading the

00:21:11.680 --> 00:21:13.900
text. It really crossed the uncanny valley. It

00:21:13.900 --> 00:21:16.460
did. And this is the platform that the Vice CE

00:21:16.460 --> 00:21:18.480
reporter used to hack his own bank account with

00:21:18.480 --> 00:21:20.440
that five minute clone we talked about at the

00:21:20.440 --> 00:21:23.980
start. And by March 2024, OpenAI corroborated

00:21:23.980 --> 00:21:26.400
that 15 second benchmark with a tool they developed

00:21:26.400 --> 00:21:28.640
called Voice Engine. Right. But they didn't release

00:21:28.640 --> 00:21:31.500
it, did they? No. The results were so potent,

00:21:31.720 --> 00:21:34.740
so indistinguishable from reality that OpenAI

00:21:34.740 --> 00:21:38.359
explicitly deemed the technology too risky. for

00:21:38.359 --> 00:21:40.859
general public release. They kept it strictly

00:21:40.859 --> 00:21:43.759
in a limited preview mode. So what does this

00:21:43.759 --> 00:21:45.799
all mean for you listening right now? If a financial

00:21:45.799 --> 00:21:48.599
institution's biometric security protocol can

00:21:48.599 --> 00:21:51.059
be utterly defeated by a software program that

00:21:51.059 --> 00:21:54.039
requires mere seconds of a person speaking, doesn't

00:21:54.039 --> 00:21:56.339
that render audio -based security completely

00:21:56.339 --> 00:21:59.440
obsolete? This raises an important question regarding

00:21:59.440 --> 00:22:01.900
how we adapt as a society. I mean, what happens

00:22:01.900 --> 00:22:03.920
when we can no longer implicitly trust our own

00:22:03.920 --> 00:22:06.480
ears? Let's picture our background shifting one

00:22:06.480 --> 00:22:09.859
last time to a futuristic digital matrix of sound

00:22:09.859 --> 00:22:13.059
waves. The matrix, I like it. On one hand, the

00:22:13.059 --> 00:22:15.480
positive applications of this technology are

00:22:15.480 --> 00:22:18.720
nothing short of miraculous. We are seeing it

00:22:18.720 --> 00:22:22.119
actively used to restore the voices of individuals

00:22:22.119 --> 00:22:25.200
who have lost theirs to severe medical conditions.

00:22:25.519 --> 00:22:28.140
Like the after Val Kilmer? Yes. Val Kilmer had

00:22:28.140 --> 00:22:30.279
his voice reconstructed after throat cancer.

00:22:30.680 --> 00:22:33.680
It is giving people their identity back, which

00:22:33.680 --> 00:22:37.259
is beautiful. But simultaneously, the era of

00:22:37.259 --> 00:22:40.720
the digital sound alike has shattered an implicit

00:22:40.720 --> 00:22:43.059
contract we've had since the dawn of humanity.

00:22:43.140 --> 00:22:45.559
What contract? The assumption that hearing a

00:22:45.559 --> 00:22:47.799
voice guarantees a physical person is present.

00:22:48.119 --> 00:22:51.400
We are navigating an entirely new terrain of

00:22:51.400 --> 00:22:54.019
reality where seeing isn't believing and hearing

00:22:54.019 --> 00:22:56.140
certainly isn't either. It is a staggering amount

00:22:56.140 --> 00:22:58.319
to process. We've gone on quite the journey today.

00:22:58.519 --> 00:23:00.980
I mean, we started in the 1700s looking at men

00:23:00.980 --> 00:23:03.240
physically pumping bellows and manipulating wooden

00:23:03.240 --> 00:23:05.460
tongues just to wrestle a vowel out of a box.

00:23:05.740 --> 00:23:08.279
It feels like centuries ago, literally. We broke

00:23:08.279 --> 00:23:11.039
down how the vocoder stripped sound into frequencies

00:23:11.039 --> 00:23:14.599
and watched an IBM mainframe sing Daisy Bell,

00:23:14.859 --> 00:23:17.299
inspiring science fiction history. We looked

00:23:17.299 --> 00:23:19.339
over the shoulder of the highly stressed front

00:23:19.339 --> 00:23:21.799
-end script supervisor trying to calculate the

00:23:21.799 --> 00:23:24.480
probability of how to pronounce project, and

00:23:24.480 --> 00:23:27.119
we learned why a robotic, purely mathematical

00:23:27.119 --> 00:23:30.079
voice is vastly superior to a human one when

00:23:30.079 --> 00:23:32.640
you're reading at warp speed. And then... the

00:23:32.640 --> 00:23:35.440
deep learning revolution. Right. Finally, we

00:23:35.440 --> 00:23:37.680
arrived at today where neural networks generate

00:23:37.680 --> 00:23:41.240
raw audio waveforms from scratch, and a 15 -second

00:23:41.240 --> 00:23:43.759
clip is all it takes to digitally clone your

00:23:43.759 --> 00:23:46.500
identity. Understanding these mechanics, knowing

00:23:46.500 --> 00:23:48.460
the difference between concatenative stitching

00:23:48.460 --> 00:23:51.200
and deep learning wave generation, is what equips

00:23:51.200 --> 00:23:53.720
you to be a critical, questioning consumer of

00:23:53.720 --> 00:23:56.059
the audio you hear every single day. Before we

00:23:56.059 --> 00:23:57.500
go, I want to leave you with one final thought

00:23:57.500 --> 00:23:59.400
to mull over, connecting the oldest technology

00:23:59.400 --> 00:24:02.059
we discussed with the absolute newest. Remember,

00:24:02.339 --> 00:24:04.720
articulatory synthesis. Yeah, the computational

00:24:04.720 --> 00:24:07.039
modeling of the physical human throat and trachea,

00:24:07.400 --> 00:24:10.259
the digital fluid dynamics simulating the bellows.

00:24:10.519 --> 00:24:13.059
Exactly. Well, if we can mathematically simulate

00:24:13.059 --> 00:24:16.059
the perfect aerodynamics of human vocal folds

00:24:16.059 --> 00:24:18.779
in a computer and then combine that physics engine

00:24:18.779 --> 00:24:21.279
with the predictive, generative power of deep

00:24:21.279 --> 00:24:24.789
learning. Oh, wow. Where is this going? How long

00:24:24.789 --> 00:24:27.670
will it be until anthropologists take the skeletal

00:24:27.670 --> 00:24:30.470
remains and DNA of ancient historical figures,

00:24:31.150 --> 00:24:33.369
reconstruct the exact physical dimensions of

00:24:33.369 --> 00:24:36.109
their nasal cavities and throats, and perfectly

00:24:36.109 --> 00:24:38.769
synthesize their living voices? That is insane

00:24:38.769 --> 00:24:41.930
to think about. We could be mere years away from

00:24:41.930 --> 00:24:44.410
listening to the physically accurate living voices

00:24:44.410 --> 00:24:47.269
of people who died centuries before the microphone

00:24:47.269 --> 00:24:50.329
was even invented. Wow. From assuming the human

00:24:50.329 --> 00:24:53.789
voice is entirely intimately organic to synthesizing

00:24:53.789 --> 00:24:55.869
the voices of the ancient dead using pure code

00:24:55.869 --> 00:24:58.329
and skeletal measurements, it is an incredibly

00:24:58.329 --> 00:25:00.670
murky, endlessly fascinating landscape. Thank

00:25:00.670 --> 00:25:02.390
you so much for joining us on this deep dive.

00:25:02.670 --> 00:25:04.670
Keep questioning what you hear and stay curious.
