WEBVTT

00:00:00.000 --> 00:00:02.799
Imagine picking up your phone, right? The caller

00:00:02.799 --> 00:00:05.059
ID says it's your best friend. Yeah, pretty normal

00:00:05.059 --> 00:00:08.140
everyday thing. Exactly. You answer and you hear

00:00:08.140 --> 00:00:11.539
their exact voice, their specific laugh, the

00:00:11.539 --> 00:00:13.900
unique way they pause when they're about to deliver

00:00:13.900 --> 00:00:16.640
a punchline. Now imagine that the person on the

00:00:16.640 --> 00:00:19.980
other end isn't your friend at all. It's like

00:00:19.980 --> 00:00:22.579
a computer program that learned how to perfectly

00:00:22.579 --> 00:00:25.160
fake all of those human nuances after listening

00:00:25.160 --> 00:00:28.039
to just 15 seconds of a video they posted online.

00:00:28.399 --> 00:00:30.260
I know it sounds like pure science fiction, but

00:00:30.260 --> 00:00:33.520
honestly, that is the reality of where speech

00:00:33.520 --> 00:00:37.000
synthesis technology currently sits. It's terrifying.

00:00:37.140 --> 00:00:39.500
It is. We are at a point where the biological

00:00:39.500 --> 00:00:42.420
signature of human communication can be, well,

00:00:42.780 --> 00:00:45.420
digitally replicated almost instantly. And getting

00:00:45.420 --> 00:00:47.899
to this point has been an absolutely wild ride.

00:00:48.439 --> 00:00:50.939
So today we are taking you on a deep dive into

00:00:50.939 --> 00:00:54.479
the fascinating centuries -long evolution of

00:00:54.479 --> 00:00:56.890
artificially produced human speech. Yeah, and

00:00:56.890 --> 00:00:59.229
we're basing this whole conversation on a single,

00:00:59.229 --> 00:01:02.530
massive source. Right, we're using this incredibly

00:01:02.530 --> 00:01:05.790
comprehensive Wikipedia article on speech synthesis

00:01:05.790 --> 00:01:08.170
as our guide. It's a goldmine of information.

00:01:08.359 --> 00:01:11.599
It really is. Our mission today is to trace this

00:01:11.599 --> 00:01:15.560
journey from like bizarre 18th century mechanical

00:01:15.560 --> 00:01:19.099
contraptions to the modern era of neural networks

00:01:19.099 --> 00:01:21.519
and AI voice cloning. We want to look under the

00:01:21.519 --> 00:01:24.159
hood. Yes. We want to satisfy your curiosity

00:01:24.159 --> 00:01:27.180
about how this tech actually works without drowning

00:01:27.180 --> 00:01:29.840
you in engineering manuals, you know, and figure

00:01:29.840 --> 00:01:32.620
out what all of this means for your future. OK,

00:01:32.680 --> 00:01:35.659
let's unpack this to truly understand. how we

00:01:35.659 --> 00:01:38.060
ended up with AI that can clone a voice in seconds.

00:01:38.359 --> 00:01:40.920
We actually have to rewind way past the invention

00:01:40.920 --> 00:01:43.900
of the microchip. Like way past. Centuries before

00:01:43.900 --> 00:01:46.840
digital code even existed, inventors were trying

00:01:46.840 --> 00:01:49.439
to solve this exact same problem using wood,

00:01:49.840 --> 00:01:53.079
leather, and physical air. That is just... They

00:01:53.079 --> 00:01:54.799
were trying to literally build talking machines

00:01:54.799 --> 00:01:56.680
from scratch. This was the part of the research

00:01:56.680 --> 00:01:59.079
that completely blew my mind. When I think of

00:01:59.079 --> 00:02:01.540
early speech tech, my brain immediately goes

00:02:01.540 --> 00:02:04.000
to those clunky, glowing green monitors from

00:02:04.000 --> 00:02:06.010
the 19... Right, like an old Apple II or something.

00:02:06.230 --> 00:02:08.830
Exactly. I am definitely not picturing the year

00:02:08.830 --> 00:02:12.030
1779, but that is the year a scientist named

00:02:12.030 --> 00:02:14.550
Christian Gottlieb Kratzenstein won a prize.

00:02:14.889 --> 00:02:16.750
Yeah, from the Russian Imperial Academy. Right,

00:02:16.830 --> 00:02:19.509
for building physical models of the human vocal

00:02:19.509 --> 00:02:22.310
tract. Kratzenstein was really an acoustics pioneer.

00:02:22.590 --> 00:02:25.009
I mean, he didn't write software. He constructed

00:02:25.009 --> 00:02:27.710
physical acoustic resonators out of tubes and

00:02:27.710 --> 00:02:30.250
chambers. Just building throats out of tubes?

00:02:30.349 --> 00:02:33.530
Basically, yeah. He meticulously shaped these

00:02:33.550 --> 00:02:36.370
chambers to mimic the biological dimensions of

00:02:36.370 --> 00:02:38.949
the human throat and mouth. Wow. And by blowing

00:02:38.949 --> 00:02:41.229
air through them, his contraptions could produce

00:02:41.229 --> 00:02:45.129
five distinct long vowel sounds. So just the

00:02:45.129 --> 00:02:48.009
vowels? Right, essentially the A, E, I, O, and

00:02:48.009 --> 00:02:50.750
U sounds. But it didn't stop with vowels, because

00:02:50.750 --> 00:02:52.909
shortly after Kratzenstein, an inventor named

00:02:52.909 --> 00:02:55.870
Wolfgang von Kimplin took this physical engineering

00:02:55.870 --> 00:02:57.969
to an entirely different level. He really did.

00:02:58.129 --> 00:03:00.289
He built this thing called an acoustic mechanical

00:03:00.289 --> 00:03:02.860
speech machine. And he wasn't just working with

00:03:02.860 --> 00:03:05.060
static tubes anymore. No, he added moving parts.

00:03:05.199 --> 00:03:07.280
Yeah, moving models of the tongue and the lips

00:03:07.280 --> 00:03:09.680
to this device so it could physically produce

00:03:09.680 --> 00:03:12.800
consonants. And the power source for this entire

00:03:12.800 --> 00:03:15.960
setup was, get this, a set of bellows. The physical

00:03:15.960 --> 00:03:18.460
labor required to operate von Kemplin's machine

00:03:18.460 --> 00:03:21.020
was intense. I can imagine. The operator had

00:03:21.020 --> 00:03:23.680
to use their arm to vigorously pump those bellows,

00:03:23.840 --> 00:03:25.759
which served as the mechanical lungs, you know,

00:03:25.919 --> 00:03:27.759
supplying the breath. Pumping the lungs by hand.

00:03:28.319 --> 00:03:31.300
Exactly. And simultaneously, the operator's hands

00:03:31.300 --> 00:03:34.300
were flying across various levers, stops, and

00:03:34.300 --> 00:03:37.099
leather flaps to manipulate the artificial tongue

00:03:37.099 --> 00:03:39.659
and lips. That sounds impossible to coordinate.

00:03:40.039 --> 00:03:42.360
It was hard. They were manually shaping that

00:03:42.360 --> 00:03:45.039
rushing air into recognizable syllables. I am

00:03:45.039 --> 00:03:47.520
just picturing someone sitting in an 18th century

00:03:47.520 --> 00:03:50.979
drawing room just sweating while pumping this

00:03:50.979 --> 00:03:53.680
elaborate wooden machine. It's like playing the

00:03:53.680 --> 00:03:55.919
human respiratory system like a weird fleshy

00:03:55.919 --> 00:03:59.460
bagpipe. That is a very vivid and surprisingly

00:03:59.460 --> 00:04:02.159
accurate analogy. A fleshy bagpipe. You were

00:04:02.159 --> 00:04:04.740
physically forcing air through artificial anatomy

00:04:04.740 --> 00:04:07.580
to sculpt sound waves. And what's crazy is this

00:04:07.580 --> 00:04:10.759
mechanical bagpipe -like approach persisted for

00:04:10.759 --> 00:04:13.300
a surprisingly long time. Really? How long? Fast

00:04:13.300 --> 00:04:16.779
forward to 1939. At the New York World's Fair,

00:04:17.100 --> 00:04:19.740
Homer Dudley from Bell Labs exhibited a machine

00:04:19.740 --> 00:04:21.959
called The Voter. The Voter? Right. It was a

00:04:21.959 --> 00:04:24.839
massive custom -built console where an operator

00:04:24.839 --> 00:04:28.209
literally played speech. like a piano. Oh, wow.

00:04:28.370 --> 00:04:31.089
They used a keyboard and foot pedals to manually

00:04:31.089 --> 00:04:34.009
combine different acoustic frequencies into words.

00:04:34.389 --> 00:04:36.949
The manual dexterity required to play a fluent

00:04:36.949 --> 00:04:39.110
sentence on a keyboard in real time must have

00:04:39.110 --> 00:04:41.389
been insane. You'd have to train for months.

00:04:41.569 --> 00:04:44.009
Yeah, you'd have to be a concert pianist of human

00:04:44.009 --> 00:04:47.269
phonetics. But seeing all these brilliant minds

00:04:47.269 --> 00:04:50.149
spend centuries building physical throats and

00:04:50.149 --> 00:04:52.910
mechanical lungs, it makes me wonder why they

00:04:52.910 --> 00:04:55.480
went through all this trouble so long ago. Well,

00:04:55.480 --> 00:04:58.100
if we connect this to the bigger picture, it

00:04:58.100 --> 00:05:01.680
reveals a fundamental, centuries -old drive to

00:05:01.680 --> 00:05:04.060
reverse engineer human communication. We just

00:05:04.060 --> 00:05:06.240
can't help ourselves. We really can. We have

00:05:06.240 --> 00:05:08.139
always wanted to look under the hood of our own

00:05:08.139 --> 00:05:11.620
biology, take it apart, and recreate it. The

00:05:11.620 --> 00:05:14.560
desire to make machines speak isn't a byproduct

00:05:14.560 --> 00:05:17.279
of the computer age. Oh, I see. The computer

00:05:17.279 --> 00:05:19.519
age just finally gave us the ultimate tool to

00:05:19.519 --> 00:05:21.860
achieve it. Which naturally leads us to the moment

00:05:21.860 --> 00:05:23.959
the physical bellows were finally swapped out

00:05:23.959 --> 00:05:25.920
for digital circuits. It's a big leap. Right,

00:05:26.040 --> 00:05:28.660
the transition from mechanical lungs to mainframes.

00:05:29.240 --> 00:05:31.300
The digital awakening really kicked off when

00:05:31.300 --> 00:05:33.480
computers didn't just manage to speak, but were

00:05:33.480 --> 00:05:36.120
actually programmed to sing. That milestone happened

00:05:36.120 --> 00:05:39.100
in 1961, back at Bell Labs again. Of course,

00:05:39.259 --> 00:05:41.800
Bell Labs. Yeah. Physicist John Larry Kelly Jr.

00:05:41.959 --> 00:05:44.680
and his colleague Louis Gerstmann utilized an

00:05:44.680 --> 00:05:48.899
IBM 704 to synthesize human speech. And the IBM

00:05:48.899 --> 00:05:51.860
704, that wasn't exactly a laptop. No, not at

00:05:51.860 --> 00:05:54.860
all. To put this in perspective, the IBM 704

00:05:54.860 --> 00:05:58.019
was a massive, room -sized vacuum tube computer.

00:05:58.100 --> 00:06:00.680
Right. And for their grand demonstration, they

00:06:00.680 --> 00:06:03.459
programmed this behemoth to sing the song Daisy

00:06:03.459 --> 00:06:05.790
Bell. And there is a phenomenal piece of trivia

00:06:05.790 --> 00:06:08.230
in our sources about who happened to be wandering

00:06:08.230 --> 00:06:11.269
the halls of Bell Labs on that exact day. It's

00:06:11.269 --> 00:06:14.069
such a great coincidence. It is. Arthur C. Clarke,

00:06:14.149 --> 00:06:16.209
the legendary science fiction author, was visiting

00:06:16.209 --> 00:06:18.850
a friend at the facility. He walked in and saw

00:06:18.850 --> 00:06:22.149
this gigantic room size calculator singing Daisy

00:06:22.149 --> 00:06:24.449
Bell. The demonstration left a permanent mark

00:06:24.449 --> 00:06:27.009
on him. He was so inspired by the eerie sound

00:06:27.009 --> 00:06:29.730
of a machine singing a human melody that he wrote

00:06:29.730 --> 00:06:31.769
it directly into the climax of his screenplay.

00:06:31.910 --> 00:06:35.519
For 2001. A space odyssey. Exactly. When the

00:06:35.519 --> 00:06:38.360
HAL 9000 computer is being deactivated by astronaut

00:06:38.360 --> 00:06:41.899
Dave Bowman, its digital mind degrades and it

00:06:41.899 --> 00:06:44.120
reverts to its very first piece of programming.

00:06:44.300 --> 00:06:47.459
Slowly singing Daisy Bell as it dies. It's chilling.

00:06:47.740 --> 00:06:50.959
It really is. It is incredible that a niche tech

00:06:50.959 --> 00:06:54.319
demo in New Jersey birthed one of the most iconic

00:06:54.319 --> 00:06:56.980
chilling moments in cinematic history. Just a

00:06:56.980 --> 00:06:59.180
total accident of timing. Yeah. So following

00:06:59.180 --> 00:07:02.500
that room size mainframe in the 60s, the technology

00:07:02.379 --> 00:07:05.899
rapidly began to shrink. Microchips changed everything.

00:07:05.899 --> 00:07:09.160
They did. By the 1970s, we saw the arrival of

00:07:09.160 --> 00:07:12.500
portable synthetic speech. There was the 1976

00:07:12.500 --> 00:07:15.259
SpeechPlus calculator for visually impaired users.

00:07:15.300 --> 00:07:17.480
Which was huge for accessibility. And then the

00:07:17.480 --> 00:07:20.860
massively popular 1978 Speak and Spell educational

00:07:20.860 --> 00:07:23.240
toy from Texas Instruments. The Speak and Spell

00:07:23.240 --> 00:07:25.339
was a monumental leap, actually, because it didn't

00:07:25.339 --> 00:07:27.800
just playback pre -recorded tape loops. Wait,

00:07:27.800 --> 00:07:30.279
it didn't? No, it utilized a technology called

00:07:30.279 --> 00:07:33.240
linear predictive coding. or LPC. OK, what does

00:07:33.240 --> 00:07:35.519
that mean? Well, instead of storing massive audio

00:07:35.519 --> 00:07:37.800
files, which microchips of that era couldn't

00:07:37.800 --> 00:07:41.060
possibly hold anyway, LPC stored the mathematical

00:07:41.060 --> 00:07:43.500
instructions required to recreate the shape of

00:07:43.500 --> 00:07:46.699
a vocal tract on the fly. Oh, wow. It was essentially

00:07:46.699 --> 00:07:50.339
generating a digital model of a throat directly

00:07:50.339 --> 00:07:53.370
on a single silicon chip. That is so cool. But

00:07:53.370 --> 00:07:55.209
you know, while reading through the evolution

00:07:55.209 --> 00:07:58.589
of these digital models, there was one specific

00:07:58.589 --> 00:08:00.889
historical detail that made me stop and reread

00:08:00.889 --> 00:08:03.329
the paragraph. Which one? The source material

00:08:03.329 --> 00:08:06.430
notes that synthesized digital voices were exclusively

00:08:06.430 --> 00:08:10.509
male until the year 1990. Yes. Wait, the digital

00:08:10.509 --> 00:08:12.949
tech had been progressing since the late 50s,

00:08:12.949 --> 00:08:15.670
but it took until 1990 for a researcher named

00:08:15.670 --> 00:08:19.050
Anne Sertle at AT &amp;T Bell Labs to finally create

00:08:19.050 --> 00:08:21.290
a female voice. That's right. Why did it take

00:08:21.290 --> 00:08:23.610
over 30 years for a computer to sound like a

00:08:23.610 --> 00:08:25.850
woman. The delay comes down to the underlying

00:08:25.850 --> 00:08:28.610
mathematics of those early acoustic models. How

00:08:28.610 --> 00:08:30.990
so? When engineers were developing technologies

00:08:30.990 --> 00:08:33.789
like linear predictive coding, they built and

00:08:33.789 --> 00:08:36.450
tuned the algorithms using the acoustic frequencies

00:08:36.450 --> 00:08:39.110
of their own voices. And the teams establishing

00:08:39.110 --> 00:08:42.129
the foundational parameters for things like fundamental

00:08:42.129 --> 00:08:45.129
frequency and vocal tract resonance were composed

00:08:45.129 --> 00:08:48.690
almost entirely of men. So the literal mathematical

00:08:48.690 --> 00:08:51.169
baseline for artificial speech was inherently

00:08:51.169 --> 00:08:55.070
male. Exactly. And you cannot simply take a mathematical

00:08:55.070 --> 00:08:57.710
model built for a male voice and just, you know,

00:08:57.909 --> 00:09:00.139
pitch it up to sound female. Why not? If you

00:09:00.139 --> 00:09:02.840
do that, you don't get a woman's voice. You get

00:09:02.840 --> 00:09:06.240
an unnatural, cartoonish chipmunk sound. Oh,

00:09:06.240 --> 00:09:09.600
right, like a sped -up record. Exactly. The physical

00:09:09.600 --> 00:09:11.899
dimensions of female vocal tracks are different,

00:09:12.179 --> 00:09:14.580
which fundamentally changes the acoustic resonances

00:09:14.580 --> 00:09:17.259
and the distance between vocal frequencies. So

00:09:17.259 --> 00:09:19.679
it's a completely different math problem. Completely.

00:09:20.519 --> 00:09:22.559
Anne's turtle had to completely rework the underlying

00:09:22.559 --> 00:09:25.519
math and crack the complex acoustic models necessary

00:09:25.519 --> 00:09:28.059
to synthesize a natural -sounding female voice.

00:09:28.200 --> 00:09:31.120
Wow, shout out to Ann Sertle. It perfectly highlights

00:09:31.120 --> 00:09:33.840
how technological milestones are inextricably

00:09:33.840 --> 00:09:36.519
linked to whoever is in the room doing the foundational

00:09:36.519 --> 00:09:39.220
research. If your baseline data isn't at the

00:09:39.220 --> 00:09:41.580
table, the technology literally does not know

00:09:41.580 --> 00:09:44.259
how to speak for you. Now, knowing that they

00:09:44.259 --> 00:09:46.960
had to rely on heavy mathematics to model these

00:09:46.960 --> 00:09:49.940
voices brings us to the core mechanics of the

00:09:49.940 --> 00:09:53.389
technology. To understand why those early digital

00:09:53.389 --> 00:09:56.769
voices sounded so famously robotic, and how we

00:09:56.769 --> 00:09:59.190
eventually solved that problem, we have to look

00:09:59.190 --> 00:10:02.230
at the two primary methods engineers use to generate

00:10:02.230 --> 00:10:05.350
speech. Right. The field is broadly divided into

00:10:05.350 --> 00:10:08.350
two distinct approaches. There's concatenative

00:10:08.350 --> 00:10:11.179
synthesis and formant. synthesis. And they are

00:10:11.179 --> 00:10:13.720
very different. They operate on entirely different

00:10:13.720 --> 00:10:16.139
philosophies of sound. Let's look at concatenative

00:10:16.139 --> 00:10:19.019
synthesis first. Our sources explain that this

00:10:19.019 --> 00:10:21.759
method relies on stringing together tiny pre

00:10:21.759 --> 00:10:24.679
-recorded pieces of actual human speech. Yes,

00:10:24.840 --> 00:10:27.399
actual recordings. Right. Engineers build massive

00:10:27.399 --> 00:10:30.139
databases of a voice actor saying every possible

00:10:30.139 --> 00:10:32.679
phonetic sound combination, and the computer

00:10:32.679 --> 00:10:35.000
stitches those audio snippets together on the

00:10:35.000 --> 00:10:37.000
fly to form new sentences. That's the core of

00:10:37.000 --> 00:10:39.000
it. When I read how this worked, I immediately

00:10:39.000 --> 00:10:41.519
thought This is exactly like a sonic ransom note.

00:10:41.919 --> 00:10:43.639
A ransom note is a brilliant way to visualize

00:10:43.639 --> 00:10:46.200
it. You are cutting out letters and words from

00:10:46.200 --> 00:10:48.059
different magazines and gluing them together

00:10:48.059 --> 00:10:50.840
to form a brand new message. And just like a

00:10:50.840 --> 00:10:53.159
physical ransom note, the individual pieces are

00:10:53.159 --> 00:10:55.919
real, but the seams where they connect can be...

00:10:55.759 --> 00:10:58.639
Well, messy. Very messy. If the computer has

00:10:58.639 --> 00:11:01.700
a massive database and stitches the sounds perfectly,

00:11:02.299 --> 00:11:04.360
it sounds incredibly natural because the raw

00:11:04.360 --> 00:11:06.940
material is an actual human. Right. But if it

00:11:06.940 --> 00:11:08.600
doesn't have the exact right transition between

00:11:08.600 --> 00:11:12.019
a T and an S sound, you get these weird audible

00:11:12.019 --> 00:11:14.620
glitches where the audio snaps. Those glitches

00:11:14.620 --> 00:11:17.659
at the seams are the fatal flaw of concatenative

00:11:17.659 --> 00:11:20.789
synthesis. It just sounds broken. Yeah. Modern

00:11:20.789 --> 00:11:23.750
versions, known as unit selection synthesis,

00:11:24.370 --> 00:11:27.330
mitigate this by using gigabytes of data to ensure

00:11:27.330 --> 00:11:29.629
they have the perfect magazine clipping for every

00:11:29.629 --> 00:11:31.669
single scenario. Just millions of clippings.

00:11:31.830 --> 00:11:34.409
Exactly. But that requires massive storage and

00:11:34.409 --> 00:11:36.870
computing power. Which brings us to the alternative,

00:11:37.450 --> 00:11:39.769
formant synthesis. That's a robotic one. Yeah.

00:11:40.129 --> 00:11:42.230
If the concatenative method is a ransom note

00:11:42.230 --> 00:11:45.110
made from real human photographs, formant synthesis

00:11:45.110 --> 00:11:47.549
is like painting a voice entirely from scratch

00:11:47.549 --> 00:11:50.169
using a mathematical rulebook. human recordings

00:11:50.169 --> 00:11:53.950
at all. Zero human audio samples. Instead, it

00:11:53.950 --> 00:11:56.190
generates acoustic waveforms through additive

00:11:56.190 --> 00:11:59.389
synthesis, combining sine waves to artificially

00:11:59.389 --> 00:12:01.870
simulate resonant frequencies. It's pure math.

00:12:02.110 --> 00:12:04.549
But the result of all that pure math is a voice

00:12:04.549 --> 00:12:07.950
that sounds undeniably heavily robotic. Very

00:12:07.950 --> 00:12:10.169
metallic. Why would anyone choose the robotic

00:12:10.169 --> 00:12:12.409
voice? I mean, nobody wants to talk to a machine

00:12:12.409 --> 00:12:14.929
that sounds like a 1950s sci -fi movie. What's

00:12:14.929 --> 00:12:17.350
fascinating here is, despite the robotic tone,

00:12:17.809 --> 00:12:21.070
Formant synthesis is actually vastly superior

00:12:21.070 --> 00:12:23.629
in several critical use cases. And it's entirely

00:12:23.629 --> 00:12:26.169
because it relies on math instead of human recordings.

00:12:26.389 --> 00:12:28.450
You superior how? Well, think about the ransom

00:12:28.450 --> 00:12:31.370
note glitches. If you take a concatenative voice

00:12:31.370 --> 00:12:33.370
and speed it up to three or four times the normal

00:12:33.370 --> 00:12:35.710
speaking rate, those tiny glitches at the seams

00:12:35.710 --> 00:12:38.289
compound. Oh, they just pile up? Yeah. transitions

00:12:38.289 --> 00:12:41.429
break down, and the output turns into an unintelligible

00:12:41.429 --> 00:12:44.230
blur of noise. But format synthesis doesn't have

00:12:44.230 --> 00:12:47.250
seams. Exactly. The math scales perfectly. You

00:12:47.250 --> 00:12:49.629
can play a format voice at incredibly high speeds,

00:12:49.909 --> 00:12:52.610
and the audio never degrades. It remains perfectly

00:12:52.610 --> 00:12:55.350
crisp and intelligible. Oh, I see. This is about

00:12:55.350 --> 00:12:59.100
accessibility and efficiency. Precisely. Visually

00:12:59.100 --> 00:13:02.159
impaired individuals who rely on screen readers

00:13:02.159 --> 00:13:05.259
to navigate computer interfaces. They don't want

00:13:05.259 --> 00:13:08.000
a slow conversational voice reading every menu

00:13:08.000 --> 00:13:10.419
item. That would take forever. Right. They often

00:13:10.419 --> 00:13:12.659
crank the reading speed up to hundreds of words

00:13:12.659 --> 00:13:15.840
per minute. Wow. And at those speeds, the robotic

00:13:15.840 --> 00:13:18.500
formant voice is the only method that remains

00:13:18.500 --> 00:13:21.879
clear. It is a perfect example of how achieving

00:13:21.879 --> 00:13:24.320
maximum naturalness isn't always the ultimate

00:13:24.320 --> 00:13:27.500
goal. Form over function. or rather function

00:13:27.500 --> 00:13:31.440
over form. Exactly. Sometimes pure utility, speed,

00:13:31.580 --> 00:13:33.899
and reliability are far more valuable to the

00:13:33.899 --> 00:13:36.919
user. So the robotic voice isn't a failure of

00:13:36.919 --> 00:13:39.960
early technology. It's a feature for power users.

00:13:40.019 --> 00:13:42.220
Absolutely. But generating the actual sound,

00:13:42.240 --> 00:13:44.100
whether you are stitching together a ransom note

00:13:44.100 --> 00:13:46.340
or calculating a sine wave, that's only half

00:13:46.340 --> 00:13:48.419
the battle. Right. You need the text. Before

00:13:48.419 --> 00:13:50.320
the machine can produce a sound, it has to figure

00:13:50.320 --> 00:13:52.539
out what the text is actually supposed to say.

00:13:53.419 --> 00:13:56.200
And human text, particularly the English language,

00:13:56.279 --> 00:13:59.080
is an absolute nightmare for a computer to process.

00:13:59.340 --> 00:14:02.740
The industry term for this hurdle is text normalization.

00:14:02.919 --> 00:14:06.419
Text normalization. Converting raw text into

00:14:06.419 --> 00:14:09.120
a phonetic blueprint is rarely a straightforward

00:14:09.120 --> 00:14:11.700
process, mainly because humans do not write the

00:14:11.700 --> 00:14:14.559
way we speak. To say the least, the examples

00:14:14.559 --> 00:14:16.720
from the source text had me laughing out loud.

00:14:16.919 --> 00:14:19.360
Let's look at heteronyms. Oh, those are tricky.

00:14:19.539 --> 00:14:21.580
Words that are spelled exactly the same but are

00:14:21.580 --> 00:14:24.179
pronounced differently based entirely on context.

00:14:24.559 --> 00:14:28.200
Take the sentence. My latest project is to learn

00:14:28.200 --> 00:14:31.100
how to better project my voice. To a computer

00:14:31.100 --> 00:14:34.240
that is just the letter sequence p -r -o -j -e

00:14:34.240 --> 00:14:37.379
-c -t appearing twice? It has no idea. It has

00:14:37.379 --> 00:14:40.639
no biological intuition at all. It has to algorithmically

00:14:40.639 --> 00:14:43.220
parse the semantics of the entire sentence just

00:14:43.220 --> 00:14:45.679
to realize the first one is a noun and the second

00:14:45.679 --> 00:14:48.299
is a verb requiring different intonations. Then

00:14:48.299 --> 00:14:50.919
you throw numbers into the mix. Numbers are a

00:14:50.919 --> 00:14:54.659
huge headache. If the text says 1325... The machine

00:14:54.659 --> 00:14:58.840
has to figure out if it should say 1 ,325, or

00:14:58.840 --> 00:15:01.559
if it's a historical date, 1 ,325. And Roman

00:15:01.559 --> 00:15:05.080
numerals. Yes. If you type Henry VIII, the computer

00:15:05.080 --> 00:15:07.500
should read Henry VIII. But if you type Chapter

00:15:07.500 --> 00:15:09.820
VIII, it has to know to say Chapter VIII. The

00:15:09.820 --> 00:15:12.100
ultimate headache for text normalization, however,

00:15:12.440 --> 00:15:16.500
is abbreviations. Oh, man. Yes. The abbreviation

00:15:16.500 --> 00:15:18.820
IN could be the preposition IN, or it could stand

00:15:18.820 --> 00:15:21.559
for inches. Right. The abbreviation ST could

00:15:21.559 --> 00:15:24.799
mean street or saint. So if you feed it an address

00:15:24.799 --> 00:15:28.059
like 12 St. John St., it has to deduce that the

00:15:28.059 --> 00:15:30.240
first one is St. and the second one is St. It's

00:15:30.240 --> 00:15:32.879
a logic puzzle. The source text highlighted a

00:15:32.879 --> 00:15:35.600
classic failure where a basic text -to -speech

00:15:35.600 --> 00:15:38.220
system looked at the historical name Ulysses

00:15:38.220 --> 00:15:41.779
S. Grant and confidently read it aloud as Ulysses

00:15:41.779 --> 00:15:44.100
South Grant. Right, because it just sees the

00:15:44.100 --> 00:15:46.659
S and guesses South. That is amazing. But wait,

00:15:46.700 --> 00:15:49.000
I have a supercomputer in my pocket with access

00:15:49.000 --> 00:15:52.059
to the entire Oxford English Dictionary. Why

00:15:52.059 --> 00:15:54.639
can't engineers just plug a digital dictionary

00:15:54.639 --> 00:15:57.139
into the machine and tell it to look up the correct

00:15:57.139 --> 00:16:00.779
pronunciation for every word? Engineers absolutely

00:16:00.779 --> 00:16:03.340
use dictionary -based approaches. The system

00:16:03.340 --> 00:16:06.340
stores a massive lookup table of words and their

00:16:06.340 --> 00:16:08.779
perfect phonetic translations. So why doesn't

00:16:08.779 --> 00:16:12.210
that just solve it? Well, it is incredibly fast

00:16:12.210 --> 00:16:15.149
and perfectly accurate right up until the moment

00:16:15.149 --> 00:16:18.009
it encounters a word that isn't in the database

00:16:18.009 --> 00:16:22.210
The moment it hits a brand new slang term a unique

00:16:22.210 --> 00:16:25.450
foreign surname or a newly coined corporate brand

00:16:25.450 --> 00:16:28.669
the dictionary approach Catastrophically fails

00:16:28.669 --> 00:16:31.129
because it's not on the list because it has no

00:16:31.129 --> 00:16:33.509
instructions for unknown variables So what is

00:16:33.509 --> 00:16:36.330
the backup plan when the dictionary fails? The

00:16:36.330 --> 00:16:39.460
fallback is a rule -based approach The system

00:16:39.460 --> 00:16:42.519
attempts to sound out the unknown word by applying

00:16:42.519 --> 00:16:44.779
the standard pronunciation rules of the language

00:16:44.779 --> 00:16:47.340
based on its spelling. Like hooked on phonics

00:16:47.340 --> 00:16:50.539
for computers. Essentially. But English is deeply,

00:16:50.899 --> 00:16:52.919
frustratingly irregular. Oh yeah. Consider the

00:16:52.919 --> 00:16:55.860
word of spelled O -F. Okay. It is the only word

00:16:55.860 --> 00:16:57.500
in the entire English language where the letter

00:16:57.500 --> 00:17:00.120
F is pronounced as a V. I have spoken English

00:17:00.120 --> 00:17:01.679
my entire life and I never actually thought about

00:17:01.679 --> 00:17:05.630
that until I read the brief. That is ridiculous.

00:17:06.009 --> 00:17:08.650
Because of these endless irregulaties, modern

00:17:08.650 --> 00:17:12.950
systems had to move beyond rigid rules. Engineers

00:17:12.950 --> 00:17:15.529
began deploying complex heuristic techniques

00:17:15.529 --> 00:17:18.730
and something called hidden Markov models. Hidden

00:17:18.730 --> 00:17:21.410
Markov models? Yeah. Instead of looking at a

00:17:21.410 --> 00:17:25.230
word in isolation, a hidden Markov model analyzes

00:17:25.230 --> 00:17:27.950
the surrounding text. So, looking for context

00:17:27.950 --> 00:17:30.309
clues. Exactly. It calculates the statistical

00:17:30.309 --> 00:17:32.450
probability of different parts of speech occurring

00:17:32.450 --> 00:17:36.130
in that specific sequence. And it uses that mathematical

00:17:36.130 --> 00:17:39.289
likelihood to make an educated guess about the

00:17:39.289 --> 00:17:41.539
context and the correct pronunciation. Here's

00:17:41.539 --> 00:17:44.099
where it gets really interesting, because what

00:17:44.099 --> 00:17:46.539
happens when we stop trying to hand code these

00:17:46.539 --> 00:17:48.539
messy grammatical rules? Everything changes.

00:17:48.740 --> 00:17:51.240
What happens when engineers stop manually patching

00:17:51.240 --> 00:17:54.339
the Ulysses south grain errors and instead hand

00:17:54.339 --> 00:17:57.019
the problem over to massive neural networks to

00:17:57.019 --> 00:17:59.440
learn organically? That transition marks the

00:17:59.440 --> 00:18:01.740
deep learning leap, which fundamentally changed

00:18:01.740 --> 00:18:04.519
the trajectory of speech synthesis around 2016.

00:18:04.819 --> 00:18:07.299
2016 wasn't that long ago. No, it really wasn't.

00:18:07.349 --> 00:18:10.269
And it was spearheaded by technologies like DeepMind's

00:18:10.269 --> 00:18:12.849
WaveNet. Previous systems were basically forcing

00:18:12.849 --> 00:18:15.950
a computer to read a strict, incredibly complex

00:18:15.950 --> 00:18:19.170
grammar textbook. But WaveNet took a completely

00:18:19.170 --> 00:18:21.369
different approach. Very different. It was like

00:18:21.369 --> 00:18:24.450
locking the computer in a room with tens of thousands

00:18:24.450 --> 00:18:26.650
of hours of people talking and forcing it to

00:18:26.650 --> 00:18:29.490
just listen until it organically absorbed the

00:18:29.490 --> 00:18:32.569
vibe and the underlying physical rules of human

00:18:32.569 --> 00:18:35.549
sound on its own. WaveNet abandoned the idea

00:18:35.549 --> 00:18:38.289
of stitching together audio clips or using rigid

00:18:38.289 --> 00:18:41.509
acoustic equations. Instead, its deep neural

00:18:41.509 --> 00:18:44.250
networks modeled raw audio waveforms directly.

00:18:44.509 --> 00:18:46.990
It was predicting the shape of the sound wave

00:18:46.990 --> 00:18:49.589
one audio sample at a time. Just guessing the

00:18:49.589 --> 00:18:52.529
next sound? Yes. And by learning the statistical

00:18:52.529 --> 00:18:55.130
patterns of human speech from massive data sets,

00:18:55.670 --> 00:18:57.809
it produced voices that were astonishingly natural.

00:18:57.910 --> 00:19:00.849
It was capturing breaths, lip smacks, and subtle

00:19:00.849 --> 00:19:02.890
intonations. Because it learned how to speak

00:19:02.890 --> 00:19:05.410
by ingesting massive amounts of data, the timeline

00:19:05.410 --> 00:19:08.009
of innovation speeds up terrifyingly fast from

00:19:08.009 --> 00:19:11.589
2016 onward. Exponentially fast. By 2020, a platform

00:19:11.589 --> 00:19:15.069
called 15 .ai launched, and the creator proved

00:19:15.069 --> 00:19:17.049
something that effectively broke the internet.

00:19:17.170 --> 00:19:19.890
Oh, this was a huge moment. It only takes 15

00:19:19.890 --> 00:19:22.630
seconds. of training data to perfectly clone

00:19:22.630 --> 00:19:25.990
a specific person's voice. 15 seconds. Think

00:19:25.990 --> 00:19:27.990
about the magnitude of that shift. That's hard

00:19:27.990 --> 00:19:30.089
to even wrap my head around. Prior to neural

00:19:30.089 --> 00:19:33.869
networks, achieving a high quality custom synthetic

00:19:33.869 --> 00:19:37.430
voice required a professional voice actor sitting

00:19:37.430 --> 00:19:40.450
in a pristine recording studio for tens of hours.

00:19:40.670 --> 00:19:43.730
reading boring scripts, reading hundreds of specific

00:19:43.730 --> 00:19:46.549
phonetically balanced sentences, but suddenly

00:19:46.549 --> 00:19:49.269
a random 15 -second snippet extracted from a

00:19:49.269 --> 00:19:51.589
compressed YouTube video or just a quick voice

00:19:51.589 --> 00:19:54.130
memo recorded on a phone that was enough data

00:19:54.130 --> 00:19:56.589
for a neural network to reconstruct your entire

00:19:56.589 --> 00:19:59.529
vocal identity. And 15 .ai didn't just clone

00:19:59.529 --> 00:20:01.529
the baseline sound of the voice, it managed to

00:20:01.529 --> 00:20:03.750
clone the emotional expressions. Yeah, the emotion

00:20:03.750 --> 00:20:06.109
is the crazy part. It exploded in meme culture.

00:20:06.410 --> 00:20:08.670
Overnight, you had people generating audio of

00:20:08.670 --> 00:20:10.839
famous video game characters. or celebrities

00:20:10.839 --> 00:20:14.220
reading absurd internet copypastas with perfect

00:20:14.220 --> 00:20:16.839
emotional delivery. But the novelty wore off

00:20:16.839 --> 00:20:20.000
quickly. Because the dark side of this democratization

00:20:20.000 --> 00:20:23.039
emerged almost immediately. Right. The rapid

00:20:23.039 --> 00:20:25.819
democratization of voice cloning ushered us directly

00:20:25.819 --> 00:20:28.660
into the deep fake era. The technology was no

00:20:28.660 --> 00:20:30.839
longer confined to corporate research labs. Anyone

00:20:30.839 --> 00:20:34.180
could use it. Exactly. In early 2022, we witnessed

00:20:34.180 --> 00:20:37.630
the first major speech synthesis NFT fraud. Oh,

00:20:37.630 --> 00:20:40.930
right. Voiceverse. Yeah, a company called Voiceverse

00:20:40.930 --> 00:20:44.029
stole pristine voice lines that users had generated

00:20:44.029 --> 00:20:47.470
using 15 .ai. They pitched the audio up slightly

00:20:47.470 --> 00:20:50.410
to obscure the theft and sold those digital voices

00:20:50.410 --> 00:20:53.269
as their own commercial NFTs. And the implications

00:20:53.269 --> 00:20:55.609
get far more serious than stolen Internet memes.

00:20:55.789 --> 00:20:59.049
Much more serious. In 2023, a reporter from Vice

00:20:59.049 --> 00:21:01.230
E decided to test the security of his own bank.

00:21:01.759 --> 00:21:04.319
He used a commercially available tool by a company

00:21:04.319 --> 00:21:06.880
called Eleven Labs to clone his voice. A widely

00:21:06.880 --> 00:21:09.900
available tool. Yes. He generated just five minutes

00:21:09.900 --> 00:21:12.259
of audio, played that synthetic voice over the

00:21:12.259 --> 00:21:14.619
phone for his bank's automated biometric security

00:21:14.619 --> 00:21:17.160
system, and the system verified his identity

00:21:17.160 --> 00:21:19.519
and let him write into his account. The tech

00:21:19.519 --> 00:21:21.779
industry is acutely aware of the danger this

00:21:21.779 --> 00:21:25.880
poses. By March 2024, OpenAI released a report

00:21:25.880 --> 00:21:28.559
corroborating that 15 -second cloning benchmark.

00:21:28.700 --> 00:21:30.940
So they confirmed it's real. They proved... They

00:21:30.940 --> 00:21:34.180
had developed a voice engine capable of flawlessly

00:21:34.180 --> 00:21:37.599
cloning a human from a tiny conversational audio

00:21:37.599 --> 00:21:39.380
sample. But they didn't release it, right? No.

00:21:39.579 --> 00:21:42.319
They took the highly unusual step of deeming

00:21:42.319 --> 00:21:44.960
their own technology too risky for general public

00:21:44.960 --> 00:21:47.940
release. They cited the profound potential for

00:21:47.940 --> 00:21:51.000
synthetic Voice misuse. The whiplash on this

00:21:51.000 --> 00:21:53.380
technology is just staggering to process. It

00:21:53.380 --> 00:21:55.680
really is. We started this deep dive looking

00:21:55.680 --> 00:21:58.539
at speech synthesis as an incredible assistive

00:21:58.539 --> 00:22:01.359
tool. It gave the brilliant Stephen Hawking a

00:22:01.359 --> 00:22:03.579
way to communicate his theories to the world.

00:22:03.660 --> 00:22:06.099
Which is amazing. More recently, advanced synthesis

00:22:06.099 --> 00:22:09.000
gave the actor Val Kilmer his specific voice

00:22:09.000 --> 00:22:12.220
back after he lost it to throat cancer. That

00:22:12.220 --> 00:22:14.500
is beautiful, life -changing technology. Absolutely.

00:22:14.759 --> 00:22:17.559
But in the exact same breath, it has become so

00:22:17.559 --> 00:22:20.140
powerful and so accessible that it can trivially

00:22:20.140 --> 00:22:23.279
bypass biometric banking security. This raises

00:22:23.279 --> 00:22:25.579
an important question about the fundamental nature

00:22:25.579 --> 00:22:28.259
of identity in the 21st century. How do we even

00:22:28.259 --> 00:22:31.720
define identity now? Exactly. As the technology

00:22:31.720 --> 00:22:34.960
moves from basic text to speech dictation to

00:22:34.960 --> 00:22:37.559
the creation of perfect digital sound -alikes,

00:22:38.059 --> 00:22:41.180
we are crossing a massive societal threshold.

00:22:41.279 --> 00:22:43.960
There's no going back. No. We are entering an

00:22:43.960 --> 00:22:48.019
era where our own biometric data The unique cadence,

00:22:48.460 --> 00:22:51.079
breath patterns, and timbre of our voice can

00:22:51.079 --> 00:22:53.559
be effortlessly captured and weaponized against

00:22:53.559 --> 00:22:56.740
us. Society is going to be forced to entirely

00:22:56.740 --> 00:23:00.359
rethink how we establish trust and verify who's

00:23:00.359 --> 00:23:02.460
actually on the other end of a digital connection.

00:23:03.119 --> 00:23:05.789
So what does this all mean? It is wild to think

00:23:05.789 --> 00:23:07.829
that our human obsession with recreating the

00:23:07.829 --> 00:23:10.410
voice hasn't changed since the 1700s. Not at

00:23:10.410 --> 00:23:12.230
all. The only thing that changed was our tools.

00:23:12.630 --> 00:23:14.990
We went from pumping physical bellows to pumping

00:23:14.990 --> 00:23:17.390
massive streams of data. The underlying desire

00:23:17.390 --> 00:23:20.029
to build a machine in our own image remains the

00:23:20.029 --> 00:23:22.849
same. But the speed at which that machine is

00:23:22.849 --> 00:23:25.589
learning has vastly outpaced our ability to secure

00:23:25.589 --> 00:23:27.750
it. To you listening to this deep dive right

00:23:27.750 --> 00:23:30.480
now. The next time you leave a casual voicemail

00:23:30.480 --> 00:23:32.839
or post a quick video update on your social media

00:23:32.839 --> 00:23:35.000
feed, you have to remember that your voice is

00:23:35.000 --> 00:23:37.279
no longer just a sound wave fading into the air.

00:23:37.400 --> 00:23:39.940
No, it's not. Your voice is highly actionable

00:23:39.940 --> 00:23:42.420
data. It is a digital fingerprint that can be

00:23:42.420 --> 00:23:45.720
copied, modeled, and deployed by a neural network.

00:23:45.980 --> 00:23:48.180
And I want to leave you with one final thought

00:23:48.180 --> 00:23:50.940
to mull over. Something built on all this research,

00:23:50.980 --> 00:23:53.900
but taking it one terrifying step further. OK.

00:23:54.319 --> 00:23:57.799
We know AI can already clone exactly how you

00:23:57.799 --> 00:24:00.740
sound in just 15 seconds, and our sources mention

00:24:00.740 --> 00:24:03.420
that researchers are actively studying prosody.

00:24:03.599 --> 00:24:05.599
Right, the rhythm and intonation. Yeah, they

00:24:05.599 --> 00:24:08.420
are analyzing pitch and duration to teach AI

00:24:08.420 --> 00:24:12.119
how to detect subtle human nuances, like a smile

00:24:12.119 --> 00:24:14.359
hidden in an audio track. Which is incredible

00:24:14.359 --> 00:24:17.279
tech. It is, but what happens when AI doesn't

00:24:17.279 --> 00:24:19.880
just clone the pitch of your voice, but learns

00:24:19.880 --> 00:24:23.160
to flawlessly synthesize your unique persuasive

00:24:23.160 --> 00:24:26.440
emotional quarks. Could a voice clone of a loved

00:24:26.440 --> 00:24:29.000
one become the ultimate irresistible phishing

00:24:29.000 --> 00:24:31.940
scam? That is terrifying. Imagine getting a frantic

00:24:31.940 --> 00:24:34.440
phone call and it isn't just a generic replica

00:24:34.440 --> 00:24:36.700
of your child's voice. It perfectly mimics the

00:24:36.700 --> 00:24:39.680
exact specific trembling tone they use when they

00:24:39.680 --> 00:24:42.119
are terrified and in deep trouble. No one would

00:24:42.119 --> 00:24:44.640
question that. A scam like that would bypass

00:24:44.640 --> 00:24:47.319
your logical critical thinking brain entirely

00:24:47.319 --> 00:24:49.940
because it perfectly triggers the biological

00:24:49.940 --> 00:24:52.859
emotional cues you are evolutionarily wired to

00:24:52.859 --> 00:24:56.539
protect. It hacks human nature. Exactly. We started

00:24:56.539 --> 00:24:58.920
this centuries -long journey trying to engineer

00:24:58.920 --> 00:25:02.099
the human voice, but we might just end up engineering

00:25:02.099 --> 00:25:05.059
a master key to human vulnerability. That is

00:25:05.059 --> 00:25:07.079
definitely something to think about. Thank you

00:25:07.079 --> 00:25:09.420
for exploring this fascinating complex topic

00:25:09.420 --> 00:25:12.319
with us today. Keep questioning, keep learning,

00:25:12.579 --> 00:25:14.539
and until next time, keep diving deep.