WEBVTT

00:00:00.000 --> 00:00:02.500
You know, usually when we think of human communication,

00:00:02.580 --> 00:00:05.870
there's this expectation of, well, effortlessness.

00:00:06.089 --> 00:00:08.390
It's basically like breathing. Right. It feels

00:00:08.390 --> 00:00:11.150
completely natural because it's deeply biological.

00:00:11.390 --> 00:00:13.769
I mean, our brains are hardwired for it from

00:00:13.769 --> 00:00:16.250
birth. Yeah, exactly. You have a thought in your

00:00:16.250 --> 00:00:18.870
head, you move your mouth, an invisible wave

00:00:18.870 --> 00:00:22.170
of sound comes out, and someone else just effortlessly

00:00:22.170 --> 00:00:25.149
catches that wave and understands it. A toddler

00:00:25.149 --> 00:00:28.350
playing with blocks, you know, masters the fundamentals

00:00:28.350 --> 00:00:31.690
of language without ever reading a manual. or

00:00:31.690 --> 00:00:34.350
studying grammar or looking at a frequency chart.

00:00:34.369 --> 00:00:36.710
It's entirely invisible and incredibly fluid.

00:00:36.750 --> 00:00:39.329
But then you try to teach a machine to do that

00:00:39.329 --> 00:00:41.469
exact same thing. Oh, yeah. You try to get a

00:00:41.469 --> 00:00:43.729
computer to take that invisible, messy fluid

00:00:43.729 --> 00:00:46.909
wave of sound and actually comprehend it. And

00:00:46.909 --> 00:00:49.109
suddenly you realize that what we do every single

00:00:49.109 --> 00:00:52.310
day without thinking is actually it's a mathematical

00:00:52.310 --> 00:00:55.770
miracle. It really is taking a machine and forcing

00:00:55.770 --> 00:00:58.780
it to listen. puts you into a technological landscape

00:00:58.780 --> 00:01:02.159
that is incredibly murky is the absolute definition

00:01:02.159 --> 00:01:04.620
of computational muddy waters. Because human

00:01:04.620 --> 00:01:07.400
speech is chaotic. Pure chaos. Which brings us

00:01:07.400 --> 00:01:10.540
to today's deep dive. We are cracking open a

00:01:10.540 --> 00:01:13.219
massive stack of research to figure out how we

00:01:13.219 --> 00:01:16.760
actually taught machines to capture that invisible

00:01:16.760 --> 00:01:20.760
wave. And it is a long, surprisingly complex

00:01:20.760 --> 00:01:23.310
journey. Right. Our mission today is to explore

00:01:23.310 --> 00:01:25.709
that journey. We're going to unpack how this

00:01:25.709 --> 00:01:28.909
technology evolved from like clunky room size

00:01:28.909 --> 00:01:32.109
machines in the 1950s to the invisible AI sitting

00:01:32.109 --> 00:01:34.129
in your pocket right now. And we'll break down

00:01:34.129 --> 00:01:36.530
the massive mathematical hurdles researchers

00:01:36.530 --> 00:01:39.049
had to overcome. Yeah. And explore why this tech

00:01:39.049 --> 00:01:42.469
is both a profound accessibility tool and a very

00:01:42.469 --> 00:01:44.969
real, very surprising security vulnerability

00:01:44.969 --> 00:01:47.469
in your daily life. It's just a massive sprawling

00:01:47.469 --> 00:01:50.390
story. It is. But before we jump into the timeline,

00:01:50.519 --> 00:01:52.459
there's a crucial distinction we need to establish

00:01:52.459 --> 00:01:54.379
right up front to make sense of the research.

00:01:54.500 --> 00:01:56.659
Okay, lay it on me. People often use the terms

00:01:56.659 --> 00:01:59.480
interchangeably in casual conversation, but voice

00:01:59.480 --> 00:02:02.040
recognition and speech recognition are two entirely

00:02:02.040 --> 00:02:04.099
different computational tasks. Wait, really?

00:02:04.379 --> 00:02:06.299
I definitely just use them interchangeably. Most

00:02:06.299 --> 00:02:09.659
people do, but voice recognition or speaker identification

00:02:09.659 --> 00:02:12.240
is about figuring out who you are. It's using

00:02:12.240 --> 00:02:14.819
the unique acoustic properties of your vocal

00:02:14.819 --> 00:02:17.670
tract, like a biological fingerprint. Speech

00:02:17.670 --> 00:02:20.250
recognition, which is our focus today, is about

00:02:20.250 --> 00:02:22.750
figuring out what you are saying, regardless

00:02:22.750 --> 00:02:25.240
of who is saying it. Which, as the source material

00:02:25.240 --> 00:02:28.419
makes glaringly obvious, is a monumentally harder

00:02:28.419 --> 00:02:31.060
problem. No, exponentially harder. So let's rewind

00:02:31.060 --> 00:02:33.340
the clock to see where this all started. We are

00:02:33.340 --> 00:02:36.219
going way back, long before the era of modern

00:02:36.219 --> 00:02:40.159
AI, into the early 1950s. Teaching machines to

00:02:40.159 --> 00:02:43.060
listen is not a new concept at all. No, not at

00:02:43.060 --> 00:02:46.099
all. In 1952, researchers at Bell Labs built

00:02:46.099 --> 00:02:48.819
a system named Audrey. Audrey could recognize

00:02:48.819 --> 00:02:51.060
spoken digits zero through nine, but there was

00:02:51.060 --> 00:02:53.159
a massive catch. It only really worked for a

00:02:53.159 --> 00:02:55.620
single speaker. Right. The notes mentioned Audrey

00:02:55.620 --> 00:02:57.960
was basically locating patterns called formants

00:02:57.960 --> 00:03:00.860
in the power spectrum of a specific voice. What

00:03:00.860 --> 00:03:03.000
does that actually mean, like in plain English?

00:03:03.300 --> 00:03:05.759
Think about how sound is physically produced.

00:03:06.039 --> 00:03:08.840
Your vocal cords buzz, and that raw sound travels

00:03:08.840 --> 00:03:10.840
up through your throat, mouth, and nasal cavity.

00:03:10.900 --> 00:03:13.030
Right, the anatomy. Yeah, and those cavities

00:03:13.030 --> 00:03:15.569
act as acoustic filters. They amplify certain

00:03:15.569 --> 00:03:18.050
frequencies and dampen others. Those amplified

00:03:18.050 --> 00:03:20.370
frequencies are called formants. OK, got it.

00:03:20.750 --> 00:03:22.909
So Audrey worked by analyzing a power spectrum,

00:03:23.490 --> 00:03:25.629
which is essentially a graph showing which frequencies

00:03:25.629 --> 00:03:28.270
in the sound wave have the most energy or power

00:03:28.270 --> 00:03:32.289
at any given moment. The machine was hardwired

00:03:32.289 --> 00:03:35.490
to look for the specific formant peaks of the

00:03:35.490 --> 00:03:38.050
lead researcher's voice. So, to use an analogy,

00:03:38.330 --> 00:03:40.189
it's kind of like an acoustic guitar. Okay, yeah.

00:03:40.370 --> 00:03:42.909
The strings create the raw vibration. but it's

00:03:42.909 --> 00:03:45.250
the hollow wooden body of the guitar that actually

00:03:45.250 --> 00:03:47.710
shapes the sound into something we recognize.

00:03:48.090 --> 00:03:49.949
Exactly. So Audrey wasn't really understanding

00:03:49.949 --> 00:03:53.210
the word five. It was just recognizing the unique

00:03:53.210 --> 00:03:55.990
acoustic resonance of one specific guy's wooden

00:03:55.990 --> 00:03:57.849
guitar body when he said the word five. That

00:03:57.849 --> 00:04:00.310
is a perfect way to visualize it. Audrey was

00:04:00.310 --> 00:04:03.009
recognizing the instrument, not the music. Wow.

00:04:03.189 --> 00:04:06.009
OK. And a decade later, IBM took that a step

00:04:06.009 --> 00:04:09.469
further. In 1962, they debuted their shoebox

00:04:09.469 --> 00:04:11.990
machine at the World Fair. is considered an absolute

00:04:11.990 --> 00:04:14.169
marvel of engineering at the time, doing addition

00:04:14.169 --> 00:04:17.370
and subtraction via voice, but... It still only

00:04:17.370 --> 00:04:20.430
possessed a tiny 16 -word vocabulary. Right.

00:04:20.569 --> 00:04:24.430
Just 16 words. Because recognizing single, isolated

00:04:24.430 --> 00:04:27.189
words, especially from a person the machine is

00:04:27.189 --> 00:04:30.089
already calibrated to, is one thing. But human

00:04:30.089 --> 00:04:32.449
conversation simply doesn't work like that. No.

00:04:32.689 --> 00:04:36.259
It flows. Words bleed into each other. The end

00:04:36.259 --> 00:04:38.399
of one word becomes the start of the next. Getting

00:04:38.399 --> 00:04:41.300
machines to understand continuous sentences was

00:04:41.300 --> 00:04:44.259
the real wall researchers hit. And that wall

00:04:44.259 --> 00:04:47.439
held strong until the late 1960s at Stanford

00:04:47.439 --> 00:04:50.800
University. A graduate student named Raj Reddy

00:04:50.800 --> 00:04:53.139
achieved something groundbreaking. He cracked

00:04:53.139 --> 00:04:56.220
the code on continuous speech recognition. Because

00:04:56.220 --> 00:04:58.399
up until Ready's work, if you wanted a computer

00:04:58.399 --> 00:05:00.579
to transcribe what you were saying, you had to

00:05:00.579 --> 00:05:02.699
speak with these artificial, unnatural gaps.

00:05:02.860 --> 00:05:05.319
You had to talk like a stilted, robotic movie

00:05:05.319 --> 00:05:08.879
alien. Exactly, like, take me to your best leader!

00:05:09.060 --> 00:05:11.000
Yeah, you had to physically stop the sound wave

00:05:11.000 --> 00:05:13.199
so the computer knew where one word ended and

00:05:13.199 --> 00:05:15.839
the next began. So what Ready did was allow the

00:05:15.839 --> 00:05:18.100
audio to flow to the point where a user could

00:05:18.100 --> 00:05:20.680
actually issue spoken commands in real time to

00:05:20.680 --> 00:05:22.939
play a game of chess against the computer. Which

00:05:22.939 --> 00:05:26.209
was huge. It feels revolutionary because it fundamentally

00:05:26.209 --> 00:05:29.209
changes the human -computer dynamic from a data

00:05:29.209 --> 00:05:32.610
entry task into an actual interaction. It shifted

00:05:32.610 --> 00:05:36.310
the paradigm entirely. But to fully contextualize

00:05:36.310 --> 00:05:39.370
this, even with Ready's breakthrough in continuous

00:05:39.370 --> 00:05:42.509
speech, a massive puzzle remained completely

00:05:42.509 --> 00:05:45.730
unsolved. The speaker independence issue. Exactly.

00:05:46.350 --> 00:05:48.629
The chess system worked beautifully, but mostly

00:05:48.629 --> 00:05:51.529
for the programmer who built it. Building a machine

00:05:51.529 --> 00:05:54.230
that could understand anyone's voice with all

00:05:54.230 --> 00:05:56.790
our different pitches, local accents, and bizarre

00:05:56.790 --> 00:06:00.089
vocal quirks was still considered nearly impossible.

00:06:00.490 --> 00:06:02.509
Right, because if everyone speaks at wildly different

00:06:02.509 --> 00:06:06.410
speeds, how did a 1970s computer not just completely

00:06:06.410 --> 00:06:08.329
fail the second someone drew out a word with

00:06:08.329 --> 00:06:10.329
a heavy southern drawl? It usually did fail.

00:06:10.410 --> 00:06:12.050
Or if someone shattered away like they just had

00:06:12.050 --> 00:06:14.569
six shots of espresso, the timing of the sound

00:06:14.569 --> 00:06:16.730
wave would be completely different from the computer's

00:06:16.730 --> 00:06:18.490
template. And the solution to that actually came

00:06:18.490 --> 00:06:21.339
from Soviet research who invented a mathematical

00:06:21.339 --> 00:06:24.480
algorithm called dynamic time warping, or DTW.

00:06:24.639 --> 00:06:27.399
Dynamic time warping? Sounds like sci -fi. It

00:06:27.399 --> 00:06:30.920
really does. But the logic behind DTW is brilliant

00:06:30.920 --> 00:06:34.300
in its simplicity. If the computer holds a perfect

00:06:34.300 --> 00:06:37.139
half -second template of the word, hello, and

00:06:37.139 --> 00:06:40.519
someone with a drawl says, hello, over two full

00:06:40.519 --> 00:06:42.800
seconds. Standard comparison fails because the

00:06:42.800 --> 00:06:44.779
peaks and valleys of the sound waves don't line

00:06:44.779 --> 00:06:48.199
up in time. Exactly. So DTW mathematically stretches

00:06:48.509 --> 00:06:51.889
or warps, the shorter sequence to match the longer

00:06:51.889 --> 00:06:55.430
one. It bends the timeline nonlinearly so the

00:06:55.430 --> 00:06:57.870
machine can overlay the actual acoustic patterns

00:06:57.870 --> 00:07:00.850
and recognize the similarity despite the completely

00:07:00.850 --> 00:07:03.389
different speaking speeds. Okay, dynamic time

00:07:03.389 --> 00:07:05.740
warping is a great trick. But the sources show

00:07:05.740 --> 00:07:08.139
that the real paradigm shift, like the moment

00:07:08.139 --> 00:07:10.860
modern speech recognition was truly born, happened

00:07:10.860 --> 00:07:13.439
when researchers decided to stop trying to perfectly

00:07:13.439 --> 00:07:15.360
emulate the human brain. Right. They abandoned

00:07:15.360 --> 00:07:18.199
the biological approach. Yeah. Or map individual

00:07:18.199 --> 00:07:21.079
words and instead just threw pure, hard statistics

00:07:21.079 --> 00:07:23.839
at the problem. Enter hidden Markov models or

00:07:23.839 --> 00:07:27.420
HMMs. In the 1970s, researchers like James and

00:07:27.420 --> 00:07:29.920
Janet Baker at Carnegie Mellon and later Fred

00:07:29.920 --> 00:07:33.379
Jelinek's team at IBM brought hidden Markov models

00:07:33.379 --> 00:07:37.300
into the world. And HMM fundamentally treats

00:07:37.300 --> 00:07:40.600
a speech signal as a short time stationary process.

00:07:40.800 --> 00:07:42.560
Okay, I want to make sure we picture this correctly

00:07:42.560 --> 00:07:45.000
because this mechanism is the foundation for

00:07:45.000 --> 00:07:47.939
everything that follows. Think of a human running.

00:07:48.139 --> 00:07:51.300
Okay. If you just watch them run, it's a fluid,

00:07:51.639 --> 00:07:54.500
continuous blur of motion. It's incredibly hard

00:07:54.500 --> 00:07:57.100
to analyze every single muscle movement in real

00:07:57.100 --> 00:08:00.519
time. But if you film them and then chop that

00:08:00.519 --> 00:08:03.800
film down into a flip book of tiny, frozen snapshots,

00:08:04.319 --> 00:08:06.439
suddenly you can study the exact mechanics of

00:08:06.439 --> 00:08:08.279
their stride. That's a great analogy. That's

00:08:08.279 --> 00:08:10.920
exactly what HMMs do to sound. They slice an

00:08:10.920 --> 00:08:13.699
acoustic, continuous sound wave into tiny, rigid

00:08:14.089 --> 00:08:16.649
10 millisecond frames. 10 milliseconds. Yeah,

00:08:16.730 --> 00:08:18.829
microscopic. In that 10 millisecond slice, the

00:08:18.829 --> 00:08:20.910
sound wave isn't flowing anymore. It's mathematically

00:08:20.910 --> 00:08:22.850
frozen. It is stationary. And once you have those

00:08:22.850 --> 00:08:26.730
frozen slices. The HMM uses pure probability.

00:08:27.589 --> 00:08:29.670
It looks at the acoustic data in that single

00:08:29.670 --> 00:08:32.850
frame and calculates the statistical likelihood

00:08:32.850 --> 00:08:36.590
of what phoneme, the smallest basic unit of sound,

00:08:36.990 --> 00:08:39.669
like the K sound in cat, is occurring in that

00:08:39.669 --> 00:08:42.840
exact fraction of a second. Ah, okay. Then it

00:08:42.840 --> 00:08:44.879
looks at the next frame and the next, stringing

00:08:44.879 --> 00:08:46.899
those probabilities together to guess the sequence

00:08:46.899 --> 00:08:50.139
of sounds and eventually the word. The historical

00:08:50.139 --> 00:08:52.379
context around this is wild, by the way. The

00:08:52.379 --> 00:08:54.980
linguistics community absolutely hated this approach.

00:08:55.059 --> 00:08:57.500
There was massive academic drama. Oh, the linguists

00:08:57.500 --> 00:09:00.659
were furious. At the time, this purely statistical

00:09:00.659 --> 00:09:03.639
approach was viewed as almost insulting to human

00:09:03.639 --> 00:09:06.639
intelligence. How so? Well, linguists argued

00:09:06.639 --> 00:09:08.980
that HMNs were far too simplistic to account

00:09:08.980 --> 00:09:11.679
for the true nuanced complexities of language.

00:09:12.159 --> 00:09:14.700
Human language is built on syntax, semantics,

00:09:15.080 --> 00:09:17.679
deep structural rules mean... And HMMs completely

00:09:17.679 --> 00:09:20.059
ignored all of that. Completely. They didn't

00:09:20.059 --> 00:09:22.059
care about the definition of a word. They only

00:09:22.059 --> 00:09:24.240
cared about the statistical probability of one

00:09:24.240 --> 00:09:26.980
sound following another sound. For Jelinek's

00:09:26.980 --> 00:09:29.179
overriding philosophy was essentially, ignore

00:09:29.179 --> 00:09:30.960
the grammar rules and let the data do the talking.

00:09:31.240 --> 00:09:33.639
He famously joked that every time he fired a

00:09:33.639 --> 00:09:36.000
linguist, the performance of the speech recognizer

00:09:36.000 --> 00:09:39.679
went up. Yes. And he was right. The brute force

00:09:39.679 --> 00:09:42.980
statistics of HMMs completely replaced the clever

00:09:42.980 --> 00:09:45.720
stretching of dynamic time warping and dominated

00:09:45.720 --> 00:09:48.720
the entire industry for decades. HMMs got us

00:09:48.720 --> 00:09:51.659
through the 80s and 90s. They gave us early dictation

00:09:51.659 --> 00:09:54.000
software and those automated phone trees where

00:09:54.000 --> 00:09:56.759
we all end up yelling representative into the

00:09:56.759 --> 00:09:59.960
receiver. We've all been there. By the 2000s,

00:10:00.120 --> 00:10:03.039
HMMs hit a wall. The error rates just stopped

00:10:03.039 --> 00:10:05.220
dropping. The source material notes the math

00:10:05.220 --> 00:10:07.600
was suffering from gradient diminishing and weak

00:10:07.600 --> 00:10:10.240
temporal correlation. Let's unpack that. OK,

00:10:10.299 --> 00:10:12.320
so to understand gradient diminishing, imagine

00:10:12.320 --> 00:10:15.320
playing a massive game of telephone. You whisper

00:10:15.320 --> 00:10:17.759
a complex sentence down a line of 100 people.

00:10:18.259 --> 00:10:20.240
By the time it reaches the 100th person, the

00:10:20.240 --> 00:10:22.299
original meaning is completely lost or distorted.

00:10:22.620 --> 00:10:25.620
Right. In machine learning, As an algorithm passes

00:10:25.620 --> 00:10:27.740
information backward through its layers to learn

00:10:27.740 --> 00:10:30.100
and correct its errors, which is a process involving

00:10:30.100 --> 00:10:32.899
mathematical gradients, that signal fades over

00:10:32.899 --> 00:10:35.740
time. The system literally forgets the context.

00:10:36.100 --> 00:10:38.580
Wow. And for speech, where the meaning of a word

00:10:38.580 --> 00:10:40.659
at the end of a sentence depends entirely on

00:10:40.659 --> 00:10:42.799
a word spoken 20 seconds earlier at the beginning.

00:10:42.980 --> 00:10:46.230
That loss of memory is catastrophic. Right. So,

00:10:46.470 --> 00:10:49.009
to get to the highly capable voice assistants

00:10:49.009 --> 00:10:52.289
you and I rely on today, machines needed an entirely

00:10:52.289 --> 00:10:55.049
new kind of architecture. And that architecture

00:10:55.049 --> 00:10:58.250
was deep learning. Yes. Moving into the 2010s,

00:10:58.309 --> 00:11:01.190
we see the rise of deep neural networks, specifically

00:11:01.190 --> 00:11:05.309
long short -term memory networks, or LSTMs. LSTMs

00:11:05.309 --> 00:11:07.909
directly solve that vanishing gradient problem.

00:11:08.350 --> 00:11:10.269
They are engineered with internal mechanisms

00:11:10.269 --> 00:11:12.429
called gates that act like a digital notepad.

00:11:12.789 --> 00:11:15.490
A digital notepad, okay. Yeah. These gates decide

00:11:15.490 --> 00:11:17.450
what information is important enough to keep

00:11:17.450 --> 00:11:19.429
and what should be thrown away, allowing the

00:11:19.429 --> 00:11:21.730
network to actively remember events that happened

00:11:21.730 --> 00:11:25.250
thousands of discrete time steps earlier. That

00:11:25.250 --> 00:11:27.450
memory allows the AI to maintain the context

00:11:27.450 --> 00:11:30.169
of a full paragraph. It also changed how these

00:11:30.169 --> 00:11:32.090
systems were built from the ground up, right?

00:11:32.490 --> 00:11:35.990
With the old statistical HMMs, engineers had

00:11:35.990 --> 00:11:38.769
to manually build an acoustic model, a separate

00:11:38.769 --> 00:11:40.809
pronunciation dictionary to teach it how words

00:11:40.809 --> 00:11:43.750
sound, and a massive language model that took

00:11:43.750 --> 00:11:46.590
up gigabytes of memory just to guess word order.

00:11:46.990 --> 00:11:49.330
It was very disjointed. But deep learning brought

00:11:49.330 --> 00:11:52.509
us end -to -end models. A prime example is the

00:11:52.509 --> 00:11:55.509
LAS model, which stands for Listen, Attend, and

00:11:55.509 --> 00:11:57.960
Spell. Instead of engineers manually cobbling

00:11:57.960 --> 00:12:00.480
together three different clunky components, an

00:12:00.480 --> 00:12:03.379
end -to -end model learns everything simultaneously.

00:12:03.600 --> 00:12:05.759
That's incredible. It listens to the acoustic

00:12:05.759 --> 00:12:08.559
signal, uses an attention mechanism to focus

00:12:08.559 --> 00:12:11.620
on the most relevant parts of that specific audio

00:12:11.620 --> 00:12:14.419
snippet, and then directly spells out the transcript.

00:12:14.840 --> 00:12:17.539
It absorbs the raw audio and the correct text

00:12:17.539 --> 00:12:19.980
transcription, and it builds its own internal

00:12:19.980 --> 00:12:22.019
unreadable rules for how to connect the two.

00:12:22.139 --> 00:12:25.649
And the results were staggering. In 2017, Microsoft

00:12:25.649 --> 00:12:27.850
hit a milestone that sounds like pure science

00:12:27.850 --> 00:12:30.889
fiction. They achieved human parity on the switchboard

00:12:30.889 --> 00:12:32.990
conversational tasks. A massive breakthrough.

00:12:33.370 --> 00:12:35.610
They got their deep learning models error rate

00:12:35.610 --> 00:12:39.009
down to roughly four percent, which exactly matched

00:12:39.009 --> 00:12:42.210
the error rate of four professional human transcribers

00:12:42.210 --> 00:12:44.149
working together to double check each other's

00:12:44.149 --> 00:12:47.759
work. Right. But. I wanna push back on this term,

00:12:48.019 --> 00:12:50.539
human parity. Okay, let's hear it. If a machine

00:12:50.539 --> 00:12:53.320
matches my error rate in transcribing a conversation,

00:12:53.460 --> 00:12:55.440
does it actually comprehend the conversation?

00:12:55.919 --> 00:12:58.500
Or is it more like a giant, incredibly complex

00:12:58.500 --> 00:13:01.419
pinball machine? A pinball machine? Yeah, like

00:13:01.419 --> 00:13:04.299
the audio goes in, bounces off a million mathematical

00:13:04.299 --> 00:13:07.159
bumpers, and lands in the exact right text slot,

00:13:07.460 --> 00:13:10.120
but the machine has zero concept of what a slot

00:13:10.120 --> 00:13:12.940
even is. I mean, the pinball machine is a highly

00:13:12.940 --> 00:13:15.659
accurate way to look at it. It is vital to recognize

00:13:15.559 --> 00:13:18.080
that the machine does not understand meaning

00:13:18.080 --> 00:13:21.379
in any biological or cognitive sense. It doesn't

00:13:21.379 --> 00:13:23.700
know what a dog is or what sadness sounds like.

00:13:23.720 --> 00:13:26.059
Right, it's just math. What these deep neural

00:13:26.059 --> 00:13:29.360
networks are doing is building incredibly sophisticated

00:13:29.360 --> 00:13:32.740
topographical maps of sound. They are layering

00:13:32.740 --> 00:13:34.980
millions of mathematical weights to recognize

00:13:34.980 --> 00:13:38.580
nonlinear patterns. So it doesn't possess a human

00:13:38.580 --> 00:13:41.340
mind, but it has mapped the acoustic landscape

00:13:41.340 --> 00:13:43.980
so perfectly that it can navigate it just as

00:13:43.980 --> 00:13:46.340
well as we can. It is mimicking understanding

00:13:46.340 --> 00:13:49.960
through sheer geometric complexity. But let's

00:13:49.960 --> 00:13:52.899
pull this technology out of the pristine Microsoft

00:13:52.899 --> 00:13:56.100
testing labs and put it into the messy real world.

00:13:56.200 --> 00:13:58.840
Let's do it. Because the stakes change dramatically

00:13:58.840 --> 00:14:01.240
when you leave the lab. I want to know how this

00:14:01.240 --> 00:14:03.360
mathematical marvel holds up when the environment

00:14:03.360 --> 00:14:06.480
gets noisy, highly stressful, or medically vital.

00:14:06.879 --> 00:14:09.139
The military applications detailed in the research

00:14:09.139 --> 00:14:11.379
serve as the ultimate stress test for this tech.

00:14:11.539 --> 00:14:13.899
I can imagine. Take the Eurofighter Typhoon.

00:14:14.000 --> 00:14:16.519
It utilizes voice commands to actively reduce

00:14:16.519 --> 00:14:19.139
pilot workload in the cockpit. A pilot flying

00:14:19.139 --> 00:14:21.419
at supersonic speeds can assign radar targets

00:14:21.419 --> 00:14:23.200
with two quick voice commands instead of looking

00:14:23.200 --> 00:14:25.559
down and hunting for a physical button on a screen.

00:14:25.769 --> 00:14:28.070
That sounds super convenient, but the cockpit

00:14:28.070 --> 00:14:30.509
of a fighter jet is a violently hostile acoustic

00:14:30.509 --> 00:14:33.549
environment. Exactly. When researchers tested

00:14:33.549 --> 00:14:37.029
these systems in the Swedish JAS -39 Gripen fighter

00:14:37.029 --> 00:14:40.450
jet, they found that pulling high G -loads literally

00:14:40.450 --> 00:14:43.149
crushed the pilot's lungs, entering their breathing

00:14:43.149 --> 00:14:46.309
and vocal tract so severely that the recognition

00:14:46.309 --> 00:14:49.070
accuracy plummeted. Which totally makes sense.

00:14:49.490 --> 00:14:51.669
But the source material highlights an amazing

00:14:51.669 --> 00:14:55.559
detail. A pilot speaking broken English did not

00:14:55.559 --> 00:14:57.860
negatively impact the system's accuracy at all.

00:14:58.259 --> 00:15:00.899
Only the physical g -force broke it. Right. Wait,

00:15:01.039 --> 00:15:03.960
so why would a machine care about g -force but

00:15:03.960 --> 00:15:06.559
completely ignore terrible grammar and a thick

00:15:06.559 --> 00:15:08.679
accent? Well, because the deep learning model

00:15:08.679 --> 00:15:11.100
isn't grading an English exam. It isn't looking

00:15:11.100 --> 00:15:13.360
for a dictionary -perfect pronunciation. It is

00:15:13.360 --> 00:15:15.620
looking for consistent acoustic patterns. Ah.

00:15:15.820 --> 00:15:18.259
If a pilot consistently says target instead of

00:15:18.259 --> 00:15:20.580
target, the math still aligns with the model's

00:15:20.580 --> 00:15:22.929
internal topography. The pattern is reliable.

00:15:23.289 --> 00:15:25.590
But G -Force physically deforms the human body.

00:15:25.730 --> 00:15:27.929
Yes. It changes the physical shape of the vocal

00:15:27.929 --> 00:15:29.850
tract and the pressure of the air being expelled.

00:15:30.429 --> 00:15:32.529
The acoustic pattern itself warps unpredictably

00:15:32.529 --> 00:15:34.970
and the math falls apart. So it is fundamentally

00:15:34.970 --> 00:15:38.190
an issue of physical load. And we see that same

00:15:38.190 --> 00:15:40.690
reliance on the technology to reduce load in

00:15:40.690 --> 00:15:43.009
healthcare. Huge impact there. Following the

00:15:43.009 --> 00:15:46.269
2009 ARRA standards that pushed hospitals to

00:15:46.269 --> 00:15:49.350
adopt electronic health records, speech recognition

00:15:49.350 --> 00:15:52.049
became a critical lifeline for doctors drowning

00:15:52.049 --> 00:15:55.070
in paperwork. Yeah, in radiology, doctors use

00:15:55.070 --> 00:15:58.629
voice macros. Saying a single short phrase like

00:15:58.629 --> 00:16:02.490
normal report triggers the software to automatically

00:16:02.490 --> 00:16:05.049
populate a massive amount of structured medical

00:16:05.049 --> 00:16:08.190
boilerplate text, saving immense amounts of administrative

00:16:08.190 --> 00:16:10.759
time. The underlying theme across these environments

00:16:10.759 --> 00:16:13.639
is cognitive and physical friction. Whether it

00:16:13.639 --> 00:16:16.659
is a pilot pulling 5G's, a radiologist reviewing

00:16:16.659 --> 00:16:19.580
hundreds of scans, or you standing in your kitchen

00:16:19.580 --> 00:16:21.759
with your hands completely covered in flour yelling

00:16:21.759 --> 00:16:23.799
at a smart speaker to set a 10 -minute timer.

00:16:24.360 --> 00:16:26.320
The technology exists to bypass the physical

00:16:26.320 --> 00:16:29.080
limits of human hands and eyes. And for individuals

00:16:29.080 --> 00:16:31.799
with disabilities, bypassing those physical limits

00:16:31.799 --> 00:16:34.320
isn't just a kitchen convenience, it's an absolute

00:16:34.320 --> 00:16:36.759
necessity. It's life -changing. The research

00:16:36.759 --> 00:16:39.360
points out that sufferers of severe repetitive

00:16:39.360 --> 00:16:42.460
strain injury, or RSI, were actually the urgent

00:16:42.460 --> 00:16:45.100
early adopters who funded and drove this market

00:16:45.100 --> 00:16:47.850
in the early days. For someone whose physical

00:16:47.850 --> 00:16:50.409
disability precludes using a keyboard, voice

00:16:50.409 --> 00:16:52.610
commands are the only bridge to independence.

00:16:52.990 --> 00:16:54.909
It allows hands -free navigation of a digital

00:16:54.909 --> 00:16:57.690
world, and it powers deaf telephony for real

00:16:57.690 --> 00:17:00.850
-time captioning. It's also reshaping education,

00:17:01.090 --> 00:17:04.190
specifically in language learning. How so? Modern

00:17:04.190 --> 00:17:06.309
pronunciation assessment software can listen

00:17:06.309 --> 00:17:09.390
to a student and grade their speech. But the

00:17:09.390 --> 00:17:12.319
philosophical focus has shifted. The algorithms

00:17:12.319 --> 00:17:14.900
are no longer programmed to demand a perfect

00:17:14.900 --> 00:17:17.680
standardized native accent. Instead, they assess

00:17:17.680 --> 00:17:19.900
core intelligibility. Meaning what, exactly?

00:17:20.140 --> 00:17:22.920
The machine asks, can the core sequence of phonemes

00:17:22.920 --> 00:17:26.099
be mathematically understood? If so, it's correct.

00:17:26.500 --> 00:17:28.519
Though the sources make a point to say it's not

00:17:28.519 --> 00:17:31.440
a magical silver bullet. While dictation software

00:17:31.440 --> 00:17:33.900
can be a massive help to students with dyslexia

00:17:33.900 --> 00:17:36.480
who struggle with spelling, the software's inevitable

00:17:36.480 --> 00:17:38.960
mistakes can actually cause severe frustration.

00:17:39.000 --> 00:17:41.839
Oh, absolutely. For a user with a learning disability,

00:17:42.180 --> 00:17:45.420
having to stop, grab a mouse, highlight a misheard

00:17:45.420 --> 00:17:49.000
word, and manually fix it takes significantly

00:17:49.000 --> 00:17:51.720
more cognitive effort and time than just typing

00:17:51.720 --> 00:17:53.779
it slowly in the first place. Which brings us

00:17:53.779 --> 00:17:55.859
to a critical reality check about the state of

00:17:55.859 --> 00:17:58.880
the art. Despite hitting that coveted human parity

00:17:58.880 --> 00:18:01.900
metric in a quiet laboratory, these systems are

00:18:01.900 --> 00:18:04.019
still incredibly fallible in the wild. Yeah,

00:18:04.019 --> 00:18:06.819
they are. And they contain massive blind spots

00:18:06.819 --> 00:18:09.720
that make them vulnerable to both innocent, highly

00:18:09.720 --> 00:18:12.759
annoying mistakes, and active malicious attacks.

00:18:13.240 --> 00:18:15.400
Let's talk about those blind spots. The standard

00:18:15.400 --> 00:18:18.700
industry metric here is word error rate, or where.

00:18:18.990 --> 00:18:20.950
The research shows that your error rate shoots

00:18:20.950 --> 00:18:24.369
up astronomically as your vocabulary grows. Recognizing

00:18:24.369 --> 00:18:26.950
the digits 0 through 9 is almost mathematically

00:18:26.950 --> 00:18:30.250
perfect. But ask a system to handle a 100 ,000

00:18:30.250 --> 00:18:33.529
word vocabulary, you might see a 45 % error rate.

00:18:33.650 --> 00:18:36.509
But the most infamous persistent blind spot in

00:18:36.509 --> 00:18:38.450
speech recognition is something called the ESET.

00:18:38.710 --> 00:18:40.950
The ESET refers to the specific English letters

00:18:40.950 --> 00:18:45.289
that rhyme with the letter E, B, C, D, G, P,

00:18:45.569 --> 00:18:49.589
T, V, Z. Historically, and even today, speech

00:18:49.589 --> 00:18:52.329
recognition systems fail miserably at telling

00:18:52.329 --> 00:18:54.279
these letters apart. If you're listening to this

00:18:54.279 --> 00:18:56.460
right now, I want you to try something. Say the

00:18:56.460 --> 00:18:59.359
letter B out loud. Now say the letter V. Notice

00:18:59.359 --> 00:19:01.779
how 90 % of the sound actually coming out of

00:19:01.779 --> 00:19:04.460
your mouth for both letters is just a long, sustained

00:19:04.460 --> 00:19:08.319
E sound. To a human ear, the tiny pop of your

00:19:08.319 --> 00:19:10.200
lips, the beginning of B, or the vibration of

00:19:10.200 --> 00:19:13.059
your teeth for V is obvious. We pick up on the

00:19:13.059 --> 00:19:16.220
context. But why do these massive deep learning

00:19:16.220 --> 00:19:18.559
models with their long short -term memory and

00:19:18.559 --> 00:19:21.099
thousands of time steps still fail at something

00:19:21.099 --> 00:19:23.269
a kindergarten student can do effortlessly? It

00:19:23.269 --> 00:19:25.809
comes down to how machines visualize sound. Remember

00:19:25.809 --> 00:19:28.809
those 10 millisecond slices of audio? The spectrograms.

00:19:29.009 --> 00:19:31.950
The flipbook. Exactly. Deep learning still fundamentally

00:19:31.950 --> 00:19:34.349
relies on plotting volume and frequency over

00:19:34.349 --> 00:19:36.670
time. Because the letters B and V have nearly

00:19:36.670 --> 00:19:38.970
identical, incredibly loud vowel sounds attached

00:19:38.970 --> 00:19:41.849
to them, the sustained E absolutely dominates

00:19:41.849 --> 00:19:44.109
the datagraph. So the consonant gets... Mathematically,

00:19:44.609 --> 00:19:47.029
the tiny consonant burst at the beginning is

00:19:47.029 --> 00:19:49.849
treated like a rounding error. The machine's

00:19:49.849 --> 00:19:52.210
topography for B and V looks nearly identical.

00:19:52.730 --> 00:19:55.450
Its absolute reliance on raw math over human

00:19:55.450 --> 00:19:58.509
intuition is exactly what makes it blind. And

00:19:58.509 --> 00:20:01.589
that precise reliance on acoustic math is exactly

00:20:01.589 --> 00:20:03.730
what modern hackers are weaponizing. This part

00:20:03.730 --> 00:20:07.130
is scary. Because speech recognition is now woven

00:20:07.130 --> 00:20:09.730
into the fabric of our homes, our cars, and our

00:20:09.730 --> 00:20:13.089
phones, these mathematical blind spots are severe

00:20:13.089 --> 00:20:16.369
security risks. We've all experienced the innocent

00:20:16.369 --> 00:20:18.230
version of this, an accidental trigger where

00:20:18.230 --> 00:20:20.089
a television commercial says, Alexa, and suddenly

00:20:20.089 --> 00:20:22.750
your living room wakes up. Right. But researchers

00:20:22.750 --> 00:20:25.609
have demonstrated active, targeted attacks that

00:20:25.609 --> 00:20:28.269
are terrifyingly clever. The artificial sound

00:20:28.269 --> 00:20:30.190
attacks detailed in the source material blew

00:20:30.190 --> 00:20:33.250
my mind. Hackers can transmit ultrasound frequencies.

00:20:33.450 --> 00:20:35.390
These are acoustic waves that are completely

00:20:35.390 --> 00:20:37.809
inaudible to the human ear. But not to a machine.

00:20:38.190 --> 00:20:40.250
Right. The physical microphone on your smart

00:20:40.250 --> 00:20:43.539
speaker can still pick them up. The AI intercepts

00:20:43.539 --> 00:20:46.079
the ultrasound, mathematically translates those

00:20:46.079 --> 00:20:48.880
frequencies into a valid command, and executes

00:20:48.880 --> 00:20:50.900
it without you ever hearing a single sound in

00:20:50.900 --> 00:20:53.660
the room. It's silent. They can silently tell

00:20:53.660 --> 00:20:56.039
your phone to open your calendar, read your messages,

00:20:56.299 --> 00:20:59.099
or make purchases. They can also execute attacks

00:20:59.099 --> 00:21:03.200
by hiding tiny, specifically calculated distortions

00:21:03.200 --> 00:21:06.410
inside normal audio. To your human ear, it just

00:21:06.410 --> 00:21:08.089
sounds like a standard pop song playing on the

00:21:08.089 --> 00:21:10.849
radio. Just music. Just music. But to the deep

00:21:10.849 --> 00:21:13.130
learning neural network, those hidden mathematical

00:21:13.130 --> 00:21:15.569
distortions overlay perfectly onto the acoustic

00:21:15.569 --> 00:21:18.609
map for the phrase, unlock the front door. The

00:21:18.609 --> 00:21:20.650
machine's incredible mathematical precision is

00:21:20.650 --> 00:21:23.890
turned against it. It is a wild, delicate balance

00:21:23.890 --> 00:21:27.089
between ultimate convenience and invisible vulnerability.

00:21:27.690 --> 00:21:30.950
We have traveled a massive distance today. from

00:21:30.950 --> 00:21:34.650
the 16 -word IBM shoebox at the 1962 World's

00:21:34.650 --> 00:21:37.349
Fair, through the timeline -stretching math of

00:21:37.349 --> 00:21:40.670
dynamic time -warping, into the frozen 10 -millisecond

00:21:40.670 --> 00:21:43.329
slices of hidden Markov models, all the way to

00:21:43.329 --> 00:21:45.089
the deep neural networks that sit in your kitchen

00:21:45.089 --> 00:21:48.089
today, quietly mapping the topography of your

00:21:48.089 --> 00:21:50.170
voice. It really highlights that every single

00:21:50.170 --> 00:21:52.470
time you use a voice assistant, you are interacting

00:21:52.470 --> 00:21:55.349
with decades of compounding, invisible mathematics.

00:21:55.609 --> 00:21:58.509
So to you, the listener. The next time your device

00:21:58.509 --> 00:22:01.809
totally misunderstands a word or infuriatingly

00:22:01.809 --> 00:22:04.029
confuses a B for a V while you're spelling a

00:22:04.029 --> 00:22:06.769
password, don't just get mad at it. Remember

00:22:06.769 --> 00:22:09.549
the 10 millisecond slices. Remember the deep

00:22:09.549 --> 00:22:11.710
neural networks working tirelessly behind the

00:22:11.710 --> 00:22:13.890
scenes, bouncing your audio around a billion

00:22:13.890 --> 00:22:16.829
mathematical bumpers, desperately trying to decode

00:22:16.829 --> 00:22:19.529
the fluid, messy reality of human speech into

00:22:19.529 --> 00:22:22.269
cold, hard data. It is doing the absolute best

00:22:22.269 --> 00:22:24.690
it can with the geometry it has. But before we

00:22:24.690 --> 00:22:27.849
wrap up, I want to leave you with one final mind

00:22:27.849 --> 00:22:30.349
-bending detail from our source material about

00:22:30.349 --> 00:22:32.750
where this entire field is heading next. Oh,

00:22:32.869 --> 00:22:35.410
this is the best part. In 2018, researchers at

00:22:35.410 --> 00:22:38.130
the MIT Media Lab announced a prototype device

00:22:38.130 --> 00:22:41.289
called Alter Ego. It does not use a microphone

00:22:41.289 --> 00:22:44.529
at all. Instead, it uses medical -grade electrodes

00:22:44.529 --> 00:22:48.130
placed precisely on your jaw and face to read

00:22:48.130 --> 00:22:50.450
the neuromuscular signals you generate when you

00:22:50.450 --> 00:22:53.309
simply subvocalize. And subvocalization is when

00:22:53.309 --> 00:22:55.269
you talk to yourself in your head. You don't

00:22:55.269 --> 00:22:56.769
make a sound. You don't even open your mouth.

00:22:57.069 --> 00:23:00.210
But your brain still sends tiny electrical signals

00:23:00.210 --> 00:23:03.009
to your vocal cords and facial muscles, preparing

00:23:03.009 --> 00:23:05.990
to speak the word. And the alter ego device intercepts

00:23:05.990 --> 00:23:07.890
those electrical signals before they ever become

00:23:07.890 --> 00:23:11.059
sound. Right. The researchers craned a convolutional

00:23:11.059 --> 00:23:13.799
neural network, which is a type of AI normally

00:23:13.799 --> 00:23:16.539
used to recognize visual patterns in images like

00:23:16.539 --> 00:23:19.039
finding a face in a photograph to treat those

00:23:19.039 --> 00:23:21.700
electrical muscle signals like an image. It looks

00:23:21.700 --> 00:23:23.859
for the visual pattern of the electricity and

00:23:23.859 --> 00:23:26.579
translates those silent physical signals directly

00:23:26.579 --> 00:23:28.720
into text. It's incredible. Think back to where

00:23:28.720 --> 00:23:32.509
we started this deep dive. that basic human expectation

00:23:32.509 --> 00:23:35.670
that speech is a fluid, invisible wave of sound

00:23:35.670 --> 00:23:38.630
traveling through the air. If a machine can translate

00:23:38.630 --> 00:23:40.930
the words you merely think about saying by reading

00:23:40.930 --> 00:23:42.670
the electricity in your face without you ever

00:23:42.670 --> 00:23:45.170
having to open your mouth, is the ultimate future

00:23:45.170 --> 00:23:48.029
of teaching machines how to listen actually entirely

00:23:48.029 --> 00:23:50.730
silent, something to mull over. Until next time.