WEBVTT

00:00:00.000 --> 00:00:03.379
So imagine you've got this room, right, and inside

00:00:03.379 --> 00:00:06.339
there are four professional human transcribers.

00:00:07.360 --> 00:00:09.839
They're sitting at their desks, and they are

00:00:09.839 --> 00:00:12.519
listening to this incredibly chaotic, just messy

00:00:12.519 --> 00:00:14.779
recording of a conversation. I mean, people are

00:00:14.779 --> 00:00:18.039
talking over each other, coughing, trailing off.

00:00:18.219 --> 00:00:20.079
A transcriber's nightmare, basically. Exactly.

00:00:20.280 --> 00:00:22.339
Total nightmare. And these transcribers are just

00:00:22.339 --> 00:00:24.480
typing as fast as they can, pooling all their

00:00:24.480 --> 00:00:27.579
expertise together to get every single word down

00:00:27.579 --> 00:00:31.329
perfectly. Now imagine a single machine. processing

00:00:31.329 --> 00:00:33.850
that exact same audio and beating all four of

00:00:33.850 --> 00:00:36.570
them combined. Yeah, that's wild. Right. And

00:00:36.570 --> 00:00:39.950
the crazy thing is, that happened. In 2017, Microsoft

00:00:39.950 --> 00:00:42.509
hit this milestone that just completely shattered

00:00:42.509 --> 00:00:45.380
our expectations of what machines can do. They

00:00:45.380 --> 00:00:48.020
achieved something called human parity in conversational

00:00:48.020 --> 00:00:51.140
speech recognition, which means the machine transcribed

00:00:51.140 --> 00:00:53.640
it just as well, if not better, than the humans.

00:00:54.079 --> 00:00:57.060
And it was a massive, just a massive paradigm

00:00:57.060 --> 00:01:00.259
shift because, I mean, we spent decades treating

00:01:00.259 --> 00:01:03.899
computers like toddlers, you know, just painstakingly

00:01:03.899 --> 00:01:06.280
trying to teach them the basic rules of vocabulary.

00:01:06.780 --> 00:01:09.359
And then suddenly, boom. They're outperforming

00:01:09.359 --> 00:01:11.680
a team of highly trained professionals. It's

00:01:11.680 --> 00:01:14.659
bonkers. Yeah. So welcome to the deep dive. Our

00:01:14.659 --> 00:01:18.239
mission today is to, well, really map out this

00:01:18.239 --> 00:01:20.840
incredible arc for you because we're going to

00:01:20.840 --> 00:01:23.219
look at the massive mountain of research detailing

00:01:23.219 --> 00:01:26.810
how we went from. computers that could barely

00:01:26.810 --> 00:01:29.450
understand a single spoken digit in a completely

00:01:29.450 --> 00:01:32.930
silent room to machines that can literally read

00:01:32.930 --> 00:01:35.790
the neuromuscular signals of your unspoken thoughts.

00:01:36.049 --> 00:01:37.549
Which still sounds like science fiction, honestly.

00:01:37.590 --> 00:01:40.230
It really does. But we are pulling apart the

00:01:40.230 --> 00:01:42.409
algorithms, the history, and the high stakes

00:01:42.409 --> 00:01:44.969
real world applications of all this. But before

00:01:44.969 --> 00:01:46.709
we get into the timeline of how this actually

00:01:46.709 --> 00:01:48.510
happened, we really need to draw a very hard

00:01:48.510 --> 00:01:51.349
line in the sand between two concepts that people,

00:01:51.349 --> 00:01:54.109
you know, they constantly mix up. And that is

00:01:54.109 --> 00:01:56.290
speech recognition. versus voice recognition.

00:01:56.609 --> 00:01:59.170
Yeah, they sound identical to most people, but

00:01:59.170 --> 00:02:02.209
functionally. They are worlds apart. Voice recognition

00:02:02.209 --> 00:02:06.030
is entirely about who is speaking. Right, like

00:02:06.030 --> 00:02:08.289
a security thing. Exactly. It is a biometric

00:02:08.289 --> 00:02:11.490
security measure. So when you call your bank

00:02:11.490 --> 00:02:15.620
and the automated system asks you to speak, a

00:02:15.620 --> 00:02:18.580
phrase to verify your identity. It is actively

00:02:18.580 --> 00:02:21.580
analyzing the unique physical shape of your vocal

00:02:21.580 --> 00:02:24.620
cords, the resonance of your nasal cavity, your

00:02:24.620 --> 00:02:28.050
pitch. So it's looking for my specific biological

00:02:28.050 --> 00:02:30.129
signature. Precisely. It wants to know it's you.

00:02:30.270 --> 00:02:32.550
OK. Whereas speech recognition, which is our

00:02:32.550 --> 00:02:34.629
focus today, that doesn't care at all about who

00:02:34.629 --> 00:02:36.669
you are. Not even a little bit. Right. It is

00:02:36.669 --> 00:02:39.250
solely focused on what is being said. The automatic

00:02:39.250 --> 00:02:42.370
conversion of spoken language into text. So like,

00:02:42.389 --> 00:02:44.610
if I say the word banana, or if a child says

00:02:44.610 --> 00:02:46.310
it, or someone with a super thick accent says

00:02:46.310 --> 00:02:48.949
it, the system's only job is to just output the

00:02:48.949 --> 00:02:51.810
letters B -A -N -A -N -A. Exactly. It basically

00:02:51.810 --> 00:02:54.050
strips away the identity of the speaker so it

00:02:54.050 --> 00:02:57.360
can just to isolate the semantic content. And

00:02:57.360 --> 00:02:59.479
teaching a machine to isolate that content has

00:02:59.479 --> 00:03:02.659
been, honestly, one of the most agonizingly complex

00:03:02.659 --> 00:03:05.060
challenges in the history of computer science.

00:03:05.180 --> 00:03:07.840
I bet. Which actually brings us back to the 1950s,

00:03:07.860 --> 00:03:09.560
because I was looking at the early research here,

00:03:09.560 --> 00:03:12.060
and I just have to wonder, how on earth did we

00:03:12.060 --> 00:03:14.759
teach a machine to hear words before we even

00:03:14.759 --> 00:03:17.819
had modern processors, let alone cloud computing?

00:03:18.400 --> 00:03:20.939
Well, to really appreciate that seamless voice

00:03:20.939 --> 00:03:23.139
assistant living inside your smartphone today,

00:03:23.659 --> 00:03:27.319
have to understand the excruciatingly slow evolution

00:03:27.319 --> 00:03:30.199
of early speech tech. Right. We have to go back

00:03:30.199 --> 00:03:33.080
to 1952 at Bell Labs. They built this system

00:03:33.080 --> 00:03:36.030
called Audrey. And Audrey was the size of a massive

00:03:36.030 --> 00:03:38.349
filing cabinet. Oh, wow. Incredibly expensive.

00:03:38.530 --> 00:03:41.129
And her capability was almost comically limited.

00:03:41.409 --> 00:03:43.990
She could only recognize single digits from zero

00:03:43.990 --> 00:03:45.990
to nine. And even then, she only worked for the

00:03:45.990 --> 00:03:47.349
specific person who trained her, right? Like,

00:03:47.349 --> 00:03:49.050
if I walked up and said the number five, Audrey

00:03:49.050 --> 00:03:50.610
would have absolutely no idea what was going

00:03:50.610 --> 00:03:52.789
on. She wouldn't, because the mechanism behind

00:03:52.789 --> 00:03:55.629
Audrey was purely physical, purely acoustic.

00:03:56.490 --> 00:03:58.750
The engineers were, they were looking at the

00:03:58.750 --> 00:04:00.169
power spectrum of the voice. They were literally

00:04:00.169 --> 00:04:02.530
trying to match the physical shape of the sound

00:04:02.530 --> 00:04:06.020
waves. Oh, I see. but every single human being

00:04:06.020 --> 00:04:10.240
has a differently shaped vocal tract. So my sound

00:04:10.240 --> 00:04:14.349
wave for the number five looks. physically mathematically

00:04:14.349 --> 00:04:15.949
different than your sound wave for the number

00:04:15.949 --> 00:04:18.410
five. Which is just a terrible way to build a

00:04:18.410 --> 00:04:20.470
universal system. Yeah, it's impossible to scale.

00:04:20.589 --> 00:04:24.009
But they kept pushing. Yeah. Right. Because 10

00:04:24.009 --> 00:04:27.470
years later, in 1962, IBM debuts a machine called

00:04:27.470 --> 00:04:30.110
the Shoebox at the World's Fair. Right, the famous

00:04:30.110 --> 00:04:32.449
Shoebox. Yeah. And it was considered this absolute

00:04:32.449 --> 00:04:34.670
marvel at the time, but it only had a 16 -word

00:04:34.670 --> 00:04:37.589
vocabulary. And the catch with all these early

00:04:37.589 --> 00:04:39.810
systems, the thing that makes them sound so incredibly

00:04:39.810 --> 00:04:42.819
robotic in those old archival footage, clips

00:04:42.819 --> 00:04:45.779
is that they required users to awkwardly pause

00:04:45.779 --> 00:04:50.259
after every single word. Yes, because the machine

00:04:50.259 --> 00:04:52.579
needed absolute silence to know where one word

00:04:52.579 --> 00:04:54.860
ended and the next one began. Right. Continuous

00:04:54.860 --> 00:04:57.579
human speech is basically just a long uninterrupted

00:04:57.579 --> 00:04:59.699
stream of noise. Like, we don't actually put

00:04:59.699 --> 00:05:02.079
spaces between our words when we talk. We just

00:05:02.079 --> 00:05:05.000
run everything together. Exactly. And those early

00:05:05.000 --> 00:05:06.920
acoustic models just couldn't find the boundaries

00:05:06.920 --> 00:05:11.459
between words unless... the human artificially

00:05:11.459 --> 00:05:14.339
created a boundary with silence. So by the 1970s,

00:05:14.560 --> 00:05:17.680
DARPA gets involved. They fund this five -year

00:05:17.680 --> 00:05:20.000
project called Speech Understanding Research

00:05:20.000 --> 00:05:23.439
with a highly ambitious goal for that era, which

00:05:23.439 --> 00:05:26.600
was a minimum vocabulary of just 1 ,000 words.

00:05:26.680 --> 00:05:29.199
Which was huge back then. Huge. But to hit those

00:05:29.199 --> 00:05:31.560
kinds of numbers, trying to physically match

00:05:31.560 --> 00:05:34.370
sound waves like Audrey did. That just wasn't

00:05:34.370 --> 00:05:36.269
going to cut it. No, the map had to completely

00:05:36.269 --> 00:05:39.009
change. And the real pivot happens in the 1980s.

00:05:39.189 --> 00:05:41.769
There's a team at IBM led by a researcher named

00:05:41.769 --> 00:05:44.829
Fred Jelenak, and they completely abandoned the

00:05:44.829 --> 00:05:46.810
acoustic matching approach. They just threw it

00:05:46.810 --> 00:05:48.769
out. Completely threw it out. They stopped trying

00:05:48.769 --> 00:05:51.029
to emulate the human ear, and they stopped trying

00:05:51.029 --> 00:05:53.449
to teach the computer the grammatical rules of

00:05:53.449 --> 00:05:56.149
linguistics. Instead, they pivoted entirely to

00:05:56.149 --> 00:05:58.810
pure statistics. Specifically, they implement

00:05:58.810 --> 00:06:02.720
something called hidden Markov models. or HMMs.

00:06:02.759 --> 00:06:05.060
Okay, let's unpack this. Hidden Markov Models.

00:06:05.540 --> 00:06:08.879
This sounds incredibly dense, but really it's

00:06:08.879 --> 00:06:10.800
just about playing the odds, right? Pretty much,

00:06:10.839 --> 00:06:12.779
yeah. It's like trying to learn a language not

00:06:12.779 --> 00:06:14.759
by understanding the grammar or the emotion,

00:06:15.060 --> 00:06:17.579
but just by betting on which word is statistically

00:06:17.579 --> 00:06:20.500
most likely to come next based on millions of

00:06:20.500 --> 00:06:22.839
previous conversations. That is precisely what

00:06:22.839 --> 00:06:25.540
it is. An HMM doesn't know what a verb or a noun

00:06:25.540 --> 00:06:28.899
is. It just doesn't care. It takes your continuous

00:06:28.899 --> 00:06:32.779
stream of speech and treats it as a stationary

00:06:32.779 --> 00:06:35.839
signal over very, very short time scales. Meaning

00:06:35.839 --> 00:06:38.319
it assumes the sound isn't changing if you look

00:06:38.319 --> 00:06:40.680
at a small enough slice of time. Yes. It chops

00:06:40.680 --> 00:06:43.000
your speech up into tiny 10 millisecond frames.

00:06:43.100 --> 00:06:45.920
10 milliseconds? That's nothing! Exactly. In

00:06:45.920 --> 00:06:49.100
a 10 millisecond window, your vocal cords literally

00:06:49.100 --> 00:06:52.040
haven't moved enough to change the sound. So

00:06:52.040 --> 00:06:55.139
the HMN looks at that tiny slice and it uses

00:06:55.139 --> 00:06:58.360
probability to guess what basic sound, what phone

00:06:58.360 --> 00:07:01.120
is being made. Then it looks at the next 10 millisecond

00:07:01.120 --> 00:07:04.180
slice and it strings those guesses together and

00:07:04.180 --> 00:07:07.699
says statistically an S sound followed by a T

00:07:07.699 --> 00:07:10.079
sound is highly likely to be the start of the

00:07:10.079 --> 00:07:13.560
word stop. Oh wow. And what's fascinating here

00:07:13.560 --> 00:07:16.639
is that linguists absolutely hated this approach.

00:07:16.800 --> 00:07:19.459
Oh, I'm sure. Yeah, the statistical model was

00:07:19.459 --> 00:07:21.579
highly controversial in the academic community

00:07:21.579 --> 00:07:24.420
because it completely ignored the actual, you

00:07:24.420 --> 00:07:27.079
know, beautiful rules of human language. It was

00:07:27.079 --> 00:07:29.879
just a brute force mathematical bulldozer. But

00:07:29.879 --> 00:07:32.339
the bulldozer worked. I mean, by the mid 80s,

00:07:32.399 --> 00:07:35.160
this statistical pivot paved the way for IBM's

00:07:35.160 --> 00:07:37.899
Tangora, which was a voice activated typewriter

00:07:37.899 --> 00:07:40.720
that could handle a twenty thousand word of vocabulary,

00:07:40.879 --> 00:07:43.019
which is an astronomical leap from the 16 word

00:07:43.019 --> 00:07:46.009
shoebox. Right. But, as we know, probability

00:07:46.009 --> 00:07:49.310
works perfectly until humans get messy. So hidden

00:07:49.310 --> 00:07:51.610
Markov models eventually hit a brick wall. They

00:07:51.610 --> 00:07:54.050
did. They're great for a quiet room where someone

00:07:54.050 --> 00:07:56.050
is speaking really clearly, but if you want a

00:07:56.050 --> 00:07:58.790
machine to handle natural, fast speech in, like

00:07:58.790 --> 00:08:01.230
a noisy coffee shop with different accents and

00:08:01.230 --> 00:08:03.410
people talking over each other, slicing audio

00:08:03.410 --> 00:08:05.550
into 10 millisecond frames just isn't enough.

00:08:06.029 --> 00:08:08.550
No, because the machine needed something resembling

00:08:08.550 --> 00:08:12.410
memory. When you speak, the meaning of a word

00:08:12.670 --> 00:08:15.470
often depends entirely on the context of a word

00:08:15.470 --> 00:08:18.970
you said like 15 seconds ago. Right. If I say

00:08:18.970 --> 00:08:21.550
I saw the bat, am I talking about a baseball

00:08:21.550 --> 00:08:24.829
bat or the animal flying in a cave? The context

00:08:24.829 --> 00:08:28.050
determines it. And those older statistical systems

00:08:28.050 --> 00:08:30.389
suffered from something called the vanishing

00:08:30.389 --> 00:08:33.090
gradient problem. Wait, vanishing gradient. Let

00:08:33.090 --> 00:08:35.269
me see if I can translate this. Is that basically

00:08:35.269 --> 00:08:38.429
like playing a massive game of telephone? Oh,

00:08:38.429 --> 00:08:40.120
yeah. That's a good way to look at it. Right,

00:08:40.159 --> 00:08:41.919
like by the time the message gets to the end

00:08:41.919 --> 00:08:44.679
of a long line of people, the original context

00:08:44.679 --> 00:08:47.259
is completely lost. So the gradient of information

00:08:47.259 --> 00:08:50.299
just vanishes. That is a perfect analogy. As

00:08:50.299 --> 00:08:52.639
the neural network processes longer and longer

00:08:52.639 --> 00:08:55.320
sentences, the influence of the earlier words

00:08:55.320 --> 00:08:58.580
physically degrades in the math. Wow. The system

00:08:58.580 --> 00:09:00.460
literally forgets the beginning of your sentence

00:09:00.460 --> 00:09:02.279
before you even finish saying it. That's a huge

00:09:02.279 --> 00:09:05.059
problem. Which is why the 20 kens sparked this

00:09:05.059 --> 00:09:08.549
total revolution. Researchers introduced deep

00:09:08.549 --> 00:09:11.509
neural networks, and specifically something called

00:09:11.509 --> 00:09:14.870
long short -term memory, or LSTM. Long short

00:09:14.870 --> 00:09:17.429
-term memory? I mean, that sounds like an oxymoron.

00:09:17.610 --> 00:09:20.830
It does, yeah. Yeah. But it revolutionized everything.

00:09:21.490 --> 00:09:24.429
LSTMs are special neural networks designed with

00:09:24.429 --> 00:09:27.870
internal gates. And these gates decide what information

00:09:27.870 --> 00:09:30.309
is important enough to keep in memory and what

00:09:30.309 --> 00:09:33.070
can just be thrown away. So they bypass the telephone

00:09:33.070 --> 00:09:36.190
game entirely? Exactly. They can remember an

00:09:36.190 --> 00:09:39.029
important contextual word from thousands of tiny

00:09:39.029 --> 00:09:42.149
time steps ago, which allows the machine to finally

00:09:42.149 --> 00:09:45.250
hold on to a complete thought. And this directly

00:09:45.250 --> 00:09:47.450
leads to the massive shift toward end -to -end

00:09:47.450 --> 00:09:49.370
learning, right? Like if you look at Google's

00:09:49.370 --> 00:09:51.929
listen, attend, and spell model, because before

00:09:51.929 --> 00:09:54.350
this, speech recognition systems were sort of

00:09:54.350 --> 00:09:56.529
like Frankenstein monsters. Oh, total. They were

00:09:56.529 --> 00:09:58.629
heavily segmented. Yeah. You had an acoustic

00:09:58.629 --> 00:10:01.110
model built by one team, a pronunciation dictionary

00:10:01.110 --> 00:10:03.070
built by another team, and then a language model

00:10:03.070 --> 00:10:05.590
built by a third. They were incredibly clunky.

00:10:06.070 --> 00:10:08.690
But end -to -end learning just throws all of

00:10:08.690 --> 00:10:11.330
that in the trash. It does. A deep neural network

00:10:11.330 --> 00:10:14.559
just ingests the raw audio at one end. and spits

00:10:14.559 --> 00:10:16.940
out the text at the other. It learns the entire

00:10:16.940 --> 00:10:19.620
internal process completely on its own. Right.

00:10:19.720 --> 00:10:22.179
It builds its own sophisticated representations

00:10:22.179 --> 00:10:25.299
across multiple hidden layers of computation.

00:10:26.080 --> 00:10:28.159
So human engineers no longer have to manually

00:10:28.159 --> 00:10:31.659
design the rules. We just feed it mountains of

00:10:31.659 --> 00:10:34.700
raw data and it finds the patterns by itself.

00:10:34.960 --> 00:10:37.399
And because the system is just looking for patterns

00:10:37.399 --> 00:10:39.840
in data, it doesn't even need audio anymore.

00:10:39.919 --> 00:10:42.679
Yes. Which brings us to the research out of Oxford

00:10:42.679 --> 00:10:47.679
in 2016. They produced Lipnet, which is an end

00:10:47.679 --> 00:10:49.919
-to -end lip reading model. It doesn't use microphones

00:10:49.919 --> 00:10:52.360
at all. It uses something called spatial temporal

00:10:52.360 --> 00:10:55.580
convolutions to literally watch a video of a

00:10:55.580 --> 00:10:58.539
person's mouth and output the text. Let's break

00:10:58.539 --> 00:11:00.340
down spatial temporal really quick. Yeah please.

00:11:00.580 --> 00:11:02.799
So spatial refers to the physical shape, the

00:11:02.799 --> 00:11:04.799
position of your lips, your teeth, and your tongue

00:11:04.799 --> 00:11:07.799
at a single frozen moment. Okay. Temporal refers

00:11:07.799 --> 00:11:10.100
to time, how those shapes transition from one

00:11:10.100 --> 00:11:12.440
frame to the next frame. So LipNet is essentially

00:11:12.440 --> 00:11:14.700
analyzing a high -speed flipbook of your mouth.

00:11:14.879 --> 00:11:17.539
And it actually surpassed human level performance

00:11:17.539 --> 00:11:19.940
in lip reading. It did, yeah. Which makes sense

00:11:19.940 --> 00:11:22.200
because a professional human lip reader is guessing

00:11:22.200 --> 00:11:25.330
a lot of the time based on context. But the algorithm

00:11:25.330 --> 00:11:27.590
is catching micro -movements that our human eyes

00:11:27.590 --> 00:11:30.850
simply cannot process. Exactly. But honestly,

00:11:31.149 --> 00:11:33.870
Lipnet feels tame compared to what came out of

00:11:33.870 --> 00:11:38.269
MIT's Media Lab in 2018. Oh, Alter Ego. Yes,

00:11:38.490 --> 00:11:41.090
a device called Alter Ego. This completely blew

00:11:41.090 --> 00:11:43.649
my mind. It reads neuromuscular signals when

00:11:43.649 --> 00:11:46.710
users sub -vocalize. Right, and sub -vocalization

00:11:46.710 --> 00:11:48.769
is something we all do, actually. We do. Yeah.

00:11:49.009 --> 00:11:51.460
When you read a book silently to yourself, Your

00:11:51.460 --> 00:11:54.519
brain is still sending these very faint electrical

00:11:54.519 --> 00:11:57.179
signals to your vocal cords, to your jaw, and

00:11:57.179 --> 00:11:59.639
to your tongue, telling them to form the words.

00:11:59.720 --> 00:12:02.379
Even if I'm not making a sound. Even then. You

00:12:02.379 --> 00:12:04.059
just stop short of actually opening your mouth

00:12:04.059 --> 00:12:06.399
and pushing air out. Wait, so we went from a

00:12:06.399 --> 00:12:08.960
machine that needed me to painfully pause between

00:12:08.960 --> 00:12:12.399
words to a headset that reads the electrical

00:12:12.399 --> 00:12:14.539
pulses in my jaw when I just think about speaking.

00:12:14.820 --> 00:12:16.639
Pretty much. I mean, I don't make a sound. I

00:12:16.639 --> 00:12:18.879
don't open my mouth. That genuinely sounds like

00:12:18.879 --> 00:12:21.559
telepathy. I completely understand why it feels

00:12:21.559 --> 00:12:24.320
like science fiction, but you have to strip away

00:12:24.320 --> 00:12:27.120
the mysticism of it. It's just pattern recognition

00:12:27.120 --> 00:12:29.580
at scale. Right. Whether the data is an acoustic

00:12:29.580 --> 00:12:32.759
sound wave, a video of lips moving, or an electrical

00:12:32.759 --> 00:12:36.179
impulse traveling down a facial nerve, it's all

00:12:36.179 --> 00:12:39.279
just raw data to a neural network. It simply

00:12:39.279 --> 00:12:42.179
maps the electrical pattern in your jaw to a

00:12:42.179 --> 00:12:45.580
specific word. That is just incredible. But if

00:12:45.580 --> 00:12:47.940
we have a technology that is this powerful, a

00:12:47.940 --> 00:12:50.539
technology that can pull words out of thin air

00:12:50.539 --> 00:12:53.710
or like micro muscle twitches, We really have

00:12:53.710 --> 00:12:56.210
to look at why it's leaving the laboratory. Absolutely.

00:12:56.370 --> 00:12:58.590
Because this isn't just about making it easier

00:12:58.590 --> 00:13:00.769
to ask your kitchen smart speaker for the weather

00:13:00.769 --> 00:13:03.870
forecast. This tech is actively being deployed

00:13:03.870 --> 00:13:06.309
in some of the most high stakes, high stress

00:13:06.309 --> 00:13:08.690
environments on the planet right now. The military

00:13:08.690 --> 00:13:11.330
integration is a perfect example of this. Modern

00:13:11.330 --> 00:13:13.929
fighter jets like the Eurofighter Typhoon and

00:13:13.929 --> 00:13:17.049
the F -35 Lightning II, they have speech technology

00:13:17.049 --> 00:13:19.539
hardwired directly into the cockpit. Which totally

00:13:19.539 --> 00:13:21.860
makes sense. I mean, a fighter pilot is flying

00:13:21.860 --> 00:13:24.500
at supersonic speeds, their hands are locked

00:13:24.500 --> 00:13:26.580
on the controls, their eyes are scanning for

00:13:26.580 --> 00:13:28.500
threats. They can't be looking down at a screen.

00:13:29.000 --> 00:13:31.259
Exactly. Voice commands allow them to assign

00:13:31.259 --> 00:13:33.940
targets, change radio frequencies, or switch

00:13:33.940 --> 00:13:36.259
radar modes without looking down or letting go

00:13:36.259 --> 00:13:39.120
of the stick. But the human body isn't actually

00:13:39.120 --> 00:13:41.700
meant to talk while pulling extreme G -forces.

00:13:41.860 --> 00:13:44.570
No, it is not. There is this incredible data

00:13:44.570 --> 00:13:48.490
from Swedish pilots flying the JAS -39 Gripen.

00:13:49.269 --> 00:13:51.429
They found that as soon as the pilots hit high

00:13:51.429 --> 00:13:54.370
G -loads, the voice recognition software just

00:13:54.370 --> 00:13:57.809
completely fell apart. Because gravity physically

00:13:57.809 --> 00:14:00.409
alters the instrument of your voice. When you

00:14:00.409 --> 00:14:03.769
pull six or seven G's, the immense pressure compresses

00:14:03.769 --> 00:14:06.330
your chest cavity. Wow! Your breathing becomes

00:14:06.330 --> 00:14:09.090
really shallow and erratic. Your neck muscles

00:14:09.090 --> 00:14:11.429
strain to hold your head up, which physically

00:14:11.429 --> 00:14:13.889
changes the shape of your vocal cords. Oh, I

00:14:13.889 --> 00:14:15.929
didn't even think about that. Yeah, your voice

00:14:15.929 --> 00:14:18.090
goes up in pitch and becomes incredibly strained.

00:14:18.169 --> 00:14:20.549
So the acoustic data totally changes. Completely.

00:14:20.549 --> 00:14:23.830
So to fix the word errors, the software developers

00:14:23.830 --> 00:14:27.350
literally had to put pilots in centrifuges, record

00:14:27.350 --> 00:14:29.490
them under immense physical strain, and actually

00:14:29.490 --> 00:14:32.029
build mathematical models of the pilot's compressed

00:14:32.029 --> 00:14:34.590
breathing just to feed it back into the neural

00:14:34.590 --> 00:14:37.009
network. They essentially had to teach the machine

00:14:37.159 --> 00:14:39.559
what human physical suffering sounds like just

00:14:39.559 --> 00:14:42.100
so it could filter it out. That is such an intense

00:14:42.100 --> 00:14:44.860
application. But you know, the exact same underlying

00:14:44.860 --> 00:14:46.820
neural networks are doing something entirely

00:14:46.820 --> 00:14:49.039
different in health care. And everyone knows

00:14:49.039 --> 00:14:51.860
doctors use speech to text to dictate patient

00:14:51.860 --> 00:14:54.399
notes. But here's where it gets really interesting.

00:14:54.580 --> 00:14:57.000
It's not just a tool for the doctors anymore.

00:14:57.279 --> 00:14:59.899
It is actively being used as therapy for the

00:14:59.899 --> 00:15:02.679
patients. Prolonged use of speech recognition

00:15:02.679 --> 00:15:05.120
software actually helps re -strengthen the short

00:15:05.120 --> 00:15:06.919
-term memory of patients who have suffered a

00:15:06.919 --> 00:15:09.860
stroke or patients recovering from brain AVMs.

00:15:12.759 --> 00:15:15.779
malformation is a dangerous tangle of abnormal

00:15:15.779 --> 00:15:18.000
blood vessels connecting arteries and veins in

00:15:18.000 --> 00:15:20.860
the brain. When those rupture or when a stroke

00:15:20.860 --> 00:15:24.179
occurs, it often heavily damages the neural pathways

00:15:24.179 --> 00:15:26.960
responsible for language and memory. So how does

00:15:26.960 --> 00:15:29.840
talking to a computer fix that? Because the cognitive

00:15:29.840 --> 00:15:33.029
mechanism here is fascinating. For a stroke survivor,

00:15:33.490 --> 00:15:36.110
the physical act of writing with a pen or typing

00:15:36.110 --> 00:15:38.870
on a keyboard requires a massive amount of mental

00:15:38.870 --> 00:15:41.690
energy. Yes, it's exhausting. They have to think

00:15:41.690 --> 00:15:44.490
about the word, translate it into physical hand

00:15:44.490 --> 00:15:47.190
movements, and execute it. And by the time they

00:15:47.190 --> 00:15:49.950
type the word, their short -term memory has literally

00:15:49.950 --> 00:15:52.250
forgotten the rest of the sentence. So speak

00:15:52.250 --> 00:15:55.649
recognition acts as a cognitive bypass. It removes

00:15:55.649 --> 00:15:58.789
the physical friction. The patient simply speaks

00:15:58.789 --> 00:16:01.379
the thought. and the software instantly provides

00:16:01.379 --> 00:16:03.519
visual feedback on the screen. Exactly. They

00:16:03.519 --> 00:16:06.360
see the word appear in real time. And that instant

00:16:06.360 --> 00:16:09.679
visual reinforcement strengthens the neural pathway

00:16:09.679 --> 00:16:12.210
in the brain. Think about that for a second.

00:16:12.610 --> 00:16:14.570
The software designed to write corporate memos

00:16:14.570 --> 00:16:16.690
is actually acting as a physical therapy machine

00:16:16.690 --> 00:16:18.769
for the human mind. And if we connect this to

00:16:18.769 --> 00:16:21.190
the bigger picture, it fundamentally changes

00:16:21.190 --> 00:16:24.049
how we should view speech recognition. It isn't

00:16:24.049 --> 00:16:27.330
just a convenience feature. It is a vital translation

00:16:27.330 --> 00:16:30.230
layer that removes the physical barriers between

00:16:30.230 --> 00:16:33.850
human intent and digital execution. You see that

00:16:33.850 --> 00:16:36.389
translation layer so clearly in accessibility,

00:16:36.730 --> 00:16:38.769
too. For someone suffering from severe repetitive

00:16:38.769 --> 00:16:42.149
strain injury, typing is agonizing. Speech recognition

00:16:42.149 --> 00:16:44.740
gave them their careers back. Absolutely. It

00:16:44.740 --> 00:16:47.399
allows a blind user to completely navigate a

00:16:47.399 --> 00:16:50.759
complex operating system and provides real -time

00:16:50.759 --> 00:16:53.460
instantaneous captions for the deaf and hard

00:16:53.460 --> 00:16:56.639
of hearing. It genuinely restores human autonomy.

00:16:56.940 --> 00:16:59.480
It does. It really does. But, you know, we do

00:16:59.480 --> 00:17:00.899
have to look at the other side of the coin here.

00:17:01.179 --> 00:17:04.779
For all this talk of human parity and restoring

00:17:04.779 --> 00:17:08.000
autonomy, we cannot pretend this technology is

00:17:08.000 --> 00:17:10.839
flawless. When we push it to its limits, the

00:17:10.839 --> 00:17:13.259
cracks in the architecture become very, very

00:17:13.259 --> 00:17:16.359
obvious. Yeah, let's dig into those flaws. Because

00:17:16.359 --> 00:17:18.720
anyone who has ever tried to dictate a text message

00:17:18.720 --> 00:17:20.859
while driving knows it still hallucinates words

00:17:20.859 --> 00:17:22.940
all the time. Oh, constantly. The industry metric

00:17:22.940 --> 00:17:26.940
for this is the word error rate, or rrr. And

00:17:26.940 --> 00:17:28.599
one of the most stubborn hurdles is something

00:17:28.599 --> 00:17:31.420
called the ESET problem. Yes, the ESET. This

00:17:31.420 --> 00:17:34.019
is a group of English letters that are acoustically

00:17:34.019 --> 00:17:37.819
almost identical. So that's B, C, D, E, G, P,

00:17:37.900 --> 00:17:41.240
T, V, Z. They all rhyme. Exactly. For an acoustic

00:17:41.240 --> 00:17:43.680
model, distinguishing between a B and a P is

00:17:43.680 --> 00:17:46.000
just an absolute nightmare. Well, yeah, if you

00:17:46.000 --> 00:17:48.220
literally just mouth the letters P and B right

00:17:48.220 --> 00:17:50.920
now, your lips do the exact same physical movement.

00:17:51.059 --> 00:17:52.880
Right. You press your lips together and pop them

00:17:52.880 --> 00:17:55.799
open. The only difference is just this tiny flutter

00:17:55.799 --> 00:17:58.240
of your vocal cords and a slight burst of air.

00:17:59.059 --> 00:18:01.779
A microphone in a noisy room is totally going

00:18:01.779 --> 00:18:04.480
to miss that difference. In fact, getting the

00:18:04.480 --> 00:18:06.740
error rate down to eight percent on the E set

00:18:06.740 --> 00:18:09.539
is celebrated as a massive victory. by engineers.

00:18:09.740 --> 00:18:12.740
It's a huge win for them. And the ESET is just

00:18:12.740 --> 00:18:16.460
about recognizing single letters. The much larger

00:18:16.460 --> 00:18:19.039
failure happens with spontaneous speech. Oh,

00:18:19.160 --> 00:18:21.519
yeah. If you hand a human a script and tell them

00:18:21.519 --> 00:18:24.079
to read it clearly, the machine will transcribe

00:18:24.079 --> 00:18:26.640
it perfectly. But when we just talk casually

00:18:26.640 --> 00:18:28.740
to each other... We are a mess. We really are.

00:18:28.799 --> 00:18:32.869
We stutter, we say, uh, and... Constantly, we

00:18:32.869 --> 00:18:34.710
cough in the middle of a syllable, we start a

00:18:34.710 --> 00:18:36.430
sentence, completely change our minds halfway

00:18:36.430 --> 00:18:39.289
through, and pivot to a new topic without even

00:18:39.289 --> 00:18:42.519
taking a breath. We break every single rule of

00:18:42.519 --> 00:18:44.940
language constraints, and the neural networks

00:18:44.940 --> 00:18:47.579
struggle to predict what comes next when we don't

00:18:47.579 --> 00:18:49.460
even know what we're going to say next. But,

00:18:49.460 --> 00:18:51.519
you know, accuracy is really just a functional

00:18:51.519 --> 00:18:54.500
annoyance. There is a much darker side to this

00:18:54.500 --> 00:18:56.960
interface. Yeah, the security side. When a device

00:18:56.960 --> 00:19:00.079
is constantly listening to the environment, analyzing

00:19:00.079 --> 00:19:02.240
every single sound to see if it matches a wake

00:19:02.240 --> 00:19:05.519
word, it opens up some severe security vulnerabilities.

00:19:05.799 --> 00:19:07.579
And this is the part of the research that is

00:19:07.579 --> 00:19:10.140
genuinely unsettling. Yeah. We all know the wake

00:19:10.140 --> 00:19:12.799
words, right? Alexa, hey Siri. And we've all

00:19:12.799 --> 00:19:15.359
seen the funny news stories where a commercial

00:19:15.359 --> 00:19:18.680
on the television says the wake word and suddenly

00:19:18.680 --> 00:19:21.019
a thousand smart speakers in people's living

00:19:21.019 --> 00:19:23.099
rooms wake up and try to order paper towels.

00:19:23.220 --> 00:19:26.339
Right. But that is an accidental trigger. The

00:19:26.339 --> 00:19:28.440
intentional attacks are terrifying. Hackers have

00:19:28.440 --> 00:19:31.680
actually developed inaudible attacks. Yes. These

00:19:31.680 --> 00:19:34.980
rely on ultrasonic frequencies. So sounds that

00:19:34.980 --> 00:19:37.900
are pitched so high, they are entirely out of

00:19:37.900 --> 00:19:39.740
the range of human hearing. But wait, how does

00:19:39.740 --> 00:19:41.920
a speaker hear a sound that I can't hear? It

00:19:41.920 --> 00:19:44.279
comes down to the physical hardware inside the

00:19:44.279 --> 00:19:47.619
microphone itself. A microphone uses a tiny physical

00:19:47.619 --> 00:19:50.079
membrane to capture sound waves. Now the human

00:19:50.079 --> 00:19:52.619
eardrum generally tops out at hearing frequencies

00:19:52.619 --> 00:19:55.519
around 20 kilohertz. OK. But the physical membrane

00:19:55.519 --> 00:19:59.140
inside a standard smart speaker microphone, that

00:19:59.140 --> 00:20:01.579
can vibrate at much higher frequencies, up to

00:20:01.579 --> 00:20:05.140
25 or 30 kilohertz. It is basically a dog whistle.

00:20:05.240 --> 00:20:09.099
So a hacker can blast a command at 25 kHz, and

00:20:09.099 --> 00:20:11.319
you, sitting right there in your living room,

00:20:11.480 --> 00:20:13.319
hear absolutely nothing. Just total silence.

00:20:13.539 --> 00:20:16.000
But the microphone membrane vibrates, the processor

00:20:16.000 --> 00:20:18.059
translates that vibration into an electrical

00:20:18.059 --> 00:20:20.720
signal, and the neural network decodes it as

00:20:20.720 --> 00:20:23.200
the command. Unlock the front door. Exactly.

00:20:23.519 --> 00:20:25.380
And honestly, they don't even need a dog whistle.

00:20:25.779 --> 00:20:27.359
Researchers have proven they can take a normal

00:20:27.359 --> 00:20:30.859
pop song, add these microscopic, inaudible mathematical

00:20:30.859 --> 00:20:33.130
distortions to the audio file, and just play

00:20:33.130 --> 00:20:34.890
it. And I wouldn't hear it. You just hear the

00:20:34.890 --> 00:20:38.710
music. But hidden inside the noise floor of that

00:20:38.710 --> 00:20:42.269
track is a command telling your phone to initiate

00:20:42.269 --> 00:20:46.230
a wire transfer. Wow. So what does this all mean?

00:20:46.930 --> 00:20:49.049
We started this deep dive talking about what

00:20:49.049 --> 00:20:52.390
human parity, right? Microsoft proved a machine

00:20:52.390 --> 00:20:54.750
can theoretically transcribe a conversation better

00:20:54.750 --> 00:20:58.049
than for professionals. But if that exact same

00:20:58.049 --> 00:21:00.710
machine can be completely hijacked by a distorted

00:21:00.710 --> 00:21:03.690
pop song or manipulated by a silent dog whistle

00:21:03.690 --> 00:21:07.009
to unlock a house, does it really understand

00:21:07.180 --> 00:21:10.299
anything at all. This raises an important question,

00:21:10.380 --> 00:21:13.359
and it is really the crux of where the technology

00:21:13.359 --> 00:21:15.720
currently stands. We have to separate acoustic

00:21:15.720 --> 00:21:18.500
transcription from semantic understanding. The

00:21:18.500 --> 00:21:20.720
deep neural network knows the exact mathematical

00:21:20.720 --> 00:21:22.880
shape of the sound wave. It maps the pattern

00:21:22.880 --> 00:21:25.680
flawlessly, but it lacks common sense. Doesn't

00:21:25.680 --> 00:21:28.259
know context. Exactly. When a human hears a TV

00:21:28.259 --> 00:21:29.920
commercial tell them to buy a product, they use

00:21:29.920 --> 00:21:32.119
contextual awareness to ignore it. When a human

00:21:32.119 --> 00:21:34.259
hears a high -frequency chirp, they know it isn't

00:21:34.259 --> 00:21:36.160
language. The machine doesn't know what language

00:21:36.160 --> 00:21:38.910
means. It just processes the math. It hears perfectly,

00:21:39.450 --> 00:21:41.589
but it comprehends nothing. Which brings us to

00:21:41.589 --> 00:21:44.549
the end of just a wild journey today. We have

00:21:44.549 --> 00:21:47.230
traced this technology from its infancy, the

00:21:47.230 --> 00:21:50.390
16 -word IBM shoebox that forced us to speak

00:21:50.390 --> 00:21:53.789
like robots. Through the incredibly controversial

00:21:53.789 --> 00:21:56.410
statistical bets of the 1980s that just ignored

00:21:56.410 --> 00:21:59.529
the rules of grammar entirely, we unpacked how

00:21:59.529 --> 00:22:01.930
vanishing gradients forced the invention of deep

00:22:01.930 --> 00:22:04.650
neural networks with memory. And we watched the

00:22:04.650 --> 00:22:07.269
technology leave the lab to assist fighter pilots

00:22:07.269 --> 00:22:10.599
pulling G -forces to rebuild the neural pathways

00:22:10.599 --> 00:22:13.200
of stroke patients and to give voice to those

00:22:13.200 --> 00:22:15.779
who lost theirs. And we confronted the reality

00:22:15.779 --> 00:22:17.680
that the smart speakers sitting on our kitchen

00:22:17.680 --> 00:22:19.579
counters right now for all their computational

00:22:19.579 --> 00:22:22.220
brilliance are still vulnerable to invisible

00:22:22.220 --> 00:22:24.539
acoustic attacks. Yeah, they really are. It has

00:22:24.539 --> 00:22:26.160
completely changed the way we interface with

00:22:26.160 --> 00:22:28.299
the digital world. So thank you for joining us

00:22:28.299 --> 00:22:30.700
on this deep dive. We hope it gave you a profound

00:22:30.700 --> 00:22:33.180
appreciation for the invisible layer of math

00:22:33.180 --> 00:22:35.559
and probability happening the next time you casually

00:22:35.559 --> 00:22:37.700
ask your phone to set a timer. And before we

00:22:37.700 --> 00:22:39.599
go, I want to leave you with one final thought.

00:22:39.950 --> 00:22:43.009
I want to circle back to that MIT Alter Ego device

00:22:43.009 --> 00:22:46.490
we discussed, the one that bypasses sound entirely

00:22:46.490 --> 00:22:49.410
and reads the neuromuscular signals of your jaw

00:22:49.410 --> 00:22:52.109
when you subvocalize. Oh, yeah, the mind reader.

00:22:52.349 --> 00:22:55.109
Right. If this technology continues to evolve

00:22:55.109 --> 00:22:57.170
and we reach a point where we can seamlessly

00:22:57.170 --> 00:23:00.150
command the digital world just by thinking the

00:23:00.150 --> 00:23:03.019
words. what happens to the spoken word. Wow.

00:23:03.259 --> 00:23:05.299
If speaking out loud is no longer a requirement

00:23:05.299 --> 00:23:07.680
to interact with our technology or our environment,

00:23:08.180 --> 00:23:10.759
will our physical voices become obsolete, reserved

00:23:10.759 --> 00:23:13.460
entirely for intimate human -to -human contact?

00:23:13.839 --> 00:23:16.279
Man, that is something to really mull over the

00:23:16.279 --> 00:23:17.960
next time you find yourself talking out loud

00:23:17.960 --> 00:23:20.880
to a machine. Stay curious, keep exploring, and

00:23:20.880 --> 00:23:22.180
we'll see you on the next Deep Dive.