WEBVTT

00:00:00.000 --> 00:00:02.100
You know that feeling when you're typing a frantic

00:00:02.100 --> 00:00:05.080
text, maybe you're walking, and you mistype a

00:00:05.080 --> 00:00:08.660
word, just completely butcher it, but autocorrect

00:00:08.660 --> 00:00:11.380
just silently fixes it? Oh yeah. You hit send,

00:00:11.519 --> 00:00:14.019
the message looks perfect, the other person has

00:00:14.019 --> 00:00:16.519
no idea you originally typed, you know, ducking.

00:00:16.600 --> 00:00:19.179
Right, the classic save. The meaning got there,

00:00:19.500 --> 00:00:23.239
but the input was a total disaster. Exactly.

00:00:23.320 --> 00:00:25.699
But here's the thing, if you were speaking that

00:00:25.699 --> 00:00:27.559
message, say, in a language you're trying to

00:00:27.559 --> 00:00:30.250
learn, you would have sounded completely wrong.

00:00:30.829 --> 00:00:32.350
The meaning might have gotten there, but the

00:00:32.350 --> 00:00:35.149
delivery, broken. And for the longest time, that

00:00:35.149 --> 00:00:38.549
has been the hidden trap of using AI to learn

00:00:38.549 --> 00:00:41.229
a language. It's the auto -correct trap. You

00:00:41.229 --> 00:00:43.590
speak into an app, it checks your grammar, maybe

00:00:43.590 --> 00:00:46.030
fixes your syntax, but it's completely deaf to

00:00:46.030 --> 00:00:48.630
how you actually sound. Right. It fixes the text,

00:00:48.729 --> 00:00:51.250
not the voice. Which is fine for, you know, writing

00:00:51.250 --> 00:00:54.229
an email. But it's catastrophic if you're trying

00:00:54.229 --> 00:00:56.130
to learn how to speak. You think you're practicing

00:00:56.130 --> 00:00:58.609
pronunciation, but really, you're just practicing

00:00:58.609 --> 00:01:03.009
dictation. But today, we are looking at a massive

00:01:03.009 --> 00:01:05.170
shift in the technology underneath all this.

00:01:05.609 --> 00:01:08.750
We're moving from an AI that reads to an AI that

00:01:08.750 --> 00:01:11.290
actually listens. We have a stack of sources

00:01:11.290 --> 00:01:14.790
here about using Google Gemini Pro specifically

00:01:14.790 --> 00:01:17.670
for pronunciation training. And I got to say.

00:01:18.489 --> 00:01:22.870
Looking at what it can do, some of it is borderline

00:01:22.870 --> 00:01:25.349
spooky. Spooky is the right word. I mean, we

00:01:25.349 --> 00:01:27.650
are not talking about robot voices anymore. We're

00:01:27.650 --> 00:01:30.310
talking about a tool that can tell if you sound

00:01:30.310 --> 00:01:32.730
confident. Or nervous. Or if you're just rushing

00:01:32.730 --> 00:01:34.109
because you want to get the sentence over with.

00:01:34.170 --> 00:01:36.189
See, that's the part that hooked me. Because

00:01:36.189 --> 00:01:38.129
I have tried the language apps, I've done the

00:01:38.129 --> 00:01:41.170
owl, and that frustration of, am I saying this

00:01:41.170 --> 00:01:44.319
right? Yeah. The number one reason I quit. It

00:01:44.319 --> 00:01:46.640
is. It's that total lack of real feedback. It's

00:01:46.640 --> 00:01:48.680
the biggest hurdle. And what we have here is

00:01:48.680 --> 00:01:50.780
basically a roadmap. We're going to cover why

00:01:50.780 --> 00:01:53.500
Gemini Pro is fundamentally different from tools

00:01:53.500 --> 00:01:56.299
like ChatGPT or Claude when it comes to audio.

00:01:56.359 --> 00:01:58.459
The physics of it. The physics of it, exactly.

00:01:58.480 --> 00:02:00.680
Yeah. Then we're going to break down this power

00:02:00.680 --> 00:02:03.760
prompt recipe that supposedly turns the AI into

00:02:03.760 --> 00:02:06.819
a 20 -year veteran coach. And then look at a

00:02:06.819 --> 00:02:10.000
daily routine to actually fix the problems it

00:02:10.000 --> 00:02:12.599
finds. Yep. So let's start with the technology

00:02:12.599 --> 00:02:15.360
itself, because I think most people, and I include

00:02:15.360 --> 00:02:17.639
myself here, we just assume AI is AI. If I talk

00:02:17.639 --> 00:02:20.360
to ChatGPT or I talk to Gemini, it's all just

00:02:20.360 --> 00:02:23.319
processing data. Not quite. But the source material

00:02:23.319 --> 00:02:26.979
draws a really hard line in the sand here, especially

00:02:26.979 --> 00:02:29.639
with how they handle sound. It's a massive distinction.

00:02:30.120 --> 00:02:32.539
And it all comes down to how the machine perceives

00:02:32.539 --> 00:02:34.879
reality. So most of the tools we've used for

00:02:34.879 --> 00:02:37.699
the last few years, ChatGPT, the voice assistants

00:02:37.699 --> 00:02:41.360
on your phone, They all rely on this legacy system

00:02:41.360 --> 00:02:46.039
called speech to text. STT. STT. Yeah. STT. I

00:02:46.039 --> 00:02:47.860
know the acronym, but let's slow down a bit.

00:02:48.060 --> 00:02:50.319
How does that actually work under the hood? OK.

00:02:50.319 --> 00:02:52.419
So imagine you're speaking to a stenographer

00:02:52.419 --> 00:02:54.939
who is trying to be a little too helpful. You

00:02:54.939 --> 00:02:58.360
say a sentence into the mic. The AI's first job

00:02:58.360 --> 00:03:01.039
isn't to critique how you sound. Its first job

00:03:01.039 --> 00:03:03.020
is to figure out what words you intended to say.

00:03:03.039 --> 00:03:05.580
Right. It takes the audio, strips out the noise,

00:03:05.759 --> 00:03:07.819
turns it into text tokens, and then it analyzes

00:03:07.819 --> 00:03:10.460
the text. So it's discarding the audio almost

00:03:10.460 --> 00:03:13.139
immediately. Precisely. Once it has the text,

00:03:13.300 --> 00:03:17.139
the audio is trash. It's gone. Wow. So if you

00:03:17.139 --> 00:03:19.360
mispronounce a word, but the context makes it

00:03:19.360 --> 00:03:21.280
obvious what you meant. The AI just fixes it.

00:03:21.460 --> 00:03:23.259
It acts like that helpful friend who finishes

00:03:23.259 --> 00:03:26.340
your sentences for you. If you say, I would like

00:03:26.340 --> 00:03:29.840
an apple, and the context is fruit, the speech

00:03:29.840 --> 00:03:33.500
detect system just writes apple. It hands the

00:03:33.500 --> 00:03:36.729
text apple to the brain of the AI. The AI looks

00:03:36.729 --> 00:03:39.270
at it and says, grammar is perfect. Meanwhile,

00:03:39.289 --> 00:03:41.349
you're still walking around saying Opal. Exactly.

00:03:41.949 --> 00:03:44.110
I mean, that's fatally flawed for a learner.

00:03:44.590 --> 00:03:47.169
It's prioritizing meaning over the mechanics.

00:03:47.370 --> 00:03:49.689
It's optimizing for communication, not for correction.

00:03:50.430 --> 00:03:52.729
And there's another layer to this that's mentioned

00:03:52.729 --> 00:03:55.909
in the sources, the issue of training bias. The

00:03:55.909 --> 00:03:59.270
profiling aspect. In a way, yeah. These STT systems

00:03:59.270 --> 00:04:01.990
are trained on these massive data sets, so they

00:04:01.990 --> 00:04:04.409
use demographic probabilities to make guesses.

00:04:04.569 --> 00:04:06.870
So if it knows I'm from a certain country. Or

00:04:06.870 --> 00:04:09.990
if you tell it, say you're Vietnamese, it accesses

00:04:09.990 --> 00:04:13.050
a database of common errors made by Vietnamese

00:04:13.050 --> 00:04:15.409
speakers. So it's already looking for missing

00:04:15.409 --> 00:04:17.970
ending sounds before I even open my mouth. It's

00:04:17.970 --> 00:04:20.779
a confirmation bias engine. It might flag a missing

00:04:20.779 --> 00:04:23.120
ending sound because statistically, that's what

00:04:23.120 --> 00:04:25.600
it expects to hear, even if you actually nailed

00:04:25.600 --> 00:04:28.199
it. So it's giving me generic advice based on

00:04:28.199 --> 00:04:30.879
my profile, not personal feedback. Right, not

00:04:30.879 --> 00:04:32.899
based on your actual performance. That explains

00:04:32.899 --> 00:04:35.680
so much about why those generic language apps

00:04:35.680 --> 00:04:39.560
feel so repetitive. Okay, so how is Gemini Pro

00:04:39.560 --> 00:04:42.800
different? The source keeps using this word,

00:04:43.259 --> 00:04:46.500
multimodal. Multimodal is the game changer. It

00:04:46.500 --> 00:04:49.019
means the AI isn't just looking at a text transcript,

00:04:49.220 --> 00:04:52.860
it is processing the raw audio file. The actual

00:04:52.860 --> 00:04:55.639
sound waves. The actual waveforms, yeah. It's

00:04:55.639 --> 00:04:57.819
listening to the length of your vowels in milliseconds.

00:04:58.600 --> 00:05:00.459
It's hearing where you place the stress in a

00:05:00.459 --> 00:05:03.199
word. It's detecting the micropauses between

00:05:03.199 --> 00:05:05.740
syllables. So it's listening to the physics of

00:05:05.740 --> 00:05:08.100
the sound, not just the definition of the word.

00:05:08.259 --> 00:05:10.180
Right. It connects the audio directly to the

00:05:10.180 --> 00:05:12.920
processing core. It can feel the energy. One

00:05:12.920 --> 00:05:15.379
of the key insights here is that Gemini can tell

00:05:15.379 --> 00:05:18.180
if you're speaking confidently or shyly. A text

00:05:18.180 --> 00:05:20.639
transcript can't show shyness. But sound waves

00:05:20.639 --> 00:05:23.620
can. That is the aha moment for me. It's the

00:05:23.620 --> 00:05:26.000
difference between sending someone an email and

00:05:26.000 --> 00:05:28.540
leaving them a voicemail. Yeah. The emotional

00:05:28.540 --> 00:05:32.189
context is just there. And for pronunciation,

00:05:32.689 --> 00:05:34.689
that emotional contest, the rhythm, the hesitation,

00:05:34.889 --> 00:05:37.129
the prosody, that is where the accent lives.

00:05:37.209 --> 00:05:40.149
That's where fluency lives. So the fatal flaw

00:05:40.149 --> 00:05:42.649
of standard speech to text is that it cleans

00:05:42.649 --> 00:05:45.910
up the mess before analyzing it. Exactly. It

00:05:45.910 --> 00:05:48.509
prioritizes meaning over sound, fixing your errors

00:05:48.509 --> 00:05:51.029
instead of flagging them. OK. So we have a tool

00:05:51.029 --> 00:05:53.930
that has ears, metaphorically speaking, but the

00:05:53.930 --> 00:05:56.529
source material is very clear. You can't just

00:05:56.529 --> 00:05:59.189
turn it on and say, help me. No. that just leads

00:05:59.189 --> 00:06:01.889
to chaos, you need a specific setup. Right. You

00:06:01.889 --> 00:06:04.129
have to control the variables. And this is practically

00:06:04.129 --> 00:06:06.529
very simple, but it's crucial. First, the hardware.

00:06:06.649 --> 00:06:08.829
OK. You don't need a studio mic. Your smartphone

00:06:08.829 --> 00:06:12.589
is fine. But the room. Yeah. The room matters.

00:06:12.689 --> 00:06:15.709
Quiet room. Door closed. Essential. Because remember,

00:06:15.970 --> 00:06:17.790
we're dealing with sound waves now, not just

00:06:17.790 --> 00:06:20.550
text tokens. If you have a fan whirring in the

00:06:20.550 --> 00:06:23.709
background, traffic noise, Gemini might interpret

00:06:23.709 --> 00:06:26.170
that white noise as a phoneme. It might think

00:06:26.170 --> 00:06:28.420
that whoosh. from the fan is you trying to make

00:06:28.420 --> 00:06:30.560
a nice sound. So it's too sensitive for its own

00:06:30.560 --> 00:06:32.720
good sometimes. It can create hallucinations

00:06:32.720 --> 00:06:36.160
in the audio processing. So silence is key. But

00:06:36.160 --> 00:06:38.860
the bigger setup rule, and this is where most

00:06:38.860 --> 00:06:41.579
people fail, is about what you actually say.

00:06:41.959 --> 00:06:44.420
The no single words rule. This is a mistake everyone

00:06:44.420 --> 00:06:45.939
makes. They pick up the app and they just say,

00:06:45.939 --> 00:06:49.319
Apple, banana. Hello. I do this constantly. I

00:06:49.319 --> 00:06:50.819
just want to check if I can say the word. Why

00:06:50.819 --> 00:06:52.939
is that bad? Because that's not how language

00:06:52.939 --> 00:06:57.129
works. The force suggests a paragraph like 150

00:06:57.129 --> 00:06:59.829
to 200 words, a short story, a recap of your

00:06:59.829 --> 00:07:03.069
day, whatever. The reason is a technical concept

00:07:03.069 --> 00:07:06.209
called co -articulation. Co -articulation? It's

00:07:06.209 --> 00:07:08.870
how sounds influence each other. When you speak

00:07:08.870 --> 00:07:11.670
a single word, you say it in isolation. But in

00:07:11.670 --> 00:07:14.589
a real sentence, the end of one word blends into

00:07:14.589 --> 00:07:16.970
the start of the next one. The rhythm changes.

00:07:17.069 --> 00:07:18.769
If you only practice single words, you sound

00:07:18.769 --> 00:07:21.649
like a robot. The AI needs to hear how you link

00:07:21.649 --> 00:07:24.120
words together. Do you blend? Is your rhythm

00:07:24.120 --> 00:07:26.660
robotic or fluid? You only get that data from

00:07:26.660 --> 00:07:28.899
a full paragraph. That's like trying to judge

00:07:28.899 --> 00:07:31.579
a dancer by looking at a single photo versus

00:07:31.579 --> 00:07:34.279
watching a video of them actually moving. That

00:07:34.279 --> 00:07:36.600
is a perfect analogy. The paragraph provides

00:07:36.600 --> 00:07:38.779
the movement. OK, so we're in a quiet room. We

00:07:38.779 --> 00:07:40.819
have our paragraph about our day. Now we get

00:07:40.819 --> 00:07:43.180
to the power prompt. This is the recipe mentioned

00:07:43.180 --> 00:07:44.600
in the source material. And I have to admit,

00:07:44.720 --> 00:07:46.740
whenever I see these long detailed prompts, I

00:07:46.740 --> 00:07:50.040
get a little skeptical. Skeptical. Well, it feels

00:07:50.040 --> 00:07:53.120
like I'm, you know, LRPing with a computer, please

00:07:53.120 --> 00:07:55.740
pretend you are a coach. It just feels weird

00:07:55.740 --> 00:07:58.060
to give a machine a personality. Does it actually

00:07:58.060 --> 00:08:00.300
change the output? It changes everything. You

00:08:00.300 --> 00:08:02.439
have to remember these large language models

00:08:02.439 --> 00:08:06.519
are these vast, generic oceans of text. They've

00:08:06.519 --> 00:08:09.139
read everything from Reddit threads to astrophysics

00:08:09.139 --> 00:08:11.579
textbooks. Right. If you don't narrow their focus,

00:08:11.980 --> 00:08:15.040
they drift. They just refer to the mean. By assigning

00:08:15.040 --> 00:08:17.720
a persona like... a native English pronunciation

00:08:17.720 --> 00:08:20.879
coach with 20 years of experience, you are telling

00:08:20.879 --> 00:08:25.120
the AI which part of its latent space to activate.

00:08:25.699 --> 00:08:28.139
You're forcing it to adopt a specific standard

00:08:28.139 --> 00:08:30.459
of critique. So it's not just flavor text, it's

00:08:30.459 --> 00:08:33.500
a functional constraint. Absolutely. And the

00:08:33.500 --> 00:08:36.820
prompt in the source has four very strict rules

00:08:36.820 --> 00:08:38.779
that I think are brilliant. Okay, let's run through

00:08:38.779 --> 00:08:40.419
them because this seems to be the secret sauce,

00:08:40.539 --> 00:08:42.960
the first one and one we've touched on. No stereotypes.

00:08:43.289 --> 00:08:46.049
This is critical for accuracy. You literally

00:08:46.049 --> 00:08:49.230
command the AI. Only report what is actually

00:08:49.230 --> 00:08:52.830
heard. Do not use demographic data. You are forcing

00:08:52.830 --> 00:08:55.629
it to ignore its training on general population

00:08:55.629 --> 00:08:58.929
stats and focus purely on your audio file. It

00:08:58.929 --> 00:09:01.129
prevents that confirmation bias we talked about.

00:09:01.389 --> 00:09:03.370
Exactly. That's empowering. It's telling the

00:09:03.370 --> 00:09:06.350
machine to look at the data, not the trend. Exactly.

00:09:06.669 --> 00:09:10.009
The second rule is specific quotes. The AI has

00:09:10.009 --> 00:09:13.269
to provide timestamps and the exact words. Right.

00:09:13.570 --> 00:09:15.690
It can't just say, work on your vowels. It has

00:09:15.690 --> 00:09:18.649
to say, at point one five, you said sheep, but

00:09:18.649 --> 00:09:21.320
it sounded more like ship. Which is so helpful,

00:09:21.600 --> 00:09:23.000
because otherwise you're just guessing where

00:09:23.000 --> 00:09:24.779
you went wrong. And frankly, without timestamps,

00:09:24.919 --> 00:09:26.840
I'd probably just assume the AI was hallucinating.

00:09:27.100 --> 00:09:29.639
Then there's stress analysis. English is a stress

00:09:29.639 --> 00:09:32.220
-timed language. If you say phototorlography

00:09:32.220 --> 00:09:34.340
instead of photogorheriaphy. And the meaning

00:09:34.340 --> 00:09:37.419
gets lost or just sounds wrong. Exactly. So the

00:09:37.419 --> 00:09:40.519
prompt explicitly asks the AI to listen for that

00:09:40.519 --> 00:09:43.580
emphasis. And the last one was confusing sounds.

00:09:44.679 --> 00:09:46.259
But I want to go back to the stress analysis

00:09:46.259 --> 00:09:48.379
for a second, because it leads into the most

00:09:48.379 --> 00:09:49.940
surprising part of the source material for me,

00:09:50.019 --> 00:09:52.460
the case study. The Vietnamese student test.

00:09:52.620 --> 00:09:56.659
Yes. This really highlights the emotional intelligence

00:09:56.659 --> 00:10:00.159
of this multimodal system. So the author used

00:10:00.159 --> 00:10:03.480
this prompt with a student. They didn't tell

00:10:03.480 --> 00:10:05.539
Gemini where the student was from. But Gemini

00:10:05.539 --> 00:10:08.789
guessed it. Instantly. Based purely on the intonation,

00:10:09.090 --> 00:10:10.929
the rise and fall of the voice gemini correctly

00:10:10.929 --> 00:10:13.029
identified the speaker as likely Vietnamese.

00:10:13.429 --> 00:10:15.950
Whoa! That is nuanced listening. That's detecting

00:10:15.950 --> 00:10:17.830
the music of the mother tongue bleeding into

00:10:17.830 --> 00:10:20.429
the target language. That's wild. But what stood

00:10:20.429 --> 00:10:23.110
out to me even more in that case study was the

00:10:23.110 --> 00:10:26.289
feedback on speed. It wasn't just, you said this

00:10:26.289 --> 00:10:28.529
word wrong. Oh, the nervous comment, yeah. Yeah,

00:10:28.570 --> 00:10:30.970
the AI told the student, you are speaking too

00:10:30.970 --> 00:10:33.629
fast. This makes you sound nervous. And it suggested

00:10:33.629 --> 00:10:36.269
pausing at commas to sound more confident. I

00:10:36.269 --> 00:10:37.950
mean, just think about the implication of that.

00:10:38.070 --> 00:10:40.889
That isn't pronunciation advice. That is psychological

00:10:40.889 --> 00:10:43.450
coaching. It is. It's soft skills. The standard

00:10:43.450 --> 00:10:45.909
speech -to -text tool would just process the

00:10:45.909 --> 00:10:48.330
words. If the words were right, it would give

00:10:48.330 --> 00:10:52.289
a thumbs up. Gemini heard the pace, the milliseconds

00:10:52.289 --> 00:10:55.309
between words, and correlated it with an emotional

00:10:55.309 --> 00:10:58.360
state, anxiety. It's teaching you how to command

00:10:58.360 --> 00:11:00.919
a room, not just how to pronounce a vowel. It's

00:11:00.919 --> 00:11:03.320
connecting the dots between how we sound and

00:11:03.320 --> 00:11:05.480
how we are perceived, which is, you know, the

00:11:05.480 --> 00:11:08.120
whole point of learning a language, really. We

00:11:08.120 --> 00:11:10.120
want to be understood, but we also want to project

00:11:10.120 --> 00:11:12.860
who we are. Precisely. It's closing the gap between

00:11:12.860 --> 00:11:15.580
your internal voice and your external reality.

00:11:15.840 --> 00:11:18.840
I'm curious, why is the no -stereotypes instruction

00:11:18.840 --> 00:11:22.500
so critical for the user? It forces the AI to

00:11:22.500 --> 00:11:24.740
listen to your individual voice, not just guess,

00:11:25.019 --> 00:11:27.240
based on a demographic textbook. Now, before

00:11:27.240 --> 00:11:28.820
we get too carried away thinking this is all

00:11:28.820 --> 00:11:31.299
magic, we have to pay the bills. Okay, we are

00:11:31.299 --> 00:11:33.519
back. We've praised the ghost in the machine,

00:11:33.840 --> 00:11:38.000
but the source material also throws a, uh, a

00:11:38.000 --> 00:11:41.080
bucket of cold water on us. Gemini Pro is good,

00:11:41.080 --> 00:11:43.899
but it has limits. It does. It's not a magic

00:11:43.899 --> 00:11:46.350
wand. It's not a human. That's so important to

00:11:46.350 --> 00:11:48.580
remember. We talked about the Hallucinations

00:11:48.580 --> 00:11:51.139
from background noise, that's a big one. But

00:11:51.139 --> 00:11:53.779
there's also the issue of accent confusion. Right,

00:11:53.919 --> 00:11:56.980
the water versus wata problem. Exactly. If you

00:11:56.980 --> 00:11:59.340
don't specify in your prompt whether you're aiming

00:11:59.340 --> 00:12:02.340
for a general American or receive pronunciation,

00:12:02.600 --> 00:12:05.720
that's the standard British accent. Gemini just

00:12:05.720 --> 00:12:08.279
defaults to the average. Which is usually a generic

00:12:08.279 --> 00:12:10.500
American accent. Usually, yeah. So if I'm trying

00:12:10.500 --> 00:12:12.659
to sound like I'm from London and I say schedule

00:12:12.659 --> 00:12:15.480
the British way. Gemini might flag it as an error.

00:12:15.600 --> 00:12:17.809
It might, yeah. because it's comparing you to

00:12:17.809 --> 00:12:21.250
a database of mostly American speakers. You have

00:12:21.250 --> 00:12:23.149
to be specific in your prompt. You got to say

00:12:23.149 --> 00:12:26.210
act as a British English coach. That seems like

00:12:26.210 --> 00:12:29.309
a simple fix, but definitely good to know. What

00:12:29.309 --> 00:12:32.909
about the hardware limitation? The source mentioned

00:12:32.909 --> 00:12:35.580
something about. sounds getting lost. This is

00:12:35.580 --> 00:12:37.559
a physics problem and it's actually really interesting.

00:12:37.980 --> 00:12:40.159
So the microphone on your smartphone is incredible,

00:12:40.539 --> 00:12:43.159
but it's designed for phone calls. Okay. So it

00:12:43.159 --> 00:12:45.139
often runs these noise cancellation algorithms

00:12:45.139 --> 00:12:48.139
to cut out background hiss. Right. The problem

00:12:48.139 --> 00:12:51.620
is, some English sounds, specifically the unvoiced

00:12:51.620 --> 00:12:54.759
flickatives like the the and them or the F and

00:12:54.759 --> 00:12:58.000
fish, they occupy the same high frequency range

00:12:58.000 --> 00:13:00.679
as that background hiss. So the phone thinks

00:13:00.679 --> 00:13:03.200
my the sound is just air conditioner noise and

00:13:03.200 --> 00:13:05.679
it deletes it. Exactly. It scrubs the audio to

00:13:05.679 --> 00:13:08.360
clean it. And in the process, it removes the

00:13:08.360 --> 00:13:10.799
very sound you were trying to practice. Gemini

00:13:10.799 --> 00:13:13.299
might say, you missed the the sound when actually

00:13:13.299 --> 00:13:14.899
you said it right and your phone just filtered

00:13:14.899 --> 00:13:16.899
it out. So don't get gaslighted by your hardware.

00:13:17.179 --> 00:13:19.779
Yes. Trust your ears or ask a human if you're

00:13:19.779 --> 00:13:22.720
really stuck. Don't let the AI destroy your confidence

00:13:22.720 --> 00:13:25.600
over a noise cancellation algorithm. So we have

00:13:25.600 --> 00:13:27.840
the tool, we have the prompt, we know the limits.

00:13:28.259 --> 00:13:30.700
The source material ends with a study plan, this

00:13:30.700 --> 00:13:33.600
record analyze fix routine. This is the practical

00:13:33.600 --> 00:13:36.379
application part. The author suggests a daily

00:13:36.379 --> 00:13:39.450
routine. And the key here is repetition. And

00:13:39.450 --> 00:13:41.509
this is a bit controversial. Don't record a new

00:13:41.509 --> 00:13:43.549
paragraph every day. Wait, really? I would have

00:13:43.549 --> 00:13:45.710
thought variety is better. I need to learn more

00:13:45.710 --> 00:13:48.110
words. No, because you can't track your progress

00:13:48.110 --> 00:13:51.110
if the target is always moving. The advice is

00:13:51.110 --> 00:13:53.330
to record the same paragraph every single day

00:13:53.330 --> 00:13:55.610
for a week. Ah, so you can A -test yourself.

00:13:55.929 --> 00:13:58.450
Exactly. You record on Monday, you get the feedback,

00:13:58.629 --> 00:14:00.850
you record on Tuesday, and you upload both files

00:14:00.850 --> 00:14:03.549
to Gemini. Then you ask, compared to my recording

00:14:03.549 --> 00:14:05.940
yesterday, What is better? You're building a

00:14:05.940 --> 00:14:08.519
feedback loop. A feedback loop, yeah. You master

00:14:08.519 --> 00:14:11.080
the rhythm of that specific paragraph. It's like

00:14:11.080 --> 00:14:13.100
learning a song on the piano. You play the same

00:14:13.100 --> 00:14:15.320
piece until it flows, then you move on. That

00:14:15.320 --> 00:14:17.779
makes so much sense. It's about depth, not breadth.

00:14:17.980 --> 00:14:20.480
But the author also suggests using other apps,

00:14:20.659 --> 00:14:23.710
the hybrid approach. Yes. This is acknowledging

00:14:23.710 --> 00:14:26.029
that Gemini is a big picture coach. It's great

00:14:26.029 --> 00:14:29.149
for flow, rhythm, intonation, the macro stuff.

00:14:29.730 --> 00:14:32.710
But for the nitty -gritty, like drilling a specific

00:14:32.710 --> 00:14:35.809
vowel sound over and over, it's not the most

00:14:35.809 --> 00:14:38.070
efficient tool. So you use apps like BoldVoice

00:14:38.070 --> 00:14:41.850
or Speechling for the drills? Right. Use Speechling

00:14:41.850 --> 00:14:45.029
to drill the word squirrel 50 times until your

00:14:45.029 --> 00:14:48.429
tongue stops tying itself in knots. Then you

00:14:48.429 --> 00:14:51.480
go to Gemini. and put squirrel into a full sentence

00:14:51.480 --> 00:14:53.580
to see if you can maintain that pronunciation

00:14:53.580 --> 00:14:56.279
while speaking naturally. I like that. Specialized

00:14:56.279 --> 00:14:59.700
tools for the bricks. Gemini for the house. That's

00:14:59.700 --> 00:15:01.279
a great way to put it. And the final piece of

00:15:01.279 --> 00:15:03.740
advice in the source was about input. You know,

00:15:04.059 --> 00:15:06.879
put correct sounds into your head. You can't

00:15:06.879 --> 00:15:09.419
output what you have in input. You need to listen

00:15:09.419 --> 00:15:12.509
to experts. The source mentions channels like

00:15:12.509 --> 00:15:15.250
Luke Pretty or Cloud English. You need to fill

00:15:15.250 --> 00:15:17.490
your brain with the target rhythm so that when

00:15:17.490 --> 00:15:19.549
you record, you actually have a reference point.

00:15:19.690 --> 00:15:21.929
It seems so obvious, but I think a lot of us

00:15:21.929 --> 00:15:25.169
skip that. We just start talking. We do. We want

00:15:25.169 --> 00:15:27.029
to perform before we've rehearsed. But listening

00:15:27.029 --> 00:15:30.169
is 50 % of speaking. So how do we get around

00:15:30.169 --> 00:15:33.529
the hardware erasing those small sounds? We rely

00:15:33.529 --> 00:15:36.409
on specialized apps for the details and use Gemini

00:15:36.409 --> 00:15:38.409
for the big picture flow. So let's bring this

00:15:38.409 --> 00:15:40.509
all together. What is the big takeaway for you

00:15:40.509 --> 00:15:42.429
here? Because for me, it's that we are finally

00:15:42.429 --> 00:15:45.049
moving away from passive learning. That is the

00:15:45.049 --> 00:15:48.169
core of it. For years, technology has kind of

00:15:48.169 --> 00:15:51.169
made us lazy learners. It auto -translates. It

00:15:51.169 --> 00:15:54.700
auto -corrects. This approach... using Gemini

00:15:54.700 --> 00:15:58.620
as a coach. It forces us to be active. We have

00:15:58.620 --> 00:16:00.519
to speak, we have to listen to the feedback,

00:16:00.679 --> 00:16:02.860
and we have to actually adjust. It's the difference

00:16:02.860 --> 00:16:05.700
between using a crutch and doing physical therapy.

00:16:05.799 --> 00:16:08.399
Yes. And the technology is finally ready to meet

00:16:08.399 --> 00:16:10.259
us there. It's not just matching text anymore.

00:16:10.679 --> 00:16:13.460
It's listening to the human element, the confidence,

00:16:13.639 --> 00:16:16.440
the hesitation, the rhythm. It turns the AI from

00:16:16.440 --> 00:16:19.240
a spell checker into a mentor. And a strict mentor,

00:16:19.440 --> 00:16:22.000
if you use the right prompt. A very strict mentor.

00:16:22.080 --> 00:16:25.100
No stereotypes, strict quotes. So... Here's our

00:16:25.100 --> 00:16:27.440
challenge to you, the listener. You have a phone.

00:16:27.500 --> 00:16:29.120
You probably have a Google account. It's free

00:16:29.120 --> 00:16:31.240
to try. Right now, when this deep dive ends,

00:16:31.320 --> 00:16:33.779
don't just think that's cool. Pick up your phone.

00:16:34.279 --> 00:16:36.299
Find a quiet closet if you have to. Maybe just

00:16:36.299 --> 00:16:38.919
a quiet room. Closets have bad acoustics. Fair

00:16:38.919 --> 00:16:41.340
point. Record 30 seconds about your day. Just

00:16:41.340 --> 00:16:44.299
30 seconds. Paste in that power prompt. Tell

00:16:44.299 --> 00:16:48.059
it to be a 20 -year veteran coach. And just see

00:16:48.059 --> 00:16:51.639
what it says. And specifically, look at the emotional

00:16:51.639 --> 00:16:54.919
feedback. Does it say you sound nervous? Does

00:16:54.919 --> 00:16:57.879
it say you sound robotic? That might be the most

00:16:57.879 --> 00:17:00.059
valuable thing you learned today. It's not about

00:17:00.059 --> 00:17:02.860
sounding like a native speaker perfectly. No.

00:17:03.200 --> 00:17:05.359
It's about being comfortable and being understood.

00:17:06.420 --> 00:17:08.440
That's it for this deep dive. Thanks for listening

00:17:08.440 --> 00:17:10.619
and good luck with the recording. Let us know

00:17:10.619 --> 00:17:12.660
what the AI hears. See you next time.
