WEBVTT

00:00:00.000 --> 00:00:03.580
Okay, so what if I told you you could get unlimited,

00:00:03.940 --> 00:00:08.220
completely free AI voice cloning? Right. Running

00:00:08.220 --> 00:00:10.199
right there on your own computer, you'd probably

00:00:10.199 --> 00:00:12.039
think, okay, what's the catch? Yeah, is it some

00:00:12.039 --> 00:00:14.080
big subscription? Do I need like massive cloud

00:00:14.080 --> 00:00:16.100
server? Exactly. But while that's kind of the

00:00:16.100 --> 00:00:18.519
bombshell here, this tech is real and it runs

00:00:18.519 --> 00:00:22.519
entirely offline. Total privacy, zero cost, full

00:00:22.519 --> 00:00:24.820
control. It's like having your own voice cloning

00:00:24.820 --> 00:00:27.839
lab right there on your desktop or laptop. Welcome

00:00:27.839 --> 00:00:30.440
to the deep dive. Hey everyone. Our goal today

00:00:30.440 --> 00:00:33.310
really is to give you the essentials. We're pulling

00:00:33.310 --> 00:00:36.090
out the key info you need to actually use this

00:00:36.090 --> 00:00:38.630
local AI. Yeah, we'll cover how the tools work,

00:00:38.750 --> 00:00:41.590
the mechanics, but also, critically, the sort

00:00:41.590 --> 00:00:44.289
of non -negotiable rules, the ethics around voice

00:00:44.289 --> 00:00:46.090
replication. We're going to talk about how this

00:00:46.090 --> 00:00:49.549
changes content creation. Huge changes. Introduce

00:00:49.549 --> 00:00:51.350
you to the platform you need. It's called Pinocchio.

00:00:51.729 --> 00:00:53.990
And then get into the really cool stuff, the

00:00:53.990 --> 00:00:56.049
pro moves, those advanced settings that take

00:00:56.049 --> 00:00:59.210
your clone from good to, well... basically perfect

00:00:59.210 --> 00:01:02.609
let's dive in so let's start with why this even

00:01:02.609 --> 00:01:05.730
matters because it's not some you know expensive

00:01:05.730 --> 00:01:08.329
corporate thing anymore at all our sources are

00:01:08.329 --> 00:01:12.390
clear ai voice cloning it works it's accessible

00:01:12.390 --> 00:01:15.030
now and it's genuinely changing how creators

00:01:15.030 --> 00:01:17.890
work it's a massive creative unlock yeah sure

00:01:17.890 --> 00:01:19.950
you can make unique character voices that's fun

00:01:19.950 --> 00:01:23.370
right but the real power i think it's automating

00:01:23.370 --> 00:01:26.909
your own workflow clone your own voice once and

00:01:26.909 --> 00:01:29.599
boom You never have to record that same boring

00:01:29.599 --> 00:01:33.599
intro or outro or promo read ever again. Think

00:01:33.599 --> 00:01:35.680
about that. And the consistency part. That's

00:01:35.680 --> 00:01:38.680
huge. Imagine having like a library of courses,

00:01:38.739 --> 00:01:41.260
hundreds of videos. Yeah. And your voice sounds

00:01:41.260 --> 00:01:43.700
exactly the same. Same energy, same tone. Even

00:01:43.700 --> 00:01:45.500
if you recorded half of it when you had a cold.

00:01:45.659 --> 00:01:48.939
That is the secret weapon. So many big creators

00:01:48.939 --> 00:01:51.299
are already doing this, saving like literally

00:01:51.299 --> 00:01:53.500
hundreds of hours. What do you think is the biggest

00:01:53.500 --> 00:01:56.239
potential time saver here for creators then?

00:01:56.599 --> 00:01:58.719
Automating all that narration. Yeah. And keeping

00:01:58.719 --> 00:02:01.040
that perfect consistency across everything you

00:02:01.040 --> 00:02:03.659
make. Automating content narration, maintaining

00:02:03.659 --> 00:02:06.280
perfect consistency across your entire library.

00:02:07.019 --> 00:02:09.960
Got it. Which naturally brings us straight to

00:02:09.960 --> 00:02:12.139
the ethics. Okay. Because this is powerful stuff

00:02:12.139 --> 00:02:14.580
and it's free now. Great power, great responsibility,

00:02:14.800 --> 00:02:17.860
right? Exactly. Rule number one, absolutely non

00:02:17.860 --> 00:02:22.020
-negotiable. Yeah. Consent -based audio only.

00:02:22.099 --> 00:02:25.169
Meaning. Meaning your own voice. Yeah. Or. Audio

00:02:25.169 --> 00:02:28.409
from someone who gave you explicit, rock -solid,

00:02:28.409 --> 00:02:32.750
verifiable permission to clone their voice. Period.

00:02:33.050 --> 00:02:35.650
And the flip side is just as important. Never

00:02:35.650 --> 00:02:38.349
use this for fraud, for deception, for harassing

00:02:38.349 --> 00:02:41.020
people. We're misinformation. Deepfakes. That's

00:02:41.020 --> 00:02:42.780
the dark side. Absolutely. You've got to follow

00:02:42.780 --> 00:02:44.919
the law, follow platform rules like YouTube's

00:02:44.919 --> 00:02:47.319
terms. The tech itself is neutral, right? Totally

00:02:47.319 --> 00:02:49.599
neutral. It's how we use it. The moment you use

00:02:49.599 --> 00:02:52.139
it to trick someone, that's when it gets dangerous.

00:02:52.979 --> 00:02:55.319
Yeah, and honestly, I still kind of wrestle with

00:02:55.319 --> 00:02:58.099
prompt drift myself sometimes, even just working

00:02:58.099 --> 00:03:00.699
with my own voice clone and figuring out that

00:03:00.699 --> 00:03:04.120
line, you know, between useful automation and

00:03:04.120 --> 00:03:07.060
something that feels kind of deep fakie. There's

00:03:07.060 --> 00:03:09.340
constant attention, real vigilant. Okay, wait,

00:03:09.400 --> 00:03:11.770
back up. Prompt drift. What's that exactly for

00:03:11.770 --> 00:03:13.750
folks who haven't heard the term? Oh, right.

00:03:13.870 --> 00:03:17.530
So prompt drift is when the AI, as it's generating

00:03:17.530 --> 00:03:21.490
really long chunks of speech from text, it starts

00:03:21.490 --> 00:03:24.210
to kind of lose the original voice's unique flavor.

00:03:24.469 --> 00:03:26.770
Oh, okay. Yeah, it might sound a bit flatter,

00:03:26.889 --> 00:03:29.189
maybe less emotional, slightly robotic even.

00:03:29.310 --> 00:03:31.289
It means you've got to step back, maybe treat

00:03:31.289 --> 00:03:33.310
the input text or the settings. But choosing

00:03:33.310 --> 00:03:35.710
to run this locally on your own machine, that

00:03:35.710 --> 00:03:38.139
actually helps with some risks, doesn't it? compared

00:03:38.139 --> 00:03:40.219
to cloud services. Oh, massively. That's the

00:03:40.219 --> 00:03:42.699
core benefit, really. Why go local? Total privacy.

00:03:42.900 --> 00:03:46.000
Your voice data, which is biometric data. Right.

00:03:46.219 --> 00:03:48.759
Sensitive stuff. Very sensitive. It never leaves

00:03:48.759 --> 00:03:50.460
your computer. Yeah. You're not uploading it

00:03:50.460 --> 00:03:52.460
somewhere where it could be hacked or sold or

00:03:52.460 --> 00:03:54.360
analyzed without your knowledge. And besides

00:03:54.360 --> 00:03:57.560
security, the cost is just zero. 100 % free,

00:03:57.719 --> 00:04:00.860
unlimited use. No tokens, no subscriptions. Once

00:04:00.860 --> 00:04:03.400
it's set up, it even works offline. You're in

00:04:03.400 --> 00:04:05.919
complete control. So what's the core defining

00:04:05.919 --> 00:04:09.219
benefit of choosing a local setup then? Complete

00:04:09.219 --> 00:04:12.159
privacy, total control, and zero ongoing cost

00:04:12.159 --> 00:04:14.659
for unlimited use. Simple. Okay, so how do we

00:04:14.659 --> 00:04:16.220
actually get this running? You mentioned a platform,

00:04:16.399 --> 00:04:18.759
Pinocchio. Yeah, Pinocchio. Think of it like

00:04:18.759 --> 00:04:22.839
Steam for gamers, or maybe an app store, but

00:04:22.839 --> 00:04:25.079
specifically for AI models. Okay. It's basically

00:04:25.079 --> 00:04:27.959
a simple way to install and run these complex

00:04:27.959 --> 00:04:31.699
AI tools on your own PC without needing to be

00:04:31.699 --> 00:04:33.980
a coding wizard. So installing Pinocchio itself

00:04:33.980 --> 00:04:36.639
is pretty standard, like any other app. Yeah,

00:04:36.660 --> 00:04:39.779
download, run the installer, easy stuff. But

00:04:39.779 --> 00:04:42.060
the real magic happens inside Pinocchio when

00:04:42.060 --> 00:04:44.310
you install the specific voice model. And the

00:04:44.310 --> 00:04:48.310
one recommended in the sources is E2F5 -TTS.

00:04:48.529 --> 00:04:51.629
That's one, E2F5 -TTS. It's a really good open

00:04:51.629 --> 00:04:53.689
source model. Yeah. Known for being powerful,

00:04:53.829 --> 00:04:56.550
but also not needing like a supercomputer to

00:04:56.550 --> 00:04:58.970
run. It's great for home setups, does emotion

00:04:58.970 --> 00:05:01.170
really well. Now, there's a critical rule here

00:05:01.170 --> 00:05:04.459
during the install. Inside Pinocchio. Oh, yeah.

00:05:04.500 --> 00:05:07.800
This is super important. When E2F5 TTS is installing

00:05:07.800 --> 00:05:09.879
and it does this automatically through Pinocchio,

00:05:09.980 --> 00:05:12.279
do not interrupt it. Let it run. Let it run completely.

00:05:12.439 --> 00:05:14.160
It's downloading code, dependencies, all sorts

00:05:14.160 --> 00:05:16.319
of stuff. If you stop it, even for a second.

00:05:16.399 --> 00:05:18.000
It breaks. It breaks. You'll get errors later.

00:05:18.139 --> 00:05:20.620
Just let it go until it clearly says 100 % done.

00:05:20.720 --> 00:05:24.670
Be patient. Why is that patience during the E2F5

00:05:24.670 --> 00:05:28.050
TTS install so vital? The model needs all its

00:05:28.050 --> 00:05:30.829
dependencies installed completely, without interruption,

00:05:30.990 --> 00:05:33.209
to avoid errors later. Makes sense. Okay, so

00:05:33.209 --> 00:05:36.810
installation's done. We open up E2F5 TTS. You

00:05:36.810 --> 00:05:38.990
called it the cloning cockpit. Yeah, kind of

00:05:38.990 --> 00:05:40.889
looks like one. You'll see a few modes. The sources

00:05:40.889 --> 00:05:43.689
are really clear here. Stick with basic TTS.

00:05:43.910 --> 00:05:46.589
Basic TTS. Why that one? Because it puts all

00:05:46.589 --> 00:05:48.990
the AI's power into making one voice sound as

00:05:48.990 --> 00:05:51.620
good as possible. Highest quality. Best fidelity.

00:05:51.839 --> 00:05:54.180
Okay. There is a multi -speech mode for conversations,

00:05:54.259 --> 00:05:57.699
but honestly, it tries to juggle multiple voices

00:05:57.699 --> 00:05:59.560
at once, and the quality for each individual

00:05:59.560 --> 00:06:02.319
voice takes a hit, a noticeable hit. So if you

00:06:02.319 --> 00:06:04.560
need two speakers, better to generate them separately

00:06:04.560 --> 00:06:07.819
in basic TTS. Exactly. Generate speaker A's lines,

00:06:08.019 --> 00:06:10.279
generate speaker B's lines, then just edit them

00:06:10.279 --> 00:06:12.100
together later in your audio or video editor.

00:06:12.240 --> 00:06:14.759
Much cleaner result. Got it. Now, you said the

00:06:14.759 --> 00:06:16.639
most critical part of all this is the reference

00:06:16.639 --> 00:06:19.019
audio. Absolutely fundamental. The sample of

00:06:19.019 --> 00:06:20.959
the voice you feed the AI. The sources hammer

00:06:20.959 --> 00:06:23.379
this point home. Clean input equals clean output.

00:06:23.740 --> 00:06:26.040
Garbage in, garbage out. So what are the best

00:06:26.040 --> 00:06:29.240
practices? Non -negotiable stuff. Okay, number

00:06:29.240 --> 00:06:34.540
one, environment. Record in a really quiet room.

00:06:35.209 --> 00:06:38.310
No fans, no air conditioning hum, no computer

00:06:38.310 --> 00:06:41.129
noise, no traffic outside, dead quiet. Number

00:06:41.129 --> 00:06:43.970
two, style. Speak naturally, like you're having

00:06:43.970 --> 00:06:46.129
a conversation. Don't try to sound like a radio

00:06:46.129 --> 00:06:48.350
announcer unless that's the voice you want cloned.

00:06:48.509 --> 00:06:51.149
Just be natural. And length. How much audio do

00:06:51.149 --> 00:06:54.149
we need? Minimum. About 10 to 15 seconds of clear

00:06:54.149 --> 00:06:56.430
speech. That's the baseline. More is generally

00:06:56.430 --> 00:06:58.810
better, maybe up to a minute or so. But 10, 15

00:06:58.810 --> 00:07:02.129
seconds is usually enough to start. And crucially,

00:07:02.230 --> 00:07:05.740
crucially important, that audio file. It must

00:07:05.740 --> 00:07:08.279
be only the voice. No background music, no sound

00:07:08.279 --> 00:07:09.980
effects, definitely no other voices in there.

00:07:10.040 --> 00:07:12.040
Just the clean voice you want to clone. I actually

00:07:12.040 --> 00:07:13.500
learned that the hard way. I spent ages trying

00:07:13.500 --> 00:07:15.779
to clone my voice. Couldn't figure out why it

00:07:15.779 --> 00:07:18.399
sounded weird. Turns out the super faint hum

00:07:18.399 --> 00:07:20.399
from my external hard drive, like way in the

00:07:20.399 --> 00:07:23.339
background, was messing it all up. These models

00:07:23.339 --> 00:07:26.899
are sensitive to noise. They really are. So what

00:07:26.899 --> 00:07:29.300
are the two most essential factors for clean

00:07:29.300 --> 00:07:32.519
reference audio? Record 10 -15 seconds of speech

00:07:32.519 --> 00:07:36.759
in a completely quiet room. Just the voice. Mid

00:07:36.759 --> 00:07:39.139
-roll sponsor, Reed Placeholder. Okay, let's

00:07:39.139 --> 00:07:40.720
get into the good stuff. The advanced settings.

00:07:40.879 --> 00:07:43.500
This is where you go from, you know, pretty good

00:07:43.500 --> 00:07:45.879
audio to, wow, that sounds flawless. This is

00:07:45.879 --> 00:07:48.420
the pro level. This is the pro level. And honestly,

00:07:48.579 --> 00:07:51.160
most people skip this, but they shouldn't. First

00:07:51.160 --> 00:07:54.879
easy win, text preparation. Just typing the script

00:07:54.879 --> 00:07:57.740
carefully. More than that. Punctuation. The AI

00:07:57.740 --> 00:08:00.399
actually reads the punctuation to figure out

00:08:00.399 --> 00:08:03.449
tone and pacing. Oh, interesting. Yeah. Quick

00:08:03.449 --> 00:08:05.370
question mark. Makes the voice go up at the end.

00:08:05.490 --> 00:08:08.470
Exclamation point. Adds energy. Commas. Add pauses.

00:08:08.509 --> 00:08:11.189
If you just type a block of text with no punctuation.

00:08:11.350 --> 00:08:14.110
It sounds flat. Robotic. Exactly. So punctuation

00:08:14.110 --> 00:08:16.709
matters. A lot. Okay. What else is in advanced

00:08:16.709 --> 00:08:19.550
settings? Real game changer. Seed control. This

00:08:19.550 --> 00:08:21.870
is about consistency. Seed control. Like a random

00:08:21.870 --> 00:08:24.550
number. Kind of. So the AI is great at matching

00:08:24.550 --> 00:08:27.329
your voice's sound. The timbre. Yeah. But the

00:08:27.329 --> 00:08:30.180
emotion. The inflection. That can vary each time

00:08:30.180 --> 00:08:32.159
you generate audio, even with the same text.

00:08:32.419 --> 00:08:35.220
The seed is the key to locking that down. Wait,

00:08:35.299 --> 00:08:37.960
whoa. Okay, so you find a generation that sounds

00:08:37.960 --> 00:08:41.779
perfect. The exact right emotional tone. And

00:08:41.779 --> 00:08:43.919
you can use the seed number to make it sound

00:08:43.919 --> 00:08:45.960
exactly like that every single time. Imagine

00:08:45.960 --> 00:08:48.799
scaling that perfect tone across like a thousand

00:08:48.799 --> 00:08:51.000
hours of course content. That's the power. It's

00:08:51.000 --> 00:08:53.960
huge. So when you get that perfect take, you

00:08:53.960 --> 00:08:57.600
need to find the seed number. the AI used. It's

00:08:57.600 --> 00:09:00.059
usually in a log file or some metadata associated

00:09:00.059 --> 00:09:02.059
with the generated audio file. Find the magic

00:09:02.059 --> 00:09:04.500
number. Find the magic number. Then back in the

00:09:04.500 --> 00:09:06.460
settings, there's usually a checkbox, like use

00:09:06.460 --> 00:09:08.879
random seed. You uncheck that. Right. And then

00:09:08.879 --> 00:09:10.879
you type your magic seed number into the seed

00:09:10.879 --> 00:09:14.320
field. From then on, using that seed, the AI

00:09:14.320 --> 00:09:17.000
will reproduce that exact same emotional delivery

00:09:17.000 --> 00:09:20.159
every time. Consistency solved. That's incredible.

00:09:20.340 --> 00:09:22.179
Okay, what else is useful in advanced? Definitely

00:09:22.179 --> 00:09:25.019
check the box for remove silences. This automatically

00:09:25.019 --> 00:09:27.259
trims out little awkward pauses between words

00:09:27.259 --> 00:09:29.679
or sentences. Makes it sound tighter, punchier.

00:09:29.840 --> 00:09:31.960
Much punchier. Essential for stuff like social

00:09:31.960 --> 00:09:34.519
media clips, TikToks, shorts. Keeps the energy

00:09:34.519 --> 00:09:37.580
up. Good tip. Anything else? Yeah, speed adjustment.

00:09:38.019 --> 00:09:41.870
Sometimes the AI's default pace. can feel just

00:09:41.870 --> 00:09:44.889
a tiny bit slow, a little unnatural. Tweaking

00:09:44.889 --> 00:09:47.549
the speed just slightly, like maybe 1 .05 or

00:09:47.549 --> 00:09:50.230
1 .1, can make it sound much more human, more

00:09:50.230 --> 00:09:52.049
conversational. So it's about experimenting.

00:09:52.250 --> 00:09:55.990
Listen, tweak, regenerate. Exactly. It's an iterative

00:09:55.990 --> 00:09:58.889
process. Listen, evaluate, adjust one setting,

00:09:58.990 --> 00:10:02.539
generate again. Until it's perfect. So if my

00:10:02.539 --> 00:10:04.799
tone keeps changing between audio files, what

00:10:04.799 --> 00:10:06.899
should I check first? Save and reuse the seed

00:10:06.899 --> 00:10:08.980
number in the advanced settings. That locks the

00:10:08.980 --> 00:10:11.840
tone. Got it. Seed number for consistency. Now

00:10:11.840 --> 00:10:13.820
let's quickly talk applications and maybe some

00:10:13.820 --> 00:10:15.940
troubleshooting. All right. Real world use. We

00:10:15.940 --> 00:10:18.539
mentioned multi -speech mode isn't ideal for

00:10:18.539 --> 00:10:21.059
quality. Basic TTS and editing is better. Okay.

00:10:21.220 --> 00:10:23.940
But think bigger picture. E -learning courses.

00:10:24.059 --> 00:10:26.200
You could generate 10, 20 hours of narration.

00:10:26.440 --> 00:10:29.700
High quality. No vocal strain for you. Wow. Yeah.

00:10:29.799 --> 00:10:32.879
Or audiobooks. Turning old blog posts into audiobooks.

00:10:32.899 --> 00:10:35.220
Totally. In your own voice. Or turning out tons

00:10:35.220 --> 00:10:37.539
of short voiceovers for social media ads or updates.

00:10:37.700 --> 00:10:41.519
Fast. It really shifts content creation from

00:10:41.519 --> 00:10:45.600
this manual grind to something more automated,

00:10:45.820 --> 00:10:49.200
more scalable. Exactly. Now, quick troubleshooting.

00:10:49.220 --> 00:10:52.000
If the voice sounds robotic. Check the reference

00:10:52.000 --> 00:10:54.419
audio. Record it again in a quieter room. Perfect.

00:10:54.580 --> 00:10:56.899
If there are too many weird pauses. Go find that

00:10:56.899 --> 00:10:59.299
remove silences checkbox in the advanced settings.

00:10:59.519 --> 00:11:02.080
Yep. And installation errors. If things just

00:11:02.080 --> 00:11:04.340
aren't working after install. Delete the model,

00:11:04.580 --> 00:11:07.379
reinstall it carefully, and absolutely let it

00:11:07.379 --> 00:11:10.039
finish 100 % without interruptions. I got it.

00:11:10.059 --> 00:11:12.059
That covers like 90 % of the common problems.

00:11:12.379 --> 00:11:14.919
Besides saving time on, you know, daily little

00:11:14.919 --> 00:11:17.659
voiceover tasks, what's the biggest long -term

00:11:17.659 --> 00:11:20.399
productivity gain here? Producing massive amounts

00:11:20.399 --> 00:11:22.470
of hype. quality course or audiobook content

00:11:22.470 --> 00:11:24.809
really really quickly producing massive amounts

00:11:24.809 --> 00:11:27.370
of high quality course or audiobook content quickly

00:11:27.370 --> 00:11:31.269
opens up new possibilities okay so wrapping up

00:11:31.269 --> 00:11:34.350
we've done a deep dive into setting up free private

00:11:34.350 --> 00:11:37.250
unlimited voice cloning using Pinocchio using

00:11:37.250 --> 00:11:41.590
that e2f5 PTS model and the keys to really making

00:11:41.590 --> 00:11:44.909
it work seem to be number one super clean reference

00:11:44.909 --> 00:11:48.009
audio And number two, getting comfortable with

00:11:48.009 --> 00:11:50.389
those advanced settings, especially using seed

00:11:50.389 --> 00:11:53.269
control for consistency. Nail those two and you're

00:11:53.269 --> 00:11:56.389
golden. Yeah. And remember the ethics. The tools

00:11:56.389 --> 00:11:59.230
are neutral. How you use them isn't. Right. Use

00:11:59.230 --> 00:12:01.950
it to enhance your own stuff. Save yourself time.

00:12:02.049 --> 00:12:04.929
Make your work more accessible. Great. But avoid

00:12:04.929 --> 00:12:07.470
anything that involves impersonation, fraud,

00:12:07.690 --> 00:12:10.789
deception. Don't use it to trick people. Elevate.

00:12:10.789 --> 00:12:14.669
Don't erode trust. Exactly. Use this power responsibly.

00:12:15.179 --> 00:12:17.240
So here's a final thought to leave you with.

00:12:17.320 --> 00:12:19.720
This tech, this really powerful voice cloning,

00:12:19.879 --> 00:12:23.000
it's now free. It runs locally. Anyone with a

00:12:23.000 --> 00:12:24.779
decent computer can use it. Yeah. The barrier

00:12:24.779 --> 00:12:28.860
to entry just vanished. So what does that mean

00:12:28.860 --> 00:12:30.919
for the future? What kind of checks and balances

00:12:30.919 --> 00:12:33.740
will big platforms, you know, YouTube, Spotify,

00:12:34.039 --> 00:12:36.779
social media, what will they need to build internally

00:12:36.779 --> 00:12:39.460
to make sure content is authentic to prevent

00:12:39.460 --> 00:12:42.259
misuse on a massive scale? That's the big question

00:12:42.259 --> 00:12:44.070
now, isn't it? Something to definitely think

00:12:44.070 --> 00:12:46.110
about as you start playing with this. It is.

00:12:46.110 --> 00:12:47.889
We're excited to see what you create. Go build

00:12:47.889 --> 00:12:51.330
amazing things responsibly. Until next time.