WEBVTT

00:00:00.000 --> 00:00:04.480
Are you paying every month for AI access tied

00:00:04.480 --> 00:00:07.540
to those big cloud services? And maybe worrying

00:00:07.540 --> 00:00:10.800
where your private thoughts or your company -sensitive

00:00:10.800 --> 00:00:13.259
docs are actually going, we kind of assume powerful

00:00:13.259 --> 00:00:17.379
AI needs, like a $20 ,000 computer. But that

00:00:17.379 --> 00:00:19.219
whole idea, it's pretty much outdated now. Welcome

00:00:19.219 --> 00:00:21.339
to the Deep Dive. Our mission today is to take

00:00:21.339 --> 00:00:23.579
this stack of research we've got here, focusing

00:00:23.579 --> 00:00:27.059
on tools like Alama and LM Studio, and turn it

00:00:27.059 --> 00:00:29.699
into something really clear. actionable. We want

00:00:29.699 --> 00:00:32.159
to make local private AI something you can do

00:00:32.159 --> 00:00:34.939
right now on a laptop you probably already have.

00:00:35.140 --> 00:00:38.299
We'll cover the privacy wins, the cost savings,

00:00:38.579 --> 00:00:41.259
and then the how breaking down things like quantization

00:00:41.259 --> 00:00:43.840
and picking the right AI model for your machine.

00:00:44.079 --> 00:00:47.039
Right. This is really all about taking back control

00:00:47.039 --> 00:00:50.280
over your own compute. These terms like parameters

00:00:50.280 --> 00:00:52.700
and quantization, they sound complicated. Have

00:00:52.700 --> 00:00:54.710
you been intimidating? We're going to pull back

00:00:54.710 --> 00:00:55.869
the curtain on those. We're going to show you

00:00:55.869 --> 00:00:58.109
exactly how you can get powerful private AI running

00:00:58.109 --> 00:01:00.289
today using the gear you've already got sitting

00:01:00.289 --> 00:01:02.609
right there. OK. So let's unpack that first part.

00:01:03.130 --> 00:01:05.989
The motivation. Why bother downloading software,

00:01:06.150 --> 00:01:08.049
maybe messing with the terminal, when you can

00:01:08.049 --> 00:01:10.670
just open a chat window in your browser? It seems

00:01:10.670 --> 00:01:13.730
easier. The sources we looked at lay out six

00:01:13.730 --> 00:01:16.170
pretty compelling reasons. Yeah. And the first

00:01:16.170 --> 00:01:19.170
one hits you right away. Cost. Yeah. Once you

00:01:19.170 --> 00:01:21.409
download that initial model file, that's it.

00:01:21.409 --> 00:01:24.579
It costs nothing more to run. forever. No monthly

00:01:24.579 --> 00:01:27.659
fees, no paying per query, none of that. The

00:01:27.659 --> 00:01:30.599
only real cost is like a tiny bit of electricity.

00:01:31.680 --> 00:01:33.719
And tied right into that is getting away from

00:01:33.719 --> 00:01:35.719
limits, right? Every commercial service puts

00:01:35.719 --> 00:01:37.560
caps on you, especially if you use it a lot.

00:01:37.659 --> 00:01:39.739
Exactly. With these local models, you could run

00:01:39.739 --> 00:01:41.540
a thousand queries in an hour if you wanted.

00:01:42.560 --> 00:01:45.019
The AI itself never tells you, nope, you've hit

00:01:45.019 --> 00:01:47.219
your limit for today. But privacy, you mentioned

00:01:47.219 --> 00:01:49.500
control. That seems like the really big one for

00:01:49.500 --> 00:01:52.060
a lot of people. Oh, it's huge. Probably the

00:01:52.060 --> 00:01:55.379
biggest driver. See, when you chat online, your

00:01:55.379 --> 00:01:58.420
conversation, your prompts, your data, it all

00:01:58.420 --> 00:02:00.900
goes to someone else's server, who knows where,

00:02:01.219 --> 00:02:03.780
running it locally. Everything stays right there

00:02:03.780 --> 00:02:06.620
on your machine, 100%. Your secret company plans,

00:02:06.859 --> 00:02:08.879
your personal journal ideas, whatever it is,

00:02:09.280 --> 00:02:11.639
it doesn't leave. that peace of mind. You can't

00:02:11.639 --> 00:02:13.139
really put a price on it. And then there's just

00:02:13.139 --> 00:02:15.960
the practical side. You get actual offline capability.

00:02:16.139 --> 00:02:18.740
I mean, you could be on a plane, no Wi -Fi, or

00:02:18.740 --> 00:02:22.039
maybe somewhere remote. The AI just works anywhere.

00:02:22.199 --> 00:02:24.180
Yeah, that's super useful. And you also get version

00:02:24.180 --> 00:02:26.520
control. This is kind of neat. If you find a

00:02:26.520 --> 00:02:28.919
model version that just works for you, gets your

00:02:28.919 --> 00:02:32.139
style, you can keep that exact version indefinitely.

00:02:32.419 --> 00:02:34.860
You're not forced into updates that might suddenly

00:02:34.860 --> 00:02:37.479
change how the AI responds, which definitely

00:02:37.479 --> 00:02:39.599
happens with the big online ones. Right. And

00:02:39.599 --> 00:02:41.659
the last one is customization. This sounds more

00:02:41.659 --> 00:02:43.819
advanced, but really powerful. It is. You can

00:02:43.819 --> 00:02:46.319
actually fine tune these models. Yeah. That means

00:02:46.319 --> 00:02:49.060
you could say, teach it all the specific jargon

00:02:49.060 --> 00:02:51.699
for your industry, or even train it to mimic

00:02:51.699 --> 00:02:54.360
your personal writing style. That kind of deep,

00:02:54.699 --> 00:02:57.639
personalized training. Just impossible with the

00:02:57.639 --> 00:02:59.900
closed off commercial models. OK, so that covers

00:02:59.900 --> 00:03:02.520
the why. But there's always that nagging question.

00:03:02.860 --> 00:03:06.020
Are these local models actually any good? I remember

00:03:06.020 --> 00:03:08.680
trying some early ones, and they were. Well.

00:03:08.889 --> 00:03:11.270
Not great. Yeah, they used to be pretty basic.

00:03:11.469 --> 00:03:13.610
Kind of dumb, honestly. Is that still the case

00:03:13.610 --> 00:03:16.210
or is that just a myth now? That myth is completely

00:03:16.210 --> 00:03:18.849
busted. Seriously, the open source world is moving

00:03:18.849 --> 00:03:21.110
incredibly fast. We're seeing models released

00:03:21.110 --> 00:03:23.969
almost daily that often match or even beat older

00:03:23.969 --> 00:03:27.569
systems like, say, GPT 3 .5. And the crucial

00:03:27.569 --> 00:03:30.849
part is they run fine on regular laptops, MacBooks,

00:03:30.849 --> 00:03:32.990
Windows machines. You don't need some monster

00:03:32.990 --> 00:03:35.490
gaming rig anymore. OK, so if they're powerful

00:03:35.490 --> 00:03:39.199
now and free and limitless. What's the catch?

00:03:39.379 --> 00:03:41.780
What's the main trade -off compared to just using

00:03:41.780 --> 00:03:44.080
a cloud service? The trade -off really comes

00:03:44.080 --> 00:03:46.120
down to managing your own computer's memory.

00:03:46.500 --> 00:03:48.620
That's the main constraint. Right. Memory. That

00:03:48.620 --> 00:03:50.699
brings us perfectly into the next bit. We need

00:03:50.699 --> 00:03:53.199
to understand what an AI model is file -wise

00:03:53.199 --> 00:03:55.840
to get why memory matters so much. Okay, yeah.

00:03:56.060 --> 00:03:58.879
So an AI model, at its core, it's just a file.

00:03:59.020 --> 00:04:02.659
A really, really big file. Think of it like stacking

00:04:02.659 --> 00:04:06.039
billions of tiny Lego blocks made of data. This

00:04:06.039 --> 00:04:08.800
file contains billions, literally billions of

00:04:08.800 --> 00:04:10.560
numbers. We call them parameters or sometimes

00:04:10.560 --> 00:04:13.639
weights. These numbers represent everything the

00:04:13.639 --> 00:04:15.800
AI will learn during its training. All the patterns,

00:04:15.939 --> 00:04:18.139
the connections. So when you download an eight

00:04:18.139 --> 00:04:20.379
billion parameter model, you're grabbing a file

00:04:20.379 --> 00:04:22.379
with eight billion of these numbers. It's chunky.

00:04:22.860 --> 00:04:24.720
Billions of numbers, okay. And you need something

00:04:24.720 --> 00:04:27.439
special to actually read and, well, run that

00:04:27.439 --> 00:04:30.360
giant file. That's where Olama comes in. Exactly.

00:04:30.600 --> 00:04:33.500
If the model file is like that super -dense complex

00:04:33.500 --> 00:04:37.060
sheet music, Alama is the specialized music player

00:04:37.060 --> 00:04:40.319
designed just for AI scores. It does three main

00:04:40.319 --> 00:04:42.899
jobs really well. First, it's a downloader. It

00:04:42.899 --> 00:04:45.180
knows how to handle fetching these enormous multi

00:04:45.180 --> 00:04:48.000
-gigabyte files reliably. Second, it's the engine.

00:04:48.110 --> 00:04:50.189
It takes those billions of parameters and loads

00:04:50.189 --> 00:04:52.350
them into your computer's active memory so the

00:04:52.350 --> 00:04:55.310
AI can actually think. And third, and this is

00:04:55.310 --> 00:04:57.850
kind of cool for flexibility, it acts as an interface.

00:04:58.410 --> 00:05:00.850
It quietly starts up a sort of hidden software

00:05:00.850 --> 00:05:03.269
door on your computer. Technically, it's an API

00:05:03.269 --> 00:05:05.589
server running on local host. This door lets

00:05:05.589 --> 00:05:07.569
other applications on your machine talk directly

00:05:07.569 --> 00:05:09.689
to the AI model that a llama is running. Okay,

00:05:09.850 --> 00:05:11.750
that interface part sounds important for later,

00:05:11.810 --> 00:05:14.089
but you mentioned memory. And there's a key difference

00:05:14.089 --> 00:05:15.769
depending on the type of computer someone has,

00:05:15.970 --> 00:05:18.790
right? Mac versus Windows. Yes, absolutely critical

00:05:18.790 --> 00:05:21.110
distinction here. It changes how much memory

00:05:21.110 --> 00:05:24.410
is actually available for the AI. So if you're

00:05:24.410 --> 00:05:27.089
on a Mac with an Apple Silicon chip, M1, M2,

00:05:27.089 --> 00:05:28.949
M3, whatever you have, what's called unified

00:05:28.949 --> 00:05:32.370
memory, this is great for AI. It means the main

00:05:32.370 --> 00:05:34.949
processor, CPU, and the graphics processor, GPU,

00:05:35.449 --> 00:05:37.600
share the same pool of RAM. So if your MacBook

00:05:37.600 --> 00:05:40.660
has, say, 16 gigabytes of RAM total, pretty much

00:05:40.660 --> 00:05:42.959
all 16 gigs can potentially be used by the AI

00:05:42.959 --> 00:05:45.279
model. It's simpler. OK. If you're on a typical

00:05:45.279 --> 00:05:47.920
Windows PC, especially one with a dedicated Nvidia

00:05:47.920 --> 00:05:49.959
graphics card, things are different. You usually

00:05:49.959 --> 00:05:52.600
have your main system RAM, maybe 16 or 32 gigs.

00:05:53.319 --> 00:05:55.019
And then crucially, the graphics card has its

00:05:55.019 --> 00:05:57.579
own separate memory called VRAM or VideORAM.

00:05:58.180 --> 00:06:00.639
For AI tasks, that VRAM on the graphics card

00:06:00.639 --> 00:06:02.949
is the golden ticket. It's much faster for the

00:06:02.949 --> 00:06:05.790
parallel processing AI needs. So ideally, the

00:06:05.790 --> 00:06:07.990
entire AI model needs to fit into that graphics

00:06:07.990 --> 00:06:10.310
card's VRAM. That's often the limiting factor

00:06:10.310 --> 00:06:12.889
on PCs. That makes sense. The version control

00:06:12.889 --> 00:06:14.629
aspect you mentioned earlier really resonates,

00:06:14.810 --> 00:06:16.750
too. I have to admit, I still wrestle sometimes

00:06:16.750 --> 00:06:19.970
with prompt drift when you're talking to an AI.

00:06:20.430 --> 00:06:21.970
And halfway through, it just seems to forget

00:06:21.970 --> 00:06:24.089
what you asked it to do initially. It's frustrating.

00:06:24.269 --> 00:06:26.689
So I'm genuinely grateful these local models

00:06:26.689 --> 00:06:29.209
let you just stop and restart with a clean, predictable

00:06:29.209 --> 00:06:31.329
version whenever you need that consistency back.

00:06:31.529 --> 00:06:33.589
So OK, let's make it concrete. Someone wants

00:06:33.589 --> 00:06:35.629
to try this. What's the simplest way to get started?

00:06:36.529 --> 00:06:39.949
Easiest path. First, Download Aulama from their

00:06:39.949 --> 00:06:42.569
website. Install it like any other app. Then

00:06:42.569 --> 00:06:44.550
you open up your terminal. Yeah, that black command

00:06:44.550 --> 00:06:47.170
line window. Don't be scared. And you just type

00:06:47.170 --> 00:06:50.810
one single command. Aulama run Aulama 3 .8b.

00:06:51.149 --> 00:06:54.350
Hit enter. Aulama run Aulama 3 .8b. And what

00:06:54.350 --> 00:06:57.170
does that do? That tells Aulama. Go find the

00:06:57.170 --> 00:06:59.490
model named Aulama 3 with 8 billion parameters.

00:06:59.790 --> 00:07:01.430
Download it if you don't have it and then run

00:07:01.430 --> 00:07:03.889
it. you'll start downloading. That specific model,

00:07:04.029 --> 00:07:06.930
Llama 3 .8b, is a fantastic starting point. Very

00:07:06.930 --> 00:07:09.050
capable, but the file size is manageable only

00:07:09.050 --> 00:07:12.050
about 4 .7 gigabytes. Most modern machines can

00:07:12.050 --> 00:07:13.709
handle that download and have enough memory to

00:07:13.709 --> 00:07:17.310
run it. OK, 4 .7 gigs is manageable, but these

00:07:17.310 --> 00:07:19.370
model files are still pretty big. If you start

00:07:19.370 --> 00:07:21.430
downloading a few, you could eat up, I don't

00:07:21.430 --> 00:07:23.670
know, tens, maybe hundreds of gigs pretty fast.

00:07:24.029 --> 00:07:26.250
How do you, like, keep track of what you've installed

00:07:26.250 --> 00:07:28.709
and clean things up? Good question. Allama has

00:07:28.709 --> 00:07:31.430
simple commands for that too. You can type AllamaList

00:07:31.430 --> 00:07:33.129
to see all the models you've downloaded and their

00:07:33.129 --> 00:07:35.389
sizes. And if you want to remove one to free

00:07:35.389 --> 00:07:38.730
up space, just use AllamaARM followed by the

00:07:38.730 --> 00:07:42.889
model name, like AllamaARM Allama3 .8b. Easy

00:07:42.889 --> 00:07:46.089
to manage. OK, easy enough. AllamaList and AllamaARM

00:07:46.089 --> 00:07:48.930
got it. But hang on. You said the 8 billion parameter

00:07:48.930 --> 00:07:52.350
model is 4 .7 gigabytes. If a parameter is a

00:07:52.350 --> 00:07:54.949
number and there are 8 billion of them, shouldn't

00:07:54.949 --> 00:07:57.009
the file be much, much bigger? How does it work?

00:07:57.640 --> 00:08:00.040
That brings us to the real magic trick of running

00:08:00.040 --> 00:08:03.160
modern AI locally. It's a technique called quantization.

00:08:03.579 --> 00:08:05.879
This is the absolute key that lets these huge

00:08:05.879 --> 00:08:08.180
powerful models shrink down enough to fit onto

00:08:08.180 --> 00:08:10.240
regular computers. Quantization. Okay, what does

00:08:10.240 --> 00:08:13.079
it do? Basically, it takes all those billions

00:08:13.079 --> 00:08:18.819
of very precise numbers inside the model. 13

00:08:18.819 --> 00:08:20.839
.4159265, and it makes them less precise. It

00:08:20.839 --> 00:08:22.600
might round them down to something simpler. It's

00:08:22.600 --> 00:08:25.259
just 13 .4. Think of it like image compression.

00:08:25.459 --> 00:08:27.779
You know how you can take a massive, super detailed

00:08:27.779 --> 00:08:30.620
raw photo file, maybe 100 megabytes, and save

00:08:30.620 --> 00:08:32.840
it as a JPEG that's only like five megabytes?

00:08:33.559 --> 00:08:35.629
Quantization is doing something similar. But

00:08:35.629 --> 00:08:38.450
for the AI's knowledge, it's a kind of lossy

00:08:38.450 --> 00:08:40.490
compression, but highly optimized for these neural

00:08:40.490 --> 00:08:42.330
networks. Lossy compression. Doesn't that mean

00:08:42.330 --> 00:08:45.409
you're losing information? Is the AI getting

00:08:45.409 --> 00:08:47.629
dumber when you quantize it? That's the amazing

00:08:47.629 --> 00:08:50.190
part. You do lose a tiny bit of precision, yes.

00:08:50.750 --> 00:08:52.960
But the trade -off is incredible. Quantized models

00:08:52.960 --> 00:08:55.799
can shrink by 50, 60, even 70 % in file size,

00:08:56.000 --> 00:08:58.200
but they typically only lose maybe 10 to 20 %

00:08:58.200 --> 00:09:00.120
of their raw performance score on benchmarks,

00:09:00.179 --> 00:09:02.600
sometimes even less. Wow. It's a fantastic deal.

00:09:02.679 --> 00:09:04.519
You get a model that's drastically smaller and

00:09:04.519 --> 00:09:06.679
needs way less memory, but it's still incredibly

00:09:06.679 --> 00:09:09.639
smart. That's why that 8 billion parameter Lama

00:09:09.639 --> 00:09:12.919
3 model ends up being only 4 .7 gigs instead

00:09:12.919 --> 00:09:16.220
of like 16 or 30 gigs. Whoa. OK, hold on. If

00:09:16.220 --> 00:09:18.460
you can shrink models that much with only a small

00:09:18.460 --> 00:09:21.580
performance hit. You could potentially take something

00:09:21.580 --> 00:09:26.000
massive, like a 70 billion parameter model, quantize

00:09:26.000 --> 00:09:28.179
it, and maybe actually run it on a high -end

00:09:28.179 --> 00:09:30.480
laptop. Exactly. That's happening right now.

00:09:30.899 --> 00:09:33.759
People are running quantized 70B models on Macs

00:09:33.759 --> 00:09:36.639
with enough unified memory or PCs with beefy

00:09:36.639 --> 00:09:38.860
graphics cards. It completely changes the game,

00:09:39.100 --> 00:09:42.580
democratizing access to really powerful AI. It's

00:09:42.580 --> 00:09:44.559
not just for giant data centers anymore. Is that

00:09:44.559 --> 00:09:47.220
10 -20 % performance loss ever really noticeable

00:09:47.220 --> 00:09:48.840
though? Like if you're asking it to do something

00:09:48.840 --> 00:09:51.960
really complex or creative? Honestly, for most

00:09:51.960 --> 00:09:54.940
everyday tasks, writing emails, summarizing articles,

00:09:55.220 --> 00:09:58.220
brainstorming ideas, even coding, help you likely

00:09:58.220 --> 00:09:59.840
won't notice the difference between a quantized

00:09:59.840 --> 00:10:01.980
model and the full precision original. Maybe

00:10:01.980 --> 00:10:03.779
if you were doing highly specialized scientific

00:10:03.779 --> 00:10:06.519
modeling or something requiring extreme numerical

00:10:06.519 --> 00:10:08.659
accuracy, you might stick with a full -size one.

00:10:08.779 --> 00:10:12.440
But for 95 % of us, the efficiency gained from

00:10:12.440 --> 00:10:14.820
chronization is absolutely worth that tiny dip

00:10:14.820 --> 00:10:17.360
in performance. OK, that makes sense. So assuming

00:10:17.360 --> 00:10:19.620
we're using these standard quantized models that

00:10:19.620 --> 00:10:21.940
Alama usually downloads by default, can we give

00:10:21.940 --> 00:10:24.039
people some simple guidelines for picking a model

00:10:24.039 --> 00:10:26.720
based on their computer's memory? Yeah, definitely.

00:10:26.879 --> 00:10:28.639
Simple rules of thumb work pretty well here.

00:10:29.000 --> 00:10:31.500
If your machine has about 8 gigabytes of available

00:10:31.500 --> 00:10:34.779
RAM or VRAM, if you're on that NVIDIA PC, you

00:10:34.779 --> 00:10:36.120
should probably stick to the smaller models,

00:10:36.299 --> 00:10:39.090
like 7 billion or 8 billion parameters. So that

00:10:39.090 --> 00:10:41.950
Llama 3 .8b we keep mentioning is perfect. Got

00:10:41.950 --> 00:10:44.769
it. Eight gigs? Stick to 7 or 8b. What about

00:10:44.769 --> 00:10:47.190
more memory? If you've got 16 gigabytes of RAM

00:10:47.190 --> 00:10:49.529
or VRAM, you can comfortably run larger models

00:10:49.529 --> 00:10:52.370
like 13 billion or even up to around 16 billion

00:10:52.370 --> 00:10:54.950
parameters. For coders, a great one in that range

00:10:54.950 --> 00:10:59.889
is DeepSeaCoder -V2 .16b. It's specifically trained

00:10:59.889 --> 00:11:02.049
for programming tasks and it's really impressive.

00:11:02.759 --> 00:11:07.259
Nice. DeepSeq Coder dashed v2 .16b for 16 gigs.

00:11:07.500 --> 00:11:09.720
And for the power users, people with 32 gigs

00:11:09.720 --> 00:11:12.539
or more. Ah, now you're talking. With 32 gigs

00:11:12.539 --> 00:11:14.519
or more, you can start running the real heavy

00:11:14.519 --> 00:11:17.220
hitters. You can handle 34 billion parameter

00:11:17.220 --> 00:11:19.899
models, or even the big 70 billion ones like

00:11:19.899 --> 00:11:23.700
Lama 3 .70b. And another interesting one to try,

00:11:23.740 --> 00:11:26.120
maybe in the smaller size like 7b, is Wizard

00:11:26.120 --> 00:11:30.740
LM 2 .7b. It's known for being less, uh, censored

00:11:30.740 --> 00:11:33.779
or aligned. than some others. It gives more direct,

00:11:33.879 --> 00:11:36.059
sometimes unfiltered answers, which can be useful

00:11:36.059 --> 00:11:38.580
depending on what you're doing. OK. Lots of options.

00:11:39.220 --> 00:11:40.840
But running commands in the terminal is cool

00:11:40.840 --> 00:11:43.000
for setup, maybe for scripting, but for just

00:11:43.000 --> 00:11:45.519
chatting with the AI day to day. Most people

00:11:45.519 --> 00:11:47.679
probably want something friendlier, a nice interface.

00:11:47.759 --> 00:11:49.600
How do we get that? Right. You want a proper

00:11:49.600 --> 00:11:52.120
chat window, history settings, all that. That's

00:11:52.120 --> 00:11:54.700
where tools like LM Studio come in. or other

00:11:54.700 --> 00:11:57.159
similar apps, there are a few now. Ellen Studio

00:11:57.159 --> 00:11:59.580
basically provides that polished graphical user

00:11:59.580 --> 00:12:02.360
interface, a GUI, that sits on top of your local

00:12:02.360 --> 00:12:05.220
models. It's a dedicated chat program. It's really

00:12:05.220 --> 00:12:07.059
good because it often shows you helpful info,

00:12:07.200 --> 00:12:09.779
like how much CPU or RAM your AI is using while

00:12:09.779 --> 00:12:11.980
it's thinking, and makes it super easy to just

00:12:11.980 --> 00:12:13.500
switch between the different models you've downloaded

00:12:13.500 --> 00:12:15.620
with a click. Okay, Ellen Studio sounds like

00:12:15.620 --> 00:12:18.399
the answer, but if we already downloaded our

00:12:18.399 --> 00:12:21.519
models using Olama and Olama is running the engine,

00:12:21.919 --> 00:12:24.039
We don't want LM Studio to download everything

00:12:24.039 --> 00:12:26.740
all over again, right? That wastes space. Exactly.

00:12:27.139 --> 00:12:29.360
You want to avoid doubling up. The smartest way

00:12:29.360 --> 00:12:32.159
is to install LM Studio, but then configure it

00:12:32.159 --> 00:12:34.139
to talk to the Elama server that's already running

00:12:34.139 --> 00:12:36.860
on your machine. Remember that hidden door Elama

00:12:36.860 --> 00:12:39.899
opens, LM Studio can just connect directly to

00:12:39.899 --> 00:12:43.289
that. Most of these GUI tools have a setting

00:12:43.289 --> 00:12:46.129
somewhere to point it in an existing Alama instance.

00:12:46.590 --> 00:12:49.009
You tell LM Studio, hey, don't download models

00:12:49.009 --> 00:12:52.769
yourself, just talk to Alama at localhost .11434.

00:12:53.149 --> 00:12:55.309
Then boom, all the models you got with Alama

00:12:55.309 --> 00:12:57.649
just appear in LM Studio's chat interface, ready

00:12:57.649 --> 00:13:00.769
to use. No redundant downloads. Perfect. So connect

00:13:00.769 --> 00:13:03.149
LM Studio to the running Alama. That's the efficient

00:13:03.149 --> 00:13:05.169
path. Now, once you're set up with a nice interface,

00:13:05.389 --> 00:13:07.190
the quality of what you get out still depends

00:13:07.190 --> 00:13:09.289
heavily on what you put in, right? Good prompting

00:13:09.289 --> 00:13:11.929
is still key. Oh, absolutely. Garbage in, garbage

00:13:11.929 --> 00:13:15.070
out still applies, even with powerful local models.

00:13:15.590 --> 00:13:17.429
The sources we looked at really hammered this

00:13:17.429 --> 00:13:20.129
point. You need detailed prompts. They suggest

00:13:20.129 --> 00:13:22.629
focusing on defining five key things for the

00:13:22.629 --> 00:13:25.970
AI. It's role, the specific task you want done,

00:13:26.169 --> 00:13:29.090
the overall goal, the reason why, and the desired

00:13:29.090 --> 00:13:33.169
tone. Role, task, goal, reason, tone. OK. If

00:13:33.169 --> 00:13:35.049
you just say, write an email, you'll get something

00:13:35.049 --> 00:13:37.610
generic. But if you structure it like, OK, act

00:13:37.610 --> 00:13:39.309
as a professional employee, that's the role.

00:13:39.450 --> 00:13:41.990
Your task is to write an email draft to my manager,

00:13:42.190 --> 00:13:44.669
whose name is Sarah. The goal is to politely

00:13:44.669 --> 00:13:46.590
request a two -day extension on the quarterly

00:13:46.590 --> 00:13:49.370
report. The reason is that the final sales data

00:13:49.370 --> 00:13:51.889
only arrived this morning. Please maintain a

00:13:51.889 --> 00:13:54.190
polite but confident tone. Much more specific.

00:13:54.309 --> 00:13:56.070
Way more specific. And you'll get a much, much

00:13:56.070 --> 00:13:58.370
better, more tailored result almost every time.

00:13:58.850 --> 00:14:01.629
That level of detail really guides the AI effectively.

00:14:02.029 --> 00:14:05.309
OK, good prompting advice. Now, what about troubleshooting?

00:14:05.809 --> 00:14:07.490
People trying this for the first time might run

00:14:07.490 --> 00:14:10.610
into things that seem weird. Let's normalize

00:14:10.610 --> 00:14:13.590
a couple of common experiences. First one, your

00:14:13.590 --> 00:14:15.409
computer might suddenly sound like a jet engine

00:14:15.409 --> 00:14:18.769
and feel pretty warm. Yeah, expect that. Running

00:14:18.769 --> 00:14:20.629
these AI models, especially the bigger ones,

00:14:20.809 --> 00:14:23.230
is computationally intensive. It uses a lot of

00:14:23.230 --> 00:14:27.049
processing power, often on the GPU. So your computer's

00:14:27.049 --> 00:14:30.409
fans are going to spin up fast. You'll hear them.

00:14:30.669 --> 00:14:33.320
The machine might get noticeably warm. That's

00:14:33.320 --> 00:14:35.200
just the sign it's doing the heavy lifting required.

00:14:35.340 --> 00:14:37.740
It's totally normal. Don't panic. OK. Loud fans

00:14:37.740 --> 00:14:40.399
and heat are normal. Good to know. Second thing,

00:14:40.879 --> 00:14:43.220
the very first time you ask a newly loaded model

00:14:43.220 --> 00:14:45.240
a question, it might seem really slow to answer,

00:14:45.379 --> 00:14:48.299
like maybe 20 or 30 seconds of silence. Mm -hmm.

00:14:48.500 --> 00:14:51.139
That also happens, and it's expected. That first

00:14:51.139 --> 00:14:53.700
query involves a llama loading all those billions

00:14:53.700 --> 00:14:56.100
of parameters from your storage drive into the

00:14:56.100 --> 00:14:59.019
active RAM or VRAM. That takes time. Think of

00:14:59.019 --> 00:15:01.559
it as the AI waking up or warming up. Once it's

00:15:01.559 --> 00:15:03.960
loaded into memory, though, Any subsequent questions

00:15:03.960 --> 00:15:06.039
you ask in that same session should get much,

00:15:06.080 --> 00:15:08.600
much faster responses. Usually just a few seconds.

00:15:09.019 --> 00:15:10.879
Right. So be patient with that first prompt.

00:15:11.039 --> 00:15:13.960
Exactly. And just a reminder about storage, keep

00:15:13.960 --> 00:15:16.500
an eye on those file sizes with a Llama list.

00:15:17.220 --> 00:15:19.120
Clean out models you're not actively using with

00:15:19.120 --> 00:15:21.320
a Llama ROM. And really, the final piece of advice

00:15:21.320 --> 00:15:25.139
is just experiment. Try different models. See

00:15:25.139 --> 00:15:28.080
how Llama 3 feels for general writing. Then switch

00:15:28.080 --> 00:15:30.679
to DeepSeq for some coding. Maybe try Wizard

00:15:30.679 --> 00:15:34.019
LN2. if you want less filtered responses. Find

00:15:34.019 --> 00:15:36.259
the personality and skills that best fit what

00:15:36.259 --> 00:15:38.259
you need to do. That seems like a great place

00:15:38.259 --> 00:15:40.779
to land. Yeah. If we just zoom out for a second,

00:15:40.799 --> 00:15:43.240
think about what we've covered. The core achievement

00:15:43.240 --> 00:15:46.559
here is pretty profound, actually. You, the listener,

00:15:46.899 --> 00:15:49.399
now have the practical knowledge to harness really

00:15:49.399 --> 00:15:52.799
powerful AI tools completely for free, privately,

00:15:53.179 --> 00:15:55.320
on your own machine, offline, if you need to

00:15:55.320 --> 00:15:57.600
be, with total control over which version you

00:15:57.600 --> 00:16:00.539
use and absolutely no usage limits imposed by

00:16:00.539 --> 00:16:02.690
anyone else. You've essentially bypassed the

00:16:02.690 --> 00:16:04.909
dependence on those big centralized cloud providers

00:16:04.909 --> 00:16:07.769
for this capability. It really is a feeling of

00:16:07.769 --> 00:16:10.830
taking back control, self -sovereignty over your

00:16:10.830 --> 00:16:13.450
computing in a way. The future of AI isn't just

00:16:13.450 --> 00:16:15.769
happening out there in the cloud. It's shifting

00:16:15.769 --> 00:16:18.950
rapidly onto the hardware you own. So don't just

00:16:18.950 --> 00:16:21.409
listen to us talk about it. Take the advice from

00:16:21.409 --> 00:16:24.590
the source material. Go download Elama. Open

00:16:24.590 --> 00:16:28.610
that terminal, type Elama run Elama 3 .8b, and

00:16:28.610 --> 00:16:31.740
start exploring today. And maybe a final thought

00:16:31.740 --> 00:16:34.039
to chew on. We talked about quantization, right?

00:16:34.039 --> 00:16:36.120
How losing a tiny bit of numerical precision

00:16:36.120 --> 00:16:38.700
gives us these massive gains in efficiency, letting

00:16:38.700 --> 00:16:40.860
big models run locally. It makes you wonder,

00:16:41.320 --> 00:16:43.860
what happens next? What will the next generation

00:16:43.860 --> 00:16:46.480
of clever lossy compression techniques for AI

00:16:46.480 --> 00:16:48.580
models look like? Could we reach a point where

00:16:48.580 --> 00:16:50.980
we can not just run, but maybe even train surprisingly

00:16:50.980 --> 00:16:53.720
large models entirely on consumer -grade hardware?

00:16:54.139 --> 00:16:55.679
That's something interesting to consider as you

00:16:55.679 --> 00:16:56.139
start your journey.
