WEBVTT

00:00:00.000 --> 00:00:02.779
We often talk about AI, like it's just one huge

00:00:02.779 --> 00:00:06.099
thing, right? One big brain. But there's actually

00:00:06.099 --> 00:00:09.119
a massive difference, a really functional difference

00:00:09.119 --> 00:00:12.740
between, say, an AI that's read everything and

00:00:12.740 --> 00:00:15.519
can only talk about a solution versus one that

00:00:15.519 --> 00:00:18.780
can actually do the task for you. Right. It's

00:00:18.780 --> 00:00:21.399
pure theory versus action. It's like the shift

00:00:21.399 --> 00:00:24.420
from asking a large language model, an LLM, how

00:00:24.420 --> 00:00:27.339
to book a cheap flight to telling a language

00:00:27.339 --> 00:00:30.059
action model. a lamb, hey, book the cheapest

00:00:30.059 --> 00:00:33.079
flight, email me the receipt. Beat. That's the

00:00:33.079 --> 00:00:35.579
jump from talking to actually doing. Welcome

00:00:35.579 --> 00:00:37.840
to the deep dive. Yeah. And look, if you're starting

00:00:37.840 --> 00:00:40.280
to feel a bit lost in all the acronyms flying

00:00:40.280 --> 00:00:43.340
around, you're definitely not alone. Our mission

00:00:43.340 --> 00:00:46.619
today really is to cut through that noise, give

00:00:46.619 --> 00:00:49.000
you a kind of shortcut to understanding the different

00:00:49.000 --> 00:00:52.119
specializations. We need the AI toolbox. Exactly.

00:00:52.219 --> 00:00:54.810
Think about it, right? You wouldn't use a hammer

00:00:54.810 --> 00:00:57.270
to turn a screw. Doesn't matter how great the

00:00:57.270 --> 00:00:59.929
hammer is. We need to know the right tool for

00:00:59.929 --> 00:01:03.350
the job. So today we're diving into eight key

00:01:03.350 --> 00:01:06.450
AI model types. We'll look at what they do, their

00:01:06.450 --> 00:01:09.090
main uses, and yeah, the important watch outs

00:01:09.090 --> 00:01:11.549
for each. Okay. Let's start with the foundation.

00:01:11.769 --> 00:01:14.430
Probably the one most people know. The LM. Okay,

00:01:14.450 --> 00:01:16.750
let's unpack this. The LLM, the Large Language

00:01:16.750 --> 00:01:19.400
Model. This is... like the definition model.

00:01:19.519 --> 00:01:21.400
It's the most famous one out there. Think of

00:01:21.400 --> 00:01:23.239
it like that super smart friend who's basically

00:01:23.239 --> 00:01:26.019
read every book, every website, you know, Chet,

00:01:26.120 --> 00:01:29.780
GPT, Gemini, Claude, those ones. Yeah, and they're

00:01:29.780 --> 00:01:32.319
large because the amount of text they train on

00:01:32.319 --> 00:01:35.060
is just astronomical. They absorb all these patterns

00:01:35.060 --> 00:01:37.480
of language, but how they work, it's actually

00:01:37.480 --> 00:01:40.219
pretty simple at its core. They're mostly just

00:01:40.219 --> 00:01:41.939
next -word guessers. They don't actually know

00:01:41.939 --> 00:01:44.459
truth. Right. They break down sentences into

00:01:44.459 --> 00:01:46.939
tokens, sort of like pieces of words. Exactly.

00:01:47.040 --> 00:01:49.159
Pieces of words or even punctuation. And then

00:01:49.159 --> 00:01:51.870
they just do the math. calculating the most likely

00:01:51.870 --> 00:01:55.069
next token, the next piece, based on, well, billions

00:01:55.069 --> 00:01:56.870
of examples they've seen. If you start with,

00:01:56.969 --> 00:01:58.450
I need to put this on there. I guess a shelf

00:01:58.450 --> 00:02:01.010
or table or whatever fits statistically. Yeah.

00:02:01.290 --> 00:02:04.250
And this guessing ability makes them really good

00:02:04.250 --> 00:02:07.390
for tech stuff. Writing content, sure. Blog posts,

00:02:07.629 --> 00:02:11.169
emails, summarizing huge articles down to like

00:02:11.169 --> 00:02:13.430
three bullet points. Oh, yeah. Summarizing is

00:02:13.430 --> 00:02:16.909
huge and helping programmers, right? Explaining

00:02:16.909 --> 00:02:20.810
tricky code, debugging. Yeah. Even translation

00:02:20.810 --> 00:02:23.629
or customer service chat bots. But that's also

00:02:23.629 --> 00:02:25.949
where prompting gets tricky. Yeah. And honestly,

00:02:26.490 --> 00:02:28.849
I still wrestle with getting problems just right

00:02:28.849 --> 00:02:31.409
myself sometimes. It's not always easy. You really

00:02:31.409 --> 00:02:33.629
see the difference between just asking vaguely.

00:02:33.759 --> 00:02:36.240
versus being really specific? Totally. Like,

00:02:36.259 --> 00:02:38.419
a bad prompt is just, write an email to work

00:02:38.419 --> 00:02:41.259
from home, that's it. Vague. Super vague. A good

00:02:41.259 --> 00:02:43.300
prompt gives the goal, maybe the specific days

00:02:43.300 --> 00:02:45.120
you want off, the reasons why, like, needing

00:02:45.120 --> 00:02:48.060
deep work focus, and even the tone. Make it polite,

00:02:48.139 --> 00:02:50.159
professional, but ask firmly. You've got to guide

00:02:50.159 --> 00:02:52.060
the guesser. And this leads to the big watch

00:02:52.060 --> 00:02:54.180
out, right? The whole hallucination problem.

00:02:54.479 --> 00:02:56.800
Because they only guess what looks right statistically.

00:02:56.960 --> 00:02:59.400
They can just confidently make stuff up. Fake

00:02:59.400 --> 00:03:02.159
studies, incorrect historical facts, citations

00:03:02.159 --> 00:03:04.759
that lead nowhere. pure invention, but sounds

00:03:04.759 --> 00:03:08.240
plausible. So if an LLM is fundamentally a great

00:03:08.240 --> 00:03:10.840
guesser, what's the single most critical thing

00:03:10.840 --> 00:03:13.219
a user has to remember about its output? You

00:03:13.219 --> 00:03:16.219
absolutely must check important facts. The AI

00:03:16.219 --> 00:03:18.719
only guesses what looks right. It doesn't verify.

00:03:19.080 --> 00:03:22.219
Good. OK, so LLMs are great text guessers, but

00:03:22.219 --> 00:03:25.919
they can be slow and Well, inventive. Yeah. What

00:03:25.919 --> 00:03:28.360
if we need pictures and we need them fast? Ah,

00:03:28.400 --> 00:03:30.819
speed. That takes us to the LCM, right? Exactly.

00:03:31.159 --> 00:03:34.400
The latent consistency model, LCM. Think of it

00:03:34.400 --> 00:03:37.080
as the visual cousin to the LLM, but its whole

00:03:37.080 --> 00:03:40.099
reason for being is speed in making images. Much

00:03:40.099 --> 00:03:42.259
faster than older models. Yeah. If you've ever

00:03:42.259 --> 00:03:45.539
waited like 30 seconds or more for an AI image,

00:03:46.319 --> 00:03:48.840
that old model like stable diffusion was working

00:03:48.840 --> 00:03:52.669
slowly. taking maybe 50, even 100 steps to refine

00:03:52.669 --> 00:03:54.469
the image from noise like a really carol paint

00:03:54.469 --> 00:03:57.189
party adding tiny details. Right. The LCM, though,

00:03:57.270 --> 00:03:58.830
is more like a master painter who's done this

00:03:58.830 --> 00:04:01.270
a million times. It learns these underlying patterns

00:04:01.270 --> 00:04:03.750
of the latent consistency and figures out how

00:04:03.750 --> 00:04:06.150
to jump huge steps. Goes from step one, maybe

00:04:06.150 --> 00:04:08.509
straight, to step 25. Whoa. So it can generate

00:04:08.509 --> 00:04:10.750
really decent images in just like two to eight

00:04:10.750 --> 00:04:14.689
steps total. Two to eight. Compared to 50. That's

00:04:14.689 --> 00:04:17.189
a huge difference. Massive difference. And that

00:04:17.189 --> 00:04:19.470
speed opens up doors for anything real time.

00:04:19.709 --> 00:04:22.769
Think AR, VR apps. The scene needs to update

00:04:22.769 --> 00:04:24.490
instantly when you turn your head, right? Right,

00:04:24.670 --> 00:04:26.610
yeah. No lag. Or making pictures right there

00:04:26.610 --> 00:04:28.709
on your phone without needing to send data to

00:04:28.709 --> 00:04:31.839
the cloud and wait. fast design tools, improving

00:04:31.839 --> 00:04:34.579
video call quality in real time. OK, so speed's

00:04:34.579 --> 00:04:36.779
the big win. What's the catch? Well, the trade

00:04:36.779 --> 00:04:38.699
-off for skipping all those steps is sometimes

00:04:38.699 --> 00:04:41.620
the images might be a bit less detailed, maybe

00:04:41.620 --> 00:04:44.120
a little too smooth compared to the slower methods.

00:04:44.500 --> 00:04:46.879
But for many uses, that near -instant result

00:04:46.879 --> 00:04:50.139
is what matters most. So why is that master painter

00:04:50.139 --> 00:04:52.920
approach, that speed, so crucial for something

00:04:52.920 --> 00:04:55.769
like a VR headset? Because everything has to

00:04:55.769 --> 00:04:58.329
update instantly as you move. Lag breaks the

00:04:58.329 --> 00:05:01.009
whole illusion of immersion. Makes sense. Okay,

00:05:01.009 --> 00:05:03.569
so we have LLMs for talking, LCMs for fast painting,

00:05:03.889 --> 00:05:05.949
but they're both creating things. What if we

00:05:05.949 --> 00:05:08.829
need the AI to actually, you know, do a task?

00:05:09.110 --> 00:05:11.730
A multi -step task. Ah, now we get to the LAM.

00:05:11.930 --> 00:05:14.430
The LAM. Language Action Model. This is that

00:05:14.430 --> 00:05:16.110
critical shift you mentioned from talking to

00:05:16.110 --> 00:05:18.470
doing. Exactly. If the LLM is the smart friend

00:05:18.470 --> 00:05:21.350
who just talks, the LAM is like the project manager.

00:05:21.689 --> 00:05:24.439
It takes action. How does it do that? Is it just

00:05:24.439 --> 00:05:28.180
a better LLM? Not quite. It's more like a system.

00:05:28.339 --> 00:05:30.399
It starts with an LLM. Yeah, that's the brain

00:05:30.399 --> 00:05:32.480
for understanding your request. But then it adds

00:05:32.480 --> 00:05:35.620
memory like a notebook, a planner to map out

00:05:35.620 --> 00:05:39.319
the steps. And the really crucial part, the ability

00:05:39.319 --> 00:05:41.899
to use tools. This means it can connect to other

00:05:41.899 --> 00:05:44.339
things, your email, a web browser, calendars,

00:05:44.439 --> 00:05:47.639
using APIs. Ah, APIs. That's the key difference

00:05:47.639 --> 00:05:49.879
then, the connection to other services. That's

00:05:49.879 --> 00:05:52.000
the game changer. So an LLM tells you how to

00:05:52.000 --> 00:05:55.420
book a flight. A LAM connects to, say, Expedia's

00:05:55.420 --> 00:05:57.620
API, searches flights based on your criteria,

00:05:58.100 --> 00:06:00.779
finds the cheapest one, books it, and then asks

00:06:00.779 --> 00:06:04.040
you to confirm. Wow. OK. That sounds like the

00:06:04.040 --> 00:06:06.220
future of automation, really. And it is. Think,

00:06:06.319 --> 00:06:09.100
AI agents, these things can handle complex sequences.

00:06:09.399 --> 00:06:11.620
Like, you could tell it. Find 10 potential customers

00:06:11.620 --> 00:06:13.959
in the software industry in California, draft

00:06:13.959 --> 00:06:16.639
a personalized intro email based on our new product

00:06:16.639 --> 00:06:20.170
launch, and show me the drafts. A lot more sophisticated

00:06:20.170 --> 00:06:22.769
than just asking a question. Definitely. Advanced

00:06:22.769 --> 00:06:25.069
customer support, too, actually processing a

00:06:25.069 --> 00:06:27.310
refund, not just answering questions about the

00:06:27.310 --> 00:06:30.329
policy or just digital assistance that can actually

00:06:30.329 --> 00:06:32.410
turn on your lights or set alarms because they

00:06:32.410 --> 00:06:35.509
can connect to those systems. So what core function

00:06:35.509 --> 00:06:38.910
lets a lamb move beyond just conversation and

00:06:38.910 --> 00:06:41.389
actually access, say, your calendar or email?

00:06:41.670 --> 00:06:43.810
It's that ability to connect directly to other

00:06:43.810 --> 00:06:46.569
tools and services using APIs. Those are the

00:06:46.569 --> 00:06:49.379
hands and feet. Got it. But building one giant

00:06:49.379 --> 00:06:51.800
lamb that can do everything sounds incredibly

00:06:51.800 --> 00:06:55.839
complex and frankly expensive to run. It absolutely

00:06:55.839 --> 00:06:58.240
is. Which leads us perfectly to the next model

00:06:58.240 --> 00:07:00.819
type, which is all about efficiency. The MOE.

00:07:01.259 --> 00:07:03.540
MOE, mixture of experts. Yeah, mixture of experts.

00:07:03.639 --> 00:07:06.500
Now, this isn't one single giant model. It's

00:07:06.500 --> 00:07:09.319
actually a whole system. Think of lots of smaller...

00:07:09.399 --> 00:07:11.680
specialized models, the experts. And they're

00:07:11.680 --> 00:07:13.680
all managed by another AI called the router.

00:07:13.980 --> 00:07:15.839
OK, so instead of one generalist, it's like having

00:07:15.839 --> 00:07:18.100
a team of specialists. Exactly, like asking a

00:07:18.100 --> 00:07:21.300
specific group of pros. The router looks at your

00:07:21.300 --> 00:07:23.819
request, your prompt, and figures out, OK, this

00:07:23.819 --> 00:07:25.860
is about coding, and sends it only to the coding

00:07:25.860 --> 00:07:29.860
expert. Or this needs history knowledge, so it

00:07:29.860 --> 00:07:32.000
activates the history expert. Usually just one

00:07:32.000 --> 00:07:34.079
or two experts get activated for any given task.

00:07:34.180 --> 00:07:35.819
Right, so you're not running the whole giant

00:07:35.819 --> 00:07:37.980
brain all the time. Precisely. And here's the

00:07:37.980 --> 00:07:40.240
really cool part. the efficiency and size benefit.

00:07:40.540 --> 00:07:43.899
You can build a massive model overall, let's

00:07:43.899 --> 00:07:46.079
say, a trillion parameters worth of knowledge.

00:07:46.259 --> 00:07:50.120
But because the router only turns on maybe 5

00:07:50.120 --> 00:07:53.240
% or 10 % of those experts for any single query,

00:07:54.120 --> 00:07:56.180
you're only paying the computational cost of

00:07:56.180 --> 00:07:59.079
running a much smaller model, maybe just 100

00:07:59.079 --> 00:08:00.980
billion parameters in that example. Wait, hold

00:08:00.980 --> 00:08:03.259
on. So you get the knowledge of a potentially

00:08:03.259 --> 00:08:06.389
enormous model. but the running cost of something

00:08:06.389 --> 00:08:09.329
much smaller. That's the magic. Wow. I mean,

00:08:09.329 --> 00:08:11.290
that just fundamentally changes the economics

00:08:11.290 --> 00:08:13.769
of building top -tier AI, doesn't it? That scaling

00:08:13.769 --> 00:08:16.730
efficiency is incredible. It's huge. That's why

00:08:16.730 --> 00:08:19.730
this architecture, MoE, is believed to be used

00:08:19.730 --> 00:08:22.149
in the really big, powerful models like Mixed

00:08:22.149 --> 00:08:25.490
Roll and probably GPT -4, too. It's also great

00:08:25.490 --> 00:08:27.430
for companies wanting to personalize. They can

00:08:27.430 --> 00:08:29.629
add their own private experts, train just on

00:08:29.629 --> 00:08:31.829
their internal documents or specific industry

00:08:31.829 --> 00:08:34.840
regulations. So you could ask it to explain I

00:08:34.840 --> 00:08:37.299
don't know, inflation using general knowledge.

00:08:37.340 --> 00:08:40.100
Yeah. And write Python code to calculate CPI

00:08:40.100 --> 00:08:42.639
using a coding expert, all in one go. Exactly.

00:08:42.860 --> 00:08:45.240
Dual knowledge handled efficiently. OK, the efficiency

00:08:45.240 --> 00:08:48.039
is amazing. But beyond cost, what's the core

00:08:48.039 --> 00:08:50.960
risk if that router AI messes up its job in a

00:08:50.960 --> 00:08:53.440
Moe system? If the router picks the wrong expert,

00:08:53.679 --> 00:08:55.980
the whole answer will likely be poor quality

00:08:55.980 --> 00:08:58.940
or just plain wrong. Bad robbing means bad results.

00:08:59.100 --> 00:09:01.399
Makes sense. Garbage in, garbage out. Or rather,

00:09:01.600 --> 00:09:03.940
wrong expert in, garbage out. Pretty much. And

00:09:03.940 --> 00:09:06.019
training them is complex, plus you need enough

00:09:06.019 --> 00:09:08.399
memory VRAM to hold all those experts, even if

00:09:08.399 --> 00:09:10.720
most are idle. OK, so Moe's give us efficient

00:09:10.720 --> 00:09:13.179
expertise. But all these models so far, they've

00:09:13.179 --> 00:09:15.200
been dealing with text or maybe creating images.

00:09:15.309 --> 00:09:18.110
based on text, what if the AI needs to actually

00:09:18.110 --> 00:09:21.450
see the world? Ah, giving the AI eyes. That must

00:09:21.450 --> 00:09:23.870
be the VLM vision language model. You got it.

00:09:24.110 --> 00:09:26.990
V -L -M. It can see and talk. It understands

00:09:26.990 --> 00:09:30.850
both pictures and text at the same time. LLMs,

00:09:31.250 --> 00:09:34.009
fundamentally, were blind. VLMs have eyes. How

00:09:34.009 --> 00:09:36.750
does that work internally? Is it merging two

00:09:36.750 --> 00:09:39.190
different kinds of models? Kind of, yeah. It

00:09:39.190 --> 00:09:42.779
uses an image encoder. Think of that as the eyes,

00:09:42.860 --> 00:09:46.019
which looks at a picture and translates it into

00:09:46.019 --> 00:09:48.480
a sort of numerical description, like orange

00:09:48.480 --> 00:09:51.179
cat sitting on a green chair. OK. A number string

00:09:51.179 --> 00:09:53.779
that represents the image. Right. And then that

00:09:53.779 --> 00:09:56.039
numerical string gets fed into a language model,

00:09:56.179 --> 00:09:59.210
the mouth. which can then understand it and talk

00:09:59.210 --> 00:10:01.330
about it, answer questions about it. And that

00:10:01.330 --> 00:10:03.629
opens up some really intuitive uses. The classic

00:10:03.629 --> 00:10:05.830
example is taking a picture of your fridge and

00:10:05.830 --> 00:10:07.909
asking, OK, what can I make for dinner with this

00:10:07.909 --> 00:10:10.169
stuff? That's the famous one. But it goes way

00:10:10.169 --> 00:10:13.230
beyond that. Think about visual search. Take

00:10:13.230 --> 00:10:15.110
a picture of some cool shoes you see someone

00:10:15.110 --> 00:10:17.450
wearing and ask, where can I buy these exact

00:10:17.450 --> 00:10:21.190
ones? Ooh, dangerous for my wallet. Huh. Tell

00:10:21.190 --> 00:10:24.870
me about it. Or analyzing video content. and

00:10:24.870 --> 00:10:27.809
really powerfully, tools for the visually impaired.

00:10:28.570 --> 00:10:30.509
Imagine it describing the world around them,

00:10:30.750 --> 00:10:33.549
reading signs, maybe even recognizing faces or

00:10:33.549 --> 00:10:35.870
expressions. That's incredible, a truly helpful

00:10:35.870 --> 00:10:38.429
application. But I guess the same watch out applies

00:10:38.429 --> 00:10:41.990
here as with LLMs. Can VLMs hallucinate about

00:10:41.990 --> 00:10:44.210
images? Absolutely. They can misinterpret an

00:10:44.210 --> 00:10:46.330
image, describe an object that isn't there, or

00:10:46.330 --> 00:10:48.710
get an action wrong. And of course, privacy becomes

00:10:48.710 --> 00:10:50.509
a bigger concern when you're uploading images,

00:10:50.610 --> 00:10:52.830
especially of people or inside your home. Right.

00:10:53.090 --> 00:10:55.529
So besides the fridge example, what's a really

00:10:55.529 --> 00:10:57.970
powerful, maybe less obvious application of a

00:10:57.970 --> 00:11:00.470
VLM? I think describing the world in real time

00:11:00.470 --> 00:11:02.509
for visually impaired people is one of the most

00:11:02.509 --> 00:11:05.669
profound uses. Yeah, definitely. OK, VLMs are

00:11:05.669 --> 00:11:08.389
powerful. They see and talk. But they sound like

00:11:08.389 --> 00:11:10.250
they need a lot of computing power, just like

00:11:10.250 --> 00:11:13.690
the big LLMs. What if you need AI on a smaller

00:11:13.690 --> 00:11:16.789
device, like your phone, and privacy is paramount?

00:11:16.950 --> 00:11:19.389
Great question. That's where the SLM comes in

00:11:19.389 --> 00:11:21.389
the small language model. While everyone's chasing

00:11:21.389 --> 00:11:25.470
bigger and bigger, these SLMs, like Microsoft's

00:11:25.470 --> 00:11:28.629
Fi3, are quietly becoming super important for

00:11:28.629 --> 00:11:31.029
on -device AI. They're designed specifically

00:11:31.029 --> 00:11:33.750
to run efficiently on phones, laptops, maybe

00:11:33.750 --> 00:11:36.720
even smart appliances. Small. So fewer parameters,

00:11:36.779 --> 00:11:38.679
I guess. How small are we talking? Just maybe

00:11:38.679 --> 00:11:41.059
fewer, maybe just one to three billion parameters

00:11:41.059 --> 00:11:43.299
compared to hundreds or even over a trillion

00:11:43.299 --> 00:11:45.440
for the big LLMs. Okay, one to three billion

00:11:45.440 --> 00:11:48.480
versus a trillion. How do they stay capable then?

00:11:48.679 --> 00:11:51.279
How aren't they just dumb? The secret sauce is

00:11:51.279 --> 00:11:53.720
the training data. Instead of just scraping the

00:11:53.720 --> 00:11:56.720
entire messy internet like LLMs often do. Right.

00:11:56.909 --> 00:12:00.210
Researchers focus on quality over quantity. They

00:12:00.210 --> 00:12:02.789
use what they call textbook quality data, highly

00:12:02.789 --> 00:12:05.429
curated information, specific examples of reasoning,

00:12:05.750 --> 00:12:08.210
logic problems. It's like feeding it a very well

00:12:08.210 --> 00:12:09.889
-structured education instead of just letting

00:12:09.889 --> 00:12:12.289
it browse the web randomly. Textbook quality.

00:12:12.429 --> 00:12:14.879
Yeah. That's interesting. So why is the training

00:12:14.879 --> 00:12:17.799
data for an SLM described that way? Because they

00:12:17.799 --> 00:12:20.440
prioritize high -quality curated knowledge and

00:12:20.440 --> 00:12:22.779
reasoning examples over just sheer volume from

00:12:22.779 --> 00:12:25.480
the internet. Quality over quantity. But doesn't

00:12:25.480 --> 00:12:28.460
that curated approach risk making them less robust?

00:12:29.220 --> 00:12:31.340
Like, they only know what's in the textbook.

00:12:31.460 --> 00:12:34.139
They can't handle the messy, unpredictable real

00:12:34.139 --> 00:12:36.980
world as well. Are they trading robustness for

00:12:36.980 --> 00:12:38.940
that efficiency? That is the trade -off, yeah.

00:12:39.100 --> 00:12:42.299
They will know less about really news or uncommon

00:12:42.299 --> 00:12:44.360
topics. Their conversational memory might be

00:12:44.360 --> 00:12:48.059
shorter too, but the payoff is huge. True on

00:12:48.059 --> 00:12:50.059
-device AI. Meaning it runs right there on your

00:12:50.059 --> 00:12:52.700
phone. Exactly. Your phone's assistant can work

00:12:52.700 --> 00:12:54.759
instantly without needing a cloud connection.

00:12:55.139 --> 00:12:57.580
That means better privacy. Your data doesn't

00:12:57.580 --> 00:13:00.039
have to leave your device. Think about instant

00:13:00.039 --> 00:13:02.360
code completion in your programming editor or

00:13:02.360 --> 00:13:04.820
voice commands in your car that just work, even

00:13:04.820 --> 00:13:07.490
if you have no signal. Smart microwaves, maybe.

00:13:07.629 --> 00:13:10.289
OK, so SLMs are great for efficient, private,

00:13:10.470 --> 00:13:12.990
on -device tasks. But they're still fundamentally

00:13:12.990 --> 00:13:15.269
about generating text or understanding commands,

00:13:15.429 --> 00:13:18.070
like LLMs. What if the main goal isn't generation

00:13:18.070 --> 00:13:20.929
at all, but really deep understanding of language

00:13:20.929 --> 00:13:23.809
meaning? For deep understanding, we need to look

00:13:23.809 --> 00:13:26.610
back at a slightly older but foundational model

00:13:26.610 --> 00:13:30.710
type, the MLM Masked Language Model. Masked Language

00:13:30.710 --> 00:13:33.029
Model. OK, how is that different from an LLM

00:13:33.029 --> 00:13:36.620
guessing the next word? So LLMs play the predict

00:13:36.620 --> 00:13:39.779
the next word game. MLMs, like Google's famous

00:13:39.779 --> 00:13:42.600
BERT model, play fill in the blank. Fill in the

00:13:42.600 --> 00:13:45.220
blank? Yeah. During training, they take a sentence

00:13:45.220 --> 00:13:48.200
and just hide or mask about 15 % of the words.

00:13:48.700 --> 00:13:51.279
Then they force the model to guess the missing

00:13:51.279 --> 00:13:54.820
words. Crucially, to do that well, the model

00:13:54.820 --> 00:13:57.279
has to look at the words before the blank and

00:13:57.279 --> 00:14:00.220
the words after the blank. Ah, so it needs context

00:14:00.220 --> 00:14:02.909
from both directions. bidirectional context.

00:14:03.350 --> 00:14:05.389
Exactly. That forces it to develop a much deeper

00:14:05.389 --> 00:14:07.669
understanding of the sentence's actual meaning

00:14:07.669 --> 00:14:10.450
and structure, not just predicting the most likely

00:14:10.450 --> 00:14:13.110
next word in sequence. OK, so where is that deep

00:14:13.110 --> 00:14:15.669
understanding most useful if it's not writing

00:14:15.669 --> 00:14:18.409
poems? Right. They're analysis models, not creative

00:14:18.409 --> 00:14:20.629
writers. Their big strength is understanding

00:14:20.629 --> 00:14:22.909
meaning, which is critical for search engines.

00:14:23.269 --> 00:14:25.450
When you search something like, best camera is

00:14:25.450 --> 00:14:27.889
A to Z, the MLM helps the engine understand A

00:14:27.889 --> 00:14:30.649
to Z means a complete guide, not literally the

00:14:30.649 --> 00:14:33.159
letter. Uh, understanding the intent behind the

00:14:33.159 --> 00:14:36.220
search. Precisely. They're also great for sentiment

00:14:36.220 --> 00:14:39.059
analysis, quickly sorting thousands of customer

00:14:39.059 --> 00:14:42.279
reviews into positive, negative, neutral, and

00:14:42.279 --> 00:14:45.059
for something called Named Entity Recognition,

00:14:45.240 --> 00:14:48.139
or N -E -R, that's pulling out specific pieces

00:14:48.139 --> 00:14:50.779
of info from text, like finding all the company

00:14:50.779 --> 00:14:53.179
names or dollar amounts mentioned in a long report.

00:14:53.580 --> 00:14:55.519
So since MLMs aren't really creative writers,

00:14:56.100 --> 00:14:57.940
what's their crucial function for something like

00:14:57.940 --> 00:15:00.639
a search engine? They understand the deep context

00:15:00.639 --> 00:15:03.039
and the actual meaning behind your search query,

00:15:03.220 --> 00:15:06.039
not just the keywords. Got it. Pure contextual

00:15:06.039 --> 00:15:09.059
depth. OK, one more model to go. We've covered

00:15:09.059 --> 00:15:12.860
text, images, actions, efficiency, seeing, small

00:15:12.860 --> 00:15:16.460
size, deep understanding, what's left. Precision

00:15:16.460 --> 00:15:19.539
in vision, specifically identifying exact boundaries.

00:15:19.740 --> 00:15:21.600
We need the SAM, the Segment Anything model.

00:15:21.779 --> 00:15:24.889
Segment Anything. SAM. Okay, this sounds specialized.

00:15:25.070 --> 00:15:27.029
Highly specialized. It's a computer vision model,

00:15:27.090 --> 00:15:29.289
but its focus isn't on telling you what an object

00:15:29.289 --> 00:15:31.909
is. Its job is to draw an incredibly precise

00:15:31.909 --> 00:15:34.269
outline around everything it sees in a picture.

00:15:34.730 --> 00:15:36.690
It segments objects right down to the individual

00:15:36.690 --> 00:15:38.769
pixel. Down to the pixel. Wow, how does it learn

00:15:38.769 --> 00:15:41.860
to do that for... Well, anything. It was trained

00:15:41.860 --> 00:15:44.779
on a massive data set, something like 11 million

00:15:44.779 --> 00:15:48.059
images, containing over a billion segmentation

00:15:48.059 --> 00:15:51.059
masks, those precise outlines. So it learned

00:15:51.059 --> 00:15:53.980
the general concept of what is an object and

00:15:53.980 --> 00:15:56.000
what is its boundary, even if it doesn't know

00:15:56.000 --> 00:15:58.500
the object's name. You usually activate it just

00:15:58.500 --> 00:16:01.080
by clicking on an object or drawing a rough box

00:16:01.080 --> 00:16:03.669
around it. So it knows This boundary belongs

00:16:03.669 --> 00:16:06.429
to object one, but not necessarily that object

00:16:06.429 --> 00:16:09.409
one is a dog or a car. Exactly. It just finds

00:16:09.409 --> 00:16:12.110
the edges. Okay. Where do you use that kind of

00:16:12.110 --> 00:16:15.220
pixel perfect? outlining. It's actually a foundational

00:16:15.220 --> 00:16:18.340
tool for lots of applications. Think photo editing,

00:16:18.600 --> 00:16:21.480
that remove background feature. Sam is likely

00:16:21.480 --> 00:16:24.399
powering the precise selection. Or critically,

00:16:24.759 --> 00:16:27.460
medical imaging. Imagine a surgeon analyzing

00:16:27.460 --> 00:16:30.679
an MRI. Sam can draw the exact outline around

00:16:30.679 --> 00:16:33.100
a tumor or an organ, which allows for really

00:16:33.100 --> 00:16:35.080
accurate measurements. Right. Precision is key

00:16:35.080 --> 00:16:37.299
there. Absolutely. And it's vital for robotics

00:16:37.299 --> 00:16:39.759
and self -driving cars too. They need to know

00:16:39.759 --> 00:16:42.200
the exact edges of objects in the real world

00:16:42.200 --> 00:16:44.730
to nap and interact safely. So why would a surgeon

00:16:44.730 --> 00:16:48.149
looking at an MRI need SAM, the segmenter, maybe

00:16:48.149 --> 00:16:51.490
more than a VLM, the describer? SAM provides

00:16:51.490 --> 00:16:54.629
those precise pixel level outlines needed to

00:16:54.629 --> 00:16:57.269
accurately measure biological structures, not

00:16:57.269 --> 00:16:59.370
just a general description. Measurement needs

00:16:59.370 --> 00:17:02.389
precision. Makes perfect sense. Is there a watch

00:17:02.389 --> 00:17:04.640
out for SAM? The main thing is just remembering

00:17:04.640 --> 00:17:07.279
what it is. It's brilliant at finding outlines,

00:17:07.700 --> 00:17:10.119
but it won't tell you what the object is. It

00:17:10.119 --> 00:17:12.119
often needs to be paired with other models, maybe

00:17:12.119 --> 00:17:15.019
a VLM, to get both the precise outline and the

00:17:15.019 --> 00:17:17.859
identification. Okay, wow. That's eight different

00:17:17.859 --> 00:17:20.420
types. Quite the toolbox. It really is. So let's

00:17:20.420 --> 00:17:22.039
try and pull this all together. What's the big

00:17:22.039 --> 00:17:24.539
idea here for you, the listener? I think the

00:17:24.539 --> 00:17:27.400
core takeaway is just realizing that AI isn't

00:17:27.400 --> 00:17:30.099
one thing, it's this specialized toolbox. And

00:17:30.099 --> 00:17:32.259
now hopefully knowing the model name and the

00:17:32.259 --> 00:17:34.700
acronym actually tells you something concrete

00:17:34.700 --> 00:17:36.960
about its core function. You know which tool

00:17:36.960 --> 00:17:39.039
to think about for which job. Yeah, let's do

00:17:39.039 --> 00:17:41.500
a quick recap. If you need writing help, content

00:17:41.500 --> 00:17:44.960
generation, chatbots, you're thinking LLM. Right,

00:17:45.119 --> 00:17:47.680
needs super fast images, especially for real

00:17:47.680 --> 00:17:49.859
-time stuff like AR or on your phone, that's

00:17:49.859 --> 00:17:52.880
the LCM. Want to automate a complex task? Have

00:17:52.880 --> 00:17:56.059
the AI actually do things by connecting to other

00:17:56.059 --> 00:17:58.680
tools. That's the LAM, the action model. Need

00:17:58.680 --> 00:18:01.299
maximum efficiency and the ability to build huge

00:18:01.299 --> 00:18:03.759
models cost -effectively using specialist parts.

00:18:03.799 --> 00:18:06.079
That's the MoE architecture. Need an AI that

00:18:06.079 --> 00:18:08.359
can see and understand images and talk about

00:18:08.359 --> 00:18:11.160
them. That's the VLM. Want AI that runs directly

00:18:11.160 --> 00:18:14.000
on your device, prioritizing privacy and efficiency.

00:18:13.799 --> 00:18:17.319
over knowing absolutely everything. That's the

00:18:17.319 --> 00:18:20.140
SLM, the small one. Need deep understanding of

00:18:20.140 --> 00:18:22.799
language meaning for search or analysis, not

00:18:22.799 --> 00:18:25.259
generation. That's the MLM. And finally, need

00:18:25.259 --> 00:18:27.779
to draw those perfect pixel -level outlines around

00:18:27.779 --> 00:18:30.400
objects in images, like for background removal

00:18:30.400 --> 00:18:33.819
or medical scans. That's the SAM. Exactly. So

00:18:33.819 --> 00:18:35.619
now you have this framework, right? We really

00:18:35.619 --> 00:18:37.119
hope you can use this knowledge next time you

00:18:37.119 --> 00:18:39.359
hear about a new AI tool. Don't just ask, is

00:18:39.359 --> 00:18:42.380
it AI? Ask yourself, okay. Based on what it does,

00:18:42.539 --> 00:18:44.619
is this primarily analyzing things, creating

00:18:44.619 --> 00:18:47.079
things, or acting on things? Which tool from

00:18:47.079 --> 00:18:49.039
the toolbox does it sound like? Yeah, that's

00:18:49.039 --> 00:18:50.519
a great way to approach it. And the pace things

00:18:50.519 --> 00:18:52.460
are moving, these tools are starting to combine

00:18:52.460 --> 00:18:54.640
in really interesting ways. We've covered models

00:18:54.640 --> 00:18:56.880
that talk, models that act, models that see.

00:18:57.539 --> 00:18:59.880
So here's a thought to leave you with. If the

00:18:59.880 --> 00:19:03.180
next big step... combines that incredible efficiency

00:19:03.180 --> 00:19:06.680
of the Moe's specialized experts with the Lamb's

00:19:06.680 --> 00:19:09.220
ability to connect to tools and automate real

00:19:09.220 --> 00:19:12.160
-world tasks, well, that creates the potential

00:19:12.160 --> 00:19:16.640
for truly autonomous, hyper -specialized AI agents.

00:19:17.599 --> 00:19:19.920
So thinking about that future, what's the single

00:19:19.920 --> 00:19:23.099
most complex task, maybe in your personal life,

00:19:23.099 --> 00:19:25.220
maybe in your business, that you would finally

00:19:25.220 --> 00:19:27.980
trust an autonomous AI agent to handle completely?

00:19:28.319 --> 00:19:29.980
Something to chew on. Definitely something to

00:19:29.980 --> 00:19:31.529
think about. We'll see you next time on the Deep

00:19:31.529 --> 00:19:31.730
Dive.
