WEBVTT

00:00:00.000 --> 00:00:03.419
Welcome in to today's deep dive. Usually, you

00:00:03.419 --> 00:00:05.839
know, we're parsing through massive white papers

00:00:05.839 --> 00:00:08.839
or these really dense historical texts for you.

00:00:09.019 --> 00:00:12.740
Right. The heavy stuff. Exactly. But today, our

00:00:12.740 --> 00:00:15.240
source material is, well, it's highly unusual.

00:00:15.400 --> 00:00:19.260
It's literally a single Wikipedia disambiguation

00:00:19.260 --> 00:00:22.489
page. Yeah. We're talking about maybe 50 actual

00:00:22.489 --> 00:00:24.969
words of content here. It's tiny. It's basically

00:00:24.969 --> 00:00:27.550
a digital signpost. Just a few sentences at a

00:00:27.550 --> 00:00:29.609
crossroads. Which is crazy because it's so minimalist.

00:00:29.850 --> 00:00:32.350
Right. But our mission today is to show you how

00:00:32.350 --> 00:00:35.710
these few carefully chosen words actually map

00:00:35.710 --> 00:00:39.609
out the architectural fault lines of modern artificial

00:00:39.609 --> 00:00:41.810
intelligence. It forces us to look past all the

00:00:41.810 --> 00:00:43.829
industry buzzwords. We get to actually examine

00:00:43.829 --> 00:00:45.530
the mechanics of what these systems are doing.

00:00:45.670 --> 00:00:47.649
Because while the document is incredibly short,

00:00:47.829 --> 00:00:50.189
it highlights this profound difference in how

00:00:50.189 --> 00:00:52.270
machine actually learn to understand our world.

00:00:52.369 --> 00:00:54.670
Oh absolutely. So the document introduces us

00:00:54.670 --> 00:00:57.130
to two core terms right at the very top. It says

00:00:57.130 --> 00:00:59.250
a few -shot learning and one -shot learning may

00:00:59.250 --> 00:01:02.429
refer to and then it lists the options. Which

00:01:02.429 --> 00:01:04.930
perfectly sets up the central mystery we're gonna

00:01:04.930 --> 00:01:08.189
unravel for you today. Yes. Why does this ultra

00:01:08.189 --> 00:01:11.370
short document explicitly take these two incredibly

00:01:11.370 --> 00:01:15.290
similar sounding concepts, few shot learning

00:01:15.290 --> 00:01:18.170
and one shot learning, and just split them into

00:01:18.170 --> 00:01:20.890
entirely different technological realms? Because

00:01:20.890 --> 00:01:23.010
on the surface, I mean, it sounds like an arbitrary

00:01:23.010 --> 00:01:25.269
division, right? Totally. If you're casually

00:01:25.269 --> 00:01:28.069
following AI. You probably assume these are just

00:01:28.069 --> 00:01:31.150
variations of the exact same algorithm, just

00:01:31.150 --> 00:01:33.549
with a different number of data points. OK. Let's

00:01:33.549 --> 00:01:35.650
unpack this terminology first before we even

00:01:35.650 --> 00:01:37.709
look at the separate technological fields. Let's

00:01:37.709 --> 00:01:40.510
do it. When this document groups few shot and

00:01:40.510 --> 00:01:43.530
one shot together, what does the concept of a

00:01:43.530 --> 00:01:46.670
shot actually mean mechanically? Good question.

00:01:47.069 --> 00:01:48.769
Because, I mean, think about human learning,

00:01:48.909 --> 00:01:51.129
right? If I'm teaching you a new card game. OK.

00:01:51.439 --> 00:01:54.519
Do you need to play a few hands like a few shots

00:01:54.519 --> 00:01:57.079
to get it? Or do you just understand the rules

00:01:57.079 --> 00:01:59.780
after seeing just one hand? Right, one shot.

00:01:59.900 --> 00:02:02.280
That's a great analogy for the baseline. But

00:02:02.280 --> 00:02:05.200
I have to push back a little here. If few shot

00:02:05.200 --> 00:02:07.620
and one shot sound like essentially the exact

00:02:07.620 --> 00:02:09.860
same process, just with a different number of

00:02:09.860 --> 00:02:13.060
examples, why does the source immediately separate

00:02:13.060 --> 00:02:15.979
them into totally different subcategories? Well...

00:02:15.900 --> 00:02:18.219
To understand that hard fork in the road, we

00:02:18.219 --> 00:02:20.340
have to look at the underlying reality of how

00:02:20.340 --> 00:02:23.500
that goal is achieved. OK. In the old days, giving

00:02:23.500 --> 00:02:26.259
an AI an example meant fundamentally altering

00:02:26.259 --> 00:02:29.520
its underlying code. You were adjusting the weights

00:02:29.520 --> 00:02:31.659
and biases of the neural network permanently.

00:02:32.300 --> 00:02:35.319
Which required massive data sets, right? Millions

00:02:35.319 --> 00:02:37.800
of shots. Millions of labeled pictures of cats

00:02:37.800 --> 00:02:40.580
just to recognize a cat, exactly. But when we

00:02:40.580 --> 00:02:43.020
talk about a shot today, specifically in modern

00:02:43.020 --> 00:02:46.180
foundational models, we're talking about in -context

00:02:46.180 --> 00:02:48.159
learning. So you're not retraining the model.

00:02:48.599 --> 00:02:50.560
No, not at all. You're giving it a temporary

00:02:50.560 --> 00:02:52.219
anchor during inference, which is the moment

00:02:52.219 --> 00:02:54.240
you actually ask it to do something. Got it.

00:02:54.379 --> 00:02:56.879
So it's like, OK, think of a massive generative

00:02:56.879 --> 00:02:59.939
AI model, like a highly detailed multi -dimensional

00:02:59.939 --> 00:03:03.180
map of every concept humanity has ever digitized.

00:03:03.199 --> 00:03:05.870
I love that visual. When you give it zero shots,

00:03:05.969 --> 00:03:07.770
meaning you just give it a blunt command with

00:03:07.770 --> 00:03:10.490
no examples, it drops you in the middle of a

00:03:10.490 --> 00:03:12.770
random continent on that map. Right. You could

00:03:12.770 --> 00:03:15.349
end up anywhere. But when you provide a few shots,

00:03:15.629 --> 00:03:18.129
you are essentially giving it GPS coordinates.

00:03:18.569 --> 00:03:20.889
You're giving it a zip code, a street name, and

00:03:20.889 --> 00:03:23.990
a house number to narrow down the exact neighborhood

00:03:23.990 --> 00:03:26.729
of the answer you want. That spatial analogy

00:03:26.729 --> 00:03:28.849
maps perfectly to the math, actually. Really?

00:03:29.129 --> 00:03:32.050
Yeah. These models operate in what we call a

00:03:32.050 --> 00:03:35.189
latent space. It's a mathematical representation

00:03:35.189 --> 00:03:38.610
of relationships between concepts. Okay. So,

00:03:38.650 --> 00:03:41.210
providing a few examples acts as a gravitational

00:03:41.210 --> 00:03:44.270
pull. It bends that latent space temporarily,

00:03:45.150 --> 00:03:47.590
pulling the AI's output toward the specific style

00:03:47.590 --> 00:03:49.550
or logic you've demonstrated. The underlying

00:03:49.550 --> 00:03:51.349
weights haven't changed, but you've constrained

00:03:51.349 --> 00:03:53.689
the geometry of its possible answers. Exactly.

00:03:54.289 --> 00:03:55.969
But that brings us right back to your pushback.

00:03:56.189 --> 00:03:58.849
The architecture required to follow sequential

00:03:58.849 --> 00:04:01.409
iterative examples is completely different from

00:04:01.409 --> 00:04:03.990
the architecture required to instantly extract

00:04:03.990 --> 00:04:06.750
the invariant truth of an object from one definitive

00:04:06.750 --> 00:04:09.969
look. Which dictates the first major branch of

00:04:09.969 --> 00:04:12.590
our source document. Yes, the generative AI branch.

00:04:13.009 --> 00:04:15.310
Right. So moving from the shared terminology

00:04:15.310 --> 00:04:18.129
into its first specific definition, the text

00:04:18.129 --> 00:04:22.920
reads, a form of prompt engineering in generative

00:04:22.920 --> 00:04:25.819
AI. What's fascinating here is how specific that

00:04:25.819 --> 00:04:29.160
phrasing is. It really is. It explicitly defines

00:04:29.160 --> 00:04:32.040
few -shot learning, not as some background -backend

00:04:32.040 --> 00:04:35.420
programming thing, but as a human -driven interaction.

00:04:35.620 --> 00:04:37.920
Right, it's active. Prompt engineering means

00:04:37.920 --> 00:04:41.139
a user sitting at a keyboard. So what does this

00:04:41.139 --> 00:04:43.879
all mean for the listener? It means it takes

00:04:43.879 --> 00:04:47.240
the magic out of the black box and puts the responsibility

00:04:47.240 --> 00:04:50.579
squarely on you. It's like, uh... Imagine giving

00:04:50.579 --> 00:04:53.019
a chef a couple of specific example dishes. Those

00:04:53.019 --> 00:04:55.000
are your few shots. Okay, yeah. And then you

00:04:55.000 --> 00:04:58.019
ask them to design an entirely new, but stylistically

00:04:58.019 --> 00:05:01.620
similar, menu for a restaurant. Right. By categorizing

00:05:01.620 --> 00:05:04.420
few -shot learning as prompt engineering, the

00:05:04.420 --> 00:05:07.160
source is revealing a crucial limitation of generative

00:05:07.160 --> 00:05:10.360
AI. Which is what? Well, large language models

00:05:10.360 --> 00:05:12.579
are fundamentally sequential prediction engines.

00:05:13.120 --> 00:05:15.079
They guess the next word based on the context

00:05:15.079 --> 00:05:17.300
window they're provided. They thrive on patterns.

00:05:17.360 --> 00:05:20.519
Exactly. But out of the box, their patterns are

00:05:20.519 --> 00:05:23.319
just generalized averages of the entire internet.

00:05:23.459 --> 00:05:25.660
Which means if you want something specific, you

00:05:25.660 --> 00:05:28.240
have to fight the model's pre -training. You

00:05:28.240 --> 00:05:30.759
do. When you're sitting at your desk, endlessly

00:05:30.759 --> 00:05:33.399
tweaking a prompt because the AI is writing an

00:05:33.399 --> 00:05:35.860
email that sounds like a generic corporate robot.

00:05:36.160 --> 00:05:38.639
They've all been there? You're experiencing the

00:05:38.639 --> 00:05:41.220
necessity of few -shot learning in real time.

00:05:41.439 --> 00:05:43.699
You have to provide three or four examples of

00:05:43.699 --> 00:05:45.860
your own writing style just to break the model

00:05:45.860 --> 00:05:48.399
out of its default state. You're hacking the

00:05:48.399 --> 00:05:51.319
context window. That is what prompt engineering

00:05:51.319 --> 00:05:55.560
actually is. It's a workaround. Exactly. Because

00:05:55.560 --> 00:05:58.399
we don't yet have models that can instantly adapt

00:05:58.399 --> 00:06:00.779
their entire personality from a single instruction.

00:06:01.439 --> 00:06:04.379
We use few -shot prompting to build a temporary

00:06:04.379 --> 00:06:07.319
pattern for the machine to lash onto. So generation

00:06:07.319 --> 00:06:10.259
requires a stylistic template. And building a

00:06:10.259 --> 00:06:13.279
template requires multiple reference points.

00:06:13.439 --> 00:06:15.759
Precisely. One point is just a dot. Two points

00:06:15.759 --> 00:06:18.660
make a line. A few points establish a curve that

00:06:18.660 --> 00:06:20.939
the AI can then extrapolate from. Here's where

00:06:20.939 --> 00:06:23.579
it gets really interesting, though. Yeah? Text

00:06:23.579 --> 00:06:25.439
and generative concepts are forgiving, right?

00:06:25.709 --> 00:06:28.589
You can iterate. But what happens when an AI

00:06:28.589 --> 00:06:31.189
needs to process the visual world from just a

00:06:31.189 --> 00:06:34.370
single example? That naturally pivots us to the

00:06:34.370 --> 00:06:36.649
second link in the source document. Right, because

00:06:36.649 --> 00:06:40.129
the text explicitly routes one -shot learning,

00:06:40.329 --> 00:06:43.769
specifically to computer vision. It literally

00:06:43.769 --> 00:06:46.110
has computer vision in parentheses right next

00:06:46.110 --> 00:06:48.230
to it. It does. And I have to raise a skeptical

00:06:48.230 --> 00:06:52.269
question here. Why is one -shot learning completely

00:06:52.269 --> 00:06:55.089
cordoned off into computer vision in this document?

00:06:55.350 --> 00:06:57.899
Seems a bit rigid, doesn't it? Yeah. I mean,

00:06:58.139 --> 00:07:00.819
if I can give a generative AI a few text examples,

00:07:01.300 --> 00:07:03.860
why is seeing something just once isolated to

00:07:03.860 --> 00:07:06.100
the visual realm? I use mid -journey all the

00:07:06.100 --> 00:07:08.300
time. I'll upload three or four different reference

00:07:08.300 --> 00:07:10.439
photos of a character to generate a brand new

00:07:10.439 --> 00:07:12.759
image. Right, you're using multiple images. Exactly.

00:07:12.839 --> 00:07:15.680
That is highly complex computer vision, and I

00:07:15.680 --> 00:07:17.879
am actively using a few -shot approach to do

00:07:17.879 --> 00:07:21.100
it. So why does this digital signpost point in

00:07:21.100 --> 00:07:23.839
such absolute directions? If we connect this

00:07:23.839 --> 00:07:26.509
to the bigger picture, It gets to the heart of

00:07:26.509 --> 00:07:28.949
how language trails behind the bleeding edge

00:07:28.949 --> 00:07:32.069
of tech. When you use mid -journey with multiple

00:07:32.069 --> 00:07:34.829
reference images, you are engaging in generative

00:07:34.829 --> 00:07:36.670
vision. You are creating something that does

00:07:36.670 --> 00:07:40.170
not exist. But when the Wikipedia editors route

00:07:40.170 --> 00:07:43.569
one -shot learning to computer vision, they're

00:07:43.569 --> 00:07:45.730
referring to the classic analytical definition

00:07:45.730 --> 00:07:48.449
of the field, discriminative computer vision.

00:07:48.529 --> 00:07:51.170
Any models designed to understand and categorize

00:07:51.170 --> 00:07:53.990
reality, not paint a new one. Yes. Think of a

00:07:53.990 --> 00:07:56.189
facial recognition system at a secure facility,

00:07:57.129 --> 00:08:00.589
or a medical AI analyzing an MRI for a rare tumor.

00:08:00.870 --> 00:08:03.209
High stakes stuff. Very high stakes. These systems

00:08:03.209 --> 00:08:05.790
don't have a conversational context window. They

00:08:05.790 --> 00:08:09.060
don't iterate. Their mathematical objective is

00:08:09.060 --> 00:08:12.040
to take a single, incredibly noisy array of pixels,

00:08:12.360 --> 00:08:15.480
an image, and extract the absolute ground truth

00:08:15.480 --> 00:08:17.480
features of whatever is in that image. But how

00:08:17.480 --> 00:08:19.720
does the system actually achieve that? How does

00:08:19.720 --> 00:08:22.079
my phone recognize my face in a dark room when

00:08:22.079 --> 00:08:24.360
the only training data I ever gave it was one

00:08:24.360 --> 00:08:27.079
initial scan? It relies on immense foundational

00:08:27.079 --> 00:08:29.720
pre -training. A vision model already knows the

00:08:29.720 --> 00:08:31.740
fundamental physics of light, shadow, edges,

00:08:31.860 --> 00:08:34.480
and geometry. It knows the visual alphabet. Exactly.

00:08:34.779 --> 00:08:37.600
So when it takes that one shot, that one scan

00:08:37.600 --> 00:08:41.179
of your face, it isn't trying to memorize specific

00:08:41.179 --> 00:08:44.720
pixels. It extracts a mathematical signature.

00:08:44.940 --> 00:08:48.440
Oh. We call it a feature vector. It produces

00:08:48.440 --> 00:08:50.659
a string of numbers that represents the unique

00:08:50.659 --> 00:08:53.419
geometric relationships of your specific facial

00:08:53.419 --> 00:08:56.179
features. So that single string of numbers is

00:08:56.179 --> 00:08:58.960
the one shot. Yes. And from that moment on, every

00:08:58.960 --> 00:09:01.139
time you hold up your phone, it generates a new

00:09:01.139 --> 00:09:03.620
vector from the live camera feed and measures

00:09:03.620 --> 00:09:05.639
the mathematical distance between that new vector

00:09:05.639 --> 00:09:08.179
and your original one shot vector. If the numbers

00:09:08.179 --> 00:09:10.909
are close enough, it unlocks. Which highlights

00:09:10.909 --> 00:09:13.309
why the source document elegantly divides these

00:09:13.309 --> 00:09:16.610
tasks. Generative AI involves iterative prompting

00:09:16.610 --> 00:09:19.149
engineering a response with a few shots. Right.

00:09:19.330 --> 00:09:21.330
Whereas computer vision relies on matching a

00:09:21.330 --> 00:09:24.669
visual pattern from a single definitive one -shot

00:09:24.669 --> 00:09:27.129
image. It's separating the creators from the

00:09:27.129 --> 00:09:29.070
observers. That is the fundamental divide. Which

00:09:29.070 --> 00:09:31.450
actually forces us to zoom out and look at the

00:09:31.450 --> 00:09:34.350
necessity of disambiguation itself. Below those

00:09:34.350 --> 00:09:37.169
two links the text has a note. This disambiguation

00:09:37.169 --> 00:09:39.509
page lists articles associated with the title,

00:09:39.750 --> 00:09:42.870
few -shot learning. And then it says, if an internal

00:09:42.870 --> 00:09:45.830
link incorrectly led you here, you may wish to

00:09:45.830 --> 00:09:47.970
change the link to point directly to the intended

00:09:47.970 --> 00:09:50.889
article. This raises an important question about

00:09:50.889 --> 00:09:53.850
the state of information in the AI age. Right.

00:09:53.950 --> 00:09:56.129
Because it's like walking into a giant library

00:09:56.129 --> 00:09:58.870
looking for a book on bugs. And the librarian

00:09:58.870 --> 00:10:01.210
has to ask, do you mean software glitches or

00:10:01.210 --> 00:10:04.409
insects? Exactly. The terminology overlaps so

00:10:04.409 --> 00:10:06.950
perfectly. But how common is it for such cutting

00:10:06.950 --> 00:10:09.549
edge fields to have overlapping terminology like

00:10:09.549 --> 00:10:13.029
this? I mean, why require active disambiguation

00:10:13.029 --> 00:10:15.549
for the public? It points to the immense danger

00:10:15.549 --> 00:10:18.190
of information overload and jargon confusion.

00:10:18.710 --> 00:10:21.429
When the text says, if an internal link incorrectly

00:10:21.429 --> 00:10:24.210
led you here, It's openly acknowledging that

00:10:24.210 --> 00:10:26.649
even encyclopedia editors and tech writers are

00:10:26.649 --> 00:10:28.509
mixing these terms up. They're cross -linking

00:10:28.509 --> 00:10:31.509
to the wrong concepts because few -shot and one

00:10:31.509 --> 00:10:33.330
-shot sound like they belong in the exact same

00:10:33.330 --> 00:10:36.029
article. Right. It completely validates the listener's

00:10:36.029 --> 00:10:38.710
potential confusion. If internal links are getting

00:10:38.710 --> 00:10:41.149
it wrong, it's no wonder the general public gets

00:10:41.149 --> 00:10:43.429
confused trying to keep up with AI news. The

00:10:43.429 --> 00:10:46.129
system is literally admitting, hey, we know our

00:10:46.129 --> 00:10:48.110
own people are linking to the wrong concepts.

00:10:48.210 --> 00:10:51.220
It's a safety net. It's a digital traffic cop

00:10:51.220 --> 00:10:53.860
trying to stop accidents at an intersection where

00:10:53.860 --> 00:10:56.000
the street signs are virtually identical. Even

00:10:56.000 --> 00:10:57.600
though the roads lead to completely different

00:10:57.600 --> 00:11:00.259
cities. Exactly. Well, let's distill this entire

00:11:00.259 --> 00:11:02.679
deep dive down for the listener. We started with

00:11:02.679 --> 00:11:06.460
this tiny minimalist Wikipedia disambiguation

00:11:06.460 --> 00:11:09.600
page. A single digital crossroads. Right. And

00:11:09.600 --> 00:11:13.580
we extracted a core taxonomy. Few shot equals

00:11:13.580 --> 00:11:17.019
prompt engineering in generative AI. It's an

00:11:17.019 --> 00:11:20.200
active human -driven process to temporarily bend

00:11:20.200 --> 00:11:23.340
the latent space. Well, one shot is tethered

00:11:23.340 --> 00:11:26.580
to computer vision. the monumental task of extracting

00:11:26.580 --> 00:11:29.320
an invariant feature vector from a single static

00:11:29.320 --> 00:11:31.500
image. Two different technological features.

00:11:31.899 --> 00:11:33.559
Yeah. Now, before we wrap up, I want to leave

00:11:33.559 --> 00:11:35.960
you with a final provocative thought to mull

00:11:35.960 --> 00:11:38.360
over on your own. Go lay it on us. We've talked

00:11:38.360 --> 00:11:40.840
a lot today about learning from a few shots in

00:11:40.840 --> 00:11:44.220
text or just one shot in vision. Both are huge

00:11:44.220 --> 00:11:46.360
leaps forward compared to the millions of data

00:11:46.360 --> 00:11:48.840
points we used to require. Unbelievable leaps,

00:11:49.000 --> 00:11:53.269
really. Right. If AI can now reliably learn complex

00:11:53.269 --> 00:11:56.889
tasks from a few shots in text, or just one shot

00:11:56.889 --> 00:12:00.169
in vision, how close are we to zero -shot learning?

00:12:00.389 --> 00:12:03.860
Oh, well. The holy grail. Imagine a scenario

00:12:03.860 --> 00:12:06.840
where an AI is asked to perfectly execute a complex

00:12:06.840 --> 00:12:09.639
task it has never seen a single example of before.

00:12:09.879 --> 00:12:13.100
No shots given, no engineered prompts, no reference

00:12:13.100 --> 00:12:16.539
images. Just pure unguided deduction from a standing

00:12:16.539 --> 00:12:18.940
start. Exactly. If this tiny digital signpost

00:12:18.940 --> 00:12:21.860
represents the crossroads of few and one, what

00:12:21.860 --> 00:12:23.639
happens to our relationship with these machines

00:12:23.639 --> 00:12:25.899
when the road finally leads to zero? It marks

00:12:25.899 --> 00:12:28.159
the moment the system no longer needs us to define

00:12:28.159 --> 00:12:30.399
the parameters of reality for it. Something to

00:12:30.399 --> 00:12:31.860
think about the next time you're typing out a

00:12:31.860 --> 00:12:33.340
prompt. Until then, keep exploring.