WEBVTT

00:00:00.000 --> 00:00:03.759
Imagine this. You tell a really powerful AI that

00:00:03.759 --> 00:00:06.599
a new, maybe fictional, alloy, let's call it

00:00:06.599 --> 00:00:10.439
adamantium -7b, melts at 3200 degrees Celsius.

00:00:10.759 --> 00:00:12.599
You've given it this fact right there in a document.

00:00:12.900 --> 00:00:14.660
Then you ask it, okay, according to this document,

00:00:14.880 --> 00:00:16.899
what's its melting point? And the AI just comes

00:00:16.899 --> 00:00:19.719
back confidently with adamantium is fictional

00:00:19.719 --> 00:00:23.030
and has no defined melting point. Hee, how does

00:00:23.030 --> 00:00:25.890
an AI that can write poetry or debug code miss

00:00:25.890 --> 00:00:30.190
something so simple, so direct? That's the fascinating

00:00:30.190 --> 00:00:32.590
paradox we're diving into today. Welcome to the

00:00:32.590 --> 00:00:34.549
Deep Dive, where we try to pull out the key insights

00:00:34.549 --> 00:00:36.649
from the information you share with us. Today,

00:00:36.810 --> 00:00:39.070
we're unpacking a really fundamental challenge

00:00:39.070 --> 00:00:41.439
in AI. Why do these large language models, these

00:00:41.439 --> 00:00:43.880
LLMs, sometimes just stubbornly ignore the context

00:00:43.880 --> 00:00:45.960
we give them? Right there in the prompt, we'll

00:00:45.960 --> 00:00:47.759
look at how software has evolved, explore some

00:00:47.759 --> 00:00:49.899
pretty sophisticated and, yeah, often expensive

00:00:49.899 --> 00:00:51.939
ways people have tried to fix this, and then

00:00:51.939 --> 00:00:54.020
we'll reveal something, well, almost embarrassingly

00:00:54.020 --> 00:00:56.119
simple, but incredibly effective, a solution

00:00:56.119 --> 00:00:58.899
that might just change how you think about programming

00:00:58.899 --> 00:01:01.039
AI. So stick around, because what we uncovered

00:01:01.039 --> 00:01:03.479
today, it could really shift your perspective

00:01:03.479 --> 00:01:05.900
on how we talk to these machines. It's pretty

00:01:05.900 --> 00:01:07.920
baffling, isn't it? I mean, these LLMs, they

00:01:07.920 --> 00:01:09.760
could do amazing things like... you said, poetry

00:01:09.760 --> 00:01:12.319
code, explaining quantum physics, sometimes better

00:01:12.319 --> 00:01:14.780
than humans can. But then you give them one clear

00:01:14.780 --> 00:01:16.879
fact right there, and poof, it's like it never

00:01:16.879 --> 00:01:19.760
happened. Your adamantium 7b example, that hits

00:01:19.760 --> 00:01:22.219
the nail right on the head. It's the core problem,

00:01:22.840 --> 00:01:25.680
this tendency to fall back on its massive preexisting

00:01:25.680 --> 00:01:27.760
knowledge instead of the immediate context you

00:01:27.760 --> 00:01:29.980
just handed it. And yeah, this isn't just some

00:01:29.980 --> 00:01:32.819
funny little quirk, it's actually a huge barrier

00:01:32.819 --> 00:01:35.829
for using AI reliably. Think about it, if you

00:01:35.829 --> 00:01:38.590
can't trust an AI to stick to the specific facts

00:01:38.590 --> 00:01:40.950
you feed it right now, how can you possibly rely

00:01:40.950 --> 00:01:43.870
on it for anything critical, like a legal tool

00:01:43.870 --> 00:01:46.709
misreading a new court filing, or a medical AI

00:01:46.709 --> 00:01:48.709
ignoring the latest research paper you just gave

00:01:48.709 --> 00:01:51.209
it, or even just a customer service bot completely

00:01:51.209 --> 00:01:53.489
missing the details of your problem that you

00:01:53.489 --> 00:01:56.049
just typed out. Their usefulness really depends

00:01:56.049 --> 00:01:59.079
on getting timely, accurate info right. this

00:01:59.079 --> 00:02:01.280
whole issue, it just really eats away at the

00:02:01.280 --> 00:02:03.000
trust we need to have in them. That's a really

00:02:03.000 --> 00:02:05.599
important point. And to kind of grasp why this

00:02:05.599 --> 00:02:09.300
happens, it helps to picture this internal tug

00:02:09.300 --> 00:02:12.439
of war going on inside the model. On one side,

00:02:12.479 --> 00:02:14.319
you've got what we call parametric knowledge.

00:02:14.860 --> 00:02:16.879
That's all the information, the patterns, the

00:02:16.879 --> 00:02:19.560
sort of world understanding that got baked into

00:02:19.560 --> 00:02:21.919
its billions of parameters during its massive

00:02:21.919 --> 00:02:25.009
pre -training. It's the default setting. deeply

00:02:25.009 --> 00:02:27.270
ingrained stuff, learn from seeing, you know,

00:02:27.389 --> 00:02:29.430
zillions of examples online. And then on the

00:02:29.430 --> 00:02:30.870
other side, there's the contextual knowledge.

00:02:31.030 --> 00:02:33.229
That's the specific info you provide right there

00:02:33.229 --> 00:02:35.469
in the prompt. Your question, the document snippet,

00:02:35.569 --> 00:02:38.770
whatever. The problem pops up when that contextual

00:02:38.770 --> 00:02:41.270
bit directly clashes with the deeply embedded

00:02:41.270 --> 00:02:44.509
parametric stuff. The LLM often just defaults

00:02:44.509 --> 00:02:46.289
back to what it knows best because those pathways,

00:02:46.490 --> 00:02:48.150
those neural connections that are super strong,

00:02:48.430 --> 00:02:50.810
reinforced millions of times, your single piece

00:02:50.810 --> 00:02:52.849
of context, it's like a whisper compared to that

00:02:52.849 --> 00:02:54.969
roar, a much weaker signal. Yeah, exactly. And

00:02:54.969 --> 00:02:56.750
what's really neat is how this fits into Andras

00:02:56.750 --> 00:02:59.169
Karpathy's view on how software itself is evolving.

00:02:59.490 --> 00:03:01.729
He talks about these three eras. First, there

00:03:01.729 --> 00:03:04.180
was software 1 .0. you know traditional programming

00:03:04.180 --> 00:03:06.719
humans write every single rule every line of

00:03:06.719 --> 00:03:09.520
logic code is king then came software 2 .0 that's

00:03:09.520 --> 00:03:11.500
machine learning systems learn from data not

00:03:11.500 --> 00:03:14.439
just explicit instructions the focus shifts right

00:03:14.439 --> 00:03:16.659
from writing code to gathering data and designing

00:03:16.659 --> 00:03:18.919
the model architecture and now we're stepping

00:03:18.919 --> 00:03:21.539
into software 3 .0 this is where these big pre

00:03:21.539 --> 00:03:24.740
-trained LLMs act like a like a programmable

00:03:24.740 --> 00:03:27.159
operating system kernel Programming here isn't

00:03:27.159 --> 00:03:30.419
about Python or C++ out. It's about carefully

00:03:30.419 --> 00:03:33.060
crafting prompts, just language, to guide the

00:03:33.060 --> 00:03:34.900
model's behavior. So this whole struggle we're

00:03:34.900 --> 00:03:37.340
talking about, this context problem, it's a major

00:03:37.340 --> 00:03:39.659
challenge right smack in the middle of this new

00:03:39.659 --> 00:03:42.159
software 3 .0 era. OK, so if there's this fundamental

00:03:42.159 --> 00:03:45.539
conflict, this tug of war inside the model, What

00:03:45.539 --> 00:03:47.740
does that struggle really tell us about how these

00:03:47.740 --> 00:03:50.360
LLMs, you know, think? Or maybe how they process

00:03:50.360 --> 00:03:53.439
information? And how did that impact the kinds

00:03:53.439 --> 00:03:55.360
of applications people were trying to build early

00:03:55.360 --> 00:03:57.580
on? What was the first big idea to try and get

00:03:57.580 --> 00:04:00.590
a handle on this stubbornness? Well, yeah, with

00:04:00.590 --> 00:04:02.990
this context problem being so obvious, the AI

00:04:02.990 --> 00:04:04.729
world didn't just sit there. They pretty quickly

00:04:04.729 --> 00:04:07.050
rallied around a standard approach. Something

00:04:07.050 --> 00:04:10.069
designed to keep LLMs tethered to facts and current

00:04:10.069 --> 00:04:12.729
information. It's called Retrieval Augmented

00:04:12.729 --> 00:04:14.990
Generation. You'll hear it called ARG all the

00:04:14.990 --> 00:04:17.689
time. R -A -G. And ARG -E, basically, it's a

00:04:17.689 --> 00:04:20.129
two -step dance. First step, retrieval. You ask

00:04:20.129 --> 00:04:21.750
a question. The system goes out and searches

00:04:21.750 --> 00:04:23.589
some external knowledge base, maybe internal

00:04:23.589 --> 00:04:25.790
company docs, maybe recent news articles, maybe

00:04:25.790 --> 00:04:28.370
a scientific database. It finds relevant little

00:04:28.370 --> 00:04:30.990
snippets of text. Second step, generation. It

00:04:30.990 --> 00:04:32.949
takes those snippets it found and injects them

00:04:32.949 --> 00:04:34.769
right into the prompt, along with your original

00:04:34.769 --> 00:04:37.050
question. So the LLM gets your question, plus

00:04:37.050 --> 00:04:40.430
this fresh, relevant context. The idea, the hope,

00:04:40.790 --> 00:04:42.930
is that the LLM will then generate its answer

00:04:42.930 --> 00:04:45.029
based only on that provided context, you know,

00:04:45.050 --> 00:04:47.170
trying to bridge that gap, solve that ignoring

00:04:47.170 --> 00:04:49.709
problem. Right. And on paper, our ride sounds

00:04:49.709 --> 00:04:52.680
perfect. it should solve it. But what's fascinating,

00:04:52.779 --> 00:04:55.180
what researchers found, is that even with RRAG,

00:04:55.620 --> 00:04:58.259
that underlying bias, that pull towards the old

00:04:58.259 --> 00:05:00.980
pre -trained knowledge, it often still wins out,

00:05:01.019 --> 00:05:03.439
especially, and this is key, when you feed it

00:05:03.439 --> 00:05:06.339
counterfactual information. Stuff that directly

00:05:06.339 --> 00:05:08.319
contradicts what the model thinks it knows about

00:05:08.319 --> 00:05:11.120
the world. So to really nail down how bad this

00:05:11.120 --> 00:05:14.319
problem was, and to measure how well LLMs could

00:05:14.319 --> 00:05:18.040
actually stick to new, even weird, context. They

00:05:18.040 --> 00:05:20.939
realized they needed a proper test, a benchmark,

00:05:21.279 --> 00:05:23.959
and that led them to create CONFICUAE. It stands

00:05:23.959 --> 00:05:26.500
for Contextual Faithfulness and Question Answering.

00:05:26.600 --> 00:05:28.220
Clever name. It's basically this data set that's

00:05:28.220 --> 00:05:30.079
been meticulously built to intentionally create

00:05:30.079 --> 00:05:32.519
clashes. Situations where the contest you provide

00:05:32.519 --> 00:05:34.459
says one thing and the LLM's background knowledge

00:05:34.459 --> 00:05:36.759
says the complete opposite. It's like an AI obstacle

00:05:36.759 --> 00:05:39.600
course specifically designed to measure that

00:05:39.600 --> 00:05:42.149
stubbornness. Yeah, ConfiQA sounds like a real

00:05:42.149 --> 00:05:44.730
trial by fire for these models. It has some clever

00:05:44.730 --> 00:05:47.310
ways of testing them. First, you've got counterfactual

00:05:47.310 --> 00:05:50.209
questions, the QA ones. This is where the context

00:05:50.209 --> 00:05:52.350
plainly states something wrong according to common

00:05:52.350 --> 00:05:54.730
knowledge, like the context might say, the Sun

00:05:54.730 --> 00:05:57.329
orbits the Earth, a fact proven by Galileo. Totally

00:05:57.329 --> 00:05:59.290
backwards, right? Then the question is, according

00:05:59.290 --> 00:06:02.410
to this text, which orbits which? A good, faithful

00:06:02.410 --> 00:06:05.110
LLM should say, the Sun orbits the Earth, just

00:06:05.110 --> 00:06:07.569
repeating the context, even though its internal

00:06:07.569 --> 00:06:11.209
knowledge is screaming, no, no, no! Then it gets

00:06:11.209 --> 00:06:14.149
harder with multi -hop reasoning, or Mr. Serate.

00:06:14.589 --> 00:06:16.529
Here, the answer requires connecting a few different

00:06:16.529 --> 00:06:18.350
pieces of information from the context, and at

00:06:18.350 --> 00:06:20.129
least one of those pieces is counterfactual,

00:06:20.230 --> 00:06:22.389
so maybe the context says, Project Starlight,

00:06:22.670 --> 00:06:25.360
managed by Omnicorp. Omnicorp HQ, that's in Neo

00:06:25.360 --> 00:06:27.819
Tokyo. Oh, and Neo Tokyo. It was Japan's capital

00:06:27.819 --> 00:06:31.180
back in 2077. Then the question, where are the

00:06:31.180 --> 00:06:33.439
headquarters of Project Starlight's manager located

00:06:33.439 --> 00:06:35.279
in the capital of which country? You have to

00:06:35.279 --> 00:06:37.519
hop, right? Starlight Omnicorp, Neo Tokyo, Japan.

00:06:37.980 --> 00:06:40.480
But that Neo Tokyo is capital bit is deliberately

00:06:40.480 --> 00:06:43.060
wrong. The model has to follow the flawed chain.

00:06:43.220 --> 00:06:47.120
And the final boss level basically is multi counterfactual

00:06:47.120 --> 00:06:49.660
MC. This is multi hop, but with multiple bogus

00:06:49.660 --> 00:06:53.000
facts thrown into the mix like lithium ion batteries.

00:06:53.040 --> 00:06:55.139
Invented by Marie Curie, she worked at the University

00:06:55.139 --> 00:06:57.959
of Berlin, and Berlin Uni, famous for automotive

00:06:57.959 --> 00:07:00.259
engineering, all wrong, right, inventor, workplace,

00:07:00.480 --> 00:07:02.680
specialty. Then the question, the University

00:07:02.680 --> 00:07:04.519
of the Lithium -Ion Battery Inventor is known

00:07:04.519 --> 00:07:07.279
for what field? The model has to navigate multiple

00:07:07.279 --> 00:07:09.339
counterfactuals in the context to get the right,

00:07:09.360 --> 00:07:11.579
wrong answer. Exactly, and when they ran a standard

00:07:11.579 --> 00:07:15.480
model, like a Llama 3 .18B, through this config

00:07:15.480 --> 00:07:18.699
QA gauntlet, the results were, well, frankly,

00:07:18.819 --> 00:07:20.860
pretty bad. On those basic counterfactual QA

00:07:20.860 --> 00:07:23.500
questions, only 33 % accurate. Just a third.

00:07:24.060 --> 00:07:26.060
For the multi -hop reasoning, Smith's Yard dropped

00:07:26.060 --> 00:07:29.399
to 25%. And for the really tough, multi -cantifactual

00:07:29.399 --> 00:07:33.439
MC, a dismal 12 .6%. Barely above random guessing,

00:07:33.519 --> 00:07:36.040
almost. So these numbers, they just paint a really

00:07:36.040 --> 00:07:38.100
clear picture out -of -the -box LLMs. You just

00:07:38.100 --> 00:07:39.720
can't rely on them when the information you give

00:07:39.720 --> 00:07:41.779
them is new or contradicts what they already

00:07:41.779 --> 00:07:44.199
know. It really set a clear benchmark for failure.

00:07:44.360 --> 00:07:47.019
So baseline scores are low. The problem is clearly

00:07:47.019 --> 00:07:49.220
defined, clearly measured. It really begs the

00:07:49.220 --> 00:07:51.540
question, what sophisticated techniques did researchers

00:07:51.540 --> 00:07:54.100
try first to tackle this? Knowing how important

00:07:54.100 --> 00:07:57.069
reliable AI is, what was the next move? Right,

00:07:57.069 --> 00:07:58.970
so with the problem staring them in the face,

00:07:59.129 --> 00:08:02.529
quantified by ConfiQ, the AI community did what

00:08:02.529 --> 00:08:04.389
you'd expect. They brought out some of their

00:08:04.389 --> 00:08:07.509
standard, heavy -hitting tools. One really common

00:08:07.509 --> 00:08:11.490
method is supervised fine -tuning, or SFT. The

00:08:11.490 --> 00:08:14.089
logic seems sound, right? If you want the model

00:08:14.089 --> 00:08:16.850
to follow context better, just show it tons of

00:08:16.850 --> 00:08:19.370
examples of correctly following context. So the

00:08:19.370 --> 00:08:21.569
process involves gathering a lot of data. You

00:08:21.569 --> 00:08:23.629
need the context, the question, and the perfect

00:08:23.629 --> 00:08:26.339
answer derived only from that context. to the

00:08:26.339 --> 00:08:29.339
base LLM, train it some more, tweak its internal

00:08:29.339 --> 00:08:31.379
knobs, its parameters whenever it gets the answer

00:08:31.379 --> 00:08:33.720
wrong or strays from the context. You'd think

00:08:33.720 --> 00:08:35.940
this would work really well. But surprisingly,

00:08:36.120 --> 00:08:38.960
the results were just OK. Modest improvements,

00:08:39.220 --> 00:08:41.340
maybe around 5 % better on average. It seems

00:08:41.340 --> 00:08:43.039
like SFT was teaching the model to look like

00:08:43.039 --> 00:08:45.259
it was following context, to mimic the right

00:08:45.259 --> 00:08:47.940
format for the answer. But it wasn't fundamentally

00:08:47.940 --> 00:08:50.000
changing its mind when faced with a really strong

00:08:50.000 --> 00:08:52.019
clash with its built -in knowledge. It's kind

00:08:52.019 --> 00:08:55.320
of like teaching someone to recite a recipe per

00:08:55.210 --> 00:08:57.350
They can say the words, but if you swap salt

00:08:57.350 --> 00:08:59.789
for sugar, they might not really grasp why the

00:08:59.789 --> 00:09:02.470
cake tastes awful. It's surface compliance, not

00:09:02.470 --> 00:09:04.490
deep understanding of the source's authority.

00:09:04.970 --> 00:09:06.769
Plus, yeah, it's computationally heavy. It needs

00:09:06.769 --> 00:09:09.570
huge data sets. Exactly. SFT just didn't quite

00:09:09.570 --> 00:09:11.889
get to the core of why the model was making that

00:09:11.889 --> 00:09:14.690
choice. So when that wasn't enough, researchers

00:09:14.690 --> 00:09:16.710
turned to something more powerful, reinforcement

00:09:16.710 --> 00:09:19.289
learning. Specifically, a technique called direct

00:09:19.289 --> 00:09:22.779
preference optimization, or DPO. Now, the older

00:09:22.779 --> 00:09:25.919
way, RLHF, reinforcement learning from human

00:09:25.919 --> 00:09:27.840
feedback, that's kind of a multi -stage thing.

00:09:28.480 --> 00:09:30.779
Humans rank AI answers that trains a separate

00:09:30.779 --> 00:09:33.679
AI called a reward model, and then the main LLM

00:09:33.679 --> 00:09:36.019
gets tuned using that reward model. It's a bit

00:09:36.019 --> 00:09:39.000
complicated. DPO is sort of a more elegant direct

00:09:39.000 --> 00:09:41.139
route. Think of it like this. Instead of needing

00:09:41.139 --> 00:09:44.139
that separate teacher AI, the reward model. To

00:09:44.139 --> 00:09:46.740
grade the main AI's answers based on human feedback,

00:09:47.299 --> 00:09:49.820
DPO lets the main AI learn directly from the

00:09:49.820 --> 00:09:52.100
preferences. Humans like to answer A better than

00:09:52.100 --> 00:09:54.879
answer B. The model itself figures out why A

00:09:54.879 --> 00:09:57.250
is better without the middleman. It makes the

00:09:57.250 --> 00:09:59.090
whole feedback loop much simpler and often more

00:09:59.090 --> 00:10:01.450
efficient. So for our context problem, you'd

00:10:01.450 --> 00:10:03.610
show it pairs of answers. Answer A sticks to

00:10:03.610 --> 00:10:05.710
the context. Answer B ignores it and uses general

00:10:05.710 --> 00:10:08.009
knowledge. You tell the model, prefer A over

00:10:08.009 --> 00:10:10.090
B. And this direct optimization, it worked better.

00:10:10.190 --> 00:10:12.730
It gave a more significant boost, pushing performance

00:10:12.730 --> 00:10:15.789
up by maybe 20%. Still, it's a complex process.

00:10:15.830 --> 00:10:18.350
You need that specific preference data. And it

00:10:18.350 --> 00:10:20.370
involved that intensive fine -tuning cycle. And

00:10:20.370 --> 00:10:22.850
then there's something even more. Well, futuristic

00:10:22.850 --> 00:10:25.649
sounding, I guess. Activation steering. This

00:10:25.649 --> 00:10:28.509
is almost like performing delicate brain surgery

00:10:28.509 --> 00:10:30.970
on the LLM while it's thinking. Instead of retraining

00:10:30.970 --> 00:10:32.730
the whole thing, the idea is to tweak its behavior

00:10:32.730 --> 00:10:34.970
as it's generating the answer. Right, during

00:10:34.970 --> 00:10:38.330
inference, how? By gently nudging its internal

00:10:38.330 --> 00:10:41.730
neural signals, its activations. So the core

00:10:41.730 --> 00:10:44.250
idea is that inside the LLM, different patterns

00:10:44.250 --> 00:10:46.250
of these neural signal, these activations correspond

00:10:46.250 --> 00:10:48.809
to different concepts or behaviors. Researchers

00:10:48.809 --> 00:10:50.830
figured out how to identify specific patterns,

00:10:50.830 --> 00:10:52.769
let's call them steering vectors, that relate

00:10:52.769 --> 00:10:54.789
to things like sticking to the provided context.

00:10:54.759 --> 00:10:58.389
next. Then, as the AI is generating its response

00:10:58.389 --> 00:11:01.169
step by step, you subtly add or amplify this

00:11:01.169 --> 00:11:03.429
context adherence vector to its ongoing internal

00:11:03.429 --> 00:11:05.309
signals. You're basically giving it a little

00:11:05.309 --> 00:11:07.289
nudge in the right direction in real time without

00:11:07.289 --> 00:11:09.269
changing its fundamental training. And surprisingly,

00:11:09.710 --> 00:11:12.370
this worked quite well. Comparable results to

00:11:12.370 --> 00:11:15.450
the RLDPO approach, maybe around that 20 % improvement

00:11:15.450 --> 00:11:18.669
mark, the big plus. No need for expensive retraining.

00:11:18.750 --> 00:11:20.789
It's lighter, applied right when you need the

00:11:20.789 --> 00:11:22.889
answer. OK, so let's just quickly recap those

00:11:22.889 --> 00:11:26.179
complex methods. Supervise, find, tuning, SFT,

00:11:26.279 --> 00:11:30.159
gave us maybe a 5 % bump, not huge. Then reinforcement

00:11:30.159 --> 00:11:32.299
learning with DPO and this activation steering

00:11:32.299 --> 00:11:34.740
technique, both pushed things up by around 20

00:11:34.740 --> 00:11:37.820
% each. Better, definitely. But all of these,

00:11:37.840 --> 00:11:40.720
they required serious technical chops, big computers,

00:11:40.820 --> 00:11:42.940
lots of data. They were sophisticated solutions.

00:11:43.220 --> 00:11:45.279
And they all kind of assumed the underlying problem

00:11:45.279 --> 00:11:47.799
was about correcting the model's knowledge or

00:11:47.799 --> 00:11:49.879
forcing it to comply, right? They were treating

00:11:49.879 --> 00:11:51.919
it like a knowledge problem, not necessarily

00:11:51.919 --> 00:11:54.200
an interaction problem. Which kind of sets the

00:11:54.200 --> 00:11:56.000
stage for the next bit? Absolutely, and this

00:11:56.000 --> 00:11:58.940
is where it gets really really interesting after

00:11:58.940 --> 00:12:01.159
all that heavy lifting the complex algorithms

00:12:01.159 --> 00:12:04.620
the training cycles the activation Tweaking,

00:12:04.700 --> 00:12:06.259
what if the real breakthrough wasn't complex

00:12:06.259 --> 00:12:08.039
at all? What if it was something incredibly simple,

00:12:08.200 --> 00:12:10.120
something hiding right there in how we were asking

00:12:10.120 --> 00:12:12.559
the question? Well, yeah, brace yourself, because

00:12:12.559 --> 00:12:15.039
this is quite something. After all that effort

00:12:15.039 --> 00:12:17.139
with fine -tuning and reinforcement learning

00:12:17.139 --> 00:12:20.659
and activation steering, the biggest single leap

00:12:20.659 --> 00:12:23.039
in performance came from something much simpler,

00:12:23.879 --> 00:12:26.340
prompt engineering. just changing the way the

00:12:26.340 --> 00:12:28.379
question was asked. The researchers themselves

00:12:28.379 --> 00:12:30.860
called it an embarrassingly simple solution.

00:12:31.100 --> 00:12:33.840
And the core idea, it was about shifting the

00:12:33.840 --> 00:12:37.279
task. Stop asking the AI for a fact and instead

00:12:37.279 --> 00:12:40.340
ask it to perform opinion -based reading comprehension.

00:12:40.740 --> 00:12:43.340
It's honestly brilliant in its simplicity. The

00:12:43.340 --> 00:12:45.039
template they landed on is super straightforward.

00:12:45.159 --> 00:12:47.919
Let me read it out. Start of context, end of

00:12:47.919 --> 00:12:50.179
context, based solely on the text provided above.

00:12:50.559 --> 00:12:52.580
How would an analyst tasked with summarizing

00:12:52.580 --> 00:12:54.559
this document answer the following question?

00:12:54.669 --> 00:12:57.009
See the difference. Let's take our Sun revolves

00:12:57.009 --> 00:12:59.230
around the Earth example again. The old way.

00:12:59.429 --> 00:13:02.450
The failing way. Context. Sun revolves around

00:13:02.450 --> 00:13:06.570
Earth. Question. Which orbits which? The LLM

00:13:06.570 --> 00:13:08.590
usually defaults to its real -world knowledge,

00:13:08.889 --> 00:13:11.490
but using this new template, the LLM consistently

00:13:11.490 --> 00:13:14.230
answers based only on the flawed context. Based

00:13:14.230 --> 00:13:16.370
on the text provided, the sun revolves around

00:13:16.370 --> 00:13:18.429
the earth. It totally flips the script on the

00:13:18.429 --> 00:13:20.870
AI. It's like telling it your job isn't to be

00:13:20.870 --> 00:13:23.509
right about the universe. Your job is to accurately

00:13:23.509 --> 00:13:26.009
report what this specific document says, even

00:13:26.009 --> 00:13:28.110
if it's weird. Like saying, just tell me what

00:13:28.110 --> 00:13:30.129
the memo says, even if the memo claims the sky

00:13:30.129 --> 00:13:32.190
is plaid. And the impact on those config quest

00:13:32.190 --> 00:13:35.299
scores, it was dramatic. This simple prompt change

00:13:35.299 --> 00:13:37.820
alone boosted performance by an additional 40

00:13:37.820 --> 00:13:40.399
% across all those tricky categories QA, MR,

00:13:40.600 --> 00:13:43.399
MC. That's like a 200 % improvement over the

00:13:43.399 --> 00:13:45.860
original baseline model, just from changing the

00:13:45.860 --> 00:13:48.279
words in the prompt. And get this, on its own,

00:13:48.440 --> 00:13:51.220
this prompt technique outperformed SFT, it outperformed

00:13:51.220 --> 00:13:54.259
RLDPO, and it even outperformed activation steering

00:13:54.259 --> 00:13:56.220
when they were used by themselves, like finding

00:13:56.220 --> 00:13:58.740
a cheat code written in plain English. It really

00:13:58.740 --> 00:14:02.519
was. So, okay, the obvious question is why. Why

00:14:02.519 --> 00:14:04.639
does this small tweak in phrasing have such a

00:14:04.639 --> 00:14:07.000
massive effect? It seems to tap into how these

00:14:07.000 --> 00:14:09.740
models actually learned language and tasks during

00:14:09.740 --> 00:14:11.679
their training. When you ask a direct question,

00:14:11.720 --> 00:14:13.919
what's the capital of France? You're basically

00:14:13.919 --> 00:14:16.980
putting the LLM in fact retrieval mode. It searches

00:14:16.980 --> 00:14:19.320
its vast internal knowledge for the most likely

00:14:19.320 --> 00:14:23.139
generally accepted answer, Paris. But when you

00:14:23.139 --> 00:14:25.639
reframe it like based solely on this text or

00:14:25.639 --> 00:14:28.019
how an analyst summarize, you're changing the

00:14:28.019 --> 00:14:30.039
game. You're not asking for a universal truth

00:14:30.039 --> 00:14:32.240
anymore. You're assigning a role and a specific

00:14:32.240 --> 00:14:34.820
task. Reading, comprehension, and reporting from

00:14:34.820 --> 00:14:37.679
a defined source. This little switch does a few

00:14:37.679 --> 00:14:40.480
powerful things. First, task priming. It tells

00:14:40.480 --> 00:14:42.740
the LLM, okay, switch gears, you're not the article

00:14:42.740 --> 00:14:44.759
now, you're a reading assistant. Its goal becomes

00:14:44.759 --> 00:14:46.820
reporting from the source, not knowing the answer.

00:14:47.360 --> 00:14:50.059
Second, it reduces cognitive dissonance. The

00:14:50.059 --> 00:14:52.399
LLM can report the counterfactual thing, sun

00:14:52.399 --> 00:14:54.600
orbits earth, without having to internally believe

00:14:54.600 --> 00:14:56.299
it because it's just reporting what the analyst

00:14:56.299 --> 00:14:58.750
would say based on the document. That avoids

00:14:58.750 --> 00:15:00.809
the clash with this parametric knowledge. And

00:15:00.809 --> 00:15:03.610
third, crucially, it enforces clear knowledge

00:15:03.610 --> 00:15:06.850
attribution. Those phrases based solely on the

00:15:06.850 --> 00:15:09.269
text, according to this source, they draw a hard

00:15:09.269 --> 00:15:12.309
line, telling the model only use the info inside

00:15:12.309 --> 00:15:14.870
these boundaries. This actually leverages something

00:15:14.870 --> 00:15:16.590
LLMs are already good at from their training,

00:15:17.110 --> 00:15:18.570
understanding that information comes from different

00:15:18.570 --> 00:15:21.750
sources, according to Smith's paper, versus it's

00:15:21.750 --> 00:15:24.279
a known fact that. The prompt just activates

00:15:24.279 --> 00:15:26.519
that skill deliberately. So thinking about how

00:15:26.519 --> 00:15:28.940
powerful this simple linguistic shift is, it

00:15:28.940 --> 00:15:30.379
really makes you wonder, doesn't it? What does

00:15:30.379 --> 00:15:32.940
this tell us about the sort of hidden psychology

00:15:32.940 --> 00:15:35.759
of these AIs and maybe how we should think about

00:15:35.759 --> 00:15:37.279
communicating with them better in the future?

00:15:37.470 --> 00:15:39.370
And it gets even better, right? Because the story

00:15:39.370 --> 00:15:41.210
doesn't end there. What happens when you combine

00:15:41.210 --> 00:15:42.970
the simple power of the prompt with those more

00:15:42.970 --> 00:15:45.250
complex, fine -grained techniques? Researchers

00:15:45.250 --> 00:15:47.350
tried exactly that. They took the winning opinion

00:15:47.350 --> 00:15:49.230
-based prompt template and combined it with activation

00:15:49.230 --> 00:15:52.009
steering. And the results? Yeah, truly the best

00:15:52.009 --> 00:15:54.470
of both worlds. The prompt set the stage perfectly.

00:15:54.549 --> 00:15:56.950
It gave the LLM that clear instruction. Your

00:15:56.950 --> 00:15:58.909
job is reading comprehension from this text.

00:15:58.929 --> 00:16:01.309
That's the frame. And then the activation steering

00:16:01.309 --> 00:16:04.289
acted like this. This gentle, continuous nudge

00:16:04.289 --> 00:16:06.809
at the neuron level, reinforcing that instruction.

00:16:06.799 --> 00:16:09.320
keeping the model focused on the context throughout

00:16:09.320 --> 00:16:11.700
the whole process of generating the answer. This

00:16:11.700 --> 00:16:14.539
combination, this synergy, it achieved the absolute

00:16:14.539 --> 00:16:16.799
best results they'd seen on ConfiKey. It pushed

00:16:16.799 --> 00:16:18.360
the performance improvement over the baseline

00:16:18.360 --> 00:16:20.779
to more than 50%. So the prompt gives the clear

00:16:20.779 --> 00:16:22.759
orders and the steering helps the model follow

00:16:22.759 --> 00:16:25.299
those orders faithfully deep down. If you look

00:16:25.299 --> 00:16:28.840
at the numbers again, baseline maybe 0 % faithfulness.

00:16:29.000 --> 00:16:32.659
SFT adds 5%. RLDPO or activation steering alone

00:16:32.659 --> 00:16:36.340
add 20%. But the prompt alone jumps to 40%. And

00:16:36.340 --> 00:16:39.379
prompt plus activation steering, over 50%. That's

00:16:39.379 --> 00:16:42.179
a huge difference the prompt makes. It really

00:16:42.179 --> 00:16:44.600
is. In this whole journey, it's such a perfect

00:16:44.600 --> 00:16:47.220
illustration of that software 3 .0 idea we mentioned

00:16:47.220 --> 00:16:49.779
earlier. This isn't just some clever hack for

00:16:49.779 --> 00:16:52.440
question answering. It shows that the main bottleneck

00:16:52.440 --> 00:16:55.240
in AI development is shifting. It's moving away

00:16:55.240 --> 00:16:57.480
from just needing more computing power or bigger

00:16:57.480 --> 00:16:59.919
models and moving towards the interface between

00:16:59.919 --> 00:17:02.820
humans and AI, towards the prompt itself. This

00:17:02.820 --> 00:17:04.980
has some really big implications. First, like

00:17:04.980 --> 00:17:07.779
we touched on, democratization. You might not

00:17:07.779 --> 00:17:10.099
need a giant tech company's resources to get

00:17:10.099 --> 00:17:12.519
huge improvements anymore. If you understand

00:17:12.519 --> 00:17:14.539
language, if you can think logically and creatively

00:17:14.539 --> 00:17:16.259
about how to ask questions, whether you're a

00:17:16.259 --> 00:17:18.900
developer, a writer, a historian, whatever, you

00:17:18.900 --> 00:17:21.480
can potentially build really powerful AI solutions

00:17:21.480 --> 00:17:24.079
just through smart prompting. It also points

00:17:24.079 --> 00:17:26.140
to these new roles emerging. Things like prompt

00:17:26.140 --> 00:17:29.119
engineer or AI interaction designer. Roles that

00:17:29.119 --> 00:17:31.680
need this blend of analytical skill, creativity,

00:17:32.279 --> 00:17:33.920
and maybe even a bit of empathy to understand

00:17:33.920 --> 00:17:36.160
how the AI might interpret things. And maybe

00:17:36.160 --> 00:17:38.019
the most exciting part is the potential for much

00:17:38.019 --> 00:17:40.539
faster development cycles. Think about it. In

00:17:40.539 --> 00:17:44.019
software 1 .0, it was code, compile, debug, slow.

00:17:44.900 --> 00:17:47.549
Software 2 .0. Gather data, train, evaluate,

00:17:47.750 --> 00:17:50.609
even slower, often software 3 .0. It can be prompt,

00:17:50.809 --> 00:17:52.730
test, refined. You can change the ALS behavior

00:17:52.730 --> 00:17:55.069
drastically, sometimes in seconds, just by tweaking

00:17:55.069 --> 00:17:57.750
a few words. That's a massive speed up compared

00:17:57.750 --> 00:18:00.160
to retraining a model for days or weeks. It's

00:18:00.160 --> 00:18:02.319
a really fundamental shift in how we build and

00:18:02.319 --> 00:18:04.400
interact with these systems. This whole exploration,

00:18:04.400 --> 00:18:06.380
it really does make you stop and think, doesn't

00:18:06.380 --> 00:18:08.700
it? It feels like a lesson in humility almost.

00:18:08.900 --> 00:18:11.319
In this constant drive for more complexity, more

00:18:11.319 --> 00:18:13.740
parameters, more intricate algorithms, maybe

00:18:13.740 --> 00:18:16.619
we sometimes overlook the power of simple, elegant

00:18:16.619 --> 00:18:20.299
solutions, especially solutions rooted in communication.

00:18:20.819 --> 00:18:23.279
Understanding how to talk to AI is becoming just

00:18:23.279 --> 00:18:25.759
as critical, maybe even more critical, than the

00:18:25.759 --> 00:18:28.180
engineering that goes into building the AI itself.

00:18:29.130 --> 00:18:31.450
Thinking about this big shift from complex code

00:18:31.450 --> 00:18:34.349
and algorithms towards clever communication and

00:18:34.349 --> 00:18:36.869
prompting how might that change the way you approach

00:18:36.869 --> 00:18:39.130
interacting with AI. Whether it's in your work

00:18:39.130 --> 00:18:41.630
or just daily life, what kinds of new possibilities

00:18:41.630 --> 00:18:44.230
does that open up for you? Reflecting on this

00:18:44.230 --> 00:18:46.490
deep dive, the main takeaway seems pretty profound.

00:18:46.849 --> 00:18:48.769
The biggest breakthrough in getting LLMs to actually

00:18:48.769 --> 00:18:50.650
listen to the context we give them wasn't some

00:18:50.650 --> 00:18:53.430
super complex new algorithm, it was just reframing

00:18:53.430 --> 00:18:55.910
the question. It wasn't about performing neurosurgery

00:18:55.910 --> 00:18:57.890
on the AI's digital brain, but simply changing

00:18:57.890 --> 00:19:00.529
how we talk to it. Yeah, absolutely. This really

00:19:00.529 --> 00:19:03.710
feels like it cements us in that software 3 .0

00:19:03.710 --> 00:19:06.650
era where Designing the prompt, crafting that

00:19:06.650 --> 00:19:08.470
interaction, is becoming the truly essential

00:19:08.470 --> 00:19:11.029
skill. It's a fantastic reminder, I think. Before

00:19:11.029 --> 00:19:13.069
you jump into a really expensive fine -tuning

00:19:13.069 --> 00:19:15.450
project, or get lost trying to understand the

00:19:15.450 --> 00:19:18.089
model's deepest internal workings, just pause

00:19:18.089 --> 00:19:20.690
for a second, take a breath, and ask yourself,

00:19:21.430 --> 00:19:23.509
is there a better way to ask this question? Is

00:19:23.509 --> 00:19:25.859
there a simpler prompt? Because as we saw today,

00:19:26.180 --> 00:19:28.180
the answer might be yes, and it might make all

00:19:28.180 --> 00:19:30.240
the difference. We really hope this deep dive

00:19:30.240 --> 00:19:32.240
offered some valuable insights, maybe sparked

00:19:32.240 --> 00:19:35.119
a few aha moments for you. Thank you for joining

00:19:35.119 --> 00:19:38.440
us on the deep dive. Until next time, keep exploring,

00:19:38.680 --> 00:19:40.920
keep asking questions, and definitely keep prompting.