WEBVTT

00:00:00.000 --> 00:00:02.759
Imagine yelling at your smart speaker to turn

00:00:02.759 --> 00:00:04.480
off the living room lights, but leave the porch

00:00:04.480 --> 00:00:07.360
on. And instead, it cheerfully sets an alarm

00:00:07.360 --> 00:00:10.500
for 3 a .m. Oh, yeah. We've all been there. Right.

00:00:11.060 --> 00:00:13.759
Like we have computers that can instantly calculate

00:00:13.759 --> 00:00:16.719
the precise orbital trajectory to land a rover

00:00:16.719 --> 00:00:19.839
on Mars. Yet they completely fail to understand

00:00:19.839 --> 00:00:23.039
a basic sentence spoken by a human in a kitchen.

00:00:23.219 --> 00:00:25.820
It is surprisingly common. It really is. Yeah.

00:00:25.940 --> 00:00:28.920
So why is that? Today, we are diving into that

00:00:28.920 --> 00:00:31.719
exact frustration. Welcome to today's deep dive.

00:00:31.820 --> 00:00:34.100
Glad to be here. Our mission today is to explore

00:00:34.100 --> 00:00:37.340
a comprehensive Wikipedia article on natural

00:00:37.340 --> 00:00:41.500
language understanding, or NLU. We're going to

00:00:41.500 --> 00:00:43.679
figure out how scientists and engineers have

00:00:43.679 --> 00:00:46.859
been trying, honestly, for decades, to teach

00:00:46.859 --> 00:00:49.640
computers to genuinely comprehend our messy,

00:00:49.659 --> 00:00:51.939
complicated human language. And it is a massive

00:00:51.939 --> 00:00:54.020
challenge. You probably interact with some form

00:00:54.020 --> 00:00:56.060
of NLU every single day, whether you realize

00:00:56.060 --> 00:00:57.759
it or not. Oh, for sure. But, you know, there

00:00:57.759 --> 00:00:59.500
is a fundamental difference between a computer

00:00:59.500 --> 00:01:02.100
simply recording the audio of your voice and

00:01:02.100 --> 00:01:04.319
a computer actually comprehending your intent.

00:01:04.620 --> 00:01:07.290
OK, let's unpack this. Because... Natural language

00:01:07.290 --> 00:01:09.629
understanding is categorized in the sources as

00:01:09.629 --> 00:01:12.890
a very specific subset of a broader field, right?

00:01:12.950 --> 00:01:15.409
Yeah, natural language processing. Right, NLP

00:01:15.409 --> 00:01:18.790
within artificial intelligence. But NLU deals

00:01:18.790 --> 00:01:20.829
strictly with something called machine reading

00:01:20.829 --> 00:01:23.370
comprehension. Exactly. That distinction is vital.

00:01:23.769 --> 00:01:25.450
I mean, in the field of artificial intelligence,

00:01:25.650 --> 00:01:29.280
NLU is considered what we call an AI hard. problem.

00:01:29.400 --> 00:01:32.180
AI hard. I mean, that sounds deliberately intimidating.

00:01:32.379 --> 00:01:34.739
Well, it's meant to convey the sheer scale of

00:01:34.739 --> 00:01:37.540
the obstacle. It means that to truly solve this

00:01:37.540 --> 00:01:39.840
problem, you essentially have to make a computer

00:01:39.840 --> 00:01:42.420
as intelligent as a human being. Wait, really?

00:01:42.680 --> 00:01:45.219
Just to understand language? Yeah. I mean, we've

00:01:45.219 --> 00:01:48.079
actually gotten incredibly good at turning spoken

00:01:48.079 --> 00:01:51.120
audio into written text, but that's just transcription.

00:01:51.299 --> 00:01:53.939
Right, just copying it down. Exactly. NLU is

00:01:53.939 --> 00:01:55.939
about automated reasoning. It's about taking

00:01:55.939 --> 00:01:58.659
those transcribed words, extracting the actual

00:01:58.659 --> 00:02:01.239
intent, and, you know, using it for large -scale

00:02:01.239 --> 00:02:03.680
content analysis or complex question answering.

00:02:04.019 --> 00:02:06.620
It is deriving actual meaning, not just copying

00:02:06.620 --> 00:02:08.879
letters. So let me push back on this a little

00:02:08.879 --> 00:02:12.099
bit. Is this just a machine doing advanced keyword

00:02:12.099 --> 00:02:15.199
matching? How do you mean? Well, think of a student

00:02:15.199 --> 00:02:17.639
hunting for vocabulary words in a textbook without

00:02:17.639 --> 00:02:20.400
actually reading the chapter. They scan the page,

00:02:20.780 --> 00:02:22.680
find the bolded word, write down the sentence,

00:02:22.939 --> 00:02:25.580
but they have absolutely no idea what the underlying

00:02:25.580 --> 00:02:28.360
concept means. Is that what these computers are

00:02:28.360 --> 00:02:30.620
doing? Historically, and even in many common

00:02:30.620 --> 00:02:33.120
systems today, yes, that is exactly what they're

00:02:33.120 --> 00:02:35.439
doing. Just advanced keyword matching. Right.

00:02:36.120 --> 00:02:38.639
But true natural language understanding requires

00:02:38.639 --> 00:02:41.439
something much deeper than that student hunting

00:02:41.439 --> 00:02:44.039
for a bolded word. It requires the system to

00:02:44.039 --> 00:02:46.599
build an internal representation of the semantics

00:02:46.599 --> 00:02:48.580
of a sentence. Let's define that for a second,

00:02:48.939 --> 00:02:51.159
because we hear syntax and semantics thrown around

00:02:51.159 --> 00:02:54.110
a lot. Syntax is just the grammar. Right? The

00:02:54.110 --> 00:02:57.090
rules of how words are ordered. Spot on. Syntax

00:02:57.090 --> 00:02:59.689
is relatively easy for a machine. Noun, verb,

00:02:59.930 --> 00:03:02.969
adjective. Sure. But semantic is the actual logic

00:03:02.969 --> 00:03:06.270
and meaning behind those words. True NLU relies

00:03:06.270 --> 00:03:08.289
on mapping that meaning into something called

00:03:08.289 --> 00:03:11.110
first -order logic. And what does first -order

00:03:11.110 --> 00:03:14.030
logic look like in plain English? Well, it's

00:03:14.030 --> 00:03:16.830
a way of translating our messy language into

00:03:16.830 --> 00:03:19.990
strict mathematical relationships. OK. So instead

00:03:19.990 --> 00:03:22.650
of just seeing the words a dog barks, first order

00:03:22.650 --> 00:03:25.110
logic forces the computer to establish a rule.

00:03:25.189 --> 00:03:28.289
It says, for all entities, if an entity is a

00:03:28.289 --> 00:03:31.289
dog, then that entity barks. Oh, wow. It breaks

00:03:31.289 --> 00:03:33.539
it down that strictly. It has to. The machine

00:03:33.539 --> 00:03:36.180
has to explicitly understand who is doing what

00:03:36.180 --> 00:03:38.599
to whom, the relationship between the objects

00:03:38.599 --> 00:03:41.400
and the context. It cannot just rely on the fact

00:03:41.400 --> 00:03:44.020
that the sentence is grammatically correct. And

00:03:44.020 --> 00:03:46.740
because achieving that level of true reading

00:03:46.740 --> 00:03:50.479
comprehension, getting a machine to map out relationships

00:03:50.479 --> 00:03:53.750
like a human is so absurdly difficult. Early

00:03:53.750 --> 00:03:55.990
computer scientists had to rely on some pretty

00:03:55.990 --> 00:03:57.990
brilliant illusions to make computers appear

00:03:57.990 --> 00:03:59.729
as though they understood us. Oh, absolutely.

00:03:59.789 --> 00:04:02.189
They were forced to work within very strict artificial

00:04:02.189 --> 00:04:04.030
constraints. If you can't teach a machine the

00:04:04.030 --> 00:04:06.490
whole world, you teach it a tiny fake world.

00:04:06.729 --> 00:04:09.330
Which brings us to the 1960s. The early days.

00:04:09.530 --> 00:04:11.889
Yeah, the source material highlights this pioneering

00:04:11.889 --> 00:04:15.490
program called Student, written in 1964 by Daniel

00:04:15.490 --> 00:04:18.810
Bobrow at MIT. This was just eight years after

00:04:18.810 --> 00:04:21.370
the term artificial intelligence was even coined.

00:04:21.550 --> 00:04:24.160
It's wild to think of. about. And student could

00:04:24.160 --> 00:04:27.019
actually take simple natural language input and

00:04:27.019 --> 00:04:29.980
solve algebra word problems. Which was hailed

00:04:29.980 --> 00:04:32.980
as a massive breakthrough at the time. But it

00:04:32.980 --> 00:04:36.120
worked specifically because the world it was

00:04:36.120 --> 00:04:38.160
operating in algebra was extremely restricted.

00:04:38.459 --> 00:04:40.399
Right. Highly logical and mathematically predictable.

00:04:40.579 --> 00:04:45.000
Exactly. So then a year later, in 1965, another

00:04:45.000 --> 00:04:48.180
MIT researcher, Joseph Weissenbaum, writes a

00:04:48.180 --> 00:04:50.560
program called ELIZA. And this one is famous.

00:04:50.779 --> 00:04:53.500
Oh, very famous. Eliza was an interactive program

00:04:53.500 --> 00:04:55.540
that carried on a dialogue in English, acting

00:04:55.540 --> 00:04:57.839
as a psychotherapist. You would type in your

00:04:57.839 --> 00:05:00.399
problems, and Eliza would respond. And people

00:05:00.399 --> 00:05:02.660
genuinely poured their hearts out to this machine.

00:05:02.819 --> 00:05:04.699
They really did. They felt deeply understood

00:05:04.699 --> 00:05:07.079
by it. But the source notes that Weissenbaum

00:05:07.079 --> 00:05:09.660
completely sidestepped the problem of giving

00:05:09.660 --> 00:05:12.899
the program a real world database. Eliza didn't

00:05:12.899 --> 00:05:15.290
have a rich vocabulary. Not at all. So if it

00:05:15.290 --> 00:05:18.129
just used simple parsing and substituted keywords

00:05:18.129 --> 00:05:21.350
into canned phrases, wasn't Eliza basically just

00:05:21.350 --> 00:05:23.910
an interactive game of Mad Libs? Like, it was

00:05:23.910 --> 00:05:26.110
just an illusion of empathy. What's fascinating

00:05:26.110 --> 00:05:28.430
here is it was entirely an illusion. A brilliant

00:05:28.430 --> 00:05:30.850
trick. Just a trick. Yeah, if you typed, I am

00:05:30.850 --> 00:05:33.370
unhappy, Eliza would just stand for the syntax

00:05:33.370 --> 00:05:36.230
I am, flip the pronoun, grab the next word, and

00:05:36.230 --> 00:05:38.050
plug it into a pre -written template. So it would

00:05:38.050 --> 00:05:40.430
just spit out, why are you unhappy? Exactly.

00:05:40.670 --> 00:05:43.069
It had zero understanding of what happiness or

00:05:43.069 --> 00:05:45.889
unhappiness actually - meant. Yet, despite being

00:05:45.889 --> 00:05:48.430
a toy project based on keyword substitution,

00:05:49.029 --> 00:05:51.589
it became wildly popular. People loved it. They

00:05:51.589 --> 00:05:54.470
did. In many ways, Eliza served as the blueprint

00:05:54.470 --> 00:05:57.370
for early commercial customer service bots like

00:05:57.370 --> 00:06:00.670
ask .com. The illusion of understanding was enough

00:06:00.670 --> 00:06:03.509
to fool us. And the history of these restricted

00:06:03.509 --> 00:06:06.209
microworlds just gets more interesting. Like

00:06:06.209 --> 00:06:10.009
in 1971, Terry Winograd writes a program called

00:06:10.009 --> 00:06:13.889
SHRDLU. Right, that's S -H -R -D -L -U. Yes,

00:06:14.149 --> 00:06:17.129
SHRDLU. It could understand English sentences

00:06:17.129 --> 00:06:19.889
well enough to direct a robotic arm to move children's

00:06:19.889 --> 00:06:22.149
blocks around on a computer screen. It was incredibly

00:06:22.149 --> 00:06:25.389
visual. Yeah. You could literally tell it. Find

00:06:25.389 --> 00:06:27.529
a block. which is taller than the one you are

00:06:27.529 --> 00:06:30.470
holding, and put it into the box. And the machine

00:06:30.470 --> 00:06:32.970
would execute the physical action. But notice

00:06:32.970 --> 00:06:35.990
the constraint again, though. It proved a machine

00:06:35.990 --> 00:06:38.269
could parse language and map it to a physical

00:06:38.269 --> 00:06:41.310
action, but only within a restricted world of

00:06:41.310 --> 00:06:43.569
children's blocks. You didn't need to understand

00:06:43.569 --> 00:06:46.870
poetry or politics. Or sarcasm, right? It only

00:06:46.870 --> 00:06:49.129
needed to understand geometric shapes, colors,

00:06:49.290 --> 00:06:52.449
and spatial relationships, like on top of or

00:06:52.449 --> 00:06:55.230
inside. There's also this incredible tridia in

00:06:55.230 --> 00:06:58.490
the sources about the lineage of this tech. A

00:06:58.490 --> 00:07:01.529
developer named Wayne Ratliff developed a program

00:07:01.529 --> 00:07:04.899
in the late 70s called Vulcan. Oh, the Star Trek

00:07:04.899 --> 00:07:07.480
connection. Yes. He built it with an English

00:07:07.480 --> 00:07:09.720
-like syntax, specifically because he wanted

00:07:09.720 --> 00:07:12.779
to mimic the voice -activated computer from Star

00:07:12.779 --> 00:07:14.660
Trek. He wanted to talk to his computer like

00:07:14.660 --> 00:07:16.540
they did on the Enterprise. The science fiction

00:07:16.540 --> 00:07:19.399
inspiration is always there in early AI. And

00:07:19.399 --> 00:07:22.199
that program, Vulkan, eventually evolved into

00:07:22.199 --> 00:07:25.040
a system called D -Base, which practically launched

00:07:25.040 --> 00:07:27.220
the entire personal computer database industry.

00:07:27.360 --> 00:07:30.410
It's a huge legacy. Even the massive software

00:07:30.410 --> 00:07:33.490
company Symantec originally started out in 1982

00:07:33.490 --> 00:07:35.550
as a natural language understanding company.

00:07:35.889 --> 00:07:37.490
They were trying to build a natural language

00:07:37.490 --> 00:07:41.009
interface for database queries. But they pivoted

00:07:41.009 --> 00:07:43.439
away from natural language fairly quickly. Right,

00:07:43.639 --> 00:07:45.759
because the computer mouse was invented. Graphical

00:07:45.759 --> 00:07:48.079
user interfaces took over. It was just easier.

00:07:48.160 --> 00:07:50.040
Yeah, it was suddenly so much easier to just

00:07:50.040 --> 00:07:52.740
point and click a folder icon than to struggle

00:07:52.740 --> 00:07:55.879
with the friction of typing a perfectly grammatically

00:07:55.879 --> 00:07:58.779
correct English sentence to ask the computer

00:07:58.779 --> 00:08:01.139
to open a file. That makes perfect sense. I mean,

00:08:01.139 --> 00:08:03.959
why fight with early clunky NLP when a mouse

00:08:03.959 --> 00:08:05.759
click guarantees the machine understands your

00:08:05.759 --> 00:08:08.660
intent? Exactly. But moving past those early

00:08:08.660 --> 00:08:11.339
decades, researchers realized that tricks like

00:08:11.339 --> 00:08:15.620
ELISA or tiny sandboxes like SHRDLU were dead

00:08:15.620 --> 00:08:18.240
ends for real -world application. They couldn't

00:08:18.240 --> 00:08:22.160
scale it. Right. To handle actual, everyday human

00:08:22.160 --> 00:08:25.120
language, they had to build an incredibly complex

00:08:25.120 --> 00:08:27.399
architecture under the hood. Here's where it

00:08:27.399 --> 00:08:30.259
gets really interesting. Because the amount of

00:08:30.259 --> 00:08:32.960
grueling manual labor required to build these

00:08:32.960 --> 00:08:36.240
internal systems is just staggering. To even

00:08:36.240 --> 00:08:39.600
begin to understand a sentence, a modern NLU

00:08:39.600 --> 00:08:42.980
system needs a lexicon, a massive dictionary

00:08:42.980 --> 00:08:45.299
of the language. But it's not just a standard

00:08:45.299 --> 00:08:47.899
dictionary like you'd find on a bookshelf. The

00:08:47.899 --> 00:08:50.820
source specifies it requires a rich lexicon with

00:08:50.820 --> 00:08:53.639
a suitable ontology. Okay, let's pause there.

00:08:54.000 --> 00:08:56.220
What exactly is an ontology in this context?

00:08:56.429 --> 00:08:59.649
Think of an ontology as a massive, invisible

00:08:59.649 --> 00:09:02.830
web of relationships. Like a mind map. Sort of.

00:09:03.149 --> 00:09:05.009
A regular dictionary just tells you the definition

00:09:05.009 --> 00:09:07.409
of an apple. An ontology tells the computer that

00:09:07.409 --> 00:09:09.750
an apple is a type of fruit, which grows on a

00:09:09.750 --> 00:09:12.269
tree, which is a plant, and an apple is related

00:09:12.269 --> 00:09:15.269
to the action of eating. Ah, I see. It maps out

00:09:15.269 --> 00:09:17.509
how every concept in the universe connects to

00:09:17.509 --> 00:09:19.649
every other concept. And building that web is

00:09:19.649 --> 00:09:22.509
no joke. The source explicitly mentions the word

00:09:22.509 --> 00:09:26.320
net lexicon. It took many person years of pure

00:09:26.320 --> 00:09:28.879
human effort to construct. Just unbelievable

00:09:28.879 --> 00:09:31.139
amounts of labor. Human beings had to sit there,

00:09:31.240 --> 00:09:33.419
meticulously mapping out the relationships between

00:09:33.419 --> 00:09:35.799
tens of thousands of words by hand. Because that

00:09:35.799 --> 00:09:37.740
lexicon is just layer one of the architecture.

00:09:38.059 --> 00:09:40.259
Once you have the words and their relationships,

00:09:40.700 --> 00:09:43.340
you need a parser and grammar rules to break

00:09:43.340 --> 00:09:47.720
the sentences down. I picture a parser like a

00:09:47.720 --> 00:09:49.700
diagramming a sentence back in middle school

00:09:49.700 --> 00:09:53.009
English class. Remember drawing those lines to

00:09:53.009 --> 00:09:55.870
separate the subject, the verb, and the prepositional

00:09:55.870 --> 00:09:58.590
phrase. That's a great analogy. But the computer

00:09:58.590 --> 00:10:01.129
has to do that instantly, almost in 3D, keeping

00:10:01.129 --> 00:10:03.830
track of dozens of overlapping grammatical rules.

00:10:04.049 --> 00:10:07.070
That sounds exhausting. It is. And even after

00:10:07.070 --> 00:10:09.850
the parser diagrams the sentence, you arrive

00:10:09.850 --> 00:10:12.129
at the most difficult layer, the semantic theory.

00:10:12.529 --> 00:10:14.549
This is what guides the actual comprehension.

00:10:15.210 --> 00:10:17.009
Because without a semantic theory, you just have

00:10:17.009 --> 00:10:20.100
a beautifully diagrammed list of words. It's

00:10:20.100 --> 00:10:22.399
like handing someone a perfectly translated dictionary

00:10:22.399 --> 00:10:24.259
of foreign language. They have the words, but

00:10:24.259 --> 00:10:26.659
no meaning. Exactly. They might know what the

00:10:26.659 --> 00:10:28.860
words mean individually, but they don't know

00:10:28.860 --> 00:10:31.799
the culture, the idioms, or the context. They

00:10:31.799 --> 00:10:34.039
wouldn't know if a phrase was meant to be a literal

00:10:34.039 --> 00:10:37.879
instruction or a sarcastic joke. Precisely. And

00:10:37.879 --> 00:10:40.399
the source outlines that these semantic theories

00:10:40.399 --> 00:10:44.019
operate on a spectrum. On one end, you have naive

00:10:44.019 --> 00:10:46.639
semantics. Naive semantics. Yeah, this is a very

00:10:46.639 --> 00:10:48.899
literal one -to -one interpretation. If you say,

00:10:49.259 --> 00:10:52.659
my car died, naive semantics assumes your vehicle

00:10:52.659 --> 00:10:55.240
was a living biological organism that has just

00:10:55.240 --> 00:10:57.460
passed away. Which obviously causes major errors

00:10:57.460 --> 00:10:59.659
in understanding. Right. So researchers try to

00:10:59.659 --> 00:11:02.200
push toward the other end of the spectrum, utilizing

00:11:02.200 --> 00:11:05.279
something called pragmatics. Pragmatics. Pragmatics

00:11:05.279 --> 00:11:08.460
is the ability to look beyond the literal dictionary

00:11:08.460 --> 00:11:11.460
definitions and derive meaning from context and

00:11:11.460 --> 00:11:13.340
speech. speaker intent. Can you give an example?

00:11:13.580 --> 00:11:16.039
Sure. If you are sitting at a dinner table and

00:11:16.039 --> 00:11:18.980
ask, can you pass the salt? Pragmatics tells

00:11:18.980 --> 00:11:22.480
the system you are making a request, not asking

00:11:22.480 --> 00:11:24.820
a literal question about the person's physical

00:11:24.820 --> 00:11:27.320
capability to lift a salt shaker. Right. Figuring

00:11:27.320 --> 00:11:29.679
out pragmatics is hard enough for humans sometimes.

00:11:30.019 --> 00:11:32.379
So we have this massive web of relationships,

00:11:32.559 --> 00:11:35.379
the parser diagramming the grammar and the semantic

00:11:35.379 --> 00:11:38.320
theory trying to guess the context. Then what?

00:11:38.460 --> 00:11:41.399
Then advanced NLU systems attempt to incorporate

00:11:41.399 --> 00:11:44.700
logic inference. They map this derived meaning

00:11:44.700 --> 00:11:47.460
into a set of assertions using predicate logic.

00:11:47.580 --> 00:11:50.700
Okay another big term, predicate logic. It connects

00:11:50.700 --> 00:11:52.740
back to that first -order logic we discussed

00:11:52.740 --> 00:11:56.000
earlier. Predicate logic is basically the mathematical

00:11:56.000 --> 00:11:58.399
machinery used to evaluate if a statement is

00:11:58.399 --> 00:12:01.100
true or false based on the variables. So back

00:12:01.100 --> 00:12:04.039
to the barking dogs. Exactly. If the machine

00:12:04.039 --> 00:12:06.700
establishes the rule all dogs bark and you tell

00:12:06.700 --> 00:12:09.700
it Fido is a dog, the predicate logic engine

00:12:09.700 --> 00:12:13.590
deduces the new truth. phytobarx. So it's learning.

00:12:13.870 --> 00:12:16.809
It is actively deducing conclusions that weren't

00:12:16.809 --> 00:12:19.470
explicitly stated. And this is where the underlying

00:12:19.470 --> 00:12:21.610
programming language becomes a massive hurdle.

00:12:21.889 --> 00:12:23.590
Yeah, the text gets into some heavy specifics

00:12:23.590 --> 00:12:25.909
here about functional languages versus logic

00:12:25.909 --> 00:12:28.970
-oriented languages. Why does the actual coding

00:12:28.970 --> 00:12:30.950
language matter so much if they are all just

00:12:30.950 --> 00:12:33.149
running on computers? Well, it changes how the

00:12:33.149 --> 00:12:35.649
machine thinks at a foundational level. Take

00:12:35.649 --> 00:12:37.490
a functional programming language like Lisp,

00:12:37.590 --> 00:12:40.590
which was very popular in early AI. Functional

00:12:40.590 --> 00:12:42.909
languages are designed to evaluate mathematical

00:12:42.909 --> 00:12:45.649
expressions. They are not naturally built to

00:12:45.649 --> 00:12:48.419
handle logical deductions like like Fido is a

00:12:48.419 --> 00:12:54.299
dog. So how do you make it understand that? massive

00:12:54.299 --> 00:12:57.460
subsystem on top of Lisp just to represent and

00:12:57.460 --> 00:12:59.600
process those logical assertions. That sounds

00:12:59.600 --> 00:13:02.139
clunky. It is. Think of it like buying a standard

00:13:02.139 --> 00:13:04.379
car and trying to bolt airplane wings onto the

00:13:04.379 --> 00:13:07.200
roof so it can fly. Right. It's heavy, complex,

00:13:07.320 --> 00:13:09.360
and wasn't built for that purpose. But the source

00:13:09.360 --> 00:13:11.919
contrasts that with logic -oriented languages

00:13:11.919 --> 00:13:15.240
like Prologue. Right. Prologue is like buying

00:13:15.240 --> 00:13:17.879
an actual airplane. Its core foundation is already

00:13:17.879 --> 00:13:20.240
built on logical representation. Oh, I see. Yeah.

00:13:20.240 --> 00:13:22.559
In Prologue, you don't write complex functions

00:13:22.570 --> 00:13:25.169
to evaluate a phyto barks. You just declare the

00:13:25.169 --> 00:13:28.549
fact. You literally write dog phyto. The logic

00:13:28.549 --> 00:13:30.669
is baked into the foundation of the language

00:13:30.669 --> 00:13:33.610
itself. OK. So if we have all of this today,

00:13:33.690 --> 00:13:36.549
these massive word net lexicons that took years

00:13:36.549 --> 00:13:39.429
to build, 3D parsers, pragmatic semantic theories,

00:13:39.809 --> 00:13:42.129
and predicate logic mapped out in languages like

00:13:42.129 --> 00:13:45.889
prologue, why do our smart speakers still completely

00:13:45.889 --> 00:13:48.820
fail to understand us? Because of the ultimate

00:13:48.820 --> 00:13:51.740
trade -off in this entire field, every system

00:13:51.740 --> 00:13:54.159
built has to make a sacrifice on the matrix of

00:13:54.159 --> 00:13:56.720
natural language understanding. The matrix of

00:13:56.720 --> 00:14:00.179
breadth versus depth. Yes. This is the core bottleneck

00:14:00.179 --> 00:14:03.480
of the technology. When we evaluate an NLU system,

00:14:03.759 --> 00:14:07.120
we plot it on a chart with two axes. Okay, let's

00:14:07.120 --> 00:14:10.909
visualize this. is the sheer size of the vocabulary

00:14:10.909 --> 00:14:14.049
and the grammar it can handle. How wide is its

00:14:14.049 --> 00:14:17.029
knowledge of the world? And depth. Depth is the

00:14:17.029 --> 00:14:19.610
degree to which its understanding actually approximates

00:14:19.610 --> 00:14:22.769
that of a fluent native human speaker. How well

00:14:22.769 --> 00:14:25.389
does it comprehend the nuances, the pragmatics,

00:14:25.710 --> 00:14:28.009
and the hidden logic of what it knows? So if

00:14:28.009 --> 00:14:30.289
we picture this matrix, where do the things we

00:14:30.289 --> 00:14:33.250
use every day sit? Like my earlier example of

00:14:33.250 --> 00:14:34.889
telling a smart speaker to turn on the living

00:14:34.889 --> 00:14:37.230
room lights. That sits squarely in the narrow

00:14:37.230 --> 00:14:39.490
and shallow quadrant. Narrow and shallow. Yeah,

00:14:39.549 --> 00:14:41.490
it's narrow because it only knows commands related

00:14:41.490 --> 00:14:44.629
to timers, weather, and smart home devices. It's

00:14:44.629 --> 00:14:46.549
shallow because it doesn't understand the nuance

00:14:46.549 --> 00:14:49.750
of your request. It's just hunting for the keywords

00:14:49.750 --> 00:14:53.990
turn on and lights. Minimal complexity. But what

00:14:53.990 --> 00:14:56.909
about those old MIT programs? The ones stacking

00:14:56.909 --> 00:15:00.879
blocks like SHRDLU? Those occupy the narrow and

00:15:00.879 --> 00:15:03.759
deep quadrant. They were built primarily by researchers

00:15:03.759 --> 00:15:06.799
to explore the actual mechanisms of human understanding.

00:15:07.360 --> 00:15:11.480
They are incredibly deep. SHRDLU deeply understood

00:15:11.480 --> 00:15:13.940
the spatial relationships and semantics of its

00:15:13.940 --> 00:15:16.830
environment. but their application is impossibly

00:15:16.830 --> 00:15:20.649
narrow. It only knows about blocks. So if SHRDLU

00:15:20.649 --> 00:15:23.649
was so good at deeply understanding blocks, why

00:15:23.649 --> 00:15:25.830
couldn't researchers just add more words to its

00:15:25.830 --> 00:15:28.529
dictionary? Why not just expand its ontology

00:15:28.529 --> 00:15:31.070
to include cars and restaurants and emotions

00:15:31.070 --> 00:15:33.210
and make it understand the whole world? Because

00:15:33.210 --> 00:15:35.429
the moment you expand the vocabulary, the amount

00:15:35.429 --> 00:15:37.769
of ambiguity explodes exponentially. Oh, I didn't

00:15:37.769 --> 00:15:40.190
think about that. Yeah. Block has one meaning

00:15:40.190 --> 00:15:42.950
in a sandbox. In the real world, block could

00:15:42.950 --> 00:15:45.929
mean a toy, a city street, a mental block, or

00:15:45.929 --> 00:15:48.549
to obstruct someone's path. The deep systems

00:15:48.549 --> 00:15:50.909
collapse under the weight of real -world ambiguity.

00:15:51.029 --> 00:15:53.129
Which forces companies to build systems in the

00:15:53.129 --> 00:15:56.169
broad and shallow quadrant. Exactly. This is

00:15:56.169 --> 00:15:58.669
where massive commercial enterprise systems live.

00:15:59.149 --> 00:16:01.730
Imagine a program that automatically reads and

00:16:01.730 --> 00:16:04.110
routes millions of corporate customer service

00:16:04.110 --> 00:16:06.590
emails to the right department. Like reaching

00:16:06.590 --> 00:16:09.120
billing instead of tech support. Right. It has

00:16:09.120 --> 00:16:11.519
to deal with a massive real -world vocabulary.

00:16:11.840 --> 00:16:15.960
It is very broad, but it lacks true comprehension.

00:16:16.600 --> 00:16:19.539
It's not deeply analyzing the customer's childhood

00:16:19.539 --> 00:16:23.039
trauma or sarcasm. It's just classifying text

00:16:23.039 --> 00:16:25.419
based on statistical patterns. It is shallow.

00:16:25.620 --> 00:16:28.919
Which leaves the final Holy Grail quadrant broad

00:16:28.919 --> 00:16:31.600
and deep. A system that knows every word and

00:16:31.600 --> 00:16:33.740
topic in the human experience and understands

00:16:33.740 --> 00:16:36.200
them with the pragmatic depth of a fluent native

00:16:36.200 --> 00:16:39.070
speaker. The dream of AI. But the source material

00:16:39.070 --> 00:16:41.269
is very clear on this. Systems that are both

00:16:41.269 --> 00:16:43.929
very broad and very deep are currently beyond

00:16:43.929 --> 00:16:45.970
the state of the art. Completely beyond it right

00:16:45.970 --> 00:16:47.950
now. So what does this all mean? Because I have

00:16:47.950 --> 00:16:49.830
to push back here on the idea that broad and

00:16:49.830 --> 00:16:52.110
deep doesn't exist. What about a system like

00:16:52.110 --> 00:16:55.289
IBM Watson? Ah. The source brings up Watson using

00:16:55.289 --> 00:16:59.240
machine learning for text classification. When

00:16:59.240 --> 00:17:01.700
Watson went on Jeopardy and beat the human champions,

00:17:02.299 --> 00:17:04.940
didn't that prove machines can understand broad

00:17:04.940 --> 00:17:08.279
and deep topics? It was answering incredibly

00:17:08.279 --> 00:17:10.700
complex, pun -filled questions about history,

00:17:11.259 --> 00:17:13.579
science, and pop culture. If we connect this

00:17:13.579 --> 00:17:15.819
to the bigger picture, that specific question

00:17:15.819 --> 00:17:18.299
is the philosophical debate tearing the AI field

00:17:18.299 --> 00:17:21.559
apart right now. Really? Yes. Did Watson actually

00:17:21.559 --> 00:17:24.059
understand those Jeopardy clues, or is it just

00:17:24.059 --> 00:17:27.069
incredibly good at statistical prediction? The

00:17:27.069 --> 00:17:29.230
source cites the famous philosopher John Searle,

00:17:29.450 --> 00:17:32.009
who argued explicitly that Watson did not understand

00:17:32.009 --> 00:17:34.230
the questions it was answering. How so? It gave

00:17:34.230 --> 00:17:36.230
the right answers. It gave the right answers

00:17:36.230 --> 00:17:39.009
by computing probabilities based on vast amounts

00:17:39.009 --> 00:17:41.549
of data, not by understanding meaning. I'm not

00:17:41.549 --> 00:17:44.339
sure I follow. Think about the word bank. When

00:17:44.339 --> 00:17:46.980
Watson reads a clue with the word bank, it doesn't

00:17:46.980 --> 00:17:49.160
vividly imagine a brick building with a vault

00:17:49.160 --> 00:17:52.140
and security guards. It simply calculates that

00:17:52.140 --> 00:17:55.579
in its massive database of text, the word bank

00:17:55.579 --> 00:17:59.480
appears next to the word money, 85 % of the time,

00:17:59.700 --> 00:18:02.759
and next to the word river, 15 % of the time.

00:18:02.900 --> 00:18:05.720
Oh, wow. Yeah, it uses statistics to guess the

00:18:05.720 --> 00:18:08.559
most probable missing piece of the puzzle. So

00:18:08.559 --> 00:18:11.119
we are right back to the student hunting for

00:18:11.119 --> 00:18:13.680
vocabulary words. Watson is just doing it at

00:18:13.680 --> 00:18:16.619
lightning speed across a billion textbooks simultaneously.

00:18:17.400 --> 00:18:19.500
Exactly. And the source includes the perspective

00:18:19.500 --> 00:18:21.980
of cognitive scientist John Ball, who invented

00:18:21.980 --> 00:18:25.759
Patom Theory. He strongly supports Searle's assessment.

00:18:26.099 --> 00:18:28.720
What does he argue? Ball argues that NLP has

00:18:28.720 --> 00:18:31.200
largely succeeded today by secretly narrowing

00:18:31.200 --> 00:18:33.519
the scope of the application or by relying on

00:18:33.519 --> 00:18:36.440
those statistical parlor tricks, not by actually

00:18:36.440 --> 00:18:39.720
achieving broad and deep comprehension. Right.

00:18:39.859 --> 00:18:42.259
Because there are still thousands of ways a human

00:18:42.259 --> 00:18:44.579
can request something that completely defies

00:18:44.579 --> 00:18:47.859
conventional statistical NLP. Because human language

00:18:47.859 --> 00:18:49.940
isn't just a math equation of probabilities.

00:18:50.480 --> 00:18:53.720
It relies on shared physical context, body language,

00:18:54.220 --> 00:18:56.960
cultural shifts, and shifting definitions. The

00:18:56.960 --> 00:18:59.680
article quotes an executive named Wide Wagemans

00:18:59.680 --> 00:19:02.380
on this exact point. He says that to have a truly

00:19:02.380 --> 00:19:05.279
meaningful conversation with a machine, the computer

00:19:05.279 --> 00:19:08.160
must match every single word to the correct meaning,

00:19:08.680 --> 00:19:10.700
based entirely on the meanings of the other words

00:19:10.700 --> 00:19:13.380
in the sentence without relying on statistical

00:19:13.380 --> 00:19:16.240
guesswork. Which, based on everything we've discussed

00:19:16.240 --> 00:19:19.039
about parsers and ontologies and breadth versus

00:19:19.039 --> 00:19:22.240
depth, sounds technologically impossible. But

00:19:22.240 --> 00:19:24.819
as Wagemans points out, that impossible standard

00:19:24.819 --> 00:19:26.900
is exactly what a three -year -old human does

00:19:26.900 --> 00:19:29.539
naturally. A three -year -old? Yeah. A toddler

00:19:29.539 --> 00:19:32.579
can understand shifting context, tone, and intent

00:19:32.579 --> 00:19:35.279
without running massive statistical probability

00:19:35.279 --> 00:19:38.380
calculations on a server farm. They just understand.

00:19:38.680 --> 00:19:41.400
A machine currently cannot. That is incredibly

00:19:41.400 --> 00:19:43.579
humbling. We have these massive supercomputers,

00:19:43.640 --> 00:19:46.099
and they still can't match the true semantic

00:19:46.099 --> 00:19:48.849
comprehension of a toddler. This debate about

00:19:48.849 --> 00:19:51.210
whether machines truly understand our logic or

00:19:51.210 --> 00:19:53.730
just predict our syntax really highlights where

00:19:53.730 --> 00:19:55.849
we are in the history of computing. It is a profound

00:19:55.849 --> 00:19:57.849
reality check on our technological progress.

00:19:58.210 --> 00:20:00.670
So let's recap this journey we've been on. We

00:20:00.670 --> 00:20:03.890
started in the 1960s with Eliza, proving that

00:20:03.890 --> 00:20:06.990
simple keyword substitution could create a powerful,

00:20:07.210 --> 00:20:10.049
albeit fake, illusion of empathy. A brilliant

00:20:10.049 --> 00:20:11.910
trick. We moved through the decades, discovering

00:20:11.910 --> 00:20:14.490
the grueling, multi -year manual labor required

00:20:14.490 --> 00:20:17.329
to build vast ontological lexicons like WordNet,

00:20:17.789 --> 00:20:20.349
and the necessity of establishing pragmatic semantic

00:20:20.349 --> 00:20:23.130
theories and predicate logic systems. We witnessed

00:20:23.130 --> 00:20:27.250
the evolution from clever parlor tricks to incredibly

00:20:27.250 --> 00:20:30.019
dense structural architecture. And yet to Today,

00:20:30.380 --> 00:20:32.900
we are still caught in that ultimate matrix of

00:20:32.900 --> 00:20:35.740
breadth versus depth. We have systems that are

00:20:35.740 --> 00:20:38.380
broad enough to scan millions of emails and systems

00:20:38.380 --> 00:20:40.140
that are deep enough to perfectly understand

00:20:40.140 --> 00:20:43.019
a tiny sandbox, but combining them into a machine

00:20:43.019 --> 00:20:45.099
that can truly converse about the whole world

00:20:45.099 --> 00:20:48.160
remains the holy grail currently beyond our reach.

00:20:48.779 --> 00:20:50.740
The three -year -old standard remains unbroken.

00:20:50.839 --> 00:20:53.900
It certainly does. So, for you listening, the

00:20:53.900 --> 00:20:56.079
next time an automated system magically routes

00:20:56.079 --> 00:20:58.599
your customer service email to the perfect apartment,

00:20:58.859 --> 00:21:02.359
or completely fails to grasp a simple spoken

00:21:02.359 --> 00:21:05.299
command to set a kitchen timer, you now know

00:21:05.299 --> 00:21:07.660
exactly what is happening under the hood. You've

00:21:07.660 --> 00:21:09.880
seen the matrix. Yeah, you know exactly where

00:21:09.880 --> 00:21:12.200
that specific program sits on the breadth versus

00:21:12.200 --> 00:21:15.299
depth chart. You know it's missing the complex

00:21:15.299 --> 00:21:18.220
semantic modeling of a fluent speaker and why

00:21:18.220 --> 00:21:21.319
it is still struggling to do what a toddler does

00:21:21.319 --> 00:21:24.119
effortlessly. This raises an important question

00:21:24.119 --> 00:21:25.880
and it's something to think about the next time

00:21:25.880 --> 00:21:28.200
you interact with one of these systems. What's

00:21:28.200 --> 00:21:31.289
that? Well, if achieving that true broad and

00:21:31.289 --> 00:21:33.730
deep machine understanding is currently beyond

00:21:33.730 --> 00:21:36.450
our engineering capabilities and computers constantly

00:21:36.450 --> 00:21:38.910
struggle with our messy context -heavy language,

00:21:39.289 --> 00:21:41.089
will we eventually stop waiting for the machines

00:21:41.089 --> 00:21:43.130
to learn our language? That's interesting. Right.

00:21:43.269 --> 00:21:45.950
Or will we subconsciously standardize and simplify

00:21:45.950 --> 00:21:48.950
the way we speak, smoothing out our own messy

00:21:48.950 --> 00:21:52.170
human quirks just to ensure the algorithms finally

00:21:52.170 --> 00:21:55.539
understand us? Wow. We started by saying we like

00:21:55.539 --> 00:21:57.779
computers because they are exact and predictable,

00:21:58.079 --> 00:22:01.380
like a rigid calculator. But maybe, in our desperation

00:22:01.380 --> 00:22:03.539
to be understood by them, we are the ones who

00:22:03.539 --> 00:22:05.700
will start talking like the calculators. Thank

00:22:05.700 --> 00:22:07.819
you for joining us on this deep dive. We will

00:22:07.819 --> 00:22:08.480
see you next time.
