WEBVTT

00:00:00.000 --> 00:00:03.940
Imagine trying to draw this, like, perfect, down

00:00:03.940 --> 00:00:06.339
-to -the -inch architectural blueprint of a house.

00:00:07.139 --> 00:00:09.140
But you have to do it while sprinting through

00:00:09.140 --> 00:00:11.960
the hallways, completely blindfolded. Oh, I mean,

00:00:12.099 --> 00:00:14.240
that sounds totally impossible. Right. It sounds

00:00:14.240 --> 00:00:17.109
like a complete nightmare. But that is the exact

00:00:17.109 --> 00:00:19.370
mathematical puzzle happening inside a robot

00:00:19.370 --> 00:00:21.629
vacuum cleaner right now. Yeah, it really is.

00:00:22.289 --> 00:00:24.469
Or think about when you strap on a VR headset

00:00:24.469 --> 00:00:26.429
and you physically walk around your living room.

00:00:26.870 --> 00:00:29.129
You're exploring a digital castle and you expect

00:00:29.129 --> 00:00:31.390
the digital walls to perfectly match your real

00:00:31.390 --> 00:00:33.649
furniture. Which we all just take entirely for

00:00:33.649 --> 00:00:36.369
granted now. Absolutely. Underneath that seamless

00:00:36.369 --> 00:00:39.009
experience is honestly one of the most maddeningly

00:00:39.009 --> 00:00:41.649
complex computational puzzles we have ever tried

00:00:41.649 --> 00:00:44.729
to solve. For a machine to just see and move

00:00:44.729 --> 00:00:47.070
in the physical world without immediately crashing

00:00:47.070 --> 00:00:49.789
into a wall, it has to perform this constant

00:00:49.789 --> 00:00:52.969
high -speed juggling act of probability, geometry,

00:00:53.729 --> 00:00:56.469
and, well, guesswork. OK, let's unpack this.

00:00:56.909 --> 00:00:59.679
Welcome to the deep dive. Today our mission is

00:00:59.679 --> 00:01:03.039
to explore a fascinating technology called simultaneous

00:01:03.039 --> 00:01:06.079
localization and mapping. Commonly known as SLAM.

00:01:06.319 --> 00:01:10.040
S -L -A -M. Right. SLAM. So for you listening,

00:01:10.599 --> 00:01:13.000
if you have ever wondered how self -driving cars,

00:01:13.560 --> 00:01:16.500
delivery drones or even those augmented reality

00:01:16.500 --> 00:01:19.280
apps on your phone actually perceive the physical

00:01:19.280 --> 00:01:22.819
space around them, SLAM is the invisible magic

00:01:22.819 --> 00:01:25.400
making it all happen. It's the invisible scaffolding

00:01:25.400 --> 00:01:28.930
of autonomy really. But to really grasp how this

00:01:28.930 --> 00:01:32.530
works, we have to start with this massive fundamental

00:01:32.530 --> 00:01:35.709
logical paradox at the center of the technology.

00:01:35.930 --> 00:01:38.430
Yes, the core of SLAM is literally a chicken

00:01:38.430 --> 00:01:40.870
or the egg problem. I love that the source material

00:01:40.870 --> 00:01:42.810
actually calls it that. I mean, it's the best

00:01:42.810 --> 00:01:45.409
way to describe it. The paradox is incredibly

00:01:45.409 --> 00:01:48.769
simple. You cannot navigate an environment without

00:01:48.769 --> 00:01:52.030
a map, but you cannot build a map of an unknown

00:01:52.030 --> 00:01:54.189
environment without knowing exactly where you

00:01:54.189 --> 00:01:56.739
are within it. It's an endless loop. How do you

00:01:56.739 --> 00:01:58.500
map the room if you don't know where you're standing?

00:01:59.019 --> 00:02:00.099
But how do you know where you're standing if

00:02:00.099 --> 00:02:02.260
you don't have a map? Exactly. Going back to

00:02:02.260 --> 00:02:04.219
that blindfolded house scenario, say you wake

00:02:04.219 --> 00:02:06.719
up in pitch black. You take a tentative step

00:02:06.719 --> 00:02:09.520
forward, reach out, feel a wall, and you start

00:02:09.520 --> 00:02:11.780
drawing a mental floor plan in your head. But

00:02:11.780 --> 00:02:14.439
because you're blindfolded, your sense of how

00:02:14.439 --> 00:02:16.740
many steps you've actually taken isn't perfect.

00:02:17.280 --> 00:02:19.960
Maybe your stride was like a little wider than

00:02:19.960 --> 00:02:22.310
you thought. So your estimate of your own location

00:02:22.310 --> 00:02:24.930
is just a very good guess. Which means the map

00:02:24.930 --> 00:02:27.110
you were drawing in your head based on that location

00:02:27.110 --> 00:02:30.009
is also just a guess. Right. Both your location

00:02:30.009 --> 00:02:32.689
and your map are nothing more than continuously

00:02:32.689 --> 00:02:35.169
updated approximations. And that blindfolded

00:02:35.169 --> 00:02:38.289
guesswork really highlights the exact mathematical

00:02:38.289 --> 00:02:40.949
reality that SLAM algorithms have to navigate.

00:02:41.469 --> 00:02:43.770
They aren't dealing in absolute certainties at

00:02:43.770 --> 00:02:45.830
all. They're just guessing. They're dealing entirely

00:02:45.830 --> 00:02:48.750
in probabilities. The objective of SLAM is not

00:02:48.879 --> 00:02:52.060
absolute flawless perfection. The goal is what

00:02:52.060 --> 00:02:54.599
they call operational compliance. Operational

00:02:54.599 --> 00:02:56.979
compliance. Meaning it just has to be good enough

00:02:56.979 --> 00:02:59.979
to not crash. Exactly. It just has to be mathematically

00:02:59.979 --> 00:03:02.379
tight enough for the robot to accomplish its

00:03:02.379 --> 00:03:05.280
task without failing. And the way a machine achieves

00:03:05.280 --> 00:03:08.060
this is by applying Bayes' rule. OK, Bayes' rule.

00:03:08.479 --> 00:03:10.500
So we're talking about the statistical principle

00:03:10.500 --> 00:03:13.340
of updating probabilities based on new evidence.

00:03:13.740 --> 00:03:16.800
Yes. The system is calculating what we call location

00:03:16.800 --> 00:03:20.800
posteriors. Location posteriors. OK, let's break

00:03:20.800 --> 00:03:22.699
down the mechanics of how that actually happens

00:03:22.699 --> 00:03:26.039
for the robot. Sure. So the robot holds a mathematical

00:03:26.039 --> 00:03:28.159
model of where it thinks it is and a model of

00:03:28.159 --> 00:03:30.360
what it thinks the map looks like. Every time

00:03:30.360 --> 00:03:33.020
the robot moves, it applies a transition function.

00:03:33.099 --> 00:03:35.419
Which is just a fancy way of saying it predicts

00:03:35.419 --> 00:03:37.719
where it went. Right. It's a mathematical prediction

00:03:37.719 --> 00:03:40.240
of where it should be based strictly on the commands

00:03:40.240 --> 00:03:43.930
it's sent to its motors. So it thinks... I told

00:03:43.930 --> 00:03:45.810
my wheels to roll forward for one second, so

00:03:45.810 --> 00:03:47.569
I should be two feet ahead. OK, makes sense.

00:03:47.729 --> 00:03:49.930
Then it takes in new sensor observations from

00:03:49.930 --> 00:03:52.650
the physical world. It compares what it actually

00:03:52.650 --> 00:03:55.849
senses to what it expected to sense, and it sequentially

00:03:55.849 --> 00:03:58.650
updates both its location and the map. So it's

00:03:58.650 --> 00:04:01.509
constantly asking itself, given the raw sensor

00:04:01.509 --> 00:04:03.490
data I just collected and the motor movements

00:04:03.490 --> 00:04:06.349
I just made, what is the most statistically probable

00:04:06.349 --> 00:04:09.469
reality? Exactly. You nailed it. Doing that once

00:04:09.469 --> 00:04:12.650
is impressive. Doing it dozens or hundreds of

00:04:12.650 --> 00:04:15.030
times a second is mind -blowing. Oh, it's an

00:04:15.030 --> 00:04:18.209
incredible feat of computation. But wait, this

00:04:18.209 --> 00:04:22.120
leads me to a major mechanical problem. If SLAM

00:04:22.120 --> 00:04:24.980
relies on continuous probabilistic guessing,

00:04:25.800 --> 00:04:27.540
how does the math keep the robot from getting

00:04:27.540 --> 00:04:30.000
completely lost in its own cascading errors?

00:04:30.199 --> 00:04:32.600
That is the million -dollar question. Because

00:04:32.600 --> 00:04:35.420
if my mental map in that blindfolded house is

00:04:35.420 --> 00:04:38.439
off by just one inch on step one, by step 50,

00:04:38.839 --> 00:04:40.819
I've accumulated so much error that I'm trying

00:04:40.819 --> 00:04:43.740
to walk through a refrigerator. That battle against

00:04:43.740 --> 00:04:46.420
cascading error is literally what drove decades

00:04:46.420 --> 00:04:49.379
of algorithmic research. Really? Yeah. In the

00:04:49.379 --> 00:04:52.139
1990s and early 2000s, the dominant method to

00:04:52.139 --> 00:04:54.839
solve this was the Extended Kalman Filter, or

00:04:54.839 --> 00:04:58.100
EKF. EKF, OK. Right. EKF SLAM is feature -based.

00:04:58.420 --> 00:05:00.639
It tries to estimate the probability distribution

00:05:00.639 --> 00:05:03.620
for the robot's pose, meaning its exact position

00:05:03.620 --> 00:05:06.019
and orientation, alongside the coordinates of

00:05:06.019 --> 00:05:08.519
landmarks it sees in the map. There is a pretty

00:05:08.519 --> 00:05:10.660
significant Achilles heel to EKF, though, right?

00:05:10.779 --> 00:05:12.660
Like, it struggles the second the real world

00:05:12.660 --> 00:05:14.759
gets too messy. Yeah, it completely falls apart.

00:05:14.899 --> 00:05:17.120
And the reason EKF fails in highly uncertain

00:05:17.120 --> 00:05:19.579
environments is that it relies heavily on a Gaussian

00:05:19.579 --> 00:05:21.899
noise assumption. OK, hold on. Gaussian noise

00:05:21.899 --> 00:05:23.860
assumption, let's translate that. It basically

00:05:23.860 --> 00:05:26.740
assumes that the errors or uncertainties in its

00:05:26.740 --> 00:05:29.300
measurements will always follow a nice, neat,

00:05:29.660 --> 00:05:32.339
predictable bell curve. Ah, OK. But the real

00:05:32.339 --> 00:05:34.500
world doesn't always produce neat bell curves

00:05:34.500 --> 00:05:37.259
of error. When a robot hits an un - expected

00:05:37.259 --> 00:05:40.120
patch of terrain or a sensor glitches wildly,

00:05:40.420 --> 00:05:44.199
the error is non -Gaussian. It's messy. Exactly.

00:05:44.560 --> 00:05:47.860
The mathematical linearization in the EKF just

00:05:47.860 --> 00:05:50.180
shatters under that unpredictability, and the

00:05:50.180 --> 00:05:53.519
robot gets hopelessly lost. So to solve this,

00:05:53.800 --> 00:05:56.379
the industry moved toward newer methods like

00:05:56.379 --> 00:05:59.220
graph slam. Graph slam. And graph slam approaches

00:05:59.220 --> 00:06:01.480
the map completely differently, right? Very different.

00:06:01.540 --> 00:06:04.459
Instead of just guessing step by step, it uses

00:06:04.459 --> 00:06:07.420
sparse information matrices. generates a factor

00:06:07.420 --> 00:06:09.879
graph of observation interdependencies. Wow,

00:06:09.920 --> 00:06:11.839
look at you with the technical terms. I read

00:06:11.839 --> 00:06:13.959
the source material, but let's clarify how that

00:06:13.959 --> 00:06:15.779
actually functions because a factor graph of

00:06:15.779 --> 00:06:18.540
interdependencies sounds like pure jargon. It

00:06:18.540 --> 00:06:21.139
really does. Are we talking about the robot tethering

00:06:21.139 --> 00:06:23.639
its memories together? Like if it sees a doorway

00:06:23.639 --> 00:06:26.259
at minute one and then sees that exact same doorway

00:06:26.259 --> 00:06:29.339
at minute five from a different angle, does it

00:06:29.339 --> 00:06:31.980
realize those two data points are the same physical

00:06:31.980 --> 00:06:34.699
object? That tethering concept is the perfect

00:06:34.699 --> 00:06:37.800
way to visualize it. Imagine the robot's path

00:06:37.800 --> 00:06:40.420
is a string of nodes. And every time it observes

00:06:40.420 --> 00:06:43.139
a landmark, it connects a mathematical rubber

00:06:43.139 --> 00:06:45.620
band between its current position and that landmark.

00:06:45.839 --> 00:06:48.040
OK, I like the rubber band analogy. Right. So

00:06:48.040 --> 00:06:51.060
when it recognizes that same doorway later, it

00:06:51.060 --> 00:06:53.980
connects another rubber band. Graph SLAM optimizes

00:06:53.980 --> 00:06:56.519
all these interdependencies at once. So instead

00:06:56.519 --> 00:06:58.879
of relying on a fragile chain of sequential guesses

00:06:58.879 --> 00:07:02.019
where one mistake ruins everything. Yes, exactly.

00:07:02.480 --> 00:07:04.459
The tension of all those mathematical rubber

00:07:04.459 --> 00:07:07.180
bands naturally pulls the entire map into the

00:07:07.180 --> 00:07:10.500
most stable globally shape. That is brilliant.

00:07:10.939 --> 00:07:12.620
I have to push back on the physical movement

00:07:12.620 --> 00:07:15.240
side of this map though. Okay, hit me. We keep

00:07:15.240 --> 00:07:17.160
talking about the robot tracking its own motion,

00:07:17.720 --> 00:07:21.399
but what if a robot vacuums wheels slip on a

00:07:21.399 --> 00:07:23.680
slick hardwood floor? Oh, happens all the time.

00:07:23.959 --> 00:07:26.199
Right, so it sends a command to its motors to

00:07:26.199 --> 00:07:28.879
move forward two feet. The motor spin perfectly,

00:07:28.879 --> 00:07:31.339
but the robot actually just spins its wheels

00:07:31.339 --> 00:07:33.819
in a puddle of spilled coffee. Right. The internal

00:07:33.819 --> 00:07:37.000
computer thinks, great, I'm two feet ahead. But

00:07:37.000 --> 00:07:39.639
physically? It hasn't moved an inch. Doesn't

00:07:39.639 --> 00:07:41.959
that completely ruin the transition function

00:07:41.959 --> 00:07:44.639
and the location posteriors? It's a great question.

00:07:44.800 --> 00:07:47.740
This is a classic problem in kinematics modeling

00:07:47.740 --> 00:07:51.199
because movement data is inherently unreliable.

00:07:51.639 --> 00:07:54.360
The noise from independent angular and linear

00:07:54.360 --> 00:07:58.399
movements is messy non -Gaussian noise. So the

00:07:58.399 --> 00:08:00.579
system doesn't trust the wheels blindly. Exactly.

00:08:01.040 --> 00:08:02.879
Reading the wheel movement is a process called

00:08:02.879 --> 00:08:06.149
odometry. And the trick to modern slam is that

00:08:06.149 --> 00:08:08.829
it refuses to treat odometry as the absolute

00:08:08.829 --> 00:08:11.189
truth. Oh, okay. It treats the physical spinning

00:08:11.189 --> 00:08:13.750
of the wheels as just one more sensory input

00:08:13.750 --> 00:08:16.509
to be heavily scrutinized. It weighs the wheel

00:08:16.509 --> 00:08:18.870
data against the camera data and the laser data.

00:08:19.050 --> 00:08:20.750
So if the wheels say, hey, we definitely moved

00:08:20.750 --> 00:08:22.649
two feet, but the camera says the doorway is

00:08:22.649 --> 00:08:25.009
still exactly where it was, the algorithm relies

00:08:25.009 --> 00:08:27.370
on the visual data to realize it slipped. Exactly.

00:08:27.670 --> 00:08:30.829
It's a democracy of senses. A democracy of senses.

00:08:31.350 --> 00:08:33.669
I love that. A very heavily weighted democracy,

00:08:33.950 --> 00:08:36.409
yes. The algorithm is constantly deciding which

00:08:36.409 --> 00:08:39.330
sense is the most trustworthy in any given millisecond.

00:08:39.389 --> 00:08:41.649
Let's talk about those senses then. We know the

00:08:41.649 --> 00:08:44.409
math can weigh different inputs, but what exactly

00:08:44.409 --> 00:08:47.990
are these inputs? How does this machine physically

00:08:47.990 --> 00:08:52.090
perceive the space to feed this voracious mathematical

00:08:52.090 --> 00:08:55.100
engine? Well, the hardware side of SLAM involves

00:08:55.100 --> 00:08:58.080
a wildly diverse array of sensors. At the most

00:08:58.080 --> 00:09:01.360
data -heavy extreme, there is LIDAR. Light Detection

00:09:01.360 --> 00:09:04.259
and Ranging. Exactly. LIDAR physically spins

00:09:04.259 --> 00:09:06.899
and shoots out lasers, measuring exactly how

00:09:06.899 --> 00:09:08.980
long it takes for the light to bounce back. This

00:09:08.980 --> 00:09:11.919
creates massive, incredibly dense 3D point clouds

00:09:11.919 --> 00:09:13.860
of an environment. It's like a topography map

00:09:13.860 --> 00:09:16.080
of the room. Right. And with LIDAR, a system

00:09:16.080 --> 00:09:18.519
gets so much raw geometric data that sometimes

00:09:18.519 --> 00:09:20.820
it barely even needs complex SLAM inference.

00:09:21.960 --> 00:09:24.860
point clouds are so crisp they can just be unambiguously

00:09:24.860 --> 00:09:27.019
aligned on top of each other frame by frame.

00:09:27.399 --> 00:09:29.779
Okay, so if the blindfolded house scenario we

00:09:29.779 --> 00:09:33.899
talked about earlier is tactile slam, just fumbling

00:09:33.899 --> 00:09:37.100
for a light switch in a pitch black room where

00:09:37.100 --> 00:09:40.279
your spatial data is incredibly sparse, LIDAR

00:09:40.279 --> 00:09:43.000
is like flipping on a stadium floodlight. Yes,

00:09:43.120 --> 00:09:45.700
you see the exact geometry of everything instantly.

00:09:45.940 --> 00:09:49.350
But LIDAR is notoriously expensive. Right? And

00:09:49.350 --> 00:09:52.049
power -hungry and bulky. I mean, you aren't putting

00:09:52.049 --> 00:09:55.230
a spinning laser array on a smartphone. No, definitely

00:09:55.230 --> 00:09:59.070
not. Which is why VisualSlam, or vSlam, became

00:09:59.070 --> 00:10:02.250
the industry standard for consumer tech. VisualSlam

00:10:02.250 --> 00:10:04.750
uses normal 2D cameras, making it perfect for

00:10:04.750 --> 00:10:06.629
mobile devices. And the breakthrough that made

00:10:06.629 --> 00:10:08.950
vSlam viable was the invention of SIFT, right?

00:10:09.110 --> 00:10:11.929
Yes. Scale -invariant feature transforms, or

00:10:11.929 --> 00:10:14.610
SIFT. These SIFT algorithms allow a simple camera

00:10:14.610 --> 00:10:17.149
to extract unique local features from an image.

00:10:17.740 --> 00:10:19.620
maybe the sharp corner of a picture frame or

00:10:19.620 --> 00:10:21.580
the pattern on a rug. Right, and recognize them

00:10:21.580 --> 00:10:23.220
regardless of whether the camera is two feet

00:10:23.220 --> 00:10:25.620
away or 20 feet away, in bright light or in shadow.

00:10:26.139 --> 00:10:28.419
It basically turns raw pixels into mathematical

00:10:28.419 --> 00:10:31.320
fingerprints. Mathematical fingerprints. That

00:10:31.320 --> 00:10:34.620
is so cool. And beyond cameras and lasers are

00:10:34.620 --> 00:10:37.159
incredibly niche ways to map a space too. Oh,

00:10:37.159 --> 00:10:40.019
there are some wild ones. Like Wi -Fi slam, which

00:10:40.019 --> 00:10:42.220
maps the layout of a building based purely on

00:10:42.220 --> 00:10:44.980
the fluctuating signal strengths of nearby Wi

00:10:44.980 --> 00:10:47.899
-Fi routers. Yeah, that one is fascinating. And

00:10:47.899 --> 00:10:50.059
my absolute favorite from the source material.

00:10:51.240 --> 00:10:54.120
There's a kind of slam designed for human pedestrians

00:10:54.120 --> 00:10:56.940
that uses an inertial measurement unit. Which

00:10:56.940 --> 00:10:59.120
is basically an advanced motion sensor. Right.

00:10:59.320 --> 00:11:01.700
Mounted directly on a shoe. It relies on the

00:11:01.700 --> 00:11:04.340
biological fact that humans naturally avoid walking

00:11:04.340 --> 00:11:07.100
into walls. Hopefully. Usually, right? So it

00:11:07.100 --> 00:11:10.120
uses our organic pathfinding to automatically

00:11:10.120 --> 00:11:12.799
build accurate floor plans of indoor buildings

00:11:12.799 --> 00:11:15.620
just by tracking our footsteps. It's brilliantly

00:11:15.620 --> 00:11:17.679
simple. But, you know, when we look at robots

00:11:17.679 --> 00:11:19.779
that have to share space with those humans, we

00:11:19.779 --> 00:11:22.259
see systems like audiovisualslam. Okay, combining

00:11:22.259 --> 00:11:26.000
audio and video. Exactly. This fuses camera data

00:11:26.000 --> 00:11:28.860
with acoustic data. It uses microphone arrays

00:11:28.860 --> 00:11:31.360
to estimate the direction of arrival, or due,

00:11:31.519 --> 00:11:33.980
of a sound source. Like human speech. Right,

00:11:34.039 --> 00:11:35.899
so the robot can interact with us naturally.

00:11:36.200 --> 00:11:38.740
Why do you need to fuse them, though? I mean,

00:11:39.200 --> 00:11:41.299
if the robot has a camera, why can't it just

00:11:41.299 --> 00:11:43.980
look at the human who is talking? Well, the core

00:11:43.980 --> 00:11:47.000
rule of sensor fusion is statistical independence.

00:11:47.610 --> 00:11:50.330
OK, explain that. You purposefully want to combine

00:11:50.330 --> 00:11:52.809
sensors that fail in completely different ways

00:11:52.809 --> 00:11:55.289
so they can compensate for each other's weaknesses.

00:11:55.590 --> 00:11:58.210
Oh, I see. Like, lightweight monocular cameras

00:11:58.210 --> 00:12:01.250
have a very narrow field of view. They get occluded

00:12:01.250 --> 00:12:03.090
if someone steps in front of them, and they fail

00:12:03.090 --> 00:12:04.909
completely if the lights go out. Which makes

00:12:04.909 --> 00:12:08.409
sense. But audio has a 360 -degree field of view

00:12:08.409 --> 00:12:11.590
and works perfectly in pitch black. But audio

00:12:11.590 --> 00:12:13.610
has its own flaws too, right? Oh, absolutely.

00:12:13.789 --> 00:12:16.350
It is highly susceptible to reverberation and

00:12:16.350 --> 00:12:19.759
background noise. But by fusing them, the visual

00:12:19.759 --> 00:12:22.379
data compensates for the audio echoes and the

00:12:22.379 --> 00:12:24.440
audio data compensates for the camera's blind

00:12:24.440 --> 00:12:26.600
spots. It's like having eyes in the back of your

00:12:26.600 --> 00:12:28.720
head that can hear. That is a great way to put

00:12:28.720 --> 00:12:31.080
it. So we've solved how a robot figures out it's

00:12:31.080 --> 00:12:34.519
in a static room using lasers and cameras. But

00:12:34.519 --> 00:12:36.840
if I walk through that room, haven't I just ruined

00:12:36.840 --> 00:12:39.620
its geometric map? Yeah, the real world refuses

00:12:39.620 --> 00:12:42.120
to stand still. Right. People are walking around,

00:12:42.299 --> 00:12:44.899
dogs are running, chairs get dragged across the

00:12:44.899 --> 00:12:48.940
floor. How does the math handle a world that

00:12:48.940 --> 00:12:51.519
is constantly shifting? Like, if a robot is mapping

00:12:51.519 --> 00:12:54.159
a hallway and I walk right past it, why doesn't

00:12:54.159 --> 00:12:56.700
the robot just assume I'm a very weird shifting

00:12:56.700 --> 00:12:59.860
wall? Well, the system has to separate the static

00:12:59.860 --> 00:13:02.899
geometry from the dynamic chaos. And this is

00:13:02.899 --> 00:13:06.620
handled by a model called SLAM with DATMO. DATMO.

00:13:07.240 --> 00:13:09.320
Detection and Tracking of Moving Objects. Yes.

00:13:09.480 --> 00:13:11.860
Essentially, the system is programmed to track

00:13:11.860 --> 00:13:14.519
non -static entities, like a pedestrian or a

00:13:14.519 --> 00:13:17.340
moving car, in a very similar way to how it tracks

00:13:17.340 --> 00:13:20.639
its own location. It identifies that an object

00:13:20.639 --> 00:13:23.750
is an anomaly against a static background. classifies

00:13:23.750 --> 00:13:26.149
it as a moving entity, and updates its changing

00:13:26.149 --> 00:13:28.769
coordinates in real time. So it basically maintains

00:13:28.769 --> 00:13:31.429
two separate ledgers. One strict ledger for the

00:13:31.429 --> 00:13:34.169
permanent walls and infrastructure, and a secondary

00:13:34.169 --> 00:13:36.110
fluid ledger for the things moving around between

00:13:36.110 --> 00:13:37.850
them. Exactly. It doesn't permanently bake my

00:13:37.850 --> 00:13:39.429
shape into the map of the hallway. Thankfully,

00:13:39.710 --> 00:13:42.110
no. But there's another dynamic problem we have

00:13:42.110 --> 00:13:43.970
to talk about, and it feels like the ultimate

00:13:43.970 --> 00:13:47.950
test of a SLAM system. Loop closure. Oh, loop

00:13:47.950 --> 00:13:50.610
closure is arguably the most critical make -or

00:13:50.610 --> 00:13:52.769
-break moment in a SLAM algorithm's journey.

00:13:53.049 --> 00:13:55.309
Because it addresses the problem of recognizing

00:13:55.309 --> 00:13:59.049
a previously visited location, right? Yes. Let's

00:13:59.049 --> 00:14:02.070
imagine a robot is tasked with mapping a giant

00:14:02.070 --> 00:14:05.669
circular office building. By the time it navigates

00:14:05.669 --> 00:14:07.909
all the way around the circle and arrives exactly

00:14:07.909 --> 00:14:11.070
back where it started, all those tiny non -Gaussian

00:14:11.070 --> 00:14:14.129
kinematics errors, the slight wheel slips, the

00:14:14.129 --> 00:14:16.409
minor sensor glitches, they've all accumulated.

00:14:16.529 --> 00:14:18.190
The compounding errors we talked about. Right.

00:14:18.370 --> 00:14:21.090
So the robot's internal math might insist it

00:14:21.090 --> 00:14:23.590
is 10 feet away from its starting point, even

00:14:23.590 --> 00:14:25.250
though it's physically standing right on top

00:14:25.250 --> 00:14:28.350
of it. And because of that accumulated error,

00:14:28.789 --> 00:14:31.129
the algorithm assigns a very low probability

00:14:31.129 --> 00:14:33.009
to the idea that it's back at the beginning.

00:14:33.450 --> 00:14:35.610
So how does it snap out of that delusion? How

00:14:35.610 --> 00:14:38.090
does it realize, wait, I've definitely been here

00:14:38.090 --> 00:14:41.110
before. It applies a secondary algorithm to compute

00:14:41.110 --> 00:14:44.409
a similarity measure. It might use a bag of words

00:14:44.409 --> 00:14:46.649
approach. Bag of words. Is that related to the

00:14:46.649 --> 00:14:49.929
SIFT fingerprints? Yes. It's a collection of

00:14:49.929 --> 00:14:52.750
those SIFT visual fingerprints we mentioned earlier.

00:14:53.210 --> 00:14:55.710
It constantly stores these visual fingerprints

00:14:55.710 --> 00:14:58.750
of every place it has been. When it detects a

00:14:58.750 --> 00:15:01.210
high probability match between the visual geometry

00:15:01.210 --> 00:15:04.090
it sees right now and a stored fingerprint from

00:15:04.090 --> 00:15:07.830
an hour ago, it forcefully overrides the accumulated

00:15:07.830 --> 00:15:10.049
movement errors. It just tosses the bad math

00:15:10.049 --> 00:15:13.009
out. Exactly. It resets its location priors,

00:15:13.169 --> 00:15:15.529
pulling the two disparate ends of its internal

00:15:15.529 --> 00:15:18.500
map together and closing the loop. It's the robot

00:15:18.500 --> 00:15:22.080
equivalent of a massive aha moment. That's exactly

00:15:22.080 --> 00:15:24.059
what it is. Like when you're lost hiking in the

00:15:24.059 --> 00:15:25.940
woods, you know, and you're starting to panic

00:15:25.940 --> 00:15:28.100
because nothing looks familiar. And then suddenly

00:15:28.100 --> 00:15:30.779
you see that weirdly shaped lightning struck

00:15:30.779 --> 00:15:33.620
tree you passed two hours ago. Yes. Instantly,

00:15:33.840 --> 00:15:36.960
your entire mental map rotates and snaps perfectly

00:15:36.960 --> 00:15:39.899
into place. The panic vanishes. You know exactly

00:15:39.899 --> 00:15:42.190
where you are. If we connect this to the bigger

00:15:42.190 --> 00:15:44.990
picture, researchers developing SLAM literally

00:15:44.990 --> 00:15:47.330
looked to the biological brain for inspiration

00:15:47.330 --> 00:15:50.009
on how to solve these exact navigational problems.

00:15:50.090 --> 00:15:53.570
Wait, really? Yeah. In neuroscience, the hippocampus

00:15:53.570 --> 00:15:57.029
appears to perform incredibly similar computations.

00:15:57.570 --> 00:16:00.029
Mammals have specialized neurons called place

00:16:00.029 --> 00:16:02.809
cells. Place cells? And they fire only when we

00:16:02.809 --> 00:16:04.929
enter a specific location in our environment.

00:16:05.340 --> 00:16:08.500
This biological reality actually formed the basis

00:16:08.500 --> 00:16:12.580
for bio -inspired SLAM systems, like RATSLAM.

00:16:12.820 --> 00:16:16.440
RATSLAM? Like the rodent? Exactly. Which mathematically

00:16:16.440 --> 00:16:19.379
models robotic navigation directly on the neural

00:16:19.379 --> 00:16:22.399
pathways of rodents. That is wild. We are modeling

00:16:22.399 --> 00:16:25.299
advanced robotics on rat brains. We are. But

00:16:25.299 --> 00:16:26.799
I guess that makes total sense when you watch

00:16:26.799 --> 00:16:29.299
how effortlessly a rat can navigate a complex

00:16:29.299 --> 00:16:31.779
changing maze without needing a spinning LiDAR

00:16:31.779 --> 00:16:33.700
laser on its head. Right, they are incredibly

00:16:33.700 --> 00:16:35.980
efficient navigators. There's also this concept

00:16:35.980 --> 00:16:38.039
in the source text of active slam, which feels

00:16:38.039 --> 00:16:40.740
very biological to me too. Oh, active slam pushes

00:16:40.740 --> 00:16:43.019
the autonomy even further because traditional

00:16:43.019 --> 00:16:45.720
slam is somewhat passive. Like a human drives

00:16:45.720 --> 00:16:48.289
the robot around. And the robot maps the environment

00:16:48.289 --> 00:16:50.850
as it is carried through it. Right. But ActiveSLAM

00:16:50.850 --> 00:16:53.370
studies the combined problem of the robot deciding

00:16:53.370 --> 00:16:55.669
where to move next in order to build the map

00:16:55.669 --> 00:16:58.169
as efficiently as possible. So it's making its

00:16:58.169 --> 00:17:01.610
own choices. Yes. It generally does this by calculating

00:17:01.610 --> 00:17:03.789
the entropy, which is the mathematical measure

00:17:03.789 --> 00:17:06.549
of uncertainty in its current map. It actively

00:17:06.549 --> 00:17:08.990
chooses the path that will provide the most new

00:17:08.990 --> 00:17:11.769
information to reduce that uncertainty. It's

00:17:11.769 --> 00:17:14.779
literally exploring. Yeah, sniffing out the unknown

00:17:14.779 --> 00:17:17.079
edges of the map just like a biological creature

00:17:17.079 --> 00:17:19.779
would so we've gone from dense Bayesian math

00:17:19.779 --> 00:17:23.299
to Factor graphs to rat brains and shoe sensors

00:17:23.299 --> 00:17:26.160
But let's bring this all the way back down to

00:17:26.160 --> 00:17:28.700
earth for you listening Where did this technology

00:17:28.700 --> 00:17:30.940
actually originate and where is it hiding in

00:17:30.940 --> 00:17:33.470
our everyday lives right now? Well, the foundational

00:17:33.470 --> 00:17:36.690
concepts trace back to a 1986 paper by researchers

00:17:36.690 --> 00:17:39.410
Smith and Cheeseman. They were studying the representation

00:17:39.410 --> 00:17:42.329
and estimation of spatial uncertainty. 1986,

00:17:42.549 --> 00:17:44.890
so a while ago. Yeah, but the field really solidified

00:17:44.890 --> 00:17:47.650
in the early 1990s. The research group led by

00:17:47.650 --> 00:17:49.549
Hugh Durant White pioneered the core mechanics,

00:17:49.750 --> 00:17:52.410
and he actually coined the acronym SLAM in a

00:17:52.410 --> 00:17:55.890
1995 paper. They proved that absolute mathematical

00:17:55.890 --> 00:17:58.130
solutions to this problem exist in what's called

00:17:58.130 --> 00:18:01.049
the infinite data limit. And just knowing that

00:18:01.049 --> 00:18:03.789
a solution was theoretically possible motivated

00:18:03.789 --> 00:18:06.589
the entire robotics industry to start searching

00:18:06.589 --> 00:18:09.170
for computationally tractable approximations

00:18:09.170 --> 00:18:12.029
that could actually run on real hardware. And

00:18:12.029 --> 00:18:14.650
that search culminated in a massive turning point

00:18:14.650 --> 00:18:17.509
in the 2000s, right? Yeah. The DARPA Grand Challenges.

00:18:17.650 --> 00:18:20.289
Oh, yeah. DARPA, the US military's research and

00:18:20.289 --> 00:18:22.910
development agency, held these highly publicized

00:18:22.910 --> 00:18:25.150
autonomous driving competitions in the Mojave

00:18:25.150 --> 00:18:27.289
Desert. I remember seeing videos of those early

00:18:27.289 --> 00:18:30.329
ones. Cars just driving into ditches. Yeah, because

00:18:30.329 --> 00:18:32.690
previous attempts at autonomous vehicles often

00:18:32.690 --> 00:18:34.809
tried to be deterministic. They wanted absolute

00:18:34.809 --> 00:18:36.730
truth. And when their sensors hit an anomaly,

00:18:36.829 --> 00:18:40.269
they just froze or crashed. Enter slam. Exactly.

00:18:41.000 --> 00:18:43.960
A researcher named Sebastian Thrun led the development

00:18:43.960 --> 00:18:46.460
of self -driving cars named Stanley and Junior.

00:18:47.259 --> 00:18:50.559
By utilizing advanced SLAM systems, these cars

00:18:50.559 --> 00:18:53.359
embraced probabilistic guessing. They weren't

00:18:53.359 --> 00:18:56.019
afraid to guess. Right. They used the math we've

00:18:56.019 --> 00:18:57.839
been discussing to navigate the unpredictable

00:18:57.839 --> 00:19:00.839
terrain. And Stanley actually won the 2005 Grand

00:19:00.839 --> 00:19:04.440
Challenge, which thrust SLAM into worldwide attention.

00:19:04.700 --> 00:19:07.140
The trajectory of this technology is staggering

00:19:07.140 --> 00:19:10.039
to me. The exact same computational breakthrough

00:19:10.039 --> 00:19:12.619
that won a multi -million dollar military DARPA

00:19:12.619 --> 00:19:15.720
challenge navigating a robotic car across the

00:19:15.720 --> 00:19:18.700
harsh Mojave Desert is now quite literally sucking

00:19:18.700 --> 00:19:20.980
up the dust bunnies under my living room couch.

00:19:21.519 --> 00:19:24.819
It's true. Consumer robot vacuums use this exact

00:19:24.819 --> 00:19:27.539
tech to map our living rooms. It's amazing. And

00:19:27.539 --> 00:19:30.579
its application goes far beyond vacuums. If you

00:19:30.579 --> 00:19:32.759
use augmented reality on your phone, you are

00:19:32.759 --> 00:19:36.059
utilizing SLAM. Google's R -Core platform, which

00:19:36.059 --> 00:19:38.680
replaced an earlier project called Pango, uses

00:19:38.680 --> 00:19:41.680
a technique called Maximum A Posteriori Estimation

00:19:41.680 --> 00:19:44.619
or MAP. MAP estimation. What does that do? Well,

00:19:44.680 --> 00:19:46.619
MAP estimation doesn't just make a random guess

00:19:46.619 --> 00:19:48.460
about where your phone is. It calculates the

00:19:48.460 --> 00:19:50.519
single most mathematically defensible reality

00:19:50.519 --> 00:19:52.680
based on your camera feed and motion sensors.

00:19:53.240 --> 00:19:55.519
It jointly estimates the pose of your phone and

00:19:55.519 --> 00:19:57.519
the position of the digital landmarks in your

00:19:57.519 --> 00:19:59.500
room. That's why the Pokemon stays on the sidewalk

00:19:59.500 --> 00:20:01.579
when you move the camera. Exactly. And if you

00:20:01.579 --> 00:20:04.299
put on a virtual reality headset like The MetaQuest

00:20:04.299 --> 00:20:08.140
2, or the PQ04, it relies on markerless inside

00:20:08.140 --> 00:20:10.200
-out tracking. Which is? That is pure visual

00:20:10.200 --> 00:20:13.380
slam happening in real time, instantly tracking

00:20:13.380 --> 00:20:15.839
your head movements so you don't experience motion

00:20:15.839 --> 00:20:18.500
sickness in VR. So it is truly the invisible

00:20:18.500 --> 00:20:22.140
scaffolding of modern tech. But circling back

00:20:22.140 --> 00:20:24.039
to the self -driving cars for a second, we talk

00:20:24.039 --> 00:20:26.559
about Stanley winning the DARPA challenge by

00:20:26.559 --> 00:20:28.980
mapping the desert on the fly. Right. Fast forward

00:20:28.980 --> 00:20:31.759
to today, we have autonomous vehicles testing

00:20:31.759 --> 00:20:34.599
on city streets everywhere. Are they just running

00:20:34.599 --> 00:20:37.519
incredibly beefed up versions of the same raw

00:20:37.519 --> 00:20:40.339
slam algorithm to navigate downtown traffic?

00:20:40.519 --> 00:20:43.160
So this raises an incredibly fascinating modern

00:20:43.160 --> 00:20:46.019
caveat. Today, most commercial self -driving

00:20:46.019 --> 00:20:49.690
cars actually cheat at slam. Wait, cheat? How

00:20:49.690 --> 00:20:52.269
do you cheat at a math problem? By heavily simplifying

00:20:52.269 --> 00:20:54.349
the mapping side of the equation to almost nothing.

00:20:54.769 --> 00:20:57.150
Really? Yeah. Modern autonomous companies use

00:20:57.150 --> 00:21:00.069
pre -collected, highly detailed map data. Think

00:21:00.069 --> 00:21:02.490
of Google Street View on steroids. These pre

00:21:02.490 --> 00:21:05.029
-recorded maps are annotated down to the granular

00:21:05.029 --> 00:21:07.809
level of marking individual white line segments

00:21:07.809 --> 00:21:11.430
on the asphalt and even the exact height of specific

00:21:11.430 --> 00:21:13.589
concrete curbs. Oh, wow. So they already know

00:21:13.589 --> 00:21:15.670
the whole map before they even turn on. Exactly.

00:21:15.900 --> 00:21:18.720
By already having a perfect high definition map

00:21:18.720 --> 00:21:21.420
of the city loaded into memory, they turn the

00:21:21.420 --> 00:21:24.099
incredibly complex simultaneous localization

00:21:24.099 --> 00:21:27.160
and mapping problem into a much, much simpler

00:21:27.160 --> 00:21:30.559
localization only task. The car just uses its

00:21:30.559 --> 00:21:32.500
active sensors to figure out where it is on the

00:21:32.500 --> 00:21:35.339
pre -existing map and tracks moving objects like

00:21:35.339 --> 00:21:38.140
other cars and pedestrians at runtime. So they

00:21:38.140 --> 00:21:40.640
aren't actively mapping the geometry of the world

00:21:40.640 --> 00:21:43.490
as they drive. They're just. double -checking

00:21:43.490 --> 00:21:45.710
their sensor homework against an answer key that

00:21:45.710 --> 00:21:48.049
was already provided to them. That's the reality

00:21:48.049 --> 00:21:50.569
of commercial autonomy today. And in outdoor

00:21:50.569 --> 00:21:53.250
applications, highly precise differential GPS

00:21:53.250 --> 00:21:56.210
sensors can also almost entirely remove the need

00:21:56.210 --> 00:21:58.650
to map the environment. Because the GPS just

00:21:58.650 --> 00:22:01.539
tells it where it is. Right. The GPS data provides

00:22:01.539 --> 00:22:04.519
such sharp location likelihoods that it absolutely

00:22:04.519 --> 00:22:06.819
dominates the Bezier inference calculations.

00:22:07.200 --> 00:22:08.779
But there has to be a catch to that. Oh, there

00:22:08.779 --> 00:22:11.140
is a dangerous catch to relying on that. GPS

00:22:11.140 --> 00:22:14.079
can fail. It can lose signal in urban canyons

00:22:14.079 --> 00:22:16.799
or it can be deliberately jammed or taken offline

00:22:16.799 --> 00:22:19.779
entirely during times of military conflict. Right.

00:22:19.940 --> 00:22:22.500
So for true resilient autonomy, especially in

00:22:22.500 --> 00:22:24.779
defense search and rescue or emergency robotics

00:22:24.779 --> 00:22:27.440
relying purely on GPS or perfectly prerecorded

00:22:27.440 --> 00:22:31.400
maps, isn't enough. raw slam remains absolutely

00:22:31.400 --> 00:22:34.960
critical. Wow. We've covered an incredible amount

00:22:34.960 --> 00:22:37.180
of ground today. We started with the classic

00:22:37.180 --> 00:22:40.500
chicken and egg paradox of needing a map to navigate,

00:22:41.140 --> 00:22:43.480
but needing to navigate to build a map. The endless

00:22:43.480 --> 00:22:46.420
loop. Right. We explored how Bayesian probabilities

00:22:46.420 --> 00:22:49.259
and algorithms like grasslands solve this by

00:22:49.259 --> 00:22:51.279
treating wheel movement as just another fallible

00:22:51.279 --> 00:22:53.460
sensor, pulling the map together with mathematical

00:22:53.460 --> 00:22:56.240
rubber bands. We shined a stadium floodlight

00:22:56.240 --> 00:22:58.880
on LIDAR, talked about fusing audio and visual

00:22:58.880 --> 00:23:01.559
sensors to track moving objects, and saw how

00:23:01.559 --> 00:23:05.400
loop closure mimics the aha moments of the biological

00:23:05.400 --> 00:23:07.730
hippocampus. It's quite the journey. It really

00:23:07.730 --> 00:23:10.369
is. And finally, we track SLAM's journey from

00:23:10.369 --> 00:23:13.910
1980s spatial uncertainty research to military

00:23:13.910 --> 00:23:16.650
desert races all the way to the Roomba navigating

00:23:16.650 --> 00:23:19.250
your living room and the VR headset on your face.

00:23:19.410 --> 00:23:22.829
It is truly the invisible geometry holding our

00:23:22.829 --> 00:23:24.930
autonomous world together. It truly is. And you

00:23:24.930 --> 00:23:26.549
know, as we wrap up, I want to leave you with

00:23:26.549 --> 00:23:29.009
a final thought to ponder building directly on

00:23:29.009 --> 00:23:31.349
that modern caveat about self -driving cars we

00:23:31.349 --> 00:23:33.609
were just talking about. We just discussed how

00:23:33.609 --> 00:23:36.029
modern autonomous vehicles essentially cheat

00:23:36.029 --> 00:23:38.849
at navigation by using perfectly pre -recorded

00:23:38.849 --> 00:23:41.670
high definition maps of our cities. But what

00:23:41.670 --> 00:23:43.690
happens if the environment fundamentally changes

00:23:43.690 --> 00:23:47.349
overnight? Say after a massive snowstorm, a hurricane

00:23:47.349 --> 00:23:49.829
or a major earthquake that shifts the very streets

00:23:49.829 --> 00:23:52.930
and drops buildings into the roads, if that perfectly

00:23:52.930 --> 00:23:55.500
pre -made map suddenly becomes a lie, can these

00:23:55.500 --> 00:23:58.720
vehicles seamlessly revert back to pure raw slam

00:23:58.720 --> 00:24:02.319
to navigate the chaos? Or are we building a vast

00:24:02.319 --> 00:24:04.859
autonomous infrastructure that is entirely dependent

00:24:04.859 --> 00:24:06.920
on a perfectly static, unchanging world?
