WEBVTT

00:00:00.000 --> 00:00:04.740
In 1966, some of the absolute smartest computer

00:00:04.740 --> 00:00:07.639
scientists on the planet thought they could teach

00:00:07.639 --> 00:00:10.140
a machine to see the world over a single summer

00:00:10.140 --> 00:00:12.679
break. Which is just wild to think about now.

00:00:12.839 --> 00:00:14.859
Right. I mean, they literally set it up as an

00:00:14.859 --> 00:00:18.300
undergraduate project at the MIT AI lab. The

00:00:18.300 --> 00:00:21.219
whole plan was to just, you know, Attach a camera

00:00:21.219 --> 00:00:23.399
to a computer, hand it over to the students,

00:00:23.719 --> 00:00:26.120
and basically say, hey, write a program to make

00:00:26.120 --> 00:00:28.539
the machine describe what it sees. Yeah, they

00:00:28.539 --> 00:00:30.440
genuinely assumed it would be an easy problem

00:00:30.440 --> 00:00:32.780
to solve, just a quick summer assignment. Exactly.

00:00:33.060 --> 00:00:36.210
And now... decades later, we realized that teaching

00:00:36.210 --> 00:00:39.750
a machine to quote -unquote see is actually one

00:00:39.750 --> 00:00:41.850
of the most complex mathematical bubbles in human

00:00:41.850 --> 00:00:44.310
history. It really perfectly captures the sheer

00:00:44.310 --> 00:00:46.429
hubris of early computer science, doesn't it?

00:00:46.789 --> 00:00:48.530
I mean, back then, we just thought vision was

00:00:48.530 --> 00:00:50.530
a hardware problem. Like, you know, if we just

00:00:50.530 --> 00:00:52.429
build a better electronic eye, the computer will

00:00:52.429 --> 00:00:54.450
automatically understand the physical world.

00:00:54.890 --> 00:00:57.310
Right. But the reality is so much more complicated.

00:00:57.869 --> 00:01:00.570
So welcome to the deep dive. Today, we are taking

00:01:00.570 --> 00:01:03.130
a massive stack of research spanning decades

00:01:03.130 --> 00:01:06.510
of computer science, physics, and biology, and

00:01:06.510 --> 00:01:10.230
our mission is to unpack exactly how we actually

00:01:10.230 --> 00:01:12.750
solved this puzzle. Because it's a huge puzzle.

00:01:13.010 --> 00:01:15.569
It is. We want to know how we took machines,

00:01:15.689 --> 00:01:18.349
which, let's be honest, at their core are just

00:01:18.400 --> 00:01:21.219
rocks made of silicon, billions of microscopic

00:01:21.219 --> 00:01:23.719
switches turning light into math, and taught

00:01:23.719 --> 00:01:26.180
them to not just capture a grid of pixels, but

00:01:26.180 --> 00:01:28.040
to deeply understand what those pixels actually

00:01:28.040 --> 00:01:30.140
mean. And that distinction is crucial. Okay,

00:01:30.159 --> 00:01:32.359
let's unpack this. Because it's one thing to

00:01:32.359 --> 00:01:34.459
teach a computer to capture a static photograph.

00:01:34.819 --> 00:01:37.219
It is an entirely different universe to teach

00:01:37.219 --> 00:01:39.950
it, that the pixels in that photograph represent,

00:01:39.969 --> 00:01:42.870
say, a dog catching a frisbee in three -dimensional

00:01:42.870 --> 00:01:45.049
space. Right. And what's so fascinating here

00:01:45.049 --> 00:01:47.489
is that this transition isn't just about cameras

00:01:47.489 --> 00:01:50.329
at all. It's really about the disentangling of

00:01:50.329 --> 00:01:54.049
symbolic information from raw data. Disentangling

00:01:54.049 --> 00:01:57.170
symbolic information. Exactly. And to do that,

00:01:57.430 --> 00:01:59.390
scientists have had to construct these incredibly

00:01:59.390 --> 00:02:02.689
complex models using geometry, physics, and learning

00:02:02.689 --> 00:02:05.969
theory. It's this massive interdisciplinary effort.

00:02:06.159 --> 00:02:09.180
And I actually ask you, the listener, to think

00:02:09.180 --> 00:02:11.639
about that as we go through this today. What

00:02:11.639 --> 00:02:13.719
stands out to you when you consider that the

00:02:13.719 --> 00:02:16.699
devices around you are just constantly, relentlessly

00:02:16.699 --> 00:02:20.520
trying to interpret your physical world? It's

00:02:20.520 --> 00:02:22.060
honestly kind of mind -blowing when you start

00:02:22.060 --> 00:02:24.719
looking at the real mechanics of it. And to really

00:02:24.719 --> 00:02:27.259
appreciate the invisible systems running on your

00:02:27.259 --> 00:02:29.819
phone or in your car right now, you have to look

00:02:29.819 --> 00:02:33.620
at why that 1966 summer project failed so spectacular.

00:02:33.699 --> 00:02:35.740
Oh, it completely crashed and burned. Yeah. And

00:02:35.740 --> 00:02:37.449
I was thinking about this. It's like... assuming

00:02:37.449 --> 00:02:39.310
that because breathing is totally effortless

00:02:39.310 --> 00:02:41.990
for human beings, building a fully functional

00:02:41.990 --> 00:02:45.250
artificial lung from scratch is just a fun weekend

00:02:45.250 --> 00:02:47.750
DIY project for a college kid. That is a great

00:02:47.750 --> 00:02:49.949
analogy. Because seeing is so instinctive for

00:02:49.949 --> 00:02:52.289
us, we just naturally assumed it would be computationally

00:02:52.289 --> 00:02:55.389
simple for a machine. Which I think reveals a

00:02:55.389 --> 00:02:57.889
really profound truth about artificial intelligence

00:02:57.889 --> 00:03:00.310
in general. The things that are traditionally

00:03:00.310 --> 00:03:04.449
hardest for computers, like calculating the exact

00:03:04.449 --> 00:03:07.569
trajectory of a rocket to the moon are very difficult

00:03:07.569 --> 00:03:09.509
for us humans. Right. I certainly can't do that

00:03:09.509 --> 00:03:12.150
math in my head. Exactly. But the things that

00:03:12.150 --> 00:03:14.729
are easiest for us, like instantly recognizing

00:03:14.729 --> 00:03:17.789
a familiar face in a crowded room or just telling

00:03:17.789 --> 00:03:20.229
the difference between a flat shadow and a solid

00:03:20.229 --> 00:03:22.669
object. The stuff a toddler does without trying.

00:03:23.169 --> 00:03:26.689
Right. Those things require millions upon millions

00:03:26.689 --> 00:03:29.389
of data points for a machine to process. So the

00:03:29.389 --> 00:03:32.210
summer of 66 comes and goes. The undergrads,

00:03:32.490 --> 00:03:35.840
shocker, do not solve human vision. And the field

00:03:35.840 --> 00:03:38.479
actually has to slowly grind forward decade by

00:03:38.479 --> 00:03:41.500
decade. Very slowly. Yeah, by the 1970s, they

00:03:41.500 --> 00:03:43.240
weren't even trying to recognize faces anymore.

00:03:43.620 --> 00:03:45.219
They were literally just trying to write algorithms

00:03:45.219 --> 00:03:47.520
that could detect basic edges and straight lines,

00:03:47.860 --> 00:03:49.919
just looking for the boundaries of objects. Because

00:03:49.919 --> 00:03:51.919
without a boundary, an object doesn't actually

00:03:51.919 --> 00:03:54.360
exist to a computer. It's just a meaningless

00:03:54.360 --> 00:03:57.379
sea of color. Wow. And then as we move into the

00:03:57.379 --> 00:04:01.639
1990s, the focus really shifted to rigorous mathematical

00:04:01.639 --> 00:04:04.840
optimization. To get a computer to understand

00:04:04.840 --> 00:04:07.900
3D space, researchers had to start borrowing

00:04:07.900 --> 00:04:10.879
concepts from a field called photogrammetry.

00:04:11.080 --> 00:04:13.740
OK, this is a great concept to pause on, because

00:04:13.740 --> 00:04:16.699
photogrammetry sounds incredibly dense. But it's

00:04:16.699 --> 00:04:19.519
basically just the science of making 3D measurements

00:04:19.519 --> 00:04:22.060
from 2D photographs, right? Precisely. I mean,

00:04:22.060 --> 00:04:23.899
if you take a picture of a building from the

00:04:23.899 --> 00:04:25.819
front and then another picture from the side,

00:04:26.339 --> 00:04:29.399
your human brain naturally understands the 3D

00:04:29.399 --> 00:04:31.319
shape of that building. We just stitch it together

00:04:31.319 --> 00:04:33.810
in our heads. Right. But a computer doesn't.

00:04:33.910 --> 00:04:36.110
So they use this concept called bundle adjustment

00:04:36.110 --> 00:04:38.589
theory. Bundle adjustment. Yeah, it's this incredibly

00:04:38.589 --> 00:04:40.990
heavy mathematical process used to calculate

00:04:40.990 --> 00:04:43.689
the exact angle and position the camera was in

00:04:43.689 --> 00:04:46.490
for every single shot. Once the computer calculates

00:04:46.490 --> 00:04:49.470
exactly where the camera was in space, it can

00:04:49.470 --> 00:04:53.290
triangulate the light and build a sparse 3D reconstruction

00:04:53.290 --> 00:04:55.550
of the whole scene. That sounds like a staggering

00:04:55.550 --> 00:04:57.410
amount of math just to know what a building looks

00:04:57.410 --> 00:05:00.410
like. It is. But even with all that 3D math.

00:05:00.800 --> 00:05:04.480
The real modern breakthrough, like, the thing

00:05:04.480 --> 00:05:07.920
that brought the field to where it is today was

00:05:07.920 --> 00:05:12.500
the shift to deep learning and CNNs, right? Convolutional

00:05:12.500 --> 00:05:15.040
neural network. Absolutely. CNNs changed everything.

00:05:15.079 --> 00:05:17.220
And I really want to make sure we explain how

00:05:17.220 --> 00:05:20.199
a CNN actually looks at an image, because it

00:05:20.199 --> 00:05:22.600
fundamentally changed the game. Before this,

00:05:23.180 --> 00:05:25.100
programmers were like, manually writing code,

00:05:25.160 --> 00:05:27.139
saying, OK, look for a circle, now look for two

00:05:27.139 --> 00:05:29.620
eyes. Which is just impossible to scale. You

00:05:29.620 --> 00:05:32.079
simply can't write a manual rule for every single

00:05:32.079 --> 00:05:34.879
object in the universe. Right. So a convolutional

00:05:34.879 --> 00:05:36.680
neural network does something entirely different.

00:05:36.939 --> 00:05:39.680
Imagine you're sliding a tiny mathematical magnifying

00:05:39.680 --> 00:05:42.620
glass. They call it a filter over an image just

00:05:42.620 --> 00:05:44.579
pixel by pixel. Gaining the whole thing. Yeah.

00:05:44.720 --> 00:05:47.300
And the first layer of the network just uses

00:05:47.300 --> 00:05:49.519
that filter to look for something super basic,

00:05:49.720 --> 00:05:52.180
like a horizontal line. Then the next layer takes

00:05:52.180 --> 00:05:54.399
those lines and looks for a curve. Building up.

00:05:54.680 --> 00:05:56.800
Exactly. The next layer combines lines and curves

00:05:56.800 --> 00:05:59.959
to find a shape, like say a wheel or an ear.

00:06:00.360 --> 00:06:03.000
It literally builds visual comprehension from

00:06:03.000 --> 00:06:05.870
the ground up entirely on its own. And it works.

00:06:06.149 --> 00:06:08.250
I mean, when you look at massive benchmarks like

00:06:08.250 --> 00:06:11.470
the ImageNet challenge, which tests AI on millions

00:06:11.470 --> 00:06:13.970
of images across a thousand different object

00:06:13.970 --> 00:06:18.329
classes, deep learning models now perform basically

00:06:18.329 --> 00:06:20.449
close to the level of humans. Close to humans,

00:06:20.649 --> 00:06:24.310
yes. But they still make some truly bizarre mistakes.

00:06:24.329 --> 00:06:26.870
Oh, totally. Like... These advanced algorithms

00:06:26.870 --> 00:06:30.449
can easily categorize a highly specific, fine

00:06:30.449 --> 00:06:33.889
-dried breed of dog or a rare species of bird

00:06:33.889 --> 00:06:36.610
something most normal humans would completely

00:06:36.610 --> 00:06:39.029
fail at. Right, it's superhuman at trivia. Exactly.

00:06:39.500 --> 00:06:42.759
but they still struggle immensely with tiny details

00:06:42.759 --> 00:06:44.480
that you or I would process without a second

00:06:44.480 --> 00:06:47.300
thought. For instance, a small ant resting on

00:06:47.300 --> 00:06:49.899
a flower stem. Oh, interesting. Yeah, to a CNN,

00:06:50.019 --> 00:06:52.139
the ant might just look like noise or a glitch

00:06:52.139 --> 00:06:54.240
in the stem's texture. It misses the context

00:06:54.240 --> 00:06:56.899
entirely. Or think about images distorted by

00:06:56.899 --> 00:06:58.540
camera filters. Like, if someone puts a weird

00:06:58.540 --> 00:07:01.319
sepia tone or a heavy glitch effect over a photo

00:07:01.319 --> 00:07:03.759
on social media, a human looks past the filter

00:07:03.759 --> 00:07:05.660
instantly. We just know, oh, it's a person with

00:07:05.660 --> 00:07:08.060
a weird filter. Right. But that same filter...

00:07:08.089 --> 00:07:10.670
can completely trip up the neural network because

00:07:10.670 --> 00:07:13.370
the foundational pixel colors have drastically

00:07:13.370 --> 00:07:16.149
changed. And the reason for this limitation goes

00:07:16.149 --> 00:07:18.410
right back to the fact that pure mathematics

00:07:18.410 --> 00:07:21.110
just wasn't enough to solve the problem of true

00:07:21.110 --> 00:07:23.870
visual understanding. Math only gets you so far.

00:07:24.350 --> 00:07:27.089
Exactly. Researchers eventually realized they

00:07:27.089 --> 00:07:29.569
had to bridge the gap between silicon hardware

00:07:29.569 --> 00:07:32.769
and biological wetware to really teach a machine

00:07:32.769 --> 00:07:35.870
to see they had to start borrowing heavily from

00:07:35.870 --> 00:07:38.910
nature. which means crossing over into entirely

00:07:38.910 --> 00:07:41.449
different scientific disciplines. I mean, just

00:07:41.449 --> 00:07:43.850
to design the physical image sensors that capture

00:07:43.850 --> 00:07:46.089
the light in your smartphone camera, we have

00:07:46.089 --> 00:07:48.810
to rely on solid state physics and quantum mechanics.

00:07:49.009 --> 00:07:50.810
It's incredible when you think about the layers

00:07:50.810 --> 00:07:53.250
involved. It really is. Physics explains how

00:07:53.250 --> 00:07:55.310
the light bounces off a coffee cup on your desk.

00:07:55.470 --> 00:07:57.610
And then quantum mechanics is required to fully

00:07:57.610 --> 00:08:00.470
understand how those individual photons hit the

00:08:00.470 --> 00:08:02.709
digital sensor and convert into an electrical

00:08:02.709 --> 00:08:05.069
charge. But the biological inspiration is where

00:08:05.069 --> 00:08:07.810
the real leap in computational processing happens.

00:08:08.350 --> 00:08:10.910
Neurobiology has massively influenced computer

00:08:10.910 --> 00:08:13.310
vision over the decades. Because we had to study

00:08:13.310 --> 00:08:16.029
ourselves first. Right. Over the last century,

00:08:16.089 --> 00:08:18.110
scientists have extensively studied mammalian

00:08:18.110 --> 00:08:20.990
eyes, neurons, and how the physical brain processes

00:08:20.990 --> 00:08:24.370
visual stimuli. And that research led to an early

00:08:24.370 --> 00:08:27.550
direct in the 1970s called the Neocognitron.

00:08:27.689 --> 00:08:29.990
Right, developed by Kunihiko Fukushima. This

00:08:29.990 --> 00:08:32.570
was an artificial neural network specifically

00:08:32.570 --> 00:08:35.450
designed to mimic the primary visual cortex of

00:08:35.450 --> 00:08:38.250
the actual brain. Yes. And to understand how

00:08:38.250 --> 00:08:41.529
these biological inspired networks learn, and

00:08:41.529 --> 00:08:44.830
frankly how they fail, there is a brilliant sort

00:08:44.830 --> 00:08:47.429
of simplified example from the research regarding

00:08:47.429 --> 00:08:50.399
how a network learns to detect objects. Let's

00:08:50.399 --> 00:08:52.460
look at the starfish and the sea urchin. Oh,

00:08:52.500 --> 00:08:54.220
this is a classic way to explain it. It makes

00:08:54.220 --> 00:08:56.340
so much sense. So imagine you are training a

00:08:56.340 --> 00:08:58.019
neural network to tell the difference between

00:08:58.019 --> 00:09:00.940
a starfish and a sea urchin. You feed it hundreds

00:09:00.940 --> 00:09:04.100
of images of both. Over time, the network creates

00:09:04.100 --> 00:09:06.399
nodes, these little decision centers, representing

00:09:06.399 --> 00:09:09.419
visual features. Okay. For the starfish, it strongly

00:09:09.419 --> 00:09:11.899
correlates a ringed texture and a star -shaped

00:09:11.899 --> 00:09:14.240
outline. For the sea urchin, it matches a striped

00:09:14.240 --> 00:09:16.960
texture and an oval shape. Seems totally logical.

00:09:17.809 --> 00:09:20.909
Starfish equals rings and stars. Sea urchin equals

00:09:20.909 --> 00:09:23.929
stripes and ovals. It's just assigning mathematical

00:09:23.929 --> 00:09:27.149
weights to those physical features. It is. But

00:09:27.149 --> 00:09:30.429
here is the catch. What if, during that massive

00:09:30.429 --> 00:09:33.350
training process, there happens to be a single

00:09:33.350 --> 00:09:36.309
sea urchin image that has a slightly ringed texture

00:09:36.309 --> 00:09:39.190
instead of stripes? Just a weird anomaly in the

00:09:39.190 --> 00:09:42.000
training data. Exactly. The network creates a

00:09:42.000 --> 00:09:44.539
weakly weighted association. It essentially notes,

00:09:44.620 --> 00:09:47.340
OK, a ringed texture might occasionally mean

00:09:47.340 --> 00:09:50.259
sea urchin. OK, I see where this is going. Right.

00:09:50.360 --> 00:09:53.059
Now, you run the train network on a brand new

00:09:53.059 --> 00:09:56.419
image. It correctly spots a starfish in the sand

00:09:56.419 --> 00:09:58.779
because of the strong ring and star signals.

00:09:59.340 --> 00:10:01.840
But because of that one weird training image,

00:10:02.179 --> 00:10:04.860
the ringed texture of the starfish also sends

00:10:04.860 --> 00:10:07.840
a tiny weak signal suggesting there might be

00:10:07.840 --> 00:10:09.990
a sea urchin nearby. The machine is getting a

00:10:09.990 --> 00:10:12.950
little confused. Yeah. And now imagine a completely

00:10:12.950 --> 00:10:15.629
random oval clamshell is sitting next to the

00:10:15.629 --> 00:10:17.809
starfish in the image. The network wasn't trained

00:10:17.809 --> 00:10:20.230
on clamshells at all. But the clamshell is an

00:10:20.230 --> 00:10:23.110
oval. Yes. And the oval shape of the shell triggers

00:10:23.110 --> 00:10:25.690
the sea urchin's strong oval node. So it takes

00:10:25.690 --> 00:10:28.970
the weak ring signal from the starfish and combines

00:10:28.970 --> 00:10:32.070
it with the strong oval signal from the clamshell.

00:10:32.409 --> 00:10:35.649
Exactly. And suddenly the network outputs a completely

00:10:35.649 --> 00:10:38.940
false positive. confidently declares there is

00:10:38.940 --> 00:10:41.820
a sea urchin in the photo when it's really just

00:10:41.820 --> 00:10:44.379
looking at pieces of a starfish in a shell. Here's

00:10:44.379 --> 00:10:46.940
where it gets really interesting to me because

00:10:46.940 --> 00:10:51.100
that mechanism is so fundamentally human. How

00:10:51.100 --> 00:10:54.399
so? It completely reminds me of teaching a toddler

00:10:54.399 --> 00:10:58.139
to speak. If you only ever show a toddler brown

00:10:58.139 --> 00:11:00.879
dogs and you point and tell them dog, the very

00:11:00.879 --> 00:11:02.759
first time they see a brown cat walking down

00:11:02.759 --> 00:11:04.980
the street, they might confidently point and

00:11:04.980 --> 00:11:08.039
say dog. That's perfect. Yes. They're overweighting

00:11:08.039 --> 00:11:10.120
the wrong features, the color brown and the fact

00:11:10.120 --> 00:11:12.500
that it has four legs, and they are completely

00:11:12.500 --> 00:11:14.720
ignoring the shape of the ears or the tail. It's

00:11:14.720 --> 00:11:17.240
the exact same false positive as the neural network.

00:11:17.480 --> 00:11:19.139
And if we connect this to the bigger picture,

00:11:19.259 --> 00:11:21.759
what we're really seeing here is an incredible

00:11:21.759 --> 00:11:24.340
interdisciplinary feedback loop. It really is.

00:11:24.429 --> 00:11:26.769
The study of biological vision helps us build

00:11:26.769 --> 00:11:30.409
better computer vision algorithms. And in turn,

00:11:30.950 --> 00:11:33.330
designing these complex artificial neural networks

00:11:33.330 --> 00:11:36.649
actually gives neuroscientists a working testable

00:11:36.649 --> 00:11:40.009
model to understand our own neurobiology. We

00:11:40.009 --> 00:11:42.049
are essentially reverse engineering ourselves.

00:11:42.389 --> 00:11:44.230
So we have the biological inspiration from the

00:11:44.230 --> 00:11:46.350
brain and we have a quantum physics in the camera

00:11:46.350 --> 00:11:49.730
sensors. But how does a modern system practically

00:11:49.730 --> 00:11:53.220
process an image? in the real world, like step

00:11:53.220 --> 00:11:55.419
by step. The actual mechanics of it. Yeah, let's

00:11:55.419 --> 00:11:57.240
walk through the actual pipeline. But first,

00:11:58.220 --> 00:11:59.799
I think we really need to clarify the difference

00:11:59.799 --> 00:12:02.120
between image processing and computer vision,

00:12:02.720 --> 00:12:04.580
because I hear them conflated all the time. That

00:12:04.580 --> 00:12:07.899
is a crucial distinction to make. In image processing,

00:12:08.159 --> 00:12:10.759
the input is an image, and the output is just

00:12:10.759 --> 00:12:14.000
an enhanced image. Like Instagram filters. Exactly.

00:12:14.419 --> 00:12:16.059
Think of hitting the auto -enhance button on

00:12:16.059 --> 00:12:18.820
your phone, adjusting contrast, rotating a photo,

00:12:18.960 --> 00:12:21.539
or removing background noise. The computer doesn't

00:12:21.539 --> 00:12:23.519
actually know what is in the picture. It's just

00:12:23.519 --> 00:12:26.460
blindly manipulating light values. Computer vision,

00:12:26.460 --> 00:12:29.019
on the other hand, takes an image or video as

00:12:29.019 --> 00:12:32.330
input, and the output is a decision. It's a 3D

00:12:32.330 --> 00:12:35.409
model, an analysis, or a specific system behavior

00:12:35.409 --> 00:12:38.190
based on what it actually understood. OK, so

00:12:38.190 --> 00:12:40.889
the output of computer vision is an action. So

00:12:40.889 --> 00:12:43.929
how do we get from raw light to a concrete decision?

00:12:44.210 --> 00:12:47.490
There is a very strict five -step pipeline these

00:12:47.490 --> 00:12:49.409
systems go through. To make this concrete for

00:12:49.409 --> 00:12:51.809
you listening, let's use a real -world example

00:12:51.809 --> 00:12:54.389
as we go through the steps. Let's imagine a camera

00:12:54.389 --> 00:12:57.409
on a sterile factory line inspecting microscopic

00:12:57.409 --> 00:13:00.690
wafer computer chips for tiny defects. Perfect

00:13:00.690 --> 00:13:03.470
example. So step one is image acquisition. Now

00:13:03.470 --> 00:13:05.250
this isn't just a standard webcam. Depending

00:13:05.250 --> 00:13:07.750
on the application, it could be lidar, radar,

00:13:07.950 --> 00:13:10.690
or ultrasonic sensors. But in our factory example...

00:13:10.690 --> 00:13:12.769
In the factory, it's likely a high -resolution

00:13:12.769 --> 00:13:15.529
microscopic camera capturing the raw photon data

00:13:15.529 --> 00:13:17.970
reflecting off the silicon wafer as it moves

00:13:17.970 --> 00:13:20.389
super fast down the conveyor belt. Okay, so it

00:13:20.389 --> 00:13:22.950
snaps the raw data. Then comes step two, pre

00:13:22.950 --> 00:13:25.370
-processing. Before the heavy lifting begins,

00:13:25.830 --> 00:13:27.669
the system has to clean up that data, right?

00:13:27.929 --> 00:13:31.110
Yes. It might remove the glare from the harsh

00:13:31.110 --> 00:13:34.830
factory lighting or enhance the contrast so the

00:13:34.830 --> 00:13:37.629
tiny microscopic circuits are actually detectable

00:13:37.629 --> 00:13:40.490
by the math. Which perfectly sets up step three,

00:13:40.850 --> 00:13:43.590
feature extraction. This is where those mathematical

00:13:43.590 --> 00:13:46.149
filters we talked about earlier come in. The

00:13:46.149 --> 00:13:48.529
system starts pulling out the lines, the sharp

00:13:48.529 --> 00:13:50.970
edges of the circuits, or localized interest

00:13:50.970 --> 00:13:53.870
points like corners and metallic blobs. Exactly.

00:13:54.110 --> 00:13:56.210
So does this mean the computer's basically playing

00:13:56.210 --> 00:13:58.730
a high -speed, multi -dimensional game of connect

00:13:58.730 --> 00:14:00.570
the dots before it even knows what it's looking

00:14:00.570 --> 00:14:02.870
at? That is an absolutely excellent way to visualize

00:14:02.870 --> 00:14:05.350
it. It's looking for structural primitives. It

00:14:05.350 --> 00:14:07.250
doesn't know it's a microchip yet. It just knows

00:14:07.250 --> 00:14:09.830
there are 20 parallel lines and a shiny metallic

00:14:09.830 --> 00:14:12.309
blob over here. Just shapes and lines. Right.

00:14:12.529 --> 00:14:14.850
And this raises a really important question about

00:14:14.850 --> 00:14:16.669
the fundamental nature of this pipeline, which

00:14:16.669 --> 00:14:19.269
is data reduction. Because once you have those

00:14:19.269 --> 00:14:22.389
features, you move to step four, detection and

00:14:22.389 --> 00:14:24.600
segmentation. Right, the system has to decide

00:14:24.600 --> 00:14:27.159
which specific regions of the massive image are

00:14:27.159 --> 00:14:30.259
actually relevant. It segments the image, essentially

00:14:30.259 --> 00:14:33.039
isolating the specific tiny solder joint on the

00:14:33.039 --> 00:14:35.340
chip from the completely useless background of

00:14:35.340 --> 00:14:37.159
the conveyor belt. It's separating the signal

00:14:37.159 --> 00:14:40.240
from the noise. Exactly. And finally, step five,

00:14:40.820 --> 00:14:43.919
decision -making. The true genius of this pipeline

00:14:43.919 --> 00:14:46.480
isn't just its ability to take in massive amounts

00:14:46.480 --> 00:14:49.519
of data. It's the system's ability to ruthlessly

00:14:49.519 --> 00:14:52.700
throw data away. Yes. It distills millions of

00:14:52.700 --> 00:14:55.000
pixels and all those complex geometric features

00:14:55.000 --> 00:14:58.120
down to a single, simple, application -specific

00:14:58.120 --> 00:15:01.279
choice. Is that isolated solder joint cracked?

00:15:01.679 --> 00:15:04.120
pass or fail. It really is an incredible funnel

00:15:04.120 --> 00:15:06.600
of information. Billions of mathematical calculations

00:15:06.600 --> 00:15:09.000
just to say yes or no. And once you perfect that

00:15:09.000 --> 00:15:11.840
specific pipeline, acquiring, extracting, segmenting,

00:15:11.879 --> 00:15:14.259
and deciding, you can deploy it literally anywhere.

00:15:14.480 --> 00:15:16.480
Anywhere at all. And the scale of this is staggering

00:15:16.480 --> 00:15:19.379
right now. For 2024, the leading areas of computer

00:15:19.379 --> 00:15:21.840
vision were industry, with a massive market size

00:15:21.840 --> 00:15:26.440
of $5 .22 billion. Medicine was at $2 .6 billion,

00:15:26.820 --> 00:15:28.960
and the military sector was just shy of a billion

00:15:28.960 --> 00:15:31.370
dollars. The real -world applications across

00:15:31.370 --> 00:15:34.049
those sectors really show how adaptable that

00:15:34.049 --> 00:15:36.679
five -step pipeline is. Take agriculture, for

00:15:36.679 --> 00:15:38.779
example. Well, the farming applications are wild.

00:15:38.919 --> 00:15:41.379
They really are. Engineers have actually developed

00:15:41.379 --> 00:15:44.000
open source vision transformer models to help

00:15:44.000 --> 00:15:46.960
farmers monitor crops in real time. OK, I want

00:15:46.960 --> 00:15:48.960
to clarify vision transformers, because they

00:15:48.960 --> 00:15:51.440
are a huge leap beyond the CNNs we talked about

00:15:51.440 --> 00:15:54.220
earlier. A massive leap. Yeah. Older models look

00:15:54.220 --> 00:15:56.799
at an image chunk by chunk, sliding that little

00:15:56.799 --> 00:15:59.379
filter around like we said. But a vision transformer

00:15:59.379 --> 00:16:02.259
looks at the entire image all at once. It's a

00:16:02.259 --> 00:16:05.200
holistic view. Right. It uses an attention mechanism

00:16:05.200 --> 00:16:07.460
to understand how a brown spot on the bottom

00:16:07.460 --> 00:16:10.740
of a leaf relates to a yellowing edge on the

00:16:10.740 --> 00:16:13.679
complete other side of the plant. And because

00:16:13.679 --> 00:16:16.620
of that holistic view, this model can automatically

00:16:16.620 --> 00:16:19.559
detect strawberry diseases from visual data with

00:16:19.559 --> 00:16:23.159
98 .4 % accuracy. It's revolutionary for global

00:16:23.159 --> 00:16:25.500
food security. And then in medicine, it's doing

00:16:25.500 --> 00:16:28.200
very similar pattern recognition, but it's looking

00:16:28.200 --> 00:16:30.940
inside us. Instead of a strawberry, it's a human.

00:16:31.230 --> 00:16:34.549
Exactly. Systems can extract information from

00:16:34.549 --> 00:16:38.690
ultrasonic or x -ray images to spot tumors, detect

00:16:38.690 --> 00:16:41.490
arteriosclerosis, or even measure blood flow.

00:16:41.950 --> 00:16:44.870
It's doing the grueling pixel -by -pixel analysis

00:16:44.870 --> 00:16:48.149
that human eyes just get too fatigued to do reliably

00:16:48.149 --> 00:16:50.529
over an eight -hour shift. And then you have

00:16:50.529 --> 00:16:54.080
autonomous vehicles and space exploration. We're

00:16:54.080 --> 00:16:56.159
talking about everything from the obstacle warning

00:16:56.159 --> 00:16:58.539
systems in consumer cars using cameras and the

00:16:58.539 --> 00:17:01.919
LiDAR, all the way to NASA's Curiosity rover

00:17:01.919 --> 00:17:05.480
and the Chinese space agency's U -22 rover navigating

00:17:05.480 --> 00:17:08.000
the moon and Mars. It's everywhere. So what does

00:17:08.000 --> 00:17:10.160
this all mean? When you step back, you really

00:17:10.160 --> 00:17:12.400
have to marvel at the universal language of light

00:17:12.400 --> 00:17:15.640
and geometry. The exact same foundational pipeline

00:17:15.640 --> 00:17:17.880
being used to inspect tiny strawberries in a

00:17:17.880 --> 00:17:20.500
field in California is steering a robotic rover

00:17:20.500 --> 00:17:23.160
through the radioactive dust of Mars. It is incredibly

00:17:23.160 --> 00:17:25.759
versatile, and in those really high -stakes environments

00:17:25.759 --> 00:17:28.140
like the military and autonomous driving, there

00:17:28.140 --> 00:17:30.240
is a specific concept that becomes absolutely

00:17:30.240 --> 00:17:32.759
critical, sensor fusion. Sensor fusion, meaning

00:17:32.759 --> 00:17:36.069
more than one camera. Exactly. It is rarely just

00:17:36.069 --> 00:17:39.630
one camera operating in isolation. A self -driving

00:17:39.630 --> 00:17:42.210
car might use visible light cameras to read a

00:17:42.210 --> 00:17:45.549
red stop sign, thermal imaging to spot a pedestrian

00:17:45.549 --> 00:17:48.309
walking in the dark, and radar to calculate the

00:17:48.309 --> 00:17:51.130
exact speed of an oncoming truck. That's a lot

00:17:51.130 --> 00:17:54.309
of incoming data. It is. The system automatically

00:17:54.309 --> 00:17:57.690
fuses all that disparate data to reduce the complexity

00:17:57.690 --> 00:18:00.730
of a busy intersection. It has to make life or

00:18:00.730 --> 00:18:03.630
death strategic decisions in real time based

00:18:03.630 --> 00:18:06.470
on a synthesized view of the entire world. OK,

00:18:06.529 --> 00:18:08.609
so we've seen how this technology spans from

00:18:08.609 --> 00:18:11.630
agriculture to space. But there is one application

00:18:11.630 --> 00:18:13.869
from the research that absolutely blew my mind.

00:18:14.569 --> 00:18:16.849
We are actually using computer, quote unquote,

00:18:17.269 --> 00:18:19.910
site. to replicate the human sense of touch.

00:18:20.450 --> 00:18:23.130
Yes. Tactile feedback for robotic hands. This

00:18:23.130 --> 00:18:25.049
is honestly one of the most cutting -edge and

00:18:25.049 --> 00:18:27.349
frankly brilliant engineering workarounds I've

00:18:27.349 --> 00:18:29.329
ever seen in the field. I have to explain this

00:18:29.329 --> 00:18:31.109
mechanism to you because it is so clever. To

00:18:31.109 --> 00:18:33.490
give a robotic finger a sense of touch, you would

00:18:33.490 --> 00:18:35.210
normally assume they just, you know, stick a

00:18:35.210 --> 00:18:36.670
pressure sensor on the outside of the finger.

00:18:36.829 --> 00:18:38.369
Right? That's what we tried to do for years.

00:18:38.630 --> 00:18:41.589
But those external sensors break, or they just

00:18:41.589 --> 00:18:44.109
aren't sensitive enough for delicate work. So

00:18:44.109 --> 00:18:47.829
instead, engineers suspend a tiny camera inside

00:18:47.829 --> 00:18:50.940
a flexible hollow silicon dome. And this dome

00:18:50.940 --> 00:18:53.839
acts like a squishy fingertip. Right. And embedded

00:18:53.839 --> 00:18:56.160
evenly all throughout the inside of that silicon

00:18:56.160 --> 00:18:59.440
dome are little painted point markers, just little

00:18:59.440 --> 00:19:02.240
dots. And when that robotic finger presses against

00:19:02.240 --> 00:19:05.099
a physical surface, say an egg or a soft piece

00:19:05.099 --> 00:19:08.099
of fabric, the silicon naturally deforms and

00:19:08.099 --> 00:19:10.900
squishes. Right. And the internal camera is just

00:19:10.900 --> 00:19:13.380
sitting in the dark inside the finger, just watching

00:19:13.380 --> 00:19:15.640
those point markers shift around in real time.

00:19:15.759 --> 00:19:18.759
The computer then analyzes that visual shift,

00:19:19.119 --> 00:19:21.440
the microscopic movement of the dots, to determine

00:19:21.440 --> 00:19:23.660
the exact physical imperfections of the surface

00:19:23.660 --> 00:19:25.980
it's touching. It's brilliant! It's literally

00:19:25.980 --> 00:19:28.099
like looking at the shifting shadows on a blanket

00:19:28.099 --> 00:19:30.039
to figure out exactly what kind of object is

00:19:30.039 --> 00:19:32.140
hiding underneath it. That analogy is spot on.

00:19:32.400 --> 00:19:35.059
Grasting objects effectively and instantly adjusting

00:19:35.059 --> 00:19:37.279
grip strength based on the texture and slip of

00:19:37.279 --> 00:19:39.500
an item has historically been one of the absolute

00:19:39.500 --> 00:19:42.279
hardest mechanical challenges in robotics. It's

00:19:42.279 --> 00:19:44.339
so hard for machines to pick things up gently.

00:19:44.490 --> 00:19:47.430
Exactly. Traditional strain gauges on the outside

00:19:47.430 --> 00:19:49.849
of a robotic hand are just too fragile. So by

00:19:49.849 --> 00:19:52.390
translating the physical sensation of touch into

00:19:52.390 --> 00:19:55.369
a visual processing problem inside a protected

00:19:55.369 --> 00:19:58.930
silicon dome, engineers have completely bypassed

00:19:58.930 --> 00:20:01.089
those hardware limitations. They used vision

00:20:01.089 --> 00:20:03.609
to solve touch? They using vision algorithms

00:20:03.609 --> 00:20:06.549
to solve a mechanical physics problem. It's beautiful.

00:20:06.829 --> 00:20:08.890
It's just a stunning evolution. I mean, we started

00:20:08.890 --> 00:20:10.890
this deep dive looking at a group of researchers

00:20:10.890 --> 00:20:14.150
in 1966 who naively thought they could and solve

00:20:14.150 --> 00:20:17.069
artificial sight as a quick summer project. And

00:20:17.069 --> 00:20:19.390
look where we are now. Yeah, today we are looking

00:20:19.390 --> 00:20:22.779
at this invisible. hyper complex matrix of algorithms

00:20:22.779 --> 00:20:25.299
that can diagnose life -threatening diseases,

00:20:25.720 --> 00:20:28.519
drive cars through rush hour traffic, and literally

00:20:28.519 --> 00:20:30.960
allow robots to feel the textures of the world

00:20:30.960 --> 00:20:33.799
using cameras hidden inside silicon fingertips.

00:20:34.200 --> 00:20:37.339
It is a profound testament to decades of rigorous

00:20:37.339 --> 00:20:39.839
interdisciplinary science. You have physics,

00:20:40.279 --> 00:20:43.299
biology, mathematics, and engineering all converging

00:20:43.299 --> 00:20:45.619
to teach a machine how to interpret our reality.

00:20:45.940 --> 00:20:47.900
And it's really something to remember as you

00:20:47.900 --> 00:20:50.480
go about your day. Every time you grants it your

00:20:50.480 --> 00:20:52.880
phone and it unlocks using facial recognition,

00:20:53.599 --> 00:20:56.559
every time you drive a modern car with lane assist,

00:20:57.039 --> 00:20:59.519
or even every time you eat a piece of automatically

00:20:59.519 --> 00:21:02.200
sorted perfectly right produce from the grocery

00:21:02.200 --> 00:21:05.099
store, you're relying on this technology. You're

00:21:05.099 --> 00:21:08.200
relying on that five -step pipeline. Exactly.

00:21:08.420 --> 00:21:10.980
You are interacting with machines that are constantly

00:21:10.980 --> 00:21:14.480
acquiring, extracting, and deciding based on

00:21:14.480 --> 00:21:16.769
the visual world. But you know, as these systems

00:21:16.769 --> 00:21:18.829
become more and more integrated into our daily

00:21:18.829 --> 00:21:21.650
lives, there's a final detail from the research

00:21:21.650 --> 00:21:23.869
regarding typical tasks that I really think we

00:21:23.869 --> 00:21:26.410
should consider before we go. Oh, right. It brings

00:21:26.410 --> 00:21:29.049
up the highly complex and somewhat controversial

00:21:29.049 --> 00:21:32.809
topic of emotion recognition. Right. Using facial

00:21:32.809 --> 00:21:35.829
recognition to not just identify who you are.

00:21:36.059 --> 00:21:38.920
but how you're feeling. Exactly. Computer vision

00:21:38.920 --> 00:21:41.339
systems are currently being trained to classify

00:21:41.339 --> 00:21:44.180
human emotions based on the microscopic geometric

00:21:44.180 --> 00:21:46.700
movements of our facial expressions. However,

00:21:46.940 --> 00:21:49.599
the data explicitly points out that psychologists

00:21:49.599 --> 00:21:52.140
strongly caution against this. Psychologists

00:21:52.140 --> 00:21:55.619
are sounding the alarm. Yes. Decades of psychological

00:21:55.619 --> 00:21:58.539
research suggest that internal subjective emotions

00:21:58.539 --> 00:22:01.140
cannot be reliably detected from facial expressions

00:22:01.140 --> 00:22:03.900
alone. A smile doesn't always mean happiness.

00:22:04.420 --> 00:22:07.059
A furrowed brow doesn't always mean anger. The

00:22:07.059 --> 00:22:10.599
human face is incredibly nuanced. I mean, we

00:22:10.599 --> 00:22:13.319
mask our feelings all the time or express them

00:22:13.319 --> 00:22:15.839
in ways that completely contradict our internal

00:22:15.839 --> 00:22:19.190
state. Precisely. Which leaves us with a lingering,

00:22:19.670 --> 00:22:23.430
somewhat provocative question to ponder. As computer

00:22:23.430 --> 00:22:26.289
vision becomes perfectly ruthlessly precise at

00:22:26.289 --> 00:22:28.569
mapping the geometry of our physical world down

00:22:28.569 --> 00:22:31.309
to the exact millimeter of a smile, what happens

00:22:31.309 --> 00:22:33.730
when we deploy that exact same geometric math

00:22:33.730 --> 00:22:35.849
to judge our inner feelings? That's a heavy thought.

00:22:36.069 --> 00:22:38.349
Are we genuinely creating a machine that truly

00:22:38.349 --> 00:22:40.670
understands us, or are we just building a highly

00:22:40.670 --> 00:22:43.130
efficient automated system for misinterpreting

00:22:43.130 --> 00:22:45.400
human nature? Wow. It really brings us right

00:22:45.400 --> 00:22:47.400
back to the beginning, doesn't it? We've spent

00:22:47.400 --> 00:22:49.460
decades trying to teach a machine to translate

00:22:49.460 --> 00:22:52.240
light into a 3D world. But when it comes to the

00:22:52.240 --> 00:22:54.900
muddy waters of human emotion, the camera might

00:22:54.900 --> 00:22:57.579
see everything, but the machine still might be

00:22:57.579 --> 00:22:59.579
completely in the dark. Thanks for joining us

00:22:59.579 --> 00:23:00.380
on this deep dive.
