WEBVTT

00:00:00.000 --> 00:00:03.160
OK, let's unpack this, because I think it's something

00:00:03.160 --> 00:00:05.519
we all do without even realizing it. Oh, absolutely.

00:00:05.660 --> 00:00:07.000
Think about what happens when you wake up in

00:00:07.000 --> 00:00:10.720
the morning. You open your eyes. and instantly,

00:00:11.080 --> 00:00:13.380
effortlessly, you just see. Right, there's no

00:00:13.380 --> 00:00:15.619
conscious effort at all. Exactly. You don't have

00:00:15.619 --> 00:00:18.239
to consciously calculate the geometric distance

00:00:18.239 --> 00:00:22.420
to your alarm clock or analyze the ambient lighting

00:00:22.420 --> 00:00:24.940
conditions to figure out that the blurry shape

00:00:24.940 --> 00:00:27.420
on the floor is your dog and not a pile of laundry.

00:00:27.559 --> 00:00:30.620
You just instinctively know. Right. We take the

00:00:30.620 --> 00:00:33.579
sheer computational power of our eyesight completely

00:00:33.579 --> 00:00:36.829
for granted, but getting a machine you know,

00:00:36.909 --> 00:00:41.390
a computer, to do that, to get a machine to open

00:00:41.390 --> 00:00:44.929
a digital eye and actually comprehend the three

00:00:44.929 --> 00:00:47.350
-dimensional scene in front of it, it turns out

00:00:47.350 --> 00:00:49.789
that is one of the most complex, mind -bending

00:00:49.789 --> 00:00:51.950
puzzles in the entire history of engineering.

00:00:52.270 --> 00:00:54.530
It really is. Because computer vision isn't just

00:00:54.530 --> 00:00:57.049
about taking a digital photograph. A digital

00:00:57.049 --> 00:01:00.009
photo is just raw data. I mean, it's just a flat

00:01:00.109 --> 00:01:02.469
two -dimensional grid of millions of pixels.

00:01:02.950 --> 00:01:04.829
And the computer doesn't naturally know what

00:01:04.829 --> 00:01:06.810
those pixels actually represent, right? Exactly.

00:01:06.950 --> 00:01:09.730
To a computer, it's just numbers. Computer vision

00:01:09.730 --> 00:01:12.109
is the extraction of high -dimensional data from

00:01:12.109 --> 00:01:15.609
the real world to produce, well, numerical or

00:01:15.609 --> 00:01:18.890
symbolic information. So giving it actual meaning.

00:01:19.109 --> 00:01:21.849
Precisely. It is essentially the process of turning

00:01:21.849 --> 00:01:24.750
a chaotic flood of light, shadow, and geometry

00:01:24.750 --> 00:01:28.230
into a series of autonomous, logical decisions.

00:01:28.989 --> 00:01:31.090
We are teaching machines not just to record the

00:01:31.090 --> 00:01:33.890
world, but to genuinely understand it. And today

00:01:33.890 --> 00:01:36.650
we are decoding that exact translation. We're

00:01:36.650 --> 00:01:39.650
doing a deep dive into a massive stack of documentation,

00:01:40.010 --> 00:01:42.689
pulling from decades of computer science history,

00:01:43.129 --> 00:01:46.370
hardware specs, and algorithmic breakdowns to

00:01:46.370 --> 00:01:48.489
really explore the mechanics of computer vision.

00:01:48.769 --> 00:01:50.950
There is a lot to cover. There really is. And

00:01:50.950 --> 00:01:52.530
for you listening, whether you're unlocking your

00:01:52.530 --> 00:01:54.950
phone with your face, or driving past an autonomous

00:01:54.950 --> 00:01:57.629
vehicle on the highway, or even just buying produce

00:01:57.629 --> 00:01:59.549
that's been optically inspected on a conveyor

00:01:59.549 --> 00:02:02.170
belt, this deep dive is your cheat sheet. Yeah,

00:02:02.170 --> 00:02:05.150
it's everywhere now. We are ripping the lid off

00:02:05.150 --> 00:02:08.240
the invi— technological eyes that are constantly

00:02:08.240 --> 00:02:11.280
inspecting, navigating, and shaping your daily

00:02:11.280 --> 00:02:14.680
life. And you know, to really grasp how sophisticated

00:02:14.680 --> 00:02:17.139
this technology is today, we have to look back

00:02:17.139 --> 00:02:19.759
at how completely we misunderstood the challenge

00:02:19.759 --> 00:02:22.060
in the first place. Oh man, the hubris of the

00:02:22.060 --> 00:02:25.139
1960s. It's hilarious in hindsight. The pioneer

00:02:25.139 --> 00:02:27.479
scientists of artificial intelligence back then

00:02:27.479 --> 00:02:29.919
just assumed that vision was the easy part of

00:02:29.919 --> 00:02:33.110
intelligence. Yeah, back in 1966 at MIT, The

00:02:33.110 --> 00:02:36.370
AI lab famously assigned computer vision to an

00:02:36.370 --> 00:02:38.889
undergraduate student as a summer project. Just

00:02:38.889 --> 00:02:40.870
a quick summer gig. They literally thought they

00:02:40.870 --> 00:02:43.150
could attach a camera to a computer. and in a

00:02:43.150 --> 00:02:45.310
few months just write a program that would have

00:02:45.310 --> 00:02:47.530
the machine describe what it saw. They fell into

00:02:47.530 --> 00:02:50.530
a classic cognitive trap, really. How so? Well,

00:02:50.629 --> 00:02:52.969
because human vision operates consciously, right?

00:02:53.069 --> 00:02:55.069
You don't feel your brain sweating when you look

00:02:55.069 --> 00:02:57.169
at a coffee mug. Right. It just happens. Yeah.

00:02:57.310 --> 00:02:59.849
So they assumed the computational load was incredibly

00:02:59.849 --> 00:03:03.189
light. They thought the hard part of AI was the

00:03:03.189 --> 00:03:06.370
reasoning, you know, playing chess, solving complex

00:03:06.370 --> 00:03:09.699
math theorems. So they figured perception Just

00:03:09.699 --> 00:03:12.319
translating a flat picture into a list of 3D

00:03:12.319 --> 00:03:15.539
objects would be a simple stepping stone. Exactly.

00:03:15.740 --> 00:03:17.379
It's as if we thought teaching a computer to

00:03:17.379 --> 00:03:19.139
see was like teaching it to read a thermometer,

00:03:19.659 --> 00:03:21.580
but it turned out to be like teaching it to understand

00:03:21.580 --> 00:03:23.520
poetry. That is the perfect way to frame it.

00:03:23.520 --> 00:03:25.699
And because it was like poetry, that one summer

00:03:25.699 --> 00:03:29.479
project stretched into decades of grueling incremental

00:03:29.479 --> 00:03:31.240
research. Just trying to get the basics down.

00:03:31.319 --> 00:03:34.280
Yeah. Through the 1970s, researchers realized

00:03:34.280 --> 00:03:36.439
they had to start with the absolute mathematical

00:03:36.439 --> 00:03:38.860
basics. They couldn't just say, hey, find the

00:03:38.860 --> 00:03:40.539
chair. Right. Because the computer doesn't know

00:03:40.539 --> 00:03:43.180
what a chair is. Exactly. They had to write algorithms

00:03:43.180 --> 00:03:46.110
strictly focused on extracting edges, just looking

00:03:46.110 --> 00:03:48.490
for where light pixels suddenly turned into dark

00:03:48.490 --> 00:03:51.370
pixels. Just to try and label lines and infer

00:03:51.370 --> 00:03:54.969
basic 3D structures from 2D blocks. Right. And

00:03:54.969 --> 00:03:58.310
then the 1980s arrive and the field gets flooded

00:03:58.310 --> 00:04:00.949
with these intense mathematical models to handle

00:04:00.949 --> 00:04:03.710
lighting and depth. Yeah, things with great names

00:04:03.710 --> 00:04:07.889
like scale space. which was a mathematical framework

00:04:07.889 --> 00:04:10.030
for allowing computers to recognize an object,

00:04:10.090 --> 00:04:11.889
whether it was two feet away or 20 feet away,

00:04:12.370 --> 00:04:15.189
or contour models, which were known as snakes.

00:04:16.649 --> 00:04:19.029
I love that name. Snakes are a great example

00:04:19.029 --> 00:04:21.790
of this era's logic, actually. How so? An active

00:04:21.790 --> 00:04:25.410
contour model, or a snake, is essentially a digital

00:04:25.410 --> 00:04:28.569
rubber band. You drop it onto an image, and the

00:04:28.569 --> 00:04:31.029
algorithm shrinks that rubber band inward. Until

00:04:31.029 --> 00:04:33.230
it hits something? Right, until it hits an area

00:04:33.230 --> 00:04:36.399
of high gradient. a sharp change in contrast,

00:04:36.620 --> 00:04:38.720
which usually signifies the edge of an object.

00:04:39.180 --> 00:04:41.800
The rubber band wraps around that shape, helping

00:04:41.800 --> 00:04:44.120
the computer trace a distinct boundary. That's

00:04:44.120 --> 00:04:45.899
so clever. And it wasn't just finding edges,

00:04:46.060 --> 00:04:48.540
right? By the 1990s, the field of computer vision

00:04:48.540 --> 00:04:50.920
started colliding with computer graphics. Yes,

00:04:51.100 --> 00:04:53.920
big time. They adopted a concept called bundle

00:04:53.920 --> 00:04:56.600
adjustment, which comes from photogrammetry.

00:04:56.939 --> 00:04:58.759
And photogrammetry is basically the science of

00:04:58.759 --> 00:05:02.220
making 3D measurements from 2D photographs. So

00:05:02.220 --> 00:05:04.540
like if a camera moves through a room taking

00:05:04.540 --> 00:05:07.800
pictures, bundle adjustment is the incredibly

00:05:07.800 --> 00:05:10.300
heavy math used to triangulate all those different

00:05:10.300 --> 00:05:13.259
viewing angles to simultaneously estimate both

00:05:13.259 --> 00:05:16.019
the 3D shape of the room and the exact path the

00:05:16.019 --> 00:05:18.759
camera took. It was a slow mathematical grind

00:05:18.759 --> 00:05:21.899
of geometry and physics for decades. But then

00:05:21.899 --> 00:05:25.740
we hit the modern era. The boom. The boom. The

00:05:25.740 --> 00:05:28.579
entire landscape shifted violently with the explosion

00:05:28.579 --> 00:05:31.319
of deep learning and convolutional neural networks,

00:05:31.660 --> 00:05:35.079
or CNNs. The famous CNNs. Yeah. Instead of humans

00:05:35.079 --> 00:05:37.439
hand coding what an edge or shadow looks like,

00:05:37.660 --> 00:05:39.860
we just started feeding millions of images into

00:05:39.860 --> 00:05:41.500
these networks. Though it learns on its own.

00:05:41.720 --> 00:05:44.300
Exactly. A convolution is basically a sliding

00:05:44.300 --> 00:05:47.139
mathematical filter. The network slides this

00:05:47.139 --> 00:05:50.040
filter over the pixels of an image layer by layer,

00:05:50.199 --> 00:05:52.420
automatically learning to detect edges, then

00:05:52.420 --> 00:05:55.540
textures, then shapes, and finally, whole objects.

00:05:55.680 --> 00:05:58.019
And it worked really well. Unbelievably well.

00:05:58.360 --> 00:06:00.959
Within a few years, these CNNs were surpassing

00:06:00.959 --> 00:06:03.740
human benchmarks on massive datasets like ImageNet,

00:06:04.160 --> 00:06:06.220
which contains millions of images sorted into

00:06:06.220 --> 00:06:08.899
thousands of object classes. Okay, I want to

00:06:08.899 --> 00:06:12.660
push back a little on this idea of modern AI

00:06:13.279 --> 00:06:16.240
surpassing humans or having vision completely

00:06:16.240 --> 00:06:18.860
solved. Fair enough. CNNs are incredibly powerful,

00:06:19.519 --> 00:06:22.220
but there's this fascinating contradiction in

00:06:22.220 --> 00:06:24.319
the technology that came up in the source material.

00:06:24.660 --> 00:06:27.519
Yeah, the dog breeds versus the filters. Yes.

00:06:27.860 --> 00:06:31.779
On one hand, a CNN can easily classify a dozen

00:06:31.779 --> 00:06:35.220
fine -grain dog breeds or specific species of

00:06:35.220 --> 00:06:37.620
birds with a level of accuracy that you or I

00:06:37.620 --> 00:06:39.420
would totally fail at. Oh, I wouldn't even know

00:06:39.420 --> 00:06:41.639
where to start with bird species. Right. But

00:06:41.639 --> 00:06:44.160
on the other hand, feed that same supercomputer

00:06:44.160 --> 00:06:47.279
an image with a modern digital camera filter,

00:06:47.439 --> 00:06:49.920
like a basic Instagram color tint, and it gets

00:06:49.920 --> 00:06:51.800
completely confused. It totally breaks down.

00:06:51.959 --> 00:06:54.540
Or it will fail to detect tiny obvious details

00:06:54.540 --> 00:06:56.860
like a small ant crawling on the stem of a flower

00:06:56.860 --> 00:06:59.100
or a person holding a thin quill in their hand.

00:06:59.620 --> 00:07:01.360
Humans never get tripped up by a color filter.

00:07:01.699 --> 00:07:04.100
What's fascinating here is why that paradox exists.

00:07:04.399 --> 00:07:06.800
You and I don't just see with our eyes. We see

00:07:06.800 --> 00:07:09.019
with our entire life experience. Oh, that's a

00:07:09.019 --> 00:07:12.519
good point. We have massive amounts of contextual,

00:07:12.519 --> 00:07:14.740
organic knowledge about how the world operates.

00:07:15.480 --> 00:07:17.639
If you see a photo tinted a slightly purple,

00:07:18.060 --> 00:07:20.180
your brain instantly knows it's a lighting effect

00:07:20.180 --> 00:07:22.779
or a filter, and you just discount it. You don't

00:07:22.779 --> 00:07:24.399
suddenly think the dog is a different species.

00:07:24.699 --> 00:07:27.040
Exactly. And if you see a hand positioned with

00:07:27.040 --> 00:07:28.980
the fingers pinched together hovering over a

00:07:28.980 --> 00:07:31.620
piece of paper, your brain actively expects to

00:07:31.620 --> 00:07:34.399
see a pen or a quill, so you spot it immediately.

00:07:34.720 --> 00:07:37.579
Even if it's super thin or blurry. Because computers

00:07:37.579 --> 00:07:39.899
don't have that organic context. They aren't

00:07:39.899 --> 00:07:43.040
expecting anything. Precisely. They are disentangling

00:07:43.040 --> 00:07:45.819
symbolic information using models built entirely

00:07:45.819 --> 00:07:48.800
on geometry and statistics. So it's all just

00:07:48.800 --> 00:07:52.079
math? All math. When a digital filter subtly

00:07:52.079 --> 00:07:54.579
distorts the numerical pixel values across an

00:07:54.579 --> 00:07:57.600
entire image, the statistical model breaks down.

00:07:58.000 --> 00:07:59.899
The computer doesn't conceptually know what a

00:07:59.899 --> 00:08:02.399
filter is. It just sees bad numbers. Exactly.

00:08:02.740 --> 00:08:04.620
It only knows that the mathematical patterns

00:08:04.620 --> 00:08:07.000
it relies on to identify a dog have suddenly

00:08:07.000 --> 00:08:09.730
vanished into statistical noise. Which really

00:08:09.730 --> 00:08:12.290
points to the massive gap between humans and

00:08:12.290 --> 00:08:15.230
machines. The reason computers get confused by

00:08:15.230 --> 00:08:18.050
a simple filter is that they don't have biological

00:08:18.050 --> 00:08:22.480
eyes or biological brains. To fix that, engineers

00:08:22.480 --> 00:08:24.639
had to figure out how to build an artificial

00:08:24.639 --> 00:08:27.459
eye and brain from scratch. And that requires

00:08:27.459 --> 00:08:31.339
a literal Frankenstein recipe of scientific fields.

00:08:31.980 --> 00:08:33.840
Here's where it gets really interesting, though,

00:08:34.139 --> 00:08:36.440
because you literally need quantum mechanics

00:08:36.440 --> 00:08:39.440
just to snap a digital picture properly. I know.

00:08:39.440 --> 00:08:41.659
It sounds like an exaggeration, but it is absolute

00:08:41.659 --> 00:08:44.039
reality. It blew my mind. Most computer vision

00:08:44.039 --> 00:08:47.179
systems rely on solid -state image sensors, typically

00:08:47.179 --> 00:08:50.539
CMOS sensors, that detect electromagnetic radiation.

00:08:51.240 --> 00:08:53.340
And to design those sensors, you need a deep

00:08:53.340 --> 00:08:55.700
understanding of solid -state physics. Wait,

00:08:55.820 --> 00:08:58.220
hold on. Quantum mechanics? Just to take a digital

00:08:58.220 --> 00:09:00.779
picture? Yeah. I thought cameras were just capturing

00:09:00.779 --> 00:09:03.159
bouncing light through a glass lens. Where? Where

00:09:03.159 --> 00:09:05.039
does the quantum realm actually come into play

00:09:05.039 --> 00:09:07.360
here? It happens the moment the light hits the

00:09:07.360 --> 00:09:09.960
digital sensor. When photons of light travel

00:09:09.960 --> 00:09:12.500
through the camera lens, they strike a grid of

00:09:12.500 --> 00:09:15.009
silicon pixels. Because of the photoelectric

00:09:15.009 --> 00:09:17.429
effect, which is a quantum mechanical phenomenon,

00:09:18.070 --> 00:09:20.210
those photons knock electrons loose within the

00:09:20.210 --> 00:09:23.210
silicon. So the light is physically freeing electrons.

00:09:23.470 --> 00:09:26.269
Exactly. The sensor collects these freed electrons

00:09:26.269 --> 00:09:29.269
in microscopic wells, and the number of electrons

00:09:29.269 --> 00:09:32.350
in each well tells the computer exactly how bright

00:09:32.350 --> 00:09:35.370
the light was at that specific pixel. That is

00:09:35.370 --> 00:09:38.309
wild. You literally cannot engineer a modern,

00:09:38.490 --> 00:09:41.090
high -definition camera sensor without relying

00:09:41.090 --> 00:09:43.169
on the quantum interactions between light and

00:09:43.169 --> 00:09:45.629
matter. Okay, so quantum physics gives us the

00:09:45.629 --> 00:09:48.190
raw hardware, you know, the artificial eye. But

00:09:48.190 --> 00:09:50.639
what about the brain? How do we get the software

00:09:50.639 --> 00:09:52.960
to interpret those electron wells? That's where

00:09:52.960 --> 00:09:55.960
we turn to biology. Right. Over the last century,

00:09:56.419 --> 00:09:58.960
scientists extensively studied neurobiology,

00:09:59.220 --> 00:10:01.539
like human eyes, neurons, and the visual cortex,

00:10:01.919 --> 00:10:04.379
just to create a blueprint for machines. And

00:10:04.379 --> 00:10:06.539
there is a massive historical anchor from the

00:10:06.539 --> 00:10:10.399
1970s, Kunihiko Fukushima's neocognitron. A legendary

00:10:10.399 --> 00:10:12.740
piece of work. Yeah, it was an early neural network

00:10:12.740 --> 00:10:16.059
directly inspired by biological vision, specifically

00:10:16.059 --> 00:10:18.840
how simple and complex cells in an animal's visual

00:10:18.840 --> 00:10:21.580
cortex respond to lines of light at specific

00:10:21.580 --> 00:10:24.840
angles. But if we are literally trying to copy

00:10:24.840 --> 00:10:27.879
the human brain, are we just building a digital

00:10:27.879 --> 00:10:30.799
clone of a human, or is a computer vision system

00:10:30.799 --> 00:10:33.500
doing something fundamentally different? It is

00:10:33.500 --> 00:10:36.100
doing something fundamentally different. Neurobiology

00:10:36.100 --> 00:10:39.200
provides the architectural inspiration, but the

00:10:39.200 --> 00:10:41.919
execution is purely mathematical. So it's not

00:10:41.919 --> 00:10:45.299
a true brain? Not at all. Biology relies on physiological

00:10:45.299 --> 00:10:48.059
processes, like chemical synapses between living

00:10:48.059 --> 00:10:50.860
neurons. Computer vision relies on nodes with

00:10:50.860 --> 00:10:53.000
mathematical weight associations. Can you give

00:10:53.000 --> 00:10:55.100
an example? Sure. Let's look at how a network

00:10:55.100 --> 00:10:57.720
learns to recognize textures. Imagine training

00:10:57.720 --> 00:11:00.100
a network to recognize various sea creatures.

00:11:00.179 --> 00:11:02.519
Like a starfish? Yeah. The starfish might strongly

00:11:02.519 --> 00:11:04.960
activate visual feature nodes for a ringed texture

00:11:04.960 --> 00:11:08.139
and a star outline. Meanwhile, sea urchins might

00:11:08.139 --> 00:11:10.580
strongly activate nodes for a striped texture.

00:11:10.539 --> 00:11:13.559
and an oval shape. But nature isn't always that

00:11:13.559 --> 00:11:16.700
clean. What happens if you feed the system an

00:11:16.700 --> 00:11:19.360
image of a rare sea urchin that actually has

00:11:19.360 --> 00:11:21.879
a ringed texture instead of stripes? Well, the

00:11:21.879 --> 00:11:24.980
network adjusts the math. It creates a weekly

00:11:24.980 --> 00:11:27.600
weighted association between the urchin category

00:11:27.600 --> 00:11:30.779
and the ringed texture node. Oh, I see. So the

00:11:30.779 --> 00:11:33.059
next time it sees an image, it isn't just checking

00:11:33.059 --> 00:11:35.799
off a rigid list of features. It is calculating

00:11:35.799 --> 00:11:39.460
massive matrices of associated mathematical weight

00:11:39.460 --> 00:11:43.139
patterns across millions of nodes simultaneously.

00:11:43.580 --> 00:11:46.139
Biology gave us the blueprint of a layered network.

00:11:46.600 --> 00:11:48.679
But the computer executes this through linear

00:11:48.679 --> 00:11:51.299
algebra and multi -dimensional signal processing.

00:11:51.740 --> 00:11:53.840
Exactly. So we have physical sensors designed

00:11:53.840 --> 00:11:56.080
with quantum mechanics and algorithms inspired

00:11:56.080 --> 00:11:58.519
by neurobiology, but executed through heavy math.

00:11:58.860 --> 00:12:01.860
How does a computer actually process a live scene

00:12:01.860 --> 00:12:04.740
in real time? It follows a very strict pipeline.

00:12:05.019 --> 00:12:06.820
Right. Let's walk through the actual assembly

00:12:06.820 --> 00:12:09.500
line of artificial perception. The typical pipeline

00:12:09.500 --> 00:12:12.200
of a computer vision system is incredibly structured.

00:12:12.539 --> 00:12:14.740
It almost always starts with image acquisition.

00:12:14.960 --> 00:12:17.600
Getting the raw data. Right. And this isn't just

00:12:17.600 --> 00:12:20.039
snapping a 2D photo. This could be gathering

00:12:20.039 --> 00:12:24.179
3D volume scans from a medical device or pulling

00:12:24.179 --> 00:12:26.779
sonic data from radar and ultrasonic cameras.

00:12:27.080 --> 00:12:29.480
And you can't just feed that raw data straight

00:12:29.480 --> 00:12:31.919
into the brain, right? It has to go through preprocessing.

00:12:31.940 --> 00:12:33.950
It has to be cleaned. Yeah. The system has to

00:12:33.950 --> 00:12:36.669
clean the data, reducing noise, enhancing the

00:12:36.669 --> 00:12:39.309
contrast, or resampling the image so the coordinate

00:12:39.309 --> 00:12:41.590
system is perfectly aligned. And from there,

00:12:41.629 --> 00:12:44.269
you move into feature extraction. And this is

00:12:44.269 --> 00:12:46.570
where it gets really granular. It's less like

00:12:46.570 --> 00:12:48.909
looking at a whole picture and more like looking

00:12:48.909 --> 00:12:51.409
through a tiny microscopic straw. That's a great

00:12:51.409 --> 00:12:54.389
analogy. The computer slides that straw across

00:12:54.389 --> 00:12:57.389
an image, and it is completely blind to what

00:12:57.389 --> 00:13:00.250
the object actually is. It only notices when

00:13:00.250 --> 00:13:03.009
the pixels inside that straw suddenly shift from

00:13:03.009 --> 00:13:05.269
light to dark or when a mathematical gradient

00:13:05.269 --> 00:13:07.590
spikes. And that sudden drop off, the computer

00:13:07.590 --> 00:13:10.850
logs that as an edge or a corner. Exactly. And

00:13:10.850 --> 00:13:12.710
once it has mapped all those edges and corners,

00:13:12.950 --> 00:13:16.149
the pipeline moves to detection and segmentation.

00:13:16.230 --> 00:13:18.330
Putting the pieces together. The system decides

00:13:18.330 --> 00:13:20.490
which of those extracted regions are actually

00:13:20.490 --> 00:13:24.279
relevant. It segments the image, mathematically

00:13:24.279 --> 00:13:26.620
isolating the foreground object from the background

00:13:26.620 --> 00:13:29.519
noise. Finally, it moves to high -level processing

00:13:29.519 --> 00:13:33.019
and decision making. Right, like, is this segmented

00:13:33.019 --> 00:13:36.639
shape a car drifting into my lane? Or does this

00:13:36.639 --> 00:13:39.980
manufactured circuit board pass quality inspection?

00:13:40.840 --> 00:13:43.669
It's like preparing a gourmet meal. Acquisition

00:13:43.669 --> 00:13:46.370
is buying the groceries, pre -processing is washing

00:13:46.370 --> 00:13:48.750
the vegetables, feature extraction is chopping

00:13:48.750 --> 00:13:51.269
them into exact shapes, and decision -making

00:13:51.269 --> 00:13:54.070
is the final taste test. I love that. But what

00:13:54.070 --> 00:13:56.250
happens when the groceries are spoiled? Say the

00:13:56.250 --> 00:13:58.769
camera lens is smudged, or there's heavy motion

00:13:58.769 --> 00:14:01.710
blur from a car moving fast, or bad lighting

00:14:01.710 --> 00:14:04.330
introduces a ton of static to the image. It happens

00:14:04.330 --> 00:14:06.870
all the time in the real world. A human can squint

00:14:06.870 --> 00:14:08.629
and still figure out what they're looking at.

00:14:08.730 --> 00:14:10.549
Does the computer just throw the degraded image

00:14:10.549 --> 00:14:12.789
away and give up? If we connect this to the bigger

00:14:12.789 --> 00:14:15.289
picture, it makes sense how they handle it. This

00:14:15.289 --> 00:14:17.990
is where a subfield called image restoration

00:14:17.990 --> 00:14:20.269
becomes absolutely critical. So they fix it.

00:14:20.429 --> 00:14:23.049
Yeah. Computers don't just discard degraded data.

00:14:23.129 --> 00:14:25.850
They actively reconstruct it. They use highly

00:14:25.850 --> 00:14:29.350
sophisticated local structural models to separate

00:14:29.350 --> 00:14:31.649
the noise from the actual signal. How does that

00:14:31.649 --> 00:14:34.730
work? Take a median filter, for example. If you

00:14:34.730 --> 00:14:37.269
have a corrupted pixel, say it's glaringly bright

00:14:37.269 --> 00:14:39.929
white because of sensor noise, the algorithm

00:14:39.929 --> 00:14:42.320
looks at a three by three grid of the surrounding

00:14:42.320 --> 00:14:44.580
pixels. Just a tiny neighborhood around the bad

00:14:44.580 --> 00:14:47.120
one. Right. It takes all nine of those numerical

00:14:47.120 --> 00:14:50.620
values, lines them up, and finds the exact median

00:14:50.620 --> 00:14:53.220
value. It throws out the extreme white noise

00:14:53.220 --> 00:14:55.700
and replaces the corrupted pixel with that median

00:14:55.700 --> 00:14:58.600
number, instantly smoothing out the image without

00:14:58.600 --> 00:15:01.000
blurring the real edges. It's literally guessing

00:15:01.000 --> 00:15:02.659
what should be there based on the neighborhood.

00:15:02.919 --> 00:15:06.440
Yes. They even use techniques called in -painting

00:15:06.440 --> 00:15:09.500
to fill in entirely missing parts of an image.

00:15:09.679 --> 00:15:12.820
Just making up missing data. Kind of. If there

00:15:12.820 --> 00:15:15.120
is transmission interference or severe motion

00:15:15.120 --> 00:15:17.840
blur, the system analyzes the existing local

00:15:17.840 --> 00:15:20.200
image structures, the lines and edges that are

00:15:20.200 --> 00:15:22.679
still visible, and uses multi -dimensional math

00:15:22.679 --> 00:15:24.980
to control the filtering process and rebuild

00:15:24.980 --> 00:15:27.559
the image to what it was intended to be. To manage

00:15:27.559 --> 00:15:30.179
this entire assembly line, the system relies

00:15:30.179 --> 00:15:34.809
on image understanding systems. or IUS. This

00:15:34.809 --> 00:15:36.870
framework breaks perception into three levels

00:15:36.870 --> 00:15:39.559
of abstraction. The low level is just the raw

00:15:39.559 --> 00:15:41.419
primitives, those edges and textures we talked

00:15:41.419 --> 00:15:44.340
about. The intermediate level builds those primitives

00:15:44.340 --> 00:15:47.240
into continuous boundaries, surfaces, and 3D

00:15:47.240 --> 00:15:49.460
volumes. Getting closer to a real object. Yeah,

00:15:49.460 --> 00:15:51.299
and the high level is where the math finally

00:15:51.299 --> 00:15:54.120
clicks into a semantic concept, an object, a

00:15:54.120 --> 00:15:56.840
scene, or a specific event. And we cannot ignore

00:15:56.840 --> 00:15:59.980
the physical hardware required to run this intensely

00:15:59.980 --> 00:16:02.799
heavy pipeline. Oh, absolutely. Traditional CPUs

00:16:02.799 --> 00:16:05.659
simply cannot handle the sheer volume of parallel

00:16:05.659 --> 00:16:07.960
math required to calculate millions of pixel

00:16:07.960 --> 00:16:11.580
values 60 times a second. We are now seeing dedicated

00:16:11.580 --> 00:16:14.360
vision processing units, or VPUs, integrated

00:16:14.360 --> 00:16:17.299
directly into devices. So we've built this insanely

00:16:17.299 --> 00:16:20.399
complex, noise -filtering, quantum -powered,

00:16:20.919 --> 00:16:22.799
biologically -inspired deep learning pipeline.

00:16:22.919 --> 00:16:25.639
That's a mouthful. Right. So what is it actually

00:16:25.639 --> 00:16:27.159
doing out there in the world right now? Because

00:16:27.159 --> 00:16:29.740
the sheer scale of its application is staggering.

00:16:29.980 --> 00:16:32.799
The market sizes alone tell the story of its

00:16:32.799 --> 00:16:36.539
dominance. In 2024, the leading sector for computer

00:16:36.539 --> 00:16:39.580
vision was industrial machine vision with a market

00:16:39.580 --> 00:16:44.320
size of 5 .22 billion U .S. dollars. The medical

00:16:44.320 --> 00:16:47.799
sector reached 2 .6 billion and military applications

00:16:47.799 --> 00:16:50.379
were nearly a billion dollars. The specific use

00:16:50.379 --> 00:16:53.820
cases are where it gets truly wild. Take agriculture.

00:16:54.029 --> 00:16:57.169
Oh, yeah. There is an open source vision transformer

00:16:57.169 --> 00:16:59.330
model out there right now that farmers use to

00:16:59.330 --> 00:17:02.110
monitor their fields. It automatically analyzes

00:17:02.110 --> 00:17:04.450
the visual data of crops and detects diseases

00:17:04.450 --> 00:17:07.049
on strawberry plants with a ninety eight point

00:17:07.049 --> 00:17:09.769
four percent accuracy rate. That is incredible.

00:17:09.930 --> 00:17:12.430
It is literally securing food supply chains or

00:17:12.430 --> 00:17:14.710
look at industrial manufacturing. They deploy

00:17:14.710 --> 00:17:17.109
high speed vision processing setups that can

00:17:17.109 --> 00:17:19.460
track. thousands of frames per second. Thousands.

00:17:19.660 --> 00:17:22.140
Yes. They are inspecting speeding glass bottles

00:17:22.140 --> 00:17:24.599
on a production line, instantly flagging microscopic

00:17:24.599 --> 00:17:26.880
cracks. They use it to check silicon wafers,

00:17:27.000 --> 00:17:28.720
the foundational pieces of our computer chips,

00:17:28.779 --> 00:17:31.079
for defects that are entirely invisible to the

00:17:31.079 --> 00:17:33.640
human eye. That's wild. They even use optical

00:17:33.640 --> 00:17:36.900
sorting to automatically blast undesirable foodstuff

00:17:36.900 --> 00:17:39.980
off of bulk conveyor belts using targeted jets

00:17:39.980 --> 00:17:42.319
of air. Then you have autonomous vehicles. And

00:17:42.319 --> 00:17:43.640
I'm not just talking about the driver assist

00:17:43.640 --> 00:17:45.880
features in your car. I'm talking about space

00:17:45.880 --> 00:17:50.519
exploration. Yes. NASA's Curiosity rover and

00:17:50.519 --> 00:17:54.440
the Chinese space agency's U -22 rover use computer

00:17:54.440 --> 00:17:57.380
vision algorithms for SLAM. That's simultaneous

00:17:57.380 --> 00:18:00.019
localization and mapping. Right. They ingest

00:18:00.019 --> 00:18:02.700
camera data to simultaneously build a 3D map

00:18:02.700 --> 00:18:06.279
of the Martian or lunar surface while pinpointing

00:18:06.279 --> 00:18:08.420
their exact location within that map. It's vital

00:18:08.420 --> 00:18:10.799
for navigation. Or unmanned aerial vehicles,

00:18:10.920 --> 00:18:14.019
UAVs, flying high over dense forests, analyzing

00:18:14.019 --> 00:18:16.440
visual spectrums to detect the earliest thermal

00:18:16.440 --> 00:18:19.359
signs of wildfires before human towers can spot

00:18:19.359 --> 00:18:22.039
them. But perhaps the most mind -bending application

00:18:22.039 --> 00:18:24.019
emerging right now is how computer vision is

00:18:24.019 --> 00:18:27.140
being used for tactile feedback. Yes. This part

00:18:27.140 --> 00:18:29.539
completely broke my brain. The idea that vision

00:18:29.539 --> 00:18:31.839
is being used to create the sensation of touch.

00:18:32.119 --> 00:18:34.519
It's brilliant. Engineers are now embedding rubber

00:18:34.519 --> 00:18:37.559
artificial skin layers with tiny strain gauges.

00:18:37.619 --> 00:18:40.940
You place the skin over a robotic finger, trace

00:18:40.940 --> 00:18:43.400
a surface, and the computer measures the upward

00:18:43.400 --> 00:18:47.400
push on tiny rubber pins to map. micro -indulations.

00:18:48.079 --> 00:18:50.420
But the far crazier version of this is the silicon

00:18:50.420 --> 00:18:53.720
domes. Right. Engineers take a tiny, high -definition

00:18:53.720 --> 00:18:57.240
camera and suspend it inside a flexible, translucent

00:18:57.240 --> 00:19:00.980
silicon dome like a robotic fingertip. They embed

00:19:00.980 --> 00:19:03.720
a grid of equally spaced point markers on the

00:19:03.720 --> 00:19:06.660
inside of that silicon. When that robotic fingertip

00:19:06.660 --> 00:19:09.700
presses against an object, the silicon physically

00:19:09.700 --> 00:19:11.779
deforms. So the camera's watching from the inside.

00:19:11.859 --> 00:19:13.940
Exactly. The internal camera watches how those

00:19:13.940 --> 00:19:16.430
point markers shift, stretch, and stored. The

00:19:16.430 --> 00:19:19.049
computer then uses that visual data to calculate

00:19:19.049 --> 00:19:21.470
the exact geometric pressure being applied to

00:19:21.470 --> 00:19:24.049
the mold. It's using a camera to watch the inside

00:19:24.049 --> 00:19:27.029
of its own skin deform, which lets the robotic

00:19:27.029 --> 00:19:30.349
hand essentially feel the shape and firmness

00:19:30.349 --> 00:19:32.369
of whatever it's grasping. It's incredible. So

00:19:32.369 --> 00:19:34.049
what does this all mean? I have to push back

00:19:34.049 --> 00:19:36.069
on our core premise here. We call this whole

00:19:36.069 --> 00:19:39.630
field computer vision. But if a camera suspended

00:19:39.630 --> 00:19:42.349
inside a silicon dome is being used to help a

00:19:42.349 --> 00:19:47.230
robot feel a surface, Hasn't vision just crossed

00:19:47.230 --> 00:19:49.529
over into a completely different sensory category?

00:19:49.750 --> 00:19:51.710
This raises an important question though about

00:19:51.710 --> 00:19:54.089
the fundamental trajectory of the technology.

00:19:54.789 --> 00:19:57.509
Machine vision is rapidly converging with robotics

00:19:57.509 --> 00:19:59.990
and visual computing. So it's merging. It is

00:19:59.990 --> 00:20:02.690
evolving away from a replication of human eyesight

00:20:02.690 --> 00:20:05.529
and turning into a generalized environmental

00:20:05.529 --> 00:20:08.369
awareness. The algorithms don't actually care

00:20:08.369 --> 00:20:11.710
if the input data represents visible light, physical

00:20:11.710 --> 00:20:14.829
pressure or radar waves. It just processes high

00:20:14.829 --> 00:20:17.710
dimensional data into autonomous decisions. Just

00:20:17.710 --> 00:20:20.130
numbers to them. Exactly. You see this vividly

00:20:20.130 --> 00:20:22.869
in modern military applications. The latest missile

00:20:22.869 --> 00:20:25.150
guidance systems don't just lock on to a pre

00:20:25.150 --> 00:20:27.349
-programmed set of coordinates. They don't. No,

00:20:27.349 --> 00:20:29.829
they are launched into a general theater of operation

00:20:29.829 --> 00:20:32.569
and then they use locally acquired image data

00:20:32.569 --> 00:20:34.990
to actively deliberate and select their targets

00:20:34.990 --> 00:20:38.170
upon arrival. Vision has become synonymous with

00:20:38.170 --> 00:20:40.630
autonomous thinking. That is both awe -inspiring

00:20:40.519 --> 00:20:43.859
and frankly terrifying. It really highlights

00:20:43.859 --> 00:20:46.420
the incredible trajectory we've unpacked today.

00:20:46.480 --> 00:20:48.099
We've come a long way from a summer project.

00:20:48.220 --> 00:20:51.279
We really have. We started with a naive undergraduate

00:20:51.279 --> 00:20:54.759
summer project in 1966, thinking a computer could

00:20:54.759 --> 00:20:57.279
just look at a camera feed and print out the

00:20:57.279 --> 00:21:00.799
list of objects. And today we have a multi -billion

00:21:00.799 --> 00:21:04.470
dollar global infrastructure. We're leveraging

00:21:04.470 --> 00:21:07.390
the quantum mechanics of the photoelectric effect

00:21:07.390 --> 00:21:09.970
and the biological blueprints of the visual cortex

00:21:09.970 --> 00:21:13.089
to spot six strawberries, to navigate rovers

00:21:13.089 --> 00:21:15.789
across the Martian landscape, and to give robotic

00:21:15.789 --> 00:21:18.589
hands the physical ability to feel. And if there's

00:21:18.589 --> 00:21:20.569
one lingering thought I'd leave you with, it's

00:21:20.569 --> 00:21:23.319
this. Computer vision is no longer limited to

00:21:23.319 --> 00:21:25.079
the visible light spectrum. What do you mean?

00:21:25.500 --> 00:21:27.799
Well, these processing pipelines are ingesting

00:21:27.799 --> 00:21:30.359
multi -dimensional data from medical CT scanners,

00:21:30.559 --> 00:21:32.740
hyperspectral imagers, and ground penetrating

00:21:32.740 --> 00:21:35.259
radar. Oh, wow. They are quite literally seeing

00:21:35.259 --> 00:21:37.660
in wavelengths and dimensions that the biological

00:21:37.660 --> 00:21:40.839
human eye cannot physically comprehend. We have

00:21:40.839 --> 00:21:44.059
spent 60 years trying to teach machines to mimic

00:21:44.059 --> 00:21:46.660
our limited understanding of the world. Right.

00:21:46.859 --> 00:21:49.400
But are we rapidly approaching a point where

00:21:49.400 --> 00:21:52.700
machines won't just mimic us, but will eventually

00:21:52.700 --> 00:21:55.740
uncover, map, and act upon a version of reality

00:21:55.740 --> 00:21:59.960
that is completely invisible to us? Wow. A reality

00:21:59.960 --> 00:22:01.579
right in front of us that we just don't have

00:22:01.579 --> 00:22:03.519
the hardware to perceive. It brings us right

00:22:03.519 --> 00:22:05.880
back to that alarm clock in the morning. Exactly.

00:22:06.099 --> 00:22:09.660
We open our eyes, and we naturally assume the

00:22:09.660 --> 00:22:12.599
world we see is the only world that exists. We

00:22:12.599 --> 00:22:15.099
take the puzzle of perception completely for

00:22:15.099 --> 00:22:17.410
granted. But the machines they're seeing much

00:22:17.410 --> 00:22:19.869
more They are just opening their eyes and what

00:22:19.869 --> 00:22:22.109
they are starting to see is vastly bigger than

00:22:22.109 --> 00:22:24.710
we ever imagined Thanks for joining us on this

00:22:24.710 --> 00:22:26.950
journey. Keep questioning what you see and we'll

00:22:26.950 --> 00:22:28.049
catch you on the next deep dive