WEBVTT

00:00:00.000 --> 00:00:04.259
Imagine trying to fit the entire Library of Congress

00:00:04.259 --> 00:00:07.580
into a single standard size shoebox. Right. I

00:00:07.580 --> 00:00:09.640
mean, you literally just can't do it. You can't.

00:00:09.640 --> 00:00:12.880
You have to start ripping out pages or like shredding

00:00:12.880 --> 00:00:15.140
the covers, printing the text microscopically.

00:00:15.240 --> 00:00:17.160
And eventually the structural integrity of the

00:00:17.160 --> 00:00:20.219
box just explodes. Exactly. Just gives up. Because

00:00:20.219 --> 00:00:22.199
you don't magically change the physics of the

00:00:22.199 --> 00:00:24.879
cardboard, right? You're forced to build an entirely

00:00:24.879 --> 00:00:27.070
different kind of container. And back in the

00:00:27.070 --> 00:00:29.789
1960s, software engineers essentially tried to

00:00:29.789 --> 00:00:32.170
do exactly this with human language. Yeah, they

00:00:32.170 --> 00:00:35.490
really did. And the architectural fallout from

00:00:35.490 --> 00:00:38.030
that decision is something we are literally still

00:00:38.030 --> 00:00:40.609
dealing with today. It's basically the ultimate

00:00:40.609 --> 00:00:43.289
story of architectural growing pains. You know,

00:00:43.289 --> 00:00:45.149
when you build a foundation for a small house

00:00:45.149 --> 00:00:47.829
and then suddenly decide that that house actually

00:00:47.829 --> 00:00:50.829
needs to be a 100 -story skyscraper. The workarounds

00:00:50.829 --> 00:00:53.969
get messy. Incredibly complex, yeah. So, welcome

00:00:53.969 --> 00:00:56.429
to today's Deep Dive. Our mission for you today

00:00:56.429 --> 00:01:00.210
is pretty simple. We are going to explore the

00:01:00.210 --> 00:01:03.710
hidden, often messy architectural choices that

00:01:03.710 --> 00:01:06.370
allow your computer to speak every language on

00:01:06.370 --> 00:01:08.569
Earth. And it's quite the journey. Oh, it is.

00:01:08.650 --> 00:01:11.310
We're going from the cramped 8 -bit data limits

00:01:11.310 --> 00:01:15.530
of the 1960s to the vast, flexible systems running

00:01:15.530 --> 00:01:19.260
today's global software. And if we connect this

00:01:19.260 --> 00:01:21.799
to the bigger picture, the stakes for you are

00:01:21.799 --> 00:01:24.299
remarkably high. Every single time you read an

00:01:24.299 --> 00:01:27.140
international news site or, you know, text a

00:01:27.140 --> 00:01:28.799
friend in another language. Or even just run

00:01:28.799 --> 00:01:31.180
a piece of code. Right. You are relying entirely

00:01:31.180 --> 00:01:33.700
on the evolution of what we call the wide character.

00:01:34.159 --> 00:01:36.959
Because without it, global digital communication

00:01:36.959 --> 00:01:39.340
literally breaks down into unreadable gibberish.

00:01:39.480 --> 00:01:41.840
Which nobody wants. So let's set a baseline here.

00:01:42.280 --> 00:01:45.659
A wide character is at its core, A computer character

00:01:45.659 --> 00:01:47.859
data type that has a size greater than the traditional

00:01:47.859 --> 00:01:50.060
8 -bit character. It's physically larger in memory.

00:01:50.420 --> 00:01:52.859
Exactly. It was engineered specifically to accommodate

00:01:52.859 --> 00:01:56.500
larger coded character sets. But to really grasp

00:01:56.500 --> 00:01:59.719
why that leap to wider characters was so revolutionary,

00:02:00.500 --> 00:02:04.109
we have to... Look at the tiny boxes computers

00:02:04.109 --> 00:02:06.709
originally forced language into. We have to go

00:02:06.709 --> 00:02:09.750
back to the 1960s. Yes. OK, let's unpack this.

00:02:09.810 --> 00:02:12.550
Take us back. So in the 60s, mainframe and many

00:02:12.550 --> 00:02:15.389
computer manufacturers were all just trying to

00:02:15.389 --> 00:02:18.349
establish some common ground because before this

00:02:18.349 --> 00:02:20.330
point, different computers used completely different

00:02:20.330 --> 00:02:22.590
sizes for their basic data chunks. Like it was

00:02:22.590 --> 00:02:25.490
the Wild West. It really was. Some used six bits,

00:02:25.490 --> 00:02:29.430
some used nine. But the industry, heavily influenced

00:02:29.430 --> 00:02:32.650
by systems like the IBM System 360, began to

00:02:32.650 --> 00:02:34.469
standardize around the 8 -bit byte. And that

00:02:34.469 --> 00:02:37.030
became the standard smallest fundamental data

00:02:37.030 --> 00:02:39.250
type, the building block. Exactly. That was the

00:02:39.250 --> 00:02:41.449
foundational block for storing information. But

00:02:41.449 --> 00:02:43.210
they weren't using all eight of those bits for

00:02:43.210 --> 00:02:45.210
the actual letters. Right. Right. Like this goes

00:02:45.210 --> 00:02:47.169
back to the 7 -bit ASCII standard. Right. The

00:02:47.169 --> 00:02:49.550
7 -bit ASCII standard. Which was the industry

00:02:49.550 --> 00:02:51.729
standard for encoding alphanumeric characters,

00:02:52.150 --> 00:02:55.599
mostly for those old teletype machines and early

00:02:55.599 --> 00:02:59.180
super clunky computer terminals. Right. And seven

00:02:59.180 --> 00:03:02.840
bits gave them exactly 128 possible combinations.

00:03:03.460 --> 00:03:05.939
Which is not a lot. It's really not. It was just

00:03:05.939 --> 00:03:08.680
enough to encode the uppercase and lowercase

00:03:08.680 --> 00:03:11.599
English alphabet, numbers 0 through 9, basic

00:03:11.599 --> 00:03:14.680
punctuation, and a few control characters. Like

00:03:14.680 --> 00:03:16.780
what, carriage return? Yeah, carriage return

00:03:16.780 --> 00:03:20.000
to tell the printer to physically go to the next

00:03:20.000 --> 00:03:22.780
line. Okay, so seven bits for the text, which

00:03:22.780 --> 00:03:25.800
leaves exactly one bit left over in our 8 -bit

00:03:25.800 --> 00:03:29.400
box. That eighth bit. A famous eighth bit. Historically,

00:03:29.439 --> 00:03:32.180
this was used as a parody bit. which I like to

00:03:32.180 --> 00:03:33.960
think of it as the digital equivalent of packing

00:03:33.960 --> 00:03:35.800
peanuts. Packing peanuts, okay, I like that.

00:03:35.979 --> 00:03:38.259
Yeah, because you're shipping a delicate package.

00:03:38.460 --> 00:03:41.259
which is your data, and you use that extra space

00:03:41.259 --> 00:03:44.400
in the box entirely for safety, just to make

00:03:44.400 --> 00:03:46.360
sure the package arrives intact. That's a great

00:03:46.360 --> 00:03:48.199
way to look at it. To take that analogy a step

00:03:48.199 --> 00:03:50.800
further into the actual mechanism, here is how

00:03:50.800 --> 00:03:52.979
those packing peanuts actually worked. In the

00:03:52.979 --> 00:03:55.360
early days of computing, hardware was incredibly

00:03:55.360 --> 00:03:59.379
noisy and unreliable. Electromagnetic interference

00:03:59.379 --> 00:04:02.860
could literally flip a zero to a one while data

00:04:02.860 --> 00:04:05.400
was moving through a wire. Oh, wow. Just randomly

00:04:05.400 --> 00:04:08.520
in transit. Static on the line, yeah. So the

00:04:08.520 --> 00:04:10.879
parity bit was a clever math trick to catch that.

00:04:11.500 --> 00:04:14.240
If you had seven bits of data, the computer would

00:04:14.240 --> 00:04:16.600
count how many ones were in that data. Right.

00:04:16.779 --> 00:04:19.279
If the number of ones was an even number, it

00:04:19.279 --> 00:04:22.120
would set the eighth bit to a zero. If it was

00:04:22.120 --> 00:04:25.279
odd, it set the eighth bit So it's forcing the

00:04:25.279 --> 00:04:27.860
math. The total number of 1s in the entire 8

00:04:27.860 --> 00:04:30.060
-bit byte was always an even number. Precisely.

00:04:30.199 --> 00:04:32.779
It always added up to even. So when the receiving

00:04:32.779 --> 00:04:35.639
computer got the data, it counted the 1s. If

00:04:35.639 --> 00:04:37.779
it counted an odd number, it immediately knew,

00:04:37.779 --> 00:04:40.620
hey, a bit got flipped in transit. This data

00:04:40.620 --> 00:04:43.500
is corrupted. Say it again. Exactly. It was a

00:04:43.500 --> 00:04:45.920
brilliantly simple error detection mechanism.

00:04:46.040 --> 00:04:48.420
But as computing technology improved, the hardware

00:04:48.420 --> 00:04:50.639
got much more reliable, right? The cables got

00:04:50.639 --> 00:04:53.180
better, the weird interference dropped. Yeah,

00:04:53.560 --> 00:04:56.639
considerably, yeah. And simultaneously, computers

00:04:56.639 --> 00:04:59.610
started to go global. Manufacturers looked at

00:04:59.610 --> 00:05:02.050
their 8 -bit box and realized, well, the English

00:05:02.050 --> 00:05:03.990
alphabet isn't going to cut it in Europe or Asia.

00:05:04.189 --> 00:05:06.189
No, they needed much more room. So they dumped

00:05:06.189 --> 00:05:08.509
the packing peanuts. They ditched the parity

00:05:08.509 --> 00:05:11.230
bit entirely. And by freeing up that eighth bit,

00:05:11.290 --> 00:05:13.829
they doubled the capacity of the byte from 128

00:05:13.829 --> 00:05:17.750
combinations to 256. Just by removing the safety

00:05:17.750 --> 00:05:21.110
material. Exactly. This gave us the famous 8

00:05:21.110 --> 00:05:23.290
-bit extensions that became commonplace in the

00:05:23.290 --> 00:05:26.990
70s and 80s. Things like IBM code page 37 or

00:05:26.990 --> 00:05:30.910
PET -C for com - machines and the ISO 80 -89

00:05:30.910 --> 00:05:34.129
standards. Suddenly those early terminals had

00:05:34.129 --> 00:05:36.769
support for Greek, Cyrillic, Hebrew, and a bunch

00:05:36.769 --> 00:05:38.850
of other regional alphabets. Okay, so you've

00:05:38.850 --> 00:05:40.589
doubled the capacity of your box, but here's

00:05:40.589 --> 00:05:42.810
the driving question for this whole deep dive.

00:05:43.790 --> 00:05:46.230
If we freed up the eighth bit and we got all

00:05:46.230 --> 00:05:48.329
these new alphabets, why wasn't that enough?

00:05:48.769 --> 00:05:51.850
Well, because the math still maxes out at 256

00:05:51.850 --> 00:05:54.970
combinations. Right. The fatal flaw of 8 -bit

00:05:54.970 --> 00:05:57.529
extensions is that they were entirely region

00:05:57.529 --> 00:06:00.730
-specific. So you had one lookup table for Greek,

00:06:00.990 --> 00:06:03.230
a completely different lookup table for Cyrillic,

00:06:03.449 --> 00:06:06.209
and another one for Arabic. But they all shared

00:06:06.209 --> 00:06:09.170
the exact same limited numeric space. Exactly.

00:06:09.329 --> 00:06:12.310
The number 200 might mean a Greek omega on one

00:06:12.310 --> 00:06:15.230
computer, but on a Russian computer, that exact

00:06:15.230 --> 00:06:18.269
same number 200 meant a Cyrillic letter. Which

00:06:18.269 --> 00:06:20.389
leads to what we call destructive translation,

00:06:20.709 --> 00:06:22.550
or what's affectionately known in the industry

00:06:22.550 --> 00:06:26.000
as Mojibake. MojiBake, yes, it's a great word

00:06:26.000 --> 00:06:29.379
for a terrible problem. MojiBake is that phenomenon

00:06:29.379 --> 00:06:32.100
where you open a text file or an email and instead

00:06:32.100 --> 00:06:34.540
of readable text, it's a completely random string

00:06:34.540 --> 00:06:37.160
of wingdings, question marks, and accented gibberish.

00:06:37.300 --> 00:06:39.600
Because the underlying binary data hasn't changed,

00:06:39.839 --> 00:06:41.660
like the sending computer sent the number 200,

00:06:42.079 --> 00:06:44.120
but the receiving computer is using the wrong

00:06:44.120 --> 00:06:46.860
regional lookup table. Right, it's blindly translating

00:06:46.860 --> 00:06:49.519
that number into whatever symbol happens to sit

00:06:49.519 --> 00:06:52.579
at slot 200 in its local directory. And if you

00:06:52.579 --> 00:06:55.250
try to actively convert data into a target set

00:06:55.250 --> 00:06:57.129
that didn't even have a slot for the character

00:06:57.129 --> 00:06:59.009
you were using. The system would just replace

00:06:59.009 --> 00:07:01.149
it with a generic question mark. And the original

00:07:01.149 --> 00:07:03.709
data was mathematically destroyed. You could

00:07:03.709 --> 00:07:06.290
never get it back. It was a wildly unsustainable

00:07:06.290 --> 00:07:08.670
way to build a global communication network.

00:07:08.970 --> 00:07:11.290
I mean, you couldn't even have a single document

00:07:11.290 --> 00:07:14.250
with both Greek and Russian characters without

00:07:14.250 --> 00:07:17.790
writing incredibly complex, fragile, special

00:07:17.790 --> 00:07:20.089
conversion routines. Just telling the computer

00:07:20.089 --> 00:07:22.610
to constantly swap its lookup tables mid -sentence.

00:07:22.790 --> 00:07:25.689
Which is a nightmare. Which brings us to 1989.

00:07:26.189 --> 00:07:28.769
The International Organization for Standardization,

00:07:28.870 --> 00:07:32.009
or ISO, begins work on the Universal Character

00:07:32.009 --> 00:07:36.209
Set. or UCS. The dream of one massive multilingual

00:07:36.209 --> 00:07:39.170
character set to rule them all. No more swapping

00:07:39.170 --> 00:07:41.050
tables, right? No more swapping. Every single

00:07:41.050 --> 00:07:42.949
character from every single language gets its

00:07:42.949 --> 00:07:45.509
own permanent unique mathematical number. But

00:07:45.509 --> 00:07:48.329
to achieve that... UCS required encoding values

00:07:48.329 --> 00:07:51.870
using either 16 -bit, which is 2 bytes, or 32

00:07:51.870 --> 00:07:54.069
-bit, which is 4 bytes. Right. And the 8 -bit

00:07:54.069 --> 00:07:56.350
box finally shatters. It just physically cannot

00:07:56.350 --> 00:07:59.149
hold values that large. So to store these new

00:07:59.149 --> 00:08:01.970
massive character values in active memory, you

00:08:01.970 --> 00:08:04.170
needed a data type fundamentally larger than

00:08:04.170 --> 00:08:07.189
8 bits. And thus, the term wide character was

00:08:07.189 --> 00:08:10.410
born, specifically to differentiate these new

00:08:10.410 --> 00:08:13.029
expansive data types from the traditional 8 -bit

00:08:13.029 --> 00:08:16.029
ones. OK. So we made the data type bigger. to

00:08:16.029 --> 00:08:18.370
fit the world's alphabets. We have a 16 -bit

00:08:18.370 --> 00:08:21.649
or 32 -bit wide character problem solved, right?

00:08:21.829 --> 00:08:24.089
Well, not exactly, because here's where it gets

00:08:24.089 --> 00:08:27.189
really interesting. Solving the storage problem

00:08:27.189 --> 00:08:29.930
accidentally highlighted a massive transmission

00:08:29.930 --> 00:08:32.190
problem. Yeah, it really did. To understand this,

00:08:32.309 --> 00:08:34.929
we need to clarify a crucial architectural distinction.

00:08:35.549 --> 00:08:38.029
When we say wide character, we are strictly talking

00:08:38.029 --> 00:08:40.820
about the size of the data type. in the computer's

00:08:40.820 --> 00:08:43.940
memory, the RAM. A Y character does not state

00:08:43.940 --> 00:08:46.340
how each value in a character set is actually

00:08:46.340 --> 00:08:48.419
defined. Right, the definition of the values

00:08:48.419 --> 00:08:50.720
is the job of the character sets themselves,

00:08:51.159 --> 00:08:53.899
like UCS or Unicode. The wide character is just

00:08:53.899 --> 00:08:56.100
the blank physical container sitting in memory,

00:08:56.279 --> 00:08:58.639
waiting to hold those massive definitions. Exactly.

00:08:58.740 --> 00:09:00.600
So your wide character is sitting comfortably

00:09:00.600 --> 00:09:03.159
in your computer's active memory, taking up 16

00:09:03.159 --> 00:09:06.440
or 32 bits of space, representing a complex kanji

00:09:06.440 --> 00:09:09.200
character or a modern emoji. Wait, I'm going

00:09:09.200 --> 00:09:11.919
to push back here. Yeah. Because if a 32 -bit

00:09:11.919 --> 00:09:14.940
wide character solves all the space issues and

00:09:14.940 --> 00:09:17.639
prevents destructive translation, why didn't

00:09:17.639 --> 00:09:20.500
the entire computing industry just agree to upgrade

00:09:20.500 --> 00:09:23.169
the pipes? What do you mean? Like, why not just

00:09:23.169 --> 00:09:25.049
make the cables, the routers, and the internet

00:09:25.049 --> 00:09:28.090
protocols 32 -bit across the board so we can

00:09:28.090 --> 00:09:30.669
just send these wide characters natively? Uh,

00:09:31.570 --> 00:09:34.139
because you are talking about replacing... trillions

00:09:34.139 --> 00:09:37.299
of dollars of global infrastructure. By the time

00:09:37.299 --> 00:09:39.179
wide characters were invented, the world had

00:09:39.179 --> 00:09:41.460
already spent decades laying undersea cables,

00:09:41.899 --> 00:09:43.960
launching satellites, and writing network protocols

00:09:43.960 --> 00:09:46.860
that were fundamentally physically hardwired

00:09:46.860 --> 00:09:50.080
to process data in 8 -bit chunks. You can't just

00:09:50.080 --> 00:09:52.559
flip a software switch to fix that. No, you can't

00:09:52.559 --> 00:09:55.039
make an 8 -bit physical router suddenly swallow

00:09:55.039 --> 00:09:58.019
a 32 -bit object. If you try to shove a 32 -bit

00:09:58.019 --> 00:10:00.240
chunk of data down a pipe strictly designed for

00:10:00.240 --> 00:10:03.120
8 -bit symbols, the hardware gets confused, it

00:10:03.120 --> 00:10:05.179
misreads the boundaries of the data, and the

00:10:05.179 --> 00:10:07.519
transmission completely fails. So we have massive

00:10:07.519 --> 00:10:10.139
characters in memory, but tiny pipes for the

00:10:10.139 --> 00:10:12.200
internet. How do we get the data from point A

00:10:12.200 --> 00:10:14.820
to point B? The engineering workaround for this

00:10:14.820 --> 00:10:17.899
transmission bottleneck is the multi -byte character

00:10:17.899 --> 00:10:20.539
encoding. And the most famous example of this

00:10:20.539 --> 00:10:24.919
today is UTF -8. Ah, multi -byte. Because we

00:10:24.919 --> 00:10:28.019
lack those wide data paths, multibyte encoding

00:10:28.019 --> 00:10:31.799
systems use multiple 8 -bit bytes in a row to

00:10:31.799 --> 00:10:34.639
encode a value that is simply too large for a

00:10:34.639 --> 00:10:36.899
single 8 -bit symbol. Right. They break it down

00:10:36.899 --> 00:10:39.580
for transit. And the mechanism behind it is brilliant.

00:10:39.580 --> 00:10:41.919
Right. Like in a multi -byte system like UTF

00:10:41.919 --> 00:10:45.440
-8, the very first bits of the byte act as a

00:10:45.440 --> 00:10:47.740
signal to the receiving computer, right? Yes.

00:10:48.279 --> 00:10:50.120
If the byte starts with a zero, the computer

00:10:50.120 --> 00:10:52.919
knows, OK, this is a standard old school 8 -bit

00:10:52.919 --> 00:10:55.480
character. I can read it immediately. But if

00:10:55.480 --> 00:10:57.960
the byte starts with a specific sequence, like

00:10:57.960 --> 00:11:00.759
110, the computer knows, wait, this is incomplete.

00:11:00.840 --> 00:11:02.720
This is part of a larger character. I need to

00:11:02.720 --> 00:11:04.919
grab the next byte. Exactly. I need to grab the

00:11:04.919 --> 00:11:07.019
next byte and read them together to figure out

00:11:07.019 --> 00:11:08.500
the math. You know, the C programming standard

00:11:08.500 --> 00:11:10.700
actually officially splits these two concepts

00:11:10.700 --> 00:11:13.039
up, which I find incredibly helpful for visualizing

00:11:13.039 --> 00:11:15.039
this. Oh, the distinction between multibyte and

00:11:15.039 --> 00:11:18.700
Y. Yeah. According to the C standard, multibyte

00:11:18.700 --> 00:11:21.559
encodings. those variable length chains of 8

00:11:21.559 --> 00:11:24.500
-bit chunks, are primarily used in source code,

00:11:24.919 --> 00:11:27.379
external files, and network transmission. Whereas

00:11:27.379 --> 00:11:31.559
wide characters, the fixed massive 16 or 32 -bit

00:11:31.559 --> 00:11:34.539
containers, are the runtime representations of

00:11:34.539 --> 00:11:37.100
characters sitting in single objects. They exist

00:11:37.100 --> 00:11:40.600
solely in the active volatile memory of the system.

00:11:40.759 --> 00:11:43.830
Right. It's basically like buying IKEA furniture.

00:11:44.009 --> 00:11:47.029
IKEA? Yeah, hear me out. MultiByte is the flat

00:11:47.029 --> 00:11:49.370
pack you shove in your car for transit. Okay.

00:11:49.549 --> 00:11:51.529
And the wide character is the fully assembled

00:11:51.529 --> 00:11:53.830
bookshelf seating in your living room's active

00:11:53.830 --> 00:11:56.779
memory. OK, that analogy actually holds up perfectly.

00:11:57.179 --> 00:11:59.299
The assembly process is the runtime conversion

00:11:59.299 --> 00:12:01.919
your processor performs. Exactly. But here is

00:12:01.919 --> 00:12:03.639
where the history of tech giants makes things

00:12:03.639 --> 00:12:06.019
really complicated. Because different operating

00:12:06.019 --> 00:12:08.279
systems decided to build their living rooms and

00:12:08.279 --> 00:12:10.419
assemble those booktiles at very different times

00:12:10.419 --> 00:12:12.980
in computing history. Right. And it created completely

00:12:12.980 --> 00:12:14.980
different architectural philosophies that still

00:12:14.980 --> 00:12:17.419
clash today. The great schism. The great schism

00:12:17.419 --> 00:12:20.639
indeed. Let's look at the early adopters. Systems

00:12:20.639 --> 00:12:23.519
like Microsoft Windows, the .NET framework, and

00:12:23.519 --> 00:12:25.360
the Java programming language. The heavy hitters.

00:12:25.440 --> 00:12:27.899
Right. In the early 1990s, they were eager to

00:12:27.899 --> 00:12:29.559
solve the internationalization problem. They

00:12:29.559 --> 00:12:31.960
wanted to be global immediately. So they jumped

00:12:31.960 --> 00:12:34.980
on the very early version of the universal character

00:12:34.980 --> 00:12:38.740
set, specifically a standard called UCS2, which

00:12:38.740 --> 00:12:42.159
essentially became Unicode 1 .0. And UCS2 was

00:12:42.159 --> 00:12:45.220
a strict 16 -bit system. Right. It offered 65

00:12:45.220 --> 00:12:49.840
,536 possible characters. And at the time, engineers

00:12:49.840 --> 00:12:52.240
genuinely believed that was more than enough

00:12:52.240 --> 00:12:55.019
space to encode every single living human language

00:12:55.019 --> 00:12:58.620
with room to spare. Oh, the hubris. I know. So

00:12:58.620 --> 00:13:01.019
Windows and Java lock their foundational architecture

00:13:01.019 --> 00:13:04.000
into a 16 -bit wide character. In their systems,

00:13:04.440 --> 00:13:06.840
the default wide character type -like raw chart

00:13:06.840 --> 00:13:10.679
in C++ on Windows, or char in Java, was hard

00:13:10.679 --> 00:13:13.190
-coded to be exactly 16 -bit. Which seemed like

00:13:13.190 --> 00:13:15.230
plenty of space, but then we get the historical

00:13:15.230 --> 00:13:17.629
plot twist. Unicode didn't just stay with living

00:13:17.629 --> 00:13:19.970
languages. No, it grew immensely. Right, we get

00:13:19.970 --> 00:13:23.850
the 1996 update, Unicode 2 .0, and suddenly they

00:13:23.850 --> 00:13:27.289
are adding dead historic scripts, complex mathematical

00:13:27.289 --> 00:13:30.049
symbols, and eventually the thousands of emojis

00:13:30.049 --> 00:13:32.490
we use today. The full range of human expression

00:13:32.490 --> 00:13:36.129
blew way past the 65 ,000 character limit. It

00:13:36.129 --> 00:13:39.470
expanded to require 21 bits of space. So what

00:13:39.470 --> 00:13:41.980
happens to Windows and Java? They are already

00:13:41.980 --> 00:13:44.580
locked into an architecture that physically only

00:13:44.580 --> 00:13:47.980
has 16 bits of space per character. They get

00:13:47.980 --> 00:13:50.460
stuck in what we can call the 16 -bit trap because

00:13:50.460 --> 00:13:53.259
they can't easily tear out the foundational architecture

00:13:53.259 --> 00:13:55.620
of their entire operating system without breaking

00:13:55.620 --> 00:13:58.159
millions of legacy programs. It would be catastrophic.

00:13:58.240 --> 00:14:00.899
It would. So these systems now have to rely on

00:14:00.899 --> 00:14:03.019
a complex workaround called surrogate pairs.

00:14:03.559 --> 00:14:05.620
Okay, break down how a surrogate pair actually

00:14:05.620 --> 00:14:07.860
functions under the hood because it sounds messy.

00:14:08.159 --> 00:14:11.500
Oh, it is. Remember how we had 65 ,000 possible

00:14:11.500 --> 00:14:14.080
combinations in the 16 -bit space? Yeah. Engineers

00:14:14.080 --> 00:14:16.679
went in and permanently reserved a specific block

00:14:16.679 --> 00:14:19.340
of those numbers. They explicitly said, these

00:14:19.340 --> 00:14:21.440
numbers no longer represent actual printable

00:14:21.440 --> 00:14:23.879
characters. Instead, they act as warning flags.

00:14:24.299 --> 00:14:27.259
When the Windows operating system reads a 16

00:14:27.259 --> 00:14:29.700
-bit wide character and sees that it falls into

00:14:29.700 --> 00:14:32.799
this reserved high surrogate block, the hardware

00:14:32.799 --> 00:14:36.059
knows it cannot print a letter yet. It has to

00:14:36.059 --> 00:14:39.559
wait. Yes. It must hold that data in suspension,

00:14:39.740 --> 00:14:42.779
grab the next 16 -bit wide character, combine

00:14:42.779 --> 00:14:45.740
the mathematical values of both, and then look

00:14:45.740 --> 00:14:48.299
up the resulting massive character. So they are

00:14:48.299 --> 00:14:51.860
essentially duct -taping two 16 -bit boxes together

00:14:51.860 --> 00:14:55.340
just to store one single modern Unicode character.

00:14:55.419 --> 00:14:58.039
Like a smiling emoji. Basically, yeah. It's a

00:14:58.039 --> 00:15:00.460
massive workaround. It gets very messy because

00:15:00.460 --> 00:15:02.379
it breaks a fundamental programming assumption.

00:15:02.839 --> 00:15:05.559
Their wide character data types no longer map

00:15:05.559 --> 00:15:08.100
one -to -one with actual printed characters.

00:15:08.320 --> 00:15:10.759
No, not at all. One printed letter might be one

00:15:10.759 --> 00:15:12.799
wide character in memory, or it might be two.

00:15:12.919 --> 00:15:15.279
Which forces programmers to write highly defensive

00:15:15.279 --> 00:15:17.940
complex code just to count how many letters are

00:15:17.940 --> 00:15:21.019
in a word. Exactly. Meanwhile, you have the Unix

00:15:21.019 --> 00:15:23.399
-like systems Linux, Mac OS, sitting across the

00:15:23.399 --> 00:15:25.539
aisle. What was their strategy? Well, they took

00:15:25.539 --> 00:15:27.980
a wait -and -see approach. They did. Unix -like

00:15:27.980 --> 00:15:30.259
systems generally waited until the dust settled

00:15:30.259 --> 00:15:33.580
on the Unicode expansion. When they finally standardized,

00:15:33.860 --> 00:15:37.659
they adopted a massive 32 -bit write chart as

00:15:37.659 --> 00:15:40.940
prescribed by the C90 standard. Why? By going

00:15:40.940 --> 00:15:43.039
straight to 32 bits, they gave themselves enough

00:15:43.039 --> 00:15:45.879
room to comfortably fit the entire modern 21

00:15:45.879 --> 00:15:49.879
-bit Unicode code point into a single solitary

00:15:49.879 --> 00:15:52.700
wide character container. So no surrogate pair

00:15:52.700 --> 00:15:55.379
is required. Every wide character is exactly

00:15:55.379 --> 00:15:57.899
one printed symbol. Precisely. So if we look

00:15:57.899 --> 00:16:00.840
at the modern landscape today, for you the listener

00:16:00.840 --> 00:16:03.080
navigating these systems, you're dealing with

00:16:03.080 --> 00:16:05.299
two very different preferences born entirely

00:16:05.299 --> 00:16:07.679
from this history. Operating systems heavily

00:16:07.679 --> 00:16:10.559
influenced by Unicode 1 .0, like Windows, tend

00:16:10.559 --> 00:16:12.759
to prefer using wide strings made up of these

00:16:12.759 --> 00:16:15.519
16 -bit character units. Yes, whereas Unix -like

00:16:15.519 --> 00:16:17.960
systems, despite having that massive 32 -bit

00:16:17.960 --> 00:16:20.080
wide character available to them, actually tend

00:16:20.080 --> 00:16:22.659
to retain the old 8 -bit narrow string convention

00:16:22.659 --> 00:16:24.980
for handling text. Wait, really? Even in memory?

00:16:25.179 --> 00:16:27.899
Yeah. Because UTF -8 multi -byte encoding became

00:16:27.899 --> 00:16:30.159
so efficient and universally adopted on the internet,

00:16:30.580 --> 00:16:32.519
Unix systems prefer to just keep the text in

00:16:32.519 --> 00:16:34.740
that variable multi -byte format, even in memory.

00:16:35.000 --> 00:16:37.340
Going back to the analogy, they prefer to keep

00:16:37.340 --> 00:16:40.039
the furniture flat packed for as long as possible,

00:16:40.399 --> 00:16:43.539
only assembling it into a 32 -bit -wide character

00:16:43.539 --> 00:16:46.059
at the exact millisecond they need to manipulate

00:16:46.059 --> 00:16:48.460
it. That's a great way to put it. The historical

00:16:48.460 --> 00:16:50.440
circumstances of when an operating system was

00:16:50.440 --> 00:16:52.980
built absolutely dictate what types of encoding

00:16:52.980 --> 00:16:55.200
they prefer to process today. Which raises a

00:16:55.200 --> 00:16:58.059
really important question. How do modern programmers

00:16:58.059 --> 00:17:00.879
actually deal with this historical baggage? If

00:17:00.879 --> 00:17:03.659
the size of a wide character changes from 16

00:17:03.659 --> 00:17:06.640
bits on Windows to 32 bits on Mac, how do you

00:17:06.640 --> 00:17:08.339
write a piece of software that works everywhere?

00:17:08.859 --> 00:17:10.720
This is where we look at the language lottery.

00:17:11.059 --> 00:17:13.200
How different programming languages have actively

00:17:13.200 --> 00:17:15.720
tried to clean up this mess over the decades.

00:17:16.140 --> 00:17:18.839
And it really starts with the absolute wild west

00:17:18.839 --> 00:17:22.819
of C and C++ prep. Oh man. Listen to how the

00:17:22.819 --> 00:17:25.819
original C90 standard defined the wide character

00:17:25.819 --> 00:17:28.119
data type. Rotschart, I have it right here. Go

00:17:28.119 --> 00:17:31.420
for it. It called it an integral type whose range

00:17:31.420 --> 00:17:33.859
of values can represent distinct codes for all

00:17:33.859 --> 00:17:36.000
members of the largest extended character set

00:17:36.000 --> 00:17:38.920
specified among the supported locales. Yeah,

00:17:39.299 --> 00:17:41.660
that is essentially a legal loophole masquerading

00:17:41.660 --> 00:17:44.240
as computer science. It really is. They basically

00:17:44.240 --> 00:17:46.859
defined it as make it as big as it needs to be

00:17:46.859 --> 00:17:48.660
for whatever system you happen to be running

00:17:48.660 --> 00:17:51.359
on. It was entirely implementation defined. Which

00:17:51.359 --> 00:17:54.170
creates a total nightmare for portability. If

00:17:54.170 --> 00:17:56.470
you write a program on a Mac expecting your wide

00:17:56.470 --> 00:17:59.130
character to hold 32 bits of data, and someone

00:17:59.130 --> 00:18:01.529
compiles that exact same code on an old Windows

00:18:01.529 --> 00:18:04.210
machine, where the compiler defines it as 16

00:18:04.210 --> 00:18:07.190
bits... Your program crashes. Right. It literally

00:18:07.190 --> 00:18:10.490
attempts to stuff 32 bits of math into a 16 -bit

00:18:10.490 --> 00:18:14.210
physical box, causing memory overflows. The ambiguity

00:18:14.210 --> 00:18:17.089
was so dangerous that the ISO -IE Unicode standard

00:18:17.089 --> 00:18:19.690
itself had to issue a staggering warning. What

00:18:19.690 --> 00:18:22.339
did it say? It explicitly stated... The width

00:18:22.339 --> 00:18:24.960
of rawchart is compiler specific and can be as

00:18:24.960 --> 00:18:27.359
small as eight bits. Programs that need to be

00:18:27.359 --> 00:18:30.900
portable across any C or C++ compiler should

00:18:30.900 --> 00:18:33.759
not use rawchart for storing Unicode text. Wow.

00:18:34.180 --> 00:18:36.180
The global standard itself is telling programmers,

00:18:36.480 --> 00:18:38.880
do not use the wide character type to store global

00:18:38.880 --> 00:18:40.680
text if you want your code to survive. Yeah,

00:18:40.680 --> 00:18:43.880
it was a massive red flag. So how did C and C++

00:18:43.880 --> 00:18:46.509
finally fix it? Well, it took until their 2011

00:18:46.509 --> 00:18:50.670
revisions. Both C and C++ finally introduced

00:18:50.670 --> 00:18:52.690
fixed -size character types to the language.

00:18:53.250 --> 00:18:56.190
They explicitly created char16 for guaranteed

00:18:56.190 --> 00:18:59.849
16 -bit storage and char32 for guaranteed 32

00:18:59.849 --> 00:19:02.589
-bit storage. Which finally provided unambiguous

00:19:02.589 --> 00:19:05.250
representations. They left the old write chart

00:19:05.250 --> 00:19:07.990
in the language as a legacy artifact so old programs

00:19:07.990 --> 00:19:10.369
wouldn't break, but they gave modern programmers

00:19:10.369 --> 00:19:12.869
new, precise physical dimensions to work with.

00:19:12.910 --> 00:19:15.529
OK, so that's C and C++. plus trying to patch

00:19:15.529 --> 00:19:18.250
the leaks. Let's look at Python's evolution,

00:19:18.390 --> 00:19:20.650
because I think it perfectly illustrates a massive

00:19:20.650 --> 00:19:23.470
realization in the tech industry regarding the

00:19:23.470 --> 00:19:25.829
wide character. Python's journey is fascinating

00:19:25.829 --> 00:19:28.869
here. It is. If you go back to Python 2 .7, the

00:19:28.869 --> 00:19:30.990
language relied heavily on whatever the operating

00:19:30.990 --> 00:19:33.390
system dictated. Its character type was tied

00:19:33.390 --> 00:19:35.930
to the underlying C compiler's rod chart. Still

00:19:35.930 --> 00:19:38.490
shackled to that unpredictable underlying system.

00:19:38.710 --> 00:19:41.170
Right. But then by PyCon 3 .3, they realized

00:19:41.170 --> 00:19:43.730
this was inefficient. They introduced a flexibly

00:19:43.730 --> 00:19:46.349
sized storage system for strings. The language

00:19:46.349 --> 00:19:49.029
would dynamically look at a string of text, figure

00:19:49.029 --> 00:19:50.970
out the largest character in it, and then allocate

00:19:50.970 --> 00:19:53.740
memory based on that. Smart. But here's the real

00:19:53.740 --> 00:19:58.079
aha moment. As of Python 3 .92, they dropped

00:19:58.079 --> 00:20:00.640
the use of write chart for Python strings entirely.

00:20:00.880 --> 00:20:03.279
It's a complete paradigm shift. It really is.

00:20:03.599 --> 00:20:06.180
They realize that forcing text into these massive

00:20:06.180 --> 00:20:09.519
32 bit wide character arrays and memory is actually

00:20:09.519 --> 00:20:12.240
a massive waste of RAM. Oh, totally. Most text

00:20:12.240 --> 00:20:14.799
on the Internet is still in the standard ASCII

00:20:14.799 --> 00:20:17.809
range. English letters, basic numbers, which

00:20:17.809 --> 00:20:20.789
only requires one byte of storage. Right. So

00:20:20.789 --> 00:20:22.990
if you force an entire English paragraph into

00:20:22.990 --> 00:20:25.890
a 32 -bit wide character array, you are inflating

00:20:25.890 --> 00:20:28.890
your memory usage by 400 % with empty zero -filled

00:20:28.890 --> 00:20:31.970
bits. Just waste space. Exactly. So Python decided

00:20:31.970 --> 00:20:34.049
to lean entirely into multi -byte flexibility.

00:20:34.630 --> 00:20:36.869
Now, they just store the text as UTF -8 multiply

00:20:36.869 --> 00:20:39.990
strings and cache it. Modern CPUs are so incredibly

00:20:39.990 --> 00:20:42.970
fast at decoding UTF -8 on the fly that keeping

00:20:42.970 --> 00:20:45.289
a massive white character in memory is just...

00:20:45.339 --> 00:20:48.079
obsolete. They move from store it wide so we

00:20:48.079 --> 00:20:50.180
can access it fast to store it narrow because

00:20:50.180 --> 00:20:52.240
our processors are finally fast enough to decode

00:20:52.240 --> 00:20:54.500
it instantly. Yeah. And to add to that on the

00:20:54.500 --> 00:20:56.039
complete opposite end of the spectrum you have

00:20:56.039 --> 00:20:59.259
modern language like Rust. Ooh, Rust took an

00:20:59.259 --> 00:21:02.180
intentional uncompromising rebellion against

00:21:02.180 --> 00:21:05.440
the decades of pain caused by C plus ambiguity.

00:21:05.630 --> 00:21:08.869
Absolutely. In Rust, a char data type is exactly

00:21:08.869 --> 00:21:12.150
32 bits, and it represents a valid Unicode scalar

00:21:12.150 --> 00:21:15.609
value? Period. No guessing. No compiler -specific

00:21:15.609 --> 00:21:19.309
baggage. No surrogate pairs. None. The designers

00:21:19.309 --> 00:21:22.509
of Rust looked at the historical mess of 16 -bit

00:21:22.509 --> 00:21:25.450
traps, variable -size row chart types, and the

00:21:25.450 --> 00:21:27.970
constant fear of memory overflows, and they established

00:21:27.970 --> 00:21:31.710
an absolute physical dimension. 32 bits. It solves

00:21:31.710 --> 00:21:34.410
the portability issue immediately by refusing

00:21:34.410 --> 00:21:36.450
to compromise. Right. It's like they just built

00:21:36.450 --> 00:21:38.450
a living room so massive that you never have

00:21:38.450 --> 00:21:40.250
to flat pack the furniture ever again. Which

00:21:40.250 --> 00:21:42.650
is wildly inefficient for memory space, but it

00:21:42.650 --> 00:21:44.710
is incredibly safe for the programmer. Which

00:21:44.710 --> 00:21:47.369
showcases the eternal trade off in computer science.

00:21:47.589 --> 00:21:49.369
You know, you can optimize for memory space like

00:21:49.369 --> 00:21:52.369
Python or you can optimize for unshakable stability

00:21:52.369 --> 00:21:55.470
like Rust. So what does this all mean? When we

00:21:55.470 --> 00:21:57.710
step back and look at the whole picture, a wide

00:21:57.710 --> 00:22:00.869
character isn't just a dry technical specification.

00:22:01.009 --> 00:22:03.490
No, it's really not. It is a living artifact.

00:22:03.710 --> 00:22:06.609
It is the fossil record of computers painfully

00:22:06.609 --> 00:22:09.410
learning to accommodate the sheer, messy volume

00:22:09.410 --> 00:22:12.789
of human language. We started in these cramped,

00:22:13.029 --> 00:22:16.049
8 -bit cardboard boxes, tried to squeeze extra

00:22:16.049 --> 00:22:18.630
alphabets into the packing peanuts, broke the

00:22:18.630 --> 00:22:21.250
box entirely with destructive translation— Shamoji

00:22:21.250 --> 00:22:24.069
bake everywhere. Exactly. And eventually, we

00:22:24.069 --> 00:22:26.730
had to engineer these complex, 32 -bit global

00:22:26.730 --> 00:22:29.730
memory systems just to say hello in every language.

00:22:30.029 --> 00:22:32.710
The real takeaway for you listening to this is

00:22:32.710 --> 00:22:35.970
to appreciate the staggering, invisible labor

00:22:35.970 --> 00:22:39.029
your devices perform every single day. It's mind

00:22:39.029 --> 00:22:41.730
blowing. It is. Whenever you type a character

00:22:41.730 --> 00:22:44.289
that isn't standard English, there is a massive

00:22:44.289 --> 00:22:46.490
mathematical translation effort happening under

00:22:46.490 --> 00:22:48.730
the hood. Your system is constantly switching

00:22:48.730 --> 00:22:51.289
between massive memory representations and narrow,

00:22:51.549 --> 00:22:53.910
multi -byte transmission streams. Juggling surrogate

00:22:53.910 --> 00:22:56.089
pairs and historical architectures. Counting

00:22:56.089 --> 00:22:58.529
ones and zeros just to render that single text

00:22:58.529 --> 00:23:01.559
message accurately. on your screen. It is a minor

00:23:01.559 --> 00:23:03.960
miracle of engineering every single time you

00:23:03.960 --> 00:23:07.019
send an emoji, which leaves me with one final

00:23:07.019 --> 00:23:10.460
thought to ponder today. If the entire history

00:23:10.460 --> 00:23:13.160
of the wide character is essentially a story

00:23:13.160 --> 00:23:16.019
of brilliant engineers chronically underestimating

00:23:16.019 --> 00:23:19.000
exactly how much digital space human expression

00:23:19.000 --> 00:23:22.799
requires, what future forms of human communication

00:23:22.799 --> 00:23:24.819
are we currently underestimating the storage

00:23:24.819 --> 00:23:27.799
for right now? That is a profound question. I

00:23:27.799 --> 00:23:30.420
mean, spatial computing, neural interfaces, the

00:23:30.420 --> 00:23:32.900
data requirements will be unimaginable. We'll

00:23:32.900 --> 00:23:34.660
leave you to think on that. We've gone from the

00:23:34.660 --> 00:23:37.920
8 -bit box to a 32 -bit world. But human expression

00:23:37.920 --> 00:23:39.940
never really stops expanding, does it? Keep diving

00:23:39.940 --> 00:23:40.220
deep.