WEBVTT

00:00:00.000 --> 00:00:01.700
You sit down at your desk, you place your hands

00:00:01.700 --> 00:00:04.540
on the keyboard, and you just... Well, you expect

00:00:04.540 --> 00:00:07.320
this completely seamless, frictionless translation

00:00:07.320 --> 00:00:09.640
of your thoughts right onto the screen. Right.

00:00:09.679 --> 00:00:12.060
Yeah, like it's magic. Exactly. I mean, you need

00:00:12.060 --> 00:00:14.259
a curly bracket to write a block of code. So,

00:00:14.259 --> 00:00:16.199
you know, you press the curly bracket key, you

00:00:16.199 --> 00:00:18.980
need a hashtag, you press shift, and then number

00:00:18.980 --> 00:00:21.399
three, it's a completely one -to -one relationship.

00:00:21.660 --> 00:00:23.859
It really is. I mean, you press a physical button,

00:00:24.160 --> 00:00:26.480
the symbol appears, and the computer just understands

00:00:26.480 --> 00:00:28.519
it perfectly. Yeah, and it feels like a fundamental

00:00:28.519 --> 00:00:30.620
law of physics at this point, right? Yeah. Here

00:00:30.620 --> 00:00:34.450
in the 21st century, we take the I guess, real

00:00:34.450 --> 00:00:36.929
estate of a modern keyboard, completely for granted.

00:00:37.170 --> 00:00:40.409
Oh, totally. We absolutely do. But today, we

00:00:40.409 --> 00:00:42.729
are pulling from this fascinating historical

00:00:42.729 --> 00:00:46.420
deep dive. It's this dense... highly detailed

00:00:46.420 --> 00:00:49.740
Wikipedia article on the history of digraphs

00:00:49.740 --> 00:00:52.259
and trigraphs and programming. Such a good topic.

00:00:52.420 --> 00:00:54.840
Right. And our mission today is to uncover how

00:00:54.840 --> 00:00:57.340
a severe lack of physical, you know, plastic

00:00:57.340 --> 00:01:00.520
keys back in the 1970s and 80s created these

00:01:00.520 --> 00:01:04.579
bizarre invisible software ghosts. Ghosts that

00:01:04.579 --> 00:01:07.640
are frankly still haunting modern code bases

00:01:07.640 --> 00:01:11.540
today. Yes, exactly. Yeah. So we're looking at

00:01:11.540 --> 00:01:14.099
what happens when the computer language you're

00:01:14.099 --> 00:01:17.250
writing in literally demands a specific symbol

00:01:17.250 --> 00:01:21.269
like, say, a square bracket or a backslash that

00:01:21.269 --> 00:01:23.510
just physically does not exist on the keyboard

00:01:23.510 --> 00:01:26.049
sitting right in front of you. Yeah, it's essentially

00:01:26.049 --> 00:01:29.230
like trying to play a complex classical piano

00:01:29.230 --> 00:01:32.370
concerto on a child's toy keyboard. Oh, that's

00:01:32.370 --> 00:01:34.629
a great way to put it. Right. You physically

00:01:34.629 --> 00:01:36.810
do not have the keys to hit the notes the sheet

00:01:36.810 --> 00:01:39.650
music acquires. So to keep the music playing,

00:01:40.250 --> 00:01:42.719
programmers had to, well... they had to invent

00:01:42.719 --> 00:01:45.519
these complex unnatural chord combinations just

00:01:45.519 --> 00:01:47.459
to hit a single missing note. Just to get by.

00:01:47.799 --> 00:01:50.120
Exactly. And the really wild part here, the thing

00:01:50.120 --> 00:01:53.060
that's so fascinating, is how those desperate

00:01:53.060 --> 00:01:56.040
temporary bandages applied to early hardware

00:01:56.040 --> 00:01:58.900
limitations permanently baked themselves into

00:01:58.900 --> 00:02:01.180
the infrastructure of our whole digital world.

00:02:01.280 --> 00:02:03.819
It's wild. Okay, so let's start by defining our

00:02:03.819 --> 00:02:05.739
terms for anyone who hasn't, you know, spent

00:02:05.739 --> 00:02:07.959
time digging through the dusty archives of computer

00:02:07.959 --> 00:02:10.439
science. Good idea. digraphs and trigraphs are

00:02:10.439 --> 00:02:12.719
essentially just sequences of two or three characters

00:02:12.719 --> 00:02:15.360
in your source code that the programming language

00:02:15.360 --> 00:02:19.620
is explicitly instructed to treat as a single,

00:02:19.620 --> 00:02:22.120
entirely different character. Like you type three

00:02:22.120 --> 00:02:23.879
things, but the computer pretends you only typed

00:02:23.879 --> 00:02:26.800
one. Which sounds horribly inefficient, right?

00:02:26.819 --> 00:02:28.620
Totally. It sounds like a nightmare. But early

00:02:28.620 --> 00:02:30.939
programmers simply didn't have the luxury of

00:02:30.939 --> 00:02:33.939
choice. I mean, today we have Unicode. which

00:02:33.939 --> 00:02:37.060
is this massive universally agreed upon library

00:02:37.060 --> 00:02:39.759
of characters. Every computer on earth understands

00:02:39.759 --> 00:02:42.759
it. Right, it's just standard now. Exactly. But

00:02:42.759 --> 00:02:44.919
in the early days of computing, hardware was

00:02:44.919 --> 00:02:47.180
wildly fragmented. You had systems running on

00:02:47.180 --> 00:02:51.699
something called EBCDIC. EBCDIC. Bless you. Yeah,

00:02:51.699 --> 00:02:55.819
right. It stands for Extended Binary Coded Decimal

00:02:55.819 --> 00:02:59.199
Interchange Code. It was this old proprietary

00:02:59.199 --> 00:03:01.860
IBM character encoding standard that was primarily

00:03:01.860 --> 00:03:05.479
used on their massive mainframes, and EBCDIC

00:03:05.479 --> 00:03:08.259
code pages were notoriously limited. Depending

00:03:08.259 --> 00:03:10.219
on the specific version running on your machine,

00:03:10.509 --> 00:03:12.710
standard programming symbols like curly brackets

00:03:12.710 --> 00:03:15.009
or square brackets, they simply did not exist

00:03:15.009 --> 00:03:19.250
in the computer's brain. Wow. I just, I have

00:03:19.250 --> 00:03:22.550
to imagine that writing modern code without a

00:03:22.550 --> 00:03:24.789
curly bracket is like, I don't know, trying to

00:03:24.789 --> 00:03:27.719
build a house. without nails. It's just a structural

00:03:27.719 --> 00:03:30.520
impossibility. It fundamentally breaks the syntax

00:03:30.520 --> 00:03:32.680
of most languages. And, you know, if we look

00:03:32.680 --> 00:03:35.740
even earlier at a language like Algol, developers

00:03:35.740 --> 00:03:38.439
were dealing with manufacturer -specific six

00:03:38.439 --> 00:03:41.039
-bit character codes. Six bits? Yeah, six bits.

00:03:41.439 --> 00:03:44.319
Which only gives you 64 possible characters total.

00:03:44.479 --> 00:03:47.539
Oh, wow. That is nothing. Right. Once you account

00:03:47.539 --> 00:03:50.259
for uppercase letters, numbers, and basic punctuation,

00:03:50.460 --> 00:03:53.080
you are completely out of room. They physically

00:03:53.080 --> 00:03:55.240
lacked the code points for the mathematical operations

00:03:55.240 --> 00:03:58.000
the language required. So what do they do? Well,

00:03:58.199 --> 00:04:02.139
they invented substitutions. So to assign a value...

00:04:01.900 --> 00:04:04.539
which conventionally looked like a left -pointing

00:04:04.539 --> 00:04:07.360
arrow in their documentation, they had to type

00:04:07.360 --> 00:04:10.139
a colon followed by an equal sign. Okay. And

00:04:10.139 --> 00:04:13.960
to write, say, greater than or equal to, they

00:04:13.960 --> 00:04:16.120
type the greater than bracket followed by an

00:04:16.120 --> 00:04:18.139
equal sign. Okay, let's unpack this because that

00:04:18.139 --> 00:04:19.720
actually makes perfect sense when you think about,

00:04:19.720 --> 00:04:22.240
like, early smartphone texting. Right. Before

00:04:22.240 --> 00:04:24.620
we all had dedicated emoji keyboards natively

00:04:24.620 --> 00:04:26.860
built into our phones, if you wanted to send

00:04:26.860 --> 00:04:29.360
a smiley face to a friend, you had to physically

00:04:29.360 --> 00:04:32.839
type out a colon, a hyphen. in a closing parenthesis.

00:04:33.259 --> 00:04:35.439
Exactly. You were combining existing unrelated

00:04:35.439 --> 00:04:38.279
characters to represent a visual concept you

00:04:38.279 --> 00:04:40.339
just didn't have a single dedicated key for.

00:04:40.560 --> 00:04:43.100
That is the exact same psychological workaround

00:04:43.100 --> 00:04:45.579
just applied to the structural architecture of

00:04:45.579 --> 00:04:49.029
software. Wow. But with texting, a human reads

00:04:49.029 --> 00:04:51.550
the colon in parenthesis and interprets the emotion.

00:04:52.350 --> 00:04:55.689
In early programming, the computer compiler physically

00:04:55.689 --> 00:04:57.810
translates those characters into a functional

00:04:57.810 --> 00:04:59.910
piece of the program itself. Right. It actually

00:04:59.910 --> 00:05:02.230
changes the code. It does. And where this mechanism

00:05:02.230 --> 00:05:04.930
gets incredibly complicated and historically

00:05:04.930 --> 00:05:08.009
chaotic, really, is when we look at how the C

00:05:08.009 --> 00:05:10.850
programming language handled this whole hardware

00:05:10.850 --> 00:05:14.709
crisis. Ah, yes. the infamous C trigraphs. The

00:05:14.709 --> 00:05:16.829
ones and only. Looking at the source material,

00:05:17.129 --> 00:05:20.089
the basic character set of C heavily relies on

00:05:20.089 --> 00:05:23.170
ASCII, which is a 7 -bit character set. But C

00:05:23.170 --> 00:05:26.670
specifically requires nine distinct characters

00:05:26.670 --> 00:05:30.529
that sit outside the widely compatible standardized

00:05:30.529 --> 00:05:33.930
subset known as the ISO 646 invariant character

00:05:33.930 --> 00:05:35.750
set. Yeah, and we should probably clarify what

00:05:35.750 --> 00:05:39.319
that invariant set actually is. You do. ISO 646

00:05:39.319 --> 00:05:42.199
invariant is basically the core universally agreed

00:05:42.199 --> 00:05:45.600
upon alphabet numbers and basic punctuation that

00:05:45.600 --> 00:05:48.019
literally every computer terminal in the world

00:05:48.019 --> 00:05:50.240
could understand, regardless of nationality.

00:05:50.519 --> 00:05:53.569
Okay. But the C language needed characters like

00:05:53.569 --> 00:05:56.110
the hashtag, the backslash, the carry, the vertical

00:05:56.110 --> 00:05:59.250
bar, the tilt, and both sets of square and curly

00:05:59.250 --> 00:06:01.329
brackets. Exactly. And those nine characters

00:06:01.329 --> 00:06:04.310
were not in that universal invariant set. So

00:06:04.310 --> 00:06:07.209
if you were a programmer in, say, France or Germany,

00:06:07.670 --> 00:06:09.709
your national keyboard layout likely replaced

00:06:09.709 --> 00:06:12.189
those specific keys with local alphabetic characters.

00:06:12.230 --> 00:06:15.529
Right, like letters with... louts or accents.

00:06:15.790 --> 00:06:17.930
So you physically could not type standard C code?

00:06:17.949 --> 00:06:20.370
You couldn't. So the ANSIC committee was staring

00:06:20.370 --> 00:06:22.709
down this massive international adoption problem.

00:06:23.189 --> 00:06:25.670
Their mandate was to ensure international programmers

00:06:25.670 --> 00:06:27.970
could type C code on absolutely any keyboard

00:06:27.970 --> 00:06:30.610
in the world. Their solution was the trigraph.

00:06:31.229 --> 00:06:34.149
They invented nine specific three -character

00:06:34.149 --> 00:06:37.009
sequences to act as stand -ins for those missing

00:06:37.009 --> 00:06:40.009
symbols. And the defining feature of these sequences

00:06:40.009 --> 00:06:42.569
was that every single one of them started with

00:06:42.569 --> 00:06:45.730
two question marks. Okay, wait. I really have

00:06:45.730 --> 00:06:47.810
to push back on the logic of that decision. Oh,

00:06:47.810 --> 00:06:50.230
I know. Why use two question marks as your trigger?

00:06:50.550 --> 00:06:52.949
I mean, what if I'm a programmer writing a standard

00:06:52.949 --> 00:06:55.250
error message, and I genuinely just want to type

00:06:55.250 --> 00:06:57.250
a bunch of question marks in my code? Like for

00:06:57.250 --> 00:06:59.870
emphasis. Exactly. Like printing the text, critical

00:06:59.870 --> 00:07:02.389
error, what just happened? Doesn't that cause

00:07:02.389 --> 00:07:04.990
absolute chaos in the system? Oh, chaos is putting

00:07:04.990 --> 00:07:08.290
it mildly. It caused catastrophic code -destroying

00:07:08.290 --> 00:07:11.449
bugs. I knew it! Yeah. And to understand the

00:07:11.449 --> 00:07:13.389
mechanics of why it was so destructive, we have

00:07:13.389 --> 00:07:15.990
to look at how the C compiler actually processes

00:07:15.990 --> 00:07:19.470
these trigraphs. OK, lay it on me. So the compiler

00:07:19.470 --> 00:07:21.589
doesn't look at the context of your code. It

00:07:21.589 --> 00:07:23.410
doesn't know if you are writing a mathematical

00:07:23.410 --> 00:07:26.389
formula or just leaving a text note for a coworker.

00:07:26.519 --> 00:07:29.060
The tool responsible for these substitutions

00:07:29.060 --> 00:07:32.800
is called the C -preprocessor. Right. And it

00:07:32.800 --> 00:07:36.279
is designed to be a completely blind brute force

00:07:36.279 --> 00:07:39.000
search and replace mechanism. It runs before

00:07:39.000 --> 00:07:41.120
anything else happens in the compilation process.

00:07:41.360 --> 00:07:44.180
It is the absolute first pass over the text.

00:07:44.519 --> 00:07:46.620
Wait. So it's essentially just doing a mindless

00:07:46.620 --> 00:07:49.560
find and replace across the entire document before

00:07:49.560 --> 00:07:52.120
the actual brain of the compiler even turns on.

00:07:52.300 --> 00:07:54.860
Precisely. It is totally blind to context. That

00:07:54.860 --> 00:07:57.720
sounds dangerous. It was. Let's walk through

00:07:57.720 --> 00:08:01.379
the specific, deeply frustrating example. related

00:08:01.379 --> 00:08:04.180
in our sources. Imagine a programmer writes a

00:08:04.180 --> 00:08:07.139
simple, harmless comment in their code. In C,

00:08:07.300 --> 00:08:09.300
you use a double slash to tell the compiler,

00:08:09.560 --> 00:08:11.860
hey, ignore everything else on this line, it's

00:08:11.860 --> 00:08:14.180
just a note for human eyes. Right, standard comment

00:08:14.180 --> 00:08:15.980
thing. So the programmer writes their double

00:08:15.980 --> 00:08:18.259
slash, followed by the phrase, will the next

00:08:18.259 --> 00:08:21.079
line be executed? And then they add 10 question

00:08:21.079 --> 00:08:22.899
marks for emphasis. Because they're very stressed

00:08:22.899 --> 00:08:25.220
about this line of code. Exactly. 10 question

00:08:25.220 --> 00:08:27.560
marks, followed immediately by a forward slash,

00:08:27.620 --> 00:08:30.860
just as a visual divider. OK, so logically, Since

00:08:30.860 --> 00:08:33.299
it is hidden behind that double slash comment

00:08:33.299 --> 00:08:35.539
marker, the compiler should just breeze right

00:08:35.539 --> 00:08:37.679
past it. It shouldn't affect the actual software

00:08:37.679 --> 00:08:40.360
at all. That is the logical assumption. But remember

00:08:40.360 --> 00:08:42.539
the mechanics of the preprocessor. Right, the

00:08:42.539 --> 00:08:44.960
blind finding replay. Exactly. It runs first.

00:08:45.120 --> 00:08:47.360
It does not know what a comment is. It simply

00:08:47.360 --> 00:08:49.679
stands that line of text, and at the very end

00:08:49.679 --> 00:08:52.500
of the string of question marks, it spots a sequence.

00:08:52.840 --> 00:08:54.960
question mark, question mark, forward slash.

00:08:55.120 --> 00:08:59.159
Uh -oh. Yep. In the ANSIC standard, that specific

00:08:59.159 --> 00:09:01.500
three -character sequence is the official trigraph.

00:09:01.740 --> 00:09:03.840
for a single backslash. Oh, I see where this

00:09:03.840 --> 00:09:06.139
is going. And it is horrifying. It really is.

00:09:06.539 --> 00:09:09.620
The preprocessor silently, invisibly changes

00:09:09.620 --> 00:09:12.440
those three characters into a backslash. Now,

00:09:12.480 --> 00:09:15.159
in the C language, placing a backslash at the

00:09:15.159 --> 00:09:17.500
very end of a line is a structural command. What

00:09:17.500 --> 00:09:19.960
does it do? It is the line splicing character.

00:09:20.440 --> 00:09:23.299
It explicitly tells the compiler, take the entire

00:09:23.299 --> 00:09:25.799
next line of code below this one and pull it

00:09:25.799 --> 00:09:28.960
up to join the current line. No. Which means,

00:09:29.500 --> 00:09:32.820
wait. The actual functional line of code immediately

00:09:32.820 --> 00:09:35.440
below that harmless text comment gets sucked

00:09:35.440 --> 00:09:37.980
up into the comment block itself. Yep. It gets

00:09:37.980 --> 00:09:41.460
completely hidden from the compiler. Wow. Program

00:09:41.460 --> 00:09:44.480
will compile perfectly, but a crucial piece of

00:09:44.480 --> 00:09:46.700
logic is just magically gone because someone

00:09:46.700 --> 00:09:49.240
was a little too enthusiastic with their punctuation.

00:09:50.080 --> 00:09:52.419
Exactly. It was an absolute nightmare to debug.

00:09:52.600 --> 00:09:54.419
I mean, you would stare at a screen of perfectly

00:09:54.419 --> 00:09:57.860
valid, structurally sound code, completely unaware

00:09:57.860 --> 00:10:00.899
that the preprocessor was secretly... rewriting

00:10:00.899 --> 00:10:03.019
the foundational text behind your back. Before

00:10:03.019 --> 00:10:05.000
the compiler even got a chance to evaluate it.

00:10:05.200 --> 00:10:07.419
Exactly. And the source notes that developers

00:10:07.419 --> 00:10:10.279
working on the classic Mac OS suffered immensely

00:10:10.279 --> 00:10:13.279
from this. Oh, really? How come? Well, they frequently

00:10:13.279 --> 00:10:16.019
used a four -character constant to denote unknown

00:10:16.019 --> 00:10:19.100
file types or creator codes. The constant was

00:10:19.100 --> 00:10:22.200
literally just four single quotes, repping four

00:10:22.200 --> 00:10:24.440
question marks. Oh, no. So every time they typed

00:10:24.440 --> 00:10:26.919
that out, the preprocessor would see those consecutive

00:10:26.919 --> 00:10:29.000
question marks, assume it was the start of a

00:10:29.000 --> 00:10:31.870
trigraph. and just mangle the code. Completely

00:10:31.870 --> 00:10:34.649
mangle it. You would have to use incredibly compensated

00:10:34.649 --> 00:10:37.169
string concatenation or weird escape sequences

00:10:37.169 --> 00:10:39.730
just to hide basic punctuation from your own

00:10:39.730 --> 00:10:42.379
compiler. It seems like an incredibly blunt,

00:10:42.639 --> 00:10:45.080
almost reckless substitution method. It was.

00:10:45.519 --> 00:10:47.879
And the sheer volume of these destroyed curd

00:10:47.879 --> 00:10:50.899
bases forced the C standards committee into a

00:10:50.899 --> 00:10:53.559
corner. They couldn't just abandon the workarounds

00:10:53.559 --> 00:10:55.519
because, you know, those international keyboards

00:10:55.519 --> 00:10:58.299
still lacked the physical keys. Right. The French

00:10:58.299 --> 00:11:00.600
and German developers still needed a type. Exactly.

00:11:00.879 --> 00:11:03.539
But they had to stop the preprocessor from blindly

00:11:03.539 --> 00:11:07.590
eating comments and strings. So their compromise

00:11:07.590 --> 00:11:11.669
introduced in the 1994 C95 amendment was the

00:11:11.669 --> 00:11:14.129
digraph. Okay, the digraph. So they shifted from

00:11:14.129 --> 00:11:16.090
three -character sequences starting with question

00:11:16.090 --> 00:11:18.789
marks to two -character sequences. Right. For

00:11:18.789 --> 00:11:21.090
example, using a less than sign and a percent

00:11:21.090 --> 00:11:24.230
sign to represent a left curly bracket or a less

00:11:24.230 --> 00:11:26.490
than sign and a colon to represent a left square

00:11:26.490 --> 00:11:29.139
bracket. Yeah. But wait, simply making the sequence

00:11:29.139 --> 00:11:31.259
shorter doesn't solve the brute force replacement

00:11:31.259 --> 00:11:33.620
problem, does it? It's still a blind find and

00:11:33.620 --> 00:11:36.299
replace. Well, making it shorter wasn't the fix.

00:11:36.570 --> 00:11:39.269
The brilliance of the digraph was changing when

00:11:39.269 --> 00:11:41.529
the sequence was processed by the system. Oh,

00:11:41.529 --> 00:11:44.330
okay. Unlike trigrass, which were aggressively

00:11:44.330 --> 00:11:47.330
replaced in that chaotic first preprocessor pass,

00:11:47.769 --> 00:11:50.110
digraphs are handled much later in the compilation

00:11:50.110 --> 00:11:52.330
pipeline. During a phase called tokenization,

00:11:52.509 --> 00:11:54.929
right? Exactly. Tokenization. Let's unpack the

00:11:54.929 --> 00:11:56.809
mechanics of tokenization for anyone listening

00:11:56.809 --> 00:12:01.129
who doesn't spend their weekends writing custom

00:12:01.129 --> 00:12:04.210
compilers. How does moving the substitution to

00:12:04.210 --> 00:12:07.899
this specific phase Save the code base. Okay,

00:12:08.019 --> 00:12:10.059
think of tokenization like reading a sentence.

00:12:10.200 --> 00:12:12.799
Okay, if a preprocessor is told to replace the

00:12:12.799 --> 00:12:16.379
letters t -h -e It will blindly rip those letters

00:12:16.379 --> 00:12:18.940
out of the word there or theater and just ruin

00:12:18.940 --> 00:12:20.940
the sentence Because it's just looking for the

00:12:20.940 --> 00:12:23.149
letters not the meaning Right. Tokenization,

00:12:23.309 --> 00:12:25.490
however, is the phase where the compiler groups

00:12:25.490 --> 00:12:27.950
individual characters into meaningful words or

00:12:27.950 --> 00:12:31.049
tokens. It actually understands context. Ah,

00:12:31.070 --> 00:12:33.210
I get it. It knows the difference between a functional

00:12:33.210 --> 00:12:36.409
keyword, a mathematical operator, and a literal

00:12:36.409 --> 00:12:38.970
string of text meant to be printed on the screen.

00:12:39.330 --> 00:12:41.909
So because digraphs are processed during tokenization,

00:12:42.730 --> 00:12:45.090
they respect the boundaries of the code. Exactly.

00:12:45.440 --> 00:12:48.340
If the compiler is reading a quoted string and

00:12:48.340 --> 00:12:51.460
sees the digraph for a curly bracket inside those

00:12:51.460 --> 00:12:53.840
quotes, it knows it is currently inside a text

00:12:53.840 --> 00:12:57.179
token. It leaves those characters alone. Wow.

00:12:57.700 --> 00:13:00.279
It finally adds a layer of intelligence to the

00:13:00.279 --> 00:13:03.710
substitution. It solves the string and comment

00:13:03.710 --> 00:13:07.350
formatting bugs that made trigraphs so universally

00:13:07.350 --> 00:13:09.929
hated. It really was a massive leap forward in

00:13:09.929 --> 00:13:12.809
stability, but because this is the history of

00:13:12.809 --> 00:13:15.669
programming, layering new logic on top of old

00:13:15.669 --> 00:13:18.309
compromises rarely results in a perfectly clean

00:13:18.309 --> 00:13:20.350
system. Of course not, that would be too easy.

00:13:20.509 --> 00:13:24.269
Right. So when C++ inherited all of these digraphs,

00:13:24.470 --> 00:13:26.509
they ran into a fascinating edge case that threatened

00:13:26.509 --> 00:13:29.169
to break the entire language. Yes. Here's where

00:13:29.169 --> 00:13:31.549
it gets really interesting. The source material

00:13:31.549 --> 00:13:33.909
highlights the less than colon colon dilemma.

00:13:34.149 --> 00:13:37.409
Oh yeah, this is a great one. In C++E, there

00:13:37.409 --> 00:13:40.490
is a very common sequence where a less than sign

00:13:40.490 --> 00:13:43.789
is followed immediately by two colons. Now, a

00:13:43.789 --> 00:13:45.629
less than sign followed by a single colon is

00:13:45.629 --> 00:13:47.909
the official digraph for a left square bracket.

00:13:48.090 --> 00:13:51.590
So if we follow the standard predictable tokenization

00:13:51.590 --> 00:13:56.370
rules less than colon, colon should automatically

00:13:56.370 --> 00:13:58.830
be interpreted by the compiler as a square bracket

00:13:58.830 --> 00:14:01.149
followed by a colon. Which would be the expected

00:14:01.149 --> 00:14:03.529
logical outcome of the rule they literally just

00:14:03.529 --> 00:14:06.529
wrote. But the C++ standards committee had to

00:14:06.529 --> 00:14:08.850
explicitly write a hyper -specific exception

00:14:08.850 --> 00:14:12.059
to stop that from happening. They did. They mandated

00:14:12.059 --> 00:14:15.440
that in this exact scenario, the less than sign

00:14:15.440 --> 00:14:17.960
must be treated as its own entirely separate

00:14:17.960 --> 00:14:20.539
token and the two colons are left alone. Because

00:14:20.539 --> 00:14:22.519
if they didn't, it would be disastrous. Why?

00:14:22.779 --> 00:14:25.100
What would happen? Well, the reason is that if

00:14:25.100 --> 00:14:27.399
they allowed the digraph substitution to happen,

00:14:27.700 --> 00:14:30.279
it would completely destroy the syntax for C++.

00:14:30.590 --> 00:14:33.529
Oh, wow. Yeah. Templates are a major feature

00:14:33.529 --> 00:14:35.809
of the language and they rely heavily on less

00:14:35.809 --> 00:14:39.169
than signs and colons to function. So it is the

00:14:39.169 --> 00:14:41.269
ultimate architectural balancing act. You are

00:14:41.269 --> 00:14:43.409
constantly patching the holes left by missing

00:14:43.409 --> 00:14:46.090
physical keys while desperately trying not to

00:14:46.090 --> 00:14:48.289
break the entirely new advanced language features

00:14:48.289 --> 00:14:51.000
you're currently inventing. Exactly. It's a house

00:14:51.000 --> 00:14:54.139
of cards built on a foundation of hardware compromises.

00:14:54.500 --> 00:14:58.200
But the funny thing is, while the C and C++ communities

00:14:58.200 --> 00:15:02.200
viewed these sequences as, like, a necessary

00:15:02.200 --> 00:15:05.139
evil, a frustrating bandage they couldn't wait

00:15:05.139 --> 00:15:08.320
to peel off, Other programming communities looked

00:15:08.320 --> 00:15:10.759
at digraphs and saw an entirely new toolkit.

00:15:11.019 --> 00:15:13.120
Oh, absolutely. They stopped treating them as

00:15:13.120 --> 00:15:15.059
workarounds and started using them as intentional

00:15:15.059 --> 00:15:17.799
features to expand what a standard keyboard was

00:15:17.799 --> 00:15:20.460
capable of. The evolution of the Pascal programming

00:15:20.460 --> 00:15:22.460
language is a perfect example of this, isn't

00:15:22.460 --> 00:15:25.679
it? It is. Pascal needed curly brackets for comments,

00:15:25.980 --> 00:15:28.659
and for keyboards that lacked them, they introduced

00:15:28.659 --> 00:15:31.399
the digraph of a left parenthesis followed by

00:15:31.399 --> 00:15:34.039
an asterisk to represent the left curly bracket.

00:15:34.179 --> 00:15:36.799
OK. But the developers using Pascal actually

00:15:36.799 --> 00:15:39.340
preferred the workaround to the real thing. Wait,

00:15:39.360 --> 00:15:41.639
really? Why prefer typing two keys when you can

00:15:41.639 --> 00:15:44.000
just type one? Because it offered a visual and

00:15:44.000 --> 00:15:46.679
functional distinction. A comment block that

00:15:46.679 --> 00:15:49.399
starts with a parenthesis and an asterisk absolutely

00:15:49.399 --> 00:15:52.299
cannot be accidentally closed by a stray regular

00:15:52.299 --> 00:15:54.419
right curly bracket floating somewhere in the

00:15:54.419 --> 00:15:57.279
code. Oh, that makes so much sense. Right. It

00:15:57.279 --> 00:16:00.299
provided a robust, distinct way to manage large

00:16:00.299 --> 00:16:02.340
blocks of comments, especially if those comments

00:16:02.340 --> 00:16:04.899
contained actual code snippets that used curly

00:16:04.899 --> 00:16:07.620
brackets. It was so popular it became a standard

00:16:07.620 --> 00:16:09.919
alternative. Wow. And then the J - programming

00:16:09.919 --> 00:16:12.179
language took that concept and pushed it even

00:16:12.179 --> 00:16:15.539
further. But J is wild. It really is. J is a

00:16:15.539 --> 00:16:18.279
descendant of the APL programming language. And

00:16:18.279 --> 00:16:21.019
APL is infamous in computer science for using

00:16:21.019 --> 00:16:25.340
a vast, terrifying array of specialized, highly

00:16:25.340 --> 00:16:28.899
mathematical symbols that absolutely do not exist

00:16:28.899 --> 00:16:31.220
on a normal keyboard. Yeah, symbols you'd expect

00:16:31.220 --> 00:16:33.980
to see on, like, an advanced physics chalkboard.

00:16:34.179 --> 00:16:37.350
Exactly. When developers built J, they wanted

00:16:37.350 --> 00:16:39.490
to replicate that massive mathematical power,

00:16:39.830 --> 00:16:42.129
but they strictly limited the language's alphabet

00:16:42.129 --> 00:16:44.389
to the basic ASCII character set. Which is a

00:16:44.389 --> 00:16:46.429
huge constraint. So to pull off that illusion,

00:16:46.950 --> 00:16:49.730
J utilizes the period and the colon as what they

00:16:49.730 --> 00:16:51.779
call inflection points. They took the concept

00:16:51.779 --> 00:16:54.000
of the digraph and made it the foundational grammar

00:16:54.000 --> 00:16:56.720
of the entire language. Yeah, let's clarify how

00:16:56.720 --> 00:16:58.679
those inflection points actually operate, because

00:16:58.679 --> 00:17:00.759
it's brilliant. It's essentially like adding

00:17:00.759 --> 00:17:04.079
a modifier key, like a shift or an alt key, but

00:17:04.079 --> 00:17:06.599
doing it entirely through software syntax instead

00:17:06.599 --> 00:17:08.680
of physical hardware. Exactly the right way to

00:17:08.680 --> 00:17:10.539
think about it. In J, if you type a standard

00:17:10.539 --> 00:17:13.220
plus sign, it performs basic addition. But if

00:17:13.220 --> 00:17:15.579
you type a plus sign immediately followed by

00:17:15.579 --> 00:17:18.519
a period, the language treats that digraph as

00:17:18.519 --> 00:17:20.799
an entirely different logical operation. Like

00:17:20.799 --> 00:17:24.390
a logical O -R. Right. And type a plus sign followed

00:17:24.390 --> 00:17:27.150
by a colon, and it becomes a logical N or R.

00:17:27.650 --> 00:17:30.369
By simply appending one of two standard punctuation

00:17:30.369 --> 00:17:33.329
marks to any normal character, J exponentially

00:17:33.329 --> 00:17:35.849
expanded its vocabulary without requiring a single

00:17:35.849 --> 00:17:38.390
new physical key on the board. That is so clever.

00:17:38.569 --> 00:17:40.809
And we see this exact same philosophy, you know,

00:17:40.990 --> 00:17:43.309
expanding a limited physical interface through

00:17:43.309 --> 00:17:45.930
sequence mapping pop -up in early portable hardware

00:17:45.930 --> 00:17:48.990
as well. The Hewlett -Packard RPL calculators

00:17:48.990 --> 00:17:51.150
are a prime example. Oh, those HP calculators

00:17:51.150 --> 00:17:53.309
are fascinating. They were designed for advanced

00:17:53.309 --> 00:17:55.490
engineering, so they supported a massive extended

00:17:55.490 --> 00:17:58.950
character set internally. But obviously the physical

00:17:58.950 --> 00:18:01.710
keyboard on a handheld calculator is incredibly

00:18:01.710 --> 00:18:04.750
tiny. Right, you simply cannot fit hundreds of

00:18:04.750 --> 00:18:07.720
keys on it. So HP implemented something called

00:18:07.720 --> 00:18:10.920
TIO codes. You would type a backslash, followed

00:18:10.920 --> 00:18:13.480
by two characters that visually resembled the

00:18:13.480 --> 00:18:16.059
missing glyph you wanted to display. And if the

00:18:16.059 --> 00:18:18.680
symbol you needed was too abstract to draw with

00:18:18.680 --> 00:18:21.759
two letters, you could type a backslash, followed

00:18:21.759 --> 00:18:24.099
by a three -digit decimal code. Which is technically

00:18:24.099 --> 00:18:26.920
a tetrograph. A tetrograph. A four -character

00:18:26.920 --> 00:18:29.579
sequence mapping directly to a specific memory

00:18:29.579 --> 00:18:32.279
address for a symbol. The ingenuity of these

00:18:32.279 --> 00:18:34.779
power user shortcuts is just astounding. And

00:18:34.779 --> 00:18:37.359
you see it everywhere in classic software, like

00:18:37.359 --> 00:18:39.980
the Vim Text Editor uses Control -K to allow

00:18:39.980 --> 00:18:42.220
users to input thousands of special characters

00:18:42.220 --> 00:18:46.099
via two -letter digraphs. Lotus123 for MS -DOS

00:18:46.099 --> 00:18:49.079
utilized Alt -F1 as a compose key to build symbols

00:18:49.079 --> 00:18:51.200
out of sequential keystrokes. So what does this

00:18:51.200 --> 00:18:54.470
all mean? What began as a desperate workaround

00:18:54.470 --> 00:18:57.309
for the physical limitations of a 6 -bit mainframe,

00:18:57.309 --> 00:19:00.069
organically morphed into a highly efficient workflow

00:19:00.069 --> 00:19:02.529
for developers across entirely different operating

00:19:02.529 --> 00:19:05.630
systems. Exactly. The constraints of the physical

00:19:05.630 --> 00:19:08.809
environment forced a structural innovation, and

00:19:08.809 --> 00:19:11.029
then the users weaponized that innovation for

00:19:11.029 --> 00:19:13.710
their own speed and efficiency. But as we know,

00:19:14.029 --> 00:19:16.549
the technological environment never stays stagnant

00:19:16.549 --> 00:19:19.410
for long. Never. If these digraphs and trigraphs

00:19:19.410 --> 00:19:22.450
were so deeply woven into the fabric of compilers,

00:19:22.869 --> 00:19:25.250
operating systems, and developer workflows, why

00:19:25.250 --> 00:19:27.329
don't modern programmers ever see them today?

00:19:27.660 --> 00:19:30.500
What finally killed the trigraph? Well, the trigraph

00:19:30.500 --> 00:19:33.059
was entirely eradicated by the widespread adoption

00:19:33.059 --> 00:19:36.059
of Unicode and the UTF -8 encoding standard.

00:19:36.180 --> 00:19:38.900
Of course. For decades, the tech industry suffered

00:19:38.900 --> 00:19:41.640
through fragmented code pages and regional keyboard

00:19:41.640 --> 00:19:44.420
quirks. But eventually, the industry rallied

00:19:44.420 --> 00:19:47.180
around a single universal standard for character

00:19:47.180 --> 00:19:50.079
encoding. Unicode was designed to handle virtually

00:19:50.079 --> 00:19:53.380
every symbol, letter, and emoji in human existence.

00:19:53.619 --> 00:19:55.960
So the digital map... finally grew large enough

00:19:55.960 --> 00:19:58.240
to perfectly cover the entire physical territory.

00:19:58.480 --> 00:20:01.339
Precisely. Modern operating systems became capable

00:20:01.339 --> 00:20:04.180
of seamlessly mapping any physical key combination

00:20:04.180 --> 00:20:06.759
to any Unicode character. The hardware caught

00:20:06.759 --> 00:20:09.759
up, and once that happened, the trigraph wasn't

00:20:09.759 --> 00:20:12.220
just obsolete, it reverted to being actively

00:20:12.220 --> 00:20:15.579
dangerous. Because of those blind preprocessor

00:20:15.579 --> 00:20:18.599
bugs we discussed earlier, Having trigraphs enabled

00:20:18.599 --> 00:20:21.359
in a modern compiler was a massive liability

00:20:21.359 --> 00:20:23.920
for any code base. They went from being a lifeline

00:20:23.920 --> 00:20:26.279
to a landmine. That's a great way to put it.

00:20:26.599 --> 00:20:29.400
They became so universally detested that compiler

00:20:29.400 --> 00:20:31.819
manufacturers actively started disabling them

00:20:31.819 --> 00:20:35.059
by default. If a programmer accidentally used

00:20:35.059 --> 00:20:37.880
a trigraph sequence, modern compilers would halt

00:20:37.880 --> 00:20:40.900
and throw a warning. Oh, wow. The company Borland

00:20:41.099 --> 00:20:43.940
even took the extraordinary step of ripping the

00:20:43.940 --> 00:20:46.579
Trigraph processing logic out of their main compiler

00:20:46.579 --> 00:20:48.500
entirely. Wait, really? What did they do with

00:20:48.500 --> 00:20:51.420
it? They physically segregated it into a totally

00:20:51.420 --> 00:20:54.440
separate standalone executable program called

00:20:54.440 --> 00:20:57.579
trigraph .exe. They literally quarantined the

00:20:57.579 --> 00:21:00.339
feature. They did. Borland realized that forcing

00:21:00.339 --> 00:21:02.720
their compiler to scan every single line of code

00:21:02.720 --> 00:21:04.900
for a double question mark significantly slowed

00:21:04.900 --> 00:21:06.960
down the compilation process for normal code.

00:21:07.099 --> 00:21:10.470
Which makes sense. Right. Why penalize the vast

00:21:10.470 --> 00:21:12.910
majority of your users with slower compile times

00:21:12.910 --> 00:21:15.470
for a feature that almost no one used anymore?

00:21:16.109 --> 00:21:19.329
If you truly desperately needed to process trigraphs,

00:21:19.670 --> 00:21:21.549
you were forced to run your code through that

00:21:21.549 --> 00:21:24.069
separate quarantine application first? That is

00:21:24.069 --> 00:21:26.430
an incredible logistical middle finger to a piece

00:21:26.430 --> 00:21:29.029
of legacy syntax. It really is. And eventually,

00:21:29.250 --> 00:21:31.710
the standards committees followed suit. Trigraphs

00:21:31.710 --> 00:21:33.769
were officially and permanently removed from

00:21:33.769 --> 00:21:37.079
the C language standard as of C23. Finally. But

00:21:37.079 --> 00:21:38.720
looking at the source material, there is one

00:21:38.720 --> 00:21:42.059
detail that stands out as deeply counterintuitive.

00:21:42.579 --> 00:21:45.140
Throughout this long, slow death of the trigraph,

00:21:45.740 --> 00:21:48.940
one massive tech giant fought aggressively to

00:21:48.940 --> 00:21:51.339
keep them alive. Oh, this is the best part. When

00:21:51.339 --> 00:21:53.900
the C++ committee proposed deprecating trigraphs

00:21:53.900 --> 00:21:57.200
in C++11, IBM stepped in and strongly opposed

00:21:57.200 --> 00:21:59.700
the removal. And they actually succeeded, at

00:21:59.700 --> 00:22:03.000
least temporarily. The IBM holdout. It is a perfect

00:22:03.000 --> 00:22:05.160
illustration of how heavy the anchor of legacy

00:22:05.160 --> 00:22:08.160
code really is. But why? Why would a modern tech

00:22:08.160 --> 00:22:10.619
giant defend a universally hated bug -ridden

00:22:10.619 --> 00:22:13.599
feature? The answer is always backwards compatibility.

00:22:13.839 --> 00:22:16.740
IBM, perhaps more than any other corporation

00:22:16.740 --> 00:22:19.940
on earth, maintains massive ancient enterprise

00:22:19.940 --> 00:22:21.940
systems. Oh, right. We are talking about the

00:22:21.940 --> 00:22:25.059
mainframes that run global banking, airline ticketing,

00:22:25.400 --> 00:22:28.140
and international logistics. Exactly. Systems

00:22:28.140 --> 00:22:30.839
that were written decades ago, heavily utilizing

00:22:30.839 --> 00:22:34.400
those very same limited EBCDIC character sets

00:22:34.400 --> 00:22:36.759
that necessitated trigraphs in the first place.

00:22:37.559 --> 00:22:40.180
For IBM, removing trigraph support from the modern

00:22:40.180 --> 00:22:43.079
C++ standard meant that millions of lines of

00:22:43.119 --> 00:22:46.000
foundational mission -critical code might suddenly

00:22:46.000 --> 00:22:48.140
fail to compile when they updated their system.

00:22:48.180 --> 00:22:50.619
That would be a disaster. Right. The risk of

00:22:50.619 --> 00:22:53.180
breaking a global banking system simply because

00:22:53.180 --> 00:22:55.940
you wanted to clean up some ugly syntax was unacceptable

00:22:55.940 --> 00:22:58.019
to them. They were completely chained to the

00:22:58.019 --> 00:23:00.559
workaround. The ghost in the machine was holding

00:23:00.559 --> 00:23:03.160
the modern infrastructure hostage. It was a valiant

00:23:03.160 --> 00:23:06.099
defense of the past. But progress is relentless.

00:23:06.740 --> 00:23:09.880
When C++17 came around a few years later, the

00:23:09.880 --> 00:23:12.220
committee voted to purge try graphs completely.

00:23:12.539 --> 00:23:15.480
not just deprecate them, but remove them entirely

00:23:15.480 --> 00:23:18.539
from the standard, despite IBM's continued protests.

00:23:19.039 --> 00:23:21.400
The era of the trigraph was officially over.

00:23:21.559 --> 00:23:24.819
Today, if you are a developer tasked with maintaining

00:23:24.819 --> 00:23:28.019
that ancient IBM legacy code, you have to run

00:23:28.019 --> 00:23:30.640
it through a dedicated translation script to

00:23:30.640 --> 00:23:33.700
parse the try graphs into standard Unicode characters

00:23:33.700 --> 00:23:35.960
before the modern compiler will even look at

00:23:35.960 --> 00:23:38.440
it. What an incredible technical journey. I mean,

00:23:38.440 --> 00:23:41.039
we started with the absolute constraints of 6

00:23:41.039 --> 00:23:43.740
-bit architecture where a programmer physically

00:23:43.740 --> 00:23:46.940
could not type an assignment arrow. Yep. we navigated

00:23:46.940 --> 00:23:49.980
the chaotic code destroying bugs of the double

00:23:49.980 --> 00:23:53.349
question mark slash preprocessor nightmare. And

00:23:53.349 --> 00:23:55.809
we ended with a massive corporate standoff over

00:23:55.809 --> 00:23:58.210
backwards compatibility in the modern era of

00:23:58.210 --> 00:24:01.109
global finance. Quite a story. It really reinforces

00:24:01.109 --> 00:24:03.609
the idea that the slick, invisible architecture

00:24:03.609 --> 00:24:06.730
of our modern digital world is heavily built

00:24:06.730 --> 00:24:09.109
on the crumbling foundations of old hardware

00:24:09.109 --> 00:24:12.089
limitations. It does. It forces a profound shift

00:24:12.089 --> 00:24:14.369
in how you view the tools you interact with every

00:24:14.369 --> 00:24:17.289
day. Trigraphs finally died because our physical

00:24:17.289 --> 00:24:19.549
hardware, our keyboards, and our internal encodings

00:24:19.549 --> 00:24:21.930
caught up to our software ambitions. But the

00:24:21.930 --> 00:24:24.490
cycle of technological limitation and workaround

00:24:24.490 --> 00:24:27.630
never truly ends. It merely shifts to a new domain.

00:24:27.930 --> 00:24:29.630
Which brings up a fascinating thought to leave

00:24:29.630 --> 00:24:33.109
you with. If software had to bend over backward

00:24:33.109 --> 00:24:35.990
to accommodate the limitations of physical plastic

00:24:35.990 --> 00:24:39.289
keyboards in the 1970s, what happens to our code

00:24:39.289 --> 00:24:41.589
when we abandon keyboards entirely? Oh, that's

00:24:41.589 --> 00:24:44.309
a great question. Right. As we rapidly move toward

00:24:44.309 --> 00:24:47.029
voice coding, spatial computing, and AI -prompted

00:24:47.029 --> 00:24:49.509
software generation, physical keys are becoming

00:24:49.509 --> 00:24:52.329
less and less relevant. With the specific Unicode

00:24:52.329 --> 00:24:54.890
symbols we fought so hard to standardize, the

00:24:54.890 --> 00:24:57.329
curly brackets and semicolons suddenly become

00:24:57.329 --> 00:25:01.029
the obsolete trigraphs of the AI era. It's very

00:25:01.029 --> 00:25:03.849
possible. Are we just leaving a new set of archaic,

00:25:03.930 --> 00:25:06.130
invisible rules for the next generation of developers

00:25:06.130 --> 00:25:08.410
to unravel? It is certainly something to ponder

00:25:08.410 --> 00:25:10.190
the next time you tap that square bracket key.

00:25:11.089 --> 00:25:13.069
Thank you for joining us on this deep dive. Keep

00:25:13.069 --> 00:25:13.890
questioning the code.
