Invent UTF-8 From Scratch

The Dark Ages

We all know that every single character is represented in binary, more specifically, in 1 byte (sizeof char in C/C++). And the good old ASCII table will faithfully interpret 1000101 (69 BTW) to E every single time. Seems good, until you received an email from China with the subject line “???? ?????? ??? ????”?.

For english letters, the low 7 bits are more than enough, we can even use the upmost 1 bit (128-255) as an extension for some common symbols (extended ASCII table / ISO Latin_1). However, there are languages like CJK (Chinese, Japanese and Korean) which has around 100k characters combined (as of Unicode 15.1). There is no way we can store this in one byte like an alphabet, but how exactly do we encode the CJK character? What’s worse, when there is no standard, everyone and their dog make a standard and call it a day, so computers made by different manufacturers may not be able to communicate with each other, software development is no doubt a pain in the arse, online communication is nearly impossible (assuming not using English of course).

So someone finally decided to fix this, make the mapping (from binary to actual character other than English, such as CJK, European letters or even emoji) a standard, that’s Unicode.

Unicode to the Rescue

The gist: Unicode is a mapping, the mapping.

We can freely assume that Unicode is a map that everyone agree on. It maps binary to actual character. For example, U+006C maps to l, where U+ means it is a Unicode (or \u in some cases).

For string Hello, we have U+0048 U+0065 U+006C U+006C U+006F.
And for string 你好, we have U+4f60 U+597d

If you are a rigid reader, you may find that the binary for English letter is the same as ASCII letter, for backward compatibility of course.

Pretty simple and straight forward, right?
Right?

Aside: What's the range of Unicode

TL;DR: Capped at 0x10FFFF, 1,114,111 in decimal
It’s because of the design of UTF-16, and for compatibility reason it’s now guaranteed by standard that a code point above that will never be assigned.

The Problem

As an old saying goes: when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

To use Unicode, we also have a problem to address first, that is, for U+0048 U+0065 U+006C U+006C U+006F, which one should we encode to?

00 48 00 65 00 6C 00 6C 00 6F
or 
48 00 65 00 6C 00 6C 00 6F 00

Invent UTF-8 From Scratch

It turns out, the representation above is not only ambiguous but also a waste of memory. You see, when using ASCII, we only need 5 bytes to s tore Hello (1 byte each). But in Unicode world, it’s (at least) 2 bytes each, making it double the memory consumption. And the worst part is that Unicode used alone is an ambiguous encoding and is almost impossible to encode and decode. We definitely need an encoding&decoding scheme, so Rob Pike and Ken Thompson came up with a great idea to encode Unicode, making it take 1–4 bytes.

You may be curious that isn’t it take more space than usual? The answer is yes and no. For English letter we can still encode it in 1 byte and most characters in 2–3 bytes. Most importantly, it is no more ambiguous. We can even invent it from scratch.

You see, for almost all systems, the low 7 bits are the same (ASCII 0-127), so we can copy paste them to UTF-8, making it not only compatible with ASCII but also preserve some memory.
But what about the 8th bit (extended ASCII 128-255)? Even though we want to include it, but remember, we are making an (variable length) encoding scheme after all. How can we know 10000000 1000000 is two characters or one? It’s very much like a Huffman Tree or Trie Tree.

Code point	Byte 1	Free
U+`0000`..`007F`	`0xxxxxxx`	7

To encode more characters, we simply need more free bits. And remember, we need them distinguishable. We intentionally left the 8th bit 0 for 1 byte character, so if the 16th byte is 1, we are sure that it’s a 2 bytes character 1xxxxxxx xxxxxxxx.

Apparently, 10000000 10000000 = 128 is not distinguishable (think it for a moment and you get the gist). We need a way to tell if it is the start of a new character or a continuation of the previous character.

How about 110xxxxxx 10xxxxxx. By some trial and errors, You will find that this is just barely distinguishable for a 2 bytes character.

Code point	Byte 1	Byte 2	Free
U+`0000`..`007F`	`0xxxxxxx`		7
U+`0080`..`07FF`	`110xxxxx`	`10xxxxxx`	11

The second one is the hardest part, and the rest is just copy pasta. You can treat 10 as a continuation mark and 110/1110/11110 as the distinguisher between character.

Code point	Byte 1	Byte 2	Byte 3	Byte 4	Free
U+`0000`..`007F`	`0xxxxxxx`				7
U+`0080`..`07FF`	`110xxxxx`	`10xxxxxx`			11
U+`0800`..`FFFF`	`1110xxxx`	`10xxxxxx`	`10xxxxxx`		16
U+`10000`..`10FFFF`	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`	21

And now we invented UTF-8 via reasoning. This is also the way which most modern Programming Language store their string today. And remember, Unicode is how you map it and UTF-8/UTF-16/UTF-32 is how you store/transport it.

Aside
The sad news is, most Windows PC in China are still using GBK encoding, making UTF-8 user mad every time when they see the gibberish because of the encoding issues.
Here is a bash one-line that convert gbk to UTF-8
# Convert gbk to utf-8 in-place
find . -type f | xargs -I {} iconv -f gbk -t utf8 {} -o {}

UTF-16

You only need this table to learn UTF-16 after you understand UTF-8.

Code Point	Byte 1	Byte 2	Byte 3	Byte 4	Description
U+`0000`..`D7FF`	`xxxx xxxx`	`xxxx xxxx`			Basic Multilingual Plane (BMP) characters
U+`D800`..`DBFF`	`110110xx`	`xxxx xxxx`			High Surrogate (used for characters outside BMP)
U+`DC00`..`DFFF`	`110111xx`	`xxxx xxxx`			Low Surrogate (used for characters outside BMP)
U+`E000`..`FFFF`	`xxxx xxxx`	`xxxx xxxx`			BMP characters
U+`10000`..`10FFFF`	`110110xx`	`xxxx xxxx`	`110111xx`	`xxxx xxxx`	Characters outside BMP (surrogate pairs)

Jargons

Code Unit

Minimal bit combination that can represent a character in a given character encoding scheme.
The minimal is where the unit comes from.
Examples:
- UTF-8: 1 byte
  - ASCII character A (U+0041) can be represented by one byte: 0x41.
- UTF-16: 2 bytes
  - The character A (U+0041) needs to be represented by 2-bytes code unit: 0x0041.

Code Point

The numerical value that maps to a specific character in the Unicode character set.
Examples:
- ‘A’ is U+0041.
- ‘a’ is U+0061.
- ‘0’ is U+0030.

Plane

A plane is a continuous range of 65,536 (or 2^16) code points.
Unicode defines 17 planes, each containing 65,536 code points, for a total of 1,114,112 code points (ranging from U+0000 to U+10FFFF).

Plane Name	Range	Purpose
Basic Multilingual Plane (BMP)	U+0000–U+FFFF	Most common characters and scripts
Supplementary Multilingual Plane (SMP)	U+10000–U+1FFFF	Historic scripts, additional modern scripts
Supplementary Ideographic Plane (SIP)	U+20000–U+2FFFF	Additional Han ideographs
Tertiary Ideographic Plane (TIP)	U+30000–U+3FFFF	Reserved for future CJK ideographs
Supplementary Special-purpose Plane (SSP)	U+E0000–U+EFFFF	Tags, variation selectors
Private Use Planes (PUA)	U+F0000–U+10FFFF	Private use
Astral Planes	U+30000–U+DFFFF	Reserved for future use

Fish Touching🐟🎣

Explorer

Invent UTF-8 From Scratch

The Dark Ages

Unicode to the Rescue

The Problem

Invent UTF-8 From Scratch

UTF-16

Jargons

Code Unit

Code Point

Plane

Links

Graph View

Table of Contents

Backlinks