The Dark Ages

We all know that every single character is represented in binary, more specifically, in 1 byte (sizeof char in C/C++). And the good old ASCII table will faithfully interpret 1000101 (69 BTW) to E every single time. Seems good, until you received an email from China with the subject line “???? ?????? ??? ????”?.

For english letters, the low 7 bits are more than enough, we can even use the upmost 1 bit (128-255) as an extension for some common symbols (extended ASCII table / ISO Latin_1). However, there are languages like CJK (Chinese, Japanese and Korean) which has around 100k characters combined (as of Unicode 15.1). There is no way we can store this in one byte like an alphabet, but how exactly do we encode the CJK character? What’s worse, when there is no standard, everyone and their dog make a standard and call it a day, so computers made by different manufacturers may not be able to communicate with each other, software development is no doubt a pain in the arse, online communication is nearly impossible (assuming not using English of course).

So someone finally decided to fix this, make the mapping (from binary to actual character other than English, such as CJK, European letters or even emoji) a standard, that’s Unicode.

Unicode to the Rescue

The gist: Unicode is a mapping, the mapping.

We can freely assume that Unicode is a map that everyone agree on. It maps binary to actual character. For example, U+006C maps to l, where U+ means it is a Unicode (or \u in some cases).

For string Hello, we have U+0048 U+0065 U+006C U+006C U+006F.
And for string 你好, we have U+4f60 U+597d

If you are a rigid reader, you may find that the binary for English letter is the same as ASCII letter, for backward compatibility of course.

Pretty simple and straight forward, right?
Right?

Aside: What's the range of Unicode

TL;DR: Capped at 0x10FFFF, 1,114,111 in decimal
It’s because of the design of UTF-16, and for compatibility reason it’s now guaranteed by standard that a code point above that will never be assigned.

The Problem

As an old saying goes: when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

To use Unicode, we also have a problem to address first, that is, for U+0048 U+0065 U+006C U+006C U+006F, which one should we encode to?

00 48 00 65 00 6C 00 6C 00 6F
or 
48 00 65 00 6C 00 6C 00 6F 00

Invent UTF-8 From Scratch

It turns out, the representation above is not only ambiguous but also a waste of memory. You see, when using ASCII, we only need 5 bytes to s tore Hello (1 byte each). But in Unicode world, it’s (at least) 2 bytes each, making it double the memory consumption. And the worst part is that Unicode used alone is an ambiguous encoding and is almost impossible to encode and decode. We definitely need an encoding&decoding scheme, so Rob Pike and Ken Thompson came up with a great idea to encode Unicode, making it take 1–4 bytes.

You may be curious that isn’t it take more space than usual? The answer is yes and no. For English letter we can still encode it in 1 byte and most characters in 2–3 bytes. Most importantly, it is no more ambiguous. We can even invent it from scratch.

You see, for almost all systems, the low 7 bits are the same (ASCII 0-127), so we can copy paste them to UTF-8, making it not only compatible with ASCII but also preserve some memory.
But what about the 8th bit (extended ASCII 128-255)? Even though we want to include it, but remember, we are making an (variable length) encoding scheme after all. How can we know 10000000 1000000 is two characters or one? It’s very much like a Huffman Tree or Trie Tree.

Code pointByte 1Free
U+0000..007F0xxxxxxx7

To encode more characters, we simply need more free bits. And remember, we need them distinguishable. We intentionally left the 8th bit 0 for 1 byte character, so if the 16th byte is 1, we are sure that it’s a 2 bytes character 1xxxxxxx xxxxxxxx.

Apparently, 10000000 10000000 = 128 is not distinguishable (think it for a moment and you get the gist). We need a way to tell if it is the start of a new character or a continuation of the previous character.

How about 110xxxxxx 10xxxxxx. By some trial and errors, You will find that this is just barely distinguishable for a 2 bytes character.

Code pointByte 1Byte 2Free
U+0000..007F0xxxxxxx7
U+0080..07FF110xxxxx10xxxxxx11

The second one is the hardest part, and the rest is just copy pasta. You can treat 10 as a continuation mark and 110/1110/11110 as the distinguisher between character.

Code pointByte 1Byte 2Byte 3Byte 4Free
U+0000..007F0xxxxxxx7
U+0080..07FF110xxxxx10xxxxxx11
U+0800..FFFF1110xxxx10xxxxxx10xxxxxx16
U+10000..10FFFF11110xxx10xxxxxx10xxxxxx10xxxxxx21

And now we invented UTF-8 via reasoning. This is also the way which most modern Programming Language store their string today. And remember, Unicode is how you map it and UTF-8/UTF-16/UTF-32 is how you store/transport it.

Aside

The sad news is, most Windows PC in China are still using GBK encoding, making UTF-8 user mad every time when they see the gibberish because of the encoding issues.
Here is a bash one-line that convert gbk to UTF-8

# Convert gbk to utf-8 in-place
find . -type f | xargs -I {} iconv -f gbk -t utf8 {} -o {}

UTF-16

You only need this table to learn UTF-16 after you understand UTF-8.

Code PointByte 1Byte 2Byte 3Byte 4Description
U+0000..D7FFxxxx xxxxxxxx xxxxBasic Multilingual Plane (BMP) characters
U+D800..DBFF110110xxxxxx xxxxHigh Surrogate (used for characters outside BMP)
U+DC00..DFFF110111xxxxxx xxxxLow Surrogate (used for characters outside BMP)
U+E000..FFFFxxxx xxxxxxxx xxxxBMP characters
U+10000..10FFFF110110xxxxxx xxxx110111xxxxxx xxxxCharacters outside BMP (surrogate pairs)

Jargons

Code Unit

  • Minimal bit combination that can represent a character in a given character encoding scheme.
  • The minimal is where the unit comes from.
  • Examples:
    • UTF-8: 1 byte
      • ASCII character A (U+0041) can be represented by one byte: 0x41.
    • UTF-16: 2 bytes
      • The character A (U+0041) needs to be represented by 2-bytes code unit: 0x0041.

Code Point

  • The numerical value that maps to a specific character in the Unicode character set.
  • Examples:
    • ‘A’ is U+0041.
    • ‘a’ is U+0061.
    • ‘0’ is U+0030.

Plane

  • A plane is a continuous range of 65,536 (or 2^16) code points.
  • Unicode defines 17 planes, each containing 65,536 code points, for a total of 1,114,112 code points (ranging from U+0000 to U+10FFFF).
Plane NameRangePurpose
Basic Multilingual Plane (BMP)U+0000–U+FFFFMost common characters and scripts
Supplementary Multilingual Plane (SMP)U+10000–U+1FFFFHistoric scripts, additional modern scripts
Supplementary Ideographic Plane (SIP)U+20000–U+2FFFFAdditional Han ideographs
Tertiary Ideographic Plane (TIP)U+30000–U+3FFFFReserved for future CJK ideographs
Supplementary Special-purpose Plane (SSP)U+E0000–U+EFFFFTags, variation selectors
Private Use Planes (PUA)U+F0000–U+10FFFFPrivate use
Astral PlanesU+30000–U+DFFFFReserved for future use