The Dark Ages
We all know that every single character is represented in binary, more specifically, in 1 byte (sizeof char
in C/C++). And the good old ASCII table will faithfully interpret 1000101
(69 BTW) to E
every single time. Seems good, until you received an email from China with the subject line “???? ?????? ??? ????”?
.
For english letters, the low 7 bits are more than enough, we can even use the upmost 1 bit (128-255) as an extension for some common symbols (extended ASCII table / ISO Latin_1). However, there are languages like CJK (Chinese, Japanese and Korean) which has around 100k characters combined (as of Unicode 15.1). There is no way we can store this in one byte like an alphabet, but how exactly do we encode the CJK character? What’s worse, when there is no standard, everyone and their dog make a standard and call it a day, so computers made by different manufacturers may not be able to communicate with each other, software development is no doubt a pain in the arse, online communication is nearly impossible (assuming not using English of course).
So someone finally decided to fix this, make the mapping (from binary to actual character other than English, such as CJK, European letters or even emoji) a standard, that’s Unicode.
Unicode to the Rescue
The gist: Unicode is a mapping, the mapping.
We can freely assume that Unicode is a map that everyone agree on. It maps binary to actual character. For example, U+006C
maps to l
, where U+
means it is a Unicode (or \u
in some cases).
For string Hello
, we have U+0048 U+0065 U+006C U+006C U+006F
.
And for string 你好
, we have U+4f60 U+597d
If you are a rigid reader, you may find that the binary for English letter is the same as ASCII letter, for backward compatibility of course.
Pretty simple and straight forward, right?
Right?
Aside: What's the range of Unicode
TL;DR: Capped at
0x10FFFF
,1,114,111
in decimal
It’s because of the design of UTF-16, and for compatibility reason it’s now guaranteed by standard that a code point above that will never be assigned.
The Problem
As an old saying goes: when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
To use Unicode, we also have a problem to address first, that is, for U+0048 U+0065 U+006C U+006C U+006F
, which one should we encode to?
00 48 00 65 00 6C 00 6C 00 6F
or
48 00 65 00 6C 00 6C 00 6F 00
Invent UTF-8 From Scratch
It turns out, the representation above is not only ambiguous but also a waste of memory. You see, when using ASCII, we only need 5 bytes to s tore Hello
(1 byte each). But in Unicode world, it’s (at least) 2 bytes each, making it double the memory consumption. And the worst part is that Unicode used alone is an ambiguous encoding and is almost impossible to encode and decode. We definitely need an encoding&decoding scheme, so Rob Pike and Ken Thompson came up with a great idea to encode Unicode, making it take 1–4 bytes.
You may be curious that isn’t it take more space than usual? The answer is yes and no. For English letter we can still encode it in 1 byte and most characters in 2–3 bytes. Most importantly, it is no more ambiguous. We can even invent it from scratch.
You see, for almost all systems, the low 7 bits are the same (ASCII 0-127), so we can copy paste them to UTF-8, making it not only compatible with ASCII but also preserve some memory.
But what about the 8th bit (extended ASCII 128-255)? Even though we want to include it, but remember, we are making an (variable length) encoding scheme after all. How can we know 10000000 1000000
is two characters or one? It’s very much like a Huffman Tree or Trie Tree.
Code point | Byte 1 | Free |
---|---|---|
U+0000 ..007F | 0xxxxxxx | 7 |
To encode more characters, we simply need more free bits. And remember, we need them distinguishable. We intentionally left the 8th bit 0 for 1 byte character, so if the 16th byte is 1, we are sure that it’s a 2 bytes character 1xxxxxxx xxxxxxxx
.
Apparently, 10000000 10000000
= 128 is not distinguishable (think it for a moment and you get the gist). We need a way to tell if it is the start of a new character or a continuation of the previous character.
How about 110xxxxxx 10xxxxxx
. By some trial and errors, You will find that this is just barely distinguishable for a 2 bytes character.
Code point | Byte 1 | Byte 2 | Free |
---|---|---|---|
U+0000 ..007F | 0xxxxxxx | 7 | |
U+0080 ..07FF | 110xxxxx | 10xxxxxx | 11 |
The second one is the hardest part, and the rest is just copy pasta. You can treat 10
as a continuation mark and 110
/1110
/11110
as the distinguisher between character.
Code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Free |
---|---|---|---|---|---|
U+0000 ..007F | 0xxxxxxx | 7 | |||
U+0080 ..07FF | 110xxxxx | 10xxxxxx | 11 | ||
U+0800 ..FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | 16 | |
U+10000 ..10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 21 |
And now we invented UTF-8 via reasoning. This is also the way which most modern Programming Language store their string today. And remember, Unicode is how you map it and UTF-8/UTF-16/UTF-32 is how you store/transport it.
Aside
The sad news is, most Windows PC in China are still using GBK encoding, making UTF-8 user mad every time when they see the gibberish because of the encoding issues.
Here is a bash one-line that convert gbk to UTF-8
UTF-16
You only need this table to learn UTF-16 after you understand UTF-8.
Code Point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Description |
---|---|---|---|---|---|
U+0000 ..D7FF | xxxx xxxx | xxxx xxxx | Basic Multilingual Plane (BMP) characters | ||
U+D800 ..DBFF | 110110xx | xxxx xxxx | High Surrogate (used for characters outside BMP) | ||
U+DC00 ..DFFF | 110111xx | xxxx xxxx | Low Surrogate (used for characters outside BMP) | ||
U+E000 ..FFFF | xxxx xxxx | xxxx xxxx | BMP characters | ||
U+10000 ..10FFFF | 110110xx | xxxx xxxx | 110111xx | xxxx xxxx | Characters outside BMP (surrogate pairs) |
Jargons
Code Unit
- Minimal bit combination that can represent a character in a given character encoding scheme.
- The minimal is where the unit comes from.
- Examples:
- UTF-8: 1 byte
- ASCII character
A
(U+0041
) can be represented by one byte:0x41
.
- ASCII character
- UTF-16: 2 bytes
- The character
A
(U+0041
) needs to be represented by 2-bytes code unit:0x0041
.
- The character
- UTF-8: 1 byte
Code Point
- The numerical value that maps to a specific character in the Unicode character set.
- Examples:
- ‘A’ is U+0041.
- ‘a’ is U+0061.
- ‘0’ is U+0030.
Plane
- A plane is a continuous range of
65,536
(or2^16
) code points. - Unicode defines 17 planes, each containing
65,536
code points, for a total of1,114,112
code points (ranging fromU+0000
toU+10FFFF
).
Plane Name | Range | Purpose |
---|---|---|
Basic Multilingual Plane (BMP) | U+0000–U+FFFF | Most common characters and scripts |
Supplementary Multilingual Plane (SMP) | U+10000–U+1FFFF | Historic scripts, additional modern scripts |
Supplementary Ideographic Plane (SIP) | U+20000–U+2FFFF | Additional Han ideographs |
Tertiary Ideographic Plane (TIP) | U+30000–U+3FFFF | Reserved for future CJK ideographs |
Supplementary Special-purpose Plane (SSP) | U+E0000–U+EFFFF | Tags, variation selectors |
Private Use Planes (PUA) | U+F0000–U+10FFFF | Private use |
Astral Planes | U+30000–U+DFFFF | Reserved for future use |