We all know that every single character is represented in binary, more specifically, in 1 byte (sizeof char in C/C++). And the good old ASCII table will faithfully interpret 1000101 (69 BTW) to E every single time. Seems good, until you received an email from China with the subject line “???? ?????? ??? ????”?.
For english letters, the low 7 bits are more than enough, we can even use the upmost 1 bit (128-255) as an extension for some common symbols (extended ASCII table / ISO Latin_1). However, there are languages like CJK (Chinese, Japanese and Korean) which has around 100k characters combined (as of Unicode 15.1). There is no way we can store this in one byte like an alphabet, but how exactly do we encode the CJK character? What’s worse, when there is no standard, everyone and their dog make a standard and call it a day, so computers made by different manufacturers may not be able to communicate with each other, software development is no doubt a pain in the arse, online communication is nearly impossible (assuming not using English of course).
So someone finally decided to fix this, make the mapping (from binary to actual character other than English, such as CJK, European letters or even emoji) a standard, that’s Unicode.
Unicode to the Rescue
The gist: Unicode is a mapping, the mapping.
We can freely assume that Unicode is a map that everyone agree on. It maps binary to actual character. For example, U+006C maps to l, where U+ means it is a Unicode (or \u in some cases).
For string Hello, we have U+0048 U+0065 U+006C U+006C U+006F.
And for string 你好, we have U+4f60 U+597d
If you are a rigid reader, you may find that the binary for English letter is the same as ASCII letter, for backward compatibility of course.
Pretty simple and straight forward, right?
Right?
Aside: What's the range of Unicode
TL;DR: Capped at 0x10FFFF, 1,114,111 in decimal
It’s because of the design of UTF-16, and for compatibility reason it’s now guaranteed by standard that a code point above that will never be assigned.
The Problem
As an old saying goes: when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
To use Unicode, we also have a problem to address first, that is, for U+0048 U+0065 U+006C U+006C U+006F, which one should we encode to?
It turns out, the representation above is not only ambiguous but also a waste of memory. You see, when using ASCII, we only need 5 bytes to s tore Hello (1 byte each). But in Unicode world, it’s (at least) 2 bytes each, making it double the memory consumption. And the worst part is that Unicode used alone is an ambiguous encoding and is almost impossible to encode and decode. We definitely need an encoding&decoding scheme, so Rob Pike and Ken Thompson came up with a great idea to encode Unicode, making it take 1–4 bytes.
You may be curious that isn’t it take more space than usual? The answer is yes and no. For English letter we can still encode it in 1 byte and most characters in 2–3 bytes. Most importantly, it is no more ambiguous. We can even invent it from scratch.
You see, for almost all systems, the low 7 bits are the same (ASCII 0-127), so we can copy paste them to UTF-8, making it not only compatible with ASCII but also preserve some memory.
But what about the 8th bit (extended ASCII 128-255)? Even though we want to include it, but remember, we are making an (variable length) encoding scheme after all. How can we know 10000000 1000000 is two characters or one? It’s very much like a Huffman Tree or Trie Tree.
Code point
Byte 1
Free
U+0000..007F
0xxxxxxx
7
To encode more characters, we simply need more free bits. And remember, we need them distinguishable. We intentionally left the 8th bit 0 for 1 byte character, so if the 16th byte is 1, we are sure that it’s a 2 bytes character 1xxxxxxx xxxxxxxx.
Apparently, 10000000 10000000 = 128 is not distinguishable (think it for a moment and you get the gist). We need a way to tell if it is the start of a new character or a continuation of the previous character.
How about 110xxxxxx 10xxxxxx. By some trial and errors, You will find that this is just barely distinguishable for a 2 bytes character.
Code point
Byte 1
Byte 2
Free
U+0000..007F
0xxxxxxx
7
U+0080..07FF
110xxxxx
10xxxxxx
11
The second one is the hardest part, and the rest is just copy pasta. You can treat 10 as a continuation mark and 110/1110/11110 as the distinguisher between character.
Code point
Byte 1
Byte 2
Byte 3
Byte 4
Free
U+0000..007F
0xxxxxxx
7
U+0080..07FF
110xxxxx
10xxxxxx
11
U+0800..FFFF
1110xxxx
10xxxxxx
10xxxxxx
16
U+10000..10FFFF
11110xxx
10xxxxxx
10xxxxxx
10xxxxxx
21
And now we invented UTF-8 via reasoning. This is also the way which most modern Programming Language store their string today.
And remember, Unicode is how you map it and UTF-8/UTF-16/UTF-32 is how you store/transport it.
Aside
The sad news is, most Windows PC in China are still using GBK encoding, making UTF-8 user mad every time when they see the gibberish because of the encoding issues.
Here is a bash one-line that convert gbk to UTF-8