Code point
Code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|
U+0000 ..007F | 0xxxxxxx | |||
U+0080 ..07FF | 110xxxxx | 10xxxxxx | ||
U+0800 ..FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
U+10000 ..10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
UTFs
Each Unicode code point can be expressed in several different formats. These formats are called Unicode transformation formats (UTFs). For example, the letter M is the Unicode code point U+004D. In UTF-8, this code point is represented as X’4D’. In UTF-16, this code point can be represented as X’004D’.
UTF-8 is a transmission format
Script
find . -type f | xargs -I {} iconv -f gbk -t utf8 {} -o {}
Links
- Java use unicode in
String
object.