Code point

Code pointByte 1Byte 2Byte 3Byte 4
U+0000..007F0xxxxxxx
U+0080..07FF110xxxxx10xxxxxx
U+0800..FFFF1110xxxx10xxxxxx10xxxxxx
U+10000..10FFFF11110xxx10xxxxxx10xxxxxx10xxxxxx

UTFs

Each Unicode code point can be expressed in several different formats. These formats are called Unicode transformation formats (UTFs). For example, the letter M is the Unicode code point U+004D. In UTF-8, this code point is represented as X’4D’. In UTF-16, this code point can be represented as X’004D’.

UTF-8 is a transmission format

Script

find . -type f | xargs -I {} iconv -f gbk -t utf8 {} -o {}
  • Java use unicode in String object.