Each Unicode code point can be expressed in several different formats. These formats are called Unicode transformation formats (UTFs). For example, the letter M is the Unicode code point U+004D. In UTF-8, this code point is represented as X’4D’. In UTF-16, this code point can be represented as X'004D'
.
Name | UTF-8 | UTF-16 | UTF-16BE | UTF-16LE | UTF-32 | UTF-32BE | UTF-32LE |
---|---|---|---|---|---|---|---|
Smallest code point | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 | 0000 |
Largest code point | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF | 10FFFF |
Code unit size | 8 bits | 16 bits | 16 bits | 16 bits | 32 bits | 32 bits | 32 bits |
Byte order | N/A | BOM | big-endian | little-endian | BOM | big-endian | little-endian |
Fewest bytes per character | 1 | 2 | 2 | 2 | 4 | 4 | 4 |
Most bytes per character | 4 | 4 | 4 | 4 | 4 | 4 | 4 |
BOM
- Byte Order Map
Links
- CP 1252 (Western Europe)
- Unicode
- UTF-8
- UTF-16
- Invent UTF-8 From Scratch