Variable Width Int (varint)

  • Encode 64bit int using 1-10 bytes (most of the time it will save some space)
  • Most significant bit of the byte (sign bit) is used to mark the boundary (continuation bit).

First you drop the MSB from each byte, as this is just there to tell us whether we’ve reached the end of the number (as you can see, it’s set in the first byte as there is more than one byte in the varint). These 7-bit payloads are in little-endian order. Convert to big-endian order, concatenate, and interpret as an unsigned 64-bit integer:

10010110 00000001        // Original inputs.
 0010110  0000001        // Drop continuation bits.
 0000001  0010110        // Convert to big-endian.
   00000010010110        // Concatenate.
 128 + 16 + 4 + 2 = 150  // Interpret as an unsigned 64-bit integer.

Tag-Length-Value (TLV)

There are six wire types: VARINTI64LENSGROUPEGROUP, and I32

IDNameUsed For
0VARINTint32, int64, uint32, uint64, sint32, sint64, bool, enum
1I64fixed64, sfixed64, double
2LENstring, bytes, embedded messages, packed repeated fields
3SGROUPgroup start (deprecated)
4EGROUPgroup end (deprecated)
5I32fixed32, sfixed32, float

The “tag” of a record is encoded as a varint formed from the field number and the wire type via the formula (field_number << 3) | wire_type.

Integers Type

Bools and Enums

Bools and enums are both encoded as if they were int32s. Bools, in particular, always encode as either 00or 01.

Signed Integers

  • intN types
    • encode negative numbers as two’s complement, which means that, as unsigned, 64-bit integers, they have their highest bit set. As a result, this means that all ten bytes must be used.
-2 ->

11111110 11111111 11111111 11111111 11111111
11111111 11111111 11111111 11111111 00000001
  • sintN 
    • uses the “ZigZag” encoding instead of two’s complement to encode negative integers. Positive integers p are encoded as 2 * p (the even numbers), while negative integers n are encoded as 2 * |n| - 1 (the odd numbers). The encoding thus “zig-zags” between positive and negative numbers. For example:
Signed OriginalEncoded As
00
-11
12
-23
0x7fffffff0xfffffffe
-0x800000000xffffffff

Or (n << 1) ^ (n >> 31)

Length-Delimited Records

Consider this message schema:

message Test2 {
  optional string b = 2;
}

A record for the field b is a string, and strings are LEN-encoded. If we set b to "testing", we encoded as a LEN record with field number 2 containing the ASCII string "testing". The result is `120774657374696e67`. Breaking up the bytes,

12 07 [74 65 73 74 69 6e 67]
TAG LEN [CONTENT]

we see that the tag, `12`, is 00010 010, or 2:LEN. The byte that follows is the int32 varint 7, and the next seven bytes are the UTF-8 encoding of "testing". The int32 varint means that the max length of a string is 2GB.

Submessage

Submessage fields also use the LEN wire type.