What

A Percolator worker, a Bigtable tablet server, and a GFS chunkserver.

Timestamp Oracle (TSO)

  • Strictly Monotonically Increasing

Why

requirement to run at massive scales and the lack of a requirement for extremely low latency.

How

Using 2PC

Writing

1. Pre-write Phase

  1. start_ts from TSO
  2. lock all the cells being written (lock col), designate one lock as primary, secondary lock contains the location of the primary lock.
  3. write the value to data col
  4. reads metadata to check for conflicts, rollback
    1. write after start_ts (newer version)
    2. lock detected at any time
  5. write lock and data to each cell at the start_ts

2. Commit Phase

  1. commit_ts from TSO
  2. replace the primary lock with write record (write col) with commit_ts
    • if primary lock is missing, the commit fails
    • write record would suggest that the data is visible to the reader (aka finished)
  3. remove all secondary locks

Reading

  1. get ts
  2. check if lock col contains lock with ts in [0, ts]
    • if so, it was locked by earlier txn → unsafe → backoff
  3. get the latest record row where write col, whose commit_ts is in range [0, ts]

Roll Forward

  • Primary Lock missing → finished
  • Remove relavant locks

BigTable Integration

  • c:lock
    • Shows primary / where primary is
  • c:write
    • committed entry, pointer to the data
  • c:data
    • stores data, referenced by pointer
  • c:notify
    • mark modified cell to be dirty
  • c:ack_O
    • do operations when data in observed col changed
    • by scanning notify col

In TiKV

  • Pre-condition: atomic to read and write a single user key
  • 3 Cols: CF_DEFAULTCF_LOCK and CF_WRITE, which corresponds to Percolator’s data column, lock column and write column respectively.
    • CF_DEFAULT(key, start_ts) → value
    • CF_LOCKkey → lock_info
      • No ts As only one lock can be held at a time
    • CF_WRITE(key, commit_ts) → write_info