Invariants

  • Election Safety: At most one leader can be elected in a given term. (§5.2)
  • Leader Append-Only: A leader never overwrites or deletes entries in its log; it only appends new entries. (§5.3)
  • Log Matching: If two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index. (§5.3)
  • Leader Completeness: If a log entry is committed in a given term, then that entry will be present in the logs of the leaders for all higher-numbered terms. (§5.4)
  • State Machine Safety: If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index. (§5.4.3)

Leader Election

Initiate an Election

  • When servers start up, they begin as followers.

  • A server remains in follower state as long as it receives valid RPCs from a leader or candidate.

  • Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority.

  • If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.

  • To begin an election, a follower increments its current term and transitions to candidate state.

  • It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster.

  • A candidate continues to be a candidate until one of three things happens:

    • it wins the election
      • receives votes from a majority of the servers in the full cluster for the same term.
    • another server establishes itself as leader
      • When waiting for vote, some may have already become leader (or previous leader came back to live)
      • restriction: If the leader’s term (included in its RPC) is at least as large as the candidate’s current term
    • a period of time goes by with no winner.
      • timeout and start new RPC
      • uses randomized election timeouts

Vote Process

  • Guarantees:

    • all the committed entries from previous terms are present on each new leader from the moment of its election
      • log entries only flow in one direction, from leaders to followers, and leaders never overwrite existing entries in their logs.
  • A candidate must contact a majority of the cluster in order to be elected

  • Each server will vote for at most one candidate in a given term, on a first-come-first-served basis

  • up to date: compare the index and term of the last entries in the logs.

    • diff terms: log with later term is more up to date
    • same term, longer log is up-to-date

Log Replication

  • Replicate → Commit.
  • The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry.
  • the leader retries AppendEntries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.
  • An log entry is considered committed if it is safe for that entry to be applied to state machines.
    • Safe:leader that created the entry has replicated it on a majority of the servers
  • The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out
  • When sending an AppendEntries RPC, the leader includes the index and term of the entry in its log that immediately precedes the new entries.
    • If the follower does not find an entry in its log with the same index and term, then it refuses the new entries.
    • The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower.
      • When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log
      • After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC.

Fast Backup

Case 1 Case 2 Case 3 S1: 4 5 5 4 4 4 4 S2: 4 6 6 6 or 4 6 6 6 or 4 6 6 6 S2 is leader for term 6, S1 comes back to life, S2 sends AE for last 6 AE has prevLogTerm=6 rejection from S1 includes: XTerm: term in the conflicting entry (if any) XIndex: index of first entry with that term (if any) XLen: log length Case 1 (leader doesn’t have XTerm): nextIndex = XIndex Case 2 (leader has XTerm): nextIndex = leader’s last entry for XTerm Case 3 (follower’s log is too short): nextIndex = XLen