- Single master simplifies decisions but introduces a single point of failure.
- Goal: Avoid split-brain (conflicting decisions by partitions).
- Example: Two clients acquiring the same lock due to network partition.
- Relied on:
- Expensive, non-failing networks.
- Human intervention.
- Use majority voting with odd-numbered servers:
- Ensures only one partition can form a majority.
- Progress requires a majority agreement.
- 2F+1 servers tolerate F failures.
- Quorum systems is another term for majority voting.
- Paxos and Viewstamp Replication (~1990) introduced majority voting.
- Raft builds upon these ideas, similar to Viewstamp Replication.
- Application Code: Service-specific logic.
- Application State: Replicated data managed by Raft.
- Raft Layer: Handles replication and maintains logs.
- Clients send requests to the leader:
- Leader commits operations through majority agreement.
- Leader applies the operation and sends the response back.
- Operations are committed after being present in a majority of logs.
- Replicas ensure consistent state by replaying committed logs.
- Order Operations: Ensures consistent application across replicas.
- Handle Uncommitted Entries: Stores tentative operations.
- Retransmit Data: Leaders resend missed operations after outages.
- Persistence: Allows recovery after crashes.
- Log Growth: Unbounded growth if leader outpaces followers.
- Solution: Implement flow control to throttle the leader.
- Servers read persisted logs but wait for the leader to identify the committed point.
- Leader synchronizes logs across replicas.
- Replaying logs is costly; checkpoints provide an optimized solution.
- Term Numbers: Distinguish leaders and track progress.
- Only one leader per term.
- Servers start elections when no leader communication is detected.
- Candidates require a majority vote to win.
- Randomized election timeouts prevent split votes.
- Old leaders in minority partitions cannot execute requests.
- Leaders rely on heartbeat responses to confirm activity.
- Logs may diverge due to:
- Leader crashes.
- Network disruptions.
- Leader ensures all logs are consistent before execution.
- Non-majority entries are discarded.
-
Scenario 1:
- Leader sends a command; some replicas don't receive it before leader crashes.
- New leader ensures consistency across replicas.
-
Scenario 2:
- Leader sends a command to some replicas; crashes before completing.
- New leader resolves the commitment.
-
Scenario 3:
- Conflicting entries across replicas.
- Raft resolves conflicts by retaining entries with majority support.
- Minimum Timeout:
- Must exceed the heartbeat interval to avoid premature elections.
- Maximum Timeout:
- Shorter timeouts enable faster recovery but may cause unnecessary elections.
- Timer Gap:
- Must accommodate vote round-trip times for smooth leader selection.