Paxos Algorithm Details

Basic Introduction

The Paxos algorithm is a message-passing distributed consensus algorithm proposed by computer scientist Leslie Lamport. This groundbreaking work earned him the 2013 Turing Award. The algorithm solves the problem of how to reach consensus on a value (resolution) in distributed systems, even when some nodes fail or the network is unstable, while ensuring system consistency.

Development History

The Paxos algorithm was first publicly introduced by Lamport in 1998 in the paper “The Part-Time Parliament.” The paper used a unique narrative approach, using a small island in Greece called Paxos as a metaphor, describing in detail the process of passing parliamentary resolutions in Paxos and naming the algorithm accordingly. “Although this metaphorical description is creative, it increases the difficulty of understanding.”

In 2001, Lamport realized that academic peers had difficulty understanding his humorous way of expression, so he republished a more concise and direct algorithm description version called “Paxos Made Simple.” This paper abandoned the previous metaphor and directly expounded the core principles of the algorithm, greatly improving comprehensibility.

Industry Impact and Applications

Since its introduction, Paxos has dominated the field of distributed consensus algorithms. The term “Paxos” has almost become synonymous with distributed consistency. The algorithm has been widely applied in industry:

Google System Applications:

  • Chubby: Distributed lock service
  • Megastore: High-availability storage system
  • Spanner: Globally distributed database

Open Source System Applications:

  • ZooKeeper: Distributed coordination service
  • MySQL 5.7’s Group Replication: A mechanism replacing traditional master-slave replication

Algorithm Optimization

In the content above, we explained the algorithm details of the Paxos algorithm in proposal selection from two aspects: proposal generation and approval from both Proposer and Acceptor. We also obtained a proposal selection announcement under the premise of globally unique proposal numbering. Next, we make a small optimization to this initial algorithm, trying to ignore the Prepare request as much as possible.

“If an Acceptor receives a Prepare request numbered N, and before this it has already responded to Prepare requests numbered higher than N. According to P1a, this Acceptor cannot accept proposal numbered N. Therefore, this Acceptor can ignore the Prepare request numbered N.”

Through this optimization, each Acceptor only needs to remember the maximum numbered proposal it has approved and the maximum numbered proposal it has responded to with a Prepare request. This ensures the invariance of P2c in case of failure or node restart. For the Proposer, as long as it can guarantee that it will not generate proposals with the same number, it can discard any proposal and all its runtime state information.

Algorithm Description

Synthesizing the previous explanations, we summarize the proposal selection process of the Paxos algorithm. Combining the proposal processing logic of Proposer and Acceptor, we can get an algorithm execution process similar to two-phase commit.

Phase One

  • The Proposer selects a proposal number N and sends Prepare requests numbered N to more than half of the Acceptors
  • If an Acceptor receives a Prepare request numbered N, and N is greater than all Prepare request numbers the Acceptor has already responded to, then it will send the proposal with the highest number it has accepted (if any) as a response to the Proposer. At the same time, the Acceptor promises no longer to accept any proposal with a number less than N

Phase Two

  • If the Proposer receives responses to its numbered N Prepare request from more than half of the Acceptors, then it sends an Acceptor request for the [N, V] proposal to more than half of the Acceptors. Note: V is the value of the proposal with the highest number in the received responses. If the response does not contain any proposal, then V is determined by the Proposer itself
  • If an Acceptor receives an Acceptor request for a proposal numbered N, as long as the Acceptor has not responded to a Prepare request numbered higher than N, it accepts the proposal

Of course, in actual operation, each Proposer may generate multiple proposals, but as long as each Proposer follows the algorithm operation as described above, the correctness of algorithm execution can be guaranteed.

Learn: Learning the Selected Value

Approach One

The premise for a Learner to obtain a proposal that has been selected is that the proposal has been approved by more than half of the Acceptors. Therefore, the simplest approach is to immediately send the approved proposal to all Learners once an Acceptor approves a proposal.

“This approach, although allowing Learners to quickly obtain selected proposals, requires each Acceptor to communicate individually with all Learners. The number of communications is at least the product of the two counts.”

Approach Two

Another feasible approach is to have all Acceptors uniformly send their proposal approval status to a specific Learner (called the primary Learner). Learners can perceive the selection status of proposals through message communication between each other. Based on this premise, when the primary Learner is notified that a proposal has been selected, it is responsible for notifying other Learners.

“In this approach, Acceptors first send the approved proposal to the primary Learner, and then synchronize it to other Learners. Therefore, compared to approach one, although approach two needs one more step to notify all Learners, the number of communications is greatly reduced, usually just the sum of the number of Acceptors and Learners. But at the same time, this approach introduces a new unstable factor: the primary Learner may fail at any time.”

Approach Three

When explaining approach two, we mentioned that the biggest problem with approach two is the single point problem of the primary Learner, i.e., the primary Learner may fail at any time. Therefore, we can improve approach two by expanding the scope of the primary Learner. That is, Acceptors can send approved proposals to a specific Learner set, and each Learner in this set can notify other Learners after a proposal is selected. “The more Learners in this set, the better the reliability, but the higher the network communication complexity.”

How to Guarantee Paxos Algorithm Liveness

Definition of Liveness Problem

In distributed systems, liveness refers to the property that the system will eventually reach some expected result. For the Paxos algorithm, liveness specifically manifests as: a Value will eventually be selected (accepted). This is one of the important guarantees of Paxos algorithm correctness.

Liveness Failure Scenario Analysis

Assuming there is an extreme situation, this scenario would cause the Paxos algorithm to fail to guarantee liveness:

  1. Proposal Competition Loop: There are two Proposers (P1 and P2) continuously alternating to propose a series of proposals with increasing numbers
  2. Deadlock Formation:
    • P1 proposes proposal n1, obtaining majority Acceptor promises
    • P2 proposes a higher numbered n2 (n2 > n1), causing P1’s proposal to be discarded
    • P1 then proposes a higher numbered n3 (n3 > n2), causing P2’s proposal to be discarded
    • This process repeats indefinitely, forming a “proposal number race”
  3. Specific Process Example:
    • Phase 1: P1 prepares (prepares) proposal n1, obtaining majority promises
    • Phase 2: P1 attempts to accept proposal n1, but at this point P2 has already sent a higher numbered n2’s prepare request
    • Phase 3: P2’s prepare request n2 obtains majority promises, causing n1 to be discarded
    • Phase 4: When P2 attempts to accept proposal n2, P1 sends a higher numbered n3’s prepare request
    • “This loop continues indefinitely, with no Value ultimately being selected.”

Liveness Problem Solutions

To solve this liveness problem, the Paxos algorithm proposes the following guarantee mechanisms:

  1. Elect a Leader Proposer:

    • The system designates a unique “leader” Proposer
    • Only the leader Proposer can propose proposals
    • When the leader Proposer fails, elect a new leader Proposer
    • “This can avoid competition between multiple Proposers.”
  2. Random Backoff Mechanism:

    • When a Proposer finds a proposal rejected, wait a random period before retrying
    • Wait time grows exponentially with the number of failures (similar to Ethernet’s backoff algorithm)
    • “This increases the probability of a single Proposer successfully completing the proposal.”
  3. Proposal Number Limits:

    • Set an upper limit on proposal numbers
    • When the number reaches the limit, forcibly select the current highest numbered proposal
    • “Avoid infinitely increasing number races.”
  4. Timeout Mechanism:

    • Set reasonable timeout times for each proposal phase
    • Automatically proceed to the next round of proposals after timeout
    • “Prevent a single proposal from indefinitely blocking the system.”

In actual engineering implementations, electing a leader Proposer is usually adopted to solve the liveness problem because this method is the most reliable and easy to implement. For example, this approach is used in Google’s Chubby lock service and many distributed storage systems.