iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🛋️

Paxos Made Step-by-Step

に公開

Paxos Made Stepwise

1. Introduction

I recently read Patterns of Distributed Systems[1] (2023) from the Martin Fowler series. It does a great job of organizing the core building blocks and fundamental technologies of distributed systems by purpose and motivation, making it an excellent book for grasping a broad overview (such as terms, their roles, and mechanisms). If you want to learn techniques used in distributed systems efficiently, I would recommend skimming through this book once and then diving deeper into the topics that interest you.

Specifically, the Paxos section in Chapter 11 provides sequence diagrams and concrete step-by-step procedures, making it seemingly easier to understand than many other articles explaining Paxos. This article aims to trace that explanation from Chapter 11 and solidify my own understanding through the process of re-explaining it. Although this is only a part of how Paxos operates, I believe that grasping these representative behaviors beforehand will be a significant help for those who deep-dive into Paxos later.

2. Step-by-Step Paxos

While many extended versions of Paxos exist, such as Multi-Paxos and Fast Paxos, what we will cover here is the basic algorithm known as Single-Decree Paxos.

Consider a Paxos consensus cluster with five nodes: A, B, C, D, and E. All nodes initially start in the Acceptor state.

2.1 Prepare Phase

Sequence Chart 1

Starting with the left side of the sequence diagram: Node A receives a request from a client to set name = alice. This causes A to become a Proposer, and the Prepare phase begins.

A obtains a generation number from its own generation clock (in this case, 1, as it is the initial state) and sends a Prepare request to all Acceptors, including itself.

In the sequence diagram above, since no node has made a promise yet, A, B, and C respond to A's Prepare request by updating their Promised Generation to [1, A] and returning a Promise. Now that A has secured a majority of Promises, it moves to the Accept phase.

Concurrently, another client sends a request to node E to set name = elanor. This makes node E a Proposer, which obtains generation number 1 from its generation clock and sends Prepare requests to E and D, updating their respective Promised Generations to [1, E].

2.2 Accept Phase Failure

Sequence Chart 2

Now, the Accept phase by Node A and the Prepare phase by node E proceed in parallel.

Node C receives a Prepare request from node E. Here, since the generation number [1, E] of the Prepare request is larger than [1, A] which C has already promised, C updates its Promised Generation to [1, E] and responds with a Promise (assuming that if generation numbers are equal, they are compared by node ID). Now that E has secured a majority of Promises, it transitions to the Accept phase. Unfortunately, however, node E failed during the Accept phase.

Meanwhile, Node A, which is proceeding with the Accept phase, sends an Accept request to node C a bit later. However, node C responds with a REJECT because the generation number [1, A] of the Accept request is smaller than the [1, E] it is currently promising. Node A's Accept phase fails because it lost the majority of Promises secured in the Prepare phase due to this defection by node C.

Normally, it is expected that the proposal will reach consensus as E continues its processing. However, E has stopped while holding a majority of Promises. The protocol must be designed so that A's re-proposal can still reach consensus even in such a situation (otherwise, a deadlock will occur).

2.3 Prepare Phase Re-execution

Sequence Chart 3

Because the proposal was REJECTed, node A restarts the Prepare phase with [2, A], which is one increment higher than the generation number it currently recognizes. A sends Prepare requests to A, C, and D, securing a majority of Promises.

However, a significant difference from the previous Prepare phase is that A has accepted alice with generation number [1, A], and D has accepted elanor with [1, E]. The accepted values are transmitted to A along with the Promise responses.

Having secured a majority of Promises, A transitions to the Accept phase. At this point, if accepted values are included in the Promises, the value with the highest generation number among them must be proposed. In other words, here A must propose the value elanor, which was accepted with the highest generation number [1, E].

Accept Phase Re-execution

Sequence Chart 4

A starts the Accept phase. First, it sends an Accept request to itself, and since the generation number is valid, A's Accepted Value is updated to elanor.

Unfortunately, however, node A failed shortly after its Prepare request was completed (in this example, the consensus cluster stops progressing because the Proposer is gone, but in a realistic implementation, another node would become a new Proposer and proceed with the consensus by attempting with a larger generation number).

Recovery by Another Request

Sequence Chart 5

Now, a client sends a request to node C to set name = carol. C becomes the Proposer, starts the Prepare phase, and sends Prepare requests to B, C, and D with 3 as the generation number, which is larger than the Promised Generation it currently recognizes.

Sequence Chart 6

C secured a majority of Promises and transitions to the Accept phase. Looking at the accepted values from the collected Promises, the value with the largest generation number [2, A] was elanor. Therefore, C makes an Accept request with the value elanor.

This Accept phase is successful, and elanor is accepted by a majority of the nodes in the cluster. As a result, even if A or E recover and start a Prepare phase, once they gather Promises from a majority of the cluster, they will observe elanor with its high generation number in at least one of those Promises, ensuring that the value elanor is ultimately adopted.

Commit Phase

Sequence Chart 7

Finally, node C sends a Commit message to finish the consensus. Even if nodes A or E recover in the future, they will be able to learn that elanor has been committed.

Operations by Role

The operations of Proposers and Acceptors in each phase are summarized as follows:

  • Proposer
    • Prepare Phase: Increments its own generation clock to assign a generation number and sends a Prepare request with that generation number to all Acceptors.
      • Success: If it collects Promises from a majority of Acceptors, it transitions to the Accept phase.
        • If the collected Promises contain values already accepted by other nodes, it discards the value it intended to propose and uses the value with the highest generation number among them as its own proposal.
      • Failure: If it fails to collect Promises from a majority of Acceptors, the Prepare phase fails. It retries the Prepare phase with a generation number one increment higher than the one it recognizes.
    • Accept Phase: Sends an Accept request with the generation number and the value to the Acceptors from whom it collected Promises.
      • Success: Once it receives OK responses from a majority of Acceptors, the proposed value is considered finalized. It proceeds to the Commit phase.
      • Failure: If a REJECT response is received and the previously collected Promises no longer constitute a majority, the Accept phase fails. It retries the Prepare phase with a generation number one increment higher than the one it recognizes.
    • Commit Phase: Sends a Commit message containing the finalized value to all Acceptors.
  • Acceptor
    • Prepare Phase: Upon receiving a Prepare request, it compares its own promised generation number with the generation number in the request.
      • Success: If the generation number in the Prepare request is larger than the promised generation number, it updates its promised generation number with that value and returns a Promise.
        • If it already holds an accepted value, it responds by adding the previous generation number and the accepted value to this Promise.
      • Failure: If it has already promised to a larger generation number, it responds with an NG (or does not respond).
    • Accept Phase: Upon receiving an Accept request, it compares its own promised generation number with the generation number in the request.
      • Success: If the promised generation number and the generation number in the Accept request are equal, it updates its accepted value and returns OK.
      • Failure: If it has already returned a Promise for a larger generation number, it responds with a REJECT (or does not respond).
    • Commit Phase: Recognizes that a value has been finalized in the consensus cluster.

3. Conclusion

Through this sequence of events, three values—alice, carol, and elanor—were proposed, but this Paxos consensus cluster eventually reached agreement on adopting elanor. Single-Decree Paxos is limited to guaranteeing a one-time consensus, so as a mechanism, this is where it ends.

However, it seems possible to determine the execution order of commands among multiple nodes by repeating this sequence of operations for each round. In variants such as Multi-Paxos, which extends Single-Decree Paxos to efficiently achieve multiple agreements, this mechanism is advanced to achieve serializability and build a Replicated State Machine.

Note that due to the FLP Impossibility, even with Paxos, there are cases where consensus cannot be reached just by allowing (assuming) the failure of a single node. A specific case for this example is described in the book[1:1].

脚注
  1. Unmesh Joshi. Patterns of Distributed Systems. Addison-Wesley Professional (2023); It can also be read on O'REILLY Safari Online. ↩︎ ↩︎

GitHubで編集を提案

Discussion