diff --git a/Lecture 15.md b/Lecture 15.md new file mode 100644 index 0000000..fdb04cf --- /dev/null +++ b/Lecture 15.md @@ -0,0 +1,211 @@ +# Distributed Systems Lecture 15 + +## Lecture Given by [Lindsey Kuper](https://users.soe.ucsc.edu/~lkuper/) on May 4th, 2020 via [YouTube](https://www.youtube.com/watch?v=UCmAzWvrFmo) + + +## Course Admin... + +...snip... + +Read Amazon's [Dynamo](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) paper and Google's [MapReduce](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf) paper. + +...snip... + +### Exam Question: Chandy-Lamport Snapshot Bug + +The following diagram shows a buggy implementation of the Chandy-Lamport snapshot algorithm. + +Process `P2` initiates the snapshot, but then something goes wrong. Where's the bug? + +![Chandy-Lamport Snapshot Bug](./img/L15%20Chandy-Lamport%20Snapshot%20Bug.png) + +The Chandy-Lamport algorithm assumes FIFO delivery of all messages — irrespective of whether they are application or marker messages; so, if we trace through the steps shown in the diagram, we can discover the bug: + +* `P2` initiates the snapshot so it records its own state (the green ellipse around event `E`), then immediately sends out a marker message to `P1` +* `P1` receives the marker message and immediately records its own state (the green ellipse around events `A`, `B`, `C`, and `D`) and then sends out its marker message +* After `P2` sends out its marker message, its snapshot is complete, and it continues processing events in the normal way — resulting in event `F` sending out an application message to `P1`. + +The bug is created by the fact that this diagram shows a FIFO anomaly created when the application message from `P2` event `F` ***overtakes*** the snapshot marker message. + +As a result, `P1` event `D` is recorded in `P1`'s snapshot, but the event that caused it (`P2` event `F`) is missing from `P2`'s snapshot. Thus, our snapshot is not a ***consistent cut***. + +Remember that for a cut to be consistent, it must contain ***all*** events that led up to a certain point in time. So, the inclusion of event `D` in `P1`'s snapshot is the problem because this is effectively a ***message from the future***. + +Here's a situation in which a FIFO anomaly (out of order message delivery) leads to a causal anomaly (an inconsistent cut). + +## Paxos: The Easy Parts + +At the end of the last lecture, our discussion of the Paxos Algorithm got us up to here: + +![Paxos Consensus Reached](./img/L14%20Paxos%206.png) + +This was a very simple run of Paxos involving: + +* One proposer, +* Three acceptors, and +* Two learners + +In this example, the proposer `P` sent out `prepare` messages to a majority of the acceptors, which in this case, was two out of three; however, it would be been equally valid for `P` to have sent `prepare` messages to all the acceptors. In fact, doing so would be quite smart because it mitigates against message loss, because on balance, even if one message is lost, you have still communicated with the majority of acceptors. + +The same idea applies when the proposer listens for `promise` messages coming back from the acceptors. It only needs to hear from a majority of the acceptors before it can be happy. Exactly who those acceptors are is not important, and if it does hear back from all the acceptors then that's great, but it’s not a requirement. It just needs to hear from a majority. + +So, when we speak of a ***majority***, we are speaking of at least the ***minimum*** majority. For instance, if there are five acceptors, then the minimum majority is three: but if we hear back from four or all five, then this is not a problem. The issue is that we must hear back from at least the minimum number of acceptors required to form a majority. + +There are other subtleties involved in this algorithm that we will now go through, including what happens when there is more than one proposer. This in turn, will lead us to an understanding of why the Paxos algorithm is not guaranteed to terminate. + +### Milestones in the Paxos Algorithm + +One thing that was mentioned in the previous lecture was that three specific milestones are reached during a run of the Paxos algorithm. These are: + +1. When the proposer receives `promise(n)` messages from a majority of acceptors. + + ![Paxos Milestone 1](./img/L15%20Paxos%20Milestone%201.png) + + The acceptors have now committed to responding to the agreed proposal number `n`; and by implication, they have also committed to ignoring any request with a proposal number lower than `n`. +1. When a majority of acceptors all issue `accepted(n,val)` messages for proposal number `n` and some value `val`. + + ![Paxos Milestone 2](./img/L15%20Paxos%20Milestone%202.png) + + Now, even though the other processes participating in the Paxos algorithm do not yet realise it, consensus has in fact been reached. +1. When the proposer(s) and learners receive `accepted(n,val)` messages from a majority of the acceptors. + + ![Paxos Milestone 3](./img/L15%20Paxos%20Milestone%203.png) + + It is only now that the proposer(s) and the learners ***realise*** that consensus has already been reached + + +## Paxos: The Full Algorithm (Mostly) + +A run of the Paxos algorithm involves the following steps. The algorithm follows this sequence of messages exchange primarily between the proposer and acceptors: + +1. ***The Proposer*** + Sends out `propose(n)` messages to at least the minimum number of acceptors needed to form a majority. The proposal number `n` must be: + * Unique + * Higher than any previous proposal number used by ***this*** proposer + + It’s important to understand that the proposal number rules are applied to proposers ***individually***. Consequently, if there are multiple proposers in the system, there does not need to be any agreement between proposers about what the next proposal number should be. + +1. ***The Acceptor*** + When the acceptor receives a `prepare(n)` message, it asks itself *"Have I already agreed to ignore proposals with this proposal number?"*. If the answer is yes, then the message is simply ignored; but if not, it replies to the proposer with a `promise(n)` message. This message means that the acceptor is now promising to ignore all messages with a proposal number smaller than `n`. + +1. ***The Proposer*** + When the proposer has received `promise` messages from a majority of messages for a particular proposal number `n`, it sends an `accept(n,val)` message to a majority of acceptors containing both the agreed proposal number `n`, and the value `val` that it wishes to propose. + + Up till now, we have assumed that there is only one proposer — but next, we must examine what happens if there are multiple proposers. + +1. ***The Acceptor*** + When an acceptor receives an `accept(n,val)` message, it asks the same question as before: *"Have I already agreed to ignore messages with this proposal number?"*. If yes, it ignores the message; but if no, it replies with an `accepted(n,val)` both back to the proposer ***and*** broadcasts this acceptance to all the learners. + +### What Happens If There Is More Than One Proposer? + +In this scenario, we will run the Paxos algorithm with two proposers and since learners do not actually take part in the steps needed to reach consensus, for visual clarity, we will omit them from the from diagram. + +Let's say we have ***two*** proposers P1 and P2 and as before, three acceptors (and we're pretending that there are also two learners) + +Remember we previously stated that in situations where there are multiple proposers, these proposers must agree as to how they will ensure the uniqueness of their own proposal numbers. So, in this case, we will as that: + +* Proposer P1 will use odd proposal numbers, and +* Proposer P2 will use even proposal numbers + +So, proposer P1 sends out a `prepare(5)` message to a majority of the acceptors. This is the first proposal number they have seen, so they are all happy to accept it and respond with `promise(5)` messages. + +Proposer P1 is seeking consensus for value `1`, so it now sends out `accept(5,1)` messages and the majority of acceptors respond with `accepted(5,1)` + +![Multiple Proposers 1](./img/L15%20Multiple%20Proposers%201.png) + +Ok, that's fine; we seem to have agreed on value `1`. + +But now, not knowing any of this has happened, proposer P2 decides to send out a `prepare(4)` message to all the acceptors? + +![Multiple Proposers 2](./img/L15%20Multiple%20Proposers%202.png) + +The `prepare(4)` message arrives at acceptors A1 and A2 ***after*** they have already agreed on proposal number `5`. Consequently, they simply ignore this message. + +Acceptor A3 however has not seen proposal number `4` before, so it happily agrees to it and sends back a `promise(4)` message to proposer P2. + +Proposer P2 is now left hanging. + +It sent out `prepare` messages to all the acceptors but has only heard back from a minority of them. The rest have simply not answered, and given the way asynchronous communication works, P2 has no idea ***why*** it has not heard back from the other acceptors. They could have crashed, or they might be running slowly, or, as it turns out, the other acceptors have already agreed to have P1's babies... + +So, all P2 can do is wait for its timeout period, and if it doesn't hear back within that time, it concludes that proposal number `4` was a bad idea and tries again. This time, P2 shows up in a faster car (proposal number `6`) + +![Multiple Proposers 3](./img/L15%20Multiple%20Proposers%203.png) + +But wait a minute, consensus (milestone 2) has ***already*** been reached, so the acceptors now have a problem because: + +* Acceptors cannot go back on their majority decision +* Acceptors can only ignore `prepare` messages with a ***lower*** proposal number + +So, here's where we must address one of the subtleties that we previously glossed. + +Previously, we stated only that if an acceptor receives a `prepare` message with a ***lower*** proposal number, it should simply ignore it. Well, OK, that's fine. + +But what about the case where we receive a proposal number that is ***higher*** than the last one? Here is where we need to further qualify ***how*** that `prepare` message should be handled. In this case, each acceptor must ponder the following situation: + +*"I've already promised to respond to proposal number `n`, but now I'm being asked to promise to respond to proposal number `n+1`"* + +Should I accept the higher proposal number or not? The answer here depends on what has happened to the acceptor in between receiving the `prepare(n)` message and the `prepare(n+1)` message. + +Either way, the acceptor is going to send out a `promise` message containing the new proposal number; but in addition, the acceptor must consider whether it has already agreed to accept a value based on some earlier, lower proposal number. + +* If no, then we accept with a `promise` message as normal +* If yes, then we accept with a `promise` message, but in addition, we are obligated to tell the new proposer that we've already agreed to a value using an older proposal number. + +In the latter case, the `promise` message must carry some extra information. + +In the above example, acceptor A1 has already agreed with proposer P1 that, using proposal number `5`, the value should be `1`; but now proposer P2 comes along with proposal number `6`. + +![Multiple Proposers 4](./img/L15%20Multiple%20Proposers%204.png) + +So in this specific situation, acceptor A3 responds simply with `promise(6)` because it has not previously agreed on any previous value, but acceptors A1 and A2 respond with the message `promise(6,(5,1))`. + +This extra information in the `promise` message effectively means: *"Ok, I'll move with you to proposal number `6` but understand this: using proposal number `5`, I already agreed to value `1`"*. + +### So, What Should A Proposer Do with Such a Message? + +Previously, we said that when a proposer receives `promise(n)` message from a majority of acceptor, it will then send out `accept(n,val)` messages. But here's where our description of the protocol needs to be refined. What should the proposer do if it receives not a `promise(n)` message, but a promise(n,(nold,valold)) message? + +In our example, proposer P2 has received three messages: + +* A straight-forward a `promise(6)` from A3, and +* Two `promise(6,(5,1))` messages from A1 and A2 + +Proposer P2 must now take into account that using proposal number `5`, the value `1` has already been agreed upon. + +In this case, both `promise` messages contain the value `1` that was agreed upon using proposal number `5`; however, it is perfectly possible that P2 could receive multiple `promise` messages containing values agreed on by proposal numbers older than `5`. + +So, the rule is this: proposer P2 must look at all the older, already agreed upon values, and chose the value corresponding to the most recent, old proposal number. + +This is pretty ironic (and amusing) really because proposer P2 now has no choice over what value to propose. It is constrained to propose the one value upon which consensus was most recently reached! So, the fact that it wants to send out its own proposal is somewhat redundant, because the only value it can propose is one upon which consensus has already been reached... + +So, now we must revise rule 3 given above. Previously we stated + +> When the proposer has received `promise` messages from a majority of messages for a particular proposal number `n`, it sends an `accept(n,val)` message to a majority of acceptors containing both the agreed proposal number `n`, and the value `val` that it wishes to propose. + +But now we understand that the proposer does not have complete liberty to send out the value ***it wishes*** to propose; instead, it must first consider: + +* If I have received any `promise` messages containing old agreed values, then I am obligated to propose the value belonging to the highest, old proposal number +* If I have received only simple `promise(n)` messages, then I am free to propose any value I like + +So now, P2 can only send out the message `accept(6,1)`. + +![Multiple Proposers 5](./img/L15%20Multiple%20Proposers%205.png) + +Notice that P2 has not had to use the earlier proposal number `5`, but it was constrained to propose the value `1`, because this value has already been agreed upon. + +So, what do the acceptors do now? They simply invoke rule 4 above and respond with `accepted(6,1)`. + +![Multiple Proposers 6](./img/L15%20Multiple%20Proposers%206.png) + +Let's isolate the messages that were exchanged between proposer P2 and acceptor A3. + +![Multiple Proposers 7](./img/L15%20Multiple%20Proposers%207.png) + +A3 only sees the following exchange of messages. + +* P2 first tried proposal number `4`, but nothing came of that +* P2 tried again with proposal number `6` +* A3 went with the highest proposal number (`6`) and subsequently agreed to accept value `1` + +As far as A3 is concerned, it thinks that value `1` was P2's idea. It has no clue that P2 was proposing a value agreed upon by others. + diff --git a/README.md b/README.md index a65c2da..6547a56 100644 --- a/README.md +++ b/README.md @@ -21,3 +21,4 @@ There are no notes for lecture 1 as this was concerned with course administratio | [Lecture 12](./Lecture%2012.md)
April 24th, 2020 | Replication
Total order vs. determinism
Consistency models: FIFO, causal and strong
Primary-backup replication
Chain replication
Latency and throughput | [Lecture 13](./Lecture%2013.md)
April 27th, 2020 | **Pause for breath!**
Wrapping up replication techniques | [Lecture 14](./Lecture%2014.md)
May 1st, 2020 | Handling node failure in replication protocols
Introduction to consensus
Problems equivalent to consensus
The FLP result
Introduction to Paxos | Strongly consistent replication protocols +| [Lecture 15](./Lecture%2015.md)
May 4th, 2020 | Paxos: the interesting parts diff --git a/img/Diagrams.pptx b/img/Diagrams.pptx index 1d05ad1..2635f19 100644 Binary files a/img/Diagrams.pptx and b/img/Diagrams.pptx differ diff --git a/img/L15 Chandy-Lamport Snapshot Bug.png b/img/L15 Chandy-Lamport Snapshot Bug.png new file mode 100644 index 0000000..464a6ac Binary files /dev/null and b/img/L15 Chandy-Lamport Snapshot Bug.png differ diff --git a/img/L15 Multiple Proposers 1.png b/img/L15 Multiple Proposers 1.png new file mode 100644 index 0000000..592c3ac Binary files /dev/null and b/img/L15 Multiple Proposers 1.png differ diff --git a/img/L15 Multiple Proposers 2.png b/img/L15 Multiple Proposers 2.png new file mode 100644 index 0000000..158ebad Binary files /dev/null and b/img/L15 Multiple Proposers 2.png differ diff --git a/img/L15 Multiple Proposers 3.png b/img/L15 Multiple Proposers 3.png new file mode 100644 index 0000000..4bac280 Binary files /dev/null and b/img/L15 Multiple Proposers 3.png differ diff --git a/img/L15 Multiple Proposers 4.png b/img/L15 Multiple Proposers 4.png new file mode 100644 index 0000000..eeba14e Binary files /dev/null and b/img/L15 Multiple Proposers 4.png differ diff --git a/img/L15 Multiple Proposers 5.png b/img/L15 Multiple Proposers 5.png new file mode 100644 index 0000000..bb61bca Binary files /dev/null and b/img/L15 Multiple Proposers 5.png differ diff --git a/img/L15 Multiple Proposers 6.png b/img/L15 Multiple Proposers 6.png new file mode 100644 index 0000000..a5a022f Binary files /dev/null and b/img/L15 Multiple Proposers 6.png differ diff --git a/img/L15 Multiple Proposers 7.png b/img/L15 Multiple Proposers 7.png new file mode 100644 index 0000000..875a164 Binary files /dev/null and b/img/L15 Multiple Proposers 7.png differ diff --git a/img/L15 Paxos Milestone 1.png b/img/L15 Paxos Milestone 1.png new file mode 100644 index 0000000..17e84db Binary files /dev/null and b/img/L15 Paxos Milestone 1.png differ diff --git a/img/L15 Paxos Milestone 2.png b/img/L15 Paxos Milestone 2.png new file mode 100644 index 0000000..8061ee9 Binary files /dev/null and b/img/L15 Paxos Milestone 2.png differ diff --git a/img/L15 Paxos Milestone 3.png b/img/L15 Paxos Milestone 3.png new file mode 100644 index 0000000..255fad9 Binary files /dev/null and b/img/L15 Paxos Milestone 3.png differ