Lecture Given by Lindsey Kuper on May 29th, 2020 via YouTube
Previous | Next |
---|---|
Lecture 21 | Lecture 23 |
Consider the following scenario in which we need strong consistency, but we have:
- A datastore held across multiple replicas
- These replicas have no leader nor coordinating process (I.E. We're not using Primary Backup or Chain Replication)
- Any replica can receive an update
The challenge then concerns how to keep these replicas consistent with each other.
What approach should we adopt?
Well, we could use a consensus protocol to decide which updates should processed in which order.
If replica R1
receives update A
and replica R2
receives update B
, then an agreement must be reached concerning the order in which these updates should be delivered.
So, even though the messages will arrive in some unpredictable order, a consensus protocol will be used to decide upon the delivery order.
In this case, both replicas operate a system of delivery slots and the consensus protocol determines that event B
should occupy delivery slot 1
and event A
should occupy delivery slot 2
.
Q: But how many messages need to be sent in order to arrive at this agreement?
A: Lots!
Here's an example showing the messages that need to be exchanged for replicas R1
and R2
simply to agree on a total order for delivering events A
and B
shown above.
- Replica
R2
sends out aprepare(6)
message to a majority of acceptors, who each respond with the correspondingpromise
messages.
4 messages - Just a little time after
R2
's message exchange has taken place, replicaR1
sends out itsprepare(5)
messages to a majority of acceptors. AcceptorA1
happily accepts this proposal number, but acceptorA2
has already promised to ignore messages with a proposal numbers less than6
, so thisprepare
message is ignored and replicaR1
left hanging.
3 messages (all of which turn out to be redundant) - Replica
R2
sends out itsaccept(6,(slot_1,B))
messages to the acceptors who each respond withaccepted(6,(slot_1,B))
. So, eventB
now occupies delivery slot 1.
4 messages - Replica
R1
still needs to get agreement on a total order for eventA
, so it tries the prepare/promise phase again, but now with proposal number7
. This time, the proposal number is accepted.
4 messages - Replica
R1
then enters the accept/accepted phase and achieves consensus on eventA
occupying delivery slot 2.
4 messages
So, even in this reasonably happy example where we are not sending messages to all the acceptors (only a majority), we've had to send 19 messages, 3 of which turned out to be redundant — and we haven't even accounted for sending messages out to the learners!
The point here is that consensus algorithms are really expensive; therefore, we should only implement them in situations where it's extremely important that everybody agrees on a total order.
So far however, we have not even discussed what events A
and B
represent.
Consensus algorithms do not care about a message's payload; they simply see an opaque (I.E. meaningless) block of data to which some metadata has been attached.
Causal Broadcast for instance, looks simply at the message's recipient and the vector clock value, and from these values, determines the delivery order.
My Aside
A useful analogy here is to think of the people working in a mail sorting room. These people are concerned with the fact that all the letters and packages have been addressed correctly, and that the correct postage has been paid for a letter or package of that weight and dimensions.
It is quite irrelevant for these people to concern themselves with the contents of the letters and packages.
However, from the perspective of the human developer, we can see firstly that we're having to go to a lot of trouble simply to agree on the order in which a set of events are delivered, and secondly, the algorithm that determines the delivery order is completely agnostic to the real-life functionality that needs to be performed as a result of processing those events.
Rather than having a message-agnostic consensus algorithm then, wouldn't it be smarter to make intelligent decisions about delivery order based on our knowledge of the functionality being implemented?
Let's go back to the shopping cart scenario described in lecture 17.
If replicas R1
and R2
simply represent shopping carts, then we certainly don't want to go to all the trouble of running a consensus algorithm.
But we have arrived at this conclusion based on our knowledge of the business process we are implementing.
In this case, we really don't care whether a book is added to the shopping cart before or after the pair of jeans.
From the perspective of the business logic, the order in which items are added is simply not important.
(More technically, the order in which items are added to a shopping cart is commutative, not associative).
Of course, there will be some situations in which message delivery order is critical to the logic of your business process, but what we propose here is that strong consistency is needed only in a minority of cases; the greater majority of business scenarios will function quite happily with strong convergence.
Just as a reminder:
Strong Consistency
If replica R1
delivers messages in the order M1
, M2
and M3
, then all replicas receiving the same set of messages must deliver them in the same order.
Only then can it be known that the replicas have equivalent state.
Strong Convergence
All replicas delivering the same set of messages eventually have the equivalent state.
Strong convergence might still be tricky to implement, but it will be easier than strong consistency, because with strong convergence, we know that state equivalence can be achieved simply by delivering the same set of updates. Strong consistency however requires us to deliver the same set of updates in precisely the same order.
The bottom line here is that you should only implement strong consistency when you have no other choice.
To do this, we will need to look back at the definition of a partially ordered set that we covered in lectures 3 and 4.
As a reminder, a partial order allows you to compare the members of set S
using a binary relation such as ≤
(less than or equals).
However, the word "partial" in the name tells us that not every pair of elements in the set is comparable.
This relation is governed by three axioms:
Property | English Description | Mathematical Description |
---|---|---|
Reflexivity | For all a in S ,a is always ≤ to itself |
∀ a ∈ S: a ≤ a |
Anti-symmetry | For all a and b in S ,if a ≤ b and b ≤ a , then a = b |
∀ a, b ∈ S: a ≤ b, b ≤ a => a = b |
Transitivity | For all a , b and c in S ,if a ≤ b and b ≤ c , then a ≤ c |
∀ a, b, c ∈ S: a ≤ b, b ≤ c => a ≤ c |
The standard example of a partially ordered set is set inclusion — that is the set of all possible subsets of a given set.
For example, if we have a set containing:
A book | |
A pair of jeans | |
A torch |
Then the inclusion set (the set of subsets) will contain the following eight members:
Our relation here is the "less than or equals" operator ≤
.
Using this operator, certain members of our set of subsets can be compared as follows:
However, other members are not comparable; for instance:
If we select some elements from our set of subsets, say the singleton set containing the jeans {👖}
and singleton set containing the torch {🔦}
, we could ask the following question:
Which elements of
S
are at least as big as{👖}
and{🔦}
?
The answer here is:
Any set that contains at least the union of
{👖}
and{🔦}
.
So, this would be the sets {👖,🔦}
and {📓,👖,🔦}
.
These two sets are known as the upper bounds of {👖}
and {🔦}
.
In more formal language, the upper bound is:
Given a partially ordered set1 (
S
,≤
) an upper bound ofa,b ∈ S
is an elementu ∈ S
such thata ≤ u
andb ≤ u
.
Notice that we talk of an upper bound.
This means it is possible that for the members a
and b
there could well be multiple examples of some set u
that all satisfy the upper bound requirements.
The upper bound set that contains all the members of the original set is not very interesting because this will always be a common upper bound for all its subsets, so we can ignore this one.
Generally speaking, the upper bounds that are the most interesting are the smallest ones. But how do we define the smallest upper bound? The formal definition is:
If
a
,b
,u
andv
are all members of the inclusion setS
, thenu
is the least upper bound2 ofa,b ∈ S
ifu ≤ v
for eachv
Q: If S
is the eight-member inclusion set of {📓,👖,🔦}
, then is it true that every 2 elements of S
will have a least upper bound (lub)?
A: Yes!
Any set for which this property is true is given the fancy name of a Join-semilattice.
A partially ordered set (poset) in which every 2 elements have a least upper bound (lub) is called a join-semilattice.
So, it follows therefore that if some partially ordered sets are join-semilattices, then there should also be some partially ordered sets that are not.
It's quite hard to think of a set within which every 2 elements do not have a least upper bound (I.E. think of a set that is not a join-semilattice), but a good example is a Boolean register.
This is a tri-state variable that can be either empty
, true
or false
So, the poset is simply {empty, true, false}
.
This is a very simple set containing only three members that can be arranged as follows:
These values can also be ordered using the ≤
operator:
empty ≤ empty
true ≤ true
false ≤ false
empty ≤ true
empty ≤ false
So, is this a true partially-ordered set? To answer this question, we must check that all the axioms are satisfied.
Reflexivity
Since empty ≤ empty
, true ≤ true
and false ≤ false
, then this set satisfies the requirements of reflexivity.
Anti-symmetry
Anti-symmetry requires that if a ≤ b
and b ≤ a
, then a = b
.
However, since this set contains only three members arranged in a two-layer lattice, we have already implicitly satisfied anti-symmetry by satisfying the requirements of reflexivity.
Transitivity
Transitivity requires that if a ≤ b
and b ≤ c
then a ≤ c
.
However, no members of the set can be compared this way, so this set obeys transitivity only in a vacuous sense.
So, this Boolean Register qualifies as a true partially-ordered set; however, we can see that if we picked the elements true
and false
and asked "What is their least upper bound?", then we can see that there isn't one.
Therefore, this set is not a join-semilattice.
Let's say we have a system with two replicas that hold a type of information that is very different to the shopping cart example. Here, these replicas hold the value of our Boolean register and they then receive conflicting updates:
Q: Why do these updates create a conflict?
A: Because we have no way to combine the values true
and false
Q: Why can we not combine these values?
A: Because, as we can see from the lattice diagram above, true
and false
have no upper bound, let alone a least upper bound.
In this case, the inclusion set formed from the members {empty, true, false}
contains only the members:
S = { {}
, {empty}
, {true}
, {false}
, {empty, true}
, {empty, false}
}
The inclusion set does not contain the least upper bound member {true, false}
neither does it contain the upper bound {empty, true, false}
.
In order to resolve such a conflict, we would need to implement some sort of consensus algorithm. However, because consensus algorithms are expensive, we really don't want to implement one unless we need to.
So generally, if the updates your replicas are receiving can be thought of as the members of a set that is a join-semilattice, then we can resolve the requirements of strong convergence by taking the least upper bound. This also means we do not need to implement a consensus algorithm.
So, here's an informal claim:
If the states that replicas can take on can be thought of as elements of a join-semilattice, then there is a natural way of resolving conflicts between replicas without needing a consensus algorithm.
This claim is described as "informal" because it uses unqualified words such as "natural". Nonetheless, there is a lot of interesting work being done on this type of conflict resolution. If you're interested in this type of work, take a look at a topic called "Conflict-Free Replicated Datatypes" or (CRDTs). An example implementation by Martin Kleppmann and Alastair Beresford has been described in this paper for JSON datatypes.
So far, we've been thinking about conflicts that can arise when different clients add members to a set. In the case of a shopping cart, the order in which items are added does not matter because even if the retailer offers schemes such as "Buy one, get one free", or "Buy these two together and get a 20% discount", these promotions are only applied at the time the user goes to the checkout (I.E. set membership is now fixed), not at the time the items are added to the cart. Based on our knowledge of the purchasing process, we can see that the order in which items are added to a shopping cart is immaterial. Therefore, this situation is not one in which consensus is required.
This is a particularly useful property in the event of a network partition. If communication is lost for some period of time between replicas, then the fact that the states of the shopping carts might diverge whilst the network partition exists does not create a problem.
As soon as the partition heals, the replicas can communicate with each other again, and their states will converge.
But what happens if we are allowed to remove items from the shopping cart?
Now things get more complicated.
Let's say that from your laptop, you add a book to your shopping cart, and from your phone, you add a pair of jeans.
Both replicas synchronise, so when looking at your shopping cart from either your laptop or your phone, you see the same items. Everything is fine...
From your laptop however, you've read some reviews of the book and decide that it doesn't look so interesting, so you remove it from your shopping cart — right at the very moment a network partition appears between the replicas.
The remove message gets through to replica 1, but not replica 2. So for the duration of the network partition, your shopping carts replicas will be out of sync with each other.
So after a while, the network partition heals and your shopping cart replicas synchronise with each other.
Now from your laptop, you look at the contents of your shopping cart and... huh!? That book has popped up again!
Maybe it's a sign that you really should read that book... or maybe it's the situation described in the Dynamo paper where, under certain circumstances, deleted items can reappear in shopping carts.
Why did this happen?
Because the contents of the shopping carts are treated as sets, and when a conflict occurs, the solution is to take the least upper bound. With Dynamo, this resolution happens on the client, but with CRDTs, it happens in the replica. Either way though, this approach takes the union of the sets and this sometimes causes a deleted item to pop up again.
So how do we avoid this problem?
We could go to all trouble of ensuring that the members of your set (I.E. the contents of your shopping cart) are always the members of a join-semilattice; however, this means that you have to throw the whole cart away as soon as any item is deleted.
But wait! There's a trick that allows us to handle this situation. Here, we will keep track of all additions to the shopping cart in one set and all the removals from the shopping cart in a different set known as the Tombstone Set. In this case, the least upper bound of two versions of a shopping cart can be calculated simply by taking the union of the sets.
R1 Additions |
R1 Removals |
R2 Additions |
R2 Removals |
---|---|---|---|
📓 | 📓 | 👖 | |
👖 | 📓 |
Even though replica R2
never found out about the removal of the book, this does not matter, because that fact has been recorded in replica R1
.
So now we can avoid having the deleted item reappear in the shopping cart by taking a two-step approach:
- Take the union of the addition sets
- From this, subtract the union of the removal sets
But we still haven't solved all our problems...
Let's say that even though all those bad reviews about the book caused us to delete it, we change our mind (I mean, can a book really be that bad? Let's find out). So you add the book again.
However, look at the addition set — it already contains the book, and our addition set can only hold single instances of an item. And the book is still in the tombstone set because we really did delete it. So, if we left the situation like it is, even though we added the book a second time, it would disappear from the resolved shopping cart because it's in the tombstone set...
So, under these conditions, once you remove an item, it’s never coming back!
Here is where you need to do some serious design work to decide on what behaviour you want your application to have, and then think of scenarios that could break that behaviour.
If you know that the addition of previously deleted items will be a frequently used aspect of your application's functionality, then you will need to implement some sort resolution strategy. For instance:
- You could have of global coordination point in which all the replicas are notified that a previously deleted item is being added again and then try to ensure that everyone agrees.
- Or you could take a simplistic approach and say that additions always win over removals.
- Or you could keep a counter against each added or removed item so that adding the same book twice sets the addition counter to
2
, and removing it sets the removal counter to-1
. Now the desired total is simply the sum of additions total and the removals total.
The bottom line is this: this is hard to get right and has been an area of active research over the last 10 years or so. Quite a few interesting data structures have been proposed for resolving this problem, but it does not appear that any of them have been implemented in production systems yet.
As a developer however, it is difficult to reason about the data you are working with in this abstract manner. The question "Can I really treat my data as elements of a join-semilattice?" is difficult to answer and typically requires the help of specialist verification tools.
This is an area of research in which Lindsey Kuper is actively involved.
Previous | Next |
---|---|
Lecture 21 | Lecture 23 |
Endnotes