RFC: Improve Topology Server Locking #16269

mattlord · 2024-06-26T14:07:34Z

Description

The Topology Service provides two core features in Vitess:

Storing metadata about the cluster — what processes live where, what role they are currently serving, how they should be configured, etc. These are the Cells, Keyspaces, Shards, Tablets and associated metadata (e.g. vschema) which make up the cluster.
A distributed lock/coordination service — a means for these loosely coupled processes within the Vitess cluster to safely coordinate on tasks.

Simple examples are updating Keyspace configuration options where a Keyspace lock is used.
More complex examples are reparenting a shard where a Shard lock is taken and a failover performed where we update the tablet types and shard configuration atomically and another is performing a traffic switching operation in a VReplication workflow where a Keyspace lock is taken on the source and target keyspaces so that routing rules, shard records, vreplication state, etc are updated atomically. The latter one in particular being a good example of cases affecting various keys across the topology server so it's not simply a matter of locking a specific key.
It's these more complex cases — where we have to wait for states across processes to converge (e.g. replication to catch up on N shards where N can be thousands) before proceeding to the next step — that are problematic as they can easily take 30-60+ seconds to complete, during which time the lock may actually be lost and the behavior becomes undefined.
Within Vitess we use TTL (time-to-live) values with locks which differ across the topo implementations:
- ZooKeeper has no lock TTLs so this is not an issue there. You already hold the lock until you release it or your session ends.
- Consul has session TTLs that we specify when requesting the lock. The lock is held until you release it, the session TTL is reached, or your session ends. With Consul, the default TTL is 15 seconds which comes from the --topo_consul_lock_session_ttl flag.
- With Etcd we specify a TTL for a lease that we get — which we tie to the lock (which in turn is an ephemeral KV) — when getting the lock and we auto-extend the lease via client/server keep-alive cycles until you release the lock, the context used is cancelled, the TTL is reached, or the session ends. With Etcd, the default TTL is 30 seconds which comes from the --topo_etcd_lease_ttl flag.
- All three topo server implementations implement client/server sessions which implement some form of keep-alive cycles and when the client/session is lost — e.g. the client process, e.g. vtctld or vttablet, which took the lock crashes — the lock is removed/released as the client performs no more keep-alive work.

Problems

Locks are sometimes not held long enough. The lock is lost and the caller is unaware, thus leading to N processes performing actions where they each assume they are holding an exclusive lock on related resources. This leads to undefined behavior and can cause very serious problems.
- There are various related timeouts in place across processes and actions. For example vtctld and vttablet have the --topo_etcd_lease_ttl flag which determines the TTL for any lock they take while the SwitchTraffic command has a --timeout flag and VDiff has a --filtered-replication-wait-time flag, both of which determine how long to wait for replication to catch up. These interrelated timeouts exist throughout the code base and it's not clear to the user when they are putting consistency at risk by e.g. using a command timeout value larger than the TTL when doing a traffic switch. The full set of lock related flags and behaviors are not well understood or documented — which poses challenges for Vitess developers and users alike. Only the caller knows all of this context and thus the caller needs a method to override the default TTL for the given lock.
You may need/want to coordinate on work that is not related to data stored directly in the topology server. Today you can only lock a topo entity/key (e.g. a Keyspace record).

Proposal

Provide the following new mechanisms to improve topology server locking:

Provide a way to coordinate on work that is not directly related to a topo entity/key.
Provide a way to allow a caller to override any default lock TTL.

Related Issues

Bug Report: VDiff2 topo keyspace locks can be released too early #11811

Proof-Of-Concept

VReplication: use new topo named locks and TTL override for workflow coordination #16260

That PR implements the two proposals by:

Adding support for named locks.
Adding LockWithTTL to the topology server interface.

Both of those are then used to improve the locking done by VReplication. We leverage the named locks to take locks on a workflow to coordinate across the VReplication and VDiff engines — now VDiff no longer blocks any unrelated operations (the Keyspace lock blocks work on other workflows in the keyspace, schema changes, keyspace config changes, etc) and the lock is held as long as needed. We leverage LockWithTTL to ensure that during traffic switching operations we hold the Keyspace lock as long as needed based on the command's timeout.

The text was updated successfully, but these errors were encountered:

mattlord · 2024-09-11T14:46:19Z

I'm going to close this as the proposal was implemented and merged in #16260

mattlord added Type: Feature Component: VReplication Type: RFC Request For Comment Component: Topology labels Jun 26, 2024

mattlord added this to the v21.0.0 milestone Jun 26, 2024

mattlord self-assigned this Jun 26, 2024

mattlord mentioned this issue Jun 27, 2024

VReplication: use new topo named locks and TTL override for workflow coordination #16260

Merged

5 tasks

github-project-automation bot added this to VReplication Aug 29, 2024

github-project-automation bot moved this to In progress in VReplication Aug 29, 2024

mattlord closed this as completed Sep 11, 2024

github-project-automation bot moved this from In progress to Done in VReplication Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Improve Topology Server Locking #16269

RFC: Improve Topology Server Locking #16269

mattlord commented Jun 26, 2024 •

edited

Loading

mattlord commented Sep 11, 2024

RFC: Improve Topology Server Locking #16269

RFC: Improve Topology Server Locking #16269

Comments

mattlord commented Jun 26, 2024 • edited Loading

Description

Problems

Proposal

Related Issues

Proof-Of-Concept

mattlord commented Sep 11, 2024

mattlord commented Jun 26, 2024 •

edited

Loading