Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Improve Topology Server Locking #16269

Closed
mattlord opened this issue Jun 26, 2024 · 1 comment
Closed

RFC: Improve Topology Server Locking #16269

mattlord opened this issue Jun 26, 2024 · 1 comment

Comments

@mattlord
Copy link
Contributor

mattlord commented Jun 26, 2024

Description

The Topology Service provides two core features in Vitess:

  1. Storing metadata about the cluster — what processes live where, what role they are currently serving, how they should be configured, etc. These are the Cells, Keyspaces, Shards, Tablets and associated metadata (e.g. vschema) which make up the cluster.
  2. A distributed lock/coordination service — a means for these loosely coupled processes within the Vitess cluster to safely coordinate on tasks.
  • Simple examples are updating Keyspace configuration options where a Keyspace lock is used.
  • More complex examples are reparenting a shard where a Shard lock is taken and a failover performed where we update the tablet types and shard configuration atomically and another is performing a traffic switching operation in a VReplication workflow where a Keyspace lock is taken on the source and target keyspaces so that routing rules, shard records, vreplication state, etc are updated atomically. The latter one in particular being a good example of cases affecting various keys across the topology server so it's not simply a matter of locking a specific key.
  • It's these more complex cases — where we have to wait for states across processes to converge (e.g. replication to catch up on N shards where N can be thousands) before proceeding to the next step — that are problematic as they can easily take 30-60+ seconds to complete, during which time the lock may actually be lost and the behavior becomes undefined.
  • Within Vitess we use TTL (time-to-live) values with locks which differ across the topo implementations:
    • ZooKeeper has no lock TTLs so this is not an issue there. You already hold the lock until you release it or your session ends.
    • Consul has session TTLs that we specify when requesting the lock. The lock is held until you release it, the session TTL is reached, or your session ends. With Consul, the default TTL is 15 seconds which comes from the --topo_consul_lock_session_ttl flag.
    • With Etcd we specify a TTL for a lease that we get — which we tie to the lock (which in turn is an ephemeral KV) — when getting the lock and we auto-extend the lease via client/server keep-alive cycles until you release the lock, the context used is cancelled, the TTL is reached, or the session ends. With Etcd, the default TTL is 30 seconds which comes from the --topo_etcd_lease_ttl flag.
    • All three topo server implementations implement client/server sessions which implement some form of keep-alive cycles and when the client/session is lost — e.g. the client process, e.g. vtctld or vttablet, which took the lock crashes — the lock is removed/released as the client performs no more keep-alive work.

Problems

  1. Locks are sometimes not held long enough. The lock is lost and the caller is unaware, thus leading to N processes performing actions where they each assume they are holding an exclusive lock on related resources. This leads to undefined behavior and can cause very serious problems.
    • There are various related timeouts in place across processes and actions. For example vtctld and vttablet have the --topo_etcd_lease_ttl flag which determines the TTL for any lock they take while the SwitchTraffic command has a --timeout flag and VDiff has a --filtered-replication-wait-time flag, both of which determine how long to wait for replication to catch up. These interrelated timeouts exist throughout the code base and it's not clear to the user when they are putting consistency at risk by e.g. using a command timeout value larger than the TTL when doing a traffic switch. The full set of lock related flags and behaviors are not well understood or documented — which poses challenges for Vitess developers and users alike. Only the caller knows all of this context and thus the caller needs a method to override the default TTL for the given lock.
  2. You may need/want to coordinate on work that is not related to data stored directly in the topology server. Today you can only lock a topo entity/key (e.g. a Keyspace record).

Proposal

Provide the following new mechanisms to improve topology server locking:

  1. Provide a way to coordinate on work that is not directly related to a topo entity/key.
  2. Provide a way to allow a caller to override any default lock TTL.

Related Issues

Proof-Of-Concept

That PR implements the two proposals by:

  1. Adding support for named locks.
  2. Adding LockWithTTL to the topology server interface.

Both of those are then used to improve the locking done by VReplication. We leverage the named locks to take locks on a workflow to coordinate across the VReplication and VDiff engines — now VDiff no longer blocks any unrelated operations (the Keyspace lock blocks work on other workflows in the keyspace, schema changes, keyspace config changes, etc) and the lock is held as long as needed. We leverage LockWithTTL to ensure that during traffic switching operations we hold the Keyspace lock as long as needed based on the command's timeout.

@mattlord
Copy link
Contributor Author

I'm going to close this as the proposal was implemented and merged in #16260

@github-project-automation github-project-automation bot moved this from In progress to Done in VReplication Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

1 participant