Add locking mechanism to prevent parallel execution as a safety measure #335

dahu33 · 2022-04-21T01:49:04Z

I know this topic has already been discussed in #191 and #258 and while I understand the argument that migrations should not run in parallel in the first place, I still think that goose should have a locking mechanism to prevent parallel execution as a safety measure. This would not only allow new use cases but also make goose a much safer tool to use!

mfridman · 2022-04-21T02:17:50Z

I can't speak for @VojtechVitek, but I've come around on this. After chatting with other folks and gathering feedback the consensus is "it'd be a nice feature to have".

I'm not sure what this would actually look like to the end-user, but, I suggest we keep this issue open and continue the discussion.

mfridman · 2022-04-21T02:22:24Z

Not an exclusive list, but some things to think through..

How is this implemented and which database(s) are supported
Is this implemented as a default or an opt-in
Can it be implemented in a backwards compatible way

I'm sure more questions and implementation details will come up.

jessedearing · 2022-09-28T19:06:48Z

To solve this we had to write a wrapper executable for goose that obtains an advisory lock in postgres. It's pretty common in Kubernetes deployments to use a init container to perform database migrations. The issue arises due to the fact that init containers are part of deployments which run multiple copies of the container. With no locking in place, goose migrations interleave with each other.

It's a race condition so we don't see the issue all the time, but with no locking in place we run into issues randomly.

We use advisory locks, but the implementation is pretty specific to each DB backend.
A more general approach could be to use LOCK TABLE on the goose migrations table with an access exclusive lock in a transaction while the migrations run.

I'd expect the end user experience for either of these methods to not look any different that it does now. Goose runs all the same, the only difference is that another copy of goose cannot run until the first session is complete.

mfridman · 2022-09-28T19:31:36Z

Oddly enough, I was just chatting with folks about this.

@jessedearing, was your approach similar to the one described by @dahu33 in this #191 (comment)?

I've seen this come up a lot recently, probably because more and more environments are punting this problem to the tool itself.

jessedearing · 2022-09-28T22:07:52Z

was your approach similar to the one described by @dahu33 in this #191 (comment)?

It was. We moved it to it's own executable and launched Goose so we can still pull in upstream Goose updates easily. We just use os/exec to call Goose. Like @dahu33 describes we do use select pg_advisory_lock(1) and select pg_advisory_unlock(1) in this executable. Since it is in a separate executable we just create a single DB connection so we don't need to worry about pooling.

The implementation is very specific to Postgres, but MySQL has GET_LOCK and RELEASE_LOCK functions too.

tomwhipple · 2022-11-23T18:22:07Z

To add another scenario: I'm embedding Goose into the test suite to ensure that code and database (Postgres) remain in sync. However, I find that the golang test runner introduces concurrency such that migrations can inconsistently happen simultaneously.

mfridman · 2022-11-30T14:36:54Z

Thanks all for the feedback. Going to bump this up and add a label as something we should add to this library. It'll likely be opt-in behaviour to avoid unexpected issues for existing users.

mfridman · 2023-01-30T01:19:08Z

After some thought, I think an advisory lock (at least for postgres) is the right approach, granted this won't always work (see cockroachdb/cockroach#13546). We can document which databases are safe for concurrent use and which aren't.

Although this may be a reason to use a table lock, similar to liquibase, which not only helps in the cockroachdb example but also where PgBouncer is being used with transaction pooling.

Maybe the current goose behavior should be the default and users are responsible for serializing their deployments (as an example) .. but if the database supports it then they can opt-in for advisory locks. Tbd whether this should be session level or transaction level advisory lock.

We could implement both and leave it up to the user to configure based on their needs, something along the lines of:

-lock={none, session, transaction}?

I really like @dahu33 comment from above, which is why I'd like to see goose add support for this feature and with some wood behind this arrow hopefully we can get an implementation we all like.

This would not only allow new use cases but also make goose a much safer tool to use!

If folks have further thoughts, suggestions or ideas on an implementation please drop your comments!

mfridman · 2023-03-14T13:45:46Z

Dropping some thoughts here. I think it'd be ideal if we implemented a transaction-level advisory lock, instead of a session-level advisory lock. But there are some notable gotchas with transaction-level locks, such as CREATE INDEX CONCURRENTLY (among others).

These statements wait for other transactions to complete before executing (reference):

.. and in addition it must wait for all existing transactions that could potentially modify or use the index to terminate.

This means if we took a transaction-level lock and then attempted to run a migration it'd hang indefinitely. Not good.

Now, there is another option. For each migration, we know whether it is safe to run within a transaction or not (with the -- +goose NO TRANSACTION annotation). By default, we run a migration using a transaction-level lock, but for these special-case migrations, we promote to a session-level lock.

The downside with session-level advisory locks is they MUST be released by the same session (read connection), and if anything happens to that session you're in for a bad time trying to figure out why subsequent operations are hanging.

Once acquired at session level, an advisory lock is held until explicitly released or the session ends. Unlike standard lock requests, session-level advisory lock requests do not honor transaction semantics: a lock acquired during a transaction that is later rolled back will still be held following the rollback, and likewise an unlock is effective even if the calling transaction fails later.

Another thing I've been thinking about is whatever the implementation is, we should keep the door open to supporting this request: #222

tl;dr it'd be nice if multiple migrations could be applied within a single atomic transaction. (assuming all those files don't have the -- +goose NO TRANSACTION annotation).

I'd like to explore what a separate goose_lock table could look like, where we run a SELECT ... FOR UPDATE to acquire a lock. But there are some gotchas with this approach as well. I'll drop them in another comment.

mfridman · 2023-04-23T20:54:03Z

For the postgres session-level advisory lock, I changed the implementation to use pg_try_advisory_lock (instead of pg_advisory_lock) with retries attempting to acquire the lock.

The retry has sane default but can be configured by the user.

This implementation avoids an edge case that can lead to a deadlock. It's quite easy to reproduce with any tool that uses pg_advisory_lock exclusively:

Create 10 workers trying to acquire a lock with pg_advisory_lock
Attempt to apply the following 2 migrations (as separate operations, e.g., 2 up files)

migration 1

CREATE TABLE owners (
    owner_id BIGSERIAL PRIMARY KEY,
    owner_name text NOT NULL
);

migration 2

CREATE UNIQUE INDEX CONCURRENTLY ON owners(owner_name);

This will lead to the following error: deadlock detected (SQLSTATE 40P01)

If I have some cycles, I'll add more details here.

peterldowns · 2023-05-30T03:08:34Z

@mfridman if/when you have the cycles, I'd be very interested to learn how taking a lock with pg_advisory_lock led to deadlock but acquiring the lock by looping over pg_try_advisory_lock fixed the problem. In my testdb project for spinning up cheap postgres databases in tests I acquire locks with pg_advisory_lock to serialize writes from potentially many tests running in parallel, and I'd be quite worried about running into this problem. Any chance you could post a repro that I could look into?

peterldowns · 2023-05-30T03:44:42Z

Oh, I was able to reproduce this with just two psql connections to a database.

Connection 1:

postgres=# select pg_advisory_lock(10101);
 pg_advisory_lock
------------------

(1 row)

Connection 2:

postgres=# select pg_advisory_lock(10101);

This hangs, waiting for Connection 1 to release the lock.

Then, back in Connection 1:

postgres=# CREATE TABLE owners (name text);
CREATE TABLE
postgres=# CREATE INDEX CONCURRENTLY foobar ON owners (name);
ERROR:  deadlock detected
DETAIL:  Process 68 waits for ShareLock on virtual transaction 3/2; blocked by process 67.
Process 67 waits for ExclusiveLock on advisory lock [5,0,10101,1]; blocked by process 68.
HINT:  See server log for query details

That's... pretty crazy, and not what I would have expected! The documentation on CREATE INDEX CONCURRENTLY and on advisory locks does not clearly explain that they would conflict with each other. I guess the virtual transaction waiting on the advisory lock (Connection 2 in my example) is counted as a "transaction that has a snapshot", even though its virtual transaction began before the owners table was even created?

After the second scan, the index build must wait for any transactions that have a snapshot (see Chapter 13) predating the second scan to terminate, including transactions used by any phase of concurrent index builds on other tables, if the indexes involved are partial or have columns that are not simple column references.

Or, like you quoted earlier, maybe it is a virtual transaction that "could use the index"

it must wait for all existing transactions that could potentially modify or use the index to terminate.

Looks like the Flyway project also switched their implementation to spinning on pg_try_advisory_lock as a result, per this thread flyway/flyway#1654 (comment)

Appreciate you mentioning it here, I wouldn't have been aware of the issue otherwise.

jessedearing · 2023-05-30T20:42:13Z

We never saw this in our tooling because our tables are small enough that we haven't added CONCURRENTLY to CREATE INDEX. Good catch

mfridman · 2023-05-31T01:53:25Z

@peterldowns Coincidentally, I recently discovered a series of Blog posts from Sourcegraph (huge kudos to @efritz) that outlines their migration woes with golang-migrate. Specifically, they describe the #335 (comment) above.

https://about.sourcegraph.com/blog/introducing-migrator-service .. scroll down to A deeper contributing factor

The good news is the /v4 version of goose handles this case and there are a bunch of tests to capture the behaviour. It's quite reproducible.

peterldowns · 2023-05-31T03:29:50Z

@mfridman thanks for the links to those blogposts. Where I last worked, we would build large indexes concurrently "out of band", and then check in a migration that creates-if-not-exists.

For example, let's say we needed to create some index on a really large table where it would probably take a few hours. We would normally run any new migrations as part of our deployment process, as a prelude to replacing the existing app containers with the new version. We couldn't create that index during deployment, even with CREATE INDEX CONCURRENTLY, because we would set strict timeouts on the deploy process. So we'd write a migration like this:

CREATE INDEX -- CONCURRENTLY
  IF NOT EXISTS my_example_idx
  ON some_really_large_table (primary_key, created_at, some_other_key);

Once the PR with this migration was approved, we'd use a script to manually execute the concurrent version of that query:

CREATE INDEX CONCURRENTLY
  IF NOT EXISTS my_example_idx
  ON some_really_large_table (primary_key, created_at, some_other_key);

Only once that concurrent index creation had succeeded would we merge the PR. After merging the PR, redeploying would involve running migrations before starting up the app, but it would take no time at all because the index was already created!

In a testing scenario where we would need to migrate from an empty database to the current state, this would be fast becuase there wouldn't be any data in the tables, and we didn't do concurrent index creation.

I'm actually planning on releasing a migration library/framework soon based on this example and a few other things I learned. Because of this dance around concurrent migrations I had never run into this problem that you mentioned. Thank you again for helping me understand the problem and for teaching me something new about Postgres!

efritz · 2023-06-01T13:23:04Z

Saw my blog post mentioned above, thought I'd drop in and say 👋! The way we do migrations at Sourcegraph assumes:

we attempt to grab an advisory lock around
we log attempts into a migration_logs table
each transaction (except for create index concurrently) is wrapped in a transaction by the migration machinery
we update the attempt rows with success/failure once done (transactionally)

This affords us certain properties such as "a incomplete attempt while no migrator holds a lock" means a previous migration died before completion, but it didn't affect anything (since it was done atomically in a transaction).

Concurrent transactions throw a wrench in the works here because you can't hold an advisory lock and complete the index creation at the same time. The way we get around this is special-casing migrations with concurrent index creation here:

Drop the advisory lock before attempting a CIC transaction
Poll the index status from pg catalog tables
If the index is 100% complete, you're done
If the index is invalid, drop
If the index not being created, apply the CREATE INDEX CONCURRENTLY statement
Go to 2

Note that the reason we could do this (and golang-migrate couldn't easily) was because we only support Postgres.

mfridman · 2023-11-12T17:48:09Z

This functionality has been added in v3.16.0 and higher. Requires a *goose.Provider and the goose.WithSessionLocker option.

A bit more detail in this blog post:

https://pressly.github.io/goose/blog/2023/goose-provider/#database-locking

Thank you to everyone for the feedback, fruitful discussions, and contributions! Going to close this issue, but feel free to continue the discussion and looking forward to your feedback.

dahu33 mentioned this issue May 18, 2022

Add Mutex Lock for Database Migrations #191

Closed

mfridman self-assigned this Nov 30, 2022

mfridman added the up next label Jan 22, 2023

mfridman pinned this issue Jan 22, 2023

mfridman added this to the v3.10.0 milestone Jan 22, 2023

mfridman mentioned this issue Mar 21, 2023

[WIP] - Add exclusive session level advisory lock (postgres only) #483

Closed

5 tasks

mfridman added the /v4 label Apr 1, 2023

mfridman mentioned this issue Apr 2, 2023

[/v4] - Add session level advisory lock (postgres only) #493

Merged

mfridman mentioned this issue May 25, 2023

Blog idea - postgres session-level advisory locks mfridman/goose-docs#5

Open

mfridman mentioned this issue Oct 7, 2023

Development Roadmap for v3 #605

Closed

mfridman removed the /v4 label Oct 7, 2023

mfridman mentioned this issue Oct 9, 2023

feat(experimental): add internal migrate package and SessionLocker interface #606

Merged

mfridman closed this as completed Nov 12, 2023

mfridman removed this from the Up next in v3 milestone Dec 14, 2023

mfridman mentioned this issue May 9, 2024

Postgres deadlocking when multiple processes run CREATE INDEX CONCURRENTLY golang-migrate/migrate#960

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add locking mechanism to prevent parallel execution as a safety measure #335

Add locking mechanism to prevent parallel execution as a safety measure #335

dahu33 commented Apr 21, 2022

mfridman commented Apr 21, 2022

mfridman commented Apr 21, 2022

jessedearing commented Sep 28, 2022

mfridman commented Sep 28, 2022

jessedearing commented Sep 28, 2022

tomwhipple commented Nov 23, 2022

mfridman commented Nov 30, 2022

mfridman commented Jan 30, 2023 •

edited

Loading

mfridman commented Mar 14, 2023

mfridman commented Apr 23, 2023

peterldowns commented May 30, 2023

peterldowns commented May 30, 2023 •

edited

Loading

jessedearing commented May 30, 2023

mfridman commented May 31, 2023 •

edited

Loading

peterldowns commented May 31, 2023 •

edited

Loading

efritz commented Jun 1, 2023

mfridman commented Nov 12, 2023

Add locking mechanism to prevent parallel execution as a safety measure #335

Add locking mechanism to prevent parallel execution as a safety measure #335

Comments

dahu33 commented Apr 21, 2022

mfridman commented Apr 21, 2022

mfridman commented Apr 21, 2022

jessedearing commented Sep 28, 2022

mfridman commented Sep 28, 2022

jessedearing commented Sep 28, 2022

tomwhipple commented Nov 23, 2022

mfridman commented Nov 30, 2022

mfridman commented Jan 30, 2023 • edited Loading

mfridman commented Mar 14, 2023

mfridman commented Apr 23, 2023

migration 1

migration 2

peterldowns commented May 30, 2023

peterldowns commented May 30, 2023 • edited Loading

jessedearing commented May 30, 2023

mfridman commented May 31, 2023 • edited Loading

peterldowns commented May 31, 2023 • edited Loading

efritz commented Jun 1, 2023

mfridman commented Nov 12, 2023

mfridman commented Jan 30, 2023 •

edited

Loading

peterldowns commented May 30, 2023 •

edited

Loading

mfridman commented May 31, 2023 •

edited

Loading

peterldowns commented May 31, 2023 •

edited

Loading