-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix transaction isolation issues #393
Conversation
Great finds on this investigation, simple solutions for a complex problem. Let me know if I can help anywhere. Have some comments/questions, also learning at the same time as you.
Are we getting this error in spite of the retries? I would have expected the retries to succeed, otherwise we won't be able to recover in the event that we get a true serialization conflict. Would it be worth adding exponential backoff with jitter, to prevent the conflicting retries happening simultaneously?
The second answer in the post you linked states:
Isn't this prevented by the advisory lock? |
Yes, but increasing the number of retries or decreasing the level of concurrency can get it to pass.
That's what I thought, so yesterday I wrote 10k entries into the table and tried it. Same issue. Maybe every empty result set lands on the same page, so it'll always be an issue for
In this case, Postgres isolation is actually too good. Both transactions will be working with a snapshot of the DB state before either transaction committed. This is true even on the lowest isolation level Postgres offers. |
Double-checking, that's 10k entries with different inbox_ids, right? At that point I would expect the
Good point! Would it solve the problem if we acquired the advisory lock before the transaction started, and released it after? |
Definitely might. I haven't tried that yet. There's a bit of nuance to acquiring locks outside of a transaction where they don't automatically expire. Think you need to set a timeout in the DB to make sure you can't perma-deadlock |
@richardhuaaa in any case, WDYT about merging this PR as-is (which will fix our issues with test parallelism and prevent issues with platform SDK tests). Then we can tinker with some of the suggestions. It does feel worthwhile to try the lock-before-the-transaction-starts approach and see how it feels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to unblock for now! For clarity in the next stage - the last_sequence_id
thing also seems fine. In order of my personal preference would be: 1) Stick to serializable isolation if it works properly with more data, 2) Acquire lock before transaction if it releases properly in edge cases, 3) Use last_sequence_id
. But really any of them seems like they will work, it's just a matter of simplicity going forward
tl;dr
We were getting an error when trying to perform concurrent identity updates, even when they were on different
inbox_id
s and with different addresses. The error was:This adds a test to detect these issues and resolves it by lowering the transaction isolation level.
Fixes
Why was this happening?
I isolated the issue to the
GetAllInboxLogs
query. When replacing that with a fixed[]
and running onlyCreateIdentity
updates, the issue totally goes away. All other behaviour is the same forCreateIdentity
since there are no existing inbox logs (it's the first update).This post explains how you can run into this scenario with serializable transactions and empty queries. The gist is that serializable transactions try to lock data as narrow as your query, but sometimes they can only lock portions of the heap that are larger. I've confirmed in the query planner that sometimes this query leads to a heap scan that touches 4 rows and sometimes it uses an index scan that touches 0 rows.
What's the fix
I changed the way we do locking to use a
pg_advisory_xact_lock
instead of relying onSELECT ... FOR UPDATE
. The advantage of using advisory locks is that it will behave the same whether or not there are existinginbox_log
entries already in the DB or not, since it doesn't rely on locking any particular row. Instead it locks theinbox_id
itself. All the locking really buys us is a guarantee that if it detects a conflict on the commit the previous transaction will be done when it retries. Otherwise you could retry 3X before the first transaction ever finished.I also changed the transaction isolation level to
repeatable_read
. I don't actually think this gives us the protection we need, and this will have to be revisited. Namely that two transactions can both run with the same initial set of identity updates, both are valid updates, and both will get written.What we could do to get us the proper protections is have a nullable
last_sequence_id
field on each row with a unique index that covers(inbox_id, last_sequence_id)
. That would guarantee postgres checks for conflict and fails if two rows both try and update with the same previous state. But it feels a little heavy-handed and I'd like to explore lighter ways of getting the same thing.Another idea is to have a separate table that just stores
(inbox_id, last_sequence_id)
and have every update write to that table. That would also get the same result, but without bloating our actual table. Postgres will detect conflicts if two transactions try to write to the same row atREPEATABLE READ
.