Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve concurrency requirements to avoid database is locked errors #2251

Open
mikerkelly opened this issue Dec 17, 2024 · 5 comments
Open

Comments

@mikerkelly
Copy link
Contributor

mikerkelly commented Dec 17, 2024

Why are we doing this?

There are regular Sentry alerts for several Issues where requests fail with OperationalError: database is locked errors. The root cause is that our level of concurrency is currently too high for SQLite to reliably handle as currently configured. The database is locked while code writes to the database. This can result in failure after a long timeout of requests making updates, which is poor UX and could lead to their updates getting lost.

We know from Honeycomb several of our views can respond slowly, taking even 10s of seconds. Possibly they are holding a database lock for an extended period. Interacting with the builder to build a codelist or search for code can both write to the database, leading to updates that need the lock occurring frequently. Possibly a queue of updates builds up and some time out before processing and the request fails.

How will we know when it's done?

There are fewer instances of OperationalError: database is locked errors, as monitored through Sentry.

This Issue might have a number of sub-issues.

What are we doing?

According to https://docs.djangoproject.com/en/4.2/ref/databases/#database-is-locked-errors:

SQLite is meant to be a lightweight database, and thus can’t support a high level of concurrency. OperationalError: database is locked errors indicate that your application is experiencing more concurrency than sqlite can handle in default configuration. This error means that one thread or process has an exclusive lock on the database connection and another thread timed out waiting for the lock the be released.

Python’s SQLite wrapper has a default timeout value that determines how long the second thread is allowed to wait on the lock before it times out and raises the OperationalError: database is locked error.
If you’re getting this error, you can solve it by:

  • Switching to another database backend. At a certain point SQLite becomes too “lite” for real-world applications, and these sorts of concurrency errors indicate you’ve reached that point.
  • Rewriting your code to reduce concurrency and ensure that database transactions are short-lived.
  • Increase the default timeout value by setting the timeout database option:

According to the SQLite docs:

If your application has a need for a lot of concurrency, then you should consider using a client/server database. But experience suggests that most applications need much less concurrency than their designers imagine.

It's unclear if our concurrency needs are fundamentally so high that we should switch to another database backend or if we can resolve the issue by changing configuration or rewriting our code.

Possible actions:

  • Increase database timeout #2252 We already increased the timeout to 45 seconds in opencodelists/settings.py. We could try increasing it further to 60 or 90 seconds and monitoring how this improves the situation.
  • Improve SQLite configuration: IMMEDIATE transaction mode and PRAGMAs #2268 Explore other configuration options that can help such as write-ahead log mode if we don't already use it.
  • Identify problematic views and code and improve them to require less concurrency by holding the lock for less time or taking the lock less often.
    - We know the builder sends a request to the server every time a checkbox is clicked, which may be multiple times a second, which should probably batch such requests. We know we want to improve or replace the builder in general.
    • It would be useful to be able to easily identify traces in Honeycomb that may write to the database, perhaps the URL+method where the errors are triggered can help with this. This could be a derived column for easy future reference. Possibly something could be added to the raw OTel data.
  • Consider splitting the database into more separate databases, as each has a separate locking mechanism. So views updating some models need not necessarily block views updating different models.
  • Consider whether we need to move to a server-client database.

Defining delivery tasks guidance

@bloodearnest
Copy link
Member

There's an sqlite detail here that might be relevant.

tl;dr: turns out everyone's been holding sqlite wrong for years. Even Simon Willison.

Switching to using IMMEDIATE mode on the sql connection may be the core of the solution to database locked errors.

Essentially, not doing this means you will get locked errors that ignore the timeout/retry mechanic altogether.

We did this with airlock: https://github.com/opensafely-core/airlock/blob/main/airlock/settings.py#L168

Relevant Slack discussion from 8 months ago with more detail. https://bennettoxford.slack.com/archives/C01T2HACV3K/p1711836958649859

@lucyb
Copy link
Contributor

lucyb commented Dec 19, 2024

Ah ha, this is the extra feature that upgrading to Django 5.1 will give us. Thanks for remembering that Simon, and for discovering it in the first place.

From Simon's comment in that thread:

Interestingly, Django 5.1 will have an init_command option we can use for per-connection pragmas
Also, "transaction_mode": "IMMEDIATE".
Might be worth upgrading opencodelists to it, see if we can stop those errors

@mikerkelly
Copy link
Contributor Author

Thanks Simon, that's extremely interesting and useful. That fills in a lot of useful detail on the second bulleted item. I've created #2261 for this. As Lucy says it's partially blocked by #2115.

It seems sensible to use wal in the core database as reads may sometimes be much more common than writes and it helps address concurrency concerns. I checked the mode by downloading and restoring a backup locally and querying pragma journal_mode; giving wal. However I'm a little concerned that I cannot see where this is configured in the repo. I've only looked for a couple of minutes. If it's not captured in code or in manual deployment instructions, it should be. For #2261 to resolve.

@evansd
Copy link
Contributor

evansd commented Dec 23, 2024

However I'm a little concerned that I cannot see where this is configured in the repo

Ah yes, I'm guessing this was just manually enabled at some point and then it persists as a database setting. The right way to do this is with the new init_command option in the Django database config. You can see an example in the Airlock config:
https://github.com/opensafely-core/airlock/blob/6f747f192df94d75e551a2e290fe81edf3aaec11/airlock/settings.py#L170-L187

If you're blocked on upgrading Django for whatever reason then we made a backport of these features for Airlock. We've since removed this, having upgraded to 5.1, but the PR which originally added it is here:
opensafely-core/airlock#255

Happy to answer questions on this if that's helpful.

@mikerkelly
Copy link
Contributor Author

mikerkelly commented Jan 2, 2025

As a point of comparison for later monitoring, there were about 175 errors regular Sentry alerts for several Issues where requests fail with OperationalError: database is locked errors in October-December 2024.

#2252 has been done as it was cheap and it's possible it can help in some cases (although possibly not in the cases where IMMEDIATE mode would help). #2268 has been raised for triage for the SQLite config and I hope we can do that soon.

We might reconsider the timeouts across a requests' lifecycle if this issue is resolved. And improving the speed and load of some views (eg the builder) might also be relevant to that discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants