Improve concurrency requirements to avoid `database is locked` errors #2251

mikerkelly · 2024-12-17T15:48:12Z

Why are we doing this?

There are regular Sentry alerts for several Issues where requests fail with OperationalError: database is locked errors. The root cause is that our level of concurrency is currently too high for SQLite to reliably handle as currently configured. The database is locked while code writes to the database. This can result in failure after a long timeout of requests making updates, which is poor UX and could lead to their updates getting lost.

We know from Honeycomb several of our views can respond slowly, taking even 10s of seconds. Possibly they are holding a database lock for an extended period. Interacting with the builder to build a codelist or search for code can both write to the database, leading to updates that need the lock occurring frequently. Possibly a queue of updates builds up and some time out before processing and the request fails.

How will we know when it's done?

There are fewer instances of OperationalError: database is locked errors, as monitored through Sentry.

This Issue might have a number of sub-issues.

What are we doing?

According to https://docs.djangoproject.com/en/4.2/ref/databases/#database-is-locked-errors:

SQLite is meant to be a lightweight database, and thus can’t support a high level of concurrency. OperationalError: database is locked errors indicate that your application is experiencing more concurrency than sqlite can handle in default configuration. This error means that one thread or process has an exclusive lock on the database connection and another thread timed out waiting for the lock the be released.

Python’s SQLite wrapper has a default timeout value that determines how long the second thread is allowed to wait on the lock before it times out and raises the OperationalError: database is locked error.
If you’re getting this error, you can solve it by:

Switching to another database backend. At a certain point SQLite becomes too “lite” for real-world applications, and these sorts of concurrency errors indicate you’ve reached that point.

Rewriting your code to reduce concurrency and ensure that database transactions are short-lived.

Increase the default timeout value by setting the timeout database option:

According to the SQLite docs:

If your application has a need for a lot of concurrency, then you should consider using a client/server database. But experience suggests that most applications need much less concurrency than their designers imagine.

It's unclear if our concurrency needs are fundamentally so high that we should switch to another database backend or if we can resolve the issue by changing configuration or rewriting our code.

Possible actions:

Increase database timeout #2252 We already increased the timeout to 45 seconds in opencodelists/settings.py. We could try increasing it further to 60 or 90 seconds and monitoring how this improves the situation.
Improve SQLite configuration: IMMEDIATE transaction mode and PRAGMAs #2268 Explore other configuration options that can help such as write-ahead log mode if we don't already use it.
Identify problematic views and code and improve them to require less concurrency by holding the lock for less time or taking the lock less often.
- We know the builder sends a request to the server every time a checkbox is clicked, which may be multiple times a second, which should probably batch such requests. We know we want to improve or replace the builder in general.
- It would be useful to be able to easily identify traces in Honeycomb that may write to the database, perhaps the URL+method where the errors are triggered can help with this. This could be a derived column for easy future reference. Possibly something could be added to the raw OTel data.
Consider splitting the database into more separate databases, as each has a separate locking mechanism. So views updating some models need not necessarily block views updating different models.
Consider whether we need to move to a server-client database.

Defining delivery tasks guidance

The text was updated successfully, but these errors were encountered:

bloodearnest · 2024-12-18T09:58:53Z

There's an sqlite detail here that might be relevant.

tl;dr: turns out everyone's been holding sqlite wrong for years. Even Simon Willison.

Switching to using IMMEDIATE mode on the sql connection may be the core of the solution to database locked errors.

Essentially, not doing this means you will get locked errors that ignore the timeout/retry mechanic altogether.

We did this with airlock: https://github.com/opensafely-core/airlock/blob/main/airlock/settings.py#L168

Relevant Slack discussion from 8 months ago with more detail. https://bennettoxford.slack.com/archives/C01T2HACV3K/p1711836958649859

lucyb · 2024-12-19T18:34:21Z

Ah ha, this is the extra feature that upgrading to Django 5.1 will give us. Thanks for remembering that Simon, and for discovering it in the first place.

From Simon's comment in that thread:

Interestingly, Django 5.1 will have an init_command option we can use for per-connection pragmas
Also, "transaction_mode": "IMMEDIATE".
Might be worth upgrading opencodelists to it, see if we can stop those errors

mikerkelly · 2024-12-22T12:25:51Z

Thanks Simon, that's extremely interesting and useful. That fills in a lot of useful detail on the second bulleted item. I've created #2261 for this. As Lucy says it's partially blocked by #2115.

It seems sensible to use wal in the core database as reads may sometimes be much more common than writes and it helps address concurrency concerns. I checked the mode by downloading and restoring a backup locally and querying pragma journal_mode; giving wal. However I'm a little concerned that I cannot see where this is configured in the repo. I've only looked for a couple of minutes. If it's not captured in code or in manual deployment instructions, it should be. For #2261 to resolve.

evansd · 2024-12-23T09:35:22Z

However I'm a little concerned that I cannot see where this is configured in the repo

Ah yes, I'm guessing this was just manually enabled at some point and then it persists as a database setting. The right way to do this is with the new init_command option in the Django database config. You can see an example in the Airlock config:
https://github.com/opensafely-core/airlock/blob/6f747f192df94d75e551a2e290fe81edf3aaec11/airlock/settings.py#L170-L187

If you're blocked on upgrading Django for whatever reason then we made a backport of these features for Airlock. We've since removed this, having upgraded to 5.1, but the PR which originally added it is here:
opensafely-core/airlock#255

Happy to answer questions on this if that's helpful.

mikerkelly · 2025-01-02T09:52:08Z

As a point of comparison for later monitoring, there were about 175 errors regular Sentry alerts for several Issues where requests fail with OperationalError: database is locked errors in October-December 2024.

#2252 has been done as it was cheap and it's possible it can help in some cases (although possibly not in the cases where IMMEDIATE mode would help). #2268 has been raised for triage for the SQLite config and I hope we can do that soon.

We might reconsider the timeouts across a requests' lifecycle if this issue is resolved. And improving the speed and load of some views (eg the builder) might also be relevant to that discussion.

mikerkelly added the initiative:onboarding label Dec 17, 2024

mikerkelly mentioned this issue Dec 17, 2024

Increase database timeout #2252

Closed

This was referenced Dec 22, 2024

Improve SQLite configuration: IMMEDIATE transaction mode and PRAGMAs #2268

Open

Document backup strategy in a draft ADR #2246

Merged

mikerkelly mentioned this issue Jan 2, 2025

Increase db timeout to 90s #2276

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve concurrency requirements to avoid `database is locked` errors #2251

Improve concurrency requirements to avoid `database is locked` errors #2251

mikerkelly commented Dec 17, 2024 •

edited

Loading

bloodearnest commented Dec 18, 2024

lucyb commented Dec 19, 2024

mikerkelly commented Dec 22, 2024

evansd commented Dec 23, 2024

mikerkelly commented Jan 2, 2025 •

edited

Loading

Improve concurrency requirements to avoid database is locked errors #2251

Improve concurrency requirements to avoid database is locked errors #2251

Comments

mikerkelly commented Dec 17, 2024 • edited Loading

Why are we doing this?

How will we know when it's done?

What are we doing?

bloodearnest commented Dec 18, 2024

lucyb commented Dec 19, 2024

mikerkelly commented Dec 22, 2024

evansd commented Dec 23, 2024

mikerkelly commented Jan 2, 2025 • edited Loading

Improve concurrency requirements to avoid `database is locked` errors #2251

Improve concurrency requirements to avoid `database is locked` errors #2251

mikerkelly commented Dec 17, 2024 •

edited

Loading

mikerkelly commented Jan 2, 2025 •

edited

Loading