SPIKE Investigate the use of Digital Ocean Block/Object Storage for the databases #2225

lucyb · 2024-12-10T16:23:39Z

We think that using Digital Ocean Block Storage will be a cost-effective way of increasing storage capacity, while also providing us with faster database access and improved backups.

In particular, we've identified the following problems:

We can't store backups of the Coding Systems databases on Dokku3 due to a lack of space (Backup coding system databases #2153)
Backups of the core application database are currently stored on the droplet itself.
Restoring data from a droplet is likely to be annoying, and will be time consuming to test and to document (Document restoring core and coding system release DBs from droplet backup #2128)
Droplet backups are are limited. We have one per week, with four retained. We would like something more frequent for databases and something that's easier to access.

There are potentially other solutions to this, such as migrating to a managed Postgres service or storing our backups on Digital Ocean Spaces object storage (i.e their S3 equivalent). So this spike should demonstrate that Block Storage is the better solution (or not).

Questions to answer

How much will Block Storage cost
How is Block Storage used with a Droplet?
Where will we store a copy of the Block Storage configuration?
How will we migrate over to using Block Storage?
How will we test that the migration has been successful and demonstrate there are no performance problems?

bloodearnest · 2024-12-13T11:03:47Z

while also providing us with faster database access

I would be very surprised if this was true. Block storage is essentially a networked file system, albeit at a very different way than NFS. Unless something has drastically changed in cloud land here, I would expect it to be measurably slower. Possibly at an acceptable level, especially for reads and if we tune sqlite things like mmap and cache sizes to keep stuff in the fs page cache.

So I would encourage performance testing to be done early (perhaps enabling sqlite otel telemetry if not already?) as it may provide a good early failure signal. For example, there were recently discussions about how N+1 queries are not nearly as much of a problem with sqlite as they are with postgres - that might be less the case with the volume mounted on Block storage.

FWIW, the model I've seen used with sqlite in the cloud is to have the sqlite db on fast local storage, and use backups or litestream off host for durability.

Some other ideas:

Have we considered mounting a BS volume for "local" backups of all all dbs?
The main db is r/w, and would be affected most by being on BS. But perhaps the code system dbs, which AIUI are readonly once created, and less frequently used, would work well on a BS volume, with tuning?

Regarding system configuration, we may be able to create and manage BS volumes with ansible. However, we ended up not using ansible to manage the DO assets, which is basically the creation of instances. Instead, we documented the use of the doctl cli client, as it was infrequent and complex, which might be another option for you.

mikerkelly · 2024-12-15T10:45:39Z

Much of the following might be questions for the spike to discover response to.

The issue reads to me as if we currently expect that Block Storage is a better fit than Object storage. Is that right and why?

Needing to restore from droplet would be costly for the core DB, losing up to a week of user-generated content. Having nightly core backups on a separate file system would reduce such potential data loss. I think the team agreed we do want to achieve that.

I'm less certain about the value of backing up the coding system DBs. If we lost dokku3 entirely would our approach to standing it back up be restoring from droplet? Would that be faster than starting deployment from scratch? Each version DB is read-only. New versions are added less than monthly. Only new DBs would be lost and those would be easy to reproduce as the work and source would be recent.

Or are the envisaged coding system database backups mainly to guard against accidental deletion or data corruption on dokku3? In that case it might be more convenient to restore from a networked file system than restoring a droplet. But we could also consider local backups if dokku3 had more space. Might we ever want to upgrade to the next sized droplet anyway for other reasons (e.g., compute/network speed in general?), in which case the space concern with a local backup would be less relevant?

I haven't found which droplet plan we're currently on, but if dokku3 is a basic droplet, going from 160GiB SSD to 320GiB would double monthly costs from ~40GBP to ~80GBP, I think. BS pricing, 100GiB is ~8GBP, ~80% cheaper. Object storage pricing is ~4GBP for 250GiB, the most economical for just storage.

improved backups

Is the anticipated improvement being on a separate file system, hence safer (in some sense)? Are there other benefits?

The costings above seem to show that either block or object storage is quite cheap so space becomes less of a concern. Could we consider taking core DB backups more frequently than nightly if sufficiently cheap and fast to improve recovery point frequency (these can then be condensed later)? Or does it need to be outside of "business hours" if it impacts performance/availability?

Does the type of file system allow more de-duplication as discussed in this blog post mentioned in #2151? Would this "kind of incremental" backup be much different in terms of speed? Do we know how much space it would save? (Maybe that's not pressing if we have more cheap space).

(#2128 was closed)

I'd still like us to try standing up a restored droplet from backup and think about how we would restore production from that. At some point, maybe now is not the time. Resulting documentation could help rapid response during an incident. Even if that backup is the "backup backup", it would be good to be sure we can use it if we need it.

lucyb · 2024-12-16T18:02:48Z

Thanks both of you for the excellent points.

I suggest that we restrict this spike to look at moving over just the backups of the core application database for now, with a view to moving over the backups of the coding system database (when we have them) and potentially the "live" databases in future. The core app database is backed up using a cron job that we could plug into Sentry to monitor any performance changes, which should answer some of the concerns.

mikerkelly · 2024-12-17T09:13:01Z

Thanks, Lucy, that makes sense. It'll be easier to do the spike if we limit the scope, and easier to take coding system backups to networked storage if the storage is already available. The consideration of which network storage to use should consider that it may be extended to coding system backups.

potentially the "live" databases in future

Another option discussed elsewhere was moving to a managed server-based database like Digital Ocean postgres. Either would avoid losing live data if the droplet were lost, which seems more likely than cloud storage going down. Unclear if our concurrency needs are high enough to switch from SQLite.

mikerkelly added the initiative:onboarding label Dec 11, 2024

This was referenced Dec 11, 2024

Reduce size of core database #2215

Closed

Document restoring core and coding system release DBs from droplet backup #2128

Open

Backup coding system databases #2153

Closed

This was referenced Dec 15, 2024

Document backup strategy in a draft ADR #2246

Merged

Use sqlite cli to backup core database #2151

Closed

mikerkelly changed the title ~~SPIKE Investigate the use of Digital Ocean Block Storage for the databases~~ SPIKE Investigate the use of Digital Ocean Block/Object Storage for the databases Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPIKE Investigate the use of Digital Ocean Block/Object Storage for the databases #2225

SPIKE Investigate the use of Digital Ocean Block/Object Storage for the databases #2225

lucyb commented Dec 10, 2024

bloodearnest commented Dec 13, 2024

mikerkelly commented Dec 15, 2024 •

edited

Loading

lucyb commented Dec 16, 2024

mikerkelly commented Dec 17, 2024 •

edited

Loading

SPIKE Investigate the use of Digital Ocean Block/Object Storage for the databases #2225

SPIKE Investigate the use of Digital Ocean Block/Object Storage for the databases #2225

Comments

lucyb commented Dec 10, 2024

bloodearnest commented Dec 13, 2024

mikerkelly commented Dec 15, 2024 • edited Loading

lucyb commented Dec 16, 2024

mikerkelly commented Dec 17, 2024 • edited Loading

mikerkelly commented Dec 15, 2024 •

edited

Loading

mikerkelly commented Dec 17, 2024 •

edited

Loading