Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIKE Investigate the use of Digital Ocean Block/Object Storage for the databases #2225

Open
lucyb opened this issue Dec 10, 2024 · 4 comments

Comments

@lucyb
Copy link
Contributor

lucyb commented Dec 10, 2024

We think that using Digital Ocean Block Storage will be a cost-effective way of increasing storage capacity, while also providing us with faster database access and improved backups.

In particular, we've identified the following problems:

There are potentially other solutions to this, such as migrating to a managed Postgres service or storing our backups on Digital Ocean Spaces object storage (i.e their S3 equivalent). So this spike should demonstrate that Block Storage is the better solution (or not).

Questions to answer

  1. How much will Block Storage cost
  2. How is Block Storage used with a Droplet?
  3. Where will we store a copy of the Block Storage configuration?
  4. How will we migrate over to using Block Storage?
  5. How will we test that the migration has been successful and demonstrate there are no performance problems?
@bloodearnest
Copy link
Member

while also providing us with faster database access

I would be very surprised if this was true. Block storage is essentially a networked file system, albeit at a very different way than NFS. Unless something has drastically changed in cloud land here, I would expect it to be measurably slower. Possibly at an acceptable level, especially for reads and if we tune sqlite things like mmap and cache sizes to keep stuff in the fs page cache.

So I would encourage performance testing to be done early (perhaps enabling sqlite otel telemetry if not already?) as it may provide a good early failure signal. For example, there were recently discussions about how N+1 queries are not nearly as much of a problem with sqlite as they are with postgres - that might be less the case with the volume mounted on Block storage.

FWIW, the model I've seen used with sqlite in the cloud is to have the sqlite db on fast local storage, and use backups or litestream off host for durability.

Some other ideas:

  • Have we considered mounting a BS volume for "local" backups of all all dbs?
  • The main db is r/w, and would be affected most by being on BS. But perhaps the code system dbs, which AIUI are readonly once created, and less frequently used, would work well on a BS volume, with tuning?

Regarding system configuration, we may be able to create and manage BS volumes with ansible. However, we ended up not using ansible to manage the DO assets, which is basically the creation of instances. Instead, we documented the use of the doctl cli client, as it was infrequent and complex, which might be another option for you.

@mikerkelly
Copy link
Contributor

mikerkelly commented Dec 15, 2024

Much of the following might be questions for the spike to discover response to.

The issue reads to me as if we currently expect that Block Storage is a better fit than Object storage. Is that right and why?

Needing to restore from droplet would be costly for the core DB, losing up to a week of user-generated content. Having nightly core backups on a separate file system would reduce such potential data loss. I think the team agreed we do want to achieve that.

I'm less certain about the value of backing up the coding system DBs. If we lost dokku3 entirely would our approach to standing it back up be restoring from droplet? Would that be faster than starting deployment from scratch? Each version DB is read-only. New versions are added less than monthly. Only new DBs would be lost and those would be easy to reproduce as the work and source would be recent.

Or are the envisaged coding system database backups mainly to guard against accidental deletion or data corruption on dokku3? In that case it might be more convenient to restore from a networked file system than restoring a droplet. But we could also consider local backups if dokku3 had more space. Might we ever want to upgrade to the next sized droplet anyway for other reasons (e.g., compute/network speed in general?), in which case the space concern with a local backup would be less relevant?

I haven't found which droplet plan we're currently on, but if dokku3 is a basic droplet, going from 160GiB SSD to 320GiB would double monthly costs from ~40GBP to ~80GBP, I think. BS pricing, 100GiB is ~8GBP, ~80% cheaper. Object storage pricing is ~4GBP for 250GiB, the most economical for just storage.

improved backups

Is the anticipated improvement being on a separate file system, hence safer (in some sense)? Are there other benefits?

The costings above seem to show that either block or object storage is quite cheap so space becomes less of a concern. Could we consider taking core DB backups more frequently than nightly if sufficiently cheap and fast to improve recovery point frequency (these can then be condensed later)? Or does it need to be outside of "business hours" if it impacts performance/availability?

Does the type of file system allow more de-duplication as discussed in this blog post mentioned in #2151? Would this "kind of incremental" backup be much different in terms of speed? Do we know how much space it would save? (Maybe that's not pressing if we have more cheap space).

(#2128 was closed)

I'd still like us to try standing up a restored droplet from backup and think about how we would restore production from that. At some point, maybe now is not the time. Resulting documentation could help rapid response during an incident. Even if that backup is the "backup backup", it would be good to be sure we can use it if we need it.

@lucyb
Copy link
Contributor Author

lucyb commented Dec 16, 2024

Thanks both of you for the excellent points.

I suggest that we restrict this spike to look at moving over just the backups of the core application database for now, with a view to moving over the backups of the coding system database (when we have them) and potentially the "live" databases in future. The core app database is backed up using a cron job that we could plug into Sentry to monitor any performance changes, which should answer some of the concerns.

@mikerkelly
Copy link
Contributor

mikerkelly commented Dec 17, 2024

Thanks, Lucy, that makes sense. It'll be easier to do the spike if we limit the scope, and easier to take coding system backups to networked storage if the storage is already available. The consideration of which network storage to use should consider that it may be extended to coding system backups.

potentially the "live" databases in future

Another option discussed elsewhere was moving to a managed server-based database like Digital Ocean postgres. Either would avoid losing live data if the droplet were lost, which seems more likely than cloud storage going down. Unclear if our concurrency needs are high enough to switch from SQLite.

@mikerkelly mikerkelly changed the title SPIKE Investigate the use of Digital Ocean Block Storage for the databases SPIKE Investigate the use of Digital Ocean Block/Object Storage for the databases Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants