Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store blobs on S3 #4088

Open
bsuttor opened this issue Jan 9, 2025 · 5 comments
Open

Store blobs on S3 #4088

bsuttor opened this issue Jan 9, 2025 · 5 comments

Comments

@bsuttor
Copy link
Member

bsuttor commented Jan 9, 2025

PLIP (Plone Improvement Proposal)

Responsible Persons

Proposer: Benoît Suttor

Seconder: Martin Peeters

Abstract

This PLIP proposes adding support for integrating Plone with S3 (Simple Storage Service) for storing content-related files, images, and other binary data. By leveraging S3 protocol as a backend storage solution, Plone would allow websites to offload storage to a scalable, highly available cloud solution, providing cost savings, redundancy, and improved performance for large deployments.

Motivation

Currently, Plone relies on local disk storage for managing files, which can limit scalability, especially for high-traffic sites or sites with significant file storage needs. Integrating S3 into Plone will offer the following benefits:

  • Scalability: Automatically scales with your data needs without requiring manual intervention.
  • Global Availability: S3’s network of data centers provides fast access to content, irrespective of the user’s geographical location.
  • Simplified Maintenance: By outsourcing storage to S3, you can reduce the load on your web server and simplify infrastructure management.

Moreover, many modern web applications and content management systems already leverage S3 for storage, and providing native support in Plone will make it easier for users to integrate Plone into cloud-centric architectures.

Assumptions

Proposal & Implementation

Technical Details

The integration would be implemented using the boto3 library (Python SDK for AWS), which allows interaction with S3.

This integration could be inspired by collective.s3blobs for downloading and uploading blobs to S3.

The following key features would be implemented:

  • File Storage Management: Plone would be able to upload, retrieve, and manage files in an S3 bucket.
  • Transparent File Access: Files that are uploaded through Plone's interface would be stored in S3, while the file paths and metadata would be stored in the Plone database.
  • Configuration: Users would configure their S3 credentials, S3 bucket name, and other options (as region) via the Plone registry or environment variables.
  • Size to upload: Administrator can define the minimal size of a blob to be uploaded to S3 (eg 1Mb). Other blobs are stored in classical blobstorage/relstorage

I thought Relstorage is a good place to implement this because the goal, in my case, is to deploy Plone by separating data from applications. So “Data.fs” could be stored on Postgres (for example) and blobs on S3.

But after talking about it with Maurits, maybe it’s better to add an adapter on ZODB blob or on plone.namedfile and use it ?

This feature would be opt-in and would not break existing Plone setups. Plone installations without this integration would continue to function normally, using blobstorage as before. Admins would need to enable and configure the integration explicitly.

Deliverables

  • Documentation to explain how it's possible to use S3 as blobstorage
  • To be defined

Risks

Potential Issues

  • Cost Control: While S3 offers cost savings, it is important to ensure that users understand the pricing model, as frequent file access or large volumes of data could incur significant charges.
  • Dependency: Users must be aware that they are relying on AWS or MinIO for file storage, which introduces an external dependency and potential single point of failure.
  • Security: Proper handling of credentials and access permissions is crucial to prevent unauthorized access to files.
    Performance: for little file, we need to test if the connection to S3 is fast enough to be effective on production
  • Fallback to Local Storage: In case S3 is unavailable, a fallback mechanism to store files locally could be added ?
  • Caching: it’s always difficult to have a good caching. But it should have a caching of small blobs.

Participants

To be defined, but I am interested

@ale-rt
Copy link
Member

ale-rt commented Jan 9, 2025

Might be relevant https://www.youtube.com/watch?v=kYBBysLk80A, CC @davisagli

@gforcada
Copy link
Member

gforcada commented Jan 9, 2025

We are very interested in this PLIP!

@stevepiercy
Copy link
Contributor

Please add documentation to the Deliverables section.

I would also be very interested in the significantly lower cost B2 storage service from Backblaze. Perhaps this PLIP could design an interface that allows a choice of cloud storage providers, instead of being designed for only one. If that's possible, then one cloud storage service could be a fallback to another.

@mpeeters
Copy link
Member

@stevepiercy B2 is compatible with S3 API. The idea is to be compatible with S3 API that many providers support.

@davisagli
Copy link
Member

@bsuttor @mpeeters Thanks for starting this PLIP. I was also been thinking about it a bit over the holidays. I'll add my notes below in case you want to add some of my ideas to the PLIP, but I think you've already covered a lot of what I had in mind.

Motivation:

  • In a large site, blobs can take up a sizable portion of the database
  • This has a noticeable impact on the performance of whole-db operations like backup/restore, site-to-site copies and packing.
  • S3-compatible storage is widely available and is an obvious alternative to try to support.
  • Storing blobs this way can reduce the frequency of managing changes in disk capacity for the main database.
  • (On the other hand, it comes with the cost of some added complexity of managing interactions with an additional system.)
  • Serving files often has different characteristics and best practices compared to other requests (i.e. cache strategy)

Design goals:

  • Optionally store blobs (files, images) in S3-compatible storage including Amazon S3, other cloud service providers that offer an S3-compatible API, and (self-hosted) MinIO.
  • Configuration by environment variables for access key, secret key, and bucket
  • Ideally make it as compatible as possible with existing code that works with binary data. (So, implement it within the ZODB or within plone.namedfile rather than requiring other code to use something new.)
  • Don't add a hard requirement on any large libraries like boto3 (but an optional extra is okay).
  • Include support in the official plone-backend Docker image.
  • Provide some way to inspect and see what is using space.
  • Provide some way to garbage collect unused blobs (packing)
  • Provide some sort of local cache for frequently accessed blobs
  • Maybe support alternative download schemes (i.e. link to a CDN or to S3 instead of to /@@Download or /@@images). But then how do we handle auth? This might be out of scope.
  • Maybe handle small and large files differently
  • Maybe support custom logic for choosing a bucket dynamically?

I think there are 2 pretty different directions we could go for the implementation:

  1. Try to support it at the ZODB storage level, probably with a wrapper that can be used with various underlying storages. collective.s3blobs uses this approach. I'm worried it might not perform well enough for writes (it's not great to have a transaction pending while we send a lot of data over the internet) and it will make the data in S3 pretty opaque and hard to see anything useful about where it came from.
  2. Add some storage abstraction to plone.namedfile so that it can write to either ZODB blobs or other storage backends. I'm currently leaning toward this way because it means a lot more context would be available, so we could do things like writing to different buckets based on the current path, or tagging files in S3 with what content they are part of.

Prior art to investigate:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants