Store blobs on S3 #4088

bsuttor · 2025-01-09T13:16:47Z

PLIP (Plone Improvement Proposal)

Responsible Persons

Proposer: Benoît Suttor

Seconder: Martin Peeters

Abstract

This PLIP proposes adding support for integrating Plone with S3 (Simple Storage Service) for storing content-related files, images, and other binary data. By leveraging S3 protocol as a backend storage solution, Plone would allow websites to offload storage to a scalable, highly available cloud solution, providing cost savings, redundancy, and improved performance for large deployments.

Motivation

Currently, Plone relies on local disk storage for managing files, which can limit scalability, especially for high-traffic sites or sites with significant file storage needs. Integrating S3 into Plone will offer the following benefits:

Scalability: Automatically scales with your data needs without requiring manual intervention.
Global Availability: S3’s network of data centers provides fast access to content, irrespective of the user’s geographical location.
Simplified Maintenance: By outsourcing storage to S3, you can reduce the load on your web server and simplify infrastructure management.

Moreover, many modern web applications and content management systems already leverage S3 for storage, and providing native support in Plone will make it easier for users to integrate Plone into cloud-centric architectures.

Assumptions

Proposal & Implementation

Technical Details

The integration would be implemented using the boto3 library (Python SDK for AWS), which allows interaction with S3.

This integration could be inspired by collective.s3blobs for downloading and uploading blobs to S3.

The following key features would be implemented:

File Storage Management: Plone would be able to upload, retrieve, and manage files in an S3 bucket.
Transparent File Access: Files that are uploaded through Plone's interface would be stored in S3, while the file paths and metadata would be stored in the Plone database.
Configuration: Users would configure their S3 credentials, S3 bucket name, and other options (as region) via the Plone registry or environment variables.
Size to upload: Administrator can define the minimal size of a blob to be uploaded to S3 (eg 1Mb). Other blobs are stored in classical blobstorage/relstorage

I thought Relstorage is a good place to implement this because the goal, in my case, is to deploy Plone by separating data from applications. So “Data.fs” could be stored on Postgres (for example) and blobs on S3.

But after talking about it with Maurits, maybe it’s better to add an adapter on ZODB blob or on plone.namedfile and use it ?

This feature would be opt-in and would not break existing Plone setups. Plone installations without this integration would continue to function normally, using blobstorage as before. Admins would need to enable and configure the integration explicitly.

Deliverables

Documentation to explain how it's possible to use S3 as blobstorage
To be defined

Risks

Potential Issues

Cost Control: While S3 offers cost savings, it is important to ensure that users understand the pricing model, as frequent file access or large volumes of data could incur significant charges.
Dependency: Users must be aware that they are relying on AWS or MinIO for file storage, which introduces an external dependency and potential single point of failure.
Security: Proper handling of credentials and access permissions is crucial to prevent unauthorized access to files.
Performance: for little file, we need to test if the connection to S3 is fast enough to be effective on production
Fallback to Local Storage: In case S3 is unavailable, a fallback mechanism to store files locally could be added ?
Caching: it’s always difficult to have a good caching. But it should have a caching of small blobs.

Participants

To be defined, but I am interested

ale-rt · 2025-01-09T14:01:39Z

Might be relevant https://www.youtube.com/watch?v=kYBBysLk80A, CC @davisagli

gforcada · 2025-01-09T14:33:55Z

We are very interested in this PLIP!

stevepiercy · 2025-01-09T20:35:10Z

Please add documentation to the Deliverables section.

I would also be very interested in the significantly lower cost B2 storage service from Backblaze. Perhaps this PLIP could design an interface that allows a choice of cloud storage providers, instead of being designed for only one. If that's possible, then one cloud storage service could be a fallback to another.

mpeeters · 2025-01-10T21:15:22Z

@stevepiercy B2 is compatible with S3 API. The idea is to be compatible with S3 API that many providers support.

davisagli · 2025-01-12T04:27:50Z

@bsuttor @mpeeters Thanks for starting this PLIP. I was also been thinking about it a bit over the holidays. I'll add my notes below in case you want to add some of my ideas to the PLIP, but I think you've already covered a lot of what I had in mind.

Motivation:

In a large site, blobs can take up a sizable portion of the database
This has a noticeable impact on the performance of whole-db operations like backup/restore, site-to-site copies and packing.
S3-compatible storage is widely available and is an obvious alternative to try to support.
Storing blobs this way can reduce the frequency of managing changes in disk capacity for the main database.
(On the other hand, it comes with the cost of some added complexity of managing interactions with an additional system.)
Serving files often has different characteristics and best practices compared to other requests (i.e. cache strategy)

Design goals:

Optionally store blobs (files, images) in S3-compatible storage including Amazon S3, other cloud service providers that offer an S3-compatible API, and (self-hosted) MinIO.
Configuration by environment variables for access key, secret key, and bucket
Ideally make it as compatible as possible with existing code that works with binary data. (So, implement it within the ZODB or within plone.namedfile rather than requiring other code to use something new.)
Don't add a hard requirement on any large libraries like boto3 (but an optional extra is okay).
Include support in the official plone-backend Docker image.
Provide some way to inspect and see what is using space.
Provide some way to garbage collect unused blobs (packing)
Provide some sort of local cache for frequently accessed blobs
Maybe support alternative download schemes (i.e. link to a CDN or to S3 instead of to /@@Download or /@@images). But then how do we handle auth? This might be out of scope.
Maybe handle small and large files differently
Maybe support custom logic for choosing a bucket dynamically?

I think there are 2 pretty different directions we could go for the implementation:

Try to support it at the ZODB storage level, probably with a wrapper that can be used with various underlying storages. collective.s3blobs uses this approach. I'm worried it might not perform well enough for writes (it's not great to have a transaction pending while we send a lot of data over the internet) and it will make the data in S3 pretty opaque and hard to see anything useful about where it came from.
Add some storage abstraction to plone.namedfile so that it can write to either ZODB blobs or other storage backends. I'm currently leaning toward this way because it means a lot more context would be available, so we could do things like writing to different buckets based on the current path, or tagging files in S3 with what content they are part of.

Prior art to investigate:

Django has a file storage abstraction that supports multiple backends. Maybe we can learn something from its design. https://django-storages.readthedocs.io/en/latest/
https://github.com/lrowe/s3storage - 20-year-old experiment by Laurence Rowe
Some thoughts from Jim Fulton in 2012: https://zodb.narkive.com/NIbDcqVR/storing-blob-data-out-of-into-amazon-s3
https://github.com/collective/collective.s3blobs - my add-on from 2017 or so. It supports fetching blobs from S3 but doesn't automatically write them there, so it's not a very complete solution.
@ericof and @jensens did some brainstorming at the 2022 Beethoven Sprint. Are there any notes from this?

bsuttor added the 03 type: feature (plip) label Jan 9, 2025

bsuttor modified the milestone: Plone 7.0 Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store blobs on S3 #4088

Store blobs on S3 #4088

bsuttor commented Jan 9, 2025 •

edited

Loading

ale-rt commented Jan 9, 2025

gforcada commented Jan 9, 2025

stevepiercy commented Jan 9, 2025

mpeeters commented Jan 10, 2025

davisagli commented Jan 12, 2025

Store blobs on S3 #4088

Store blobs on S3 #4088

Comments

bsuttor commented Jan 9, 2025 • edited Loading

PLIP (Plone Improvement Proposal)

Responsible Persons

Proposer: Benoît Suttor

Seconder: Martin Peeters

Abstract

Motivation

Assumptions

Proposal & Implementation

Technical Details

Deliverables

Risks

Potential Issues

Participants

ale-rt commented Jan 9, 2025

gforcada commented Jan 9, 2025

stevepiercy commented Jan 9, 2025

mpeeters commented Jan 10, 2025

davisagli commented Jan 12, 2025

bsuttor commented Jan 9, 2025 •

edited

Loading