Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for "append" mode for Azure Blobs #836

Open
geovalexis opened this issue Sep 15, 2024 · 12 comments
Open

Support for "append" mode for Azure Blobs #836

geovalexis opened this issue Sep 15, 2024 · 12 comments

Comments

@geovalexis
Copy link

Hi all!

I use smart-open for one of my projects and I've recently run into the need for "append" mode for Azure blobs. This is something Azure's SDK supports natively but it looks like it hasn't been implemented in smart-open yet.

I was thinking on adding support for this feature myself but I was wondering if there is any additional concern/inconvenience I might be missing.

P.S.: Thanks for such a simple yet useful tool!

Cheers.

@ddelange
Copy link
Contributor

ddelange commented Sep 15, 2024

do you mean creating a new AppendBlob object on azure blob storage, or appending to an existing AppendBlob?

@geovalexis
Copy link
Author

I mean appending to an existing AppendBlob.

@ddelange
Copy link
Contributor

I guess if 'a' in mode could make the blind assumption that we're talking about AppendBlob, whether it already exists on remote or not.

@geovalexis
Copy link
Author

what would happened if it's not an AppendBlob?

@ddelange
Copy link
Contributor

raise a ValueError immediately on the open() call?

@ddelange
Copy link
Contributor

ddelange commented Sep 15, 2024

some concerns:

  • append_block would have to go into _upload_part, which would probably warrant a new AppendWriter (sub)class conform this mechanic.
  • the terminate() method needs to be amended such that the whole append operation gets aborted upon terminate() call (i.e. with-statement is aborted with any exception). that means that append_block might not be usable, as it probably already commits upon each call.
  • there definitely needs to be a full test suite for supported compression mechanisms (e.g. gzip supports append, others might not)

@ddelange
Copy link
Contributor

ddelange commented Sep 16, 2024

I think bullet number 2 is a hard blocker. There's no way to revert an append block operation.

Append Block uploads a block to the end of an existing append blob. The block of data is immediately available after the call succeeds on the server. A maximum of 50,000 appends are permitted for each append blob. Each block can be of different size.

ref https://learn.microsoft.com/en-us/rest/api/storageservices/append-block?tabs=microsoft-entra-id#remarks

The only workaround I can think of is to only start uploading in the close() call (i.e. a successful __exit__) using append_blob_from_stream. But I guess that's an anti-pattern, especially regarding memory usage for big (multi-part) streams and usage with generators and such.

@geovalexis
Copy link
Author

@ddelange thanks for sharing your thoughts on this.

Is this "aborting" capability really required? I mean, I think we need to stick to what the API allows us to do. If we can't revert an append block, so be it.

Also, going over the code, I have noticed there is no append mode support for other cloud providers. Is it because they don't support it? Or because similar blockers than this?

@ddelange
Copy link
Contributor

@ddelange thanks for sharing your thoughts on this.

Is this "aborting" capability really required? I mean, I think we need to stick to what the API allows us to do. If we can't revert an append block, so be it.

Setting chunksize default to 100MB for the new AppendWriter would allow for aborting at least appends smaller than 100MB (I guess the gross of applications for this feature) but in any case it's a big caveat that would get introduced with the feature.

Also, going over the code, I have noticed there is no append mode support for other cloud providers. Is it because they don't support it? Or because similar blockers than this?

I'm not a maintainer (just an active contributer) but afaik it's because they only implement immutable objects.

@geovalexis
Copy link
Author

Sounds good @ddelange! I'll try to put something together and see if the maintainers like it.

@ddelange
Copy link
Contributor

Awesome :) The 100MB is a hard chunk size limit on azure side btw, would have to ensure that bytes going into the append_block never surpass this size. There's also a max amount of blocks that can be appended to an AppendBlob, 50k iirc

@ddelange
Copy link
Contributor

correction:

Each block in an append blob can be a different size, up to a maximum of 4 MB, and an append blob can include up to 50,000 blocks. The maximum size of an append blob is therefore slightly more than 195 GB (4 MB X 50,000 blocks).

ref https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.appendblobservice.appendblobservice?view=azure-python-previous

smart_open azure.py links to this table, maybe the low defaults we have now is a remnant from before 2019?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants