Support for "append" mode for Azure Blobs #836

geovalexis · 2024-09-15T16:52:03Z

Hi all!

I use smart-open for one of my projects and I've recently run into the need for "append" mode for Azure blobs. This is something Azure's SDK supports natively but it looks like it hasn't been implemented in smart-open yet.

I was thinking on adding support for this feature myself but I was wondering if there is any additional concern/inconvenience I might be missing.

P.S.: Thanks for such a simple yet useful tool!

Cheers.

ddelange · 2024-09-15T16:58:14Z

do you mean creating a new AppendBlob object on azure blob storage, or appending to an existing AppendBlob?

geovalexis · 2024-09-15T17:02:20Z

I mean appending to an existing AppendBlob.

ddelange · 2024-09-15T17:05:53Z

I guess if 'a' in mode could make the blind assumption that we're talking about AppendBlob, whether it already exists on remote or not.

geovalexis · 2024-09-15T17:19:00Z

what would happened if it's not an AppendBlob?

ddelange · 2024-09-15T18:07:41Z

raise a ValueError immediately on the open() call?

ddelange · 2024-09-15T18:23:02Z

some concerns:

append_block would have to go into _upload_part, which would probably warrant a new AppendWriter (sub)class conform this mechanic.
the terminate() method needs to be amended such that the whole append operation gets aborted upon terminate() call (i.e. with-statement is aborted with any exception). that means that append_block might not be usable, as it probably already commits upon each call.
there definitely needs to be a full test suite for supported compression mechanisms (e.g. gzip supports append, others might not)

ddelange · 2024-09-16T08:52:28Z

I think bullet number 2 is a hard blocker. There's no way to revert an append block operation.

Append Block uploads a block to the end of an existing append blob. The block of data is immediately available after the call succeeds on the server. A maximum of 50,000 appends are permitted for each append blob. Each block can be of different size.

ref https://learn.microsoft.com/en-us/rest/api/storageservices/append-block?tabs=microsoft-entra-id#remarks

The only workaround I can think of is to only start uploading in the close() call (i.e. a successful __exit__) using append_blob_from_stream. But I guess that's an anti-pattern, especially regarding memory usage for big (multi-part) streams and usage with generators and such.

geovalexis · 2024-09-18T08:29:31Z

@ddelange thanks for sharing your thoughts on this.

Is this "aborting" capability really required? I mean, I think we need to stick to what the API allows us to do. If we can't revert an append block, so be it.

Also, going over the code, I have noticed there is no append mode support for other cloud providers. Is it because they don't support it? Or because similar blockers than this?

ddelange · 2024-09-18T10:32:11Z

@ddelange thanks for sharing your thoughts on this.

Is this "aborting" capability really required? I mean, I think we need to stick to what the API allows us to do. If we can't revert an append block, so be it.

Setting chunksize default to 100MB for the new AppendWriter would allow for aborting at least appends smaller than 100MB (I guess the gross of applications for this feature) but in any case it's a big caveat that would get introduced with the feature.

Also, going over the code, I have noticed there is no append mode support for other cloud providers. Is it because they don't support it? Or because similar blockers than this?

I'm not a maintainer (just an active contributer) but afaik it's because they only implement immutable objects.

geovalexis · 2024-09-20T06:05:03Z

Sounds good @ddelange! I'll try to put something together and see if the maintainers like it.

ddelange · 2024-09-20T06:23:39Z

Awesome :) The 100MB is a hard chunk size limit on azure side btw, would have to ensure that bytes going into the append_block never surpass this size. There's also a max amount of blocks that can be appended to an AppendBlob, 50k iirc

ddelange · 2024-09-20T07:21:30Z

correction:

Each block in an append blob can be a different size, up to a maximum of 4 MB, and an append blob can include up to 50,000 blocks. The maximum size of an append blob is therefore slightly more than 195 GB (4 MB X 50,000 blocks).

ref https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.appendblobservice.appendblobservice?view=azure-python-previous

smart_open azure.py links to this table, maybe the low defaults we have now is a remnant from before 2019?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for "append" mode for Azure Blobs #836

Support for "append" mode for Azure Blobs #836

geovalexis commented Sep 15, 2024

ddelange commented Sep 15, 2024 •

edited

Loading

geovalexis commented Sep 15, 2024

ddelange commented Sep 15, 2024

geovalexis commented Sep 15, 2024

ddelange commented Sep 15, 2024

ddelange commented Sep 15, 2024 •

edited

Loading

ddelange commented Sep 16, 2024 •

edited

Loading

geovalexis commented Sep 18, 2024

ddelange commented Sep 18, 2024

geovalexis commented Sep 20, 2024

ddelange commented Sep 20, 2024

ddelange commented Sep 20, 2024

Support for "append" mode for Azure Blobs #836

Support for "append" mode for Azure Blobs #836

Comments

geovalexis commented Sep 15, 2024

ddelange commented Sep 15, 2024 • edited Loading

geovalexis commented Sep 15, 2024

ddelange commented Sep 15, 2024

geovalexis commented Sep 15, 2024

ddelange commented Sep 15, 2024

ddelange commented Sep 15, 2024 • edited Loading

ddelange commented Sep 16, 2024 • edited Loading

geovalexis commented Sep 18, 2024

ddelange commented Sep 18, 2024

geovalexis commented Sep 20, 2024

ddelange commented Sep 20, 2024

ddelange commented Sep 20, 2024

ddelange commented Sep 15, 2024 •

edited

Loading

ddelange commented Sep 15, 2024 •

edited

Loading

ddelange commented Sep 16, 2024 •

edited

Loading