S3 ContentEncoding is disregarded #743

goranvinterhalter · 2022-11-30T10:47:37Z

Problem description

This I believe is the same issue as #422 but it's for S3.

Certain libraries, like django_s3_storage use ContentEncoding https://github.com/etianen/django-s3-storage/blob/master/django_s3_storage/storage.py#L330 to express on-the-fly compression/decompression.

Smart open does not support this and I have to manually check for the presence of ContentEncoding when reading such files. The s3 documentation specifies:

ContentEncoding (string) -- Specifies what content encodings have been applied to the object and thus what decoding mechanisms must be applied to obtain the media-type referenced by the Content-Type header field.

Is this something that can/will be implemented at some point?

Steps/code to reproduce the problem

It's hard to give precise steps, but simply put uploading a .txt file with .txt extension who's content has been gziped and ContentEncoding value is "gzip" should be automatically decompressed, but it is not.

Versions

Linux-4.14.296-222.539.amzn2.x86_64-x86_64-with-glibc2.2.5
Python 3.7.10 (default, Jun  3 2021, 00:02:01)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-13)]
smart_open 6.2.0

The text was updated successfully, but these errors were encountered:

mpenkov · 2022-11-30T14:01:06Z

Thank you for the report.

The following two statements seem inconsistent to me:

It's hard to give precise steps
Simply put uploading a .txt file with .txt extension who's content has been gziped and ContentEncoding value is "gzip" should be automatically decompressed, but it is not

Why is it difficult to show the precise source code for 2)?

goranvinterhalter · 2022-11-30T15:41:40Z

@mpenkov I don't know if these instructions are correct or incorrect. For example, is Metadata uncompressed_size required (as used in https://github.com/etianen/django-s3-storage/blob/master/django_s3_storage/storage.py#L331)?
I observed that chrome will automatically decompress the file when a pre-signed url is used but I'm having problems replicating this in steps bellow. For now this is what I have.

Create the file and upload (I'm referring to the bucket as ):

echo "hello world" | gzip -c > a.txt
aws s3 cp a.txt <bucket>/a.txt --content-encoding gzip

Check ContentEncoding is set:

In [36]: import boto3

In [37]: client = boto3.client("s3")

In [38]: obj = client.get_object(Bucket="<bucket>", Key="a.txt")

In [39]: obj["ContentEncoding"]
Out[39]: 'gzip'

Reading with smart_open:

In [1]: import smart_open

In [2]: smart_open.open("<bucket>/a.txt", "rb").read()
Out[2]: b'\x1f\x8b\x08\x00\xa2k\x87c\x00\x03\xcbH\xcd\xc9\xc9W(\xcf/\xcaI\xe1\x02\x00-;\x08\xaf\x0c\x00\x00\x00'

In [3]: smart_open.open("<bucket>/a.txt").read()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In [3], line 1
----> 1 smart_open.open("<bucket>/a.txt").read()

File /opt/python/3.9.14/lib/python3.9/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    319 def decode(self, input, final=False):
    320     # decode input (taking the buffer into account)
    321     data = self.buffer + input
--> 322     (result, consumed) = self._buffer_decode(data, self.errors, final)
    323     # keep undecoded input until the next call
    324     self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

UrsDeSwardt · 2024-04-18T14:54:17Z

Is there a solution to this problem yet? 🤔

ddelange · 2024-04-18T17:02:29Z

I think your best bet is to do a head_object, and then use the compression keyword argument to smart_open.open.

UrsDeSwardt · 2024-04-19T08:10:41Z

My problem was a little different. In my scenario, I'm uploading a gzip file to S3, but the ContentEncoding is not set to "gzip", resulting in corrupted data when downloading the file again. I fixed it by adding transport_params to open:

transport_params = {
    "client_kwargs": {
        "S3.Client.create_multipart_upload": {"ContentEncoding": "gzip"},
    },
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 ContentEncoding is disregarded #743

S3 ContentEncoding is disregarded #743

goranvinterhalter commented Nov 30, 2022 •

edited

Loading

mpenkov commented Nov 30, 2022

goranvinterhalter commented Nov 30, 2022 •

edited

Loading

UrsDeSwardt commented Apr 18, 2024

ddelange commented Apr 18, 2024 •

edited

Loading

UrsDeSwardt commented Apr 19, 2024 •

edited

Loading

S3 ContentEncoding is disregarded #743

S3 ContentEncoding is disregarded #743

Comments

goranvinterhalter commented Nov 30, 2022 • edited Loading

Problem description

Steps/code to reproduce the problem

Versions

mpenkov commented Nov 30, 2022

goranvinterhalter commented Nov 30, 2022 • edited Loading

UrsDeSwardt commented Apr 18, 2024

ddelange commented Apr 18, 2024 • edited Loading

UrsDeSwardt commented Apr 19, 2024 • edited Loading

goranvinterhalter commented Nov 30, 2022 •

edited

Loading

goranvinterhalter commented Nov 30, 2022 •

edited

Loading

ddelange commented Apr 18, 2024 •

edited

Loading

UrsDeSwardt commented Apr 19, 2024 •

edited

Loading