Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 ContentEncoding is disregarded #743

Open
goranvinterhalter opened this issue Nov 30, 2022 · 5 comments
Open

S3 ContentEncoding is disregarded #743

goranvinterhalter opened this issue Nov 30, 2022 · 5 comments

Comments

@goranvinterhalter
Copy link

goranvinterhalter commented Nov 30, 2022

Problem description

This I believe is the same issue as #422 but it's for S3.

Certain libraries, like django_s3_storage use ContentEncoding https://github.com/etianen/django-s3-storage/blob/master/django_s3_storage/storage.py#L330 to express on-the-fly compression/decompression.

Smart open does not support this and I have to manually check for the presence of ContentEncoding when reading such files. The s3 documentation specifies:

ContentEncoding (string) -- Specifies what content encodings have been applied to the object and thus what decoding mechanisms must be applied to obtain the media-type referenced by the Content-Type header field.

Is this something that can/will be implemented at some point?

Steps/code to reproduce the problem

It's hard to give precise steps, but simply put uploading a .txt file with .txt extension who's content has been gziped and ContentEncoding value is "gzip" should be automatically decompressed, but it is not.

Versions

Linux-4.14.296-222.539.amzn2.x86_64-x86_64-with-glibc2.2.5
Python 3.7.10 (default, Jun  3 2021, 00:02:01)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-13)]
smart_open 6.2.0
@mpenkov
Copy link
Collaborator

mpenkov commented Nov 30, 2022

Thank you for the report.

The following two statements seem inconsistent to me:

  1. It's hard to give precise steps
  2. Simply put uploading a .txt file with .txt extension who's content has been gziped and ContentEncoding value is "gzip" should be automatically decompressed, but it is not

Why is it difficult to show the precise source code for 2)?

@goranvinterhalter
Copy link
Author

goranvinterhalter commented Nov 30, 2022

@mpenkov I don't know if these instructions are correct or incorrect. For example, is Metadata uncompressed_size required (as used in https://github.com/etianen/django-s3-storage/blob/master/django_s3_storage/storage.py#L331)?
I observed that chrome will automatically decompress the file when a pre-signed url is used but I'm having problems replicating this in steps bellow. For now this is what I have.

Create the file and upload (I'm referring to the bucket as ):

echo "hello world" | gzip -c > a.txt
aws s3 cp a.txt <bucket>/a.txt --content-encoding gzip

Check ContentEncoding is set:

In [36]: import boto3

In [37]: client = boto3.client("s3")

In [38]: obj = client.get_object(Bucket="<bucket>", Key="a.txt")

In [39]: obj["ContentEncoding"]
Out[39]: 'gzip'

Reading with smart_open:

In [1]: import smart_open

In [2]: smart_open.open("<bucket>/a.txt", "rb").read()
Out[2]: b'\x1f\x8b\x08\x00\xa2k\x87c\x00\x03\xcbH\xcd\xc9\xc9W(\xcf/\xcaI\xe1\x02\x00-;\x08\xaf\x0c\x00\x00\x00'

In [3]: smart_open.open("<bucket>/a.txt").read()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In [3], line 1
----> 1 smart_open.open("<bucket>/a.txt").read()

File /opt/python/3.9.14/lib/python3.9/codecs.py:322, in BufferedIncrementalDecoder.decode(self, input, final)
    319 def decode(self, input, final=False):
    320     # decode input (taking the buffer into account)
    321     data = self.buffer + input
--> 322     (result, consumed) = self._buffer_decode(data, self.errors, final)
    323     # keep undecoded input until the next call
    324     self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

@UrsDeSwardt
Copy link

Is there a solution to this problem yet? 🤔

@ddelange
Copy link
Contributor

ddelange commented Apr 18, 2024

I think your best bet is to do a head_object, and then use the compression keyword argument to smart_open.open.

@UrsDeSwardt
Copy link

UrsDeSwardt commented Apr 19, 2024

My problem was a little different. In my scenario, I'm uploading a gzip file to S3, but the ContentEncoding is not set to "gzip", resulting in corrupted data when downloading the file again. I fixed it by adding transport_params to open:

transport_params = {
    "client_kwargs": {
        "S3.Client.create_multipart_upload": {"ContentEncoding": "gzip"},
    },
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants