Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http module - incorrect reading gzip compressed stream #713

Open
3 tasks done
grubberr opened this issue Aug 11, 2022 · 3 comments
Open
3 tasks done

http module - incorrect reading gzip compressed stream #713

grubberr opened this issue Aug 11, 2022 · 3 comments

Comments

@grubberr
Copy link

Hello,

import smart_open

url = "https://fonts.googleapis.com/css?family=Montserrat"
headers = {"Accept-encoding": "deflate, gzip"}

result = smart_open.open(url, transport_params={"headers": headers}, mode="rb")
buff = result.read()
print(len(buff))

result = smart_open.open(url, transport_params={"headers": headers}, mode="rb")
buff = result.read(2)
buff += result.read()
print(len(buff))

196
209

196 bytes - gzip compressed result
209 bytes - uncompressed result

This happened because:
in 1-st case library uses self.response.raw.read() - it returns result as is from server, it's gzip compressed
in 2-nd case library uses self.response.iter_content - result uncompressed by requests library

Versions

print(platform.platform())
Linux-5.14.0-1047-oem-x86_64-with-glibc2.31
print("Python", sys.version)
Python 3.9.11 (main, Aug  9 2022, 09:22:28) 
[GCC 9.4.0]
print("smart_open", smart_open.__version__)
smart_open 6.0.0

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software
@mpenkov
Copy link
Collaborator

mpenkov commented Aug 12, 2022

What is the desired behavior here?

@grubberr
Copy link
Author

in really it's good question
I just pointed on inconsistency

@theogaraj
Copy link

Came across this while trying to solve a problem using smart_open to read from a range of different URLs.
My code:

with (
    so.open(source, 'rb', transport_params={'headers': HEADERS}) as fin,
    so.open(destination, 'wb') as fout
):
    fout.write(fin.read())

I observed that for some URLs I was able to get a meaningful output file while in other cases it was just gibberish. Comparing between success and failure I determined that the ones that were failing were those with Content-Encoding: gzip in the response headers.

@grubberr your issue helped pinpoint what was going on; changing my code to the following now works for all URLs:

with (
    so.open(source, 'rb', transport_params={'headers': HEADERS}) as fin,
    so.open(destination, 'wb') as fout
):
    while True:
        chunk = fin.read(1024)
        if not chunk:
            break
                
        fout.write(chunk)

I understand smart_open uses the extension to determine compression. My failing URL is 'https://www.BCBSIL.com/aca-json/il/index_il.json' so I guess smart_open can't determine to use gzip to decompress. I tried using compression='.gz' when opening the file, but it gave me the following error.

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 300, in read
    return self._buffer.read(size)
  File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 487, in read
    if not self._read_gzip_header():
  File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 435, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'{\n')

This really puzzled me for a while, but @grubberr 's explanation of result.read() vs result.read(2) helps explain this. It looks like gzip is reading in chunks (4th line of stack trace), so even though original content is compressed, gzip is getting the uncompressed (by requests) content which causes it to raise an error.

What is the desired behavior here?

  • I think ideally it would be that the gzip decompression is done transparently for f.read() as it is for f.read(n).
  • If that's not possible or too complex, having the difference in behavior clarified in the documentation will probably be useful for other people running into the same problem, and they can implement slightly different code similar to what I've done.

Now that I know what the issue is and how to work around it, this is by no means a showstopper. I do want to say that smart_open has really made my life much simpler, I appreciate all the work that has gone into this library!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants