Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

出现gzip.BadGzipFile: Not a gzipped file (b'<!') 的解决办法。 一处bug #14

Open
legend-zl opened this issue May 9, 2022 · 3 comments

Comments

@legend-zl
Copy link

如何爬去的一个网站返回的response里面的headers包含了 content-encoding: "gzip"的话,那么就会报上述错误,虽然作者在 downloadermiddlewares.py 的代码段中去掉了这个属性:

Necessary to bypass the compression middleware

        # 这个地方只能去掉 headers 中的content-encoding,但是response.headers中的依然存在,所以下面应该直接改为  headers=headers,
        headers = response.headers
        headers.pop('content-encoding', None)
        headers.pop('Content-Encoding', None)

        response = HtmlResponse(
            page.url,
            status=response.status,
            headers=response.headers,    # 解决办法就是改为: headers=headers, 
            body=content,
            encoding='utf-8',
            request=request
        )

但是很可惜的是,去不掉,只有把 headers=response.headers, 改为headers才可以。

@legend-zl
Copy link
Author

注释 是我添加上去的

@tangyuanba
Copy link

感谢你的解决方案, 我发现在调用HtmlResponse之后进行删除操作,就可以返回正确的response

response = HtmlResponse(
page.url,
status=response.status,
headers=response.headers,
body=content,
encoding='utf-8',
request=request
)

headers.pop('content-encoding', None)
headers.pop('Content-Encoding', None)

@yswtrue
Copy link

yswtrue commented Jun 29, 2022

我把scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware这个中间件去了也可以

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants