Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

brotli compression not supported? #112

Open
cypherlou opened this issue May 17, 2019 · 5 comments
Open

brotli compression not supported? #112

cypherlou opened this issue May 17, 2019 · 5 comments

Comments

@cypherlou
Copy link

cypherlou commented May 17, 2019

Robots.fetch() silently fails when the Accept Encoding header contains a compression algorithm (such as br) that is not supported.

@dlecocq
Copy link

dlecocq commented May 21, 2019

Under the hood, it uses requests. Does any of this help? https://github.com/kennethreitz/requests/issues/4525

It seems like with brotli installed and the newest urllib3 and requests, it should work.

Still, I agree that it should not fail silently. I have the bandwidth to review PRs, but lack the power to merge them :-/

@quiddihub
Copy link

@dlecocq, I will give that a go - many thanks for the suggestion.

I solved the problem by removing that as an accepted encoding method - but thought I would mention it as this behaviour meant that our site tests suggested there was no robots.txt for servers that supported that encoding method while servers that elected to use something else returned successfully.

Apart from repairing the problem, which may well manifest itself elsewhere so worth doing, I think the ability to provide a robots.txt content as a text string or file (ideally both) and to be able to access the content once it is in the possession of the module (regardless of which method was used to get it there; string, file or url) would help in this and other contexts.

Are you saying if I did a PR it wouldn't be accepted or that you can't do it yourself? I ask as I have development resource that might be able to help but I'm don't want to apply it if the PR is going to sit languishing.

@dlecocq
Copy link

dlecocq commented May 23, 2019

Yeah, it's a pretty terrible failure mode since it effectively ignores the presence of such robots.txts. It looks like urllib3.HttpResponse.CONTENT_DECODERS enumerates the supported encoders for us (https://github.com/urllib3/urllib3/blob/64e413f1b2fef86a150ae747f00aab0e2be8e59c/src/urllib3/response.py#L184), so we could reach down in there to always get the supported encodings.

Re: robots.txt content as a string (or file), that seems reasonable to me. One mode of operation that's important to users of this is to be able to keep a compact representation of the parsed rules for a number of sites, dropping the raw content and rules for other bots as soon as the thing is parsed. As long as that remains possible, I don't think you'll find any pushback from Moz.

I was just saying that I can help out with the reviewing, but I'm no longer at Moz so can't press the 'merge' button for you. @lindseyreno is who'd probably end up doing that.

@lindseyreno
Copy link
Contributor

If @dlecocq reviews it, I'll merge it :)

@quiddihub
Copy link

Thanks @dlecocq and @lindseyreno, really great to get such speedy feedback. I'll talk to one of the lead devs and see if we can get this looked at. They might not want to prioritise as there's a work around for the br encoding failure but we'll try our best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants