You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For instance, when asked if any page on http://other.example.com/ is allowed, reppy returns False.
It should either return True or potentially throw an exception, but definitely not False.
Returning False is incorrect because robots.txt is not a whitelist.
I have mixed feelings about what the behavior should be. On the one hand, False doesn't really capture the truth of it, but it is the safer alternative - better to incorrectly report False than to risk incorrectly reporting True; instead we need a way to convey "it's not clear whether it's allowed or not based on this robots.txt." On the other hand, an exception doesn't quite feel appropriate to throw an exception because it doesn't feel particularly exceptional. Perhaps a different return type that conveys some more of the nuance would work, but that also seems a little clunky.
Whenever we've used this, we generally are using it through the cache, which takes care of finding the appropriate Robots or Agent based on the domain.
What's workaround for this? Many website contains robots.txt rules only for 2nd level domain this means that links containing "www.domain.com" will be also forbidden by rules while they're not. For example:
DEBUG - URL https://insurancejournal.com/news/west/ is allowed in robots.txt
DEBUG - URL https://www.insurancejournal.com/news/international/2020/10/02/584993.htm is FORBIDDEN by robots.txt, skipping
I'm thinking to remove www. from the URL before checking it but this looks ugly.
Hi,
Let's take a look at the following example from Google:
For instance, when asked if any page on
http://other.example.com/
is allowed, reppy returnsFalse
.It should either return
True
or potentially throw an exception, but definitely not False.Returning False is incorrect because robots.txt is not a whitelist.
Here is an example:
The text was updated successfully, but these errors were encountered: