-
-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle HTTP 429 errors + add failure limit #393
base: main
Are you sure you want to change the base?
Conversation
BTW, I did not add any tests, this seems pretty hard to do. |
a48f46a
to
f697027
Compare
I just converted it to Draft because my manual tests are showing that my code is wrong, I will submit a fix as soon as code is OK |
Code is now ready for review, my tests were not behaving as expected only because Cloudflare suddenly stopped returning "Retry-After" header for the failing requests. |
Thanks for this PR! We should be able to add it to the next feature release 0.12.0. One potential issue is the overall timeout for the page, which is calculated here:
I actually think the consecutive limit might make more sense, especially for multiple crawler instances, eg. if one instance is having a lot of issues, it should be interrupted, while others continue. The state would track total failures across all instances, but maybe for your use case that doesn't matter as much.. Were you seeing worse results with consecutive failures? Will think about this a bit more. |
Great, thank you ! Both points are very valid! I think that the 429 should "pause" the overall timeout, because 429 are not really timeouts, they are a request from the server to slow down our requests. So from my perspective it does not means that the server / crawler is malfunctioning (what I consider we try to capture with the overall timeout) but only that we are too "aggressive" for the server. How to implement seems is a bit complex, because I consider it should only "pause" the overall timeout only when 429 errors are handled. We should not increase the overall timeout if 429 error are not returned for the current page. I will have a look into it, but if you have any suggestion, they are welcomed. Regarding the consecutive limits, I still don't think that consecutive limits make sense. You could easily get into situations where the crawler won't stop but the result is garbage. For instance if only one page out of 10 is good, and you have set the limit to 50 because you are crawling of website with thousands of pages, you will never hit the limit but the result is that 90% of the website is garbage. Maybe we should track an individual limit per crawler instance (is an instance what is controlled by the There are two scenarios we encounter (for now):
|
I think that I've implemented retry on 429 errors at the wrong place. I suggest that I change it this way:
With this solution:
WDYT? |
- logger.fatal() also sets crawl status to 'failed' - add 'failOnFailedLimit' to set crawl status to 'failed' if number of failed pages exceeds limit, refactored from #393
Regarding the failures, I refactored that into a separate PR with additional cleanup (#402) - in this case, I think we want to mark the crawl as 'failed', rather than merely interrupt it, which means the crawler will wait for other workers to finish, and possibly upload the WACZ file. I think the desired behavior for that is to fail the crawl, which would also prevent it from being restarted again in our use case. Let's focus this PR on just the 429 handling perhaps? |
Yes, perfect, let's focus on 429 handling on this PR! |
- logger.fatal() also sets crawl status to 'failed' and adds endTime before exiting - add 'failOnFailedLimit' to set crawl status to 'failed' if number of failed pages exceeds limit, refactored from #393 to now use logger.fatal() to end crawl.
@ikreymer do you have any more thoughts to share? |
Sorry for the delay - just catching up. Yes, this is a better approach, as it allows retrying w/o having to wait for within page counter. I think that could work. Some caveats:
|
No worries, I know what this is to have too many things on the plate. Your caveats are very valid. I will have a look about how to implement the pause per domain and for all workers, you are probably right this would make even more sense. |
I had a look at the code and searched a bit what could be done. The logic handling the pause in case of 429 errors could be moved in Moving this logic further up (typically in @ikreymer what do you think about this? Should we join efforts and try to tackle the second solution above or should I start making some progress by implementing the easy solution which is sufficient in our scenario (and probably many other scenarios)? |
Fix #392 (mostly, see NB below, but this is ok for me)
Changes
failedLimit
CLI argument which interrupts the crawler if the number of failed pages is greater or equal to this limitpageLoadAttempts
+defaultRetryPause
CLI arguments. In case of HTTP 429 error:pageLoadAttempts
timesRetry-After
HTTP response headerdefaultRetryPause
secondsNB: the
failedLimit
argument is not based on a count of successive failures as originally suggested in the issue, because it is indeed way more complex + potentially not that great (e.g. if there is many failures but some random success, the limit might not apply ; if there is random failures on a limited number of pages, the limit might never apply but the result be pretty bad)