Handle HTTP 429 errors + add failure limit #393

benoit74 · 2023-09-25T12:53:36Z

Fix #392 (mostly, see NB below, but this is ok for me)

Changes

add a failedLimit CLI argument which interrupts the crawler if the number of failed pages is greater or equal to this limit
add a pageLoadAttempts + defaultRetryPause CLI arguments. In case of HTTP 429 error:
- the crawler will retry up to pageLoadAttempts times
- the pause between retries will be based on the Retry-After HTTP response header
  - both absolute and relative formats are supported
- if the header is not provided, the pause will default to defaultRetryPause seconds
- for now, other statuses are not retried (HTTP 503 could theoretically be retried as well, but with even larger pauses so not very practical for a crawler especially since the chance of getting a good response are typically lower than on HTTP 429)

NB: the failedLimit argument is not based on a count of successive failures as originally suggested in the issue, because it is indeed way more complex + potentially not that great (e.g. if there is many failures but some random success, the limit might not apply ; if there is random failures on a limited number of pages, the limit might never apply but the result be pretty bad)

benoit74 · 2023-09-25T12:54:03Z

BTW, I did not add any tests, this seems pretty hard to do.

benoit74 · 2023-09-26T13:51:49Z

I just converted it to Draft because my manual tests are showing that my code is wrong, I will submit a fix as soon as code is OK

benoit74 · 2023-09-27T06:34:58Z

Code is now ready for review, my tests were not behaving as expected only because Cloudflare suddenly stopped returning "Retry-After" header for the failing requests.

ikreymer · 2023-09-29T17:03:12Z

Thanks for this PR! We should be able to add it to the next feature release 0.12.0.

One potential issue is the overall timeout for the page, which is calculated here:
https://github.com/webrecorder/browsertrix-crawler/blob/main/crawler.js#L94
Do you think the 429 timeouts should extend this timeout, or just make sure it is set high enough?

NB: the failedLimit argument is not based on a count of successive failures as originally suggested in the issue, because it is indeed way more complex + potentially not that great (e.g. if there is many failures but some random success, the limit might not apply ; if there is random failures on a limited number of pages, the limit might never apply but the result be pretty bad)

I actually think the consecutive limit might make more sense, especially for multiple crawler instances, eg. if one instance is having a lot of issues, it should be interrupted, while others continue. The state would track total failures across all instances, but maybe for your use case that doesn't matter as much.. Were you seeing worse results with consecutive failures? Will think about this a bit more.

benoit74 · 2023-10-02T06:13:02Z

Great, thank you ! Both points are very valid!

I think that the 429 should "pause" the overall timeout, because 429 are not really timeouts, they are a request from the server to slow down our requests. So from my perspective it does not means that the server / crawler is malfunctioning (what I consider we try to capture with the overall timeout) but only that we are too "aggressive" for the server. How to implement seems is a bit complex, because I consider it should only "pause" the overall timeout only when 429 errors are handled. We should not increase the overall timeout if 429 error are not returned for the current page. I will have a look into it, but if you have any suggestion, they are welcomed.

Regarding the consecutive limits, I still don't think that consecutive limits make sense. You could easily get into situations where the crawler won't stop but the result is garbage. For instance if only one page out of 10 is good, and you have set the limit to 50 because you are crawling of website with thousands of pages, you will never hit the limit but the result is that 90% of the website is garbage. Maybe we should track an individual limit per crawler instance (is an instance what is controlled by the --workers parameter?), but this is not what we want to capture.

There are two scenarios we encounter (for now):

suddenly, all pages are failing with timeouts (but we are usually using only 1 "worker", if this is what you refer to when speaking about crawler instances). This is a side-question, but have you already encountered this situation were a crawler suddenly fails to load all pages? We were wondering whether it is a networking issue or a webserver issue, but never thought it might be a browser issue.
more rarely, we see situation where many pages are failing (something like 30-50% or even more) but the crawler continues ; these failures are sometimes random ; after 100 failures, we already know that the final result will not be valuable and we should not continue to waste compute time

benoit74 · 2023-10-02T06:42:37Z

I think that I've implemented retry on 429 errors at the wrong place.

I suggest that I change it this way:

create a custom RetryWithPauseError class, with information on how long the crawler should pause
in loadPage function of crawler.js, keep only the code detecting the 429 and extracting the Retry-After if present ; raise a RetryWithPauseError with proper information
handle retries in

browsertrix-crawler/util/worker.js

Line 196 in 4c7ebf1

try {
- probably with a nested while loop for retries + a new try block detecting the RetryWithPauseError raised above

With this solution:

we do not have to modify maxPageTime and make assumptions about how long the webserver might ask us to pause
we continue to be able to retry as many times as wanted
we restart the whole page processing, which is probably even better than what I did at a lower level

WDYT?

- logger.fatal() also sets crawl status to 'failed' - add 'failOnFailedLimit' to set crawl status to 'failed' if number of failed pages exceeds limit, refactored from #393

ikreymer · 2023-10-03T01:55:23Z

Regarding the failures, I refactored that into a separate PR with additional cleanup (#402) - in this case, I think we want to mark the crawl as 'failed', rather than merely interrupt it, which means the crawler will wait for other workers to finish, and possibly upload the WACZ file. I think the desired behavior for that is to fail the crawl, which would also prevent it from being restarted again in our use case.

Let's focus this PR on just the 429 handling perhaps?

benoit74 · 2023-10-03T05:41:17Z

Yes, perfect, let's focus on 429 handling on this PR!
Thank you for the refactoring around failure, and yes, you are right regarding failing the crawl.

- logger.fatal() also sets crawl status to 'failed' and adds endTime before exiting - add 'failOnFailedLimit' to set crawl status to 'failed' if number of failed pages exceeds limit, refactored from #393 to now use logger.fatal() to end crawl.

benoit74 · 2023-10-09T15:02:55Z

@ikreymer do you have any more thoughts to share?
does my idea above about the alternative way to handle those 429 errors (and not impact page maximum time) makes sense to you?

README.md

util/argParser.js

ikreymer · 2023-10-19T01:25:59Z

@ikreymer do you have any more thoughts to share? does my idea above about the alternative way to handle those 429 errors (and not impact page maximum time) makes sense to you?

Sorry for the delay - just catching up. Yes, this is a better approach, as it allows retrying w/o having to wait for within page counter. I think that could work. Some caveats:

This probably only works with one worker, otherwise multiple workers will still retry at a more frequent interval, right?
Another option could be to continue on to the next page and put this page back into the queue, however, that probably only makes sense if crawling across multiple domains, otherwise they will likely all be 429s.
We also have the pageExtraDelay flag (which probably should be moved into the worker), which could also be updated to reflect the 429 limit. Maybe it is also set to the maximum retry time?
I wonder if some sort of per-domain retry is needed, that could be complicated. For your use case, you're just using this with one worker, and one domain right, so that would be the most important option to get working?

benoit74 · 2023-10-19T07:20:20Z

No worries, I know what this is to have too many things on the plate.

Your caveats are very valid.

I will have a look about how to implement the pause per domain and for all workers, you are probably right this would make even more sense.

benoit74 · 2023-10-26T12:54:48Z

I had a look at the code and searched a bit what could be done.

The logic handling the pause in case of 429 errors could be moved in timedCrawlPage in crawler.js, and this is probably sufficient to make it work in Kiwix scenario (only one worker, mostly only one domain). It would also work with more workers, but each worker will have to receive its own 429 error before stopping. And it means we could "loose" some time in a multi-domain scenario where we will pause instead of moving on to another domain. Those two limitations are not a problem for us of course.

Moving this logic further up (typically in runLoop of worker.js) would be meaningful since it would allow to ask all workers to pause processing a given domain as soon as one page in one worker returned of 429, and to continue to process other domains. However this is too complex for me to implement (we need to push page back to the queue, ensure we do not impact computation of retries at this level, probably consider impact on pending pages computations, inform other workers of domain which has to be paused, etc ...). Since this is not needed in our scenario, I really don't see a strong interest in moving into this direction alone.

@ikreymer what do you think about this? Should we join efforts and try to tackle the second solution above or should I start making some progress by implementing the easy solution which is sufficient in our scenario (and probably many other scenarios)?

Handle HTTP 429 errors + add failure limit

f697027

benoit74 force-pushed the handle_http_429 branch from a48f46a to f697027 Compare September 25, 2023 12:54

Await promises

2950042

benoit74 marked this pull request as draft September 26, 2023 13:52

Simplify retry-after header retrieval

5d68cd6

benoit74 marked this pull request as ready for review September 27, 2023 06:34

benoit74 mentioned this pull request Sep 28, 2023

Radiopaedia Recipe openzim/zim-requests#1016

Open

ikreymer added a commit that referenced this pull request Oct 3, 2023

additional failure logic:

235f963

- logger.fatal() also sets crawl status to 'failed' - add 'failOnFailedLimit' to set crawl status to 'failed' if number of failed pages exceeds limit, refactored from #393

ikreymer mentioned this pull request Oct 3, 2023

additional failure logic: #402

Merged

Merge branch 'main' into handle_http_429

1f18d5f

ikreymer reviewed Oct 19, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

ikreymer reviewed Oct 19, 2023

View reviewed changes

util/argParser.js Outdated Show resolved Hide resolved

removing fail on options, already added

c97ee42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle HTTP 429 errors + add failure limit #393

Handle HTTP 429 errors + add failure limit #393

benoit74 commented Sep 25, 2023

benoit74 commented Sep 25, 2023

benoit74 commented Sep 26, 2023

benoit74 commented Sep 27, 2023

ikreymer commented Sep 29, 2023

benoit74 commented Oct 2, 2023

benoit74 commented Oct 2, 2023

ikreymer commented Oct 3, 2023

benoit74 commented Oct 3, 2023

benoit74 commented Oct 9, 2023

ikreymer commented Oct 19, 2023

benoit74 commented Oct 19, 2023

benoit74 commented Oct 26, 2023

Handle HTTP 429 errors + add failure limit #393

Are you sure you want to change the base?

Handle HTTP 429 errors + add failure limit #393

Conversation

benoit74 commented Sep 25, 2023

Changes

benoit74 commented Sep 25, 2023

benoit74 commented Sep 26, 2023

benoit74 commented Sep 27, 2023

ikreymer commented Sep 29, 2023

benoit74 commented Oct 2, 2023

benoit74 commented Oct 2, 2023

ikreymer commented Oct 3, 2023

benoit74 commented Oct 3, 2023

benoit74 commented Oct 9, 2023

ikreymer commented Oct 19, 2023

benoit74 commented Oct 19, 2023

benoit74 commented Oct 26, 2023