-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry deleting tiles on S3. #225
Conversation
The `delete_keys` method used to delete tiles from S3 returns an object containing both successes and failures, and does not raise an exception for failures. This means that rare or intermittent errors can result in some tiles not being deleted. Adding a retry mechanism should make sure that is not the case. Although the loop is infinite, the assumption is that errors from S3 are rare and that dealing with them this way is easier than adapting the API of the object to handle returning failure coordinates back to the main process.
This seems good. I think I remember copying/pasting a stack trace from an S3 delete that failed when I was working on the pruner but I can't seem to find it. |
We have seen S3 blow it's brains out before (down for > 8 hours), how would an infinite loop interact with that? An alternative is what @iandees setup for S3 retry's in Marble Cutter: |
It retries the previously failed tiles every minute (by default - it's configurable), so it wouldn't be spamming S3 with retries. The behaviour is to wait until all deletes are successful and, in the case where S3 is down, this would mean waiting until it comes back up. Some options for the delete behaviour:
On the assumption that S3 being totally down is incredibly rare, although it does happen, we should aim to do something which is "safe" and will recover when S3 comes back up. And is additionally safe from the occasional intermittent 500 error.
I was assuming that:
There are likely additional options that I haven't thought of. Which, if any, matches the behaviour that we want? |
I think the current behavior is reasonable given the types of failures we can expect and the recovery options available.
I like option 3 or 4 the most, where we cap the number of retries somewhere and then are just loud about the failure. Realistically though, since the toi is also on s3, I'd be surprised if we couldn't delete some objects but then were able to subsequently update the toi set. Retrying indefinitely sounds fine to me too, like you have it now, with my pedantic concern being that having a programming error can cause the process to hang indefinitely. But it looks like you have the loop guarded specifically for s3 internal errors, so I don't think there's much risk here.
In general, in terms of inconsistent situations I think we're better off having the tile not exist in s3 but still have it in the toi rather than the other way around. That being said, don't we have a hole in our system where a tile can get enqueued for processing in one prune run, and then removed from the toi in the next run while it still hasn't gotten processed yet and remains in the queue? It would then get rendered eventually, but no longer exist in the toi, and we would end up with a stale tile. |
Thanks for the explanation. Since you've guarded against the basic cases I'm fine with the code change.
@rmarianski can you elaborate more on this in a new ticket, please? |
Follow up issue for the race condition #226 |
The
delete_keys
method used to delete tiles from S3 returns an object containing both successes and failures, and does not raise an exception for failures. This means that rare or intermittent errors can result in some tiles not being deleted. Adding a retry mechanism should make sure that is not the case. Although the loop is infinite, the assumption is that errors from S3 are rare and that dealing with them this way is easier than adapting the API of the object to handle returning failure coordinates back to the main process.