-
-
Notifications
You must be signed in to change notification settings - Fork 675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Branding: improve IO resilience #2869
Conversation
If an error occurs during event fetch, the operation will now fail rather than skip the event or return an empty result. Skipping events may cause the branding to reset or change unexpectedly, which is not desired. If the bot cannot fetch all events, it should leave the existing branding assets intact.
If the bot cannot find an active event, it will try to find the fallback event. If the fallback event cannot be found either, it will now raise an error. This should be an exceptional state that cannot be recovered from. Synchronisation should be retried once the event setup is corrected.
The event fetch may now fail. We handle the error in the calendar refresh command and present the user with the result.
Requests that fail due to a GitHub server error will now be retried.
The bot will no longer automatically invoke the calendar view cmd after a cache refresh, as the cache refresh may fail.
No use for jitter in this case, as we don't make concurrent requests. Deterministic delays make more sense.
If you would like to test the error handling, you can manually raise an error in |
Can we maybe use tenacity instead ? The library is popular, resilient and offeres the same features & even more which may become helpful one day. Most importantly, it's still maintained & gets updates, unlike |
if response.status >= 500: | ||
raise GitHubServerError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the request fails 5 times with a 5xx error, currently the actual server error wont be included in the traceback, just the traceback from this line. If we do something like this would it help put more context in the traceback?
try:
response.raise_for_status()
except ResponseErrorOrWhateverItsCalled as err:
if response.status >= 500:
raise GitHubServerError from err
raise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fantastic idea! Thanks, I'll work it in.
@shtlrs Thanks for the suggestion, I will happily switch to The |
I'm not an expert on licenses, but I don't think that should be necessary, the third party license file we have is currently just used for code we've copied directly into our project rather than dependencies (which I think work in a slightly different way). I believe we already have some Apache licensed dependencies, and there are other projects like discord.py that are MIT licensed but use Apache licensed dependencies without including the Apache license in their repo. |
@wookie184 Makes sense, thanks for the answer. I'll work on the suggested improvements. |
Closing in favour of #3182 which adheres to the feedback. Thanks for the reviews! |
Closes GH-2808
This PR aims to improve the resiliency of the Branding manager cog.
Context
The Branding manager automatically discovers events once per day at midnight -- or when instructed to. During each such discovery, it makes multiple HTTP requests to the GitHub API. First, it fetches a list of all events in the Branding repository, and then it fetches those events one-by-one.
If the bot fails to fetch an event, it skips it. This behaviour made sense originally, as we didn't want the bot to stop the discovery iteration if just one event is badly configured. In practice, however, this is a non-issue as events are validated in the Branding repository before they reach the
main
branch. Additionally, this behaviour also affects server errors - if the bot receives a 503 when fetching en event, the event is simply excluded from the result set. This has an unfortunate consequence - if the bot fails to fetch the currently active event, it decides that the event has ended, and will revert the branding to the evergreen (fallback) season.This PR makes adjustments to the error handling methodology. Now, if any of the requests fail, the whole discovery iteration will be aborted & retried the following day (or it may be retried manually via the
sync
command). The current state remains in the Redis cache, so the cog will continue working just fine. Additionally, to reduce the likelihood of a discovery iteration being aborted, there will be a retry mechanism on each request. For every request that fails on a server error, the request will be retried up to 5 times with exponential backoff.Implementation
Adjusting the error handling was easy - I basically just removed the error handling clauses from the repository module. This means that errors occurring during event fetch will now propagate to the cog. The daemon already handles exceptions, and so if a fetch fails, the daemon will simply log it and sleep until the next midnight.
Another way to trigger the event discovery is via the
sync
command. This also already handles errors - I just removed a redundant check for when no active event was found, which will now result in an exception.Finally, the calendar refresh cmd used to automatically show the calendar after running discovery. Because discovery can now fail (well, it always could, but the error used to be hidden), it now responds with an info embed:
The user can then use check the calendar via the calender cmd. I think this makes more sense, but I'm open to feedback.
Within the repository module, I added a custom exception for GitHub server errors, and revamped the response status checks. I've added the MIT-licensed backoff library to avoid implementing the retry mechanism myself, but if we would rather avoid the dependency, then a custom implementation is also possible. The backoff decorator basically retries the decorated function if it raises the specified exception, up to the configured amount of times.