Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branding: improve IO resilience #2869

Closed
wants to merge 9 commits into from

Conversation

kwzrd
Copy link
Contributor

@kwzrd kwzrd commented Dec 28, 2023

Closes GH-2808

This PR aims to improve the resiliency of the Branding manager cog.

Context

The Branding manager automatically discovers events once per day at midnight -- or when instructed to. During each such discovery, it makes multiple HTTP requests to the GitHub API. First, it fetches a list of all events in the Branding repository, and then it fetches those events one-by-one.

If the bot fails to fetch an event, it skips it. This behaviour made sense originally, as we didn't want the bot to stop the discovery iteration if just one event is badly configured. In practice, however, this is a non-issue as events are validated in the Branding repository before they reach the main branch. Additionally, this behaviour also affects server errors - if the bot receives a 503 when fetching en event, the event is simply excluded from the result set. This has an unfortunate consequence - if the bot fails to fetch the currently active event, it decides that the event has ended, and will revert the branding to the evergreen (fallback) season.

This PR makes adjustments to the error handling methodology. Now, if any of the requests fail, the whole discovery iteration will be aborted & retried the following day (or it may be retried manually via the sync command). The current state remains in the Redis cache, so the cog will continue working just fine. Additionally, to reduce the likelihood of a discovery iteration being aborted, there will be a retry mechanism on each request. For every request that fails on a server error, the request will be retried up to 5 times with exponential backoff.

Implementation

Adjusting the error handling was easy - I basically just removed the error handling clauses from the repository module. This means that errors occurring during event fetch will now propagate to the cog. The daemon already handles exceptions, and so if a fetch fails, the daemon will simply log it and sleep until the next midnight.

Another way to trigger the event discovery is via the sync command. This also already handles errors - I just removed a redundant check for when no active event was found, which will now result in an exception.

Finally, the calendar refresh cmd used to automatically show the calendar after running discovery. Because discovery can now fail (well, it always could, but the error used to be hidden), it now responds with an info embed:

image

image

The user can then use check the calendar via the calender cmd. I think this makes more sense, but I'm open to feedback.

Within the repository module, I added a custom exception for GitHub server errors, and revamped the response status checks. I've added the MIT-licensed backoff library to avoid implementing the retry mechanism myself, but if we would rather avoid the dependency, then a custom implementation is also possible. The backoff decorator basically retries the decorated function if it raises the specified exception, up to the configured amount of times.

If an error occurs during event fetch, the operation will now fail
rather than skip the event or return an empty result.

Skipping events may cause the branding to reset or change unexpectedly,
which is not desired. If the bot cannot fetch all events, it should
leave the existing branding assets intact.
If the bot cannot find an active event, it will try to find the
fallback event. If the fallback event cannot be found either,
it will now raise an error.

This should be an exceptional state that cannot be recovered from.
Synchronisation should be retried once the event setup is corrected.
The event fetch may now fail. We handle the error in the calendar
refresh command and present the user with the result.
Requests that fail due to a GitHub server error will now be retried.
The bot will no longer automatically invoke the calendar view cmd
after a cache refresh, as the cache refresh may fail.
No use for jitter in this case, as we don't make concurrent requests.
Deterministic delays make more sense.
@kwzrd
Copy link
Contributor Author

kwzrd commented Dec 28, 2023

If you would like to test the error handling, you can manually raise an error in _raise_for_status. You can force a 401 if you use a bad access token, but forcing a 500 isn't really possible, so manually raising in the code is probably the best option. Feel free to get in touch if you need help with testing.

@shtlrs
Copy link
Member

shtlrs commented Dec 29, 2023

Can we maybe use tenacity instead ?

The library is popular, resilient and offeres the same features & even more which may become helpful one day.

Most importantly, it's still maintained & gets updates, unlike backoff

Comment on lines +121 to +122
if response.status >= 500:
raise GitHubServerError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the request fails 5 times with a 5xx error, currently the actual server error wont be included in the traceback, just the traceback from this line. If we do something like this would it help put more context in the traceback?

try:
    response.raise_for_status()
except ResponseErrorOrWhateverItsCalled as err:
    if response.status >= 500:
        raise GitHubServerError from err
    raise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fantastic idea! Thanks, I'll work it in.

@kwzrd
Copy link
Contributor Author

kwzrd commented Jan 4, 2024

@shtlrs Thanks for the suggestion, I will happily switch to tenacity. Though I have a question about the licensing.

The tenacity lib is licensed under Apache 2.0 which should permit us to use it, but I think we may have to declare it in the third party license declaration file. Could you confirm whether that's the right approach? The file does not currently mention the Apache license.

@Xithrius Xithrius added a: backend Related to internal functionality and utilities (error_handler, logging, security, utils and core) p: 2 - normal Normal Priority t: enhancement Changes or improvements to existing features s: needs review Author is waiting for someone to review and approve labels Feb 2, 2024
@wookie184
Copy link
Contributor

The tenacity lib is licensed under Apache 2.0 which should permit us to use it, but I think we may have to declare it in the third party license declaration file. Could you confirm whether that's the right approach? The file does not currently mention the Apache license.

I'm not an expert on licenses, but I don't think that should be necessary, the third party license file we have is currently just used for code we've copied directly into our project rather than dependencies (which I think work in a slightly different way). I believe we already have some Apache licensed dependencies, and there are other projects like discord.py that are MIT licensed but use Apache licensed dependencies without including the Apache license in their repo.

@wookie184 wookie184 added s: waiting for author Waiting for author to address a review or respond to a comment and removed s: needs review Author is waiting for someone to review and approve labels Apr 16, 2024
@kwzrd
Copy link
Contributor Author

kwzrd commented Apr 16, 2024

@wookie184 Makes sense, thanks for the answer. I'll work on the suggested improvements.

@kwzrd
Copy link
Contributor Author

kwzrd commented Oct 9, 2024

Closing in favour of #3182 which adheres to the feedback. Thanks for the reviews!

@kwzrd kwzrd closed this Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a: backend Related to internal functionality and utilities (error_handler, logging, security, utils and core) p: 2 - normal Normal Priority s: waiting for author Waiting for author to address a review or respond to a comment t: enhancement Changes or improvements to existing features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Branding cog - RuntimeError: Failed to fetch file due to status: 503
4 participants