Branding: improve IO resilience #2869

kwzrd · 2023-12-28T14:31:35Z

This PR aims to improve the resiliency of the Branding manager cog.

Context

The Branding manager automatically discovers events once per day at midnight -- or when instructed to. During each such discovery, it makes multiple HTTP requests to the GitHub API. First, it fetches a list of all events in the Branding repository, and then it fetches those events one-by-one.

If the bot fails to fetch an event, it skips it. This behaviour made sense originally, as we didn't want the bot to stop the discovery iteration if just one event is badly configured. In practice, however, this is a non-issue as events are validated in the Branding repository before they reach the main branch. Additionally, this behaviour also affects server errors - if the bot receives a 503 when fetching en event, the event is simply excluded from the result set. This has an unfortunate consequence - if the bot fails to fetch the currently active event, it decides that the event has ended, and will revert the branding to the evergreen (fallback) season.

This PR makes adjustments to the error handling methodology. Now, if any of the requests fail, the whole discovery iteration will be aborted & retried the following day (or it may be retried manually via the sync command). The current state remains in the Redis cache, so the cog will continue working just fine. Additionally, to reduce the likelihood of a discovery iteration being aborted, there will be a retry mechanism on each request. For every request that fails on a server error, the request will be retried up to 5 times with exponential backoff.

Implementation

Adjusting the error handling was easy - I basically just removed the error handling clauses from the repository module. This means that errors occurring during event fetch will now propagate to the cog. The daemon already handles exceptions, and so if a fetch fails, the daemon will simply log it and sleep until the next midnight.

Another way to trigger the event discovery is via the sync command. This also already handles errors - I just removed a redundant check for when no active event was found, which will now result in an exception.

Finally, the calendar refresh cmd used to automatically show the calendar after running discovery. Because discovery can now fail (well, it always could, but the error used to be hidden), it now responds with an info embed:

The user can then use check the calendar via the calender cmd. I think this makes more sense, but I'm open to feedback.

Within the repository module, I added a custom exception for GitHub server errors, and revamped the response status checks. I've added the MIT-licensed backoff library to avoid implementing the retry mechanism myself, but if we would rather avoid the dependency, then a custom implementation is also possible. The backoff decorator basically retries the decorated function if it raises the specified exception, up to the configured amount of times.

If an error occurs during event fetch, the operation will now fail rather than skip the event or return an empty result. Skipping events may cause the branding to reset or change unexpectedly, which is not desired. If the bot cannot fetch all events, it should leave the existing branding assets intact.

If the bot cannot find an active event, it will try to find the fallback event. If the fallback event cannot be found either, it will now raise an error. This should be an exceptional state that cannot be recovered from. Synchronisation should be retried once the event setup is corrected.

The event fetch may now fail. We handle the error in the calendar refresh command and present the user with the result.

Requests that fail due to a GitHub server error will now be retried.

The bot will no longer automatically invoke the calendar view cmd after a cache refresh, as the cache refresh may fail.

No use for jitter in this case, as we don't make concurrent requests. Deterministic delays make more sense.

kwzrd · 2023-12-28T14:40:29Z

If you would like to test the error handling, you can manually raise an error in _raise_for_status. You can force a 401 if you use a bad access token, but forcing a 500 isn't really possible, so manually raising in the code is probably the best option. Feel free to get in touch if you need help with testing.

shtlrs · 2023-12-29T15:40:47Z

Can we maybe use tenacity instead ?

The library is popular, resilient and offeres the same features & even more which may become helpful one day.

Most importantly, it's still maintained & gets updates, unlike backoff

wookie184 · 2024-01-04T14:17:44Z

bot/exts/backend/branding/_repository.py

+        if response.status >= 500:
+            raise GitHubServerError


if the request fails 5 times with a 5xx error, currently the actual server error wont be included in the traceback, just the traceback from this line. If we do something like this would it help put more context in the traceback?

try: response.raise_for_status() except ResponseErrorOrWhateverItsCalled as err: if response.status >= 500: raise GitHubServerError from err raise

That's a fantastic idea! Thanks, I'll work it in.

kwzrd · 2024-01-04T14:52:56Z

@shtlrs Thanks for the suggestion, I will happily switch to tenacity. Though I have a question about the licensing.

The tenacity lib is licensed under Apache 2.0 which should permit us to use it, but I think we may have to declare it in the third party license declaration file. Could you confirm whether that's the right approach? The file does not currently mention the Apache license.

wookie184 · 2024-04-16T11:38:25Z

The tenacity lib is licensed under Apache 2.0 which should permit us to use it, but I think we may have to declare it in the third party license declaration file. Could you confirm whether that's the right approach? The file does not currently mention the Apache license.

I'm not an expert on licenses, but I don't think that should be necessary, the third party license file we have is currently just used for code we've copied directly into our project rather than dependencies (which I think work in a slightly different way). I believe we already have some Apache licensed dependencies, and there are other projects like discord.py that are MIT licensed but use Apache licensed dependencies without including the Apache license in their repo.

kwzrd · 2024-04-16T13:09:57Z

@wookie184 Makes sense, thanks for the answer. I'll work on the suggested improvements.

kwzrd · 2024-10-09T18:01:11Z

Closing in favour of #3182 which adheres to the feedback. Thanks for the reviews!

kwzrd added 9 commits December 28, 2023 12:19

Branding: handle errors in calendar refresh

44bcfca

The event fetch may now fail. We handle the error in the calendar refresh command and present the user with the result.

Deps: add 'backoff'

04f36e6

Branding: implement retry for server errors

6c914de

Requests that fail due to a GitHub server error will now be retried.

Branding: do not invoke calendar after refresh

8d7ab36

The bot will no longer automatically invoke the calendar view cmd after a cache refresh, as the cache refresh may fail.

Branding: pre-define retry decorator

256d4cb

Branding: disable jitter

5871218

No use for jitter in this case, as we don't make concurrent requests. Deterministic delays make more sense.

Branding: remove redundant parentheses

9d0f8ff

wookie184 requested changes Jan 4, 2024

View reviewed changes

wookie184 added s: waiting for author Waiting for author to address a review or respond to a comment and removed s: needs review Author is waiting for someone to review and approve labels Apr 16, 2024

kwzrd mentioned this pull request Oct 9, 2024

Branding: improve IO resilience #3182

Open

kwzrd closed this Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Branding: improve IO resilience #2869

Branding: improve IO resilience #2869

kwzrd commented Dec 28, 2023

kwzrd commented Dec 28, 2023

shtlrs commented Dec 29, 2023

wookie184 Jan 4, 2024

kwzrd Jan 4, 2024

kwzrd commented Jan 4, 2024

wookie184 commented Apr 16, 2024

kwzrd commented Apr 16, 2024

kwzrd commented Oct 9, 2024

Branding: improve IO resilience #2869

Branding: improve IO resilience #2869

Conversation

kwzrd commented Dec 28, 2023

Context

Implementation

kwzrd commented Dec 28, 2023

shtlrs commented Dec 29, 2023

wookie184 Jan 4, 2024

Choose a reason for hiding this comment

kwzrd Jan 4, 2024

Choose a reason for hiding this comment

kwzrd commented Jan 4, 2024

wookie184 commented Apr 16, 2024

kwzrd commented Apr 16, 2024

kwzrd commented Oct 9, 2024