Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator refactor to control pods + pvcs directly instead of statefulsets #1149

Merged
merged 19 commits into from
Sep 11, 2023

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Sep 6, 2023

Refactoring operator to control pods directly, instead of statefulsets, fixes #1147

This PR includes a number of optimizations for the operator:

  • Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state.
  • Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion.
  • Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds
  • Job deletion starts as soon as post-finish crawl operations are run
  • Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer
  • Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished)
  • Current resource usage added to status

ikreymer and others added 14 commits September 1, 2023 00:27
set 'max_crawl_scale' in values.yaml to indicate max possible scale, used to create crawl-instance-{0, N}
priority classes, each with lower priority
allows crawl instance 0 to preempt crawls with more instances (and lower priorities)
eg. 2nd instance of a crawl can preempt 3rd instance of another, and a new crawl (1st instance)
can preempt 2nd instance of another crawl
- ensure redis pod is deleted last
- start deletion in background as soon as crawl is done
- operator may call finalizer with old state: if not finished but in finalizer, attempt to
cancel, and throw 400 if already canceled
- recreate redis in finalizer from yaml to avoid change event
- support reconciling desired and actual scale
- if desired scale is lower, attempt to gracefully shutdown each instance
via new redis 'stopone' key
- once each instance above > desired scale exit successfully, adjust
the status.scale down to clean up pods. also clean up redis per-instance
state when scaling down
…have been running for >60 seconds, not immediately
add placeholder for adding podmetrics as related resources
fix canceled condition
- async add_crawl_errors_to_db() call creates its own redis connection, as other one is supposed to be closed
by caller
- remove unneeded 'sync_db_state_if_finished'
- delete job after crawl finished tasks
- log if crawl finished but not yet deleted on next update
@ikreymer ikreymer requested a review from tw4l September 7, 2023 19:59
@ikreymer ikreymer marked this pull request as ready for review September 7, 2023 20:00
@ikreymer
Copy link
Member Author

ikreymer commented Sep 7, 2023

Should be ready for review -- there's a bunch of changes / optimizations to get operator more robust and in preparation for possible autoscaling, let me know if you have any questions @tw4l

- pods explicitly deleted if spec.restartTime != status.restartTime, then updates status.restartTime
- use force_restart to remove pods for one sync response to force deletion
- update to latest metacontroller v4.11.0
- add --restartOnError flag for crawler
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing locally, if I set the crawl scale high and then set it back to 1 via the frontend, it looks like the second and third instances of the crawler don't get the interrupt signal and continue crawling, although I can only see that there are no longer screencasting messages in the second/third crawler pod and I can only see the first instance in the UI.

Edit: Looks like I was mistaken. It looks like pod crawl-><crawler-id>-0 only issues screencasting and status update methods after scaling back down, and the WACZ produced by that crawler is significantly smaller than the other, so this may be working as intended after all. Makes sense that the crawler would remain active so we can get the WACZ at the end.

backend/btrixcloud/operator.py Outdated Show resolved Hide resolved
backend/btrixcloud/operator.py Outdated Show resolved Hide resolved
@ikreymer
Copy link
Member Author

ikreymer commented Sep 8, 2023

Testing locally, if I set the crawl scale high and then set it back to 1 via the frontend, it looks like the second and third instances of the crawler don't get the interrupt signal and continue crawling, although I can only see that there are no longer screencasting messages in the second/third crawler pod and I can only see the first instance in the UI.

Edit: Looks like I was mistaken. It looks like pod crawl-><crawler-id>-0 only issues screencasting and status update methods after scaling back down, and the WACZ produced by that crawler is significantly smaller than the other, so this may be working as intended after all. Makes sense that the crawler would remain active so we can get the WACZ at the end.

This is a good question and should be documented here. Previously, with the StatefulSet, we automatically scale down, which sends the interrupt signal to the pods being scaled down. However, there is a bit of a risk, if any of those pods fail for any reason (upload fails, or get evicted before they finish and upload), then they will not be restarted again, which could lead to data loss. With this setup, wanted to be a bit more careful, and instead set the request each instance to be stopped via <id>:stopone redis key (introduced in webrecorder/browsertrix-crawler#366), which will wait until that pod is stopped. If it gets interrupted, it will restart again, so effectively we don't actually scale down until the pods finish with 'Completed' (may need to look at error handling here again in case pod is never able to finish..)
But i think that leads to more safe, even though more complicated, shutdown process.

ikreymer and others added 4 commits September 8, 2023 08:50
cancel crawl test: just wait until page is found, not necessarily done
…ditional logging for failed crawls, if enabled

- print logs: print logs for default container
- also print pod status on failure
- use mark_finished(... 'canceled') for canceled crawl
- tests: also check other finished states to avoid stuck in infinite loop if crawl fails
- tests: disable disk utilization check, which adds unpredictability to crawl testing!
@ikreymer ikreymer merged commit ad9bca2 into main Sep 11, 2023
4 checks passed
@ikreymer ikreymer deleted the op-pod branch September 11, 2023 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: Make operator more robust by control pods directly
2 participants