-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator refactor to control pods + pvcs directly instead of statefulsets #1149
Conversation
add redis_storage param
set 'max_crawl_scale' in values.yaml to indicate max possible scale, used to create crawl-instance-{0, N} priority classes, each with lower priority allows crawl instance 0 to preempt crawls with more instances (and lower priorities) eg. 2nd instance of a crawl can preempt 3rd instance of another, and a new crawl (1st instance) can preempt 2nd instance of another crawl
- ensure redis pod is deleted last - start deletion in background as soon as crawl is done - operator may call finalizer with old state: if not finished but in finalizer, attempt to cancel, and throw 400 if already canceled - recreate redis in finalizer from yaml to avoid change event
- support reconciling desired and actual scale - if desired scale is lower, attempt to gracefully shutdown each instance via new redis 'stopone' key - once each instance above > desired scale exit successfully, adjust the status.scale down to clean up pods. also clean up redis per-instance state when scaling down
…have been running for >60 seconds, not immediately
add placeholder for adding podmetrics as related resources fix canceled condition
Should be ready for review -- there's a bunch of changes / optimizations to get operator more robust and in preparation for possible autoscaling, let me know if you have any questions @tw4l |
- pods explicitly deleted if spec.restartTime != status.restartTime, then updates status.restartTime - use force_restart to remove pods for one sync response to force deletion - update to latest metacontroller v4.11.0 - add --restartOnError flag for crawler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing locally, if I set the crawl scale high and then set it back to 1 via the frontend, it looks like the second and third instances of the crawler don't get the interrupt signal and continue crawling, although I can only see that there are no longer screencasting messages in the second/third crawler pod and I can only see the first instance in the UI.
Edit: Looks like I was mistaken. It looks like pod crawl-><crawler-id>-0
only issues screencasting and status update methods after scaling back down, and the WACZ produced by that crawler is significantly smaller than the other, so this may be working as intended after all. Makes sense that the crawler would remain active so we can get the WACZ at the end.
This is a good question and should be documented here. Previously, with the StatefulSet, we automatically scale down, which sends the interrupt signal to the pods being scaled down. However, there is a bit of a risk, if any of those pods fail for any reason (upload fails, or get evicted before they finish and upload), then they will not be restarted again, which could lead to data loss. With this setup, wanted to be a bit more careful, and instead set the request each instance to be stopped via |
Co-authored-by: Tessa Walsh <[email protected]>
Co-authored-by: Tessa Walsh <[email protected]>
cancel crawl test: just wait until page is found, not necessarily done
…ditional logging for failed crawls, if enabled - print logs: print logs for default container - also print pod status on failure - use mark_finished(... 'canceled') for canceled crawl - tests: also check other finished states to avoid stuck in infinite loop if crawl fails - tests: disable disk utilization check, which adds unpredictability to crawl testing!
Refactoring operator to control pods directly, instead of statefulsets, fixes #1147
This PR includes a number of optimizations for the operator: