Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Kubernetes node upgrade process #680

Open
bderrly opened this issue Mar 30, 2023 · 2 comments
Open

Improve Kubernetes node upgrade process #680

bderrly opened this issue Mar 30, 2023 · 2 comments

Comments

@bderrly
Copy link
Contributor

bderrly commented Mar 30, 2023

We have had several instances in the past month of node pool upgrades causing disruption for users of our production Humio cluster. The issue seems to be around the lack of graceful node removal and the lack of automation for bringing new Humio nodes up to date with digest and storage data.

A good example of what we think is required for graceful Kubernetes node upgrades can be seen from the Strimzi project. They have a utility named Drain Cleaner which monitors for pod eviction notices. Drain cleaner then applies a label to the affected Pod(s).

The Strimzi Kubernetes controller watches for this label and takes action to reschedule the node but only if it is safe for the cluster to do so; in particular it is concerned with ensuring there are enough in-sync replicas and that removing a particular broker will not reduce the ISR below minimums.

In the case of Humio, the appropriate action would be to schedule a new Humio Pod on a different Kubernetes node (this shouldn't require any particular work from the operator as the node should be cordoned as part of the upgrade process). Once the Humio node is healthy, the node should be given the same digest and storage partition assignments as the node to be evicted. Next, the process of transferring data can begin from the soon-to-be-evicted node to the new. Once this is done, the controller can gracefully remove the old node from the Humio cluster. This would prevent so many problems for us when there are (unexpected) node upgrades from GKE.

@bderrly bderrly changed the title Improve node upgrade process Improve Kubernetes node upgrade process Mar 30, 2023
@bderrly
Copy link
Contributor Author

bderrly commented Jul 18, 2023

@schofield, @SaaldjorMike, and @jswoods (sorry for the at spam, just trying to get some eyes on this), I would like to get some dialogue about this idea. We are getting desperate to find relief for node pool upgrade problems. Does this solution seem like an acceptable way to proceed? I can write up a more detailed specification for how the system would behave if there is agreement that the general idea is agreeable.

@bderrly
Copy link
Contributor Author

bderrly commented Oct 17, 2023

I wrote up a design document outlining the proposed changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant