Improve Kubernetes node upgrade process #680

bderrly · 2023-03-30T16:28:57Z

We have had several instances in the past month of node pool upgrades causing disruption for users of our production Humio cluster. The issue seems to be around the lack of graceful node removal and the lack of automation for bringing new Humio nodes up to date with digest and storage data.

A good example of what we think is required for graceful Kubernetes node upgrades can be seen from the Strimzi project. They have a utility named Drain Cleaner which monitors for pod eviction notices. Drain cleaner then applies a label to the affected Pod(s).

The Strimzi Kubernetes controller watches for this label and takes action to reschedule the node but only if it is safe for the cluster to do so; in particular it is concerned with ensuring there are enough in-sync replicas and that removing a particular broker will not reduce the ISR below minimums.

In the case of Humio, the appropriate action would be to schedule a new Humio Pod on a different Kubernetes node (this shouldn't require any particular work from the operator as the node should be cordoned as part of the upgrade process). Once the Humio node is healthy, the node should be given the same digest and storage partition assignments as the node to be evicted. Next, the process of transferring data can begin from the soon-to-be-evicted node to the new. Once this is done, the controller can gracefully remove the old node from the Humio cluster. This would prevent so many problems for us when there are (unexpected) node upgrades from GKE.

bderrly · 2023-07-18T21:01:49Z

@schofield, @SaaldjorMike, and @jswoods (sorry for the at spam, just trying to get some eyes on this), I would like to get some dialogue about this idea. We are getting desperate to find relief for node pool upgrade problems. Does this solution seem like an acceptable way to proceed? I can write up a more detailed specification for how the system would behave if there is agreement that the general idea is agreeable.

bderrly · 2023-10-17T21:11:59Z

I wrote up a design document outlining the proposed changes.

bderrly changed the title ~~Improve node upgrade process~~ Improve Kubernetes node upgrade process Mar 30, 2023

bderrly mentioned this issue Mar 30, 2023

Consider how we can handle emptyDir #555

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Kubernetes node upgrade process #680

Improve Kubernetes node upgrade process #680

bderrly commented Mar 30, 2023

bderrly commented Jul 18, 2023

bderrly commented Oct 17, 2023

Improve Kubernetes node upgrade process #680

Improve Kubernetes node upgrade process #680

Comments

bderrly commented Mar 30, 2023

bderrly commented Jul 18, 2023

bderrly commented Oct 17, 2023