You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have had several instances in the past month of node pool upgrades causing disruption for users of our production Humio cluster. The issue seems to be around the lack of graceful node removal and the lack of automation for bringing new Humio nodes up to date with digest and storage data.
A good example of what we think is required for graceful Kubernetes node upgrades can be seen from the Strimzi project. They have a utility named Drain Cleaner which monitors for pod eviction notices. Drain cleaner then applies a label to the affected Pod(s).
The Strimzi Kubernetes controller watches for this label and takes action to reschedule the node but only if it is safe for the cluster to do so; in particular it is concerned with ensuring there are enough in-sync replicas and that removing a particular broker will not reduce the ISR below minimums.
In the case of Humio, the appropriate action would be to schedule a new Humio Pod on a different Kubernetes node (this shouldn't require any particular work from the operator as the node should be cordoned as part of the upgrade process). Once the Humio node is healthy, the node should be given the same digest and storage partition assignments as the node to be evicted. Next, the process of transferring data can begin from the soon-to-be-evicted node to the new. Once this is done, the controller can gracefully remove the old node from the Humio cluster. This would prevent so many problems for us when there are (unexpected) node upgrades from GKE.
The text was updated successfully, but these errors were encountered:
bderrly
changed the title
Improve node upgrade process
Improve Kubernetes node upgrade process
Mar 30, 2023
@schofield, @SaaldjorMike, and @jswoods (sorry for the at spam, just trying to get some eyes on this), I would like to get some dialogue about this idea. We are getting desperate to find relief for node pool upgrade problems. Does this solution seem like an acceptable way to proceed? I can write up a more detailed specification for how the system would behave if there is agreement that the general idea is agreeable.
We have had several instances in the past month of node pool upgrades causing disruption for users of our production Humio cluster. The issue seems to be around the lack of graceful node removal and the lack of automation for bringing new Humio nodes up to date with digest and storage data.
A good example of what we think is required for graceful Kubernetes node upgrades can be seen from the Strimzi project. They have a utility named Drain Cleaner which monitors for pod eviction notices. Drain cleaner then applies a label to the affected Pod(s).
The Strimzi Kubernetes controller watches for this label and takes action to reschedule the node but only if it is safe for the cluster to do so; in particular it is concerned with ensuring there are enough in-sync replicas and that removing a particular broker will not reduce the ISR below minimums.
In the case of Humio, the appropriate action would be to schedule a new Humio Pod on a different Kubernetes node (this shouldn't require any particular work from the operator as the node should be cordoned as part of the upgrade process). Once the Humio node is healthy, the node should be given the same digest and storage partition assignments as the node to be evicted. Next, the process of transferring data can begin from the soon-to-be-evicted node to the new. Once this is done, the controller can gracefully remove the old node from the Humio cluster. This would prevent so many problems for us when there are (unexpected) node upgrades from GKE.
The text was updated successfully, but these errors were encountered: