[BUG] Node cannot be recovered from error state #856

axieatlas · 2024-07-08T08:34:28Z

What is the bug?

The pod cannot be healed by themselves if the pod is not in healthy state.

How can one reproduce the bug?

I have created a healthy cluster by defining all nodes with cluster_manager and data roles. All of them use persistent volume.
Then I decided that changing a few nodes to be cluster_manager only by still leave persistent volume.
I understand that this is not a good practise, but just for finding what this operator can achieve.
Now the opensearch container of the pod cannot be started due to error

"node does not have the data role but has shard data".

Then I update my OpenSearchCluster configuration file back to all nodes with data role. However, the pod is still trying to be a non-data node and stuck in running status. That means kubectl apply the rectified file cannot bring the cluster back to life.

What is the expected behavior?

After fixing the operator configuration file, OpenSearchCluster should be able to discard the failure pods and create new pods with the updated spec.

What is your host/environment?

Linux

Do you have any screenshots?

No

Do you have any additional context?

No

swoehrl-mw · 2024-07-10T09:13:45Z

Hi @axieatlas. The problem here is that the pods all have persistent storage that is reused (by design) across restarts and pod recreations. And unless the volume for a pod is deleted it will always remember its role and report the error you posted. But IMO we cannot have the operator just delete volumes, that can lead to disaster and data loss. So if you change node roles in an existing nodepool, it is your responsibility to deal with existing data.

axieatlas · 2024-07-11T07:02:22Z

Thanks @swoehrl-mw for looking into it.

I understand that the persistent volume is matched by the cluster and nodePool name and it is my responsibility to use the volume with the correct node role. I have changed the nodepool with persistent volume back to a data role, the problem here is that operator still fail to recover the cluster from the wrong configuration.

I am looking for a general solution to recover the cluster when a pod failing to start because of a configuration mistake (incorrect persistent volume attaching is an example above).

Whether there is a way from opensearch operator ignore the failure pod or remove the failure pod without tear down the whole opensearch cluster and use the correct specification to create a new pod. I am sure the the spec has been applied / submitted to kube server by checking kubectl describe OpenSearchOperator <my-cluster>, but operator uses the old spec to restart the pod again and again.

Current steps are like:
spec 1 correct, operator create cluster
spec 2 incorrect, operator maintain the cluster by updating node (pod) one by one, then a pod stuck in a failure status.
spec 3, revert to spec 1, kube apply, operator cannot correct the failure pod, it is still in failure status

What to expect:
spec 1 correct, operator create cluster
spec 2 incorrect, operator maintain the cluster by updating node (pod) one by one, then pod stuck in a failure status.
spec 3, revert to spec 1, kube apply, operator delete the failure pod and create the pod with correct spec. All pods are back to healthy.

swoehrl-mw · 2024-07-11T12:22:31Z

@axieatlas Can you please check the pods (with a kubectl describe pod <name>) if they have the correct roles (in the envs as node.role)? If no it is potentially an operator problem, if yes, the "wrong" roles have already been persisted into the state of the opensearch node and the operator can not deal with it.

axieatlas · 2024-07-15T08:33:37Z

@swoehrl-mw I have deleted the whole cluster and remove persistent volumes from my kubernetes cluster.
What is funny now is I cannot even get a cluster running with the example opensearch-cluster.yaml. I will update here once I get a cluster working and check the pods for this case.

axieatlas · 2024-07-16T01:12:34Z

Hi @swoehrl-mw
I have recreated a test cluster to verify the case about the role change and cluster not recovering.
first apply: working cluster
nodePool1: 3 nodes [cluster_manager, data],
nodePool2: 1 node [data]

second apply: broken cluster
nodePool1: 3 nodes [cluster_manager]
nodePool2: 1 node [data]

third apply: revert spec, but still broken cluster
nodePool1: 3 nodes [cluster_manager, data]
nodePool2: 1 node [data]

describe opensearchcluster

  Node Pools:
    Component:  manager
    Roles:
      cluster_manager
      data

describe pod of manager (expect to be cluster_manager, data after the third apply)

    Environment:
      node.roles:                    cluster_manager

So it is the first case, the role of pod after the rectified spec is not updated.

swoehrl-mw · 2024-07-17T13:43:15Z

Hi @axieatlas. I've reproduced your scenario.
The problem is that while the cluster is in this broken state it is not healthy and the operator will not continue any restart/reconfigure operations. As such in this case the operator waits forever for the first pod of the broken node pool to become healthy again.
The workaround is to manually delete the struggling pod (kubectl delete pod). As the operator will already have updated the statefulset manifest, the pod will be recreated with the correct config (data+cluster_manager) and should come up successfully. Then the operator will proceed with a rolling restart of the other pods, and everything should go back to normal.

The problem is that automating such error handling runs the risk of dataloss, so I don't think the operator can really (or should) do much.

axieatlas added bug Something isn't working untriaged Issues that have not yet been triaged labels Jul 8, 2024

swoehrl-mw added question User questions. Neither a bug nor feature request. and removed bug Something isn't working untriaged Issues that have not yet been triaged labels Jul 10, 2024

peterzhuamazon added this to Engineering Effectiveness Board Jul 11, 2024

github-project-automation bot moved this to 🆕 New in Engineering Effectiveness Board Jul 11, 2024

getsaurabh02 moved this from 🆕 New to Backlog in Engineering Effectiveness Board Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Node cannot be recovered from error state #856

[BUG] Node cannot be recovered from error state #856

axieatlas commented Jul 8, 2024

swoehrl-mw commented Jul 10, 2024

axieatlas commented Jul 11, 2024

swoehrl-mw commented Jul 11, 2024

axieatlas commented Jul 15, 2024

axieatlas commented Jul 16, 2024 •

edited

Loading

swoehrl-mw commented Jul 17, 2024

[BUG] Node cannot be recovered from error state #856

[BUG] Node cannot be recovered from error state #856

Comments

axieatlas commented Jul 8, 2024

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

swoehrl-mw commented Jul 10, 2024

axieatlas commented Jul 11, 2024

swoehrl-mw commented Jul 11, 2024

axieatlas commented Jul 15, 2024

axieatlas commented Jul 16, 2024 • edited Loading

swoehrl-mw commented Jul 17, 2024

axieatlas commented Jul 16, 2024 •

edited

Loading