Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Node cannot be recovered from error state #856

Open
axieatlas opened this issue Jul 8, 2024 · 6 comments
Open

[BUG] Node cannot be recovered from error state #856

axieatlas opened this issue Jul 8, 2024 · 6 comments
Labels
question User questions. Neither a bug nor feature request.

Comments

@axieatlas
Copy link

What is the bug?

The pod cannot be healed by themselves if the pod is not in healthy state.

How can one reproduce the bug?

I have created a healthy cluster by defining all nodes with cluster_manager and data roles. All of them use persistent volume.
Then I decided that changing a few nodes to be cluster_manager only by still leave persistent volume.
I understand that this is not a good practise, but just for finding what this operator can achieve.
Now the opensearch container of the pod cannot be started due to error

"node does not have the data role but has shard data".

Then I update my OpenSearchCluster configuration file back to all nodes with data role. However, the pod is still trying to be a non-data node and stuck in running status. That means kubectl apply the rectified file cannot bring the cluster back to life.

What is the expected behavior?

After fixing the operator configuration file, OpenSearchCluster should be able to discard the failure pods and create new pods with the updated spec.

What is your host/environment?

Linux

Do you have any screenshots?

No

Do you have any additional context?

No

@axieatlas axieatlas added bug Something isn't working untriaged Issues that have not yet been triaged labels Jul 8, 2024
@swoehrl-mw
Copy link
Collaborator

Hi @axieatlas. The problem here is that the pods all have persistent storage that is reused (by design) across restarts and pod recreations. And unless the volume for a pod is deleted it will always remember its role and report the error you posted. But IMO we cannot have the operator just delete volumes, that can lead to disaster and data loss. So if you change node roles in an existing nodepool, it is your responsibility to deal with existing data.

@swoehrl-mw swoehrl-mw added question User questions. Neither a bug nor feature request. and removed bug Something isn't working untriaged Issues that have not yet been triaged labels Jul 10, 2024
@axieatlas
Copy link
Author

Thanks @swoehrl-mw for looking into it.

I understand that the persistent volume is matched by the cluster and nodePool name and it is my responsibility to use the volume with the correct node role. I have changed the nodepool with persistent volume back to a data role, the problem here is that operator still fail to recover the cluster from the wrong configuration.

I am looking for a general solution to recover the cluster when a pod failing to start because of a configuration mistake (incorrect persistent volume attaching is an example above).

Whether there is a way from opensearch operator ignore the failure pod or remove the failure pod without tear down the whole opensearch cluster and use the correct specification to create a new pod. I am sure the the spec has been applied / submitted to kube server by checking kubectl describe OpenSearchOperator <my-cluster>, but operator uses the old spec to restart the pod again and again.

Current steps are like:
spec 1 correct, operator create cluster
spec 2 incorrect, operator maintain the cluster by updating node (pod) one by one, then a pod stuck in a failure status.
spec 3, revert to spec 1, kube apply, operator cannot correct the failure pod, it is still in failure status

What to expect:
spec 1 correct, operator create cluster
spec 2 incorrect, operator maintain the cluster by updating node (pod) one by one, then pod stuck in a failure status.
spec 3, revert to spec 1, kube apply, operator delete the failure pod and create the pod with correct spec. All pods are back to healthy.

@swoehrl-mw
Copy link
Collaborator

@axieatlas Can you please check the pods (with a kubectl describe pod <name>) if they have the correct roles (in the envs as node.role)? If no it is potentially an operator problem, if yes, the "wrong" roles have already been persisted into the state of the opensearch node and the operator can not deal with it.

@axieatlas
Copy link
Author

@swoehrl-mw I have deleted the whole cluster and remove persistent volumes from my kubernetes cluster.
What is funny now is I cannot even get a cluster running with the example opensearch-cluster.yaml. I will update here once I get a cluster working and check the pods for this case.

@axieatlas
Copy link
Author

axieatlas commented Jul 16, 2024

Hi @swoehrl-mw
I have recreated a test cluster to verify the case about the role change and cluster not recovering.
first apply: working cluster
nodePool1: 3 nodes [cluster_manager, data],
nodePool2: 1 node [data]

second apply: broken cluster
nodePool1: 3 nodes [cluster_manager]
nodePool2: 1 node [data]

third apply: revert spec, but still broken cluster
nodePool1: 3 nodes [cluster_manager, data]
nodePool2: 1 node [data]

describe opensearchcluster

  Node Pools:
    Component:  manager
    Roles:
      cluster_manager
      data

describe pod of manager (expect to be cluster_manager, data after the third apply)

    Environment:
      node.roles:                    cluster_manager

So it is the first case, the role of pod after the rectified spec is not updated.

@swoehrl-mw
Copy link
Collaborator

Hi @axieatlas. I've reproduced your scenario.
The problem is that while the cluster is in this broken state it is not healthy and the operator will not continue any restart/reconfigure operations. As such in this case the operator waits forever for the first pod of the broken node pool to become healthy again.
The workaround is to manually delete the struggling pod (kubectl delete pod). As the operator will already have updated the statefulset manifest, the pod will be recreated with the correct config (data+cluster_manager) and should come up successfully. Then the operator will proceed with a rolling restart of the other pods, and everything should go back to normal.

The problem is that automating such error handling runs the risk of dataloss, so I don't think the operator can really (or should) do much.

@getsaurabh02 getsaurabh02 moved this from 🆕 New to Backlog in Engineering Effectiveness Board Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question User questions. Neither a bug nor feature request.
Projects
Status: Backlog
Development

No branches or pull requests

2 participants