-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Node cannot be recovered from error state #856
Comments
Hi @axieatlas. The problem here is that the pods all have persistent storage that is reused (by design) across restarts and pod recreations. And unless the volume for a pod is deleted it will always remember its role and report the error you posted. But IMO we cannot have the operator just delete volumes, that can lead to disaster and data loss. So if you change node roles in an existing nodepool, it is your responsibility to deal with existing data. |
Thanks @swoehrl-mw for looking into it. I understand that the persistent volume is matched by the cluster and nodePool name and it is my responsibility to use the volume with the correct node role. I have changed the nodepool with persistent volume back to a data role, the problem here is that operator still fail to recover the cluster from the wrong configuration. I am looking for a general solution to recover the cluster when a pod failing to start because of a configuration mistake (incorrect persistent volume attaching is an example above). Whether there is a way from opensearch operator ignore the failure pod or remove the failure pod without tear down the whole opensearch cluster and use the correct specification to create a new pod. I am sure the the spec has been applied / submitted to kube server by checking Current steps are like: What to expect: |
@axieatlas Can you please check the pods (with a |
@swoehrl-mw I have deleted the whole cluster and remove persistent volumes from my kubernetes cluster. |
Hi @swoehrl-mw second apply: broken cluster third apply: revert spec, but still broken cluster describe opensearchcluster
describe pod of manager (expect to be cluster_manager, data after the third apply)
So it is the first case, the role of pod after the rectified spec is not updated. |
Hi @axieatlas. I've reproduced your scenario. The problem is that automating such error handling runs the risk of dataloss, so I don't think the operator can really (or should) do much. |
What is the bug?
The pod cannot be healed by themselves if the pod is not in healthy state.
How can one reproduce the bug?
I have created a healthy cluster by defining all nodes with cluster_manager and data roles. All of them use persistent volume.
Then I decided that changing a few nodes to be cluster_manager only by still leave persistent volume.
I understand that this is not a good practise, but just for finding what this operator can achieve.
Now the opensearch container of the pod cannot be started due to error
Then I update my OpenSearchCluster configuration file back to all nodes with data role. However, the pod is still trying to be a non-data node and stuck in running status. That means kubectl apply the rectified file cannot bring the cluster back to life.
What is the expected behavior?
After fixing the operator configuration file, OpenSearchCluster should be able to discard the failure pods and create new pods with the updated spec.
What is your host/environment?
Linux
Do you have any screenshots?
No
Do you have any additional context?
No
The text was updated successfully, but these errors were encountered: