-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big problems when half of the workers are down (2/4) #1258
Comments
I was able to force some of the nodes to spread out properly by modifying the cr CSIScaleOperator and adding my own label to provisionerNodeSelector, resizerNodeSelector, snapshotterNodeSelector, attacherNodeSelector. But the main problem is still with the GUI pods even if I make sure that one "survives". As soon as one (out of two) of the GUI pods go down , the other one goes into 3/4 Running status In this state PVs cannot be attached anymore: Warning FailedAttachVolume 90s (x3 over 5m35s) attachdetach-controller AttachVolume.Attach failed for volume "pvc-c6be73ca-bf8b-4399-8848-4edc1451a1f0" : rpc error: code = Internal desc = ControllerPublishVolume : Error in getting filesystem Name for filesystem ID of AC10460D:6726F422. Error [rpc error: code = Internal desc = Response unmarshal failed: GET request https://ibm-spectrum-scale-gui-ibm-spectrum-scale.apps.ocp.x.x:443/scalemgmt/v2/filesystems?filter=uuid=AC10460D:6726F422, user: CsiAdmin, param: { }, response: &{503 Service Unavailable 503 HTTP/1.0 1 0 map[Cache-Control:[private, max-age=0, no-cache, no-store] Content-Type:[text/html] Pragma:[no-cache]] 0xc0003b65c0 -1 [] true false map[] 0xc000254a20 0xc000340000}, error json.Unmarshal failed invalid character '<' looking for beginning of value] I am thinking that the quorum is missing inside of the CNSA but I don't understand why this is a problem. I thought that GUI doesn't need quorum and it should serve REST api no matter what. It's there only to contact with a remote cluster which of course is healthy. I am starting to think that I should not be using CNSA and just focus on scale CSI. I am guessing I would not have these issues or would I ? |
Additional note: The GUI pod that survived actually breaks when the CNSA cluster is loosing the quorum I am pretty sure at this point. That's a problem and I am not sure how to solve that if I have an even number of nodes (and even number of physical servers - 4:2). |
Hello,
We have a strorage scale configured via CNSA, connected to a remote scale cluster.
I have trouble making the system function after losing 2 out of 4 worker nodes. I was expecting that whatever is needed for the scale to work, would be automatically rescheduled on remaining nodes.
One observation I've made is that, in a setup with 4 worker nodes, the CNSA deployment creates:
1 provisioner pod,
2 attacher pods,
1 operator pod,
2 GUI pods, and
2 pmcollector pods.
1 resizeer
1 snapshooter
When half of the workers go offline, some of the CNSA pods eventually reschedule onto the remaining nodes, but the process takes a significant amount of time. Many of the pods go into CrashLoopBackOff including attachers.
The situation with the GUI pods is worse—they get stuck in a terminating state and never reschedule.
Generally we end up in a very unhealthy scenario.
Once the workers are brought back online, everything recovers, and all pods return to normal operation. But that’s far from ideal.
My question is: Can we configure the critical CNSA ReplicaSets to maintain 4 replicas (ensuring there’s always at least one running, even if only one worker remains)? If so, how can we achieve this?
Or, how to force critical pods (for example attacher) to run on specific worker nodes? This would help to mitigate the issue because I could make sure that the 2 worker nodes we are losing are not the ones that run both copies of attacher/gui/collector pods. But that doesn't help with provisioner, we need this one to scale to a bigger number than 1.
Thank you!
The text was updated successfully, but these errors were encountered: