-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diskpools are in "Created" state after 1 node crashed #1719
Comments
Update:
The pod trying to use this volume:
Another pod is logging:
While the volume is showing as online:
|
Noticed errors in csi-node:
The volume shows:
Using the deviceUri to locate the device on the node:
Use the nvme-cli to manually attach, but it disappeared after a few min:
|
hmm I think the loop of connected/disconnected messages ie because staging is failing, and keeps being retried:
But not sure why we are getting this error, could you share logs from data-plane on |
Attaching dmesg from node3. Tried a dump system, got errors such as:
Also:
I tried moving the pod that uses Attaching it here: |
Ah, I think that's why it's happening!
It's a harmless error, we need to change dump tool to ignore this.
This is not expected! Are you using the plugin from 2.5.1 or 2.7.0? |
Does it mean the filesystem is corrupted? I tried to recover the corrupted FS, it's not going to work very well if the FS is used for persistent storage. Also, in this setup, the volume is not staying connected even with the nvme-cli:
I'm using 2.7.0 plugin. I'll try with DEBUG level and report back |
When the node crashes, there's a chance for filesystem corruption. Also v2.5 also had a partial rebuild bug which we can't rule out as also having done some damage, if the volume have had partial rebuilds..
Great, thank you |
The first dump "mayastor-2024-08-19--19-40-59-UTC.tar.gz" has the logs. Running again after 30 mins, didn't include the logs.
|
Update on the mayastor plug-in returning an error:
List of running pods:
Attaching agent-core logs: mayastor-api-rest-d666fb647-9dv5m.log.gz |
You seem to have configured 3 replicas for the api-rest, any reason for this?
Also etcd has a pending pod:
Anyway, moving on to the error, is due to timeout:
But on the agents-core there's nothing there which would explain the timeout...
@Abhinandan-Purkait any ideas here? |
I think it's because the rest-api isn't supposed to run with multiple instances, as it doesn't have any sort of loadbalancing and or leader/slave mechanism. Timeouts probably because the same request is being served by different instances of rest-api and the locks held at core-agent puts one on wait? I believe it has nothing to do with the connectivity or computing time due to load? |
@veenadong could you please change agents.core.logLevel to debug? This should help provide more information. |
I just checked and I think the above could not be, as in a Kubernetes ClusterIP service setup, it is highly unlikely for two different pods to receive the exact same request under normal circumstances, as it kind of does load balancing by default, round robin. Yeah, it would be good to see debug logs. |
Upgraded from 2.5.1 to 2.7.0. After the upgrade, 1 node in a 3 node cluster crashed.
Currently running nodes:
List of pods running in mayastor namespace:
Versions of all the pods:
The mayastor plugin didn't capture any of the logs, so I captured using kubectl:
mayastor.log.tar.gz
mayastor-2024-08-13--16-56-32-UTC.tar.gz
The text was updated successfully, but these errors were encountered: