-
Notifications
You must be signed in to change notification settings - Fork 753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods stuck in CrashLoopBackoff
when restarting custom EKS node.
#2852
Comments
Do you have both Calico and VPC CNI? Do you where the specific error message is coming from?
|
Yep, we have both installed. The errors are in pod logs and it's somewhat random what pods have errors. Usually they are connection refused errors connecting to the Kube API or other pods, e.g.: Or connecting to another pod:
The exact errors vary by things like the language used and what they're connecting to. In all cases DNS works correctly, but the packets aren't routed to the other pod/service. Is there a secure way to send you logs and pod statuses? |
This is strange error message.
I would expect the API path to be You can follow this troubleshooting doc - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md and send the logs to '[email protected]' for us to investigate. I am suspecting that kube-proxy isn't running when this error occurred, but the description of the error itself isn't typical either. |
It's not just the kubernetes api, it's basically random what services and pods can be connected and which can't e.g. a pod won't be able to connect to our rabbitmq service, or another one will be able to connect to rabbitmq, but won't connect to vault etc. We've fixed this by draining/cordoning the node on startup. I'll try tracking down the bundle of logs and sending them through. |
Was this node specific behavior? If yes, perhaps there is some thing it is running on the node that changing iptables. Yes, logs will help. |
Closing this as Cx were able to resolve this at the node level. |
This issue is now closed. Comments on closed issues are hard for our team to see. |
What happened:
We have a custom AMI that we deploy to EC2 and connect to an existing EKS cluster. We start and stop this node as needed to save costs. In addition the instance has state that we want to maintain across restarts i.e. we don't want to get a new node every restart.
Over the past 2 months we've noticed an issue where k8s doesn't restart properly. some pods get stuck in a
CrashLoopBackoff
when they try to connect to other pods or services. DNS resolves to the correct IP address, however packets aren't routed to the other pod correctly. It seems like a race condition, where pods start before the network is set up correctly.The only reliable fix we've found is to delete all pods and let k8s recreate them, this seems to set up the correct iptables rules.
Is there a better way to fix this? It kind of looks like projectcalico/calico#5135, but not sure if the problem is in Calico or AWS.
Environment:
Kubernetes version:
v1.27.10-eks-508b6b3
CNI Version
amazon-k8s-cni:v1.15.1-eksbuild.1
OS (e.g:
cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: