Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connectetivity to Kubernetes API service timeout #1797

Closed
ErikLundJensen opened this issue Dec 20, 2021 · 5 comments
Closed

Connectetivity to Kubernetes API service timeout #1797

ErikLundJensen opened this issue Dec 20, 2021 · 5 comments

Comments

@ErikLundJensen
Copy link

ErikLundJensen commented Dec 20, 2021

What happened:

We have been running AWS CNI for months but today when scaling cluster from 0 to a few nodes we ran into the following problem.

CoreDns get timeout when trying to connect to Kubernetes API at 10.100.0.1

CNI uses 10.64.x.x networks for the pod networks.
Service CIDR is 10.100.0.0.
HostNetwork is 192.168.x.x.

When testing the connectivity from other pods we do see the same problem -- except those pods that uses hostNetwork (192.168...)
The Kubernetes Endpoint and Kubernetes Service looks all fine in namespace "default" - I have compared with another environment running the same setup.

kubectl get ep
NAME         ENDPOINTS                             AGE
kubernetes   192.168.0.12:443,192.168.0.91:443   99d

Where is the translation from Kubernetes ClusterIP 10.100.0.1 to Endpoint 192.168.0.12 resolved?
kube-proxy adds kubernetes service port at each node as seen from the kube-proxy log:
Adding new service port "default/kubernetes:https" at 10.100.0.1:443/TCP

The aws-node pod (running the Customer CNI setup) does not report any errors in the logs.

Environment:

  • Kubernetes version:
    Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
  • EKS managed linux nodes:
  Kernel Version:             5.4.156-83.273.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.7
  Kubelet Version:            v1.21.5-eks-bc4871b
  Kube-Proxy Version:         v1.21.5-eks-bc4871b
  • Chart version: 1.1.10
@jayanthvn
Copy link
Contributor

@ErikLundJensen - Please let us know if it possible to open a support ticket? With the ticket if you can share the clusterARN we can verify iptables and also check if it is any known issues like this - kubernetes/client-go#374

@achevuru
Copy link
Contributor

@ErikLundJensen CNI doesn't setup any IP table rules to facilitate API Server access. kube-proxy is responsible for that and going by the o/p you shared it does appear that there are valid endpoints behind the kubernetes service. Have you tried reaching out to API Server endpoints directly from the worker nodes? If so, did that work? Might be a generic connectivity issue between the worker nodes and control plane instances ..so it is good to validate that flow.

Also, I see that the hostNetwork and Pod IP ranges are different. Are you using Custom networking with CNI? Also, any reason you carved out both Pod (10.64.x.x) and Service IP ranges (10.100.x.x) from 10.0.0.0/8 range? I don't think this is feasible with EKS clusters but you mentioned you are using EKS managed linux nodes - so curious.

@ErikLundJensen
Copy link
Author

Yes, we are using Custom networking with CNI. The subnet in each availability zone gets a block from 10.64.x.x (for example 10.64.64.0/18), however, the service IP range is not tight to any particular availability zone and therefore gets another CIDR block.

As @achevuru wrote, it is responsibility of kube-proxy to setup the iptables for mapping 10.100.0.1 to the 192.168.0.12 and 192.168.0.91 and therefore this is issue is most likely not related to AWS Customer CNI.

However, this could either be to re-connection issues as @jayanthvn wrote or even related to
https://aws.amazon.com/premiumsupport/knowledge-center/eks-vpc-cni-plugin-api-server-failure/

@ErikLundJensen
Copy link
Author

I recreated the cluster from scratch using Terraform and realised that a couple of security groups were not destroyed. The result was mix of old and new security groups having the same value for Name attribute in tags, however, with different generated security group names.

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants