Connectetivity to Kubernetes API service timeout #1797

ErikLundJensen · 2021-12-20T15:35:49Z

What happened:

We have been running AWS CNI for months but today when scaling cluster from 0 to a few nodes we ran into the following problem.

CoreDns get timeout when trying to connect to Kubernetes API at 10.100.0.1

CNI uses 10.64.x.x networks for the pod networks.
Service CIDR is 10.100.0.0.
HostNetwork is 192.168.x.x.

When testing the connectivity from other pods we do see the same problem -- except those pods that uses hostNetwork (192.168...)
The Kubernetes Endpoint and Kubernetes Service looks all fine in namespace "default" - I have compared with another environment running the same setup.

kubectl get ep
NAME         ENDPOINTS                             AGE
kubernetes   192.168.0.12:443,192.168.0.91:443   99d

Where is the translation from Kubernetes ClusterIP 10.100.0.1 to Endpoint 192.168.0.12 resolved?
kube-proxy adds kubernetes service port at each node as seen from the kube-proxy log:
Adding new service port "default/kubernetes:https" at 10.100.0.1:443/TCP

The aws-node pod (running the Customer CNI setup) does not report any errors in the logs.

Environment:

Kubernetes version:
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
EKS managed linux nodes:

  Kernel Version:             5.4.156-83.273.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.7
  Kubelet Version:            v1.21.5-eks-bc4871b
  Kube-Proxy Version:         v1.21.5-eks-bc4871b

Chart version: 1.1.10

The text was updated successfully, but these errors were encountered:

jayanthvn · 2021-12-21T05:16:40Z

@ErikLundJensen - Please let us know if it possible to open a support ticket? With the ticket if you can share the clusterARN we can verify iptables and also check if it is any known issues like this - kubernetes/client-go#374

achevuru · 2021-12-21T14:03:52Z

@ErikLundJensen CNI doesn't setup any IP table rules to facilitate API Server access. kube-proxy is responsible for that and going by the o/p you shared it does appear that there are valid endpoints behind the kubernetes service. Have you tried reaching out to API Server endpoints directly from the worker nodes? If so, did that work? Might be a generic connectivity issue between the worker nodes and control plane instances ..so it is good to validate that flow.

Also, I see that the hostNetwork and Pod IP ranges are different. Are you using Custom networking with CNI? Also, any reason you carved out both Pod (10.64.x.x) and Service IP ranges (10.100.x.x) from 10.0.0.0/8 range? I don't think this is feasible with EKS clusters but you mentioned you are using EKS managed linux nodes - so curious.

ErikLundJensen · 2021-12-22T16:13:51Z

Yes, we are using Custom networking with CNI. The subnet in each availability zone gets a block from 10.64.x.x (for example 10.64.64.0/18), however, the service IP range is not tight to any particular availability zone and therefore gets another CIDR block.

As @achevuru wrote, it is responsibility of kube-proxy to setup the iptables for mapping 10.100.0.1 to the 192.168.0.12 and 192.168.0.91 and therefore this is issue is most likely not related to AWS Customer CNI.

However, this could either be to re-connection issues as @jayanthvn wrote or even related to
https://aws.amazon.com/premiumsupport/knowledge-center/eks-vpc-cni-plugin-api-server-failure/

ErikLundJensen · 2021-12-29T22:01:08Z

I recreated the cluster from scratch using Terraform and realised that a couple of security groups were not destroyed. The result was mix of old and new security groups having the same value for Name attribute in tags, however, with different generated security group names.

github-actions · 2021-12-29T22:01:35Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

ErikLundJensen added needs investigation question labels Dec 20, 2021

ErikLundJensen closed this as completed Dec 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connectetivity to Kubernetes API service timeout #1797

Connectetivity to Kubernetes API service timeout #1797

ErikLundJensen commented Dec 20, 2021 •

edited

Loading

jayanthvn commented Dec 21, 2021

achevuru commented Dec 21, 2021

ErikLundJensen commented Dec 22, 2021

ErikLundJensen commented Dec 29, 2021

github-actions bot commented Dec 29, 2021

Connectetivity to Kubernetes API service timeout #1797

Connectetivity to Kubernetes API service timeout #1797

Comments

ErikLundJensen commented Dec 20, 2021 • edited Loading

jayanthvn commented Dec 21, 2021

achevuru commented Dec 21, 2021

ErikLundJensen commented Dec 22, 2021

ErikLundJensen commented Dec 29, 2021

github-actions bot commented Dec 29, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

ErikLundJensen commented Dec 20, 2021 •

edited

Loading