Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"No overlay network" -> "overlay network" migration does not work #213

Open
ialidzhikov opened this issue Sep 27, 2022 · 0 comments
Open
Labels
area/networking Networking related kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage)

Comments

@ialidzhikov
Copy link
Member

ialidzhikov commented Sep 27, 2022

How to categorize this issue?

/area networking
/kind bug

What happened:
We discovered the following issue as part of the investigations about the impact of gardener/gardener-extension-provider-aws#621. In the context of this bug:

  1. The Shoot gets created with "no overlay network"
      apiVersion: calico.networking.extensions.gardener.cloud/v1alpha1
      backend: none
      ipv4:
        mode: Never
      kind: NetworkConfig
  1. The Shoots is reconciled and the Network resource is changed to run "with overlay network". The providerConfig field is removed.

We discovered that after such migration for multi-zone clusters (let's say zone-1, and zone-2) the Pods running in zone-2 for example cannot reach the Pods running in zone-1.

In details:

  1. Make sure that coredns Pods run in zone-1.
coredns-755f85d7d9-6hkxn                              1/1     Running       0            17s    100.64.2.16     ip-10-180-15-62.eu-central-1.compute.internal    <none>           <none>
coredns-755f85d7d9-j2w64                              1/1     Running       0            17s    100.64.2.15     ip-10-180-15-62.eu-central-1.compute.internal    <none>           <none>
  1. Create a Pod in zone-2.
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: debug-pod
  name: debug-pod
spec:
  containers:
  - args:
    - sleep
    - "1000000"
    image: gcr.io/google-containers/busybox
    name: debug-pod
    resources: {}
  restartPolicy: Never
  # nodeName: pick a zone-2 Node
  1. Make sure that the debug-pod from zone-2 cannot reach the coredns Pods in zone-1
$ kubectl exec -it debug-pod -- /bin/sh
/ # ping google.com
ping: bad address 'google.com'

The DNS resolution fails because the debug-pod is not able to reach the coredns Pods to DNS resolution.

After investigation we found that there is a ippools.crd.projectcalico.org resource. And the ipipMode field in this resource is set to Never:

apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  annotations:
    projectcalico.org/metadata: '{"uid":"746e13ec-29a7-43a5-88f4-d558ab1b3135","creationTimestamp":"2022-09-22T11:15:36Z"}'
  creationTimestamp: "2022-09-22T11:15:36Z"
  generation: 1
  name: default-ipv4-ippool
  resourceVersion: "1600"
  uid: 8002bf4f-b2a1-4b41-b700-086323ea373d
spec:
  allowedUses:
  - Workload
  - Tunnel
  blockSize: 26
  cidr: 100.96.0.0/11
  ipipMode: Never
  natOutgoing: true
  nodeSelector: all()
  vxlanMode: Never

ipipMode: Never would mean "no overlay" network, but the cluster should already use overlay network. We can also verify from the control plane monitoring that the tunl0 device does not have any network traffic.

We had to manually patch the ipipMode field to be Always. After that the Pod to Pod communication between zones worked again. The control plane monitoring started reporting network traffic for the tunl0 device.

What you expected to happen:
The "no overlay network" -> "overlay network" migration to work without issues or a validation to be present to forbid such update or all known issues to be documented.

How to reproduce it (as minimally and precisely as possible):
See above.

Anything else we need to know?:

Environment:

  • Gardener version (if relevant):
  • Extension version: v1.26.0
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@gardener-robot gardener-robot added area/networking Networking related kind/bug Bug labels Sep 27, 2022
@ialidzhikov ialidzhikov changed the title "No overlay network" -> "overlay network" migration does not work "No overlay network" -> "overlay network" migration does not work for multi-zone clusters Sep 27, 2022
@ialidzhikov ialidzhikov changed the title "No overlay network" -> "overlay network" migration does not work for multi-zone clusters "No overlay network" -> "overlay network" migration does not work Sep 30, 2022
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 9, 2023
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/networking Networking related kind/bug Bug lifecycle/rotten Nobody worked on this for 12 months (final aging stage)
Projects
None yet
Development

No branches or pull requests

2 participants