Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Felix Exited #7440

Open
sujan46 opened this issue Dec 18, 2024 · 18 comments
Open

Felix Exited #7440

sujan46 opened this issue Dec 18, 2024 · 18 comments

Comments

@sujan46
Copy link

sujan46 commented Dec 18, 2024

Environmental Info:
RKE2 Version: v1.31.3+rke2r1

Node(s) CPU architecture, OS, and Version: x86_64, Ubuntu 22.04

Cluster Configuration: 3 controlplanes, 4 linux agents and 1 windows agent

Describe the bug: RKE2 service exited with error Felix exited and non-hpc pods lost connection to the internet and kubernetes gateway

Steps To Reproduce:

  1. Create a new cluster v1.31.3
  2. Join windows node to the cluster
  3. Launch a dummy pod and try to reach ping8.8.8.8
  4. After 10-15 mins rke2 services reports error Felix Exited but service itself is not stopped.
  5. After restarting rke2 it will be back again but will fail again after 10-15 mins.
  • Installed RKE2:

Expected behavior:

ping 8.8.8.8 should be able to pong or nslookup just breaks with error DNS request timed out

Actual behavior:

ping would respond with a pong

Additional context / logs:

TimeWritten           ReplacementStrings
-----------           ------------------
12/17/2024 8:55:27 PM {Felix exited}
12/17/2024 8:40:20 PM {Running RKE2 kube-proxy [--bind-address=10.107.22.24 --enable-dsr=true
                      --feature-gates=WinDSR=true --network-name=Calico --source-vip=172.25.99.194
                      --cluster-cidr=172.25.0.0/17 --healthz-bind-address=127.0.0.1
                      --hostname-override=uls-ep-kubert28
                      --kubeconfig=C:\var\lib\rancher\rke2\agent\kubeproxy.kubeconfig --proxy-mode=kernelspace]}
12/17/2024 8:40:20 PM {WinDSR support is enabled}
12/17/2024 8:40:20 PM {HCN feature check, version={13 3} supportedFeatures={{true true true true} {true true} true
                      true true true true true true true true true true false false false false false}}
12/17/2024 8:40:20 PM {Reserved VIP for kube-proxy: 172.25.99.194}
12/17/2024 8:40:17 PM {Calico started correctly}
@manuelbuil
Copy link
Contributor

Can you check the felix logs and see if you get more information?

@sujan46
Copy link
Author

sujan46 commented Dec 18, 2024

@manuelbuil I have few warning in felix logs

2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 203: This is a stale endpoint with no container attached id="ddafb3be-6baa-4487-a292-1e7ad485bb9a" name="6418c83b6e6d2eeb0e52d4e264252cbb329c8c37fdab25cafadb543b9123f1bf_Calico"
2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 203: This is a stale endpoint with no container attached id="fc0693ad-a9af-469e-a71a-6fadc20b0031" name="ba7714587f517d6353151e8c1a70b998ea2fed4b6d398071443a522b2a12bed2_Calico"
2024-12-18 05:25:34.403 [INFO][15024] felix/endpoint_mgr.go 560: Could not resolve hns endpoint id ip="172.25.17.196/32"
2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 350: Failed to look up HNS endpoint for workload id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"grafana-loki/loki-canary-6jwh6", EndpointId:"eth0"}
2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 440: Failed to look up one or more HNS endpoints; will schedule a retry
2024-12-18 05:25:34.403 [WARNING][15024] felix/win_dataplane.go 346: CompleteDeferredWork returned an error - scheduling a retry error=Endpoint could not be found
2024-12-18 05:27:09.696 [WARNING][15024] felix/l3_route_resolver.go 688: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=172.25.17.192
2024-12-18 05:27:09.696 [WARNING][15024] felix/l3_route_resolver.go 688: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=172.25.17.193

@sujan46

This comment was marked as off-topic.

@sujan46
Copy link
Author

sujan46 commented Dec 18, 2024

And we also have autodetection enable for calico installation

  installation:
    calicoNetwork:
      nodeAddressAutodetectionV4:
        canReach: <gateway ip>

@brandond
Copy link
Member

brandond commented Dec 18, 2024

failed to pull and unpack image "artifactory.xxxx.com:6609/rancher/rke2-runtime:v1.31.3-rke2r1-windows-amd64"

This is fine, it's not a real Windows image that is used to run a pod. This message can be ignored.

@sujan46
Copy link
Author

sujan46 commented Dec 18, 2024

Added the fresh windows node still faced the same issue. When we start adding the workloads(~40 pods) it breaks

@manuelbuil
Copy link
Contributor

Perhaps you are running out of IPs in the windows node? Check the output of:

kubectl get ipamblocks.crd.projectcalico.org $YOURWINDOWSNODECIDR -o yaml

That should provide further information

@sujan46
Copy link
Author

sujan46 commented Dec 19, 2024

@manuelbuil We are facing issue with just 30 pods. The steps I followed to reproduce

  1. After Felix Exited error restart of rke2 service temp fixes it.
  2. I launched 30 windows pods IP allocation seems to be happening fine I was able to ping github.com for example from within the pod
  3. Noticed we had enough IP left for allocation. As soon as I started to terminate the pods I got Felix Exited error.

IP allocation example.

  allocations:
  - 0
  - 0
  - 0
  - null
  - 14
  - 11
  - 6
  - null
  - 23
  - 16
  - null
  - 9

We even tried rke2 1.28.15 we encountered same error.

Just FYI, We reverted back to rke2 version v1.28.10 with calico version 3.27.3 everything seems to be working as expected. It seems like latest version rke2 coupled with new version of calico causes these errors

@dima-b
Copy link

dima-b commented Jan 8, 2025

I have similar issue with v1.30.6+Calico v3.28.2 and v1.31.3+Calico v3.29.0.
network in Linux nodes works fine, but for Windows nodes it stops working.
I have similar errors in felix log

2025-01-06 07:01:07.275 [WARNING][5360] felix/endpoint_mgr.go 350: Failed to look up HNS endpoint for workload id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"ma/admin-859987cb67-bgtsr", EndpointId:"eth0"}
2025-01-06 07:01:07.275 [WARNING][5360] felix/endpoint_mgr.go 440: Failed to look up one or more HNS endpoints; will schedule a retry
2025-01-06 07:01:07.275 [WARNING][5360] felix/win_dataplane.go 346: CompleteDeferredWork returned an error - scheduling a retry error=Endpoint could not be found
unrelated issue also during rke2 service startup on windows there is an error

Error encountered while importing C:\var\lib\rancher\rke2\agent\images\runtime-image.txt: failed to pull images from C:\var\lib\rancher\rke2\agent\images\runtime-image.txt: rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/rancher/rke2-runtime:v1.31.3-rke2r1-windows-amd64": failed to extract layer sha256:a982c1cdcfe20bc827701769532a931379ec341822f0d096b394f4f5c46c8a6f: hcsshim::ProcessBaseLayer \\?\C:\var\lib\rancher\rke2\agent\containerd\io.containerd.snapshotter.v1.windows\snapshots\436: The system cannot find the path specified.: unknown

I could reproduce it by running ctr image pull docker.io/rancher/rke2-runtime:v1.31.3-rke2r1 with multiple versions

@hoppla20
Copy link

I ran into the same issue today, with v1.29.11+rke2r1 and calico v3.29.0

@michaelrlowe
Copy link

We ran into this as well after upgrading to rke v1.31.3, which uses Calico v3.29.0. After discovering this report we downgraded to rke 1.30.3, which is the latest rke version with with Calico v3.27.3, and the Windows nodes started working as expected again.

On our testing cluster we then upgraded back up to rke v1.31.4, which includes a newer Calico version v3.29.1, but the issue is still present.

@rbrtbnfgl
Copy link
Contributor

rbrtbnfgl commented Jan 15, 2025

This seems related to a bug on the hcsshim library included on Calico v3.29.0 and v3.29.1.

@dima-b
Copy link

dima-b commented Jan 17, 2025

This seems related to a bug on the hcsshim library included on Calico v3.29.0 and v3.29.1.

it is also reproduced in Calico v3.28.2, but all good in 3.27.3 so the issue appeared between these 2 versions.

@noahbailey
Copy link

noahbailey commented Jan 24, 2025

I believe I'm also having this issue. Windows & Linux nodes all running v1.31.4+rke2r1
It appears to happen after roughly the 10th pod is created on a Windows node, though it's a fairly modest 2C/4G lab VM. Result is the same though, Felix exited is logged and then reachability issues with the subsequently created pods on windows server.
Is anybody aware of a workaround other than rolling back to 1.30.x?

@brandond
Copy link
Member

brandond commented Jan 24, 2025

I believe this should be fixed by

Any further discussion should occur on that PR. We will update Calico in RKE2 when a new release is available.

@apfrimme
Copy link

I am also experiencing this issue with RKE2 v1.31.3 with the packaged Calico v3.29.1.

@brandond Thank you for addressing this in that PR. Is there a way for me to fix my current RKE2 v1.31.3 without having to rollback or wait for Calico next public release and for Rancher to pick that version with RKE2 version?

@brandond
Copy link
Member

brandond commented Jan 24, 2025

There is no new release of Calico that contains the fix, so your only choice is to roll back to an unaffected old release.

@apfrimme
Copy link

I was able to backport your fixes to the 3.29.1 release branch and my cluster has been stable for 6 hours now. Previously my pods were losing connectively after just a few minutes of pods being scheduled.

DISCLAIMER: I have no idea what other impacts this might have. The windows worker pool is a very small part of my environment and doing a full cluster roll back was not an option for us.

git clone --branch release-v3.29 https://github.com/projectcalico/calico.git
cd calico
Make edits to files as seen here: https://github.com/projectcalico/calico/pull/9505/files

felix/dataplane/windows/endpoint_mgr.go
Change just the single line if statement

go.mod
version in repo will differ from this PR, thats fine just replace what you find instead
replace hcsshim line
add:
github.com/containerd/containerd v1.7.23 // indirect
github.com/containerd/errdefs v0.3.0 // indirect
make similar updates to the go.sum file.

run "go mod tidy" from root directory

cd node

sudo make ./dist/bin/calico-node.exe

sudo make build-windows-archive

copy the .zip from /node/dist/ to the Windows Worker nodes.

On Windows Worker nodes:
Stop the RKE2 service

make a backup copy of the 3 calico* files under C:\var\lib\rancher\rke2\bin and delete the version there.

Extrac the zip,
Move these 3 files to replace what you deleted under C:\var\lib\rancher\rke2\bin
/CalicoWindows/calico-node
/CalicoWindows/cni/calico
/CalicoWindows/cni/calico-ipam

Restart the RKE2 service.

Existing broken pods may not get connectivity back, but all new pods I've launched (hundreds of them) have worked.

I no longer observe the Felix Exited in the logs either.
Hope this helps someone until the next proper Calico release and they incorporate into RKE2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants