Felix Exited #7440

sujan46 · 2024-12-18T12:01:13Z

Environmental Info:
RKE2 Version: v1.31.3+rke2r1

Node(s) CPU architecture, OS, and Version: x86_64, Ubuntu 22.04

Cluster Configuration: 3 controlplanes, 4 linux agents and 1 windows agent

Describe the bug: RKE2 service exited with error Felix exited and non-hpc pods lost connection to the internet and kubernetes gateway

Steps To Reproduce:

Create a new cluster v1.31.3
Join windows node to the cluster
Launch a dummy pod and try to reach ping8.8.8.8
After 10-15 mins rke2 services reports error Felix Exited but service itself is not stopped.
After restarting rke2 it will be back again but will fail again after 10-15 mins.

Installed RKE2:

Expected behavior:

ping 8.8.8.8 should be able to pong or nslookup just breaks with error DNS request timed out

Actual behavior:

ping would respond with a pong

Additional context / logs:

TimeWritten           ReplacementStrings
-----------           ------------------
12/17/2024 8:55:27 PM {Felix exited}
12/17/2024 8:40:20 PM {Running RKE2 kube-proxy [--bind-address=10.107.22.24 --enable-dsr=true
                      --feature-gates=WinDSR=true --network-name=Calico --source-vip=172.25.99.194
                      --cluster-cidr=172.25.0.0/17 --healthz-bind-address=127.0.0.1
                      --hostname-override=uls-ep-kubert28
                      --kubeconfig=C:\var\lib\rancher\rke2\agent\kubeproxy.kubeconfig --proxy-mode=kernelspace]}
12/17/2024 8:40:20 PM {WinDSR support is enabled}
12/17/2024 8:40:20 PM {HCN feature check, version={13 3} supportedFeatures={{true true true true} {true true} true
                      true true true true true true true true true true false false false false false}}
12/17/2024 8:40:20 PM {Reserved VIP for kube-proxy: 172.25.99.194}
12/17/2024 8:40:17 PM {Calico started correctly}

The text was updated successfully, but these errors were encountered:

manuelbuil · 2024-12-18T14:31:21Z

Can you check the felix logs and see if you get more information?

sujan46 · 2024-12-18T15:36:13Z

@manuelbuil I have few warning in felix logs

2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 203: This is a stale endpoint with no container attached id="ddafb3be-6baa-4487-a292-1e7ad485bb9a" name="6418c83b6e6d2eeb0e52d4e264252cbb329c8c37fdab25cafadb543b9123f1bf_Calico"
2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 203: This is a stale endpoint with no container attached id="fc0693ad-a9af-469e-a71a-6fadc20b0031" name="ba7714587f517d6353151e8c1a70b998ea2fed4b6d398071443a522b2a12bed2_Calico"

2024-12-18 05:25:34.403 [INFO][15024] felix/endpoint_mgr.go 560: Could not resolve hns endpoint id ip="172.25.17.196/32"
2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 350: Failed to look up HNS endpoint for workload id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"grafana-loki/loki-canary-6jwh6", EndpointId:"eth0"}
2024-12-18 05:25:34.403 [WARNING][15024] felix/endpoint_mgr.go 440: Failed to look up one or more HNS endpoints; will schedule a retry
2024-12-18 05:25:34.403 [WARNING][15024] felix/win_dataplane.go 346: CompleteDeferredWork returned an error - scheduling a retry error=Endpoint could not be found

2024-12-18 05:27:09.696 [WARNING][15024] felix/l3_route_resolver.go 688: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=172.25.17.192
2024-12-18 05:27:09.696 [WARNING][15024] felix/l3_route_resolver.go 688: Unable to create route for IP; the node it belongs to was not recorded in IPAM IP=172.25.17.193

sujan46 · 2024-12-18T16:11:54Z

And we also have autodetection enable for calico installation

  installation:
    calicoNetwork:
      nodeAddressAutodetectionV4:
        canReach: <gateway ip>

brandond · 2024-12-18T16:38:01Z

failed to pull and unpack image "artifactory.xxxx.com:6609/rancher/rke2-runtime:v1.31.3-rke2r1-windows-amd64"

This is fine, it's not a real Windows image that is used to run a pod. This message can be ignored.

sujan46 · 2024-12-18T20:12:52Z

Added the fresh windows node still faced the same issue. When we start adding the workloads(~40 pods) it breaks

manuelbuil · 2024-12-19T10:55:05Z

Perhaps you are running out of IPs in the windows node? Check the output of:

kubectl get ipamblocks.crd.projectcalico.org $YOURWINDOWSNODECIDR -o yaml

That should provide further information

sujan46 · 2024-12-19T16:15:33Z

@manuelbuil We are facing issue with just 30 pods. The steps I followed to reproduce

After Felix Exited error restart of rke2 service temp fixes it.
I launched 30 windows pods IP allocation seems to be happening fine I was able to ping github.com for example from within the pod
Noticed we had enough IP left for allocation. As soon as I started to terminate the pods I got Felix Exited error.

IP allocation example.

  allocations:
  - 0
  - 0
  - 0
  - null
  - 14
  - 11
  - 6
  - null
  - 23
  - 16
  - null
  - 9

We even tried rke2 1.28.15 we encountered same error.

Just FYI, We reverted back to rke2 version v1.28.10 with calico version 3.27.3 everything seems to be working as expected. It seems like latest version rke2 coupled with new version of calico causes these errors

dima-b · 2025-01-08T14:07:23Z

I have similar issue with v1.30.6+Calico v3.28.2 and v1.31.3+Calico v3.29.0.
network in Linux nodes works fine, but for Windows nodes it stops working.
I have similar errors in felix log

2025-01-06 07:01:07.275 [WARNING][5360] felix/endpoint_mgr.go 350: Failed to look up HNS endpoint for workload id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"ma/admin-859987cb67-bgtsr", EndpointId:"eth0"}
2025-01-06 07:01:07.275 [WARNING][5360] felix/endpoint_mgr.go 440: Failed to look up one or more HNS endpoints; will schedule a retry
2025-01-06 07:01:07.275 [WARNING][5360] felix/win_dataplane.go 346: CompleteDeferredWork returned an error - scheduling a retry error=Endpoint could not be found

unrelated issue

also during rke2 service startup on windows there is an error

Error encountered while importing C:\var\lib\rancher\rke2\agent\images\runtime-image.txt: failed to pull images from C:\var\lib\rancher\rke2\agent\images\runtime-image.txt: rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/rancher/rke2-runtime:v1.31.3-rke2r1-windows-amd64": failed to extract layer sha256:a982c1cdcfe20bc827701769532a931379ec341822f0d096b394f4f5c46c8a6f: hcsshim::ProcessBaseLayer \\?\C:\var\lib\rancher\rke2\agent\containerd\io.containerd.snapshotter.v1.windows\snapshots\436: The system cannot find the path specified.: unknown

I could reproduce it by running ctr image pull docker.io/rancher/rke2-runtime:v1.31.3-rke2r1 with multiple versions

hoppla20 · 2025-01-13T13:01:59Z

I ran into the same issue today, with v1.29.11+rke2r1 and calico v3.29.0

michaelrlowe · 2025-01-15T15:21:01Z

We ran into this as well after upgrading to rke v1.31.3, which uses Calico v3.29.0. After discovering this report we downgraded to rke 1.30.3, which is the latest rke version with with Calico v3.27.3, and the Windows nodes started working as expected again.

On our testing cluster we then upgraded back up to rke v1.31.4, which includes a newer Calico version v3.29.1, but the issue is still present.

rbrtbnfgl · 2025-01-15T17:10:11Z

This seems related to a bug on the hcsshim library included on Calico v3.29.0 and v3.29.1.

dima-b · 2025-01-17T08:21:07Z

This seems related to a bug on the hcsshim library included on Calico v3.29.0 and v3.29.1.

it is also reproduced in Calico v3.28.2, but all good in 3.27.3 so the issue appeared between these 2 versions.

noahbailey · 2025-01-24T21:24:07Z

I believe I'm also having this issue. Windows & Linux nodes all running v1.31.4+rke2r1
It appears to happen after roughly the 10th pod is created on a Windows node, though it's a fairly modest 2C/4G lab VM. Result is the same though, Felix exited is logged and then reachability issues with the subsequently created pods on windows server.
Is anybody aware of a workaround other than rolling back to 1.30.x?

brandond · 2025-01-24T21:38:35Z

I believe this should be fixed by

Upgrade hcsshim to 0.12.8 and update logic for stale endpoint for hyperv projectcalico/calico#9505

Any further discussion should occur on that PR. We will update Calico in RKE2 when a new release is available.

apfrimme · 2025-01-24T22:04:04Z

I am also experiencing this issue with RKE2 v1.31.3 with the packaged Calico v3.29.1.

@brandond Thank you for addressing this in that PR. Is there a way for me to fix my current RKE2 v1.31.3 without having to rollback or wait for Calico next public release and for Rancher to pick that version with RKE2 version?

brandond · 2025-01-24T22:10:31Z

There is no new release of Calico that contains the fix, so your only choice is to roll back to an unaffected old release.

apfrimme · 2025-01-25T20:57:23Z

I was able to backport your fixes to the 3.29.1 release branch and my cluster has been stable for 6 hours now. Previously my pods were losing connectively after just a few minutes of pods being scheduled.

DISCLAIMER: I have no idea what other impacts this might have. The windows worker pool is a very small part of my environment and doing a full cluster roll back was not an option for us.

git clone --branch release-v3.29 https://github.com/projectcalico/calico.git
cd calico
Make edits to files as seen here: https://github.com/projectcalico/calico/pull/9505/files

felix/dataplane/windows/endpoint_mgr.go
Change just the single line if statement

go.mod
version in repo will differ from this PR, thats fine just replace what you find instead
replace hcsshim line
add:
github.com/containerd/containerd v1.7.23 // indirect
github.com/containerd/errdefs v0.3.0 // indirect
make similar updates to the go.sum file.

run "go mod tidy" from root directory

cd node

sudo make ./dist/bin/calico-node.exe

sudo make build-windows-archive

copy the .zip from /node/dist/ to the Windows Worker nodes.

On Windows Worker nodes:
Stop the RKE2 service

make a backup copy of the 3 calico* files under C:\var\lib\rancher\rke2\bin and delete the version there.

Extrac the zip,
Move these 3 files to replace what you deleted under C:\var\lib\rancher\rke2\bin
/CalicoWindows/calico-node
/CalicoWindows/cni/calico
/CalicoWindows/cni/calico-ipam

Restart the RKE2 service.

Existing broken pods may not get connectivity back, but all new pods I've launched (hundreds of them) have worked.

I no longer observe the Felix Exited in the logs either.
Hope this helps someone until the next proper Calico release and they incorporate into RKE2.

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Felix Exited #7440

Felix Exited #7440

sujan46 commented Dec 18, 2024

manuelbuil commented Dec 18, 2024

sujan46 commented Dec 18, 2024 •

edited

Loading

This comment was marked as off-topic.

sujan46 commented Dec 18, 2024

brandond commented Dec 18, 2024 •

edited

Loading

sujan46 commented Dec 18, 2024 •

edited

Loading

manuelbuil commented Dec 19, 2024

sujan46 commented Dec 19, 2024 •

edited

Loading

dima-b commented Jan 8, 2025 •

edited by brandond

Loading

hoppla20 commented Jan 13, 2025

michaelrlowe commented Jan 15, 2025

rbrtbnfgl commented Jan 15, 2025 •

edited

Loading

dima-b commented Jan 17, 2025

noahbailey commented Jan 24, 2025 •

edited

Loading

brandond commented Jan 24, 2025 •

edited

Loading

apfrimme commented Jan 24, 2025

brandond commented Jan 24, 2025 •

edited

Loading

apfrimme commented Jan 25, 2025

Felix Exited #7440

Felix Exited #7440

Comments

sujan46 commented Dec 18, 2024

manuelbuil commented Dec 18, 2024

sujan46 commented Dec 18, 2024 • edited Loading

This comment was marked as off-topic.

sujan46 commented Dec 18, 2024

brandond commented Dec 18, 2024 • edited Loading

sujan46 commented Dec 18, 2024 • edited Loading

manuelbuil commented Dec 19, 2024

sujan46 commented Dec 19, 2024 • edited Loading

dima-b commented Jan 8, 2025 • edited by brandond Loading

hoppla20 commented Jan 13, 2025

michaelrlowe commented Jan 15, 2025

rbrtbnfgl commented Jan 15, 2025 • edited Loading

dima-b commented Jan 17, 2025

noahbailey commented Jan 24, 2025 • edited Loading

brandond commented Jan 24, 2025 • edited Loading

apfrimme commented Jan 24, 2025

brandond commented Jan 24, 2025 • edited Loading

apfrimme commented Jan 25, 2025

sujan46 commented Dec 18, 2024 •

edited

Loading

brandond commented Dec 18, 2024 •

edited

Loading

sujan46 commented Dec 18, 2024 •

edited

Loading

sujan46 commented Dec 19, 2024 •

edited

Loading

dima-b commented Jan 8, 2025 •

edited by brandond

Loading

rbrtbnfgl commented Jan 15, 2025 •

edited

Loading

noahbailey commented Jan 24, 2025 •

edited

Loading

brandond commented Jan 24, 2025 •

edited

Loading

brandond commented Jan 24, 2025 •

edited

Loading