Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release-1.27] - Race condition when rke2-windows-calico removes and creates an HNS network #5381

Closed
manuelbuil opened this issue Feb 8, 2024 · 1 comment
Assignees

Comments

@manuelbuil
Copy link
Contributor

Backport fix for Race condition when rke2-windows-calico removes and creates an HNS network

@ShylajaDevadiga
Copy link
Contributor

Validated using rke2 version v1.27.11-rc1+rke2r1

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
Ubuntu 22.04
Windows Server 2022

Steps to Reproduce and Validate following the steps in the PR

1 - Deploy rke2 server with calico
2 - Deploy rke2-agent on windows
3 - Once everything is up, Stop-Service rke2 and C:\usr\local\bin\rke2.exe agent service --delete
4 - Verify that there is at least one HNS Network: get-hnsnetwork
5 - Start the rke2-agent on windows again with debug: true (remember to remove the node first or it will complain about password already there)

You should at least see the messages:

 Deleting network: XXXXXX before starting calico"
And

Calico is waiting for the interface with ip: XXXXXX to come back

Created a cluster with 1 server, 1 linux agent, 1 windows agent

Reproduction results on rke2 version

ubuntu@ip-172-31-9-240:~$ rke2 -v
rke2 version v1.27.10+rke2r1 (915672bd6cab658edb974d0aedb33ec5a32c239a)
go version go1.20.13 X:boringcrypto
ubuntu@ip-172-31-9-240:~$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE     VERSION
ip-172-31-15-88.us-east-2.compute.internal   Ready    <none>                      5h39m   v1.27.10+rke2r1
ip-172-31-9-240.us-east-2.compute.internal   Ready    control-plane,etcd,master   5h41m   v1.27.10+rke2r1
ip-ac1f2610                                  Ready    <none>                      5h37m   v1.27.10
ubuntu@ip-172-31-9-240:~$ kubectl delete node ip-ac1f2610
node "ip-ac1f2610" deleted
ubuntu@ip-172-31-9-240:~$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE     VERSION
ip-172-31-15-88.us-east-2.compute.internal   Ready    <none>                      5h43m   v1.27.10+rke2r1
ip-172-31-9-240.us-east-2.compute.internal   Ready    control-plane,etcd,master   5h45m   v1.27.10+rke2r1

Logs:

time="2024-02-20T22:37:37Z" level=debug msg="hcsshim::HNSNetwork::Delete id=62E06721-5D16-48AB-AA27-D11F7AAD307E"
time="2024-02-20T22:37:37Z" level=debug msg="[DELETE]=>[/networks/62E06721-5D16-48AB-AA27-D11F7AAD307E] Request : "
time="2024-02-20T22:37:37Z" level=debug msg="hcsshim::HNSNetwork::Delete id=AF1A97B4-D99B-4B06-9920-B0C8E3A1EDD7"
time="2024-02-20T22:37:37Z" level=debug msg="[DELETE]=>[/networks/AF1A97B4-D99B-4B06-9920-B0C8E3A1EDD7] Request : "
time="2024-02-20T22:37:38Z" level=debug msg="evaluating if the interface: Ethernet with addresses [fe80::1193:ec4f:4d82:95e4/64], contains ip: 172.31.9.172"
time="2024-02-20T22:37:38Z" level=debug msg="evaluating if the interface: Loopback Pseudo-Interface 1 with addresses [::1/128 127.0.0.1/8], contains ip: 172.31.9.172"
time="2024-02-20T22:37:38Z" level=debug msg="evaluating if the interface: vEthernet (nat) with addresses [fe80::5d22:5a7c:f0bf:8f35/64 172.25.192.1/20], contains ip: 172.31.9.172"
time="2024-02-20T22:37:38Z" level=fatal msg="no interface has the ip: 172.31.9.172"

Validation results on rke2 version v1.27.11-rc1+rke2r1

ubuntu@ip-172-31-8-57:~$ rke2 -v
rke2 version v1.27.11-rc1+rke2r1 (de74ade94562b108e8a189b296c4d8d86894d288)
go version go1.21.7 X:boringcrypto
ubuntu@ip-172-31-8-57:~$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE     VERSION
ip-172-31-2-131.us-east-2.compute.internal   Ready    <none>                      5h37m   v1.27.11+rke2r1
ip-172-31-8-57.us-east-2.compute.internal    Ready    control-plane,etcd,master   5h40m   v1.27.11+rke2r1
ip-ac1f2610                                  Ready    <none>                      5h35m   v1.27.11
ubuntu@ip-172-31-8-57:~$ kubectl get nodes
NAME                                         STATUS     ROLES                       AGE     VERSION
ip-172-31-2-131.us-east-2.compute.internal   Ready      <none>                      5h38m   v1.27.11+rke2r1
ip-172-31-8-57.us-east-2.compute.internal    Ready      control-plane,etcd,master   5h40m   v1.27.11+rke2r1
ip-ac1f2610                                  NotReady   <none>                      5h35m   v1.27.11
ubuntu@ip-172-31-8-57:~$ kubectl delete node ip-ac1f2610
node "ip-ac1f2610" deleted
ubuntu@ip-172-31-8-57:~$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE     VERSION
ip-172-31-2-131.us-east-2.compute.internal   Ready    <none>                      5h41m   v1.27.11+rke2r1
ip-172-31-8-57.us-east-2.compute.internal    Ready    control-plane,etcd,master   5h43m   v1.27.11+rke2r1
ubuntu@ip-172-31-8-57:~$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE     VERSION
ip-172-31-2-131.us-east-2.compute.internal   Ready    <none>                      5h42m   v1.27.11+rke2r1
ip-172-31-8-57.us-east-2.compute.internal    Ready    control-plane,etcd,master   5h45m   v1.27.11+rke2r1
ip-ac1f2610                                  Ready    <none>                      38s     v1.27.11
ubuntu@ip-172-31-8-57:~$ 

Logs:

time="2024-02-20T22:37:48Z" level=debug msg="Deleting network: Calico before starting calico"
time="2024-02-20T22:37:48Z" level=debug msg="hcsshim::HNSNetwork::Delete id=E34A6AF4-26CA-4350-B6CA-4C873F4223BB"
time="2024-02-20T22:37:48Z" level=debug msg="[DELETE]=>[/networks/E34A6AF4-26CA-4350-B6CA-4C873F4223BB] Request : "
time="2024-02-20T22:37:48Z" level=debug msg="Deleting network: External before starting calico"
time="2024-02-20T22:37:48Z" level=debug msg="hcsshim::HNSNetwork::Delete id=9DC69FAA-1132-4992-8CC6-E53D4137C9B8"
time="2024-02-20T22:37:48Z" level=debug msg="[DELETE]=>[/networks/9DC69FAA-1132-4992-8CC6-E53D4137C9B8] Request : "
time="2024-02-20T22:37:52Z" level=debug msg="Calico is waiting for the interface with ip: 172.31.14.93 to come back"
time="2024-02-20T22:37:52Z" level=debug msg="evaluating if the interface: Ethernet with addresses [2600:1f16:1d38:1c00:f240:13f8:b4cc:706/128 fe80::20e3:24c6:9a48:355f/64 172.31.14.93/20], contains ip: 172.31.14.93"
time="2024-02-20T22:37:52Z" level=debug msg="Calico is waiting for the interface with ip: 172.31.14.93 to come back"
...
time="2024-02-20T22:38:15Z" level=info msg="Node ip-ac1f2610 registered. Calico can start"
time="2024-02-20T22:38:15Z" level=info msg="Calico Envs: [KUBE_NETWORK=Calico.* KUBECONFIG=c:\\var\\lib\\rancher\\rke2\\agent\\calico.kubeconfig NODENAME=ip-ac1f2610 CALICO_K8S_NODE_REF=ip-ac1f2610 IP=172.31.14.93 USE_POD_CIDR=false CALICO_NODENAME_FILE=c:\\var\\lib\\rancher\\rke2\\agent\\calico_node_name CALICO_NETWORKING_BACKEND=vxlan CALICO_DATASTORE_TYPE=kubernetes IP_AUTODETECTION_METHOD=first-found VXLAN_VNI=4096]"
time="2024-02-20T22:38:15Z" level=debug msg="[GET]=>[/endpoints/] Request :
...

time="2024-02-20T22:38:16Z" level=info msg="Calico started correctly"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants