-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automate the removal of older directories from /var/lib/rancher/rke2/data/* #3902
Comments
Still valid, but probably fairly low priority as I don't believe these directories occupy much space. |
I'll take this on! I want to make sure my assumptions are correct The files /var/lib/rancher/rke2/data are just yaml files which looks like those are used to deploy the default manifests. The folder is built when the rke2-server service starts. Is it safe to assume that directories for the older versions of rke2 are not needed since technically this would be rebuilt in a downgrade/upgrade when the service is started? If so, could we just clear the directory at the end of the install.sh script? |
No. The data dir does contain the charts to deployed, but also contains binaries that are needed to bootstrap the kubelet and container runtime, along with a few CLI tools. root@rke2-server-1:/# ls -la /var/lib/rancher/rke2/data/v1.28.3-dev.2ea41d30-933ac7f7276f/bin/
total 325600
drwxr-xr-x 2 root root 4096 Nov 22 20:25 .
drwxr-xr-x 4 root root 4096 Nov 22 20:25 ..
-rwxr-xr-x 1 root root 60686856 Nov 22 20:25 containerd
-rwxr-xr-x 1 root root 8992552 Nov 22 20:25 containerd-shim
-rwxr-xr-x 1 root root 10653736 Nov 22 20:25 containerd-shim-runc-v1
-rwxr-xr-x 1 root root 14090784 Nov 22 20:25 containerd-shim-runc-v2
-rwxr-xr-x 1 root root 38483024 Nov 22 20:25 crictl
-rwxr-xr-x 1 root root 21035200 Nov 22 20:25 ctr
-rwxr-xr-x 1 root root 54670640 Nov 22 20:25 kubectl
-rwxr-xr-x 1 root root 112940560 Nov 22 20:25 kubelet
-rwxr-xr-x 1 root root 11811456 Nov 22 20:25 runc
It is extracted from the rancher/rke2-runtime image whenever rke2 starts, if the directories are missing.
My biggest concern would be with the containerd shims that are still using the old data dir. Remember that pods continue running even when rke2's containerd is stopped, so any pods that were created by the previous version will still use the shim from the previous version: root@rke2-server-1:/# ps auxfww
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
...
root 3296 0.0 0.3 726092 15080 pts/0 Sl 20:26 0:00 /var/lib/rancher/rke2/data/v1.28.3-dev.2ea41d30-933ac7f7276f/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3c84ef9bd38e44f5114e8091525fbcd172b96934db650cb557ce72c49659aa22 -address /run/k3s/containerd/containerd.sock -debug
65535 3315 0.0 0.0 972 512 ? Ss 20:26 0:00 \_ /pause
root 4765 0.0 1.2 769580 51328 ? Ssl 20:26 0:00 \_ /coredns -conf /etc/coredns/Corefile
root@rke2-server-1:/# xargs -n1 -0 -a /proc/3296/environ echo
PATH=/var/lib/rancher/rke2/agent/containerd/bin:/var/lib/rancher/rke2/data/v1.28.3-dev.2ea41d30-933ac7f7276f/bin:/var/lib/rancher/rke2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=rke2-server-1
TERM=xterm
KUBECONFIG=/etc/rancher/rke2/rke2.yaml
CONTAINER_RUNTIME_ENDPOINT=unix:///run/k3s/containerd/containerd.sock
HOME=/root
RES_OPTIONS=
_K3S_LOG_REEXEC_=true
NO_PROXY=.svc,.cluster.local,10.42.0.0/16,10.43.0.0/16
NODE_NAME=rke2-server-1
LD_LIBRARY_PATH=/var/lib/rancher/rke2/agent/containerd/lib:
MAX_SHIM_VERSION=2
TTRPC_ADDRESS=/run/k3s/containerd/containerd.sock.ttrpc
GRPC_ADDRESS=/run/k3s/containerd/containerd.sock
NAMESPACE=k8s.io
GOMAXPROCS=4 The container will continue to exist with that path as long as the pod is running. If you clean up the data dir, any commands that use runc will fail: root@rke2-server-1:/# kubectl exec -it -n kube-system rke2-coredns-rke2-coredns-6b795db654-x5hmv -- /bin/sh
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "7a459afe398165168b5e1a3b1073fcdc7e5b8990629f44e607149af5009420c4": OCI runtime exec failed: exec: "runc": executable file not found in $PATH: <nil>: unknown In practice, this means that data dirs cannot be cleaned up until any pods that were started by that version of RKE2 have been recreated. |
What if when the rke2-server service start it's able to identify those containers versions like you did with the ps auxfww command and only kill the directories for those version that are not being referenced? When the machine is restarted, the containers are recreated to the current running version of rke2 so this way it's the folder is cleaned up over time. For example, adding a StartExecPost to the service file that a contains the logic to perform this action only when the rke2 service starts successfully. Also, I'm only look at this from the master nodes perspective. I assume the rke2-agent would also need to be updated. |
The core issue here of pods running with old binaries affects the agent (kubelet+containerd) portion of the code, however, servers also run an agent - so it effectively affects both node types. |
I would probably want to handle this in golang as part of the agent startup code, not in the systemd unit - just to avoid embedding too much more cruft in there. |
Hello, |
Environmental Info:
RKE2 Version: All version of rke2
Node(s) CPU architecture, OS, and Version:
NA
Cluster Configuration:
NA
Describe the bug:
A dedicated version specific directory is getting created in /var/lib/rancher/rke2/data/* on every rke2 k8s upgrade and it is consuming a disk space. So, to avoid the disk space issue, we have to delete the older directories manually(except the current version directory).
It would be good to have an automation in place so that the older directories will get removed automatically without manual intervention.
Steps To Reproduce:
eg:
Expected behavior:
Except Current version or (Current -2 version) directories should exists. Other directories should get removed automatically.
Actual behavior:
All version directories exists.
Additional context / logs:
N/A
The text was updated successfully, but these errors were encountered: