Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with Kubevirt CPU throttling even with Guaranteed and static CPUManager #4954

Open
4 tasks done
mega-alex opened this issue Sep 10, 2024 · 7 comments
Open
4 tasks done
Labels
bug Something isn't working
Milestone

Comments

@mega-alex
Copy link

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

Linux 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Version

v1.30.4+k0s.0

Sysinfo

`k0s sysinfo`
Total memory: 503.4 GiB (pass)
Disk space available for /var/lib/k0s: 1.5 TiB (pass)
Name resolution: localhost: [::1 127.0.0.1] (pass)
Operating system: Linux (pass)
  Linux kernel release: 6.1.0-25-amd64 (pass)
  Max. file descriptors per process: current: 1048576 / max: 1048576 (pass)
  AppArmor: active (pass)
  Executable in PATH: modprobe: /usr/sbin/modprobe (pass)
  Executable in PATH: mount: /usr/bin/mount (pass)
  Executable in PATH: umount: /usr/bin/umount (pass)
  /proc file system: mounted (0x9fa0) (pass)
  Control Groups: version 2 (pass)
    cgroup controller "cpu": available (is a listed root controller) (pass)
    cgroup controller "cpuacct": available (via cpu in version 2) (pass)
    cgroup controller "cpuset": available (is a listed root controller) (pass)
    cgroup controller "memory": available (is a listed root controller) (pass)
    cgroup controller "devices": available (device filters attachable) (pass)
    cgroup controller "freezer": available (cgroup.freeze exists) (pass)
    cgroup controller "pids": available (is a listed root controller) (pass)
    cgroup controller "hugetlb": available (is a listed root controller) (pass)
    cgroup controller "blkio": available (via io in version 2) (pass)
  CONFIG_CGROUPS: Control Group support: built-in (pass)
    CONFIG_CGROUP_FREEZER: Freezer cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_PIDS: PIDs cgroup subsystem: built-in (pass)
    CONFIG_CGROUP_DEVICE: Device controller for cgroups: built-in (pass)
    CONFIG_CPUSETS: Cpuset support: built-in (pass)
    CONFIG_CGROUP_CPUACCT: Simple CPU accounting cgroup subsystem: built-in (pass)
    CONFIG_MEMCG: Memory Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_HUGETLB: HugeTLB Resource Controller for Control Groups: built-in (pass)
    CONFIG_CGROUP_SCHED: Group CPU scheduler: built-in (pass)
      CONFIG_FAIR_GROUP_SCHED: Group scheduling for SCHED_OTHER: built-in (pass)
        CONFIG_CFS_BANDWIDTH: CPU bandwidth provisioning for FAIR_GROUP_SCHED: built-in (pass)
    CONFIG_BLK_CGROUP: Block IO controller: built-in (pass)
  CONFIG_NAMESPACES: Namespaces support: built-in (pass)
    CONFIG_UTS_NS: UTS namespace: built-in (pass)
    CONFIG_IPC_NS: IPC namespace: built-in (pass)
    CONFIG_PID_NS: PID namespace: built-in (pass)
    CONFIG_NET_NS: Network namespace: built-in (pass)
  CONFIG_NET: Networking support: built-in (pass)
    CONFIG_INET: TCP/IP networking: built-in (pass)
      CONFIG_IPV6: The IPv6 protocol: built-in (pass)
    CONFIG_NETFILTER: Network packet filtering framework (Netfilter): built-in (pass)
      CONFIG_NETFILTER_ADVANCED: Advanced netfilter configuration: built-in (pass)
      CONFIG_NF_CONNTRACK: Netfilter connection tracking support: module (pass)
      CONFIG_NETFILTER_XTABLES: Netfilter Xtables support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_REDIRECT: REDIRECT target support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_COMMENT: "comment" match support: module (pass)
        CONFIG_NETFILTER_XT_MARK: nfmark target and match support: module (pass)
        CONFIG_NETFILTER_XT_SET: set target and match support: module (pass)
        CONFIG_NETFILTER_XT_TARGET_MASQUERADE: MASQUERADE target support: module (pass)
        CONFIG_NETFILTER_XT_NAT: "SNAT and DNAT" targets support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_ADDRTYPE: "addrtype" address type match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_CONNTRACK: "conntrack" connection tracking match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_MULTIPORT: "multiport" Multiple port match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_RECENT: "recent" match support: module (pass)
        CONFIG_NETFILTER_XT_MATCH_STATISTIC: "statistic" match support: module (pass)
      CONFIG_NETFILTER_NETLINK: module (pass)
      CONFIG_NF_NAT: module (pass)
      CONFIG_IP_SET: IP set support: module (pass)
        CONFIG_IP_SET_HASH_IP: hash:ip set support: module (pass)
        CONFIG_IP_SET_HASH_NET: hash:net set support: module (pass)
      CONFIG_IP_VS: IP virtual server support: module (pass)
        CONFIG_IP_VS_NFCT: Netfilter connection tracking: built-in (pass)
        CONFIG_IP_VS_SH: Source hashing scheduling: module (pass)
        CONFIG_IP_VS_RR: Round-robin scheduling: module (pass)
        CONFIG_IP_VS_WRR: Weighted round-robin scheduling: module (pass)
      CONFIG_NF_CONNTRACK_IPV4: IPv4 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_REJECT_IPV4: IPv4 packet rejection: module (pass)
      CONFIG_NF_NAT_IPV4: IPv4 NAT: unknown (warning)
      CONFIG_IP_NF_IPTABLES: IP tables support: module (pass)
        CONFIG_IP_NF_FILTER: Packet filtering: module (pass)
          CONFIG_IP_NF_TARGET_REJECT: REJECT target support: module (pass)
        CONFIG_IP_NF_NAT: iptables NAT support: module (pass)
        CONFIG_IP_NF_MANGLE: Packet mangling: module (pass)
      CONFIG_NF_DEFRAG_IPV4: module (pass)
      CONFIG_NF_CONNTRACK_IPV6: IPv6 connetion tracking support (required for NAT): unknown (warning)
      CONFIG_NF_NAT_IPV6: IPv6 NAT: unknown (warning)
      CONFIG_IP6_NF_IPTABLES: IP6 tables support: module (pass)
        CONFIG_IP6_NF_FILTER: Packet filtering: module (pass)
        CONFIG_IP6_NF_MANGLE: Packet mangling: module (pass)
        CONFIG_IP6_NF_NAT: ip6tables NAT support: module (pass)
      CONFIG_NF_DEFRAG_IPV6: module (pass)
    CONFIG_BRIDGE: 802.1d Ethernet Bridging: module (pass)
      CONFIG_LLC: module (pass)
      CONFIG_STP: module (pass)
  CONFIG_EXT4_FS: The Extended 4 (ext4) filesystem: module (pass)
  CONFIG_PROC_FS: /proc file system support: built-in (pass)

What happened?

When deploying a KubeVirt VM, we can't achieve 100% CPU usage without throttling. This is despite enabling CPUManager static policy, and the VM pod being given the Guaranteed QOS along with the kubevirt options for dedicated CPU placement. We don't see this issue on other K8s distributions like rke2 with the same manifest, and can achieve 100% utilization of cores.

Steps to reproduce

  1. Deploy k0s with kubelet extra args for CPU manager

    k0s installflags
       installFlags:
           - --debug
           - --kubelet-extra-args="--config-dir=/etc/kubernetes/kubelet.conf.d --cpu-manager-policy=static --kube-reserved=cpu=1,memory=1Gi"
           # Label for sriov-network-operator & openebs topology
           - --labels="feature.node.kubernetes.io/network-sriov.capable=true"
       k0s:
     
  2. Deploy a VM that should have dedicated CPU resources

    test-vm.yaml
     apiVersion: kubevirt.io/v1
     kind: VirtualMachine
     metadata:
       name: fedora-test
       namespace: vm-images
       # annotations:
       #   cdi.kubevirt.io/storage.bind.immediate.requested: "true"
     spec:
       # runStrategy: Always
       template:
         spec:
           domain:
             ioThreadsPolicy: auto
             cpu:
                cores: 16
                model: host-model
                dedicatedCpuPlacement: true
                numa:
                  guestMappingPassthrough: { }
             memory:
               hugepages:
                 pageSize: 1Gi
             resources:
               limits:
                 memory: 64Gi
             devices:
               # autoattachSerialConsole: true
               # autoattachMemBalloon: false
               # autoattachGraphicsDevice: false
               disks:
                 - name: containerdisk
                   disk:
                     bus: virtio
                 - name: cloudinitdisk
                   disk:
                     bus: virtio
               interfaces:
                 - masquerade: {}
                   pciAddress: "0000:09:00.0"
                   name: default
               # rng: {}
           networks:
           - name: default
             pod: {}
           terminationGracePeriodSeconds: 10
           volumes:
             - name: containerdisk
               containerDisk:
                 image: kubevirt/fedora-cloud-container-disk-demo:latest
             - name: cloudinitdisk
               cloudInitNoCloud:
                 userData: |-
                   #cloud-config
                   password: fedora
                   chpasswd: { expire: False }
     
  3. Observe in htop as well as in cgroup info that the CPU is being throttled. You can run for i in $(seq $(getconf _NPROCESSORS_ONLN)); do yes > /dev/null & done on the VM to pin the CPUs at 100%.

    $ cat /sys/fs/cgroup/kubepods/pod8ca0d105-35a8-49b9-8c5b-18527438da41/28564cb019cb7f79c96149f2c8a4505d21cb875a17e01c34658e3a279c4b0e26
    
    usage_usec 112462632150
    user_usec 111809657955
    system_usec 652974194
    nr_periods 94939
    nr_throttled 28574
    throttled_usec 3554215264
    nr_bursts 0
    burst_usec 0
    

Expected behavior

We would expect the CPU to actually be pinned at 100% with no throttling.

Actual behavior

CPU is throttled and the VM can't achieve 100% CPU utilization on the host. Interestingly, this doesn't seem to be an issue when deploying pods, and they can use 100% CPU. We also do see that the VM is active on all requested cores, and that the only processes scheduled on those cores are the kubevirt ones.

Screenshots and logs

No response

Additional context

Here's the kublet config for each host extracted from the running cluster.

kublet config k0s
{
  "kubeletconfig": {
    "enableServer": true,
    "podLogsDir": "/var/log/pods",
    "syncFrequency": "1m0s",
    "fileCheckFrequency": "20s",
    "httpCheckFrequency": "20s",
    "address": "0.0.0.0",
    "port": 10250,
    "tlsCipherSuites": [
      "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256",
      "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256"
    ],
    "tlsMinVersion": "VersionTLS12",
    "rotateCertificates": true,
    "serverTLSBootstrap": true,
    "authentication": {
      "x509": {
        "clientCAFile": "/var/lib/k0s/pki/ca.crt"
      },
      "webhook": {
        "enabled": true,
        "cacheTTL": "2m0s"
      },
      "anonymous": {
        "enabled": false
      }
    },
    "authorization": {
      "mode": "Webhook",
      "webhook": {
        "cacheAuthorizedTTL": "5m0s",
        "cacheUnauthorizedTTL": "30s"
      }
    },
    "registryPullQPS": 5,
    "registryBurst": 10,
    "eventRecordQPS": 0,
    "eventBurst": 100,
    "enableDebuggingHandlers": true,
    "healthzPort": 10248,
    "healthzBindAddress": "127.0.0.1",
    "oomScoreAdj": -999,
    "clusterDomain": "cluster.local",
    "clusterDNS": [
      "10.96.0.10"
    ],
    "streamingConnectionIdleTimeout": "4h0m0s",
    "nodeStatusUpdateFrequency": "10s",
    "nodeStatusReportFrequency": "5m0s",
    "nodeLeaseDurationSeconds": 40,
    "imageMinimumGCAge": "2m0s",
    "imageMaximumGCAge": "0s",
    "imageGCHighThresholdPercent": 85,
    "imageGCLowThresholdPercent": 80,
    "volumeStatsAggPeriod": "1m0s",
    "kubeletCgroups": "/system.slice/containerd.service",
    "cgroupsPerQOS": true,
    "cgroupDriver": "cgroupfs",
    "cpuManagerPolicy": "static",
    "cpuManagerReconcilePeriod": "10s",
    "memoryManagerPolicy": "None",
    "topologyManagerPolicy": "none",
    "topologyManagerScope": "container",
    "runtimeRequestTimeout": "2m0s",
    "hairpinMode": "promiscuous-bridge",
    "maxPods": 110,
    "podPidsLimit": -1,
    "resolvConf": "/etc/resolv.conf",
    "cpuCFSQuota": true,
    "cpuCFSQuotaPeriod": "100ms",
    "nodeStatusMaxImages": 50,
    "maxOpenFiles": 1000000,
    "contentType": "application/vnd.kubernetes.protobuf",
    "kubeAPIQPS": 50,
    "kubeAPIBurst": 100,
    "serializeImagePulls": true,
    "evictionHard": {
      "imagefs.available": "15%",
      "imagefs.inodesFree": "5%",
      "memory.available": "100Mi",
      "nodefs.available": "10%",
      "nodefs.inodesFree": "5%"
    },
    "evictionPressureTransitionPeriod": "5m0s",
    "enableControllerAttachDetach": true,
    "makeIPTablesUtilChains": true,
    "iptablesMasqueradeBit": 14,
    "iptablesDropBit": 15,
    "failSwapOn": false,
    "memorySwap": {},
    "containerLogMaxSize": "10Mi",
    "containerLogMaxFiles": 5,
    "containerLogMaxWorkers": 1,
    "containerLogMonitorInterval": "10s",
    "configMapAndSecretChangeDetectionStrategy": "Watch",
    "kubeReserved": {
      "cpu": "1",
      "memory": "1Gi"
    },
    "kubeReservedCgroup": "system.slice",
    "enforceNodeAllocatable": [
      "pods"
    ],
    "volumePluginDir": "/usr/libexec/k0s/kubelet-plugins/volume/exec",
    "logging": {
      "format": "text",
      "flushFrequency": "5s",
      "verbosity": 1,
      "options": {
        "text": {
          "infoBufferSize": "0"
        },
        "json": {
          "infoBufferSize": "0"
        }
      }
    },
    "enableSystemLogHandler": true,
    "enableSystemLogQuery": false,
    "shutdownGracePeriod": "0s",
    "shutdownGracePeriodCriticalPods": "0s",
    "enableProfilingHandler": true,
    "enableDebugFlagsHandler": true,
    "seccompDefault": false,
    "memoryThrottlingFactor": 0.9,
    "registerNode": true,
    "localStorageCapacityIsolation": true,
    "containerRuntimeEndpoint": "unix:///run/k0s/containerd.sock"
  }
}
kubelet config rke2 (working)
{
  "kubeletconfig": {
    "enableServer": true,
    "staticPodPath": "/var/lib/rancher/rke2/agent/pod-manifests",
    "podLogsDir": "/var/log/pods",
    "syncFrequency": "30s",
    "fileCheckFrequency": "5s",
    "httpCheckFrequency": "20s",
    "address": "0.0.0.0",
    "port": 10250,
    "tlsCertFile": "/var/lib/rancher/rke2/agent/serving-kubelet.crt",
    "tlsPrivateKeyFile": "/var/lib/rancher/rke2/agent/serving-kubelet.key",
    "authentication": {
      "x509": {
        "clientCAFile": "/var/lib/rancher/rke2/agent/client-ca.crt"
      },
      "webhook": {
        "enabled": true,
        "cacheTTL": "2m0s"
      },
      "anonymous": {
        "enabled": false
      }
    },
    "authorization": {
      "mode": "Webhook",
      "webhook": {
        "cacheAuthorizedTTL": "5m0s",
        "cacheUnauthorizedTTL": "30s"
      }
    },
    "registryPullQPS": 5,
    "registryBurst": 10,
    "eventRecordQPS": 50,
    "eventBurst": 100,
    "enableDebuggingHandlers": true,
    "healthzPort": 10248,
    "healthzBindAddress": "127.0.0.1",
    "oomScoreAdj": -999,
    "clusterDomain": "cluster.local",
    "clusterDNS": [
      "10.43.0.10"
    ],
    "streamingConnectionIdleTimeout": "4h0m0s",
    "nodeStatusUpdateFrequency": "10s",
    "nodeStatusReportFrequency": "5m0s",
    "nodeLeaseDurationSeconds": 40,
    "imageMinimumGCAge": "2m0s",
    "imageMaximumGCAge": "0s",
    "imageGCHighThresholdPercent": 85,
    "imageGCLowThresholdPercent": 80,
    "volumeStatsAggPeriod": "1m0s",
    "cgroupsPerQOS": true,
    "cgroupDriver": "systemd",
    "cpuManagerPolicy": "static",
    "cpuManagerReconcilePeriod": "10s",
    "memoryManagerPolicy": "Static",
    "topologyManagerPolicy": "restricted",
    "topologyManagerScope": "pod",
    "runtimeRequestTimeout": "2m0s",
    "hairpinMode": "promiscuous-bridge",
    "maxPods": 110,
    "podPidsLimit": -1,
    "resolvConf": "/etc/resolv.conf",
    "cpuCFSQuota": true,
    "cpuCFSQuotaPeriod": "100ms",
    "nodeStatusMaxImages": 50,
    "maxOpenFiles": 1000000,
    "contentType": "application/vnd.kubernetes.protobuf",
    "kubeAPIQPS": 50,
    "kubeAPIBurst": 100,
    "serializeImagePulls": false,
    "evictionHard": {
      "imagefs.available": "5%",
      "nodefs.available": "5%"
    },
    "evictionPressureTransitionPeriod": "5m0s",
    "evictionMinimumReclaim": {
      "imagefs.available": "10%",
      "nodefs.available": "10%"
    },
    "enableControllerAttachDetach": true,
    "makeIPTablesUtilChains": true,
    "iptablesMasqueradeBit": 14,
    "iptablesDropBit": 15,
    "featureGates": {
      "CloudDualStackNodeIPs": true
    },
    "failSwapOn": false,
    "memorySwap": {},
    "containerLogMaxSize": "10Mi",
    "containerLogMaxFiles": 5,
    "containerLogMaxWorkers": 1,
    "containerLogMonitorInterval": "10s",
    "configMapAndSecretChangeDetectionStrategy": "Watch",
    "systemReserved": {
      "cpu": "2",
      "memory": "1000Mi"
    },
    "kubeReserved": {
      "memory": "2000Mi"
    },
    "reservedSystemCPUs": "0,28",
    "enforceNodeAllocatable": [
      "pods"
    ],
    "volumePluginDir": "/var/lib/kubelet/volumeplugins",
    "logging": {
      "format": "text",
      "flushFrequency": "5s",
      "verbosity": 0,
      "options": {
        "text": {
          "infoBufferSize": "0"
        },
        "json": {
          "infoBufferSize": "0"
        }
      }
    },
    "enableSystemLogHandler": true,
    "enableSystemLogQuery": false,
    "shutdownGracePeriod": "0s",
    "shutdownGracePeriodCriticalPods": "0s",
    "reservedMemory": [
      {
        "numaNode": 0,
        "limits": {
          "memory": "1500Mi"
        }
      },
      {
        "numaNode": 1,
        "limits": {
          "memory": "1500Mi"
        }
      }
    ],
    "enableProfilingHandler": true,
    "enableDebugFlagsHandler": true,
    "seccompDefault": false,
    "memoryThrottlingFactor": 0.9,
    "registerNode": true,
    "localStorageCapacityIsolation": true,
    "containerRuntimeEndpoint": "unix:///run/k3s/containerd/containerd.sock"
  }
}
@mega-alex mega-alex added the bug Something isn't working label Sep 10, 2024
@jnummelin
Copy link
Member

I think this is very related, if not the same story as in #4319

@jnummelin
Copy link
Member

Looking at the "semantic" diff between the k0s and rke kubelet configs, there's some differences which I believe are the culprit here. I've omitted some non relevant bits from the diff such as key file locations etc.

% jd -set k0s.json rke2.json
@ ["kubeletconfig","cgroupDriver"]
- "cgroupfs"
+ "systemd"
@ ["kubeletconfig","evictionHard","imagefs.available"]
- "15%"
+ "5%"
@ ["kubeletconfig","evictionHard","imagefs.inodesFree"]
- "5%"
@ ["kubeletconfig","evictionHard","memory.available"]
- "100Mi"
@ ["kubeletconfig","evictionHard","nodefs.available"]
- "10%"
+ "5%"
@ ["kubeletconfig","evictionHard","nodefs.inodesFree"]
- "5%"
@ ["kubeletconfig","fileCheckFrequency"]
- "20s"
+ "5s"
@ ["kubeletconfig","kubeReserved","cpu"]
- "1"
@ ["kubeletconfig","kubeReserved","memory"]
- "1Gi"
+ "2000Mi"
@ ["kubeletconfig","kubeReservedCgroup"]
- "system.slice"
@ ["kubeletconfig","kubeletCgroups"]
- "/system.slice/containerd.service"
@ ["kubeletconfig","logging","verbosity"]
- 1
+ 0
@ ["kubeletconfig","memoryManagerPolicy"]
- "None"
+ "Static"
@ ["kubeletconfig","syncFrequency"]
- "1m0s"
+ "30s"
@ ["kubeletconfig","topologyManagerPolicy"]
- "none"
+ "restricted"
@ ["kubeletconfig","topologyManagerScope"]
- "container"
+ "pod"
@ ["kubeletconfig","featureGates"]
+ {"CloudDualStackNodeIPs":true}
@ ["kubeletconfig","reservedMemory"]
+ [{"limits":{"memory":"1500Mi"},"numaNode":0},{"limits":{"memory":"1500Mi"},"numaNode":1}]
@ ["kubeletconfig","reservedSystemCPUs"]
+ "0,28"
@ ["kubeletconfig","systemReserved"]
+ {"cpu":"2","memory":"1000Mi"}

Without deeper refreshment read into the area of CPU pinning and scheduling, I'd look into these:

  • cgroupDriver
  • memoryManagerPolicy
  • topologyManagerPolicy
  • topologyManagerScope

Remember that with k0s, you can create a specialized worker profile which basically allow you to customize the kubelet config. In that case, remember to start the worker using --profile my-custom-profile to get it pull the correct profile and kubelet config.

@jnummelin jnummelin added this to the 1.32 milestone Sep 13, 2024
@mega-alex
Copy link
Author

Hey we've been doing some testing with this, if we set the cgoupDriver to systemd using a worker profile, almost every pod goes into a CrashLoopBackoff. Seems like the konnectivity-agent might not like it and that is causing other pods to crash with No agent available for kube-...

apiVersion: k0sctl.k0sproject.io/v1beta1
kind: Cluster
metadata:
  name: cluster
spec:
  hosts:
  - ssh:
      # server-a - Controller
      address: 10.69.0.23
      user: megaport
      port: 22
    role: controller
    installFlags:
      - --debug
  - ssh:
      # server-b - Worker
      address: 10.69.0.27
      user: megaport
      port: 22
    role: worker
    installFlags:
      - --debug
      - --profile="custom"
  k0s:
    version: "v1.30.4+k0s.0"
    versionChannel: stable
    dynamicConfig: true
    config:
      apiVersion: k0s.k0sproject.io/v1beta1
      kind: ClusterConfig
      metadata:
        name: cluster
      spec:
        workerProfiles:
          - name: custom
            values:
              cgroupDriver: "systemd"

The above config reliably causes the crashing behavior.

We're going to look into the other options today. Any reason k0s uses the cgroupfs driver by default even though it's managed by systemd?

@jnummelin
Copy link
Member

if we set the cgoupDriver to systemd using a worker profile, almost every pod goes into a CrashLoopBackoff.

hmm, did you also change it on containerd config? IIRC that defaults to cgroupfs too and as those are now different between kubelet and containerd, things might go south... 😄

Any reason k0s uses the cgroupfs driver by default even though it's managed by systemd?

No real reason other than simplicity. Simplicity from the point of view that k0s runs also on other than systemd managed stuff. We have more better detection and logic in plans to make it play nicer with different cgroup managers.

@mega-alex
Copy link
Author

We've got the new kubelet config here, but unfortunately still running into throttling issues.

kubelet config
```json
{
  "kubeletconfig": {
    "enableServer": true,
    "podLogsDir": "/var/log/pods",
    "syncFrequency": "1m0s",
    "fileCheckFrequency": "20s",
    "httpCheckFrequency": "20s",
    "address": "0.0.0.0",
    "port": 10250,
    "tlsCipherSuites": [
      "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256",
      "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256",
      "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384",
      "TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256"
    ],
    "tlsMinVersion": "VersionTLS12",
    "rotateCertificates": true,
    "serverTLSBootstrap": true,
    "authentication": {
      "x509": {
        "clientCAFile": "/var/lib/k0s/pki/ca.crt"
      },
      "webhook": {
        "enabled": true,
        "cacheTTL": "2m0s"
      },
      "anonymous": {
        "enabled": false
      }
    },
    "authorization": {
      "mode": "Webhook",
      "webhook": {
        "cacheAuthorizedTTL": "5m0s",
        "cacheUnauthorizedTTL": "30s"
      }
    },
    "registryPullQPS": 5,
    "registryBurst": 10,
    "eventRecordQPS": 0,
    "eventBurst": 100,
    "enableDebuggingHandlers": true,
    "healthzPort": 10248,
    "healthzBindAddress": "127.0.0.1",
    "oomScoreAdj": -999,
    "clusterDomain": "cluster.local",
    "clusterDNS": [
      "10.96.0.10"
    ],
    "streamingConnectionIdleTimeout": "4h0m0s",
    "nodeStatusUpdateFrequency": "10s",
    "nodeStatusReportFrequency": "5m0s",
    "nodeLeaseDurationSeconds": 40,
    "imageMinimumGCAge": "2m0s",
    "imageMaximumGCAge": "0s",
    "imageGCHighThresholdPercent": 85,
    "imageGCLowThresholdPercent": 80,
    "volumeStatsAggPeriod": "1m0s",
    "kubeletCgroups": "/system.slice/containerd.service",
    "cgroupsPerQOS": true,
    "cgroupDriver": "systemd",
    "cpuManagerPolicy": "static",
    "cpuManagerReconcilePeriod": "10s",
    "memoryManagerPolicy": "Static",
    "topologyManagerPolicy": "restricted",
    "topologyManagerScope": "pod",
    "runtimeRequestTimeout": "2m0s",
    "hairpinMode": "promiscuous-bridge",
    "maxPods": 110,
    "podPidsLimit": -1,
    "resolvConf": "/etc/resolv.conf",
    "cpuCFSQuota": true,
    "cpuCFSQuotaPeriod": "100ms",
    "nodeStatusMaxImages": 50,
    "maxOpenFiles": 1000000,
    "contentType": "application/vnd.kubernetes.protobuf",
    "kubeAPIQPS": 50,
    "kubeAPIBurst": 100,
    "serializeImagePulls": true,
    "evictionHard": {
      "imagefs.available": "15%",
      "imagefs.inodesFree": "5%",
      "memory.available": "100Mi",
      "nodefs.available": "10%",
      "nodefs.inodesFree": "5%"
    },
    "evictionPressureTransitionPeriod": "5m0s",
    "enableControllerAttachDetach": true,
    "makeIPTablesUtilChains": true,
    "iptablesMasqueradeBit": 14,
    "iptablesDropBit": 15,
    "failSwapOn": false,
    "memorySwap": {},
    "containerLogMaxSize": "10Mi",
    "containerLogMaxFiles": 5,
    "containerLogMaxWorkers": 1,
    "containerLogMonitorInterval": "10s",
    "configMapAndSecretChangeDetectionStrategy": "Watch",
    "systemReserved": {
      "cpu": "2",
      "memory": "1000Mi"
    },
    "kubeReserved": {
      "memory": "2000Mi"
    },
    "kubeReservedCgroup": "system.slice",
    "enforceNodeAllocatable": [
      "pods"
    ],
    "volumePluginDir": "/usr/libexec/k0s/kubelet-plugins/volume/exec",
    "logging": {
      "format": "text",
      "flushFrequency": "5s",
      "verbosity": 1,
      "options": {
        "text": {
          "infoBufferSize": "0"
        },
        "json": {
          "infoBufferSize": "0"
        }
      }
    },
    "enableSystemLogHandler": true,
    "enableSystemLogQuery": false,
    "shutdownGracePeriod": "0s",
    "shutdownGracePeriodCriticalPods": "0s",
    "reservedMemory": [
      {
        "numaNode": 0,
        "limits": {
          "memory": "1550Mi"
        }
      },
      {
        "numaNode": 1,
        "limits": {
          "memory": "1550Mi"
        }
      }
    ],
    "enableProfilingHandler": true,
    "enableDebugFlagsHandler": true,
    "seccompDefault": false,
    "memoryThrottlingFactor": 0.9,
    "registerNode": true,
    "localStorageCapacityIsolation": true,
    "containerRuntimeEndpoint": "unix:///run/k0s/containerd.sock"
  }
}
```

After setting the containerd config to

# /etc/k0s/containerd.d/00-cgroups.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

We were able to get things to start, but still running into the cgroup throttling issue.

kubvirt pod
```yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/containerID: f14bd217a120cc893dbadc4dce5237d40fb3be15cf8b3f7fd257402cdf689acc
    cni.projectcalico.org/podIP: 10.244.9.88/32
    cni.projectcalico.org/podIPs: 10.244.9.88/32
    descheduler.alpha.kubernetes.io/request-evict-only: ""
    kubectl.kubernetes.io/default-container: compute
    kubevirt.io/domain: fedora-test
    kubevirt.io/migrationTransportUnix: "true"
    kubevirt.io/vm-generation: "1"
    post.hook.backup.velero.io/command: '["/usr/bin/virt-freezer", "--unfreeze", "--name",
      "fedora-test", "--namespace", "default"]'
    post.hook.backup.velero.io/container: compute
    pre.hook.backup.velero.io/command: '["/usr/bin/virt-freezer", "--freeze", "--name",
      "fedora-test", "--namespace", "default"]'
    pre.hook.backup.velero.io/container: compute
  creationTimestamp: "2024-09-16T20:21:16Z"
  generateName: virt-launcher-fedora-test-
  labels:
    kubevirt.io: virt-launcher
    kubevirt.io/created-by: 76a6a7fb-2327-479e-bb06-b816d3e0b730
    kubevirt.io/nodeName: protoklustr-2
    vm.kubevirt.io/name: fedora-test
  name: virt-launcher-fedora-test-j575z
  namespace: default
  ownerReferences:
  - apiVersion: kubevirt.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: VirtualMachineInstance
    name: fedora-test
    uid: 76a6a7fb-2327-479e-bb06-b816d3e0b730
  resourceVersion: "5611"
  uid: 2fdafb62-3de2-43e6-900a-eda4bf1982db
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: node-labeller.kubevirt.io/obsolete-host-model
            operator: DoesNotExist
  automountServiceAccountToken: false
  containers:
  - command:
    - /usr/bin/virt-launcher-monitor
    - --qemu-timeout
    - 356s
    - --name
    - fedora-test
    - --uid
    - 76a6a7fb-2327-479e-bb06-b816d3e0b730
    - --namespace
    - default
    - --kubevirt-share-dir
    - /var/run/kubevirt
    - --ephemeral-disk-dir
    - /var/run/kubevirt-ephemeral-disks
    - --container-disk-dir
    - /var/run/kubevirt/container-disks
    - --grace-period-seconds
    - "25"
    - --hook-sidecars
    - "0"
    - --ovmf-path
    - /usr/share/OVMF
    - --run-as-nonroot
    env:
    - name: XDG_CACHE_HOME
      value: /var/run/kubevirt-private
    - name: XDG_CONFIG_HOME
      value: /var/run/kubevirt-private
    - name: XDG_RUNTIME_DIR
      value: /var/run
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    image: quay.io/kubevirt/virt-launcher:v1.3.1
    imagePullPolicy: IfNotPresent
    name: compute
    resources:
      limits:
        cpu: "16"
        devices.kubevirt.io/kvm: "1"
        devices.kubevirt.io/tun: "1"
        devices.kubevirt.io/vhost-net: "1"
        hugepages-1Gi: 16Gi
        memory: "501219329"
      requests:
        cpu: "16"
        devices.kubevirt.io/kvm: "1"
        devices.kubevirt.io/tun: "1"
        devices.kubevirt.io/vhost-net: "1"
        ephemeral-storage: 50M
        hugepages-1Gi: 16Gi
        memory: "501219329"
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        add:
        - NET_BIND_SERVICE
        drop:
        - ALL
      privileged: false
      runAsGroup: 107
      runAsNonRoot: true
      runAsUser: 107
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/kubevirt-private
      name: private
    - mountPath: /var/run/kubevirt
      name: public
    - mountPath: /var/run/kubevirt-ephemeral-disks
      name: ephemeral-disks
    - mountPath: /var/run/kubevirt/container-disks
      mountPropagation: HostToContainer
      name: container-disks
    - mountPath: /var/run/libvirt
      name: libvirt-runtime
    - mountPath: /var/run/kubevirt/sockets
      name: sockets
    - mountPath: /dev/hugepages
      name: hugepages
    - mountPath: /dev/hugepages/libvirt/qemu
      name: hugetblfs-dir
    - mountPath: /var/run/kubevirt/hotplug-disks
      mountPropagation: HostToContainer
      name: hotplug-disks
  - args:
    - --copy-path
    - /var/run/kubevirt-ephemeral-disks/container-disk-data/76a6a7fb-2327-479e-bb06-b816d3e0b730/disk_0
    command:
    - /usr/bin/container-disk
    image: kubevirt/fedora-cloud-container-disk-demo:latest
    imagePullPolicy: Always
    name: volumecontainerdisk
    resources:
      limits:
        cpu: 10m
        memory: 40M
      requests:
        cpu: 10m
        ephemeral-storage: 50M
        memory: 40M
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      runAsNonRoot: true
      runAsUser: 107
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/kubevirt-ephemeral-disks/container-disk-data/76a6a7fb-2327-479e-bb06-b816d3e0b730
      name: container-disks
    - mountPath: /usr/bin
      name: virt-bin-share-dir
  - args:
    - --logfile
    - /var/run/kubevirt-private/76a6a7fb-2327-479e-bb06-b816d3e0b730/virt-serial0-log
    command:
    - /usr/bin/virt-tail
    env:
    - name: VIRT_LAUNCHER_LOG_VERBOSITY
      value: "2"
    image: quay.io/kubevirt/virt-launcher:v1.3.1
    imagePullPolicy: IfNotPresent
    name: guest-console-log
    resources:
      limits:
        cpu: 15m
        memory: 60M
      requests:
        cpu: 15m
        memory: 60M
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      runAsNonRoot: true
      runAsUser: 107
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/kubevirt-private
      name: private
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: false
  hostname: fedora-test
  initContainers:
  - command:
    - /usr/bin/cp
    - /usr/bin/container-disk
    - /init/usr/bin/container-disk
    env:
    - name: XDG_CACHE_HOME
      value: /var/run/kubevirt-private
    - name: XDG_CONFIG_HOME
      value: /var/run/kubevirt-private
    - name: XDG_RUNTIME_DIR
      value: /var/run
    image: quay.io/kubevirt/virt-launcher:v1.3.1
    imagePullPolicy: IfNotPresent
    name: container-disk-binary
    resources:
      limits:
        cpu: 10m
        memory: 40M
      requests:
        cpu: 10m
        memory: 40M
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      runAsGroup: 107
      runAsNonRoot: true
      runAsUser: 107
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /init/usr/bin
      name: virt-bin-share-dir
  - args:
    - --no-op
    command:
    - /usr/bin/container-disk
    image: kubevirt/fedora-cloud-container-disk-demo:latest
    imagePullPolicy: Always
    name: volumecontainerdisk-init
    resources:
      limits:
        cpu: 10m
        memory: 40M
      requests:
        cpu: 10m
        ephemeral-storage: 50M
        memory: 40M
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      runAsNonRoot: true
      runAsUser: 107
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/kubevirt-ephemeral-disks/container-disk-data/76a6a7fb-2327-479e-bb06-b816d3e0b730
      name: container-disks
    - mountPath: /usr/bin
      name: virt-bin-share-dir
  nodeName: protoklustr-2
  nodeSelector:
    cpumanager: "true"
    kubernetes.io/arch: amd64
    kubevirt.io/schedulable: "true"
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  readinessGates:
  - conditionType: kubevirt.io/virtual-machine-unpaused
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 107
    runAsGroup: 107
    runAsNonRoot: true
    runAsUser: 107
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 40
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - emptyDir: {}
    name: private
  - emptyDir: {}
    name: public
  - emptyDir: {}
    name: sockets
  - emptyDir: {}
    name: virt-bin-share-dir
  - emptyDir: {}
    name: libvirt-runtime
  - emptyDir: {}
    name: ephemeral-disks
  - emptyDir: {}
    name: container-disks
  - emptyDir:
      medium: HugePages
    name: hugepages
  - emptyDir: {}
    name: hugetblfs-dir
  - emptyDir: {}
    name: hotplug-disks
status:
  conditions:
  - lastProbeTime: "2024-09-16T20:21:16Z"
    lastTransitionTime: "2024-09-16T20:21:16Z"
    message: the virtual machine is not paused
    reason: NotPaused
    status: "True"
    type: kubevirt.io/virtual-machine-unpaused
  - lastProbeTime: null
    lastTransitionTime: "2024-09-16T20:21:17Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-09-16T20:21:18Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-09-16T20:21:19Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-09-16T20:21:19Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-09-16T20:21:16Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://d30a5394062be7df6b9d9297e4fb6e30561e1c5f949e66acded795d17f05528b
    image: quay.io/kubevirt/virt-launcher:v1.3.1
    imageID: quay.io/kubevirt/virt-launcher@sha256:b15f8049d7f1689d9d8c338d255dc36b15655fd487e824b35e2b139258d44209
    lastState: {}
    name: compute
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-09-16T20:21:18Z"
  - containerID: containerd://8c28b3316e2ae0a6e2fdd49d0590f32ff509b62118d8002ca4a436f4f223a579
    image: quay.io/kubevirt/virt-launcher:v1.3.1
    imageID: quay.io/kubevirt/virt-launcher@sha256:b15f8049d7f1689d9d8c338d255dc36b15655fd487e824b35e2b139258d44209
    lastState: {}
    name: guest-console-log
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-09-16T20:21:19Z"
  - containerID: containerd://6f9b6afe9dcd8d02cfb90db1609d561fd929e5ab10b085df162c3e20f3640ff9
    image: docker.io/kubevirt/fedora-cloud-container-disk-demo:latest
    imageID: docker.io/kubevirt/fedora-cloud-container-disk-demo@sha256:4a0c3f9526551d0294079f1b0171a071a57fe0bf60a2e8529bf4102ee63a67cd
    lastState: {}
    name: volumecontainerdisk
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-09-16T20:21:19Z"
  hostIP: 10.69.0.23
  hostIPs:
  - ip: 10.69.0.23
  initContainerStatuses:
  - containerID: containerd://151888cd26bd5ef8b7b9677b5f817a2e1437b579d952e926b94d4cdf98bb1a5c
    image: quay.io/kubevirt/virt-launcher:v1.3.1
    imageID: quay.io/kubevirt/virt-launcher@sha256:b15f8049d7f1689d9d8c338d255dc36b15655fd487e824b35e2b139258d44209
    lastState: {}
    name: container-disk-binary
    ready: true
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://151888cd26bd5ef8b7b9677b5f817a2e1437b579d952e926b94d4cdf98bb1a5c
        exitCode: 0
        finishedAt: "2024-09-16T20:21:17Z"
        reason: Completed
        startedAt: "2024-09-16T20:21:16Z"
  - containerID: containerd://ac024ab1218d338ac40043c9c6cd3befe8011141c1f50f74e12e4ed54b31e233
    image: docker.io/kubevirt/fedora-cloud-container-disk-demo:latest
    imageID: docker.io/kubevirt/fedora-cloud-container-disk-demo@sha256:4a0c3f9526551d0294079f1b0171a071a57fe0bf60a2e8529bf4102ee63a67cd
    lastState: {}
    name: volumecontainerdisk-init
    ready: true
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://ac024ab1218d338ac40043c9c6cd3befe8011141c1f50f74e12e4ed54b31e233
        exitCode: 0
        finishedAt: "2024-09-16T20:21:18Z"
        reason: Completed
        startedAt: "2024-09-16T20:21:18Z"
  phase: Running
  podIP: 10.244.9.88
  podIPs:
  - ip: 10.244.9.88
  qosClass: Guaranteed
  startTime: "2024-09-16T20:21:16Z"
```

We have this kubevirt pod deployed and are still seeing the throttling reported per container, and overall for the pod

/sys/fs/cgroup/kubepods.slice/kubepods-pod2fdafb62_3de2_43e6_900a_eda4bf1982db.slice$ cat cpu.stat
usage_usec 8099091679
user_usec 8089018953
system_usec 10072726
nr_periods 7607
nr_throttled 4174
throttled_usec 3322148518
nr_bursts 0
burst_usec 0
/sys/fs/cgroup/kubepods.slice/kubepods-pod2fdafb62_3de2_43e6_900a_eda4bf1982db.slice/cri-containerd-d30a5394062be7df6b9d9297e4fb6e30561e1c5f949e66acded795d17f05528b.scope$ cat cpu.stat
usage_usec 988930037
user_usec 980846160
system_usec 8083877
nr_periods 1197
nr_throttled 496
throttled_usec 395384834
nr_bursts 0
burst_usec 0

Interestingly if we run a pod like this

apiVersion: v1
kind: Pod
metadata:
  name: ubuntu
  labels:
    app: ubuntu
spec:
  containers:
  - image: ubuntu
    command:
      - "sleep"
      - "604800"
    imagePullPolicy: IfNotPresent
    name: ubuntu
    resources:
      requests:
        memory: "4Gi"
        cpu: "4"
      limits:
        memory: "4Gi"
        cpu: "4"
  restartPolicy: Always

We don't see the throttling in reported in cpu.stat but the pod will not schedule process on more than one CPU core... but we can achieve 100% utilization on that one core.

Anything else we could look into for this behavior?

Copy link
Contributor

The issue is marked as stale since no activity has been recorded in 30 days

@github-actions github-actions bot added the Stale label Oct 16, 2024
@twz123 twz123 removed the Stale label Oct 17, 2024
Copy link
Contributor

The issue is marked as stale since no activity has been recorded in 30 days

@github-actions github-actions bot added the Stale label Nov 16, 2024
@twz123 twz123 removed the Stale label Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants