Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-node pods fail to start on GPU instances (g5.xl) with AL2023 AMI due to missing NVIDIA container runtime #3104

Closed
omerap12 opened this issue Nov 3, 2024 · 2 comments
Labels

Comments

@omerap12
Copy link
Contributor

omerap12 commented Nov 3, 2024

What happened:
When using a g5.xl instance with AL2023 AMI, the aws-node pods fail to start due to missing NVIDIA container runtime.
The VPC CNI pods cannot initialize on GPU instances without the proper NVIDIA runtime configuration.

 ~/ k describe pod aws-node-xljqg -n kube-system                                                                                                                                                                                                                                 <aws:RAMP-Prod> <region:us-east-1>
Name:                 aws-node-xljqg
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      aws-node
Node:                 ip-172-25-149-156.ec2.internal/172.25.149.156
Start Time:           Sun, 03 Nov 2024 11:32:20 +0200
Labels:               app.kubernetes.io/instance=aws-vpc-cni
                      app.kubernetes.io/name=aws-node
                      controller-revision-hash=59f8c97cb7
                      k8s-app=aws-node
                      pod-template-generation=18
Annotations:          <none>
Status:               Pending
IP:                   172.25.149.156
IPs:
  IP:           172.25.149.156
Controlled By:  DaemonSet/aws-node
Init Containers:
  aws-vpc-cni-init:
    Container ID:   
    Image:          602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni-init:v1.18.5-eksbuild.1
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:  25m
    Environment:
      DISABLE_TCP_EARLY_DEMUX:  false
      ENABLE_IPv6:              false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mt7lp (ro)
Containers:
  aws-node:
    Container ID:   
    Image:          602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.18.5-eksbuild.1
    Image ID:       
    Port:           61678/TCP
    Host Port:      61678/TCP
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:      25m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s] delay=60s timeout=10s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s] delay=1s timeout=10s period=10s #success=1 #failure=3
    Environment:
      ADDITIONAL_ENI_TAGS:                    {}
      ANNOTATE_POD_IP:                        false
      AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER:     false
      AWS_VPC_CNI_NODE_PORT_SUPPORT:          true
      AWS_VPC_ENI_MTU:                        9001
      AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG:     false
      AWS_VPC_K8S_CNI_EXTERNALSNAT:           false
      AWS_VPC_K8S_CNI_LOGLEVEL:               DEBUG
      AWS_VPC_K8S_CNI_LOG_FILE:               /host/var/log/aws-routed-eni/ipamd.log
      AWS_VPC_K8S_CNI_RANDOMIZESNAT:          prng
      AWS_VPC_K8S_CNI_VETHPREFIX:             eni
      AWS_VPC_K8S_PLUGIN_LOG_FILE:            /var/log/aws-routed-eni/plugin.log
      AWS_VPC_K8S_PLUGIN_LOG_LEVEL:           DEBUG
      CLUSTER_NAME:                           undertone-p-us-east-1
      DISABLE_INTROSPECTION:                  false
      DISABLE_METRICS:                        false
      DISABLE_NETWORK_RESOURCE_PROVISIONING:  false
      ENABLE_IPv4:                            true
      ENABLE_IPv6:                            false
      ENABLE_POD_ENI:                         false
      ENABLE_PREFIX_DELEGATION:               false
      ENABLE_SUBNET_DISCOVERY:                true
      NETWORK_POLICY_ENFORCING_MODE:          standard
      VPC_CNI_VERSION:                        v1.18.5
      VPC_ID:                                 vpc-f70c168f
      WARM_ENI_TARGET:                        1
      WARM_PREFIX_TARGET:                     1
      MY_NODE_NAME:                            (v1:spec.nodeName)
      MY_POD_NAME:                            aws-node-xljqg (v1:metadata.name)
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /host/var/log/aws-routed-eni from log-dir (rw)
      /run/xtables.lock from xtables-lock (rw)
      /var/run/aws-node from run-dir (rw)
      /var/run/dockershim.sock from dockershim (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mt7lp (ro)
  aws-eks-nodeagent:
    Container ID:  
    Image:         602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon/aws-network-policy-agent:v1.1.3-eksbuild.1
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      --enable-ipv6=false
      --enable-network-policy=false
      --enable-cloudwatch-logs=false
      --enable-policy-event-logs=false
      --log-file=/var/log/aws-routed-eni/network-policy-agent.log
      --metrics-bind-addr=:8162
      --health-probe-bind-addr=:8163
      --conntrack-cache-cleanup-period=300
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:  25m
    Environment:
      MY_NODE_NAME:   (v1:spec.nodeName)
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /sys/fs/bpf from bpf-pin-path (rw)
      /var/log/aws-routed-eni from log-dir (rw)
      /var/run/aws-node from run-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mt7lp (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  bpf-pin-path:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/bpf
    HostPathType:  
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  dockershim:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/dockershim.sock
    HostPathType:  
  log-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/aws-routed-eni
    HostPathType:  DirectoryOrCreate
  run-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/aws-node
    HostPathType:  DirectoryOrCreate
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  
  kube-api-access-mt7lp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason                  Age               From               Message
  ----     ------                  ----              ----               -------
  Normal   Scheduled               2m32s             default-scheduler  Successfully assigned kube-system/aws-node-xljqg to ip-172-25-149-156.ec2.internal
  Warning  FailedCreatePodSandBox  2m32s             kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/d9bb50134187f1734edb616a475f001e36289e1bac550dfa60cf0f2fd05edb1d/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
  Warning  FailedCreatePodSandBox  2m20s             kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/e45c28b7a8c0ec4d65be2a774de49d4812095af80b96b8ceade6e37557bd9ccf/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
  Warning  FailedCreatePodSandBox  2m8s              kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/d228a45967c5e6d3393f878a03925448dad399f8dd9e3916023d1607a13a5f4d/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
  Warning  FailedCreatePodSandBox  115s              kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/35f93eb4ea975862d534b861723a58fa150f3a9e0a226b390661708653c93b51/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
  Warning  FailedCreatePodSandBox  104s              kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/b3352d3f9ff3eed05d414798ddc98abf823792dc88814db27a91feaca0aafde4/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
  Warning  FailedCreatePodSandBox  89s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/42c14b49a6a1b5573a5be6b01a61c7a2921e43236294911eb58fa49a25ffce64/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
  Warning  FailedCreatePodSandBox  78s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/6d014b312b6c69c22c2f12918f6719f9e87189aa230c41e3434719ffe20942e4/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
  Warning  FailedCreatePodSandBox  67s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/0add6d800ff66ac7800dde8246a33d5c62b1b324a21c7741715a52d574d2d894/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
  Warning  FailedCreatePodSandBox  55s               kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/54fddb5528d140ca87a8b5a6da4719294fe52979b62b39c6586367bed66ef124/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
  Warning  FailedCreatePodSandBox  4s (x4 over 40s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/2c4ad8e61ecead2f106816043ef341d590933cda9bafa4ce49ec382631a6b20d/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown

This is the node (which is not in ready state):

Name:               ip-172-25-149-156.ec2.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=g6.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-1
                    failure-domain.beta.kubernetes.io/zone=us-east-1a
                    instance-types=ml-test
                    k8s.io/cloud-provider-aws=92e8930f7fc6a131864b8a2de4a46e59
                    karpenter.k8s.aws/instance-category=g
                    karpenter.k8s.aws/instance-cpu=4
                    karpenter.k8s.aws/instance-encryption-in-transit-supported=true
                    karpenter.k8s.aws/instance-family=g6
                    karpenter.k8s.aws/instance-generation=6
                    karpenter.k8s.aws/instance-gpu-count=1
                    karpenter.k8s.aws/instance-gpu-manufacturer=nvidia
                    karpenter.k8s.aws/instance-gpu-memory=22888
                    karpenter.k8s.aws/instance-gpu-name=l4
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-local-nvme=250
                    karpenter.k8s.aws/instance-memory=16384
                    karpenter.k8s.aws/instance-size=xlarge
                    karpenter.sh/capacity-type=on-demand
                    karpenter.sh/nodepool=ml-test
                    karpenter.sh/registered=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-172-25-149-156.ec2.internal
                    kubernetes.io/os=linux
                    managed-by=karpenter
                    node.kubernetes.io/instance-type=g6.xlarge
                    os=al2023
                    topology.k8s.aws/zone-id=use1-az2
                    topology.kubernetes.io/region=us-east-1
                    topology.kubernetes.io/zone=us-east-1a
Annotations:        alpha.kubernetes.io/provided-node-ip: 172.25.149.156
                    karpenter.k8s.aws/ec2nodeclass-hash: 13145685265961450066
                    karpenter.k8s.aws/ec2nodeclass-hash-version: v1
                    karpenter.sh/nodepool-hash: 3154188912813901712
                    karpenter.sh/nodepool-hash-version: v1
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sun, 03 Nov 2024 11:32:20 +0200
Taints:             node.kubernetes.io/not-ready:NoExecute
                    node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-172-25-149-156.ec2.internal
  AcquireTime:     <unset>
  RenewTime:       Sun, 03 Nov 2024 11:39:28 +0200
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Sun, 03 Nov 2024 11:37:37 +0200   Sun, 03 Nov 2024 11:32:17 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Sun, 03 Nov 2024 11:37:37 +0200   Sun, 03 Nov 2024 11:32:17 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Sun, 03 Nov 2024 11:37:37 +0200   Sun, 03 Nov 2024 11:32:17 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Sun, 03 Nov 2024 11:37:37 +0200   Sun, 03 Nov 2024 11:32:17 +0200   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Addresses:
  InternalIP:   172.25.149.156
  InternalDNS:  ip-172-25-149-156.ec2.internal
  Hostname:     ip-172-25-149-156.ec2.internal
Capacity:
  cpu:                4
  ephemeral-storage:  104779756Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             15727836Ki
  pods:               58
Allocatable:
  cpu:                3920m
  ephemeral-storage:  95491281146
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             14301404Ki
  pods:               58
System Info:
  Machine ID:                 ec2975af9486489e594a38045a562fab
  System UUID:                ec2975af-9486-489e-594a-38045a562fab
  Boot ID:                    4eda1b61-0b73-481e-b0a7-20a0fa599cec
  Kernel Version:             6.1.112-122.189.amzn2023.x86_64
  OS Image:                   Amazon Linux 2023.6.20241010
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.22
  Kubelet Version:            v1.30.4-eks-a737599
  Kube-Proxy Version:         v1.30.4-eks-a737599
ProviderID:                   aws:///us-east-1a/i-055bf023aaa7684f9
Non-terminated Pods:          (6 in total)
  Namespace                   Name                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                    ------------  ----------  ---------------  -------------  ---
  elastic-search              fluent-bit-h4n7p        0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m15s
  kube-system                 aws-node-xljqg          50m (1%)      0 (0%)      0 (0%)           0 (0%)         7m15s
  kube-system                 datadog-z7nk5           0 (0%)        0 (0%)      0 (0%)           0 (0%)         7m15s
  kube-system                 ebs-csi-node-6qxk8      30m (0%)      0 (0%)      120Mi (0%)       768Mi (5%)     7m15s
  kube-system                 kube-proxy-85d4s        100m (2%)     0 (0%)      0 (0%)           0 (0%)         7m15s
  kube-system                 node-local-dns-h75wp    25m (0%)      0 (0%)      50Mi (0%)        0 (0%)         7m15s
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                205m (5%)   0 (0%)
  memory             170Mi (1%)  768Mi (5%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:
  Type     Reason                   Age                    From                   Message
  ----     ------                   ----                   ----                   -------
  Normal   Starting                 7m19s                  kubelet                Starting kubelet.
  Warning  InvalidDiskCapacity      7m19s                  kubelet                invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  7m19s (x2 over 7m19s)  kubelet                Node ip-172-25-149-156.ec2.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    7m19s (x2 over 7m19s)  kubelet                Node ip-172-25-149-156.ec2.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     7m19s (x2 over 7m19s)  kubelet                Node ip-172-25-149-156.ec2.internal status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  7m19s                  kubelet                Updated Node Allocatable limit across pods
  Normal   Synced                   7m15s                  cloud-node-controller  Node synced successfully
  Normal   RegisteredNode           7m14s                  node-controller        Node ip-172-25-149-156.ec2.internal event: Registered Node ip-172-25-149-156.ec2.internal in Controller

Attach logs

root@ip-172-25-179-103 bin]# journalctl -u kubelet -f
Nov 03 10:11:44 ip-172-25-179-103.ec2.internal kubelet[2817]: E1103 10:11:44.999106    2817 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returnserror: cni plugin not initialized" pod="kube-system/ebs-csi-node-8rq4x" podUID="a6ae0c4d-55f7-4876-bedd-fa2d177c4d5b"
Nov 03 10:11:44 ip-172-25-179-103.ec2.internal kubelet[2817]: I1103 10:11:44.999327    2817 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="kube-system/kube-proxy-gtcqj"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: E1103 10:11:45.054848    2817 remote_runtime.go:193] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed:unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/847d67535ba8738d48944df82127c5dc663e556d0f880ce0e562bc7f6fc9d046/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: E1103 10:11:45.054893    2817 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/847d67535ba8738d48944df82127c5dc663e556d0f880ce0e562bc7f6fc9d046/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown" pod="kube-system/kube-proxy-gtcqj"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: E1103 10:11:45.054913    2817 kuberuntime_manager.go:1170] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/847d67535ba8738d48944df82127c5dc663e556d0f880ce0e562bc7f6fc9d046/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown" pod="kube-system/kube-proxy-gtcqj"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: E1103 10:11:45.054963    2817 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-proxy-gtcqj_kube-system(3f7ad267-95fd-485c-a4de-599d1e6d1de6)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-proxy-gtcqj_kube-system(3f7ad267-95fd-485c-a4de-599d1e6d1de6)\\\": rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/847d67535ba8738d48944df82127c5dc663e556d0f880ce0e562bc7f6fc9d046/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown\"" pod="kube-system/kube-proxy-gtcqj" podUID="3f7ad267-95fd-485c-a4de-599d1e6d1de6"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: I1103 10:11:45.998128    2817 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="kube-system/datadog-tsxlg"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: I1103 10:11:45.998128    2817 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="elastic-search/fluent-bit-bl6rn"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: E1103 10:11:45.998397    2817 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returnserror: cni plugin not initialized" pod="kube-system/datadog-tsxlg" podUID="e5fef0d8-4b44-40a1-8ef3-a04522471bfe"

What you expected to happen:
VPC CNI pods should be able to run on GPU instances (g5.xl) without requiring manual NVIDIA runtime installation, especially since these instances are supported in EKS.

How to reproduce it (as minimally and precisely as possible):

  1. Create an EKS cluster (1.30)
  2. Configure Karpenter with AL2023 AMI
  3. Create a NodePool using g5.16xl instance type
  4. Attempt to schedule pods on the new node
  5. Observe aws-node pod failing to start due to missing NVIDIA runtime

Anything else we need to know?:
I'm not sure if it's a bug or misconfiguration on my side, perhaps g5/g6 nodes needs a different AMI?
Environment:

  • Kubernetes version: v1.30.4-eks-a737599
  • CNI Version: v1.18.5-eksbuild.1
  • OS (e.g: cat /etc/os-release):
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.6.20241010"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2028-03-15"
  • Kernel (e.g. uname -a): Linux ip-172-25-179-103.ec2.internal 6.1.112-122.189.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Oct 8 17:02:11 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
@omerap12 omerap12 added the bug label Nov 3, 2024
@omerap12
Copy link
Contributor Author

omerap12 commented Nov 3, 2024

The issue was resolved by using the AMI ami-0ab46b6e2dbe2a9d9.
We are running Karpenter, and the initial configuration for the node pool was set as follows:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: ml-test
spec:
  amiFamily: AL2023_GPU

The problem was fixed by updating the configuration to specify:

spec:
  amiFamily: AL2023
  amiSelectorTerms:
    - id: ami-0ab46b6e2dbe2a9d9

This explicitly uses the specific AMI, this issue is related to Karpenter, so I’ll go ahead and close it.

@omerap12 omerap12 closed this as completed Nov 3, 2024
Copy link

github-actions bot commented Nov 3, 2024

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant