You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
When using a g5.xl instance with AL2023 AMI, the aws-node pods fail to start due to missing NVIDIA container runtime.
The VPC CNI pods cannot initialize on GPU instances without the proper NVIDIA runtime configuration.
~/ k describe pod aws-node-xljqg -n kube-system <aws:RAMP-Prod> <region:us-east-1>
Name: aws-node-xljqg
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: aws-node
Node: ip-172-25-149-156.ec2.internal/172.25.149.156
Start Time: Sun, 03 Nov 2024 11:32:20 +0200
Labels: app.kubernetes.io/instance=aws-vpc-cni
app.kubernetes.io/name=aws-node
controller-revision-hash=59f8c97cb7
k8s-app=aws-node
pod-template-generation=18
Annotations: <none>
Status: Pending
IP: 172.25.149.156
IPs:
IP: 172.25.149.156
Controlled By: DaemonSet/aws-node
Init Containers:
aws-vpc-cni-init:
Container ID:
Image: 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni-init:v1.18.5-eksbuild.1
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 25m
Environment:
DISABLE_TCP_EARLY_DEMUX: false
ENABLE_IPv6: false
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mt7lp (ro)
Containers:
aws-node:
Container ID:
Image: 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.18.5-eksbuild.1
Image ID:
Port: 61678/TCP
Host Port: 61678/TCP
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 25m
Liveness: exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s] delay=60s timeout=10s period=10s #success=1 #failure=3
Readiness: exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s] delay=1s timeout=10s period=10s #success=1 #failure=3
Environment:
ADDITIONAL_ENI_TAGS: {}
ANNOTATE_POD_IP: false
AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER: false
AWS_VPC_CNI_NODE_PORT_SUPPORT: true
AWS_VPC_ENI_MTU: 9001
AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG: false
AWS_VPC_K8S_CNI_EXTERNALSNAT: false
AWS_VPC_K8S_CNI_LOGLEVEL: DEBUG
AWS_VPC_K8S_CNI_LOG_FILE: /host/var/log/aws-routed-eni/ipamd.log
AWS_VPC_K8S_CNI_RANDOMIZESNAT: prng
AWS_VPC_K8S_CNI_VETHPREFIX: eni
AWS_VPC_K8S_PLUGIN_LOG_FILE: /var/log/aws-routed-eni/plugin.log
AWS_VPC_K8S_PLUGIN_LOG_LEVEL: DEBUG
CLUSTER_NAME: undertone-p-us-east-1
DISABLE_INTROSPECTION: false
DISABLE_METRICS: false
DISABLE_NETWORK_RESOURCE_PROVISIONING: false
ENABLE_IPv4: true
ENABLE_IPv6: false
ENABLE_POD_ENI: false
ENABLE_PREFIX_DELEGATION: false
ENABLE_SUBNET_DISCOVERY: true
NETWORK_POLICY_ENFORCING_MODE: standard
VPC_CNI_VERSION: v1.18.5
VPC_ID: vpc-f70c168f
WARM_ENI_TARGET: 1
WARM_PREFIX_TARGET: 1
MY_NODE_NAME: (v1:spec.nodeName)
MY_POD_NAME: aws-node-xljqg (v1:metadata.name)
Mounts:
/host/etc/cni/net.d from cni-net-dir (rw)
/host/opt/cni/bin from cni-bin-dir (rw)
/host/var/log/aws-routed-eni from log-dir (rw)
/run/xtables.lock from xtables-lock (rw)
/var/run/aws-node from run-dir (rw)
/var/run/dockershim.sock from dockershim (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mt7lp (ro)
aws-eks-nodeagent:
Container ID:
Image: 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon/aws-network-policy-agent:v1.1.3-eksbuild.1
Image ID:
Port: <none>
Host Port: <none>
Args:
--enable-ipv6=false
--enable-network-policy=false
--enable-cloudwatch-logs=false
--enable-policy-event-logs=false
--log-file=/var/log/aws-routed-eni/network-policy-agent.log
--metrics-bind-addr=:8162
--health-probe-bind-addr=:8163
--conntrack-cache-cleanup-period=300
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 25m
Environment:
MY_NODE_NAME: (v1:spec.nodeName)
Mounts:
/host/opt/cni/bin from cni-bin-dir (rw)
/sys/fs/bpf from bpf-pin-path (rw)
/var/log/aws-routed-eni from log-dir (rw)
/var/run/aws-node from run-dir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mt7lp (ro)
Conditions:
Type Status
PodReadyToStartContainers False
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
bpf-pin-path:
Type: HostPath (bare host directory volume)
Path: /sys/fs/bpf
HostPathType:
cni-bin-dir:
Type: HostPath (bare host directory volume)
Path: /opt/cni/bin
HostPathType:
cni-net-dir:
Type: HostPath (bare host directory volume)
Path: /etc/cni/net.d
HostPathType:
dockershim:
Type: HostPath (bare host directory volume)
Path: /var/run/dockershim.sock
HostPathType:
log-dir:
Type: HostPath (bare host directory volume)
Path: /var/log/aws-routed-eni
HostPathType: DirectoryOrCreate
run-dir:
Type: HostPath (bare host directory volume)
Path: /var/run/aws-node
HostPathType: DirectoryOrCreate
xtables-lock:
Type: HostPath (bare host directory volume)
Path: /run/xtables.lock
HostPathType:
kube-api-access-mt7lp:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: op=Exists
node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m32s default-scheduler Successfully assigned kube-system/aws-node-xljqg to ip-172-25-149-156.ec2.internal
Warning FailedCreatePodSandBox 2m32s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/d9bb50134187f1734edb616a475f001e36289e1bac550dfa60cf0f2fd05edb1d/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
Warning FailedCreatePodSandBox 2m20s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/e45c28b7a8c0ec4d65be2a774de49d4812095af80b96b8ceade6e37557bd9ccf/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
Warning FailedCreatePodSandBox 2m8s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/d228a45967c5e6d3393f878a03925448dad399f8dd9e3916023d1607a13a5f4d/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
Warning FailedCreatePodSandBox 115s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/35f93eb4ea975862d534b861723a58fa150f3a9e0a226b390661708653c93b51/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
Warning FailedCreatePodSandBox 104s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/b3352d3f9ff3eed05d414798ddc98abf823792dc88814db27a91feaca0aafde4/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
Warning FailedCreatePodSandBox 89s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/42c14b49a6a1b5573a5be6b01a61c7a2921e43236294911eb58fa49a25ffce64/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
Warning FailedCreatePodSandBox 78s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/6d014b312b6c69c22c2f12918f6719f9e87189aa230c41e3434719ffe20942e4/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
Warning FailedCreatePodSandBox 67s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/0add6d800ff66ac7800dde8246a33d5c62b1b324a21c7741715a52d574d2d894/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
Warning FailedCreatePodSandBox 55s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/54fddb5528d140ca87a8b5a6da4719294fe52979b62b39c6586367bed66ef124/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
Warning FailedCreatePodSandBox 4s (x4 over 40s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/2c4ad8e61ecead2f106816043ef341d590933cda9bafa4ce49ec382631a6b20d/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown
This is the node (which is not in ready state):
Name: ip-172-25-149-156.ec2.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=g6.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1a
instance-types=ml-test
k8s.io/cloud-provider-aws=92e8930f7fc6a131864b8a2de4a46e59
karpenter.k8s.aws/instance-category=g
karpenter.k8s.aws/instance-cpu=4
karpenter.k8s.aws/instance-encryption-in-transit-supported=true
karpenter.k8s.aws/instance-family=g6
karpenter.k8s.aws/instance-generation=6
karpenter.k8s.aws/instance-gpu-count=1
karpenter.k8s.aws/instance-gpu-manufacturer=nvidia
karpenter.k8s.aws/instance-gpu-memory=22888
karpenter.k8s.aws/instance-gpu-name=l4
karpenter.k8s.aws/instance-hypervisor=nitro
karpenter.k8s.aws/instance-local-nvme=250
karpenter.k8s.aws/instance-memory=16384
karpenter.k8s.aws/instance-size=xlarge
karpenter.sh/capacity-type=on-demand
karpenter.sh/nodepool=ml-test
karpenter.sh/registered=true
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-172-25-149-156.ec2.internal
kubernetes.io/os=linux
managed-by=karpenter
node.kubernetes.io/instance-type=g6.xlarge
os=al2023
topology.k8s.aws/zone-id=use1-az2
topology.kubernetes.io/region=us-east-1
topology.kubernetes.io/zone=us-east-1a
Annotations: alpha.kubernetes.io/provided-node-ip: 172.25.149.156
karpenter.k8s.aws/ec2nodeclass-hash: 13145685265961450066
karpenter.k8s.aws/ec2nodeclass-hash-version: v1
karpenter.sh/nodepool-hash: 3154188912813901712
karpenter.sh/nodepool-hash-version: v1
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Sun, 03 Nov 2024 11:32:20 +0200
Taints: node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/not-ready:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: ip-172-25-149-156.ec2.internal
AcquireTime: <unset>
RenewTime: Sun, 03 Nov 2024 11:39:28 +0200
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Sun, 03 Nov 2024 11:37:37 +0200 Sun, 03 Nov 2024 11:32:17 +0200 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sun, 03 Nov 2024 11:37:37 +0200 Sun, 03 Nov 2024 11:32:17 +0200 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sun, 03 Nov 2024 11:37:37 +0200 Sun, 03 Nov 2024 11:32:17 +0200 KubeletHasSufficientPID kubelet has sufficient PID available
Ready False Sun, 03 Nov 2024 11:37:37 +0200 Sun, 03 Nov 2024 11:32:17 +0200 KubeletNotReady container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Addresses:
InternalIP: 172.25.149.156
InternalDNS: ip-172-25-149-156.ec2.internal
Hostname: ip-172-25-149-156.ec2.internal
Capacity:
cpu: 4
ephemeral-storage: 104779756Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15727836Ki
pods: 58
Allocatable:
cpu: 3920m
ephemeral-storage: 95491281146
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 14301404Ki
pods: 58
System Info:
Machine ID: ec2975af9486489e594a38045a562fab
System UUID: ec2975af-9486-489e-594a-38045a562fab
Boot ID: 4eda1b61-0b73-481e-b0a7-20a0fa599cec
Kernel Version: 6.1.112-122.189.amzn2023.x86_64
OS Image: Amazon Linux 2023.6.20241010
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.22
Kubelet Version: v1.30.4-eks-a737599
Kube-Proxy Version: v1.30.4-eks-a737599
ProviderID: aws:///us-east-1a/i-055bf023aaa7684f9
Non-terminated Pods: (6 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
elastic-search fluent-bit-h4n7p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7m15s
kube-system aws-node-xljqg 50m (1%) 0 (0%) 0 (0%) 0 (0%) 7m15s
kube-system datadog-z7nk5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 7m15s
kube-system ebs-csi-node-6qxk8 30m (0%) 0 (0%) 120Mi (0%) 768Mi (5%) 7m15s
kube-system kube-proxy-85d4s 100m (2%) 0 (0%) 0 (0%) 0 (0%) 7m15s
kube-system node-local-dns-h75wp 25m (0%) 0 (0%) 50Mi (0%) 0 (0%) 7m15s
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 205m (5%) 0 (0%)
memory 170Mi (1%) 768Mi (5%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 7m19s kubelet Starting kubelet.
Warning InvalidDiskCapacity 7m19s kubelet invalid capacity 0 on image filesystem
Normal NodeHasSufficientMemory 7m19s (x2 over 7m19s) kubelet Node ip-172-25-149-156.ec2.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 7m19s (x2 over 7m19s) kubelet Node ip-172-25-149-156.ec2.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 7m19s (x2 over 7m19s) kubelet Node ip-172-25-149-156.ec2.internal status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 7m19s kubelet Updated Node Allocatable limit across pods
Normal Synced 7m15s cloud-node-controller Node synced successfully
Normal RegisteredNode 7m14s node-controller Node ip-172-25-149-156.ec2.internal event: Registered Node ip-172-25-149-156.ec2.internal in Controller
Attach logs
root@ip-172-25-179-103 bin]# journalctl -u kubelet -f
Nov 03 10:11:44 ip-172-25-179-103.ec2.internal kubelet[2817]: E1103 10:11:44.999106 2817 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returnserror: cni plugin not initialized" pod="kube-system/ebs-csi-node-8rq4x" podUID="a6ae0c4d-55f7-4876-bedd-fa2d177c4d5b"
Nov 03 10:11:44 ip-172-25-179-103.ec2.internal kubelet[2817]: I1103 10:11:44.999327 2817 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="kube-system/kube-proxy-gtcqj"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: E1103 10:11:45.054848 2817 remote_runtime.go:193] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed:unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/847d67535ba8738d48944df82127c5dc663e556d0f880ce0e562bc7f6fc9d046/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: E1103 10:11:45.054893 2817 kuberuntime_sandbox.go:72] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/847d67535ba8738d48944df82127c5dc663e556d0f880ce0e562bc7f6fc9d046/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown" pod="kube-system/kube-proxy-gtcqj"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: E1103 10:11:45.054913 2817 kuberuntime_manager.go:1170] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/847d67535ba8738d48944df82127c5dc663e556d0f880ce0e562bc7f6fc9d046/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown" pod="kube-system/kube-proxy-gtcqj"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: E1103 10:11:45.054963 2817 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"kube-proxy-gtcqj_kube-system(3f7ad267-95fd-485c-a4de-599d1e6d1de6)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"kube-proxy-gtcqj_kube-system(3f7ad267-95fd-485c-a4de-599d1e6d1de6)\\\": rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/k8s.io/847d67535ba8738d48944df82127c5dc663e556d0f880ce0e562bc7f6fc9d046/log.json: no such file or directory): fork/exec /usr/bin/nvidia-container-runtime: no such file or directory: unknown\"" pod="kube-system/kube-proxy-gtcqj" podUID="3f7ad267-95fd-485c-a4de-599d1e6d1de6"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: I1103 10:11:45.998128 2817 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="kube-system/datadog-tsxlg"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: I1103 10:11:45.998128 2817 util.go:30] "No sandbox for pod can be found. Need to start a new one" pod="elastic-search/fluent-bit-bl6rn"
Nov 03 10:11:45 ip-172-25-179-103.ec2.internal kubelet[2817]: E1103 10:11:45.998397 2817 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returnserror: cni plugin not initialized" pod="kube-system/datadog-tsxlg" podUID="e5fef0d8-4b44-40a1-8ef3-a04522471bfe"
What you expected to happen:
VPC CNI pods should be able to run on GPU instances (g5.xl) without requiring manual NVIDIA runtime installation, especially since these instances are supported in EKS.
How to reproduce it (as minimally and precisely as possible):
Create an EKS cluster (1.30)
Configure Karpenter with AL2023 AMI
Create a NodePool using g5.16xl instance type
Attempt to schedule pods on the new node
Observe aws-node pod failing to start due to missing NVIDIA runtime
Anything else we need to know?:
I'm not sure if it's a bug or misconfiguration on my side, perhaps g5/g6 nodes needs a different AMI? Environment:
The issue was resolved by using the AMI ami-0ab46b6e2dbe2a9d9.
We are running Karpenter, and the initial configuration for the node pool was set as follows:
This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
What happened:
When using a g5.xl instance with AL2023 AMI, the aws-node pods fail to start due to missing NVIDIA container runtime.
The VPC CNI pods cannot initialize on GPU instances without the proper NVIDIA runtime configuration.
This is the node (which is not in ready state):
Attach logs
What you expected to happen:
VPC CNI pods should be able to run on GPU instances (g5.xl) without requiring manual NVIDIA runtime installation, especially since these instances are supported in EKS.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
I'm not sure if it's a bug or misconfiguration on my side, perhaps g5/g6 nodes needs a different AMI?
Environment:
cat /etc/os-release
):uname -a
):Linux ip-172-25-179-103.ec2.internal 6.1.112-122.189.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Oct 8 17:02:11 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: