Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
In the Freshly installed OpenShift cluster on Fusion HCI system, with 2GPU Servers each with 8x L40S GPUs - Few operator pods related to validator are in crashloopback state on these nodes in Nvidia GPU Operator namespace.
To Reproduce
These pods get every few secs restart and they go into crashloopbackoff mode.
Expected behavior
The pods need to be in a steady state
Environment (please provide the following information):
- GPU Operator Version: 25.3.4
- OS: Red Hat Enterprise Linux CoreOS 9.6.20250925-0 (Plow)
- Kernel Version: full=5.14.0-570.49.1.el9_6.x86_64
- Container Runtime Version: full=13.0
- OpenShift Distro and Version: OpenShift 4.20.0
Information to attach (optional if deemed irrelevant)
% oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
nvidia-cuda-validator-6zgzf 0/1 Init:CrashLoopBackOff 3 (40s ago) 86s
nvidia-cuda-validator-9dcgj 0/1 Init:CrashLoopBackOff 2 (25s ago) 41s
% oc get daemonset -n cpd-operators
No resources found in cpd-operators namespace.
% oc get daemonset -n openshift-operators
No resources found in openshift-operators namespace.
oc describe pod nvidia-container-toolkit-daemonset-hnxm7 -n nvidia-gpu-operator
Name: nvidia-container-toolkit-daemonset-hnxm7
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-container-toolkit
Node: compute-1-ru25.sde.cloud9.ibm.com/9.47.144.35
Start Time: Wed, 01 Apr 2026 04:32:53 -0400
Labels: app=nvidia-container-toolkit-daemonset
controller-revision-hash=748698f757
pod-template-generation=1
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.131.2.64/23"],"mac_address":"0a:58:0a:83:02:40","gateway_ips":["10.131.2.1"],"routes":[{"dest":"10.128.0.0...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.131.2.64"
],
"mac": "0a:58:0a:83:02:40",
"default": true,
"dns": {}
}]
openshift.io/scc: privileged
security.openshift.io/validated-scc-subject-type: serviceaccount
Status: Running
IP: 10.131.2.64
IPs:
IP: 10.131.2.64
Controlled By: DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
driver-validation:
Container ID: cri-o://e3855094d1c9ce2f237554bc2bc66a52cd2b116230b9222d61520e7e26b82e4b
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:619e7bf48168d76a7e087c3bf6190fda9b10449c9839309fa4f9ed5b8a9e8804
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:619e7bf48168d76a7e087c3bf6190fda9b10449c9839309fa4f9ed5b8a9e8804
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 01 Apr 2026 04:32:59 -0400
Finished: Wed, 01 Apr 2026 04:34:16 -0400
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-dir (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vwglc (ro)
Containers:
nvidia-container-toolkit-ctr:
Container ID: cri-o://c41940a1c37130b25368d7da40006db6f5c00d94aa0930eb2d769869634534b4
Image: nvcr.io/nvidia/k8s/container-toolkit@sha256:51c8f71d3b3c08ae4eb4853697e3f8e6f11e435e666e08210178e6a1faf8028f
Image ID: nvcr.io/nvidia/k8s/container-toolkit@sha256:51c8f71d3b3c08ae4eb4853697e3f8e6f11e435e666e08210178e6a1faf8028f
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
Args:
/bin/entrypoint.sh
State: Running
Started: Wed, 01 Apr 2026 04:34:17 -0400
Ready: True
Restart Count: 0
Environment:
ROOT: /usr/local/nvidia
NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND: management.nvidia.com/gpu
NVIDIA_VISIBLE_DEVICES: void
TOOLKIT_PID_FILE: /run/nvidia/toolkit/toolkit.pid
RUNTIME: crio
RUNTIME_CONFIG: /runtime/config-dir/99-nvidia.conf
CRIO_CONFIG: /runtime/config-dir/99-nvidia.conf
Mounts:
/bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
/driver-root from driver-install-dir (rw)
/host from host-root (ro)
/run/nvidia/toolkit from toolkit-root (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/runtime/config-dir/ from crio-config (rw)
/usr/local/nvidia from toolkit-install-dir (rw)
/usr/share/containers/oci/hooks.d from crio-hooks (rw)
/var/run/cdi from cdi-root (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vwglc (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
nvidia-container-toolkit-entrypoint:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nvidia-container-toolkit-entrypoint
Optional: false
toolkit-root:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/toolkit
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-dir:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
toolkit-install-dir:
Type: HostPath (bare host directory volume)
Path: /usr/local/nvidia
HostPathType:
crio-hooks:
Type: HostPath (bare host directory volume)
Path: /run/containers/oci/hooks.d
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
cdi-root:
Type: HostPath (bare host directory volume)
Path: /var/run/cdi
HostPathType: DirectoryOrCreate
crio-config:
Type: HostPath (bare host directory volume)
Path: /etc/crio/crio.conf.d
HostPathType: DirectoryOrCreate
kube-api-access-vwglc:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.container-toolkit=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events: <none>
Name: nvidia-container-toolkit-daemonset-l6fgw
Namespace: nvidia-gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-container-toolkit
Node: compute-1-ru28.sde.cloud9.ibm.com/9.47.144.36
Start Time: Wed, 01 Apr 2026 04:40:48 -0400
Labels: app=nvidia-container-toolkit-daemonset
controller-revision-hash=748698f757
pod-template-generation=1
Annotations: k8s.ovn.org/pod-networks:
{"default":{"ip_addresses":["10.128.4.47/23"],"mac_address":"0a:58:0a:80:04:2f","gateway_ips":["10.128.4.1"],"routes":[{"dest":"10.128.0.0...
k8s.v1.cni.cncf.io/network-status:
[{
"name": "ovn-kubernetes",
"interface": "eth0",
"ips": [
"10.128.4.47"
],
"mac": "0a:58:0a:80:04:2f",
"default": true,
"dns": {}
}]
openshift.io/scc: privileged
security.openshift.io/validated-scc-subject-type: serviceaccount
Status: Running
IP: 10.128.4.47
IPs:
IP: 10.128.4.47
Controlled By: DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
driver-validation:
Container ID: cri-o://31b4182e7de68552edf0ead344d24753948d731775ce8edd35338474459b7451
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:619e7bf48168d76a7e087c3bf6190fda9b10449c9839309fa4f9ed5b8a9e8804
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:619e7bf48168d76a7e087c3bf6190fda9b10449c9839309fa4f9ed5b8a9e8804
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Wed, 01 Apr 2026 04:40:50 -0400
Finished: Wed, 01 Apr 2026 04:42:31 -0400
Ready: True
Restart Count: 0
Environment:
WITH_WAIT: true
COMPONENT: driver
OPERATOR_NAMESPACE: nvidia-gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/host-dev-char from host-dev-char (rw)
/run/nvidia/driver from driver-install-dir (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8w6hl (ro)
Containers:
nvidia-container-toolkit-ctr:
Container ID: cri-o://83448b497563ca01e649d5c1a62b0f7ae1db321af1de42dc0e61b1767966bfd6
Image: nvcr.io/nvidia/k8s/container-toolkit@sha256:51c8f71d3b3c08ae4eb4853697e3f8e6f11e435e666e08210178e6a1faf8028f
Image ID: nvcr.io/nvidia/k8s/container-toolkit@sha256:51c8f71d3b3c08ae4eb4853697e3f8e6f11e435e666e08210178e6a1faf8028f
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
Args:
/bin/entrypoint.sh
State: Running
Started: Wed, 01 Apr 2026 04:42:32 -0400
Ready: True
Restart Count: 0
Environment:
ROOT: /usr/local/nvidia
NVIDIA_CONTAINER_RUNTIME_MODES_CDI_DEFAULT_KIND: management.nvidia.com/gpu
NVIDIA_VISIBLE_DEVICES: void
TOOLKIT_PID_FILE: /run/nvidia/toolkit/toolkit.pid
RUNTIME: crio
RUNTIME_CONFIG: /runtime/config-dir/99-nvidia.conf
CRIO_CONFIG: /runtime/config-dir/99-nvidia.conf
Mounts:
/bin/entrypoint.sh from nvidia-container-toolkit-entrypoint (ro,path="entrypoint.sh")
/driver-root from driver-install-dir (rw)
/host from host-root (ro)
/run/nvidia/toolkit from toolkit-root (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/runtime/config-dir/ from crio-config (rw)
/usr/local/nvidia from toolkit-install-dir (rw)
/usr/share/containers/oci/hooks.d from crio-hooks (rw)
/var/run/cdi from cdi-root (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8w6hl (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
nvidia-container-toolkit-entrypoint:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: nvidia-container-toolkit-entrypoint
Optional: false
toolkit-root:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/toolkit
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-dir:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
toolkit-install-dir:
Type: HostPath (bare host directory volume)
Path: /usr/local/nvidia
HostPathType:
crio-hooks:
Type: HostPath (bare host directory volume)
Path: /run/containers/oci/hooks.d
HostPathType:
host-dev-char:
Type: HostPath (bare host directory volume)
Path: /dev/char
HostPathType:
cdi-root:
Type: HostPath (bare host directory volume)
Path: /var/run/cdi
HostPathType: DirectoryOrCreate
crio-config:
Type: HostPath (bare host directory volume)
Path: /etc/crio/crio.conf.d
HostPathType: DirectoryOrCreate
kube-api-access-8w6hl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.container-toolkit=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events: <none>
oc logs nvidia-container-toolkit-daemonset-hnxm7 -c cuda-validation -n nvidia-gpu-operator
oc logs nvidia-container-toolkit-daemonset-l6fgw -c cuda-validation -n nvidia-gpu-operator
error: container cuda-validation is not valid for pod nvidia-container-toolkit-daemonset-hnxm7
error: container cuda-validation is not valid for pod nvidia-container-toolkit-daemonset-l6fgw
ravi@ravis-MacBook-Pro ~ % oc logs nvidia-container-toolkit-daemonset-hnxm7 -n nvidia-gpu-operator
oc logs nvidia-container-toolkit-daemonset-l6fgw -n nvidia-gpu-operator
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
IS_HOST_DRIVER=false
NVIDIA_DRIVER_ROOT=/run/nvidia/driver
DRIVER_ROOT_CTR_PATH=/driver-root
NVIDIA_DEV_ROOT=/run/nvidia/driver
DEV_ROOT_CTR_PATH=/driver-root
time="2026-04-01T08:34:22Z" level=info msg="Parsing arguments"
time="2026-04-01T08:34:22Z" level=info msg="Starting nvidia-toolkit"
time="2026-04-01T08:34:22Z" level=info msg="disabling device node creation since --cdi-enabled=false"
time="2026-04-01T08:34:22Z" level=info msg="Verifying Flags"
time="2026-04-01T08:34:22Z" level=info msg=Initializing
time="2026-04-01T08:34:22Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
time="2026-04-01T08:34:22Z" level=info msg="Removing existing NVIDIA container toolkit installation"
time="2026-04-01T08:34:22Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
time="2026-04-01T08:34:22Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
time="2026-04-01T08:34:22Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
time="2026-04-01T08:34:22Z" level=info msg="Finding library libnvidia-container.so.1 (root=)"
time="2026-04-01T08:34:22Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
time="2026-04-01T08:34:22Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container.so.1' => '/usr/lib64/libnvidia-container.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/lib64/libnvidia-container.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/lib64/libnvidia-container.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)"
time="2026-04-01T08:34:22Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'"
time="2026-04-01T08:34:22Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container-go.so.1' => '/usr/lib64/libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/lib64/libnvidia-container-go.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/lib64/libnvidia-container-go.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.cdi' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.cdi' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.legacy' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.legacy' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy'"
time="2026-04-01T08:34:22Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
time="2026-04-01T08:34:22Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-runtime-hook'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime-hook' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime-hook' to '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook'"
time="2026-04-01T08:34:22Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-toolkit' -> 'nvidia-container-runtime-hook'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-ctk' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-ctk' to '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-ctk'"
time="2026-04-01T08:34:22Z" level=info msg="Installing executable '/usr/bin/nvidia-cdi-hook' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing '/usr/bin/nvidia-cdi-hook' to '/usr/local/nvidia/toolkit/nvidia-cdi-hook.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-cdi-hook.real'"
time="2026-04-01T08:34:22Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-cdi-hook'"
time="2026-04-01T08:34:22Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
time="2026-04-01T08:34:22Z" level=info msg="Skipping unset option: nvidia-container-runtime.debug"
time="2026-04-01T08:34:22Z" level=info msg="Skipping unset option: nvidia-container-runtime.log-level"
time="2026-04-01T08:34:22Z" level=info msg="Skipping unset option: nvidia-container-runtime.mode"
time="2026-04-01T08:34:22Z" level=info msg="Skipping unset option: nvidia-container-runtime.modes.cdi.annotation-prefixes"
time="2026-04-01T08:34:22Z" level=info msg="Skipping unset option: nvidia-container-runtime.runtimes"
time="2026-04-01T08:34:22Z" level=info msg="Skipping unset option: nvidia-container-cli.debug"
Using config:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
[nvidia-container-cli]
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/run/nvidia/driver"
[nvidia-container-runtime]
log-level = "info"
mode = "auto"
runtimes = ["/usr/bin/crun", "/usr/bin/runc", "docker-runc", "runc", "crun"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "management.nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime.modes.legacy]
cuda-compat-mode = "ldconfig"
[nvidia-container-runtime-hook]
path = "/usr/local/nvidia/toolkit/nvidia-container-runtime-hook"
skip-mode-detection = true
[nvidia-ctk]
path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2026-04-01T08:34:22Z" level=info msg="Starting 'setup' for nvidia-toolkit"
time="2026-04-01T08:34:22Z" level=info msg="Installing prestart hook"
time="2026-04-01T08:34:22Z" level=info msg="Waiting for signal"
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
IS_HOST_DRIVER=false
NVIDIA_DRIVER_ROOT=/run/nvidia/driver
DRIVER_ROOT_CTR_PATH=/driver-root
NVIDIA_DEV_ROOT=/run/nvidia/driver
DEV_ROOT_CTR_PATH=/driver-root
time="2026-04-01T08:42:37Z" level=info msg="Parsing arguments"
time="2026-04-01T08:42:37Z" level=info msg="Starting nvidia-toolkit"
time="2026-04-01T08:42:37Z" level=info msg="disabling device node creation since --cdi-enabled=false"
time="2026-04-01T08:42:37Z" level=info msg="Verifying Flags"
time="2026-04-01T08:42:37Z" level=info msg=Initializing
time="2026-04-01T08:42:37Z" level=info msg="Installing NVIDIA container toolkit to '/usr/local/nvidia/toolkit'"
time="2026-04-01T08:42:37Z" level=info msg="Removing existing NVIDIA container toolkit installation"
time="2026-04-01T08:42:37Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit'"
time="2026-04-01T08:42:37Z" level=info msg="Creating directory '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime'"
time="2026-04-01T08:42:37Z" level=info msg="Installing NVIDIA container library to '/usr/local/nvidia/toolkit'"
time="2026-04-01T08:42:37Z" level=info msg="Finding library libnvidia-container.so.1 (root=)"
time="2026-04-01T08:42:37Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container.so.1'"
time="2026-04-01T08:42:37Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container.so.1' => '/usr/lib64/libnvidia-container.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/lib64/libnvidia-container.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/lib64/libnvidia-container.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container.so.1' -> 'libnvidia-container.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Finding library libnvidia-container-go.so.1 (root=)"
time="2026-04-01T08:42:37Z" level=info msg="Checking library candidate '/usr/lib64/libnvidia-container-go.so.1'"
time="2026-04-01T08:42:37Z" level=info msg="Resolved link: '/usr/lib64/libnvidia-container-go.so.1' => '/usr/lib64/libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/lib64/libnvidia-container-go.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/lib64/libnvidia-container-go.so.1.17.8' to '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/libnvidia-container-go.so.1' -> 'libnvidia-container-go.so.1.17.8'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.cdi' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.cdi' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.cdi'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime.legacy' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime.legacy' to '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime.legacy'"
time="2026-04-01T08:42:37Z" level=info msg="Installing NVIDIA container CLI from '/usr/bin/nvidia-container-cli'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-container-cli' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-container-cli' to '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-cli.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-cli'"
time="2026-04-01T08:42:37Z" level=info msg="Installing NVIDIA container runtime hook from '/usr/bin/nvidia-container-runtime-hook'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-container-runtime-hook' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-container-runtime-hook' to '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-container-runtime-hook'"
time="2026-04-01T08:42:37Z" level=info msg="Creating symlink '/usr/local/nvidia/toolkit/nvidia-container-toolkit' -> 'nvidia-container-runtime-hook'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-ctk' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-ctk' to '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-ctk.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-ctk'"
time="2026-04-01T08:42:37Z" level=info msg="Installing executable '/usr/bin/nvidia-cdi-hook' to /usr/local/nvidia/toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing '/usr/bin/nvidia-cdi-hook' to '/usr/local/nvidia/toolkit/nvidia-cdi-hook.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed '/usr/local/nvidia/toolkit/nvidia-cdi-hook.real'"
time="2026-04-01T08:42:37Z" level=info msg="Installed wrapper '/usr/local/nvidia/toolkit/nvidia-cdi-hook'"
time="2026-04-01T08:42:37Z" level=info msg="Installing NVIDIA container toolkit config '/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml'"
time="2026-04-01T08:42:37Z" level=info msg="Skipping unset option: nvidia-container-runtime.log-level"
time="2026-04-01T08:42:37Z" level=info msg="Skipping unset option: nvidia-container-runtime.mode"
time="2026-04-01T08:42:37Z" level=info msg="Skipping unset option: nvidia-container-runtime.modes.cdi.annotation-prefixes"
time="2026-04-01T08:42:37Z" level=info msg="Skipping unset option: nvidia-container-runtime.runtimes"
time="2026-04-01T08:42:37Z" level=info msg="Skipping unset option: nvidia-container-cli.debug"
time="2026-04-01T08:42:37Z" level=info msg="Skipping unset option: nvidia-container-runtime.debug"
Using config:
accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false
supported-driver-capabilities = "compat32,compute,display,graphics,ngx,utility,video"
[nvidia-container-cli]
environment = []
ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
load-kmods = true
path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
root = "/run/nvidia/driver"
[nvidia-container-runtime]
log-level = "info"
mode = "auto"
runtimes = ["/usr/bin/crun", "/usr/bin/runc", "docker-runc", "runc", "crun"]
[nvidia-container-runtime.modes]
[nvidia-container-runtime.modes.cdi]
annotation-prefixes = ["cdi.k8s.io/"]
default-kind = "management.nvidia.com/gpu"
spec-dirs = ["/etc/cdi", "/var/run/cdi"]
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
[nvidia-container-runtime.modes.legacy]
cuda-compat-mode = "ldconfig"
[nvidia-container-runtime-hook]
path = "/usr/local/nvidia/toolkit/nvidia-container-runtime-hook"
skip-mode-detection = true
[nvidia-ctk]
path = "/usr/local/nvidia/toolkit/nvidia-ctk"
time="2026-04-01T08:42:37Z" level=info msg="Starting 'setup' for nvidia-toolkit"
time="2026-04-01T08:42:37Z" level=info msg="Installing prestart hook"
time="2026-04-01T08:42:37Z" level=info msg="Waiting for signal"
Collecting full debug bundle (optional):
nvidia-gpu-operator_20260422_1616.tar.gz
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
Describe the bug
In the Freshly installed OpenShift cluster on Fusion HCI system, with 2GPU Servers each with 8x L40S GPUs - Few operator pods related to validator are in crashloopback state on these nodes in Nvidia GPU Operator namespace.
To Reproduce
These pods get every few secs restart and they go into crashloopbackoff mode.
Expected behavior
The pods need to be in a steady state
Environment (please provide the following information):
Information to attach (optional if deemed irrelevant)
kubectl get ds -n OPERATOR_NAMESPACEkubectl describe pod -n OPERATOR_NAMESPACE POD_NAMEkubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containersnvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smijournalctl -u containerd > containerd.logCollecting full debug bundle (optional):
nvidia-gpu-operator_20260422_1616.tar.gz
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com