Skip to content

[Bug]: nvidia-sandbox-validator pods crash when card doesn't support SR-IOV #2365

@jorti

Description

@jorti

Describe the bug
I'm configuring NVIDIA GPU Operator 26.3.0 in OpenShift Virtualization 4.21 to run VMs using vGPUs from a Tesla T4 card. The driver version I'm using is 595.58.02

Tesla cards don't support SR-IOV:

# nvidia-smi -q

==============NVSMI LOG==============

Timestamp                                              : Sun Apr 12 08:03:33 2026
Driver Version                                         : 595.58.02
CUDA Version                                           : Not Found
vGPU Driver Capability
        Heterogenous Multi-vGPU                        : Supported

Attached GPUs                                          : 1
GPU 00000000:41:00.0
    Product Name                                       : Tesla T4
    Product Brand                                      : NVIDIA
    Product Architecture                               : Turing
    Display Mode                                       : Requested functionality has been deprecated
    Display Attached                                   : Yes
    Display Active                                     : Disabled
    Persistence Mode                                   : Enabled
    Addressing Mode                                    : N/A
    vGPU Device Capability
        Fractional Multi-vGPU                          : Supported
        Heterogeneous Time-Slice Profiles              : Supported
        Heterogeneous Time-Slice Sizes                 : Supported
        Homogeneous Placements                         : Not Supported
        MIG Time-Slicing                               : Not Supported
        MIG Time-Slicing Mode                          : Disabled
[...]
    GPU Virtualization Mode
        Virtualization Mode                            : Host VGPU
        Host VGPU Mode                                 : Non SR-IOV   <----
        vGPU Heterogeneous Mode                        : Disabled

However when querying sriov_totalvfs in the node, I get 16:

# lspci|grep -i nvidia
a1:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

# cat /sys/bus/pci/devices/0000\:a1\:00.0/sriov_numvfs 
0

# cat /sys/bus/pci/devices/0000\:a1\:00.0/sriov_totalvfs 
16

This makes the nvidia-sandbox-validator pod to enter a loop in the init container waiting for the VFs, and it crash after a while:

time="2026-04-06T10:22:16Z" level=info msg="Waiting for VFs to be available..."
2026/04/06 10:22:16 WARNING: unable to detect IOMMU FD for [0000:a1:00.0 open /sys/bus/pci/devices/0000:a1:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
time="2026-04-06T10:22:16Z" level=info msg="Waiting for VFs: 0/16 enabled across 1 GPU(s)"
2026/04/06 10:22:21 WARNING: unable to detect IOMMU FD for [0000:a1:00.0 open /sys/bus/pci/devices/0000:a1:00.0/vfio-dev: no such file or directory]: %!v(MISSING)
time="2026-04-06T10:22:21Z" level=info msg="Waiting for VFs: 0/16 enabled across 1 GPU(s)"

From what I see, the operator waits for VFs if sriov_totalvfs is greater than 0, which is not correct for this card that doesn't support SR-IOV:

if totalExpected == 0 {

https://github.com/NVIDIA/go-nvlib/blob/68058cecb77b8d5f014caec9a8e54e3485000b8e/pkg/nvpci/nvpci.go#L509

To Reproduce

  1. Use OpenShift 4.21 cluster with virtualization operator (KubeVirt)
  2. Add kernel cmdline options amd_iommu=on iommu=pt to nodes
  3. Label nodes with nvidia.com/gpu.workload.config=vm-vgpu
  4. Create driver image for rhel9 with drivers version 595.58.02
  5. Install NVIDIA GPU Operator 26.3.0
  6. Create ClusterPolicy:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  daemonsets:
    updateStrategy: RollingUpdate
  dcgm:
    enabled: true
  dcgmExporter: {}
  devicePlugin: {}
  driver:
    enabled: false
    kernelModuleType: auto
  gfd: {}
  mig:
    strategy: single
  migManager:
    enabled: true
  nodeStatusExporter:
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  sandboxDevicePlugin:
    enabled: true
  sandboxWorkloads:
    defaultWorkload: vm-vgpu
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: "true"
  vfioManager:
    enabled: false
  vgpuDeviceManager:
    config:
      default: default
      name: vgpu-devices-config
    enabled: false
  vgpuManager:
    enabled: true
    image: vgpu-manager
    repository: image-registry.openshift-image-registry.svc:5000/nvidia-gpu-operator
    version: 595.58.02

Expected behavior
No pods crashing

Environment (please provide the following information):

  • GPU Operator Version: 26.3.0
  • OS: Red Hat Enterprise Linux CoreOS 9.6.20260324-0
  • Kernel Version: 5.14.0-570.103.1.el9_6.x86_64
  • Container Runtime Version: CRI-O 1.34.6-2.rhaos4.21.gitbca534a.el9
  • Kubernetes Distro and Version: Red Hat OpenShift 4.21

Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
$ oc get pods
NAME                                                 READY   STATUS       RESTARTS          AGE
gpu-operator-5455bd7dc6-qhk2l                        1/1     Running      0                 3d21h
nvidia-sandbox-device-plugin-daemonset-mhg8d         0/1     Init:1/2     0                 2d21h
nvidia-sandbox-device-plugin-daemonset-xrbf4         0/1     Init:1/2     0                 2d21h
nvidia-sandbox-validator-mhld9                       0/1     Init:1/3     610 (116s ago)    2d21h
nvidia-sandbox-validator-w8psk                       0/1     Init:Error   608 (5m36s ago)   2d21h
nvidia-vgpu-manager-daemonset-9.6.20260324-0-4f5w7   2/2     Running      0                 2d21h
nvidia-vgpu-manager-daemonset-9.6.20260324-0-slkq4   2/2     Running      0                 2d21h
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
$ oc get ds
NAME                                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                        AGE
gpu-feature-discovery                          0         0         0       0            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                     2d21h
nvidia-container-toolkit-daemonset             0         0         0       0            0           nvidia.com/gpu.deploy.container-toolkit=true                                                                         2d21h
nvidia-dcgm                                    0         0         0       0            0           nvidia.com/gpu.deploy.dcgm=true                                                                                      2d21h
nvidia-dcgm-exporter                           0         0         0       0            0           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                             2d21h
nvidia-device-plugin-daemonset                 0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true                                                                             2d21h
nvidia-device-plugin-mps-control-daemon        0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true                                                 2d21h
nvidia-mig-manager                             0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                                                               2d21h
nvidia-node-status-exporter                    0         0         0       0            0           nvidia.com/gpu.deploy.node-status-exporter=true                                                                      2d21h
nvidia-operator-validator                      0         0         0       0            0           nvidia.com/gpu.deploy.operator-validator=true                                                                        2d21h
nvidia-sandbox-device-plugin-daemonset         2         2         0       2            0           nvidia.com/gpu.deploy.sandbox-device-plugin=true                                                                     2d21h
nvidia-sandbox-validator                       2         2         0       2            0           nvidia.com/gpu.deploy.sandbox-validator=true                                                                         2d21h
nvidia-vgpu-manager-daemonset-9.6.20260324-0   2         2         2       2            2           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=9.6.20260324-0,nvidia.com/gpu.deploy.vgpu-manager=true   2d21h
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
$ oc describe pod nvidia-sandbox-validator-mhld9
Name:                 nvidia-sandbox-validator-mhld9
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-sandbox-validator
Node:                 dell-r7525-01.gsslab.brq2.redhat.com/10.37.192.52
Start Time:           Sun, 12 Apr 2026 12:18:18 +0200
Labels:               app=nvidia-sandbox-validator
                      app.kubernetes.io/part-of=gpu-operator
                      controller-revision-hash=7868c9cbcc
                      pod-template-generation=1
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.129.2.150/23"],"mac_address":"0a:58:0a:81:02:96","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.128.0....
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.129.2.150"
                            ],
                            "mac": "0a:58:0a:81:02:96",
                            "default": true,
                            "dns": {}
                        }]
                      openshift.io/scc: nvidia-sandbox-validator
                      security.openshift.io/validated-scc-subject-type: serviceaccount
Status:               Pending
IP:                   10.129.2.150
IPs:
  IP:           10.129.2.150
Controlled By:  DaemonSet/nvidia-sandbox-validator
Init Containers:
  vfio-pci-validation:
    Container ID:  cri-o://82b67f651ac1076a93417cb885b70851e1ca8bb37b8c340c40dd71ce755eeeb8
    Image:         nvcr.io/nvidia/gpu-operator:v26.3.0@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 12 Apr 2026 12:18:19 +0200
      Finished:     Sun, 12 Apr 2026 12:18:19 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:                    true
      COMPONENT:                    vfio-pci
      NODE_NAME:                     (v1:spec.nodeName)
      DEFAULT_GPU_WORKLOAD_CONFIG:  vm-vgpu
    Mounts:
      /host from host-root (ro)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xqks5 (ro)
  vgpu-manager-validation:
    Container ID:  cri-o://895ffde9b861cdc8fad1537a81f299eb2c1212f6699f6d9c563fe70479f556df
    Image:         nvcr.io/nvidia/gpu-operator:v26.3.0@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310
    Image ID:      nvcr.io/nvidia/gpu-operator@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Running
      Started:      Wed, 15 Apr 2026 09:33:37 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 15 Apr 2026 09:28:25 +0200
      Finished:     Wed, 15 Apr 2026 09:33:25 +0200
    Ready:          False
    Restart Count:  610
    Environment:
      WITH_WAIT:                    true
      COMPONENT:                    vgpu-manager
      NODE_NAME:                     (v1:spec.nodeName)
      DEFAULT_GPU_WORKLOAD_CONFIG:  vm-vgpu
    Mounts:
      /host from host-root (ro)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xqks5 (ro)
  vgpu-devices-validation:
    Container ID:  
    Image:         nvcr.io/nvidia/gpu-operator:v26.3.0@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:                    true
      COMPONENT:                    vgpu-devices
      NODE_NAME:                     (v1:spec.nodeName)
      DEFAULT_GPU_WORKLOAD_CONFIG:  vm-vgpu
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xqks5 (ro)
Containers:
  nvidia-sandbox-validator:
    Container ID:  
    Image:         nvcr.io/nvidia/gpu-operator:v26.3.0@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; while true; do sleep 86400; done
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xqks5 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 False 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  kube-api-access-xqks5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    Optional:                false
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.sandbox-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                       From     Message
  ----     ------   ----                      ----     -------
  Normal   Created  8m23s (x610 over 2d21h)   kubelet  Created container: vgpu-manager-validation
  Warning  BackOff  3m22s (x5034 over 2d21h)  kubelet  Back-off restarting failed container vgpu-manager-validation in pod nvidia-sandbox-validator-mhld9_nvidia-gpu-operator(6996e347-078c-4060-b52c-6dfa9e26ebef)
  Normal   Pulled   3m11s (x611 over 2d21h)   kubelet  Container image "nvcr.io/nvidia/gpu-operator:v26.3.0@sha256:64ef3dafb9dd28eebe645010e9a5df1efccddea8ad2d61bb6e0a0c4e12eea310" already present on machine
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
    (shared relevant logs above)

  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
    (shared relevant output above)

  • containerd logs journalctl -u containerd > containerd.log

Metadata

Metadata

Labels

bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions