Skip to content

validator: should validate GPU healthiness by using DCGM #670

@Dentrax

Description

@Dentrax

Feature Description

AFAIUC, validator app currently only validates the driver installation.

It would be great to have additional validating steps for the DCGM installation. It can be enabled with an optional flag, having a healthy DCGM is important to export metrics via dcgm-exporter.

Idea is to similar to this cookbook: https://github.com/aws/aws-parallelcluster-cookbook/blob/v3.8.0/cookbooks/aws-parallelcluster-slurm/files/default/config_slurm/scripts/health_checks/gpu_health_check.sh - cc @shivamerla for awareness

GPU Health Check
Check GPU healthiness by executing NVIDIA DCGM diagnostic tool
If GPU_DEVICE_ORDINAL is set, the diagnostic check targets the GPU listed in the variable
Prerequisite for the diagnostic check are:

  • node has NVIDIA GPU
  • DCGM service is running
  • fabric manager service is running (if node is NVSwitch enabled)
  • persistent mode is enabled for the target GPU
root@nvidia-dcgm-exporter-8kvn6:/# dcgmi diag -i 0 -r 2
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.1.8                                          |
| Driver Version Detected   | 535.154.05                                     |
| GPU Device IDs Detected   | 20b5                                           |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Skip                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
+-----  Stress  ------------+------------------------------------------------+
+---------------------------+------------------------------------------------+

It should be run for each GPU: nvidia-smi -L

Metadata

Metadata

Assignees

Labels

featureissue/PR that proposes a new feature or functionalitylifecycle/frozen

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions