From 9ff1274fee112432233736c9fcfa435d3fd8ef88 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Tue, 12 Aug 2025 16:11:47 +0200 Subject: [PATCH 1/2] [Changelog] Introducing GPU passive health checks --- docs/blog/posts/gpu-health-checks.md | 73 ++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) create mode 100644 docs/blog/posts/gpu-health-checks.md diff --git a/docs/blog/posts/gpu-health-checks.md b/docs/blog/posts/gpu-health-checks.md new file mode 100644 index 0000000000..10ecfca10e --- /dev/null +++ b/docs/blog/posts/gpu-health-checks.md @@ -0,0 +1,73 @@ +--- +title: Introducing passive GPU health checks +date: 2025-08-12 +description: "TBA" +slug: gpu-helth-checks +image: https://dstack.ai/static-assets/static-assets/images/gpu-health-checks.png +categories: + - Changelog +--- + +# Introducing passive GPU health checks + +In large-scale training, a single bad GPU can derail progress. Sometimes the failure is obvious — jobs crash outright. Other times it’s subtle: correctable memory errors, intermittent instability, or thermal throttling that quietly drags down throughput. In big experiments, these issues can go unnoticed for hours or days, wasting compute and delaying results. + +`dstack` already supports GPU telemetry monitoring through NVIDIA DCGM [metrics](../../docs/guides/metrics.md), covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM [background health checks](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks). With these, `dstack` continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads. + + + + + +## Why this matters + +Multi-GPU and multi-node workloads are only as strong as their weakest component. GPU cloud providers increasingly rely on automated health checks to prevent degraded hardware from reaching customers. Problems can stem from ECC memory errors, faulty PCIe links, overheating, or other hardware-level issues. Some are fatal, others allow the GPU to run but at reduced performance or with higher failure risk. + +Passive checks like these run in the background. They collect telemetry, run lightweight internal tests, and compare results to NVIDIA’s known failure patterns — without pausing workloads. + +## How it works in dstack + +`dstack` automatically queries DCGM for each fleet instance and appends a health status: + +* An `idle` status means no issues have been detected. +* An `idle (warning)` status indicates a non-fatal issue, such as a correctable ECC error. The instance remains usable but should be monitored. +* An `idle (failure)` status points to a fatal issue, and the instance is automatically excluded from scheduling. + +
+ +```shell +$ dstack fleet + + FLEET INSTANCE BACKEND RESOURCES STATUS PRICE CREATED + my-fleet 0 aws (us-east-1) T4:16GB:1 idle $0.526 11 mins ago + 1 aws (us-east-1) T4:16GB:1 idle (warning) $0.526 11 mins ago + 2 aws (us-east-1) T4:16GB:1 idle (failure) $0.526 11 mins ago +``` + +
+ +A healthy instance is ready for workloads. A warning means you should monitor it closely. A failure removes it from scheduling entirely. + +## Passive vs active checks + +This release focuses on passive checks using DCGM background health checks. These run continuously and do not interrupt workloads. + +For active checks today, you can run [NCCL tests](../../examples/clusters/nccl-tests/index.md) as a [distributed task](../../docs/concepts/tasks.md#distributed-tasks) to verify GPU-to-GPU communication and bandwidth across a fleet. Active tests like these can reveal network or interconnect issues that passive monitoring might miss. More built-in support for active diagnostics is planned. + +## Supported backends + +Passive GPU health checks work on AWS (except with custom `os_images`), Azure (except A10 GPUs), GCP, OCI, and [SSH fleets](../../docs/concepts/fleets.md#ssh) where DCGM is installed and configured for background checks. + +> Fleets created before version 0.19.22 need to be recreated to enable this feature. + +## Looking ahead + +This update is about visibility: giving engineers real-time insight into GPU health before jobs run. Next comes automation — policies to skip GPUs with warnings, and self-healing workflows that replace unhealthy instances without manual steps. + +If you have experience with GPU reliability or ideas for automated recovery, join the conversation on +[Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}. + +!!! info "What's next?" + 1. Check [Quickstart](../../docs/quickstart.md) + 2. Explore the [clusters](../../docs/guides/clusters.md) guide + 3. Learn more about [metrics](../../docs/guides/metrics.md) + 4. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"} From 3a8cc3fcf9fb9e2e7e60f11c733124d4ee41c66e Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Tue, 12 Aug 2025 16:22:05 +0200 Subject: [PATCH 2/2] [Changelog] Introducing GPU passive health checks Review feedback --- docs/blog/posts/gpu-health-checks.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/blog/posts/gpu-health-checks.md b/docs/blog/posts/gpu-health-checks.md index 10ecfca10e..fdd5b75e62 100644 --- a/docs/blog/posts/gpu-health-checks.md +++ b/docs/blog/posts/gpu-health-checks.md @@ -22,7 +22,7 @@ In large-scale training, a single bad GPU can derail progress. Sometimes the fai Multi-GPU and multi-node workloads are only as strong as their weakest component. GPU cloud providers increasingly rely on automated health checks to prevent degraded hardware from reaching customers. Problems can stem from ECC memory errors, faulty PCIe links, overheating, or other hardware-level issues. Some are fatal, others allow the GPU to run but at reduced performance or with higher failure risk. -Passive checks like these run in the background. They collect telemetry, run lightweight internal tests, and compare results to NVIDIA’s known failure patterns — without pausing workloads. +Passive checks like these run in the background. They continuously monitor hardware telemetry and system events, evaluating them against NVIDIA’s known failure patterns — all without pausing workloads. ## How it works in dstack