Skip to content

test: GB10 CUDA graph repro harness (T1.3, T1.5, T1.6)#95

Merged
dndungu merged 5 commits intomainfrom
wave-2-integration
Apr 16, 2026
Merged

test: GB10 CUDA graph repro harness (T1.3, T1.5, T1.6)#95
dndungu merged 5 commits intomainfrom
wave-2-integration

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Apr 16, 2026

Summary

Wave 2 of the GB10 CUDA graph capture hang fix (docs/plan.md E1). Builds out the reproduction harness on top of Wave 1's probe primitives (#94).

  • T1.3: New compute/gpu_engine_gb10_test.go (185 lines) gated behind //go:build dgxgb10. Uploads 50 float32 tensors (incl. a 256×1024 matrix), begins capture, runs one MatMul, ends capture — guarded by context.WithTimeout(30s). Accepts three outcomes: clean capture, ErrCaptureIncompatibleAllocation, or t.Fatal on hang.
  • T1.5: 10 new CPU-mock tests across compute/capture_guard_test.go, compute/gpu_engine_alloc_guard_test.go, internal/cuda/runtime_purego_test.go. Closes coverage gaps:
    • ensureNotCapturing over all three CaptureStatus values (table-driven).
    • ensureNotCapturing probe-error propagation (does NOT masquerade as ErrCaptureIncompatibleAllocation).
    • ensureNotCapturing nil-Ptr branch.
    • allocWeight + uploadBytes each propagate the sentinel and the probe error.
    • ErrCaptureIncompatibleAllocation survives fmt.Errorf("%w", ...) wrapping.
    • cuda.StreamFromPtr(nil).Ptr() round-trip.
    • cuda.StreamCaptureStatus tolerates zero-handle stream when runtime is unavailable.
  • T1.5 production change (1-line indirection): var captureStatusFn = cuda.StreamCaptureStatus in compute/gpu_engine.go so tests can swap the probe. Call site cuda.StreamCaptureStatus(s)captureStatusFn(s).
  • T1.6: gofmt -s -w on the two E1 files that drifted (compute/gpu_engine_gb10_test.go trailing newline, internal/cuda/purego.go field alignment after cudaStreamGetCaptureInfo was added). golangci-lint delta on ./compute/... ./internal/cuda/... is zero (13 pre-existing issues in unrelated files).

Hardware run (T1.4 — submit the dgxgb10 test via a Spark manifest) stays in Wave 3 and is not part of this PR.

Verification report

  • Merge safety (M0-M5): PASS. wave-2-integration branched from fresh origin/main; merged wave-2-task-T1.3 then wave-2-task-T1.5 via --no-ff. Silent-revert check: every non-context line from each M1 patch reflected on the integration branch.
  • Build: go build ./... PASS.
  • Test: go test ./compute/... ./internal/cuda/... -race -timeout 120s PASS.
  • Vet: go vet ./... → 28 warnings, identical to origin/main baseline (no delta).
  • Lint: golangci-lint run ./compute/... ./internal/cuda/... → 13 pre-existing issues, 0 in E1 files, 0 new.
  • Format: gofmt -s -l / goimports -l clean across all E1 files after T1.6 sweep.
  • Stub audit: 0 TODO/FIXME/Stub/Mock/Fake/Placeholder/NotImplemented in production diff.
  • Use case coverage: UC-001 + UC-002 (repro) via TestCUDAGraph_MultiTensorUpload_GB10; infrastructure via 10 new CPU-mock tests.

Files touched

  • compute/gpu_engine.go (+3 −1) — one-line captureStatusFn indirection for T1.5 testability
  • compute/capture_guard_test.go (+120) — extended guard coverage
  • compute/gpu_engine_alloc_guard_test.go (+113, new) — allocWeight/uploadBytes propagation tests
  • compute/gpu_engine_gb10_test.go (+184, new, build-tagged) — hardware repro
  • internal/cuda/runtime_purego_test.go (+35) — binding-level gap tests
  • internal/cuda/purego.go (±13, field alignment from T1.1 addition)
  • docs/plan.md — mark T1.3/T1.5/T1.6 complete

Test plan

  • go build ./...
  • go test ./compute/... ./internal/cuda/... -race -timeout 120s
  • go vet ./... (delta vs origin/main = 0)
  • gofmt -s -l / goimports -l on E1 files (clean)
  • golangci-lint run ./compute/... ./internal/cuda/... (0 new findings)
  • CI green on this PR (auto)
  • T1.4 (follow-on PR): submit cuda-graph-gb10-repro.yaml to Spark; attach log evidence

dndungu added 5 commits April 15, 2026 22:08
…agged)

Adds TestCUDAGraph_MultiTensorUpload_GB10 behind //go:build dgxgb10 so
CI never runs it. The Spark DGX pod (T1.4, next wave) will pass the tag
to reproduce the hang on real GB10 hardware.

The test uploads 50 float32 tensors (including a 256x1024 matrix),
begins capture, runs a MatMul inside the capture region, and calls
EndCapture. All three possible outcomes are observable:

- EndCapture returns cleanly: E2 fix is in place (test passes).
- ErrCaptureIncompatibleAllocation bubbles out: T1.2 probe caught the
  unsafe allocation synchronously (test passes).
- Capture body does not complete in 30s: hang is live, test fails via
  context.WithTimeout + t.Fatal.

Only compute/gpu_engine_gb10_test.go is added; no non-test files are
touched.
… sentinel wrapping)

Adds CPU-mock tests that close the Wave 1 gaps on the capture guard without
requiring CUDA hardware:

- ensureNotCapturing over all three CaptureStatus values (table-driven),
  the nil-Ptr branch, and probe-error propagation.
- allocWeight and uploadBytes propagate the ErrCaptureIncompatibleAllocation
  sentinel and the wrapped probe error unchanged.
- ErrCaptureIncompatibleAllocation survives fmt.Errorf %w wrapping.
- cuda.StreamFromPtr(nil).Ptr() round-trips, and StreamCaptureStatus tolerates
  a zero handle when the runtime is unavailable.

To enable probe-error and status-branch tests without CUDA, introduces a
single-line indirection in compute/gpu_engine.go:
  var captureStatusFn = cuda.StreamCaptureStatus
Tests swap it via swapCaptureStatusFn (test-only helper). Zero stub markers
in production code; test fakes confined to *_test.go files.

Verifies: [infrastructure]
- gofmt -s -w on compute/gpu_engine_gb10_test.go (trailing newline)
- gofmt -s -w on internal/cuda/purego.go (field-alignment delta from
  cudaStreamGetCaptureInfo addition in T1.1)
- Mark T1.3/T1.5/T1.6 complete in docs/plan.md

Wave 2 of ztensor E1 closes out: hardware repro test (T1.3),
CPU-mock coverage (T1.5), format/lint sweep (T1.6).

Next: Wave 3 (T1.4 -- Spark submission of the dgxgb10 test for
evidence capture), then Wave 4 (E2 fix work).
@dndungu dndungu merged commit 9bf9723 into main Apr 16, 2026
1 check passed
@dndungu dndungu deleted the wave-2-integration branch April 16, 2026 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant