feat: WithCapture helper + capture watchdog (T2.1a, T4.1)#96
Merged
Conversation
…mpling Add a captureWatchdog goroutine that monitors CUDA graph capture health during stream capture. The watchdog: - Polls cuda.StreamCaptureStatus every 1 second - Detects CaptureStatusInvalidated and force-ends capture - Enforces a 30-second total timeout via context.WithTimeout - Treats probe stalls (>5s) as hang signals - Is a no-op when stream is nil (CPU-only builds) - Cleans up via cancel() when capture completes normally The watchdog is wired into captureAndRun between StreamBeginCapture and StreamEndCapture. On error, capture falls back to uncaptured execution via the existing failure path. Tests in capture_watchdog_test.go cover nil-stream no-op, cancel stops goroutine, sentinel error identity, and default timeout value. All tests run without CUDA.
…ifecycle WithCapture(fn) wraps BeginCapture/EndCapture into a single call that ensures the CaptureAwareAllocator is engaged for the duration of fn. Returns the GraphHandle on success so callers can replay the captured graph. fn error takes precedence over EndCapture error; the graph is destroyed on fn failure. Also introduces test-swappable indirection for StreamBeginCapture, StreamEndCapture, GraphInstantiate, and GraphDestroy — following the existing captureStatusFn pattern — so WithCapture can be unit-tested without real CUDA hardware.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave 4a of the GB10 CUDA graph capture fix (docs/plan.md E2+E4). Closes #93 partially — adds the two foundational primitives that E2 fix and E4 fallback build on.
(*GPUEngine[T]).WithCapture(fn func() error) (GraphHandle, error)— safe one-call API for entering a CUDA graph capture region through the engine. Correctly engagesCaptureAwareAllocatorfor the duration offn. Returns a replayableGraphHandle. 6 CPU-mock unit tests (nil stream, fn error propagation, begin/end error propagation, error precedence, valid handle return).captureWatchdogingraph/cuda_graph.go— 30s timeout watchdog goroutine that samplesStreamCaptureStatusevery second during capture. DetectsInvalidatedstatus and stalls. Sentinel errorsErrCaptureTimeoutandErrCaptureInvalidated. Wired intocaptureAndRun. 4 CPU-mock unit tests (nil-stream no-op, cancel stops goroutine, sentinel identity, default timeout).These unblock Wave 4b (T2.2 capture-aware
allocWeightrouting, T2.3 workspace pre-allocation) and Wave 5 (T4.2CaptureSafehelper).Refs #93.
Verification
go build ./...PASSgo test ./compute/... ./graph/... -race -timeout 120sPASS (10 new tests total)Test plan
go build ./...go test ./compute/... ./graph/... -race -timeout 120sgo vet ./...(delta zero)