feat: add TRT-RTX native CUDA graph support#4187
feat: add TRT-RTX native CUDA graph support#4187tp5uiuc wants to merge 1 commit intopytorch:mainfrom
Conversation
56f1d3f to
52d9ba5
Compare
Thanks for calling this out Naren, I was unaware. I will align my implementation as you suggest, thanks! |
Add cuda_graph_strategy compilation setting and automatic RTX-native
CUDA graph integration for the Python runtime path.
Key changes:
- New cuda_graph_strategy setting ("disabled" / "whole_graph_capture")
on CompilationSettings, mapped to trt.CudaGraphStrategy on
IRuntimeConfig (same pattern as dynamic_shapes_kernel_specialization)
- In SUBGRAPH cudagraph mode on RTX, always use RTX-native CUDA graphs
(manual torch.cuda.CUDAGraph capture is not safe due to lazy kernel
specialization and potential runtime allocation)
- _is_monolithic_capturable() check using context.is_stream_capturable()
and strategy != "lazy" for WHOLE_GRAPH mode safety validation
- _enable_rtx_native_cudagraphs() for runtime context recreation
- _check_monolithic_capturability() in CudaGraphsTorchTensorRTModule
for mixed TRT + PyTorch graph validation
- Comprehensive unit tests covering all code paths
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
52d9ba5 to
c36fba4
Compare
| def set_use_output_allocator(self, enable: bool) -> None: | ||
| self.use_output_allocator_outputs = enable | ||
|
|
||
| def _check_monolithic_capturability(self, stream: torch.cuda.Stream) -> None: |
There was a problem hiding this comment.
TRT-RTX would need to avoid the
"If your input shapes change between requests, the graph is re-recorded for each new shape. "
behavior from torch-TRT here in subgraphs mode. TRT-RTX takes care of re-capturing graphs internally if shapes have changed.
https://docs.pytorch.org/TensorRT/tutorials/runtime_opt/cuda_graphs.html
We should add an explicit test to verify this.
| # Check 2: Lazy kernel specialization would invalidate captured graph | ||
| if self.settings.dynamic_shapes_kernel_specialization_strategy == "lazy": | ||
| return False | ||
| return True |
There was a problem hiding this comment.
Refactor to use any(conditions) rather than individual checks.
| if ENABLED_FEATURES.tensorrt_rtx: | ||
| self._setup_runtime_config() | ||
| self._rtx_native_cudagraphs = ( | ||
| ENABLED_FEATURES.tensorrt_rtx |
There was a problem hiding this comment.
ENABLED_FEATURES.tensorrt_rtx is already true, don't need to double check.
Address PR review comments that asked the new C++ runtime tests be folded into existing feature-level files rather than shipped as parallel `*_cpp.py` files. What - Merge `test_000_runtime_cache_cpp.py` into the existing `test_000_runtime_cache.py`. The file already covered the Python runtime path; two new classes (`TestRuntimeCacheCppPersistence`, `TestCppSerializationIndices`) cover the C++ runtime path via `use_python_runtime=False`, and the serialization-index assertions. Skip on non-RTX builds. - Fold the C++ runtime cases for dynamic shapes kernel specialization strategy into `test_001_dynamic_shapes_kernel_ strategy.py` (introduced upstream in PR pytorch#4184). Two new classes (`TestDynamicShapesKernelStrategyCpp`, `TestDynamicShapesKernel StrategyCppInvalidValue`) exercise lazy/eager/none end-to-end and reject invalid strategy names. The pre-existing Python runtime tests remain untouched. - Rename `test_000_cuda_graph_strategy.py` to `test_001_cuda_graph_ strategy.py` to match the `test_001_*` convention used for L1 RTX-only features. When upstream lands the Python runtime counterpart (PR pytorch#4187), both sets fold into the same file. - Add model-level tests: `test_runtime_cache_models.py` gains a `TestRuntimeCacheCppModels` class exercising ResNet18 through the C++ runtime with warm-cache roundtrip. `test_dynamic_shapes_ kernel_strategy_models.py` gains `TestDynamicShapesKernelStrategy CppModels` covering lazy/eager/none on ResNet18 via the C++ runtime. Verified - 35 passed / 3 skipped in the runtime/ tests (merged file plus test_001 strategy files). - No regression in test_002_cudagraphs_cpp.py (8 passed) or test_005_dynamic_allocation.py (1 passed). Addresses PR pytorch#4202 review comments asking for test file merges and the addition of model-level runtime_cache_models.py / dynamic_shapes_kernel_strategy_models.py coverage.
Description
Add
cuda_graph_strategycompilation setting and automatic RTX-native CUDA graph integration for the Python runtime path (PythonTorchTensorRTModule).TensorRT-RTX has native CUDA graph support via
IRuntimeConfig.cuda_graph_strategy, where the JIT compiler handles capture/replay/invalidation internally. This is superior to manualtorch.cuda.CUDAGraph()capture on RTX because:cudaStreamBeginCaptureto failKey changes
cuda_graph_strategysetting onCompilationSettings("disabled"/"whole_graph_capture")trt.CudaGraphStrategyonIRuntimeConfig(same pattern asdynamic_shapes_kernel_specialization_strategy)set_cudagraphs_mode(True)): On RTX, always use RTX-native CUDA graphs — manual capture is bypassed. Ifcuda_graph_strategywas not explicitly set, the runtime overrides towhole_graph_captureand warns.enable_cudagraphs()with mixed TRT + PyTorch ops): Validates all TRT engines are monolithically capturable viacontext.is_stream_capturable(stream)andstrategy != "lazy". If capturable, proceeds with outer monolithic capture (RTX-native disabled per-engine). If not capturable, raisesRuntimeError._is_monolithic_capturable()— runtime check combining stream capturability and kernel specialization strategy_enable_rtx_native_cudagraphs()— recreates execution context withWHOLE_GRAPH_CAPTURE_check_monolithic_capturability()inCudaGraphsTorchTensorRTModulefor mixed graph validationBehavior matrix
Depends on #4180 (runtime cache) and #4184 (dynamic shapes strategy).
Type of change
Checklist: