Add vmem chunked allocator by mawad-amd · Pull Request #516 · ROCm/iris

mawad-amd · 2026-04-23T20:16:20Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

New allocator design: - Reserve large VA range up front (cheap, just address space) - Map physical memory in large chunks (256 MiB default) - hipMemSetAccess called once per chunk, not per allocation - Sub-allocate with bump pointer, power-of-two free lists for reuse - GC via weakref finalizers on tensor.untyped_storage() - Free/reuse is pure bookkeeping (no HIP calls, no physical remap) - refresh_peer_access only triggered on chunk growth, not every allocation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The test was using 10 float32 elements (40 bytes) for "small", which SymmetricHeap.allocate() rounds up to granularity/4 = 1024 elements on MI355X (4KiB granularity). This puts "small" in the same power-of-two bucket as "medium" (1024 elements), causing pointer swaps on free-list reuse. Fix: derive test sizes from the allocator's actual granularity so each allocation lands in a distinct power-of-two bucket (1x, 4x, 16x granularity). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…UF imports DMA-BUF handles imported from PyTorch's default allocator (not VMem-created) already have device access set. hipMemSetAccess fails with "invalid argument" on such handles. The mem_map is sufficient for the VA mapping; treat the set_access error as non-fatal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

import_external_tensor creates pseudo-chunks with DMA-BUF imported handles that cannot be re-exported via mem_export_to_shareable_handle. Track these in a separate _import_chunks list so get_allocation_chunks() only returns VMem-created chunks that can be safely shared with peers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The 64 GiB default VA reservation per rank caused hipIpcGetMemHandle failures when NCCL tried to allocate IPC-compatible memory after many tests created and destroyed iris contexts. Changes: - Default VA size is now auto-sized to 8x heap_size (min 256 MiB) instead of a fixed 64 GiB - Add SymmetricHeap.close() to free _peer_va_ranges and fd sockets that were previously leaked on context destruction Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Cap chunk_size to heap_size to avoid a single chunk consuming the entire VA range (was 256 MiB chunk for 1 MiB heap = no room to grow) - Increase VA multiplier to 16x heap_size for growth + import headroom - Fix SymmetricHeap.__del__ to handle partial init and Python shutdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add torch.cuda.synchronize() before unmapping VMem chunks in close(). Async GPU kernels (.zero_(), .fill_()) may still be accessing mapped memory when close() is called. Unmapping while kernels are in-flight causes a GPU page fault that poisons the HIP runtime state, making all subsequent GPU operations fail with hipErrorUnknown. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Async GPU operations (NCCL collectives, .zero_(), .fill_()) may still reference mapped virtual addresses when close() is called. Freeing VA ranges while kernels are in-flight causes hipErrorUnknown, which poisons the HIP runtime state and fails all subsequent GPU operations. Add torch.cuda.synchronize() at both SymmetricHeap and VMemChunkedAllocator close() entry points. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The root cause of hipErrorUnknown after multiple context create/destroy cycles was improper cleanup of peer-imported VMem mappings: 1. In _refresh_peer_access_chunked(), imported handles from mem_import_from_shareable_handle() were local variables that leaked — never stored for later cleanup. 2. In SymmetricHeap.close(), peer VA ranges were freed via mem_address_free() WITHOUT first calling mem_unmap() on the chunks mapped into those VA ranges. 3. The imported handles were never released via mem_release(). Calling mem_address_free() on a VA range with active mappings corrupts HIP runtime state, causing hipErrorUnknown on all subsequent GPU operations across new context cycles. Fix: - Track all peer-imported handles and their VA mappings in _peer_imported_mappings dict. - In close(), unmap and release all peer-imported chunks BEFORE calling mem_address_free() on the peer VA ranges. - Fix same bug in _refresh_peer_access_segmented() path. - Fix Iris.__del__() to call heap.close() instead of just allocator.close(), ensuring peer VA cleanup actually runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…sors The VMem API (hipMemImportFromShareableHandle + hipMemMap + hipMemSetAccess) does not work for importing DMA-BUF handles exported from hipMalloc-backed PyTorch allocations. hipMemSetAccess returns hipErrorInvalidValue on such handles, leaving the mapping inaccessible and corrupting subsequent GPU ops. Switch to the External Memory API (hipImportExternalMemory + hipExternalMemoryGetMappedBuffer) which correctly handles DMA-BUF fds from any source including PyTorch's caching allocator. Also update owns_tensor() to check imported external memory ranges since they are no longer mapped into the allocator's VA range. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Synchronize GPU before del to ensure async ops release storage refs, and add a second gc.collect() pass to handle reference cycles that may prevent the weakref finalizer from firing on the first pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The num_free_blocks check depends on weakref finalizer timing which varies across test orderings. GC-based free/reuse is already covered by test_chunked_gc_free_reuse and test_chunked_gc_multiple_reuse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

VA reservation (hipMemAddressReserve) is just address space — no physical memory cost. 128 GiB provides ample headroom for growth and imports without risk of VA exhaustion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

hipMemImportFromShareableHandle segfaults on ROCm 7.0 due to inverted MemObjMap logic in ROCm/clr hip_vm.cpp (removes instead of adds imported memory objects, causing null dereference in hipMemSetAccess). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

1. iris.host.memory.allocators.__init__ did not export VMemChunkedAllocator, so from iris.host.memory.allocators import ... VMemChunkedAllocator failed immediately. 2. After exporting it, iris/host/memory/allocators/vmem_chunked_allocator.py imported hip via from ..hip import ..., which resolves to iris.host.memory.hip and does not exist. The working modules use iris.host.platform.hip.

…pport Introduces `LocalHipDriver` and `LocalCudaDriver` utilizing native VMM APIs and POSIX/DMA-BUF file descriptors for intra-node IPC. Adds `DriverFactory` to dynamically route requests based on vendor and interconnect topology. Updates fabric drivers to accept an optional `va` parameter and safely tracks virtual address ownership across all drivers.

… driver-based peer setup for chunked VMM heaps, with separate local FD and fabric-handle refresh paths

mawad-amd assigned artulab Apr 23, 2026

github-actions Bot added in-progress We are working on it iris Iris project issue labels Apr 23, 2026

artulab force-pushed the muhaawad/vmem-chunked-allocator branch from a1a1eeb to 7aa60e9 Compare April 27, 2026 19:06

mawad-amd and others added 23 commits May 7, 2026 13:12

Apply Ruff auto-fixes

245a9c7

Apply Ruff auto-fixes

e63822c

Apply Ruff auto-fixes

78fa6a3

Apply Ruff auto-fixes

ecb8680

Increase default VA reservation to 128 GiB

6f7f5df

VA reservation (hipMemAddressReserve) is just address space — no physical memory cost. 128 GiB provides ample headroom for growth and imports without risk of VA exhaustion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Refactor memory drivers to decouple virtual address reservation

aa46247

Apply Ruff auto-fixes

fdc035c

artulab force-pushed the muhaawad/vmem-chunked-allocator branch from e1f5b78 to 2101952 Compare May 7, 2026 20:20

Refactor SymmetricHeap and VMemChunkedAllocator to use topology-aware…

9e0c5c8

… driver-based peer setup for chunked VMM heaps, with separate local FD and fabric-handle refresh paths

artulab force-pushed the muhaawad/vmem-chunked-allocator branch from 2101952 to 9e0c5c8 Compare May 7, 2026 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vmem chunked allocator#516

Add vmem chunked allocator#516
mawad-amd wants to merge 24 commits intomainfrom
muhaawad/vmem-chunked-allocator

mawad-amd commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mawad-amd commented Apr 23, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants