Skip to content

Add vmem chunked allocator#516

Draft
mawad-amd wants to merge 24 commits intomainfrom
muhaawad/vmem-chunked-allocator
Draft

Add vmem chunked allocator#516
mawad-amd wants to merge 24 commits intomainfrom
muhaawad/vmem-chunked-allocator

Conversation

@mawad-amd
Copy link
Copy Markdown
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@github-actions github-actions Bot added in-progress We are working on it iris Iris project issue labels Apr 23, 2026
@artulab artulab force-pushed the muhaawad/vmem-chunked-allocator branch from a1a1eeb to 7aa60e9 Compare April 27, 2026 19:06
mawad-amd and others added 23 commits May 7, 2026 13:12
New allocator design:
- Reserve large VA range up front (cheap, just address space)
- Map physical memory in large chunks (256 MiB default)
- hipMemSetAccess called once per chunk, not per allocation
- Sub-allocate with bump pointer, power-of-two free lists for reuse
- GC via weakref finalizers on tensor.untyped_storage()
- Free/reuse is pure bookkeeping (no HIP calls, no physical remap)
- refresh_peer_access only triggered on chunk growth, not every allocation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test was using 10 float32 elements (40 bytes) for "small", which
SymmetricHeap.allocate() rounds up to granularity/4 = 1024 elements
on MI355X (4KiB granularity). This puts "small" in the same
power-of-two bucket as "medium" (1024 elements), causing pointer
swaps on free-list reuse.

Fix: derive test sizes from the allocator's actual granularity so
each allocation lands in a distinct power-of-two bucket (1x, 4x, 16x
granularity).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…UF imports

DMA-BUF handles imported from PyTorch's default allocator (not VMem-created)
already have device access set. hipMemSetAccess fails with "invalid argument"
on such handles. The mem_map is sufficient for the VA mapping; treat the
set_access error as non-fatal.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
import_external_tensor creates pseudo-chunks with DMA-BUF imported handles
that cannot be re-exported via mem_export_to_shareable_handle. Track these
in a separate _import_chunks list so get_allocation_chunks() only returns
VMem-created chunks that can be safely shared with peers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 64 GiB default VA reservation per rank caused hipIpcGetMemHandle
failures when NCCL tried to allocate IPC-compatible memory after many
tests created and destroyed iris contexts.

Changes:
- Default VA size is now auto-sized to 8x heap_size (min 256 MiB)
  instead of a fixed 64 GiB
- Add SymmetricHeap.close() to free _peer_va_ranges and fd sockets
  that were previously leaked on context destruction

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Cap chunk_size to heap_size to avoid a single chunk consuming the
  entire VA range (was 256 MiB chunk for 1 MiB heap = no room to grow)
- Increase VA multiplier to 16x heap_size for growth + import headroom
- Fix SymmetricHeap.__del__ to handle partial init and Python shutdown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add torch.cuda.synchronize() before unmapping VMem chunks in close().
Async GPU kernels (.zero_(), .fill_()) may still be accessing mapped
memory when close() is called. Unmapping while kernels are in-flight
causes a GPU page fault that poisons the HIP runtime state, making
all subsequent GPU operations fail with hipErrorUnknown.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Async GPU operations (NCCL collectives, .zero_(), .fill_()) may still
reference mapped virtual addresses when close() is called. Freeing VA
ranges while kernels are in-flight causes hipErrorUnknown, which
poisons the HIP runtime state and fails all subsequent GPU operations.

Add torch.cuda.synchronize() at both SymmetricHeap and
VMemChunkedAllocator close() entry points.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The root cause of hipErrorUnknown after multiple context create/destroy
cycles was improper cleanup of peer-imported VMem mappings:

1. In _refresh_peer_access_chunked(), imported handles from
   mem_import_from_shareable_handle() were local variables that leaked
   — never stored for later cleanup.

2. In SymmetricHeap.close(), peer VA ranges were freed via
   mem_address_free() WITHOUT first calling mem_unmap() on the
   chunks mapped into those VA ranges.

3. The imported handles were never released via mem_release().

Calling mem_address_free() on a VA range with active mappings
corrupts HIP runtime state, causing hipErrorUnknown on all
subsequent GPU operations across new context cycles.

Fix:
- Track all peer-imported handles and their VA mappings in
  _peer_imported_mappings dict.
- In close(), unmap and release all peer-imported chunks BEFORE
  calling mem_address_free() on the peer VA ranges.
- Fix same bug in _refresh_peer_access_segmented() path.
- Fix Iris.__del__() to call heap.close() instead of just
  allocator.close(), ensuring peer VA cleanup actually runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sors

The VMem API (hipMemImportFromShareableHandle + hipMemMap + hipMemSetAccess)
does not work for importing DMA-BUF handles exported from hipMalloc-backed
PyTorch allocations. hipMemSetAccess returns hipErrorInvalidValue on such
handles, leaving the mapping inaccessible and corrupting subsequent GPU ops.

Switch to the External Memory API (hipImportExternalMemory +
hipExternalMemoryGetMappedBuffer) which correctly handles DMA-BUF fds
from any source including PyTorch's caching allocator.

Also update owns_tensor() to check imported external memory ranges since
they are no longer mapped into the allocator's VA range.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Synchronize GPU before del to ensure async ops release storage refs,
and add a second gc.collect() pass to handle reference cycles that
may prevent the weakref finalizer from firing on the first pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The num_free_blocks check depends on weakref finalizer timing which
varies across test orderings. GC-based free/reuse is already covered
by test_chunked_gc_free_reuse and test_chunked_gc_multiple_reuse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
VA reservation (hipMemAddressReserve) is just address space — no
physical memory cost. 128 GiB provides ample headroom for growth
and imports without risk of VA exhaustion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hipMemImportFromShareableHandle segfaults on ROCm 7.0 due to inverted
MemObjMap logic in ROCm/clr hip_vm.cpp (removes instead of adds imported
memory objects, causing null dereference in hipMemSetAccess).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
hipMemImportFromShareableHandle segfaults on ROCm 7.0 due to inverted
MemObjMap logic in ROCm/clr hip_vm.cpp (removes instead of adds imported
memory objects, causing null dereference in hipMemSetAccess).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
 1. iris.host.memory.allocators.__init__ did not export VMemChunkedAllocator, so from iris.host.memory.allocators import ...
VMemChunkedAllocator failed immediately.
 2. After exporting it, iris/host/memory/allocators/vmem_chunked_allocator.py imported hip via from ..hip import ..., which resolves to
iris.host.memory.hip and does not exist. The working modules use iris.host.platform.hip.
…pport

Introduces `LocalHipDriver` and `LocalCudaDriver` utilizing native VMM APIs
and POSIX/DMA-BUF file descriptors for intra-node IPC. Adds `DriverFactory`
to dynamically route requests based on vendor and interconnect topology.
Updates fabric drivers to accept an optional `va` parameter and safely tracks
virtual address ownership across all drivers.
@artulab artulab force-pushed the muhaawad/vmem-chunked-allocator branch from e1f5b78 to 2101952 Compare May 7, 2026 20:20
… driver-based peer setup for chunked VMM heaps, with separate local FD and fabric-handle refresh paths
@artulab artulab force-pushed the muhaawad/vmem-chunked-allocator branch from 2101952 to 9e0c5c8 Compare May 7, 2026 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants