Skip to content

Add graph-capturable one-shot all-reduce#531

Draft
mawad-amd wants to merge 7 commits intomainfrom
muhaawad/one-shot-vllm
Draft

Add graph-capturable one-shot all-reduce#531
mawad-amd wants to merge 7 commits intomainfrom
muhaawad/one-shot-vllm

Conversation

@mawad-amd
Copy link
Copy Markdown
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

mawad-amd and others added 6 commits May 4, 2026 22:25
Graph-capturable single-kernel all-reduce with barrier-compute-barrier
semantics for vLLM integration. 1D layout, unmasked fast path, capped
at 16 CTAs with BLOCK_SIZE=2048 for small-message optimization.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
The per-block barrier variant (formerly one_shot_vllm) is the canonical
one_shot implementation going forward. The old one_shot that required
host-side zero+barrier is preserved as one_shot_legacy.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@github-actions github-actions Bot added in-progress We are working on it iris Iris project issue labels May 5, 2026
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant