Fix hardcoded lock value in fused GEMM+CCL operations by mawad-amd · Pull Request #529 · ROCm/iris

mawad-amd · 2026-05-02T15:33:53Z

Summary

Replace hardcoded lock signal value 1 in fused GEMM+CCL producer-consumer signaling with a monotonically increasing call_counter on FusedWorkspace
Each call to matmul_all_reduce or matmul_reduce_scatter increments the counter and passes it to both producer (atomic_xchg) and consumer (spin loop) sides
Eliminates the need to zero locks between calls for one_shot, two_shot, and reduce_scatter variants
spinlock variant is unaffected (uses CAS mutex pattern that self-resets)
signal_value parameter defaults to 1 for backward compatibility with existing test kernels

Test plan

25/25 fused ops tests pass on MI300X (8 GPUs)
54/54 context CCL tests pass on MI300X (8 GPUs)
Custom workspace reuse test: 5 consecutive calls without lock zeroing, all correct (max_diff=0.125, call_counter=1→5)

Closes #465

🤖 Generated with Claude Code

Replace hardcoded lock signal value `1` with a monotonically increasing call_counter on FusedWorkspace. Each call to matmul_all_reduce or matmul_reduce_scatter increments the counter and passes it as the signal value to both producer (atomic_xchg) and consumer (spin loop) sides. This eliminates the need to zero locks between calls for one_shot, two_shot, and reduce_scatter variants, since each call uses a unique signal value that won't collide with previous calls. The spinlock variant still uses CAS(0→1)/release(0) mutex semantics and continues to require zeroed locks. The signal_value parameter defaults to 1 for backward compatibility with existing test kernels and examples that zero locks manually. Closes #465 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR replaces the hardcoded lock “ready” value (1) used in fused GEMM+CCL producer/consumer signaling with a monotonically increasing per-workspace call_counter, so locks don’t need to be zeroed between calls for one_shot/two_shot/reduce_scatter.

Changes:

Add call_counter to FusedWorkspace and reset it on clear().
Pass a per-call signal_value through fused matmul+collective kernels and into Triton context ops.
Remove lock zeroing for reduce_scatter and for all_reduce one_shot/two_shot (keep it for spinlock).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
iris/ops/workspace.py	Adds `call_counter` state to track per-call signal values and resets it in `clear()`.
iris/ops/matmul_reduce_scatter.py	Uses `signal_value` for producer `atomic_xchg` and consumer wait; removes lock zeroing.
iris/ops/matmul_all_reduce.py	Uses per-call `signal_value` for one_shot/two_shot and conditionally zeroes locks only for spinlock.
iris/mem/triton/context.py	Extends Triton context collectives to wait on `signal_value` instead of hardcoded `1`.

mawad-amd · 2026-05-02T16:04:54Z

    even_k = K % config.block_size_k == 0

+    # Increment call counter for producer-consumer signal value.
+    # Each call uses a unique value so consumers don't see stale signals.


Fixed in 1044149 — call_counter now wraps at INT32_MAX (0x7FFFFFFF).

mawad-amd · 2026-05-02T16:04:57Z

+    # Increment call counter for producer-consumer signal value.
+    workspace.call_counter += 1
+    signal_value = workspace.call_counter


Workspaces are per-process objects — in distributed training, each rank runs its own Python process with its own workspace instance. The call_counter increments identically on all ranks because fused ops are collective (all ranks must call them together). Divergence would indicate a program bug (one rank skipping a collective call), which would deadlock regardless of the signal value.

mawad-amd · 2026-05-02T16:04:55Z

    BLOCK_SIZE_K: tl.constexpr,
    EVEN_K: tl.constexpr,
    VARIANT: tl.constexpr,
+    SIGNAL_VALUE=1,


Fixed in 1044149 — renamed to lowercase signal_value.

… INT32_MAX - Rename kernel parameter from SIGNAL_VALUE (constexpr style) to signal_value (runtime parameter style) to avoid confusion with compile-time constants - Wrap call_counter at INT32_MAX since lock tensors are int32 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mawad-amd and others added 2 commits May 2, 2026 08:14

Apply Ruff auto-fixes

54da964

Copilot AI review requested due to automatic review settings May 2, 2026 15:33

mawad-amd requested review from BKP and neoblizz as code owners May 2, 2026 15:33

github-actions Bot added in-progress We are working on it iris Iris project issue labels May 2, 2026

Copilot AI reviewed May 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hardcoded lock value in fused GEMM+CCL operations#529

Fix hardcoded lock value in fused GEMM+CCL operations#529
mawad-amd wants to merge 3 commits intomainfrom
muhaawad/fix-hardcoded-lock

mawad-amd commented May 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

mawad-amd May 2, 2026

Uh oh!

mawad-amd May 2, 2026

Uh oh!

mawad-amd May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mawad-amd commented May 2, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

mawad-amd May 2, 2026

Choose a reason for hiding this comment

Uh oh!

mawad-amd May 2, 2026

Choose a reason for hiding this comment

Uh oh!

mawad-amd May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants