Fix swiglu_decode intermediate comparison to use chained reference by albiol2004 · Pull Request #105 · amd/IRON

albiol2004 · 2026-04-14T11:06:17Z

Aligns swiglu_decode/test.py with swiglu_prefill/test.py by verifying the intermediate buffer against a chained reference built from the observed AIE left_swished and right buffers, instead of against the CPU-computed golden_ref["intermediate"].

The golden-reference path amplifies legitimate, sub-tolerance bf16 drift from upstream stages (e.g. SiLU of very-negative inputs where the AIE LUT rounds to 0.0 while fp32 CPU silu preserves a tiny negative value) through the multiplication against a large-magnitude right operand, producing spurious "got 0.0, expected -1.27"-style failures. The AIE kernels themselves are numerically correct, the observed intermediate matches observed_left_swished * observed_right exactly, and the final output already passes at a tighter tolerance than the intermediate stage.

This issue surfaces at rectangular FFN shapes (e.g. embedding_dim=1024, hidden_dim=3584) where the statistics of the SiLU input distribution make near-zero LUT outputs more common than at the previously-tested square 2048² shape. Adds (1024, 3584) to the parametrization so regressions in rectangular decode are caught.

Added

(1024, 3584) parametrization in iron/operators/swiglu_decode/test.py, reflecting Qwen3.5-0.8B FFN dims so rectangular decode is covered alongside the existing square smoke test.

Changed

iron/operators/swiglu_decode/test.py: verify intermediate against a chained reference (observed_left_swished * observed_right) rather than golden_ref["intermediate"], matching the approach already in swiglu_prefill/test.py. Tightens the tolerance to rel_tol=0.04, abs_tol=0.4 accordingly (same values used for the output check and for prefill).

Removed

None.

Testing

Verified on NPU2 (Strix, aie2p):

pytest iron/operators/swiglu_decode/test.py -v --iterations 1 : both 2048×2048 and 1024×3584 pass.
pytest iron/operators/ -m "not extensive" --iterations 1 : no regressions.

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR has been reviewed and approved.
All checks are passing.

The decode test verified `intermediate` against golden_ref["intermediate"] (CPU-computed silu(golden_left) * golden_right), while the prefill test uses a chained reference built from the observed AIE left_swished and right buffers. That inconsistency surfaces as spurious failures at rectangular FFN shapes (e.g. embedding=1024, hidden=3584): the AIE SiLU LUT rounds near-zero outputs to exactly 0.0 where fp32 CPU silu keeps a tiny negative value, and the subsequent multiply against a large-magnitude right operand amplifies that sub-tolerance drift into "got 0.0, expected -1.27"-style mismatches. The AIE kernels are numerically correct, the observed intermediate matches observed_left_swished * observed_right exactly, and the final output already passes at a tighter tolerance. Only the verification methodology was wrong. Switch to the prefill-style chained reference and tighten tolerance to (rel=0.04, abs=0.4), same as the output check and prefill intermediate. Add (1024, 3584) to the parametrization so rectangular decode is covered in CI.

albiol2004 requested review from andrej, hunhoffe and jgmelber as code owners April 14, 2026 11:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix swiglu_decode intermediate comparison to use chained reference#105

Fix swiglu_decode intermediate comparison to use chained reference#105
albiol2004 wants to merge 1 commit intoamd:develfrom
albiol2004:fix-swiglu-decode-rectangular

albiol2004 commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

albiol2004 commented Apr 14, 2026

Added

Changed

Removed

Testing

PR Merge Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant