Implement MLA decode fwd nh128 a8w8 kernel with FlyDSL. by ruanjm · Pull Request #403 · ROCm/FlyDSL

ruanjm · 2026-04-15T09:38:07Z

ATT.

Copilot

Pull request overview

Adds a FlyDSL implementation of an MLA decode forward kernel specialized for nhead=128 with FP8 Q/KV inputs and BF16 output, plus a Python launcher and a corresponding correctness/perf test driver.

Changes:

Introduce kn_mla_fwd_decode_m16x8_fp8_fp8 FlyDSL kernel and JIT launcher for the nh=128 FP8/FP8 decode path.
Add a thin Python dispatcher (flydsl_mla_fwd_decode) that flattens inputs/outputs and launches the specialized kernel.
Add a new MLA decode test/benchmark script with a PyTorch reference + aiter metadata/reduce integration.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 8 comments.

File	Description
tests/kernels/test_mla_decode.py	New MLA decode reference + kernel launch driver (currently not pytest-integrated and has a couple runtime issues).
kernels/mla_fwd_decode_m16x8_fp8_fp8.py	New FlyDSL kernel implementation + JIT launcher for FP8/FP8 nh=128 decode.
kernels/mla_fwd_decode.py	Public Python launcher/dispatcher for the new specialized kernel.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderfeli · 2026-04-15T11:22:36Z

Align code style, type, IR usage with other kernels. Try to remove or reduce arith/std_arith and similar native mlir low level uses.

ruanjm · 2026-04-16T04:00:56Z

Align code style, type, IR usage with other kernels. Try to remove or reduce arith/std_arith and similar native mlir low level uses.

Done

Implement MLA decode fwd nh128 a8w8 kernel with FlyDSL.

4614f41

Copilot AI review requested due to automatic review settings April 15, 2026 09:38

Copilot started reviewing on behalf of ruanjm April 15, 2026 09:39 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Fix the issues raised by Copilot.

1f2ee20

ruanjm added 3 commits April 16, 2026 03:12

try to fix ci issue.

249899c

Try to follow other kernel's style.

34db82c

add _get_lds_size_per_cu

dce5654

Try to fix MI355 CI err.

ab46f7f

ruanjm force-pushed the jruan/mla_h128_a8w8 branch from d5b5369 to ab46f7f Compare April 16, 2026 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement MLA decode fwd nh128 a8w8 kernel with FlyDSL.#403

Implement MLA decode fwd nh128 a8w8 kernel with FlyDSL.#403
ruanjm wants to merge 6 commits intomainfrom
jruan/mla_h128_a8w8

ruanjm commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderfeli commented Apr 15, 2026

Uh oh!

ruanjm commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ruanjm commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderfeli commented Apr 15, 2026

Uh oh!

ruanjm commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants