Refactor stridingMemcpyKernel to be unroll friendly by Artem-B · Pull Request #65 · NVIDIA/nvbandwidth

Artem-B · 2026-04-13T23:41:13Z

While loop over incrementing pointer is harder for compiler to analyze. Switching to directly indexed pointer accesses gives compiler more flexibility over unrolling the loop.

Also, work around clang-related issue which results in slow code when aggregates are swapped via a temporary variable. Copying uint4 element-wise avoids the problem, and works fine with NVCC.

While loop over incrementing pointer is harder for compiler to analyze. Switching to directly indexed pointer accesses gives compiler more flexibility over unrolling the loop. Also, work around clang-related issue which results in slow code when aggregates are swapped via a temporary variable. Copying uint4 element-wise avoids the problem, and works fine with NVCC.

Artem-B · 2026-04-13T23:42:10Z

Comparison between clang/nvcc old/new kernel: https://godbolt.org/z/onojWhr7a
Old kernel with clang shows the issue (local pipe_* variables are not eliminated), but it could be worked around by using __int128 type for the transfers via passing -Di128=__int128 to the clang compilation in the compiler explorer link above.

Artem-B mentioned this pull request Apr 14, 2026

Refactor stridingMemcpyKernel to be unroll friendly NVIDIA/nvloom#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor stridingMemcpyKernel to be unroll friendly#65

Refactor stridingMemcpyKernel to be unroll friendly#65
Artem-B wants to merge 1 commit intoNVIDIA:mainfrom
Artem-B:future

Artem-B commented Apr 13, 2026

Uh oh!

Artem-B commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Artem-B commented Apr 13, 2026

Uh oh!

Artem-B commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant