Skip to content

Refactor stridingMemcpyKernel to be unroll friendly#65

Open
Artem-B wants to merge 1 commit intoNVIDIA:mainfrom
Artem-B:future
Open

Refactor stridingMemcpyKernel to be unroll friendly#65
Artem-B wants to merge 1 commit intoNVIDIA:mainfrom
Artem-B:future

Conversation

@Artem-B
Copy link
Copy Markdown

@Artem-B Artem-B commented Apr 13, 2026

While loop over incrementing pointer is harder for compiler to analyze. Switching to directly indexed pointer accesses gives compiler more flexibility over unrolling the loop.

Also, work around clang-related issue which results in slow code when aggregates are swapped via a temporary variable. Copying uint4 element-wise avoids the problem, and works fine with NVCC.

While loop over incrementing pointer is harder for compiler to analyze.
Switching to directly indexed pointer accesses gives compiler more
flexibility over unrolling the loop.

Also, work around clang-related issue which results in slow code when
aggregates are swapped via a temporary variable. Copying uint4
element-wise avoids the problem, and works fine with NVCC.
@Artem-B
Copy link
Copy Markdown
Author

Artem-B commented Apr 13, 2026

Comparison between clang/nvcc old/new kernel: https://godbolt.org/z/onojWhr7a
Old kernel with clang shows the issue (local pipe_* variables are not eliminated), but it could be worked around by using __int128 type for the transfers via passing -Di128=__int128 to the clang compilation in the compiler explorer link above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant