Skip to content

[CI Testing Only] PR #1187 AMD verification: serial test execution (-j 1)#1198

Draft
alsepkow wants to merge 8 commits into
llvm:mainfrom
alsepkow:pr-1187-testing-serial
Draft

[CI Testing Only] PR #1187 AMD verification: serial test execution (-j 1)#1198
alsepkow wants to merge 8 commits into
llvm:mainfrom
alsepkow:pr-1187-testing-serial

Conversation

@alsepkow
Copy link
Copy Markdown
Collaborator

[CI testing only — do not merge]

Fourth verification variant. Tests whether the AMD amdxc64.dll PSO compilation crashes are caused (or amplified) by concurrent PSO compilation across multiple offloader.exe processes contending for the AMD user-mode driver.

What this changes

Adds LIT_OPTS: '-j 1' env to the test step in build-and-test-callable.yaml. lit runs one test at a time instead of using cpu_count() parallel workers (the default).

This means: only one offloader.exe / one amdxc64.dll instance is compiling a PSO at any given time, instead of N concurrent processes.

Why

test/lit.cfg.py invokes offloader.exe once per .test file. lit's default -j is cpu_count(), so N separate offloader processes run in parallel — all hitting the single AMD driver simultaneously.

If this is a race / contention bug:

  • Failures should drop dramatically under -j 1
  • Confirms multi-process PSO compilation race in amdxc64.dll

If it's a pure single-process bug:

  • Failures persist at similar rates
  • Bug is in amdxc64.dll PSO compilation regardless of contention

Companion PRs

Tradeoff

Serial execution will make the test run much slower — possibly 30+ minutes instead of ~12s for the test phase. That's expected.

Branch

alsepkow/offload-test-suite:pr-1187-testing-serial

alsepkow and others added 8 commits May 13, 2026 16:07
Both AMD and NVIDIA DirectX configurations have been stable and have higher pass rates than the existing Tier 1 Intel target. Promote them to Tier 1 so they run on every PR. Qualcomm and the Vulkan IHV configurations remain experimental and continue to require the 'test-all' label.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Match the tier change in docs/CI.md and pr-matrix.yaml so the README status table reflects that these targets now run on every PR.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per Bob's review feedback, switch from listing AMD/NVIDIA D3D12 combinations via 'include' to a cross-product with 'exclude' for the AMD/NVIDIA Vulkan combinations. As future targets get promoted out of experimental, we can simply remove exclusions rather than adding inclusions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Apply the same cross-product + exclude pattern to the experimental Exec-Tests-Extra job for consistency. As targets are promoted out of experimental, exclusions can be added here in lockstep with their removal from the Tier 1 Exec-Tests-Windows job.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This reverts commit 1eec3eb.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This change is for the draft AMD-testing PR only and should NOT be merged.
Strips the matrix down to only windows-amd x {check-hlsl-d3d12, check-hlsl-clang-d3d12}
so we can quickly iterate on AMD D3D12 stability investigation without spending
CI on Intel/NVIDIA/MacOS/WARP/Vulkan jobs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds LIT_OPTS=-j 1 env to the Run HLSL Tests step. This forces lit to
run a single test at a time instead of parallelizing across CPU cores.

Goal: test whether the AMD amdxc64.dll PSO compilation crashes are
caused or amplified by concurrent PSO compilation across multiple
offloader.exe processes contending for the AMD user-mode driver.

If failures dramatically reduce or disappear: race condition / contention
in amdxc64.dll. If failures persist at similar rates: bug is purely a
single-process issue in PSO compilation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Combines all 'maximum bug exposure' factors with the serial-execution
isolation: Debug build, debug layer ON, single worker (-j 1).

If this run also passes cleanly, the multi-process race hypothesis is
maximally confirmed - even the most aggressive bug-amplifying config
cannot reproduce the failure when tests are serialized.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant