[CI Testing Only] PR #1187 AMD verification: serial test execution (-j 1)#1198
Draft
alsepkow wants to merge 8 commits into
Draft
[CI Testing Only] PR #1187 AMD verification: serial test execution (-j 1)#1198alsepkow wants to merge 8 commits into
alsepkow wants to merge 8 commits into
Conversation
Both AMD and NVIDIA DirectX configurations have been stable and have higher pass rates than the existing Tier 1 Intel target. Promote them to Tier 1 so they run on every PR. Qualcomm and the Vulkan IHV configurations remain experimental and continue to require the 'test-all' label. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Match the tier change in docs/CI.md and pr-matrix.yaml so the README status table reflects that these targets now run on every PR. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per Bob's review feedback, switch from listing AMD/NVIDIA D3D12 combinations via 'include' to a cross-product with 'exclude' for the AMD/NVIDIA Vulkan combinations. As future targets get promoted out of experimental, we can simply remove exclusions rather than adding inclusions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Apply the same cross-product + exclude pattern to the experimental Exec-Tests-Extra job for consistency. As targets are promoted out of experimental, exclusions can be added here in lockstep with their removal from the Tier 1 Exec-Tests-Windows job. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This reverts commit 1eec3eb. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This change is for the draft AMD-testing PR only and should NOT be merged.
Strips the matrix down to only windows-amd x {check-hlsl-d3d12, check-hlsl-clang-d3d12}
so we can quickly iterate on AMD D3D12 stability investigation without spending
CI on Intel/NVIDIA/MacOS/WARP/Vulkan jobs.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds LIT_OPTS=-j 1 env to the Run HLSL Tests step. This forces lit to run a single test at a time instead of parallelizing across CPU cores. Goal: test whether the AMD amdxc64.dll PSO compilation crashes are caused or amplified by concurrent PSO compilation across multiple offloader.exe processes contending for the AMD user-mode driver. If failures dramatically reduce or disappear: race condition / contention in amdxc64.dll. If failures persist at similar rates: bug is purely a single-process issue in PSO compilation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Combines all 'maximum bug exposure' factors with the serial-execution isolation: Debug build, debug layer ON, single worker (-j 1). If this run also passes cleanly, the multi-process race hypothesis is maximally confirmed - even the most aggressive bug-amplifying config cannot reproduce the failure when tests are serialized. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[CI testing only — do not merge]
Fourth verification variant. Tests whether the AMD
amdxc64.dllPSO compilation crashes are caused (or amplified) by concurrent PSO compilation across multipleoffloader.exeprocesses contending for the AMD user-mode driver.What this changes
Adds
LIT_OPTS: '-j 1'env to the test step inbuild-and-test-callable.yaml. lit runs one test at a time instead of usingcpu_count()parallel workers (the default).This means: only one
offloader.exe/ oneamdxc64.dllinstance is compiling a PSO at any given time, instead of N concurrent processes.Why
test/lit.cfg.pyinvokesoffloader.exeonce per.testfile. lit's default-jiscpu_count(), soNseparate offloader processes run in parallel — all hitting the single AMD driver simultaneously.If this is a race / contention bug:
-j 1amdxc64.dllIf it's a pure single-process bug:
amdxc64.dllPSO compilation regardless of contentionCompanion PRs
Tradeoff
Serial execution will make the test run much slower — possibly 30+ minutes instead of ~12s for the test phase. That's expected.
Branch
alsepkow/offload-test-suite:pr-1187-testing-serialpr-1187-testing(the RWD baseline)LIT_OPTS: '-j 1'env to the Run HLSL Tests step