Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: Enable Multi-GPU Execution for
SN_SOLVER_HPC_CUDA(Single Workstation) and Fix MPI RK2 Scalar-Flux SyncSummary
This PR makes the CUDA HPC SN solver work correctly on a multi-GPU workstation by mapping MPI ranks to GPUs and ensuring angular work is distributed across all ranks.
It also fixes a correctness issue in the MPI + RK2 path that produced incorrect scalar-flux-derived QoIs in multi-rank runs.
Scope:
SN_SOLVER_HPC_CUDAmulti-GPU execution for one node with multiple GPUs (tested on 2 GPUs).Motivation
The previous CUDA HPC implementation was effectively single-GPU in practice:
Given the requirement to run on a 2-GPU workstation through the CUDA Singularity environment, these issues prevented reliable multi-GPU behavior and produced incorrect MPI results in RK2.
Root Cause
1) Device selection was pinned
InitCUDA()always set:_cudaDeviceId = 0So all MPI ranks on the node targeted the same GPU unless externally constrained.
2) Angular split logic was not balanced across all ranks
The original partitioning used:
_localNSys = _nSys / (_numProcs - 1)This design was uneven and not a clean all-rank distribution.
3) RK2 path missed MPI scalar-flux synchronization
In RK2 mode, after
RK2AverageAndScalarFluxKernel,_device->scalarFluxwas updated, but not globally reduced across ranks before postprocessing/output.FVMUpdate()does perform an allreduce, but RK2 averaging overwrites scalar flux afterward.Result: rank-local values could leak into global QoIs.
What Changed
A) Balanced angular decomposition across all MPI ranks
File:
src/solvers/snsolver_hpc.cuReplaced custom rank split with standard balanced decomposition:
baseChunk = _nSys / numRanksremainder = _nSys % numRanksbaseChunkplus one extra ifrank < remainder_startSysIdxcomputed with prefix remainder compensationThis ensures all ranks participate and angular ordinates are distributed deterministically and contiguously.
B) Rank-to-GPU mapping by node-local MPI rank
File:
src/solvers/snsolver_hpc.cuInitCUDA()now:MPI_Comm_split_type(... MPI_COMM_TYPE_SHARED ...)localRankandlocalSize_cudaDeviceId = localRank % nDevicesThis gives correct mapping for one-node multi-GPU setups and graceful sharing when ranks exceed visible devices.
Added logging on rank 0:
localSize > nDevicesC) Fix MPI correctness in RK2
File:
src/solvers/snsolver_hpc.cuAfter
RK2AverageAndScalarFluxKernel, underIMPORT_MPI, now:MPI_Allreduce(sum) over_nCellsThis mirrors the existing scalar-flux synchronization pattern in
FVMUpdate()and restores consistency for postprocessing and output.D) Build and docs updates for CUDA Singularity workflow
tools/singularity/install_kitrt_singularity_cuda.shBUILD_MPI=OFFtoBUILD_MPI=ONREADME.mdmpirun -np 2 ...CMakeLists.txtBUILD_CUDA_HPCinclude/solvers/snsolver_hpc_cuda.hppValidation and Test Evidence
All validation was executed inside the CUDA Singularity environment (
singularity exec --nv ...).1) Regression validation suite (HPC validation tests)
Build:
BUILD_MPI=OFFBUILD_CUDA_HPC=ONBUILD_TESTING=ONRun:
./unit_tests "[validation_tests][hpc]"Result:
All tests passed (748 assertions in 2 test cases)2) Multi-GPU execution check (MPI + CUDA)
Build:
BUILD_MPI=ONBUILD_CUDA_HPC=ONRun:
mpirun -np 2 ./build_singularity_cuda/KiT-RT tests/input/validation_tests/SN_solver_hpc/lattice_hpc_200_cuda_order2.cfgObserved in log:
CUDA backend: 2 local MPI rank(s), 2 visible CUDA device(s).This confirms node-local rank/device mapping is active and both GPUs are visible to the run.
3) Numerical consistency check: non-MPI CUDA vs MPI(2) CUDA
Compared output CSV from:
Comparison method:
Result:
CSV_COMPARE: MATCHThis is especially important because it verifies the RK2 scalar-flux MPI fix: before the fix, this comparison differed; after the fix, it matches.
Behavior Notes
Risk Assessment
Risk level: Low to Medium
Why low:
Why medium:
Reviewer Focus
Please review:
_nSys % numRanks)Follow-ups (Optional, Not Required for This PR)