Mpi cuda by ScSteffen · Pull Request #64 · KiT-RT/kitrt_code

ScSteffen · 2026-02-13T21:18:08Z

PR: Enable Multi-GPU Execution for `SN_SOLVER_HPC_CUDA` (Single Workstation) and Fix MPI RK2 Scalar-Flux Sync

Summary

This PR makes the CUDA HPC SN solver work correctly on a multi-GPU workstation by mapping MPI ranks to GPUs and ensuring angular work is distributed across all ranks.
It also fixes a correctness issue in the MPI + RK2 path that produced incorrect scalar-flux-derived QoIs in multi-rank runs.

Scope:

SN_SOLVER_HPC_CUDA multi-GPU execution for one node with multiple GPUs (tested on 2 GPUs).
Singularity CUDA workflow updates for MPI + CUDA build/run.
No algorithmic model changes beyond correct rank/device mapping and missing MPI scalar-flux synchronization.

Motivation

The previous CUDA HPC implementation was effectively single-GPU in practice:

CUDA device selection was hardcoded to GPU 0.
Angular partitioning logic left rank utilization asymmetrical in multi-rank runs.
In temporal order 2 (RK2), the scalar flux after RK2 averaging was not allreduced before postprocessing/output, so rank 0 could write incomplete quantities in MPI mode.

Given the requirement to run on a 2-GPU workstation through the CUDA Singularity environment, these issues prevented reliable multi-GPU behavior and produced incorrect MPI results in RK2.

Root Cause

1) Device selection was pinned

InitCUDA() always set:

_cudaDeviceId = 0

So all MPI ranks on the node targeted the same GPU unless externally constrained.

2) Angular split logic was not balanced across all ranks

The original partitioning used:

_localNSys = _nSys / (_numProcs - 1)
custom last-rank remainder handling

This design was uneven and not a clean all-rank distribution.

3) RK2 path missed MPI scalar-flux synchronization

In RK2 mode, after RK2AverageAndScalarFluxKernel, _device->scalarFlux was updated, but not globally reduced across ranks before postprocessing/output.
FVMUpdate() does perform an allreduce, but RK2 averaging overwrites scalar flux afterward.
Result: rank-local values could leak into global QoIs.

What Changed

A) Balanced angular decomposition across all MPI ranks

File: src/solvers/snsolver_hpc.cu

Replaced custom rank split with standard balanced decomposition:

baseChunk = _nSys / numRanks
remainder = _nSys % numRanks
each rank gets baseChunk plus one extra if rank < remainder
_startSysIdx computed with prefix remainder compensation

This ensures all ranks participate and angular ordinates are distributed deterministically and contiguously.

B) Rank-to-GPU mapping by node-local MPI rank

File: src/solvers/snsolver_hpc.cu

InitCUDA() now:

creates shared-memory local communicator with MPI_Comm_split_type(... MPI_COMM_TYPE_SHARED ...)
reads localRank and localSize
sets device as _cudaDeviceId = localRank % nDevices

This gives correct mapping for one-node multi-GPU setups and graceful sharing when ranks exceed visible devices.

Added logging on rank 0:

local MPI ranks count
visible device count
warning when localSize > nDevices

C) Fix MPI correctness in RK2

File: src/solvers/snsolver_hpc.cu

After RK2AverageAndScalarFluxKernel, under IMPORT_MPI, now:

copies scalar flux device -> host
runs MPI_Allreduce (sum) over _nCells
copies reduced scalar flux back host -> device

This mirrors the existing scalar-flux synchronization pattern in FVMUpdate() and restores consistency for postprocessing and output.

D) Build and docs updates for CUDA Singularity workflow

tools/singularity/install_kitrt_singularity_cuda.sh
- switched CUDA script build from BUILD_MPI=OFF to BUILD_MPI=ON
README.md
- section renamed to multi-GPU capable CUDA+MPI usage
- run example updated to mpirun -np 2 ...
CMakeLists.txt
- updated option description for BUILD_CUDA_HPC
include/solvers/snsolver_hpc_cuda.hpp
- comment updated from “single GPU” wording to generic CUDA backend state

Validation and Test Evidence

All validation was executed inside the CUDA Singularity environment (singularity exec --nv ...).

1) Regression validation suite (HPC validation tests)

Build:

BUILD_MPI=OFF
BUILD_CUDA_HPC=ON
BUILD_TESTING=ON

Run:

./unit_tests "[validation_tests][hpc]"

Result:

All tests passed (748 assertions in 2 test cases)

2) Multi-GPU execution check (MPI + CUDA)

Build:

BUILD_MPI=ON
BUILD_CUDA_HPC=ON

Run:

mpirun -np 2 ./build_singularity_cuda/KiT-RT tests/input/validation_tests/SN_solver_hpc/lattice_hpc_200_cuda_order2.cfg

Observed in log:

CUDA backend: 2 local MPI rank(s), 2 visible CUDA device(s).

This confirms node-local rank/device mapping is active and both GPUs are visible to the run.

3) Numerical consistency check: non-MPI CUDA vs MPI(2) CUDA

Compared output CSV from:

non-MPI CUDA run
MPI(2) CUDA run

Comparison method:

ignored runtime timestamp column and wall-time column
diffed all other QoI fields

Result:

CSV_COMPARE: MATCH

This is especially important because it verifies the RK2 scalar-flux MPI fix: before the fix, this comparison differed; after the fix, it matches.

Behavior Notes

MPI parallelization remains in the angular domain (quadrature ordinates), not spatial domain decomposition.
Spatial mesh data remains replicated per rank.
Multi-GPU mapping currently targets a single node workstation model (as requested).
If local MPI ranks exceed local visible GPUs, rank-to-device assignment wraps with modulo.

Risk Assessment

Risk level: Low to Medium

Why low:

changes are isolated to partitioning, device selection, and scalar-flux reduction points
no new numerical model, no kernel math changes
validation and cross-run consistency checks are clean

Why medium:

MPI+CUDA behavior is environment-sensitive (launcher, visibility, affinity)
existing unit test harness does not currently run MPI-enabled Catch2 tests directly without MPI init in test main

Reviewer Focus

Please review:

angular split logic correctness and edge cases (_nSys % numRanks)
rank-local GPU mapping behavior in one-node scenarios
RK2 scalar-flux synchronization placement and cost
whether additional logging or runtime knobs are desired for rank/device diagnostics

Follow-ups (Optional, Not Required for This PR)

Add explicit MPI-enabled regression test target for CUDA path (currently validated by executable runs and CSV comparison).
Add optional config/env control to override default local-rank device mapping (for affinity experiments).
Consider reducing redundant MPI barriers around allreduces if profiling shows measurable overhead.

ScSteffen added 2 commits February 13, 2026 16:16

add mpi cuda

798ed02

added cuda mpi

1c9c20a

ScSteffen self-assigned this Feb 13, 2026

ScSteffen temporarily deployed to coverage February 13, 2026 21:18 — with GitHub Actions Inactive

ScSteffen added the enhancement New feature or request label Feb 13, 2026

ScSteffen merged commit 091f8f1 into master Feb 13, 2026
1 check passed

ScSteffen deleted the mpi_cuda branch February 13, 2026 21:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mpi cuda#64

Mpi cuda#64
ScSteffen merged 2 commits intomasterfrom
mpi_cuda

ScSteffen commented Feb 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ScSteffen commented Feb 13, 2026

PR: Enable Multi-GPU Execution for SN_SOLVER_HPC_CUDA (Single Workstation) and Fix MPI RK2 Scalar-Flux Sync

Summary

Motivation

Root Cause

1) Device selection was pinned

2) Angular split logic was not balanced across all ranks

3) RK2 path missed MPI scalar-flux synchronization

What Changed

A) Balanced angular decomposition across all MPI ranks

B) Rank-to-GPU mapping by node-local MPI rank

C) Fix MPI correctness in RK2

D) Build and docs updates for CUDA Singularity workflow

Validation and Test Evidence

1) Regression validation suite (HPC validation tests)

2) Multi-GPU execution check (MPI + CUDA)

3) Numerical consistency check: non-MPI CUDA vs MPI(2) CUDA

Behavior Notes

Risk Assessment

Reviewer Focus

Follow-ups (Optional, Not Required for This PR)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PR: Enable Multi-GPU Execution for `SN_SOLVER_HPC_CUDA` (Single Workstation) and Fix MPI RK2 Scalar-Flux Sync