Skip to content

Mpi cuda#64

Merged
ScSteffen merged 2 commits intomasterfrom
mpi_cuda
Feb 13, 2026
Merged

Mpi cuda#64
ScSteffen merged 2 commits intomasterfrom
mpi_cuda

Conversation

@ScSteffen
Copy link
Contributor

PR: Enable Multi-GPU Execution for SN_SOLVER_HPC_CUDA (Single Workstation) and Fix MPI RK2 Scalar-Flux Sync

Summary

This PR makes the CUDA HPC SN solver work correctly on a multi-GPU workstation by mapping MPI ranks to GPUs and ensuring angular work is distributed across all ranks.
It also fixes a correctness issue in the MPI + RK2 path that produced incorrect scalar-flux-derived QoIs in multi-rank runs.

Scope:

  • SN_SOLVER_HPC_CUDA multi-GPU execution for one node with multiple GPUs (tested on 2 GPUs).
  • Singularity CUDA workflow updates for MPI + CUDA build/run.
  • No algorithmic model changes beyond correct rank/device mapping and missing MPI scalar-flux synchronization.

Motivation

The previous CUDA HPC implementation was effectively single-GPU in practice:

  • CUDA device selection was hardcoded to GPU 0.
  • Angular partitioning logic left rank utilization asymmetrical in multi-rank runs.
  • In temporal order 2 (RK2), the scalar flux after RK2 averaging was not allreduced before postprocessing/output, so rank 0 could write incomplete quantities in MPI mode.

Given the requirement to run on a 2-GPU workstation through the CUDA Singularity environment, these issues prevented reliable multi-GPU behavior and produced incorrect MPI results in RK2.

Root Cause

1) Device selection was pinned

InitCUDA() always set:

  • _cudaDeviceId = 0

So all MPI ranks on the node targeted the same GPU unless externally constrained.

2) Angular split logic was not balanced across all ranks

The original partitioning used:

  • _localNSys = _nSys / (_numProcs - 1)
  • custom last-rank remainder handling

This design was uneven and not a clean all-rank distribution.

3) RK2 path missed MPI scalar-flux synchronization

In RK2 mode, after RK2AverageAndScalarFluxKernel, _device->scalarFlux was updated, but not globally reduced across ranks before postprocessing/output.
FVMUpdate() does perform an allreduce, but RK2 averaging overwrites scalar flux afterward.
Result: rank-local values could leak into global QoIs.

What Changed

A) Balanced angular decomposition across all MPI ranks

File: src/solvers/snsolver_hpc.cu

Replaced custom rank split with standard balanced decomposition:

  • baseChunk = _nSys / numRanks
  • remainder = _nSys % numRanks
  • each rank gets baseChunk plus one extra if rank < remainder
  • _startSysIdx computed with prefix remainder compensation

This ensures all ranks participate and angular ordinates are distributed deterministically and contiguously.

B) Rank-to-GPU mapping by node-local MPI rank

File: src/solvers/snsolver_hpc.cu

InitCUDA() now:

  • creates shared-memory local communicator with MPI_Comm_split_type(... MPI_COMM_TYPE_SHARED ...)
  • reads localRank and localSize
  • sets device as _cudaDeviceId = localRank % nDevices

This gives correct mapping for one-node multi-GPU setups and graceful sharing when ranks exceed visible devices.

Added logging on rank 0:

  • local MPI ranks count
  • visible device count
  • warning when localSize > nDevices

C) Fix MPI correctness in RK2

File: src/solvers/snsolver_hpc.cu

After RK2AverageAndScalarFluxKernel, under IMPORT_MPI, now:

  • copies scalar flux device -> host
  • runs MPI_Allreduce (sum) over _nCells
  • copies reduced scalar flux back host -> device

This mirrors the existing scalar-flux synchronization pattern in FVMUpdate() and restores consistency for postprocessing and output.

D) Build and docs updates for CUDA Singularity workflow

  • tools/singularity/install_kitrt_singularity_cuda.sh
    • switched CUDA script build from BUILD_MPI=OFF to BUILD_MPI=ON
  • README.md
    • section renamed to multi-GPU capable CUDA+MPI usage
    • run example updated to mpirun -np 2 ...
  • CMakeLists.txt
    • updated option description for BUILD_CUDA_HPC
  • include/solvers/snsolver_hpc_cuda.hpp
    • comment updated from “single GPU” wording to generic CUDA backend state

Validation and Test Evidence

All validation was executed inside the CUDA Singularity environment (singularity exec --nv ...).

1) Regression validation suite (HPC validation tests)

Build:

  • BUILD_MPI=OFF
  • BUILD_CUDA_HPC=ON
  • BUILD_TESTING=ON

Run:

  • ./unit_tests "[validation_tests][hpc]"

Result:

  • All tests passed (748 assertions in 2 test cases)

2) Multi-GPU execution check (MPI + CUDA)

Build:

  • BUILD_MPI=ON
  • BUILD_CUDA_HPC=ON

Run:

  • mpirun -np 2 ./build_singularity_cuda/KiT-RT tests/input/validation_tests/SN_solver_hpc/lattice_hpc_200_cuda_order2.cfg

Observed in log:

  • CUDA backend: 2 local MPI rank(s), 2 visible CUDA device(s).

This confirms node-local rank/device mapping is active and both GPUs are visible to the run.

3) Numerical consistency check: non-MPI CUDA vs MPI(2) CUDA

Compared output CSV from:

  • non-MPI CUDA run
  • MPI(2) CUDA run

Comparison method:

  • ignored runtime timestamp column and wall-time column
  • diffed all other QoI fields

Result:

  • CSV_COMPARE: MATCH

This is especially important because it verifies the RK2 scalar-flux MPI fix: before the fix, this comparison differed; after the fix, it matches.

Behavior Notes

  • MPI parallelization remains in the angular domain (quadrature ordinates), not spatial domain decomposition.
  • Spatial mesh data remains replicated per rank.
  • Multi-GPU mapping currently targets a single node workstation model (as requested).
  • If local MPI ranks exceed local visible GPUs, rank-to-device assignment wraps with modulo.

Risk Assessment

Risk level: Low to Medium

Why low:

  • changes are isolated to partitioning, device selection, and scalar-flux reduction points
  • no new numerical model, no kernel math changes
  • validation and cross-run consistency checks are clean

Why medium:

  • MPI+CUDA behavior is environment-sensitive (launcher, visibility, affinity)
  • existing unit test harness does not currently run MPI-enabled Catch2 tests directly without MPI init in test main

Reviewer Focus

Please review:

  • angular split logic correctness and edge cases (_nSys % numRanks)
  • rank-local GPU mapping behavior in one-node scenarios
  • RK2 scalar-flux synchronization placement and cost
  • whether additional logging or runtime knobs are desired for rank/device diagnostics

Follow-ups (Optional, Not Required for This PR)

  1. Add explicit MPI-enabled regression test target for CUDA path (currently validated by executable runs and CSV comparison).
  2. Add optional config/env control to override default local-rank device mapping (for affinity experiments).
  3. Consider reducing redundant MPI barriers around allreduces if profiling shows measurable overhead.

@ScSteffen ScSteffen self-assigned this Feb 13, 2026
@ScSteffen ScSteffen added the enhancement New feature or request label Feb 13, 2026
@ScSteffen ScSteffen merged commit 091f8f1 into master Feb 13, 2026
1 check passed
@ScSteffen ScSteffen deleted the mpi_cuda branch February 13, 2026 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant