Enable user to take ownership of MPI by otbrown · Pull Request #722 · QuEST-Kit/QuEST

otbrown · 2026-04-14T13:02:04Z

Closes #712.

QuEST no longer uses MPI_COMM_WORLD everywhere. Instead it uses mpiCommQuest (which will very often just be a duplicate of MPI_COMM_WORLD). This aligns with best practise for library developers, as it helps avoid avoid clashes with user MPI calls.
Added a new custom environment constructor which lest the user positively declare that they want to take ownership of MPI. QuEST relinquishes responsibility, and will not attempt to initialise or finalise MPI itself. It will still throw if there are an incompatible number of MPI processes.
Added some minimal testing... We should probably look at environment testing in more detail at some point! In fairness, I think if the mpiCommQuest stuff was broken essentially any QuEST program would also be broken.
Swapped Irecv and Isend in the comms routine as is the first thing I was asked about when we encountered issues with slingshot-11. It doesn't make any difference.
New Subcommunicator interface lets the user create a custom QuESTEnv on communicator that is not MPI_COMM_WORLD. This interface is guarded behind compile-time flags as it requires bringing the MPI header into the interface, which means all consumers of libQuEST need to include MPI as well. This interface is not yet documented or tested. I think this is tolerable in the short term as it should only appeal to advanced users.

Speaking of tests! Test results on my laptop:

otbrown@oliver-win:~/quantum/quest/source/QuEST/build$ mpirun -n 4 ./tests/tests

QuEST execution environment:
  precision:       2
  multithreaded:   1
  distributed:     1
  GPU-accelerated: 0
  GPU-sharing ok:  0
  cuQuantum:       0
  num nodes:       4

Testing configuration:
  test all deployments:  0
  num qubits in qureg:   6
  max num qubit perms:   10
  max num superop targs: 4
  num mixed-deploy reps: 10

Tested Qureg deployments:
  CPU + OMP + MPI

Randomness seeded to: 2200603699
===============================================================================
All tests passed (293841 assertions in 275 test cases)

and min_example output:

Date:
2026-04-13T18:54:07+01:00

OMP_NUM_THREADS=36
SLURM_NTASKS_PER_NODE=8
SLURM_NNODES=1

QuEST execution environment:
  [precision]
    qreal.................double (8 bytes)
    qcomp.................std::complex<double> (16 bytes)
    qindex................long long int (8 bytes)
    validationEpsilon.....1e-12
  [compilation]
    isMpiCompiled....................1
    isMpiSubCommunicatorCompiled.....0
    isGpuCompiled....................0
    isOmpCompiled....................1
    isCuQuantumCompiled..............0
  [deployment]
    isMpiEnabled............1
    doesUserOwnMpi..........0
    isGpuEnabled............0
    isOmpEnabled............1
    isCuQuantumEnabled......0
    isGpuSharingEnabled.....0
  [cpu]
    numCpuCores.......576 per machine
    numOmpProcs.......36 per machine
    numOmpThrds.......36 per node
    cpuMemory.........754.7 GiB per machine
    cpuMemoryFree.....unknown
  [gpu]
    numGpus...........N/A
    gpuDirect.........N/A
    gpuMemPools.......N/A
    gpuMemory.........N/A
    gpuMemoryFree.....N/A
    gpuCache..........N/A
  [distribution]
    isMpiGpuAware.....0
    numMpiNodes.......8
  [statevector limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............35
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........37
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....58
    maxQubitsForIndOverflow.....63
  [density matrix limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............17
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........20
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....28
    maxQubitsForIndOverflow.....31
  [statevector autodeployment]
    8 qubits......[omp]
    29 qubits.....[omp] [mpi]
  [density matrix autodeployment]
    4 qubits......[omp]
    15 qubits.....[omp] [mpi]

Qureg:
  [deployment]
    isMpiEnabled.....1
    isGpuEnabled.....0
    isOmpEnabled.....1
  [dimension]
    isDensMatr.....0
    numQubits......20
    numCols........N/A
    numAmps........2^20 = 1048576
  [distribution]
    numNodes.....2^3 = 8
    numCols......N/A
    numAmps......2^17 = 131072 per node
  [memory]
    cpuAmps...........2 MiB per node
    gpuAmps...........N/A
    cpuCommBuffer.....2 MiB per node
    gpuCommBuffer.....N/A
    globalTotal.......32 MiB

Qureg (20 qubit statevector, 1048576 qcomps over 8 nodes, 4 MiB per node):
    -0.00062991+0.00025551i   |0⟩        [node 0]
    (8.3557e-5)-0.00063022i   |1⟩
    -0.00053343-0.00011416i   |2⟩
    0.00028553-0.00018421i    |3⟩
    -0.00094126-0.00097974i   |4⟩
    0.00016001+0.00042058i    |5⟩
    0.00073393-0.002013i      |6⟩
    0.00061578+0.00046316i    |7⟩
    0.00048523+0.00068937i    |8⟩
    -0.00081878-0.00060096i   |9⟩
    0.0012373+0.00049652i     |10⟩
    0.0001958+0.00014134i     |11⟩
    0.0013159-0.00040911i     |12⟩
    -0.00061634+0.00094838i   |13⟩
    0.00019392-0.00049259i    |14⟩
    -0.00031263+0.00018272i   |15⟩
                ⋮
    -0.00053361+0.00010033i   |1048560⟩  [node 7]
    0.0017006-0.00053801i     |1048561⟩
    0.0010772-(1.5397e-5)i    |1048562⟩
    -0.00085736-0.00062965i   |1048563⟩
    0.00083238-0.00053071i    |1048564⟩
    -0.00025421+0.00064579i   |1048565⟩
    -0.00069857-0.00043407i   |1048566⟩
    (-9.9973e-5)-0.00094092i  |1048567⟩
    0.00092175-0.00068456i    |1048568⟩
    -0.00026758-0.00027345i   |1048569⟩
    0.00034301+0.00011848i    |1048570⟩
    0.00026923-0.00043647i    |1048571⟩
    -0.00088784-0.00011044i   |1048572⟩
    -0.00037642-0.00087563i   |1048573⟩
    0.00031388+0.00031749i    |1048574⟩
    0.00014548+0.00030294i    |1048575⟩

Total probability: 1

JobID           JobName    Elapsed     MaxRSS
------------ ---------- ---------- ----------
161633       QuEST_min+   00:00:04
161633.batch      batch   00:00:04
161633.exte+     extern   00:00:04
161633.0     min_examp+   00:00:01

There is a caveat here: distributed tests on Cirrus keep failing, but this appears to be true of devel as well as this branch, so I'll do some more digging and raise it as a separate issue once I have a better of idea of what is going on.

…unction name to comm_getMpiComm

…ing for subcomm compiled

…for SubComm, because of course it enforces CXX

…declare that they take ownership of MPI

…fined

…PI_Comm

… updated setComm for init only workflow

…ything but it makes MPI library implementers less nervous

…ith user owned MPI

otbrown · 2026-04-16T11:55:56Z

On the plus side, a distributed QFT benchmark circuit runs perfectly correctly:

QFT Benchmark on 34 Qubits
Date:
2026-04-14T15:24:08+01:00

OMP_NUM_THREADS=36
SLURM_NTASKS_PER_NODE=8
SLURM_NNODES=1

QuEST QFT Benchmark

QuEST execution environment:
  [precision]
    qreal.................double (8 bytes)
    qcomp.................std::complex<double> (16 bytes)
    qindex................long long int (8 bytes)
    validationEpsilon.....1e-12
  [compilation]
    isMpiCompiled...........1
    isGpuCompiled...........0
    isOmpCompiled...........1
    isCuQuantumCompiled.....0
  [deployment]
    isMpiEnabled............1
    isGpuEnabled............0
    isOmpEnabled............1
    isCuQuantumEnabled......0
    isGpuSharingEnabled.....0
  [cpu]
    numCpuCores.......576 per machine
    numOmpProcs.......36 per machine
    numOmpThrds.......36 per node
    cpuMemory.........1.475 TiB per machine
    cpuMemoryFree.....unknown
  [gpu]
    numGpus...........N/A
    gpuDirect.........N/A
    gpuMemPools.......N/A
    gpuMemory.........N/A
    gpuMemoryFree.....N/A
    gpuCache..........N/A
  [distribution]
    isMpiGpuAware.....0
    numMpiNodes.......8
  [statevector limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............36
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........38
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....58
    maxQubitsForIndOverflow.....63
  [density matrix limits]
    minQubitsForMpi.............3
    maxQubitsForCpu.............18
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........20
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....28
    maxQubitsForIndOverflow.....31
  [statevector autodeployment]
    8 qubits......[omp]
    29 qubits.....[omp] [mpi]
  [density matrix autodeployment]
    4 qubits......[omp]
    15 qubits.....[omp] [mpi]


Performing QFT on 34 qubits.
Time to run QFT Circuit: 420.544s.

Correctness checks:
Statevector norm: 1
First amplitude: (7.62939e-06, 0)
JobID           JobName    Elapsed     MaxRSS
------------ ---------- ---------- ----------
162993       QFT_bench+   00:07:08
162993.batch      batch   00:07:08
162993.exte+     extern   00:07:08
162993.0            qft   00:07:05

JPRichings · 2026-04-27T11:32:03Z

 */
 void initCustomQuESTEnv(int useDistrib, int useGpuAccel, int useMultithread);

+void initCustomMpiQuESTEnv(int useDistrib, int userOwnsMpi, int useGpuAccel, int useMultithread);


Do we not want a new initialiser so its really clear where there is a different setup? Also this probably braked compatibility for older software.

That is a new initialiser! This is effectively a C interface, so no overloading allowed.

Apologies I had missread this I thought it was replacing initCustomQueSTEnv hence the concern. Once again reading comprehension issue must do better.

The difference is more subtle than I might have liked, but the alternative would be an initialiser that was way too long!

initCustomMpiCommQuESTEnv?

JPRichings · 2026-04-27T11:50:40Z

Looks good not run yet.

My only comment is above and raises a wider point where we need to think about how we managed change in the programming interface as we probably want to stabilise as much as possible and only add to existing functionality.

otbrown · 2026-04-27T14:00:00Z

Looks good not run yet.

My only comment is above and raises a wider point where we need to think about how we managed change in the programming interface as we probably want to stabilise as much as possible and only add to existing functionality.

Agreed, If we're being proper about it, breaking changes to the API require a new major version release, and I don't think we have the time or the inclination currently.

JPRichings · 2026-04-28T16:00:08Z

Remaining testing required:

iarejula-bsc

Overall, looks good to me, really like the idea! I was wondering if this design would allow running quregs in parallel, each with their own communicator (different envs for each one, maybe?), or if that would be impossible with the current implementation. I think that would be a highly interesting feature. As my team is trying to apply malleability to QuEST, that would be a huge win for us.

A few small comments below, but note this is an untested review.

iarejula-bsc · 2026-05-05T14:21:55Z



-void comm_init() {
+void comm_init(int userOwnsMpi) {


Should userOwnsMpi be a bool instead of int? It's only ever used as a boolean. Is there a repo convention for this, or is it to maintain C compatibility?

Yes, but repo convention is to use int for historical (C) reasons

Oh bool is totally fine internally, and should be used. You'll see it all through the codebase, like here. A previous contention was whether we can expose bool in a C-visible header, as we did here, so that's a license to use it in all user-visible stuff too.

Note initCustomQuESTEnv accepts int flags because they are not bools; they can be -1 to specify "choose this automatically".

Ah, thanks Tyson. I will update to bool in that case.

iarejula-bsc · 2026-05-05T14:31:55Z

+        MPI_Init(NULL, NULL);

-    MPI_Init(NULL, NULL);
    MPI_Comm_dup(MPI_COMM_WORLD, &mpiCommQuest);


What if we simplify this by passing the communicator directly as a parameter instead of userOwnsMpi? If the user passes a communicator, QuEST uses it; if null/not provided, we initialize MPI and default to MPI_COMM_WORLD. This removes the ambiguity of who owns MPI and makes the API more explicit:

void comm_init(MPI_Comm userComm = MPI_COMM_NULL) { if (userComm == MPI_COMM_NULL) { MPI_Init(NULL, NULL); MPI_Comm_dup(MPI_COMM_WORLD, &mpiCommQuest); } else { MPI_Comm_dup(userComm, &mpiCommQuest); } }

I think this is what I had in mind originally, but if I recall correctly the issue is that it brings MPI_Comm into the public headers, which we need to avoid as it causes headaches for users who are linking a non-MPI application.

I will double-check though!

iarejula-bsc · 2026-05-05T14:49:32Z

+    // set mpiCommQuest to user provided communicator
+    comm_setMpiComm(userQuestComm);
+
+    // initialise QuEST around that communicator
+    initCustomMpiQuESTEnv(USE_DISTRIB, USER_OWNS_MPI, useGpuAccel, useMultithread);


does this work correcty? initCustomMpiQuESTEnv calls on comm_init where we are overwriting the mpiCommQuest with MPI_COMM_WORLD, does not matter if the user has the ownsership.

void comm_init(int userOwnsMpi) {Expand commentComment on line R106 ... // QuEST must initialise MPI if the user does not own it if (!userOwnsMpi) MPI_Init(NULL, NULL); MPI_Comm_dup(MPI_COMM_WORLD, &mpiCommQuest);

iarejula-bsc

In subcommunicator.cpp:17, userQuestComm appears to be duplicated into mpiCommQuest before the normal environment validation runs. If validation later fails under a throwing or custom error handler, the duplicated communicator may be left in mpiCommQuest with no cleanup, which is an MPI resource leak.
Haven't had time to trace this fully so it may be a non-issue depending on execution order, but probably worth a quick check.

iarejula-bsc · 2026-05-06T10:22:52Z


    // safely callable before MPI initialisation, but NOT after comm_end()
    int isInit;
    MPI_Initialized(&isInit);


There are now two distinct initialized states that the code conflates into one:

MPI runtime initialized: MPI_Init() has been called
QuEST communicator initialized: mpiCommQuest != MPI_COMM_NULL

comm_isInit() only checks the first, So comm_isInit() returning true is not a safe guard against calling MPI with a null communicator.

Precise crash path:

1. MPI_Init() State After MPI runtime: initialized mpiCommQuest: MPI_COMM_NULL QuEST env: not initialized initCustomMpiQuESTEnv(0, 1, 0, 0) └─ validateAndInitCustomQuESTEnv( useDistrib=0, userOwnsMpi=1, useGpuAccel=0, useMultithread=0 ) ├─ validate_envNeverInit(...) ├─ validate_newEnvDeploymentMode(...) ├─ autodep_chooseQuESTEnvDeployment(...) ├─ if (useDistrib) │ └─ comm_init(userOwnsMpi) │ │ ** skipped: useDistrib == 0 ** │ └─ validate_newEnvDistributedBetweenPower2Nodes(...) └─ comm_getNumNodes() ├─ comm_isInit() │ └─ MPI_Initialized(&isInit) │ returns true <-- runtime is up, guard passes │ └─ MPI_Comm_size(mpiCommQuest, &numNodes) CRASH State at Crash MPI runtime: initialized (ok) mpiCommQuest: MPI_COMM_NULL (!) QuEST env: not initialized (!)
CRASH: MPI_Comm_size called with MPI_COMM_NULL -> abort

I suggest that comm_isInit() should distinguish between "MPI runtime is initialized" and "QuEST communicator is initialized", or all accessors should guard against mpiCommQuest != MPI_COMM_NULL.

iarejula-bsc · 2026-05-06T10:31:18Z

+    // set mpiCommQuest to user provided communicator
+    comm_setMpiComm(userQuestComm);
+
+    // initialise QuEST around that communicator
+    initCustomMpiQuESTEnv(useDistrib, userOwnsMpi, useGpuAccel, useMultithread);


I think there is a problem with the order of operations or you must ensure MPI is initialized: initCustomMpiCommQuESTEnv() duplicates the user communicator before verifying that the MPI runtime exists at all.

In the case were MPI is not being initialized, the guard will be skiped

void comm_setMpiComm(MPI_Comm newComm) { if (mpiCommQuest != MPI_COMM_NULL) // false error_commDoubleSetMpiComm(); MPI_Comm_dup(newComm, &mpiCommQuest); // <-- ... before any MPI_Init() check }

otbrown · 2026-05-06T13:31:00Z

@iarejula-bsc :

I was wondering if this design would allow running quregs in parallel, each with their own communicator (different envs for each one, maybe?), or if that would be impossible with the current implementation.

We think it should allow that yes -- also useful for our CQ project. @JPRichings will explicitly add it to his testing to be sure.

Thank you for catching some of the initialisation bugs -- I will have a more detailed look and try fix them later this week! Much appreciated.

iarejula-bsc · 2026-05-07T10:33:49Z

We think it should allow that yes

Nice! I was wondering whether using concurrent quregs at the same time could be an issue out of mpi. I have two main concerns:

In v3, it was possible to inject the env when creating a qureg. Now a singleton pattern is used, so all of them would share the same instance.
I'm not sure whether other modules, such as the GPU, rely on global configuration, which could also cause issues (CUDA or whatever).

JPRichings · 2026-05-07T15:33:01Z

Note to add single GPU multi-Qreg testing with multiple rank 1 communicators to confirm global GPU functionality implications here

TysonRayJones · 2026-05-09T18:55:21Z

I'm happy to review this too; and eager to, since realising my comm software architecture was limiting! It would be good to first see an example of the imagined user flow. Does anyone have an example flow handy? I already (very weakly) lean to breaking API of initCustomQuESTEnv over initCustomMpiQuESTEnv, but wonder if there are other (even more stateful 😟 ) designs.

Nice! I was wondering whether using concurrent quregs at the same time could be an issue out of mpi. I have two main concerns:

In v3, it was possible to inject the env when creating a qureg. Now a singleton pattern is used, so all of them would share the same instance.
I'm not sure whether other modules, such as the GPU, rely on global configuration, which could also cause issues (CUDA or whatever).

I think this is achievable internally, though with some substantial refactoring - it does not seem admitted by the current communicator-singleton design of the existing codebase, and of this PR.

I could imagine a "stateful" user flow (to mirror the existing statefulness of initQuESTEnv()) of:

// existing API
initQuESTEnv(); // calls MPI_Init

// new API enabled only with compile-time flag, like COMPILE_SUBCOMM
MPI_Comm questComm = getDefaultQuESTCommunicator();

// user-fiddling with MPI directly (let's say myComm has smaller size than WORLD)
MPI_Comm myComm;
MPI_Comm_split(questComm, ...)

// new API enabled only with compile-time flag, like COMPILE_SUBCOMM
setQuESTCommunicator(myComm);

// existing API
Qureg smallQureg = createQureg(...); // adopts myComm, distributed between e.g. 4 ranks
// smallQureg would NOT have buffers allocated on WORLD nodes outside myComm

setDefaultQuESTCommunicator();
Qureg bigQureg = createQureg(...); // adopts internal comm, distributed between all ranks, e.g. 16

Here, we've left all existing API unchanged, but introduced new functions only exposed in a new header like Oliver's subcommunicator.h.

MPI_Comm getDefaultQuESTCommunicator();
void setQuESTCommunicator(MPI_Comm);
void setDefaultQuESTCommunicator();

We can achieve above even without attaching MPI types to user-facing structs (which is illegal). One (semi-gross?) way is to attach only a communicator id which indexes a dynamic list (or dictionary?) in comm_config.cpp. Such a change isn't trivial - it will require new validation in functions receiving multiple distributed objects, like calcFidelity, applyFullStateDiagMatr().

@otbrown ignoring any considerations of internal complexity, does such a design make sense? My rationale is that if we have to sully the file/library separation to enable custom MPI, we should get as much "bang for our buck" and try to enable all things an MPI superuser may wish to do. Would the stateful design above achieve that? The statefulness is just an alternative to adding an MPI_Comm arg to every distributed-object creator of course.

EDIT: noting minor user pitfall:
The proposed stateful setQuESTCommunicator function differs from existing stateful functions (like setValidationEpsilon and setQuESTGpuThreadsPerBlock()) in that it affects only distributed objects created since its call. The existing stateful setters instead affect everything. That doesn't bother me too much though

JPRichings · 2026-05-10T14:26:50Z

ARCHER2 GPU environment build issue:

In file included from /work/z19/z19/jpr/Projects/Quantum/Development/mpi-update/mpi-update/quest/src/gpu/gpu_config.cpp:19:
/work/z19/z19/jpr/Projects/Quantum/Development/mpi-update/mpi-update/quest/src/comm/comm_config.hpp:16:12: fatal error: 'mpi.h' file not found
  #include <mpi.h>
           ^~~~~~~
1 error generated when compiling for gfx90a.
make[2]: *** [CMakeFiles/QuEST.dir/build.make:509: CMakeFiles/QuEST.dir/quest/src/gpu/gpu_config.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
In file included from /work/z19/z19/jpr/Projects/Quantum/Development/mpi-update/mpi-update/quest/src/gpu/gpu_subroutines.cpp:58:
In file included from /work/z19/z19/jpr/Projects/Quantum/Development/mpi-update/mpi-update/quest/src/gpu/gpu_thrust.cuh:42:
/work/z19/z19/jpr/Projects/Quantum/Development/mpi-update/mpi-update/quest/src/comm/comm_config.hpp:16:12: fatal error: 'mpi.h' file not found
  #include <mpi.h>

My Cmake-foo is not strong enough to figure out how this has failed on A2 and not in your own testing.

JPRichings · 2026-05-10T16:09:54Z

I have my own QFT example working on archer2:

4 node run result:

    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....58
    maxQubitsForIndOverflow.....63
  [density matrix limits]
    minQubitsForMpi.............1
    maxQubitsForCpu.............16
    maxQubitsForGpu.............N/A
    maxQubitsForMpiCpu..........17
    maxQubitsForMpiGpu..........N/A
    maxQubitsForMemOverflow.....28
    maxQubitsForIndOverflow.....31
  [statevector autodeployment]
    8 qubits......[omp]
    27 qubits.....[omp] [mpi]
  [density matrix autodeployment]
    4 qubits......[omp]
    14 qubits.....[omp] [mpi]


Total number of gates: 210
Measured probability amplitude of |0..0> state: 9.53674e-07
Calculated probability amplitude of |0..0>, C0 = 1 / 2^20: 9.53674e-07
Measuring final state: (all probabilities should be 0.5)
Qubit 0 measured in state 1 with probability 0.5
Qubit 1 measured in state 1 with probability 0.5
Qubit 2 measured in state 1 with probability 0.5
Qubit 3 measured in state 1 with probability 0.5
Qubit 4 measured in state 0 with probability 0.5
Qubit 5 measured in state 0 with probability 0.5
Qubit 6 measured in state 0 with probability 0.5
Qubit 7 measured in state 0 with probability 0.5
Qubit 8 measured in state 0 with probability 0.5
Qubit 9 measured in state 1 with probability 0.5
Qubit 10 measured in state 1 with probability 0.5
Qubit 11 measured in state 0 with probability 0.5
Qubit 12 measured in state 0 with probability 0.5
Qubit 13 measured in state 0 with probability 0.5
Qubit 14 measured in state 1 with probability 0.5
Qubit 15 measured in state 0 with probability 0.5
Qubit 16 measured in state 0 with probability 0.5
Qubit 17 measured in state 1 with probability 0.5
Qubit 18 measured in state 0 with probability 0.5
Qubit 19 measured in state 0 with probability 0.5

Final state:
|11110000011000100100>
QFT run time: 0.0168758s
Total run time: 0.386883s
size comm world: 4
Process 1 took part to the new communicator broadcast.

code for future reference / check ive not done a stupid thing:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include "quest.h"
#include <mpi.h>

double calcPhaseShif(const int);
void qftQubit(Qureg, const int, const int);
void qft(Qureg, const int);
void writeState(const int * const, const size_t);
void getMonoTime(struct timespec *);
double getElapsedSeconds(const struct timespec * const,
  const struct timespec * const);

int main (int argc, char *argv[])
{
  int num_qubits;

  MPI_Init(&argc, &argv);

  if (argc < 2) {
    num_qubits = 4;
  } else if (argc > 2) {
    printf("Error: Too many arguments! Usage: ./qft $NUMBER_OF_QUBITS\n");
    return -1;
  } else {
    // arcg == 2
    num_qubits = atoi(argv[1]);
    if (num_qubits < 1) {
      printf("Error: num_qubits < 1, you requested %d qubits\n", num_qubits);
      return -1;
    }
  }

  struct timespec run_start, run_stop;
  struct timespec qft_start, qft_stop;
  double run_time, qft_time;

  getMonoTime(&run_start);

  // Borrowing from https://rookiehpc.org/mpi/docs/mpi_comm_create/index.html

  // initialise MPI

  //MPI_Init(&argc, &argv);

  // Extract info about comm world

  int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);

  printf("size comm world: %d \n", size);

  if(size < 3)
  {
      printf("Please run this application with at least 4 processes.\n");
      MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
  }

  MPI_Group world_group;
  MPI_Comm_group(MPI_COMM_WORLD, &world_group);


  // Keep only the processes 0 and 1 in the new group.
  int ranks[2] = {0, 1};
  MPI_Group new_group;
  MPI_Group_incl(world_group, 2, ranks, &new_group);

  // Create the new communicator from that group of processes.
  MPI_Comm new_communicator;
  MPI_Comm_create(MPI_COMM_WORLD, new_group, &new_communicator);

  // Test new communicator

  int my_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
  int value;

  if(new_communicator == MPI_COMM_NULL)
  {
      // I am not part of the new communicator, I can't participate to that broadcast.
      printf("Process %d did not take part to the new communicator broadcast.\n", my_rank);
  }
  else
  {
      // I am part of the new communicator, I can participate to that broadcast.
      MPI_Bcast(&value, 1, MPI_INT, 0, new_communicator);
      printf("Process %d took part to the new communicator broadcast.\n", my_rank);
  }

 if(new_communicator == MPI_COMM_NULL)
  {
    printf("Process %d did not take part in running QueST.\n", my_rank);
  }
  else
  {
  // initialise QuEST with sub communicator

  initCustomMpiCommQuESTEnv(new_communicator, 0, 1);
  // report on QuEST env

  reportQuESTEnv();

  // create quantum register
  Qureg qureg = createQureg(num_qubits);

  // report on qreg

  reportQureg(qureg);

  if (!getQuESTEnv().rank)
    printf("Simulating %d-Qubit QFT\n\n", num_qubits);

  // initialise input register to |0..0>
  initZeroState(qureg);

  // report model
  if (!getQuESTEnv().rank) {
    reportQuregParams(qureg);
    printf("\n");
    reportQuESTEnv();
    printf("\n");
  }

  getMonoTime(&qft_start);

  // apply QFT to input register
qft(qureg, num_qubits);

  getMonoTime(&qft_stop);

  if (!getQuESTEnv().rank)
    printf("Total number of gates: %d\n", (num_qubits * (num_qubits+1))/2 );

  // results test
  qreal prob_0 = calcProbOfBasisState(qureg, 0);
  if (!getQuESTEnv().rank) {
    printf("Measured probability amplitude of |0..0> state: %g\n", prob_0);
    printf("Calculated probability amplitude of |0..0>, C0 = 1 / 2^%d: %g\n",
      num_qubits, 1.0 / pow(2,num_qubits));

    printf("Measuring final state: (all probabilities should be 0.5)\n");
  }

  int * state = (int *) malloc(num_qubits * sizeof(int));
  qreal * probs = (qreal *) malloc(num_qubits * sizeof(qreal));
  for (int n = 0; n < num_qubits; ++n) {
    state[n] = applyQubitMeasurementAndGetProb(qureg, n, probs+n);
  }
  if (!getQuESTEnv().rank) {
    for (int n = 0; n < num_qubits; ++n)
      printf("Qubit %d measured in state %d with probability %g\n",
        n, state[n], probs[n]);
    printf("\n");
    printf("Final state:\n");
    writeState(state, num_qubits);
  }

  // free resources and stop timer
  free(state);
  free(probs);
  destroyQureg(qureg);

  getMonoTime(&run_stop);

  qft_time = getElapsedSeconds(&qft_start, &qft_stop);
  run_time = getElapsedSeconds(&run_start, &run_stop);

  if (!getQuESTEnv().rank)
    printf("QFT run time: %gs\nTotal run time: %gs\n", qft_time, run_time);

  // finalize QuESTEnv late because we need MPI rank
  finalizeQuESTEnv();
  }
  return 0;
}

double calcPhaseShift(const int M) {
  return  ( M_PI / pow(2, (M-1)) );
}

void qftQubit(Qureg qureg, const int num_qubits, const int QUBIT_ID) {
  int control_id = 0;
  double angle = 0.0;

  applyHadamard(qureg, QUBIT_ID);
  int m = 2;
  for (int control = QUBIT_ID+1; control < num_qubits; ++control) {
    angle = calcPhaseShift(m++);
    applyTwoQubitPhaseShift(qureg, control, QUBIT_ID, angle);
  }

  return;
}

void qft(Qureg qureg, const int num_qubits) {
  for (int qid = 0; qid < num_qubits; ++qid)
    qftQubit(qureg, num_qubits, qid);
  return;
}

void writeState(const int * const STATE, const size_t num_qubits) {
  printf("|");
  for (size_t n = 0; n < num_qubits; ++n) printf("%d", STATE[n]);
  printf(">\n");
  return;
}

void getMonoTime(struct timespec * time) {
  clock_gettime(CLOCK_MONOTONIC, time);
  return;
}

double getElapsedSeconds(const struct timespec * const start,
const struct timespec * const stop) {
  const unsigned long BILLION = 1000000000UL;
  const unsigned long long TOTAL_NS =
    BILLION * (stop->tv_sec - start->tv_sec)
    + (stop->tv_nsec - start->tv_nsec);

  return (double) TOTAL_NS / BILLION;
}

JPRichings · 2026-05-10T19:22:07Z

As noted by ticked items in #722 (comment)

The following tests needed to confirm this pull request is working have been done on ARCHER2 CPU multinode:

world comm -1 rank: tested working
world_comm as subcomm: tested working
sub comm only 1 rank: tested working
50/50 split of world comm: tested working
Random splitting (4 rank subcomm or similar): redundant

Outstanding tests and issues:

GPU compilation
Test multiprocess and Multi-Qureg per GPU does not cause errors

TysonRayJones · 2026-05-10T20:05:07Z

@JPRichings super useful, tyvm! In that example, you call

MPI_Init(&argc, &argv);

explicitly before initialising QuEST. It's my understanding that those args never actually affect the MPI env; MPI just conveniently removes MPI-specific args for us. So is there any utility in letting the user call MPI_Init themselves? In my userflow example above, we would still call MPI_Init inside the traditional initQuESTEnv() (and finalize stays in finalizeQuESTEnv()), which minimises API change (or so I think).

JPRichings · 2026-05-10T20:43:19Z

I have called MPI_Init explicitly as if we were to try to incorporate QuEST in to an existing MPI code we cannot guarantee we control or can change how MPI_Init is called.

I agree for a pure QuEST application with multiple Qreg's each with their own subComm controlling this in QuEST with MPI initialised in initQuESTEnv() is a good approach.

I think we ultimately need both for different use cases but I am a bit weary of the amount of time we have for additional changes and debugging before our planned release. My concern is that we rush this and wish we had thought about it more carefully.

TysonRayJones · 2026-05-10T21:08:12Z

I have called MPI_Init explicitly as if we were to try to incorporate QuEST in to an existing MPI code we cannot guarantee we control or can change how MPI_Init is called.

What about when COMPILE_SUBCOMM (possibly renamed) is set, and ergo the "new MPI-specific functions" are exposed, then initQuESTEnv() will check whether MPI is initialised before calling it; and just not recall it. We just relax the current validation that it's not already initialised (but we can still easily verify initQuESTEnv() hasn't been called twice through other singleton vars already present in environment.cpp).

(As I mentioned above, this depends upon MPI_Init not receiving any configuration - so there is no concept of "how MPI_Init is called" - but I may be mistaken here)

So it remains valid to call MPI_Init or not before initQuESTEnv(). THen, if I understand right, initCustomMpiQuESTEnv() becomes redundant, given we expose setQuESTCommunicator(MPI_Comm) as above.

I think we ultimately need both for different use cases but I am a bit weary of the amount of time we have for additional changes and debugging before our planned release. My concern is that we rush this and wish we had thought about it more carefully.

Certainly we shouldn't rush through any drafted designs - we can always expose experimental stuff in non-main branches 🙏

JPRichings · 2026-05-10T21:37:12Z

I think I agree with your reasoning and I hadn't twigged that the COMPILE_SUBCOMM would cause initQuESTEnv() not to recall MPI_Init. The comment around how MPI_Init is called is more about what does the program that owns that process also do to create its own communicators before QuESt does its own manipulations. We are not guaranteed to be able to rely on comm world so maybe we should pass a comm or set a default explicitly so that we know we have defined QuEST_Comm_world in a way that does not conflict with other communicators from the parent applicaiton.

If I have understood correctly in your proposal QuEST always knows about two communicators current and default and this will lead to the developer above keeping track of all the comms created for each Qreg themselves and we might end up needing to provide helpers for this.

I feel like having a default and then setting the MPI comm for a given Qreg, as the Qreg is created, would be easier for the user. I think setting the current comm will get lost by the user and mistakes will ensue when it was set 9 files over and forgotten about. ( I don't like how cudaSetdevice works for example) I like the idea of tying the MPI comm to a Qreg as this feels cleaner somehow. I feel like I am however suggesting a large breaking change (no investigation done).

I think what this could buy us is the ability to clone a Qreg or check that based on the default Comm that Qreg's are not overlapping and I think this helps us with communication between Qregs later down the line as we know which mpi communicators are involved from the Qreg's themselves.

Not sure what I am suggesting makes sense and it is personal preference mainly informed by cudaSetDevice.

Note: Think about this a little more we aren't likely to end up with different running environments using different compute targets for each Qreg so we should seperate the QuEST init which is dealing with how we are going to simulate from on what bit of that available resource we plan to use for a given Qreg.

I do think that Oliver's current code gets us to stage one and we should maybe focus the next full release on bottoming out this implementation.

TysonRayJones · 2026-05-10T23:55:20Z

We are not guaranteed to be able to rely on comm world so maybe we should pass a comm or set a default explicitly so that we know we have defined QuEST_Comm_world in a way that does not conflict with other communicators from the parent applicaiton.

Yep for sure. Regardless of any of the API changes, we should move internally to a custom communicator.

If I have understood correctly in your proposal QuEST always knows about two communicators current and default and this will lead to the developer above keeping track of all the comms created for each Qreg themselves and we might end up needing to provide helpers for this.

In the design I propose, each Qureg (and FullStateDiagMatr; another distributed object) will have its own attached communicator (although the struct will only have an id for that communicator, stored in a singleton list inside comm_config.cpp). Whenever communication is required, it will involve the communicator of the Qureg or FullStateDiagMatr . Functions accepting multiple of these types demand/validate all such objects share a communicator id.

The nuance of my proposal is all in the API. In lieu of creating a new "custom initialiser", nor new creators for each distributed object, the use of

setQuESTCommunicator(myComm);

will bind myComm to subsequently created Qureg and FullStateDiagMatr. It just avoids having a new, custom creator (because there was already a "custom creator" - two is awful!) for each distributed type, and for the environment. User flexibility is entirely the same, if I'm not mistaken.

Indeed there can and should exist helper functions like

MPI_Comm getQuregCommunicator(Qureg); // the communicator permanently used by the Qureg
MPI_Comm getCurrentQuESTCommunicator(); // the communicator which subsequent Quregs will bind to

I think what this could buy us is the ability to clone a Qreg or check that based on the default Comm that Qreg's are not overlapping and I think this helps us with communication between Qregs later down the line as we know which mpi communicators are involved from the Qreg's themselves.

I think we're imagining the same internal design!

Note: Think about this a little more we aren't likely to end up with different running environments using different compute targets for each Qreg so we should seperate the QuEST init which is dealing with how we are going to simulate from on what bit of that available resource we plan to use for a given Qreg.

Hmm was "aren't" a critical typo here 😅 ? In any case, each Qureg creation will indeed need to consult its MPI_Comm to see the available resources, if we envision a heterogenous deployment. That shouldn't actually be very hard!

I do think that Oliver's current code gets us to stage one and we should maybe focus the next full release on bottoming out this implementation.

I don't think we should release any API we intend to imminently break or rework entirely. But I suspect we can refine the interface of Oliver's current single-communicator design to be released in v4.3, in a way that doesn't require API break when we extend it to Qureg-specific communicators. One way is to use

setQuESTCommunicator(MPI_Init)

after initQuESTEnv(), permitting it to only ever be called once right now (relaxed later for Qureg-specific comms). The initQuESTEnv() call would call MPI_Init itself, but only if the user hadn't called MPI_Init already. And setQuESTCommunicator would just override the communicator singleton in Oliver's current design. Calling setQuESTCommunicator after createQureg or createFullStateDiagMatr would be illegal and undefined behaviour (we need to know comm at creation time).

If/when we later extend to Qureg-specific communicators, we are just relaxing when setQuESTCommunicator can be called; at any point in execution, affecting only Qureg created thereafter.

Note that what we cannot do is create a Qureg then set the user-given communicator. It's "too late" once buffers have been allocated (unless we want to be super inefficient).

JPRichings · 2026-05-11T08:53:41Z

I think we're imagining the same internal design! - 🥳

Hmm was "aren't" a critical typo here 😅 ? - I think what I meant by target here is type of hardware not which specific node. As in if we are running on HPC with GPUs should we run all Qureg in the same mode? i.e. Distributed, GPU aware,... If we end up with a Qureg on CPU and one on GPU for example then I think we need to consider how we are controlling that carefully as I don't think that was the intended change here. Basically I think I want to separate three things:

What modes QuEST is able to use to run a simulation? (Distributed, GPU aware, Multithreaded,...)
What mode is used by a given Qureg or FullStateDiagMatr?
Of the available resources N nodes of compute of a given type (we could split out GPUs and CPUs on the same node here as different "nodes") which ones are allocated to a given Qureg or FullStateDiagMatr?

I agree we should not release then immediately rework features I was hoping we could find an initial subset that gets us a minimal implementation that we can clean up for this release and then keep exploring this. I was concerend the level of change was increasing.

OK on your point around using setQuESTCommunicator(MPI_Init). I think I had miss understood what your example would allow us to do. I thought we would be able to setQuESTCommunicator multiple times once before each new createQureg but from your text in the last comment I now understand we wouldn't be able to in the current proposal.

I think my remaining confusion is how do you propose picking the communicator set by setDefaultQuESTCommunicator();?

JPRichings · 2026-05-11T13:57:53Z

Fixed: #722 (comment) ARCHER2 GPU compilation environment is always fun.

otbrown · 2026-05-11T17:12:25Z

@TysonRayJones Apologies, I was only weakly following along over the weekend on my phone! I did have a chat with James today about your suggestions, and the summary of my thoughts is:

Conceptually I like the idea of enabling more precise user controlled deployment, and I think this warrants further investigation at a slightly nebulous, post v4.3 time.
However I think we'd be better off letting the user set deployment parameters, and then dealing with the MPI ourselves. The envelope of MPI processes for QuEST would then be initialisation time only, but we could enable 'sub-deployment' for specific Quregs. This avoids the challenges of even more mpi.h in the public headers, and I suspect makes it easier to avoid orphaned processes and deadlocks.
However we end up doing with this, it is going to need more time and thought than we have available before my proposed code-freeze date. One of the issues is that Quregs deployed in different comms are (possibly) incompatible with one another, so all multi-qureg functions would need to either reject them, or otherwise handle the inter-comm communication, which would be another significant re-write with performance implications.
I'm much more relaxed about introducing the new API as it is, and then potentially changing it in the future, as this is explicitly advanced functionality hidden behind an advanced interface. I expect the vast majority of users will continue to want to use all the available MPI ranks! (Or indeed no MPI ranks, and just CPU multithreading or a GPU...)

Every week I say I will get to this soon, and then I get completely derailed 😅 I truly hope to have implemented the additional validation proposed by the end of this week. I've got a couple of lengthy train journeys this week which should help with that.

I must insist however that we push the more significant work proposed above to a separate issue for post v4.3 contemplation!

…o avoid unexpected test failure due to range checking when compild in Debug configuration

Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es>

otbrown and others added 21 commits February 27, 2026 15:53

Added ENABLE_SUBCOMM build option

1c87a3c

Moved from MPI_COMM_WORLD to mpiQuestComm

19ac320

Decided passing *MPI_Comm was probably overly cautious, and updated f…

95415b7

…unction name to comm_getMpiComm

environment.cpp: added methods to reset rank and numNodes, and report…

0699e58

…ing for subcomm compiled

comm_config.hpp/cpp: added comm_setMpiComm

bfec279

CMakeLists.txt: PUBLIC MPI::MPI_CXX turned out to be unhelpful, even …

3a3bdff

…for SubComm, because of course it enforces CXX

Added new custom QuESTEnv initialiser which allow user to positively …

3358581

…declare that they take ownership of MPI

validation.cpp: updated comm_end call

6a093d0

Merge branch 'devel' into mpi-update

727d1bc

comm_config.hpp: added config.h include so COMPILE_MPI is actually de…

588b636

…fined

subcommunicator.h/cpp: implemented QuESTEnv initialiser with custom M…

7c6caac

…PI_Comm

CMake: added subcommunicator.cpp

e6628e4

comm_config.hpp: added missing config.h include...

bb2d5f2

comm_config.cpp: explicitly initialise mpiCommQuest to MPI_COMM_NULL,…

7c30b8c

… updated setComm for init only workflow

Merge branch 'mpi-update' of github.com:otbrown/QuEST into mpi-update

7512e67

quest.h: added subcommunicator header

cb72846

CMake: added MPI to application binaries when SUBCOMM is enabled

7ff2a2e

Merge branch 'devel' into mpi-update

13769be

comm_routines.cpp: post Irecv before Isend which probably won't do an…

68e435b

…ything but it makes MPI library implementers less nervous

tests: added new env test for initCustomMpiQuESTEnv

f15def4

Added error throws to comm_config to cover new scenarios of badness w…

dbe2b1e

…ith user owned MPI

otbrown requested a review from JPRichings April 16, 2026 11:56

JPRichings reviewed Apr 27, 2026

View reviewed changes

otbrown added 2 commits May 4, 2026 12:14

subcommunicator.cpp: updated var names to match QuEST style

bf62480

Merge branch 'devel' into mpi-update

f26ae47

iarejula-bsc reviewed May 5, 2026

View reviewed changes

iarejula-bsc reviewed May 6, 2026

View reviewed changes

otbrown and others added 4 commits May 12, 2026 10:57

Merge branch devel

6edbd1c

tests/unit/initialisations.cpp: slightly modified setQuregAmps test t…

ea23f30

…o avoid unexpected test failure due to range checking when compild in Debug configuration

Updated validation in comm_setMpiComm

142db02

Co-authored-by: iarejula-bsc <inigo.arejula@bsc.es>

userOwnsMpi int->bool

d9d2369

Conversation

otbrown commented Apr 14, 2026

Uh oh!

otbrown commented Apr 16, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JPRichings commented Apr 27, 2026

Uh oh!

otbrown commented Apr 27, 2026

Uh oh!

JPRichings commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iarejula-bsc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iarejula-bsc May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iarejula-bsc left a comment

Choose a reason for hiding this comment

Uh oh!

iarejula-bsc May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

otbrown commented May 6, 2026

Uh oh!

iarejula-bsc commented May 7, 2026

Uh oh!

JPRichings commented May 7, 2026

Uh oh!

TysonRayJones commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 10, 2026

Uh oh!

JPRichings commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented May 10, 2026

Uh oh!

JPRichings commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

JPRichings commented Apr 28, 2026 •

edited

Loading

iarejula-bsc May 5, 2026 •

edited

Loading

iarejula-bsc May 6, 2026 •

edited

Loading

TysonRayJones commented May 9, 2026 •

edited

Loading

JPRichings commented May 10, 2026 •

edited

Loading

JPRichings commented May 10, 2026 •

edited

Loading

JPRichings commented May 10, 2026 •

edited

Loading

TysonRayJones commented May 10, 2026 •

edited

Loading

JPRichings commented May 10, 2026 •

edited

Loading

TysonRayJones commented May 10, 2026 •

edited

Loading

JPRichings commented May 11, 2026 •

edited

Loading