Fix race in MPSC algo by ccotter · Pull Request #1812 · NVIDIA/stdexec

ccotter · 2026-02-04T05:17:07Z

Relacy helped identify a race in the existing MPSC algo. I am having a hard time exactly explaining what's going on, but in the newly added unit test (the non Relacy one), I am able to observe three different odd behaviors

a consumer consuming the same elemment in an finite loop, apparently due to the internal next pointers pointing in some sort of cycle
consumer returning &_nil!
consumer never able to consume a produced value (node is lost)

With the non-relacy unit test, in the existing algo, if I insert a random sleep of 0-10 microseconds in push_back after _back is exchanged, I can observe one of the above behaviors nearly every single time. The most common was the first behavior.

The existing algo claims it came from Dmitry Vyukov's implementation, though one key difference is that the existing one uses an atomic pointer to a Node for the "nil" object, whereas Dmitry's stores an actual Node object embedded in the queue.

I re-implemented the version in stdexec exactly as it appears on Dmitry's website (which I had to dig up on archive.org), and it passes newly added Relacy (exploring many thread interleavings) and non-Relacy unit tests.

I originally tracked down a bug in timed_thread_scheduler.cpp, where sometimes STDEXEC_ASSERT(op->command_ == command_type::command_type::stop); failed.

Relacy helped identify a race in the existing MPSC algo. I am having a hard time exactly explaining what's going on, but in the newly added unit test (the non Relacy one), I am able to observe three different odd behaviors - a consumer consuming the same elemment in an finite loop, apparently due to the internal next pointers pointing in some sort of cycle - consumer returning &__nil_! - consumer never able to consume a produced value (node is lost) With the non-relacy unit test, in the existing algo, if I insert a random sleep of 0-10 microseconds in push_back after __back_ is exchanged, I can observe one of the above behaviors nearly every single time. The most common was the first behavior. The existing algo claims it came from Dmitry Vyukov's implementation, though one key difference is that the existing one uses an atomic pointer to a Node for the "nil" object, whereas Dmitry's stores an actual Node object embedded in the queue. I re-implemented the version in stdexec exactly as it appears on Dmitry's website (which I had to dig up on archive.org), and it passes newly added Relacy (exploring many thread interleavings) and non-Relacy unit tests. I originally tracked down a bug in timed_thread_scheduler.cpp, where sometimes `STDEXEC_ASSERT(op->command_ == command_type::command_type::stop);` failed.

copy-pr-bot · 2026-02-04T05:17:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ccotter · 2026-02-04T05:18:22Z

cc @maikel - can you check this over, since I think you implemented the original version of the MPSC queue?

ccotter · 2026-02-04T05:20:03Z

Dmitry's website appears to have stopped working a few months ago, though the original link is on archive.org: https://web.archive.org/web/20221127081044/https://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue (Dmitry is also the author of Relacy).

ccotter · 2026-02-04T05:21:26Z

Also of note, I originally noticed the race/bug in timed_thread_scheduler while running the exec unit tests through with a version of TSAN I am trying to upstream which fuzzes the scheduler with random sleeps, and this is another success of that approach I think: llvm/llvm-project#178836

ccotter · 2026-02-04T17:05:08Z

I caught up with @maikel. I'm converting this to a draft for now, as @maikel will take a closer look to understand if the existing algo can be fixed. We also suspect my version may be overly specified in the atomics (it appears to be correct, but could be more relaxed).

ccotter · 2026-02-04T17:33:41Z

@maikel as an experiment to help narrow down where the race leading to the queue state becoming corrupt, I added a mutex in two places: https://gist.github.com/ccotter/37b0d3cb3ad88f8c4fea4b2adbb5dce7 ... Relacy says this queue is correct...

maikel · 2026-02-04T17:56:24Z

Yes after revisiting the old code I agree that the management of nil is the culprit here. We should simplify the implementation towards dmitrys original one. The only thing I would check is whether your usage of memory orderings and atomics can be improved

ericniebler · 2026-02-04T20:38:37Z

/ok to test 0a9df03

ccotter mentioned this pull request Feb 4, 2026

[tsan] Introduce Adaptive Delay Scheduling to TSAN llvm/llvm-project#178836

Open

ccotter added 4 commits February 4, 2026 14:40

Add Relacy tests to the build

5a875d4

Default relacy only fofr GNU

c0720dd

Remove old Relacy Makefile

0585125

Simplify tests

2d595ad

ccotter marked this pull request as draft February 4, 2026 17:04

Merge branch 'main' of https://github.com/nvidia/stdexec into HEAD

0a9df03

Improve implementation to be near to Dmitry's

3b6648b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race in MPSC algo#1812

Fix race in MPSC algo#1812
ccotter wants to merge 7 commits intoNVIDIA:mainfrom
ccotter:mpsc-bug

ccotter commented Feb 4, 2026

Uh oh!

copy-pr-bot bot commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

maikel commented Feb 4, 2026

Uh oh!

ericniebler commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ccotter commented Feb 4, 2026

Uh oh!

copy-pr-bot bot commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

ccotter commented Feb 4, 2026

Uh oh!

maikel commented Feb 4, 2026

Uh oh!

ericniebler commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants