Skip to content

gh-144586: Improve _Py_yield to improve light weight cpu instruction#144587

Open
corona10 wants to merge 5 commits intopython:mainfrom
corona10:gh-144586
Open

gh-144586: Improve _Py_yield to improve light weight cpu instruction#144587
corona10 wants to merge 5 commits intopython:mainfrom
corona10:gh-144586

Conversation

@corona10
Copy link
Member

@corona10 corona10 commented Feb 8, 2026

@corona10
Copy link
Member Author

corona10 commented Feb 8, 2026

Benchmark on my Mac mini (consistenty enhanced)

baseline

Python: 3.15.0a5+ free-threading build (heads/gh-115697:e682141c495, Feb  8 2026, 16:38:00) [Clang 17.0.0 (clang-1700.6.3.2)]
GIL enabled: False
CPUs: 12

2 threads: 0.0284s  (14,103,668 ops/sec)
4 threads: 0.0728s  (10,988,690 ops/sec)
8 threads: 0.3063s  (5,223,362 ops/sec)

with PR

➜  cpython git:(gh-144586) ✗ ./python.exe bench_mutex_contention.py
Python: 3.15.0a5+ free-threading build (heads/gh-144586:21bd43c7e5e, Feb  8 2026, 16:34:31) [Clang 17.0.0 (clang-1700.6.3.2)]
GIL enabled: False
CPUs: 12

2 threads: 0.0239s  (16,738,824 ops/sec)
4 threads: 0.0559s  (14,300,174 ops/sec)
8 threads: 0.1813s  (8,824,965 ops/sec)

script

import threading
import time
import sys
import os

NUM_THREADS_LIST = [2, 4, 8]
OPS_PER_THREAD = 200_000
ROUNDS = 3


def contention_bench(num_threads, ops):
    lock = threading.Lock()
    total = [0]
    barrier = threading.Barrier(num_threads + 1)

    def worker():
        barrier.wait()
        for _ in range(ops):
            with lock:
                total[0] += 1

    threads = [threading.Thread(target=worker) for _ in range(num_threads)]
    for t in threads:
        t.start()
    barrier.wait()
    t0 = time.perf_counter()
    for t in threads:
        t.join()
    return time.perf_counter() - t0, total[0]


if __name__ == "__main__":
    print(f"Python: {sys.version}")
    if hasattr(sys, "_is_gil_enabled"):
        print(f"GIL enabled: {sys._is_gil_enabled()}")
    print(f"CPUs: {os.cpu_count()}\n")

    for nt in NUM_THREADS_LIST:
        best = float("inf")
        for _ in range(ROUNDS):
            elapsed, total = contention_bench(nt, OPS_PER_THREAD)
            best = min(best, elapsed)
        print(f"{nt} threads: {best:.4f}s  ({total/best:,.0f} ops/sec)")

@corona10 corona10 removed the skip news label Feb 8, 2026
@corona10 corona10 requested a review from vstinner February 8, 2026 07:43
extern void _Py_yield(void);
// Lightweight CPU pause hint for spin-wait loops (e.g., x86 PAUSE, AArch64 WFE).
// Falls back to sched_yield() on platforms without a known pause instruction.
static inline void
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it static inline because the function call overhead is more expensive than a single instruction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only used in lock.c, why move it to header?

Copy link
Member Author

@corona10 corona10 Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, we can move back to lock.c

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm no

_Py_yield();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I was looking at older checkout of main branch. Making it static inline looks fine although I think LTO would have inlined it anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LTO would have inlined it anyways.

I think that same way, but just follow our old convention :)

#elif defined(_M_X64) || defined(_M_IX86)
_mm_pause();
#elif defined(_M_ARM64) || defined(_M_ARM)
__yield();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corona10 corona10 added performance Performance or resource usage topic-free-threading labels Feb 8, 2026
@colesbury
Copy link
Contributor

colesbury commented Feb 8, 2026 via email

@corona10
Copy link
Member Author

corona10 commented Feb 8, 2026

We also have an unresolved issue where we are only spinning for one
iteration, but changing that seems to hurt performance.

Just for sharing: From my micro bechmark and ft-scailing benchmark doesn't show negotive impact from this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants