gh-144586: Improve _Py_yield to improve light weight cpu instruction by corona10 · Pull Request #144587 · python/cpython

corona10 · 2026-02-08T07:34:30Z

Issue: Improve _Py_yield to use light weight cpu instruction #144586

…ction

Python/lock.c

corona10 · 2026-02-08T07:39:52Z

Benchmark on my Mac mini (consistenty enhanced)

baseline

Python: 3.15.0a5+ free-threading build (heads/gh-115697:e682141c495, Feb  8 2026, 16:38:00) [Clang 17.0.0 (clang-1700.6.3.2)]
GIL enabled: False
CPUs: 12

2 threads: 0.0284s  (14,103,668 ops/sec)
4 threads: 0.0728s  (10,988,690 ops/sec)
8 threads: 0.3063s  (5,223,362 ops/sec)

with PR

➜  cpython git:(gh-144586) ✗ ./python.exe bench_mutex_contention.py
Python: 3.15.0a5+ free-threading build (heads/gh-144586:21bd43c7e5e, Feb  8 2026, 16:34:31) [Clang 17.0.0 (clang-1700.6.3.2)]
GIL enabled: False
CPUs: 12

2 threads: 0.0239s  (16,738,824 ops/sec)
4 threads: 0.0559s  (14,300,174 ops/sec)
8 threads: 0.1813s  (8,824,965 ops/sec)

script

import threading
import time
import sys
import os

NUM_THREADS_LIST = [2, 4, 8]
OPS_PER_THREAD = 200_000
ROUNDS = 3


def contention_bench(num_threads, ops):
    lock = threading.Lock()
    total = [0]
    barrier = threading.Barrier(num_threads + 1)

    def worker():
        barrier.wait()
        for _ in range(ops):
            with lock:
                total[0] += 1

    threads = [threading.Thread(target=worker) for _ in range(num_threads)]
    for t in threads:
        t.start()
    barrier.wait()
    t0 = time.perf_counter()
    for t in threads:
        t.join()
    return time.perf_counter() - t0, total[0]


if __name__ == "__main__":
    print(f"Python: {sys.version}")
    if hasattr(sys, "_is_gil_enabled"):
        print(f"GIL enabled: {sys._is_gil_enabled()}")
    print(f"CPUs: {os.cpu_count()}\n")

    for nt in NUM_THREADS_LIST:
        best = float("inf")
        for _ in range(ROUNDS):
            elapsed, total = contention_bench(nt, OPS_PER_THREAD)
            best = min(best, elapsed)
        print(f"{nt} threads: {best:.4f}s  ({total/best:,.0f} ops/sec)")

corona10 · 2026-02-08T13:39:58Z

Include/internal/pycore_lock.h

-extern void _Py_yield(void);
+// Lightweight CPU pause hint for spin-wait loops (e.g., x86 PAUSE, AArch64 WFE).
+// Falls back to sched_yield() on platforms without a known pause instruction.
+static inline void


I made it static inline because the function call overhead is more expensive than a single instruction.

This function is only used in lock.c, why move it to header?

~~Ah yeah, we can move back to lock.c~~

umm no

cpython/Objects/genobject.c

Line 46 in d736349

_Py_yield();

I see, I was looking at older checkout of main branch. Making it static inline looks fine although I think LTO would have inlined it anyways.

I think LTO would have inlined it anyways.

I think that same way, but just follow our old convention :)

corona10 · 2026-02-08T13:47:12Z

Include/internal/pycore_lock.h

+#elif defined(_M_X64) || defined(_M_IX86)
+    _mm_pause();
+#elif defined(_M_ARM64) || defined(_M_ARM)
+    __yield();


See: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170

colesbury · 2026-02-08T15:07:13Z

A lightweight pause is usually not what we want here. See https://webkit.org/blog/6161/locking-in-webkit/ for some discussions on yield vs pause. We also have an unresolved issue where we are only spinning for one iteration, but changing that seems to hurt performance. Anyways, I think we should be careful in th changes we make here.

…

On Sun, Feb 8, 2026 at 7:16 AM Donghee Na ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In Include/internal/pycore_lock.h <#144587 (comment)>: > @@ -70,8 +74,25 @@ PyMutex_LockFlags(PyMutex *m, _PyLockFlags flags) // error messages) otherwise returns 0. extern int _PyMutex_TryUnlock(PyMutex *m); -// Yield the processor to other threads (e.g., sched_yield). -extern void _Py_yield(void); +// Lightweight CPU pause hint for spin-wait loops (e.g., x86 PAUSE, AArch64 WFE). +// Falls back to sched_yield() on platforms without a known pause instruction. +static inline void I think LTO would have inlined it anyways. I think that same way, but just follow our old convention :) — Reply to this email directly, view it on GitHub <#144587 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFAD6USQWH6RK6X4MNGBOT4K5ALBAVCNFSM6AAAAACUL2ZTD2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTONRZHAYDSMJYGQ> . You are receiving this because your review was requested.Message ID: ***@***.***>

corona10 · 2026-02-08T15:17:27Z

We also have an unresolved issue where we are only spinning for one
iteration, but changing that seems to hurt performance.

Just for sharing: From my micro bechmark and ft-scailing benchmark doesn't show negotive impact from this change.

pythongh-144586: Improve _Py_yield to improve light weight cpu instru…

21bd43c

…ction

bedevere-app bot added the awaiting core review label Feb 8, 2026

bedevere-app bot mentioned this pull request Feb 8, 2026

Improve _Py_yield to use light weight cpu instruction #144586

Open

corona10 added the skip news label Feb 8, 2026

corona10 requested review from ZeroIntensity and colesbury February 8, 2026 07:34

corona10 commented Feb 8, 2026

View reviewed changes

Python/lock.c Outdated Show resolved Hide resolved

Add NEWS.d

9b90b96

corona10 removed the skip news label Feb 8, 2026

corona10 requested a review from vstinner February 8, 2026 07:43

Address code review

d1a986c

corona10 requested a review from kumaraditya303 February 8, 2026 13:38

corona10 commented Feb 8, 2026

View reviewed changes

Fix for windows

2fe2bbb

corona10 commented Feb 8, 2026

View reviewed changes

corona10 added performance Performance or resource usage topic-free-threading labels Feb 8, 2026

nit

29ebf1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-144586: Improve _Py_yield to improve light weight cpu instruction#144587

gh-144586: Improve _Py_yield to improve light weight cpu instruction#144587
corona10 wants to merge 5 commits intopython:mainfrom
corona10:gh-144586

corona10 commented Feb 8, 2026 •

edited by bedevere-app bot

Loading

Uh oh!

Uh oh!

corona10 commented Feb 8, 2026 •

edited

Loading

Uh oh!

corona10 Feb 8, 2026

Uh oh!

kumaraditya303 Feb 8, 2026

Uh oh!

corona10 Feb 8, 2026 •

edited

Loading

Uh oh!

corona10 Feb 8, 2026

Uh oh!

kumaraditya303 Feb 8, 2026

Uh oh!

corona10 Feb 8, 2026

Uh oh!

corona10 Feb 8, 2026

Uh oh!

colesbury commented Feb 8, 2026 via email

Uh oh!

corona10 commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

corona10 commented Feb 8, 2026 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

corona10 commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

baseline

with PR

script

Uh oh!

corona10 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

kumaraditya303 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

corona10 Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corona10 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

kumaraditya303 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

corona10 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

corona10 Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

colesbury commented Feb 8, 2026 via email

Uh oh!

corona10 commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

corona10 commented Feb 8, 2026 •

edited by bedevere-app bot

Loading

corona10 commented Feb 8, 2026 •

edited

Loading

corona10 Feb 8, 2026 •

edited

Loading