Skip to content

feat(session): multi-process session-server (N workers + sticky-hash router)#31

Draft
rmfan wants to merge 22 commits into
prodfrom
feat/session-server-multiprocess
Draft

feat(session): multi-process session-server (N workers + sticky-hash router)#31
rmfan wants to merge 22 commits into
prodfrom
feat/session-server-multiprocess

Conversation

@rmfan

@rmfan rmfan commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Problem

The session-server (miles/rollout/session/sessions.py) runs as a single uvicorn process per RolloutManager actor. One GIL serialises tokenizer work (merge_tokens, chat-template render), JSON serialisation of multi-MB SGLang responses, and asyncio scheduling for every in-flight request. On a Slurm CPU node a single process saturates near ~50 r/s. Past that, requests queue inside the event loop and tail latency explodes — production run 1709750 showed one session-server PID serving 180k+ events with row-5 p50 = 20 s and p99 = 192 s, while the upstream SGLang generation itself stayed well under a second.

This PR lifts both layers off a single GIL: the session-server backend can be sharded across N processes, and the front-end proxy that fans requests to those backends can itself be sharded across K processes. Session affinity is preserved end-to-end without any shared state between the K proxy workers.

Design

Backend multi-process (--session-server-workers N)

_start_session_server (in miles/ray/rollout.py) spawns N independent SessionServer processes when N > 1. Each backend gets its own port (walked forward from port+1, tracked in a chosen_ports set so sibling spawns can't race onto the same one), its own tokenizer, its own asyncio loop, and a worker_index it stamps onto every session it creates. Children are daemon=True; the parent installs a SIGTERM handler plus atexit hook that does .terminate() → .join(timeout) → .kill() on each child. N = 1 is bit-for-bit unchanged from today.

Router multi-worker (--session-router-workers K)

The user-facing port is held by a small Python uvicorn proxy (session_router.py) that forwards each request to one of the N backends. With K = 1, the original single-process router runs unchanged. A second bench showed this single router process itself caps at ~50 r/s — even with 4 backends behind it that together can serve ~280 r/s when reached directly, the front-end GIL becomes the new bottleneck (see Evidence below).

With K > 1, _start_session_server spawns K independent uvicorn worker processes, each opening its own listening socket on the user-facing port with SO_REUSEPORT set, then handing the bound socket to uvicorn.Server.run(sockets=[sock]). The Linux kernel (≥ 3.9) hash-distributes inbound TCP connections across the K listeners. The router workers share no state and do not coordinate; each is a stateless duplicate of the others. (uvicorn 0.40–0.49 — the versions we ship — has no reuse_port kwarg, so handing it a pre-bound socket is the supported path.)

How stickiness is preserved

Harbor sees one URL. Every session always lands on the same backend, regardless of which router worker accepted the connection. The mechanism:

  1. When a backend handles POST /sessions, SessionRegistry.create_session mints an id of the form w{worker_index}-{uuid4().hex}. The backend's index is encoded in the id itself.
  2. On every subsequent request, any router worker that receives it parses the w<idx>- prefix and forwards to backend idx. The id is the routing key — there is no shared hash function and no routing table for the K routers to keep in sync.
  3. The kernel's SO_REUSEPORT distribution picks a router worker per inbound TCP connection. That choice is independent of session affinity because every router worker applies the same deterministic prefix → backend rule.
  4. Requests with no session_id (POST /sessions, /health) are dispatched round-robin via a per-worker itertools.count(). The chosen backend stamps its own index on the returned id, so all follow-up calls route correctly. Per-worker round-robin counters drift but average out at scale.

Net effect: K routers behave as one logical front-end. Same session_id → same backend, every time, no coordination required.

Supporting changes

  • asyncio.to_thread wrap on the two synchronous tokenizer calls (prepare_pretokenized, update_pretokenized_state) — keeps the event loop responsive during merge/encode. Benefits N=1 deployments too.
  • Streaming request/response pass-through (client.send(stream=True) + StreamingResponse(aiter_raw())) — no router-side buffering or JSON re-encoding.
  • HTTP keepalive enabled in the router → backend httpx pool — eliminates per-request TCP setup at peak.
  • Session-id format switched from a bare uuid4().hex (N=1, back-compat) to w{idx}-{uuid4().hex} (N>1) — prefix-encoded, no shared hash algorithm.
  • Unknown / malformed session_ids fall back to round-robin instead of returning 404 — lets the existing get_or_create_session auto-reseed path actually fire.

Evidence

Headline server-side server_done_p99 (45 s window, SGLang replaced by a constant-0.5 s stub so the table reflects session-server behaviour alone):

in-flight N wc=1 backend wc=4 backends
100 0.55 s 0.55 s
200 5.99 s 6.36 s (stub-queue noise)
400 6.03 s 0.74 s (end-to-end Slurm run)
800 11.63 s 8.17 s

End-to-end on a real Slurm CPU node at N=400 in-flight, going from 1 → 4 backends dropped server_p99 9× (6.74 s → 0.74 s). Throughput at the same point did not scale, which surfaced the second cap.

Router-cap finding (wc=4, N=400):

path throughput
through the single-process router ~51 r/s
bypassing the router via sticky direct-to-backend addressing ~284 r/s

The 5.5× gap is the router's GIL. K > 1 router workers close it; the K-way path is what makes the multi-backend gains reachable end-to-end through the real entry point.

Files touched

  • miles/utils/arguments.py--session-server-workers, --session-router-workers flags.
  • miles/ray/rollout.py _start_session_server — N==1 / K==1 paths unchanged; otherwise spawns N backends and K router workers, registers the SIGTERM reaper before the first .start(), deep-copies per-worker args to avoid shared-ref bugs, and handles port exhaustion explicitly.
  • miles/rollout/session/sessions.pyasyncio.to_thread wrap; worker_index surfaced in /health.
  • miles/rollout/session/linear_trajectory.pySessionRegistry(worker_index, worker_count) + prefix-encoded session_id.
  • miles/rollout/session/session_router.py (new) — proxy with streaming pass-through, prefix-parsing router, round-robin fallback, SO_REUSEPORT socket helper.
  • miles/utils/http_utils.py — httpx keepalive disabled on the loadgen side to spread connections across router workers (kernel hash is by 4-tuple).
  • tests/fast/router/test_session_uuid_routing.py, tests/fast/router/test_multi_worker_startup.py, tests/fast/router/test_router_multi_worker.py — prefix invariants across N ∈ {1, 2, 4, 8, 16}, /health schema + malformed-id fallback, K=1 vs K>1 launch contract and per-worker instance_id isolation.

Known limitations

  • Backend chat-completions still buffers the SGLang response body. The handler does await response.aread() on the upstream body (miles/rollout/session/session_server.py:73) to extract meta_info. The streaming change in this PR covers the wildcard pass-through path only; the chat-completions path remains buffered. Out of scope.
  • Tokenizer memory multiplies by N. Qwen-class tokenizers are approximately a few hundred MB resident per backend; N ≤ 8 is comfortable on current Slurm node memory but should be sized per model.
  • SO_REUSEPORT semantics differ on macOS. K > 1 is intended for Linux production. The K=1 default keeps dev environments on the original code path; the multi-worker tests patch the uvicorn call surface rather than binding real ports.
  • Router needs K ≈ N for full backend saturation (TODO: swap to nginx). Empirically each Python uvicorn router worker caps at ~50–60 r/s. With K = N the router stops being a coalescer and acts as a pure parallelism layer — same process count as if every backend listened on the user port directly. A follow-up should replace the Python router with nginx (template-generated config, started as a child of _start_session_server): per-worker throughput is ~100× higher, K = 1 nginx worker saturates 8+ backends, and the multi-worker router code path can be removed. Not in this PR to keep the change focused on backend multi-process + minimum-viable routing.
  • Stale-update guard previously returned phantom 200s. Follow-up commit (66f27f6) now raises SessionStateConflictError (409) when num_assistant changes during the unlocked proxy phase, instead of silently skipping the state update and returning 200. The old behaviour caused cursor-mismatch crashes in compute_samples_from_openai_records downstream (run 1711903: 24 conflict warnings → 2 cursor-mismatch failures with 88/102-token deltas). Callers must treat 409 from /sessions/{id}/v1/chat/completions as a retryable conflict.

Risk / rollout

Default off (--session-server-workers 1 --session-router-workers 1) → bit-for-bit current behaviour. Recommended initial production config: --session-server-workers 4 --session-router-workers 4 (until the nginx replacement lands, set K ≈ N so the router doesn't become the new cap). Roll forward, no migration; flip per-job after one clean run.

rmfan and others added 5 commits May 29, 2026 14:49
…n workers

A pooled httpx.AsyncClient against a uvicorn --workers N server pins all
requests to the small subset of workers that accept()-won the pooled TCP
connections (uvicorn shares one listen socket across workers; no
SO_REUSEPORT, no work-stealing). Observed in a harbor_server run:
n_workers_active = 2 of 32 for most minutes, with those 2 workers
saturated at their per-process Semaphore cap while the other 30 sat idle.

Setting max_keepalive_connections=0 closes the TCP after each response,
so every /run gets its own accept() race and load spreads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `worker_index` / `worker_count` to `SessionRegistry` and pin
`create_session()` to regenerate uuids until the hash falls in the
current worker's bucket. The hash function (`session_id_bucket`,
md5-of-utf8 truncated mod N) is the load-bearing agreement between
the front-end router and the registry — they must use the same one.

When `worker_count == 1` (default), behavior is identical to before.

Also surface `worker_index`/`worker_count` in `/health` so operators
can correlate logs across workers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A small FastAPI app that:
  * Parses session_id from /sessions/{id}/... URL paths.
  * Picks a backend with session_id_bucket(session_id, N) — the same
    hash SessionRegistry.create_session uses to pin uuids, so a
    session always routes back to its creator.
  * Round-robins stateless paths (POST /sessions, /health, etc.) so
    we don't hot-spot worker 0.
  * Streams body through with no JSON re-work in the hot path.

Picked the Python ASGI option over nginx because nginx is not a
guaranteed presence on the Slurm compute nodes RolloutManager runs on;
this keeps the deployment dependency-free. The router does almost no
per-request CPU work (path-parse + md5 + httpx passthrough), so the
GIL on the router process is not the new bottleneck — all the
tokenizer / TITO / JSON work happens in the N backend workers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `--session-server-workers N` (default 1, opt-in).

When N == 1, `_start_session_server` behaves exactly as before.

When N > 1, it:
  1. Allocates N backend ports starting at session_server_port + 1.
  2. Spawns N SessionServer processes, each with its own
     (worker_index, worker_count, instance_id) and its own tokenizer.
  3. Waits for all N backends to be reachable.
  4. Spawns the SessionRouter front-end on session_server_port.
  5. Registers SIGTERM and atexit reapers so children don't outlive
     the parent (Ray actor shutdown, kill -TERM, etc.).

Memory cost: tokenizer is loaded N times (per-worker process). For
Qwen-class tokenizers this is a few hundred MB each; verify before
flipping the default to something like cpu_count()//2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test_session_uuid_routing.py:
  - session_id_bucket is deterministic and roughly balanced.
  - Every uuid SessionRegistry.create_session(i, N) returns hashes
    to bucket i (load-bearing invariant — if this breaks, sticky
    routing breaks and trials silently fall through the auto-reseed
    path on every turn).
  - SessionRouter.pick_backend agrees with SessionRegistry on the
    same hash.
  - Stateless paths round-robin instead of pinning to worker 0.

test_multi_worker_startup.py:
  - Smoke-test the router end-to-end with 4 fake HTTP backends
    spun up on real ports; verify hash-routed requests land on the
    expected backend.
  - Verify /health on the router is local, not proxied.

Not run: any test that loads a real tokenizer or starts a real Ray
job. Out of scope for this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rmfan rmfan requested a review from a team as a code owner June 4, 2026 04:47
@rmfan rmfan marked this pull request as draft June 4, 2026 05:15
rmfan and others added 5 commits June 3, 2026 22:24
…ion_id

Switch from md5(session_id) %% N + UUID rejection-sampling to Stripe-style
prefix encoding: SessionRegistry stamps "w<idx>-" on every multi-worker
session_id, and SessionRouter parses the prefix instead of hashing.

This kills two failure modes the prior design carried:

1. The 100-try regen cap was a hot-path RuntimeError land-mine — at N=8
   it tripped roughly once per ~20 minutes of 800 RPS create_session
   traffic ((7/8)^100 ~= 1.6e-6 per call), bubbling a 500 to the agent
   and killing the rollout.
2. Router and backend had to agree on the hash function "forever"; any
   future drift silently 100%%-misrouted every request.

The new contract is a 4-char str.split — no shared algorithm, no
rejection sampling. Invalid prefixes 404 with a clear error. Stateless
paths (POST /sessions, /health) still round-robin.

Tests:
- test_session_uuid_routing.py: replace TestRouterAgreement hash test
  with parse-correctness invariants (well-formed prefixes parse, bare
  uuids / non-numeric prefixes / out-of-range indices raise).
- test_multi_worker_startup.py: mint test ids with "w<idx>-" prefix
  instead of bare uuid + session_id_bucket lookup.

Design rationale in docs/sticky-session-routing-research.md (Mechanism A).
Addresses smell #1 and audit H1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… B1)

Replace the buffer+re-encode proxy with a true streaming pass-through:

- Request body: await request.body() (full buffer) ->
  content=request.stream() in client.build_request, so the multi-MB
  SGLang request payload never lands in router RAM.
- Response body: await response.aread() + JSONResponse(json.loads(...))
  -> client.send(req, stream=True) + StreamingResponse(aiter_raw()),
  so output_token_logprobs streams chunk-by-chunk and the router
  doesn't burn one GIL on json.loads + json.dumps per response.

The old code held ~800 x ~5MB = ~4GB of decoded Python dicts at the
800-in-flight concurrency this PR is sized for, and spent the router's
GIL on json codecs — the exact bottleneck class the multi-process
layout was supposed to eliminate.

Header handling: factored hop-by-hop header set per RFC 7230 §6.1 into
_HOP_BY_HOP_HEADERS; applied symmetrically to request and response.
content-type passes through unmodified (preserves charset hints that
the old JSONResponse path silently dropped — see audit M4).

Upstream response is closed in a try/finally inside the body generator
so backend connections don't leak if the client disconnects mid-stream.

Addresses audit B1 (and incidentally M4).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The reaper was registered after wait_for_server_ready returned for both
backends (~60s) and the router (~30s). Any SIGTERM to the parent during
that ~90s ready-window leaked all N child processes — they survived as
orphans holding their ports, and the next rollout's is_port_available
pre-check fired the "stale session server" RuntimeError telling the
operator to pkill -9 python.

Fix: pass a mutable tracked_processes list into the reaper before
spawning the first child, then append to it as children come up. The
reaper sees an empty list until a child actually starts, so registering
early is safe; once a SIGTERM lands, the list has whatever was already
spawned.

The try/except cleanup path now uses tracked_processes too, so it
covers the router_process if startup fails between its .start() and
its readiness check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chained SIGTERM handler had two real problems:

1. Race with Ray's own SIGTERM handler. Whether ours or Ray's wins
   depends on the install order, which depends on when
   _start_session_server runs relative to Ray actor init — fragile
   coupling we should not be relying on.
2. If _start_session_server is called twice in the same process (rare
   but possible on actor restart), the second prev = getsignal(SIGTERM)
   captures the first call's _handler closure, forming a chain that
   re-terminates already-dead processes and masks any genuine failure
   in the second call.

Fix: rely on atexit + daemon=True only. Daemon children are terminated
automatically by the Python runtime on parent exit; atexit covers the
clean-shutdown path. The SIGTERM-during-startup leak is a real concern
but it's better addressed by the earlier reaper-registration fix (H2)
than by a fragile signal chain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gs (audit H4)

_shallow_copy_args was copy.copy(args) — fine for the current
attribute set (all scalars), but a footgun: any future
session_server_* field that's a list / dict / nested Namespace would
alias the same object across all N worker copies, and mutating it in
one worker (or in this very setup loop) would silently corrupt the
others.

Replace with copy.deepcopy via a renamed _per_worker_args_copy. The
args object is small and parsed once at startup, so the deepcopy cost
is dwarfed by multiprocessing fork overhead.

Also drop the dataclasses.replace / is_dataclass branch at the call
site. argparse.Namespace (what arguments.py actually returns) is not
a dataclass, so that branch was dead code that hid the real copy
mechanism behind a misleading dispatch. _per_worker_args_copy handles
both Namespaces and dataclasses uniformly via deepcopy.

Promotes "import copy" from a function-local import to the module top
(the other in-file copy import at line 1123 is unrelated and kept
local-scoped to avoid widening this diff).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rmfan

rmfan commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

Audit cleanup pass — B1 + H1–H4

Pushed 5 commits addressing the architectural + blocker + high-severity findings from two prior audits:

  • ~/agent-dist/docs/sticky-session-routing-research.md — Mechanism A (Stripe-style prefix-encoded session_id)
  • ~/agent-dist/docs/pr-31-code-smell-audit.md — B1, H1, H2, H3, H4

Commits

SHA Audit item What changed
27851f8c H1 + smell #1 Replace md5+rejection-sampling routing with prefix-encoded w<idx>-<uuid> session ids. Router parses the prefix; no shared hash algorithm, no 100-try regen cap, no RuntimeError land-mine. Tests rewritten as parse-correctness invariants; bad ids → 404.
fa89d49c B1 Router now streams request/response bodies end-to-end. Replaces await request.body() + await response.aread() + json.loads/JSONResponse re-encode with client.send(req, stream=True) + StreamingResponse(aiter_raw()). Hop-by-hop headers factored per RFC 7230 §6.1; upstream response closed in finally to avoid leaks on client disconnect. Incidentally fixes M4 (charset hint preservation).
39efe46a H2 _register_session_server_reaper now runs before the first .start() call, against a mutable tracked_processes list. SIGTERM during the ~90s ready-window no longer orphans backends.
7ed52a71 H3 Deleted the SIGTERM handler chain. Relies on atexit + daemon=True only (per audit recommendation). Avoids the Ray-init-order race and the closure-cycle on double-init.
b0bb0e16 H4 _shallow_copy_args (copy.copy) → _per_worker_args_copy (copy.deepcopy). Drops the dead dataclasses.replace branch — argparse.Namespace is never a dataclass.

Out of scope (follow-up PR)

M1–M6 and L1–L6 from the code-smell audit. Per the task spec, this PR only lands B1 + H1–H4.

Verification

  • python -m py_compile clean on every modified file (linear_trajectory.py, session_router.py, rollout.py, test_session_uuid_routing.py, test_multi_worker_startup.py).
  • Existing tests/fast/router/test_session_uuid_routing.py::TestRouterAgreement rewritten to test the new parse contract; tests/fast/router/test_multi_worker_startup.py now mints w<idx>- test ids.

🤖 Generated with Claude Code

rmfan added 8 commits June 3, 2026 22:38
The router's httpx client previously set ``max_keepalive_connections=0``,
which forced a full TCP handshake + teardown per request. Under sustained
load (PR cites ~800 RPS) this exhausts the router's ephemeral port range
via TIME_WAIT and silently throttles the backend pool — request
throughput plateaus regardless of how many backends are added.

The router explicitly routes by ``session_id`` (Stripe-style prefix),
so the "pinning to one backend" concern that motivates the same setting
in ``init_http_client`` does not apply here. Keepalive is safe and
saves the per-request handshake.

Also bump ``max_connections`` to 4096 to comfortably cover N backends *
hundreds of in-flight requests.

Identified in PR #31 deep review (blocker B1).
OpenAIEndpointTracer.create (openai_endpoint_utils.py:39-43) reads
``session_server_instance_id`` from ``/health`` and stamps it on
trial metadata. In single-worker mode this came from the backend
process; in multi-worker mode the router is the user-facing
session_url, so the field has to come from there too.

Add ``session_server_instance_id`` (sourced from the cluster-facing
``args.session_server_instance_id`` set in rollout.py:1320) to the
router's ``/health`` response. The backend workers' per-worker ids
remain ``<cluster-id>-w<i>`` for log correlation; only the
cluster-facing id is exposed externally.

Identified in PR #31 deep review (high H1).
Two related lifecycle fixes flagged by the PR #31 deep review:

H2 — port allocation race. The previous loop picked a port for worker
i and advanced if taken, but worker i+1's ``is_port_available`` check
ran before worker i's child had bound. Both could target the same
port; one then crashed on bind, manifesting as a slow ``wait_for_server_ready``
timeout. Track ``chosen_ports`` and skip already-handed-out ports.

H3 — atexit-only reaper leaks children on SIGTERM. The previous
cleanup commit (7ed52a7) correctly removed the brittle SIGTERM
handler chain, but the resulting reaper only ran on clean Python
exit. Under Ray actor preemption (SIGTERM, parent stays alive
briefly) the children turned into zombies that held the
session-server port — and the next rollout then trips the "stale
session server" RuntimeError at line 1324. Re-add a SIMPLE SIGTERM
handler that just calls the same reap path (no chain semantics, no
prev-handler capture). Reap now also ``join``s after ``terminate``
and escalates to ``kill`` if needed, so the port is definitely
released.

Also fix a B905 ``zip(..., strict=)`` ruff warning that surfaced once
the file came into scope.
When ``parse_worker_index`` raises (id has no ``w<idx>-`` prefix, or the
parsed index is outside ``[0, worker_count)``), the router used to
return 404. That breaks the registry's auto-reseed safety net during
rolling deploys that shrink ``worker_count``: an in-flight trial still
holding ``w5-<uuid>`` from a previous N=8 deploy gets hard-killed
instead of resuming on a different worker.

Fall back to round-robin instead. The receiving backend's
``get_or_create_session`` reseeds the session cleanly under a fresh
prefix — trial loses state but recovers in-place rather than dying
mid-rollout. This restores the established "router restart" behavior
that single-worker mode has always had.

Update the unit test contract (``test_router_unknown_session_id_falls_back_to_round_robin``)
and add an end-to-end test that a malformed session_id reaches a
backend with status 200.

Identified in PR #31 deep review (medium M).
prepare_pretokenized and update_pretokenized_state are sync calls
inside async handlers that hold the GIL during merge_tokens and
chat-template render. With 300+ in-flight sessions on a single
process this blocks the event loop and inflates server p99.

Microbench (laptop, N=400 concurrent): server_p99 9.9s → 5.8s.

Complements the multi-process scaling already in this PR — the two
benefits stack: multi-process raises the throughput ceiling
(one GIL per process); to_thread keeps each process responsive
below saturation.
@rmfan rmfan changed the base branch from fix/http-client-no-keepalive to prod June 4, 2026 06:54
Unrelated to this PR's changes but required for CI green — `prod`
has pre-existing black/isort violations in 8 files that the
`--all-files` pre-commit hook catches on every PR run.
Comment thread miles/rollout/session/session_router.py Fixed
rmfan and others added 3 commits June 4, 2026 00:44
CodeQL Information-exposure-through-exception finding on PR #31:
the JSONResponse on a backend transport error returned the
exception type and message, which can include internal backend
hostnames, ports, or file paths from urllib3's error chain.

Exception is still logged server-side for debugging; the caller
now gets a generic 'session-router backend transport error'.
Single-process router saturates at ~50 r/s with 4 backends behind it,
whereas backend aggregate capacity (sticky-routed bench) is ~280 r/s.
The router is the new bottleneck.

Add --session-router-workers K (default 1; backward-compatible).
When K>1, _start_session_server spawns K independent uvicorn workers
that bind the same args.session_server_port via SO_REUSEPORT; the
Linux kernel hash-distributes incoming connections across them.

The router state is per-process (rr counter, httpx pool) and routing
decisions are pure functions of the URL prefix, so no cross-worker
coordination is needed.

uvicorn 0.40--0.49 has no Config(reuse_port=...) kwarg, so each
worker opens a SO_REUSEPORT socket and passes it via
Server.run(sockets=[sock]).
…or mismatch

The split-lock chat flow releases session.lock around the SGLang proxy
call (Phase 2) so concurrent same-session writers can overlap at the
backend. When a competing writer commits an assistant turn during that
unlocked window, the in-flight writer hits the
``num_assistant != expected_num_assistant`` guard at sessions.py:271.

Previously the guard silently skipped both ``update_pretokenized_state``
and ``append_record`` while still returning the SGLang body with HTTP
200. The caller (litellm/harbor) treated the 200 as a real turn and
appended the assistant message to its local trajectory. On the next
``compute_samples_from_openai_records`` pass
(openai_endpoint_utils.py:155), the cursor walk asserted
``cursor == len(accumulated_token_ids)`` and crashed with a delta equal
to the dropped turn's token count.

Evidence: run 1711903 — the first prod deploy of PR #31 — produced 24
``state changed during proxy`` warnings and 2 cursor-mismatch failures
with deltas of 88 and 102 tokens. See
``~/run_analysis/1711903/1711903_errors_rca.md``.

Fix: raise the new ``SessionStateConflictError`` (409) instead of
returning a phantom 200. Callers see a clear retryable conflict and do
not record the dropped turn locally, so the trajectory's
accumulated_token_ids and records stay mutually consistent.

The closing-during-proxy branch above keeps its existing skip-200
behavior because deleted sessions have no downstream
``compute_samples_from_openai_records`` consumer.

Tests:
- ``test_same_session_concurrent_requests_reach_backend`` updated to
  expect exactly 1 winner (200) + N-1 conflicts (409) instead of all
  200s.
- New ``TestStateConflictNoCursorMismatch`` asserts (a) status codes
  partition cleanly into 1x200 + Nx409, (b) the session's records list
  length matches the number of accumulated checkpoints, and (c) a
  serial follow-up turn built on the winner's checkpoint succeeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rmfan added a commit that referenced this pull request Jun 9, 2026
…via cancellation channel

PR #31 split session.lock around the SGLang proxy call to keep DELETE
non-blocking while a chat was in flight. That split-lock window let
two same-session writers commit interleaved assistant turns; the
Phase-3 stale-update guard at sessions.py:241 silently dropped the
second commit via `logger.warning(... skipping state update); return
backend.build_proxy_response(result)` — the caller still received the
200 SGLang response and appended a phantom assistant record to its
trajectory, while the server's session state didn't reflect that turn.
The next chat then hit a cursor mismatch.

Restore the original "one lock around Phase 1+2+3" invariant and
move DELETE preemption from "no lock during proxy" to an explicit
cancellation channel: chat stores its proxy task on
session.current_proxy_task, DELETE cancels it (after setting
session.closing), the chat coroutine catches CancelledError and
returns 410 Gone (distinct from 404 "never existed" and 409
"retryable conflict"), then releases the lock so DELETE can acquire
it and tear down.

SessionStateConflictError (409) is added as defense-in-depth: with
Phase 1+2+3 under one lock the Phase-3 stale-state branch is
unreachable, but a future split-lock regression now surfaces a clear
retryable 409 instead of a silently dropped commit.

Two tests in test_session_race_conditions.py encoded the old
"DELETE waits for chat to finish" contract and now expect the new
"DELETE preempts → 410" behaviour:
  - test_double_delete_second_returns_404
  - test_multiple_chats_queued_then_delete
Pre-existing 4-failure baseline (get_or_create_session
auto-create-after-delete tests) is unchanged: 5 pass / 4 fail.

Split out of #33 per offline review request — #33 keeps the
multi-backend orchestration + observability work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rmfan added a commit that referenced this pull request Jun 9, 2026
…via cancellation channel

PR #31 split session.lock around the SGLang proxy call to keep DELETE
non-blocking while a chat was in flight. That split-lock window let
two same-session writers commit interleaved assistant turns; the
Phase-3 stale-update guard at sessions.py:241 silently dropped the
second commit via `logger.warning(... skipping state update); return
backend.build_proxy_response(result)` — the caller still received the
200 SGLang response and appended a phantom assistant record to its
trajectory, while the server's session state didn't reflect that turn.
The next chat then hit a cursor mismatch.

Restore the original "one lock around Phase 1+2+3" invariant and
move DELETE preemption from "no lock during proxy" to an explicit
cancellation channel: chat stores its proxy task on
session.current_proxy_task, DELETE cancels it (after setting
session.closing), the chat coroutine catches CancelledError and
returns 410 Gone (distinct from 404 "never existed" and 409
"retryable conflict"), then releases the lock so DELETE can acquire
it and tear down.

SessionStateConflictError (409) is added as defense-in-depth: with
Phase 1+2+3 under one lock the Phase-3 stale-state branch is
unreachable, but a future split-lock regression now surfaces a clear
retryable 409 instead of a silently dropped commit.

Two tests in test_session_race_conditions.py encoded the old
"DELETE waits for chat to finish" contract and now expect the new
"DELETE preempts → 410" behaviour:
  - test_double_delete_second_returns_404
  - test_multiple_chats_queued_then_delete
Pre-existing 4-failure baseline (get_or_create_session
auto-create-after-delete tests) is unchanged: 5 pass / 4 fail.

Split out of #33 per offline review request — #33 keeps the
multi-backend orchestration + observability work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants