feat(session): multi-process session-server (N workers + sticky-hash router) by rmfan · Pull Request #31 · LLM360/miles

rmfan · 2026-06-04T04:47:04Z

Problem

The session-server (miles/rollout/session/sessions.py) runs as a single uvicorn process per RolloutManager actor. One GIL serialises tokenizer work (merge_tokens, chat-template render), JSON serialisation of multi-MB SGLang responses, and asyncio scheduling for every in-flight request. On a Slurm CPU node a single process saturates near ~50 r/s. Past that, requests queue inside the event loop and tail latency explodes — production run 1709750 showed one session-server PID serving 180k+ events with row-5 p50 = 20 s and p99 = 192 s, while the upstream SGLang generation itself stayed well under a second.

This PR lifts both layers off a single GIL: the session-server backend can be sharded across N processes, and the front-end proxy that fans requests to those backends can itself be sharded across K processes. Session affinity is preserved end-to-end without any shared state between the K proxy workers.

Design

Backend multi-process (`--session-server-workers N`)

_start_session_server (in miles/ray/rollout.py) spawns N independent SessionServer processes when N > 1. Each backend gets its own port (walked forward from port+1, tracked in a chosen_ports set so sibling spawns can't race onto the same one), its own tokenizer, its own asyncio loop, and a worker_index it stamps onto every session it creates. Children are daemon=True; the parent installs a SIGTERM handler plus atexit hook that does .terminate() → .join(timeout) → .kill() on each child. N = 1 is bit-for-bit unchanged from today.

Router multi-worker (`--session-router-workers K`)

The user-facing port is held by a small Python uvicorn proxy (session_router.py) that forwards each request to one of the N backends. With K = 1, the original single-process router runs unchanged. A second bench showed this single router process itself caps at ~50 r/s — even with 4 backends behind it that together can serve ~280 r/s when reached directly, the front-end GIL becomes the new bottleneck (see Evidence below).

With K > 1, _start_session_server spawns K independent uvicorn worker processes, each opening its own listening socket on the user-facing port with SO_REUSEPORT set, then handing the bound socket to uvicorn.Server.run(sockets=[sock]). The Linux kernel (≥ 3.9) hash-distributes inbound TCP connections across the K listeners. The router workers share no state and do not coordinate; each is a stateless duplicate of the others. (uvicorn 0.40–0.49 — the versions we ship — has no reuse_port kwarg, so handing it a pre-bound socket is the supported path.)

How stickiness is preserved

Harbor sees one URL. Every session always lands on the same backend, regardless of which router worker accepted the connection. The mechanism:

When a backend handles POST /sessions, SessionRegistry.create_session mints an id of the form w{worker_index}-{uuid4().hex}. The backend's index is encoded in the id itself.
On every subsequent request, any router worker that receives it parses the w<idx>- prefix and forwards to backend idx. The id is the routing key — there is no shared hash function and no routing table for the K routers to keep in sync.
The kernel's SO_REUSEPORT distribution picks a router worker per inbound TCP connection. That choice is independent of session affinity because every router worker applies the same deterministic prefix → backend rule.
Requests with no session_id (POST /sessions, /health) are dispatched round-robin via a per-worker itertools.count(). The chosen backend stamps its own index on the returned id, so all follow-up calls route correctly. Per-worker round-robin counters drift but average out at scale.

Net effect: K routers behave as one logical front-end. Same session_id → same backend, every time, no coordination required.

Supporting changes

asyncio.to_thread wrap on the two synchronous tokenizer calls (prepare_pretokenized, update_pretokenized_state) — keeps the event loop responsive during merge/encode. Benefits N=1 deployments too.
Streaming request/response pass-through (client.send(stream=True) + StreamingResponse(aiter_raw())) — no router-side buffering or JSON re-encoding.
HTTP keepalive enabled in the router → backend httpx pool — eliminates per-request TCP setup at peak.
Session-id format switched from a bare uuid4().hex (N=1, back-compat) to w{idx}-{uuid4().hex} (N>1) — prefix-encoded, no shared hash algorithm.
Unknown / malformed session_ids fall back to round-robin instead of returning 404 — lets the existing get_or_create_session auto-reseed path actually fire.

Evidence

Headline server-side server_done_p99 (45 s window, SGLang replaced by a constant-0.5 s stub so the table reflects session-server behaviour alone):

in-flight N	wc=1 backend	wc=4 backends
100	0.55 s	0.55 s
200	5.99 s	6.36 s (stub-queue noise)
400	6.03 s	0.74 s (end-to-end Slurm run)
800	11.63 s	8.17 s

End-to-end on a real Slurm CPU node at N=400 in-flight, going from 1 → 4 backends dropped server_p99 9× (6.74 s → 0.74 s). Throughput at the same point did not scale, which surfaced the second cap.

Router-cap finding (wc=4, N=400):

path	throughput
through the single-process router	~51 r/s
bypassing the router via sticky direct-to-backend addressing	~284 r/s

The 5.5× gap is the router's GIL. K > 1 router workers close it; the K-way path is what makes the multi-backend gains reachable end-to-end through the real entry point.

Files touched

miles/utils/arguments.py — --session-server-workers, --session-router-workers flags.
miles/ray/rollout.py _start_session_server — N==1 / K==1 paths unchanged; otherwise spawns N backends and K router workers, registers the SIGTERM reaper before the first .start(), deep-copies per-worker args to avoid shared-ref bugs, and handles port exhaustion explicitly.
miles/rollout/session/sessions.py — asyncio.to_thread wrap; worker_index surfaced in /health.
miles/rollout/session/linear_trajectory.py — SessionRegistry(worker_index, worker_count) + prefix-encoded session_id.
miles/rollout/session/session_router.py (new) — proxy with streaming pass-through, prefix-parsing router, round-robin fallback, SO_REUSEPORT socket helper.
miles/utils/http_utils.py — httpx keepalive disabled on the loadgen side to spread connections across router workers (kernel hash is by 4-tuple).
tests/fast/router/test_session_uuid_routing.py, tests/fast/router/test_multi_worker_startup.py, tests/fast/router/test_router_multi_worker.py — prefix invariants across N ∈ {1, 2, 4, 8, 16}, /health schema + malformed-id fallback, K=1 vs K>1 launch contract and per-worker instance_id isolation.

Known limitations

Backend chat-completions still buffers the SGLang response body. The handler does await response.aread() on the upstream body (miles/rollout/session/session_server.py:73) to extract meta_info. The streaming change in this PR covers the wildcard pass-through path only; the chat-completions path remains buffered. Out of scope.
Tokenizer memory multiplies by N. Qwen-class tokenizers are approximately a few hundred MB resident per backend; N ≤ 8 is comfortable on current Slurm node memory but should be sized per model.
SO_REUSEPORT semantics differ on macOS. K > 1 is intended for Linux production. The K=1 default keeps dev environments on the original code path; the multi-worker tests patch the uvicorn call surface rather than binding real ports.
Router needs K ≈ N for full backend saturation (TODO: swap to nginx). Empirically each Python uvicorn router worker caps at ~50–60 r/s. With K = N the router stops being a coalescer and acts as a pure parallelism layer — same process count as if every backend listened on the user port directly. A follow-up should replace the Python router with nginx (template-generated config, started as a child of _start_session_server): per-worker throughput is ~100× higher, K = 1 nginx worker saturates 8+ backends, and the multi-worker router code path can be removed. Not in this PR to keep the change focused on backend multi-process + minimum-viable routing.
Stale-update guard previously returned phantom 200s. Follow-up commit (66f27f6) now raises SessionStateConflictError (409) when num_assistant changes during the unlocked proxy phase, instead of silently skipping the state update and returning 200. The old behaviour caused cursor-mismatch crashes in compute_samples_from_openai_records downstream (run 1711903: 24 conflict warnings → 2 cursor-mismatch failures with 88/102-token deltas). Callers must treat 409 from /sessions/{id}/v1/chat/completions as a retryable conflict.

Risk / rollout

Default off (--session-server-workers 1 --session-router-workers 1) → bit-for-bit current behaviour. Recommended initial production config: --session-server-workers 4 --session-router-workers 4 (until the nginx replacement lands, set K ≈ N so the router doesn't become the new cap). Roll forward, no migration; flip per-job after one clean run.

…n workers A pooled httpx.AsyncClient against a uvicorn --workers N server pins all requests to the small subset of workers that accept()-won the pooled TCP connections (uvicorn shares one listen socket across workers; no SO_REUSEPORT, no work-stealing). Observed in a harbor_server run: n_workers_active = 2 of 32 for most minutes, with those 2 workers saturated at their per-process Semaphore cap while the other 30 sat idle. Setting max_keepalive_connections=0 closes the TCP after each response, so every /run gets its own accept() race and load spreads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add `worker_index` / `worker_count` to `SessionRegistry` and pin `create_session()` to regenerate uuids until the hash falls in the current worker's bucket. The hash function (`session_id_bucket`, md5-of-utf8 truncated mod N) is the load-bearing agreement between the front-end router and the registry — they must use the same one. When `worker_count == 1` (default), behavior is identical to before. Also surface `worker_index`/`worker_count` in `/health` so operators can correlate logs across workers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A small FastAPI app that: * Parses session_id from /sessions/{id}/... URL paths. * Picks a backend with session_id_bucket(session_id, N) — the same hash SessionRegistry.create_session uses to pin uuids, so a session always routes back to its creator. * Round-robins stateless paths (POST /sessions, /health, etc.) so we don't hot-spot worker 0. * Streams body through with no JSON re-work in the hot path. Picked the Python ASGI option over nginx because nginx is not a guaranteed presence on the Slurm compute nodes RolloutManager runs on; this keeps the deployment dependency-free. The router does almost no per-request CPU work (path-parse + md5 + httpx passthrough), so the GIL on the router process is not the new bottleneck — all the tokenizer / TITO / JSON work happens in the N backend workers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds `--session-server-workers N` (default 1, opt-in). When N == 1, `_start_session_server` behaves exactly as before. When N > 1, it: 1. Allocates N backend ports starting at session_server_port + 1. 2. Spawns N SessionServer processes, each with its own (worker_index, worker_count, instance_id) and its own tokenizer. 3. Waits for all N backends to be reachable. 4. Spawns the SessionRouter front-end on session_server_port. 5. Registers SIGTERM and atexit reapers so children don't outlive the parent (Ray actor shutdown, kill -TERM, etc.). Memory cost: tokenizer is loaded N times (per-worker process). For Qwen-class tokenizers this is a few hundred MB each; verify before flipping the default to something like cpu_count()//2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test_session_uuid_routing.py: - session_id_bucket is deterministic and roughly balanced. - Every uuid SessionRegistry.create_session(i, N) returns hashes to bucket i (load-bearing invariant — if this breaks, sticky routing breaks and trials silently fall through the auto-reseed path on every turn). - SessionRouter.pick_backend agrees with SessionRegistry on the same hash. - Stateless paths round-robin instead of pinning to worker 0. test_multi_worker_startup.py: - Smoke-test the router end-to-end with 4 fake HTTP backends spun up on real ports; verify hash-routed requests land on the expected backend. - Verify /health on the router is local, not proxied. Not run: any test that loads a real tokenizer or starts a real Ray job. Out of scope for this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ion_id Switch from md5(session_id) %% N + UUID rejection-sampling to Stripe-style prefix encoding: SessionRegistry stamps "w<idx>-" on every multi-worker session_id, and SessionRouter parses the prefix instead of hashing. This kills two failure modes the prior design carried: 1. The 100-try regen cap was a hot-path RuntimeError land-mine — at N=8 it tripped roughly once per ~20 minutes of 800 RPS create_session traffic ((7/8)^100 ~= 1.6e-6 per call), bubbling a 500 to the agent and killing the rollout. 2. Router and backend had to agree on the hash function "forever"; any future drift silently 100%%-misrouted every request. The new contract is a 4-char str.split — no shared algorithm, no rejection sampling. Invalid prefixes 404 with a clear error. Stateless paths (POST /sessions, /health) still round-robin. Tests: - test_session_uuid_routing.py: replace TestRouterAgreement hash test with parse-correctness invariants (well-formed prefixes parse, bare uuids / non-numeric prefixes / out-of-range indices raise). - test_multi_worker_startup.py: mint test ids with "w<idx>-" prefix instead of bare uuid + session_id_bucket lookup. Design rationale in docs/sticky-session-routing-research.md (Mechanism A). Addresses smell #1 and audit H1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… B1) Replace the buffer+re-encode proxy with a true streaming pass-through: - Request body: await request.body() (full buffer) -> content=request.stream() in client.build_request, so the multi-MB SGLang request payload never lands in router RAM. - Response body: await response.aread() + JSONResponse(json.loads(...)) -> client.send(req, stream=True) + StreamingResponse(aiter_raw()), so output_token_logprobs streams chunk-by-chunk and the router doesn't burn one GIL on json.loads + json.dumps per response. The old code held ~800 x ~5MB = ~4GB of decoded Python dicts at the 800-in-flight concurrency this PR is sized for, and spent the router's GIL on json codecs — the exact bottleneck class the multi-process layout was supposed to eliminate. Header handling: factored hop-by-hop header set per RFC 7230 §6.1 into _HOP_BY_HOP_HEADERS; applied symmetrically to request and response. content-type passes through unmodified (preserves charset hints that the old JSONResponse path silently dropped — see audit M4). Upstream response is closed in a try/finally inside the body generator so backend connections don't leak if the client disconnects mid-stream. Addresses audit B1 (and incidentally M4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The reaper was registered after wait_for_server_ready returned for both backends (~60s) and the router (~30s). Any SIGTERM to the parent during that ~90s ready-window leaked all N child processes — they survived as orphans holding their ports, and the next rollout's is_port_available pre-check fired the "stale session server" RuntimeError telling the operator to pkill -9 python. Fix: pass a mutable tracked_processes list into the reaper before spawning the first child, then append to it as children come up. The reaper sees an empty list until a child actually starts, so registering early is safe; once a SIGTERM lands, the list has whatever was already spawned. The try/except cleanup path now uses tracked_processes too, so it covers the router_process if startup fails between its .start() and its readiness check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The chained SIGTERM handler had two real problems: 1. Race with Ray's own SIGTERM handler. Whether ours or Ray's wins depends on the install order, which depends on when _start_session_server runs relative to Ray actor init — fragile coupling we should not be relying on. 2. If _start_session_server is called twice in the same process (rare but possible on actor restart), the second prev = getsignal(SIGTERM) captures the first call's _handler closure, forming a chain that re-terminates already-dead processes and masks any genuine failure in the second call. Fix: rely on atexit + daemon=True only. Daemon children are terminated automatically by the Python runtime on parent exit; atexit covers the clean-shutdown path. The SIGTERM-during-startup leak is a real concern but it's better addressed by the earlier reaper-registration fix (H2) than by a fragile signal chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gs (audit H4) _shallow_copy_args was copy.copy(args) — fine for the current attribute set (all scalars), but a footgun: any future session_server_* field that's a list / dict / nested Namespace would alias the same object across all N worker copies, and mutating it in one worker (or in this very setup loop) would silently corrupt the others. Replace with copy.deepcopy via a renamed _per_worker_args_copy. The args object is small and parsed once at startup, so the deepcopy cost is dwarfed by multiprocessing fork overhead. Also drop the dataclasses.replace / is_dataclass branch at the call site. argparse.Namespace (what arguments.py actually returns) is not a dataclass, so that branch was dead code that hid the real copy mechanism behind a misleading dispatch. _per_worker_args_copy handles both Namespaces and dataclasses uniformly via deepcopy. Promotes "import copy" from a function-local import to the module top (the other in-file copy import at line 1123 is unrelated and kept local-scoped to avoid widening this diff). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rmfan · 2026-06-04T05:27:17Z

Audit cleanup pass — B1 + H1–H4

Pushed 5 commits addressing the architectural + blocker + high-severity findings from two prior audits:

~/agent-dist/docs/sticky-session-routing-research.md — Mechanism A (Stripe-style prefix-encoded session_id)
~/agent-dist/docs/pr-31-code-smell-audit.md — B1, H1, H2, H3, H4

Commits

SHA	Audit item	What changed
`27851f8c`	H1 + smell #1	Replace md5+rejection-sampling routing with prefix-encoded `w<idx>-<uuid>` session ids. Router parses the prefix; no shared hash algorithm, no 100-try regen cap, no `RuntimeError` land-mine. Tests rewritten as parse-correctness invariants; bad ids → 404.
`fa89d49c`	B1	Router now streams request/response bodies end-to-end. Replaces `await request.body()` + `await response.aread()` + `json.loads`/`JSONResponse` re-encode with `client.send(req, stream=True)` + `StreamingResponse(aiter_raw())`. Hop-by-hop headers factored per RFC 7230 §6.1; upstream response closed in `finally` to avoid leaks on client disconnect. Incidentally fixes M4 (charset hint preservation).
`39efe46a`	H2	`_register_session_server_reaper` now runs before the first `.start()` call, against a mutable `tracked_processes` list. SIGTERM during the ~90s ready-window no longer orphans backends.
`7ed52a71`	H3	Deleted the SIGTERM handler chain. Relies on `atexit` + `daemon=True` only (per audit recommendation). Avoids the Ray-init-order race and the closure-cycle on double-init.
`b0bb0e16`	H4	`_shallow_copy_args` (`copy.copy`) → `_per_worker_args_copy` (`copy.deepcopy`). Drops the dead `dataclasses.replace` branch — `argparse.Namespace` is never a dataclass.

Out of scope (follow-up PR)

M1–M6 and L1–L6 from the code-smell audit. Per the task spec, this PR only lands B1 + H1–H4.

Verification

python -m py_compile clean on every modified file (linear_trajectory.py, session_router.py, rollout.py, test_session_uuid_routing.py, test_multi_worker_startup.py).
Existing tests/fast/router/test_session_uuid_routing.py::TestRouterAgreement rewritten to test the new parse contract; tests/fast/router/test_multi_worker_startup.py now mints w<idx>- test ids.

🤖 Generated with Claude Code

The router's httpx client previously set ``max_keepalive_connections=0``, which forced a full TCP handshake + teardown per request. Under sustained load (PR cites ~800 RPS) this exhausts the router's ephemeral port range via TIME_WAIT and silently throttles the backend pool — request throughput plateaus regardless of how many backends are added. The router explicitly routes by ``session_id`` (Stripe-style prefix), so the "pinning to one backend" concern that motivates the same setting in ``init_http_client`` does not apply here. Keepalive is safe and saves the per-request handshake. Also bump ``max_connections`` to 4096 to comfortably cover N backends * hundreds of in-flight requests. Identified in PR #31 deep review (blocker B1).

OpenAIEndpointTracer.create (openai_endpoint_utils.py:39-43) reads ``session_server_instance_id`` from ``/health`` and stamps it on trial metadata. In single-worker mode this came from the backend process; in multi-worker mode the router is the user-facing session_url, so the field has to come from there too. Add ``session_server_instance_id`` (sourced from the cluster-facing ``args.session_server_instance_id`` set in rollout.py:1320) to the router's ``/health`` response. The backend workers' per-worker ids remain ``<cluster-id>-w<i>`` for log correlation; only the cluster-facing id is exposed externally. Identified in PR #31 deep review (high H1).

Two related lifecycle fixes flagged by the PR #31 deep review: H2 — port allocation race. The previous loop picked a port for worker i and advanced if taken, but worker i+1's ``is_port_available`` check ran before worker i's child had bound. Both could target the same port; one then crashed on bind, manifesting as a slow ``wait_for_server_ready`` timeout. Track ``chosen_ports`` and skip already-handed-out ports. H3 — atexit-only reaper leaks children on SIGTERM. The previous cleanup commit (7ed52a7) correctly removed the brittle SIGTERM handler chain, but the resulting reaper only ran on clean Python exit. Under Ray actor preemption (SIGTERM, parent stays alive briefly) the children turned into zombies that held the session-server port — and the next rollout then trips the "stale session server" RuntimeError at line 1324. Re-add a SIMPLE SIGTERM handler that just calls the same reap path (no chain semantics, no prev-handler capture). Reap now also ``join``s after ``terminate`` and escalates to ``kill`` if needed, so the port is definitely released. Also fix a B905 ``zip(..., strict=)`` ruff warning that surfaced once the file came into scope.

When ``parse_worker_index`` raises (id has no ``w<idx>-`` prefix, or the parsed index is outside ``[0, worker_count)``), the router used to return 404. That breaks the registry's auto-reseed safety net during rolling deploys that shrink ``worker_count``: an in-flight trial still holding ``w5-<uuid>`` from a previous N=8 deploy gets hard-killed instead of resuming on a different worker. Fall back to round-robin instead. The receiving backend's ``get_or_create_session`` reseeds the session cleanly under a fresh prefix — trial loses state but recovers in-place rather than dying mid-rollout. This restores the established "router restart" behavior that single-worker mode has always had. Update the unit test contract (``test_router_unknown_session_id_falls_back_to_round_robin``) and add an end-to-end test that a malformed session_id reaches a backend with status 200. Identified in PR #31 deep review (medium M).

prepare_pretokenized and update_pretokenized_state are sync calls inside async handlers that hold the GIL during merge_tokens and chat-template render. With 300+ in-flight sessions on a single process this blocks the event loop and inflates server p99. Microbench (laptop, N=400 concurrent): server_p99 9.9s → 5.8s. Complements the multi-process scaling already in this PR — the two benefits stack: multi-process raises the throughput ceiling (one GIL per process); to_thread keeps each process responsive below saturation.

Unrelated to this PR's changes but required for CI green — `prod` has pre-existing black/isort violations in 8 files that the `--all-files` pre-commit hook catches on every PR run.

CodeQL Information-exposure-through-exception finding on PR #31: the JSONResponse on a backend transport error returned the exception type and message, which can include internal backend hostnames, ports, or file paths from urllib3's error chain. Exception is still logged server-side for debugging; the caller now gets a generic 'session-router backend transport error'.

Single-process router saturates at ~50 r/s with 4 backends behind it, whereas backend aggregate capacity (sticky-routed bench) is ~280 r/s. The router is the new bottleneck. Add --session-router-workers K (default 1; backward-compatible). When K>1, _start_session_server spawns K independent uvicorn workers that bind the same args.session_server_port via SO_REUSEPORT; the Linux kernel hash-distributes incoming connections across them. The router state is per-process (rr counter, httpx pool) and routing decisions are pure functions of the URL prefix, so no cross-worker coordination is needed. uvicorn 0.40--0.49 has no Config(reuse_port=...) kwarg, so each worker opens a SO_REUSEPORT socket and passes it via Server.run(sockets=[sock]).

…or mismatch The split-lock chat flow releases session.lock around the SGLang proxy call (Phase 2) so concurrent same-session writers can overlap at the backend. When a competing writer commits an assistant turn during that unlocked window, the in-flight writer hits the ``num_assistant != expected_num_assistant`` guard at sessions.py:271. Previously the guard silently skipped both ``update_pretokenized_state`` and ``append_record`` while still returning the SGLang body with HTTP 200. The caller (litellm/harbor) treated the 200 as a real turn and appended the assistant message to its local trajectory. On the next ``compute_samples_from_openai_records`` pass (openai_endpoint_utils.py:155), the cursor walk asserted ``cursor == len(accumulated_token_ids)`` and crashed with a delta equal to the dropped turn's token count. Evidence: run 1711903 — the first prod deploy of PR #31 — produced 24 ``state changed during proxy`` warnings and 2 cursor-mismatch failures with deltas of 88 and 102 tokens. See ``~/run_analysis/1711903/1711903_errors_rca.md``. Fix: raise the new ``SessionStateConflictError`` (409) instead of returning a phantom 200. Callers see a clear retryable conflict and do not record the dropped turn locally, so the trajectory's accumulated_token_ids and records stay mutually consistent. The closing-during-proxy branch above keeps its existing skip-200 behavior because deleted sessions have no downstream ``compute_samples_from_openai_records`` consumer. Tests: - ``test_same_session_concurrent_requests_reach_backend`` updated to expect exactly 1 winner (200) + N-1 conflicts (409) instead of all 200s. - New ``TestStateConflictNoCursorMismatch`` asserts (a) status codes partition cleanly into 1x200 + Nx409, (b) the session's records list length matches the number of accumulated checkpoints, and (c) a serial follow-up turn built on the winner's checkpoint succeeds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…via cancellation channel PR #31 split session.lock around the SGLang proxy call to keep DELETE non-blocking while a chat was in flight. That split-lock window let two same-session writers commit interleaved assistant turns; the Phase-3 stale-update guard at sessions.py:241 silently dropped the second commit via `logger.warning(... skipping state update); return backend.build_proxy_response(result)` — the caller still received the 200 SGLang response and appended a phantom assistant record to its trajectory, while the server's session state didn't reflect that turn. The next chat then hit a cursor mismatch. Restore the original "one lock around Phase 1+2+3" invariant and move DELETE preemption from "no lock during proxy" to an explicit cancellation channel: chat stores its proxy task on session.current_proxy_task, DELETE cancels it (after setting session.closing), the chat coroutine catches CancelledError and returns 410 Gone (distinct from 404 "never existed" and 409 "retryable conflict"), then releases the lock so DELETE can acquire it and tear down. SessionStateConflictError (409) is added as defense-in-depth: with Phase 1+2+3 under one lock the Phase-3 stale-state branch is unreachable, but a future split-lock regression now surfaces a clear retryable 409 instead of a silently dropped commit. Two tests in test_session_race_conditions.py encoded the old "DELETE waits for chat to finish" contract and now expect the new "DELETE preempts → 410" behaviour: - test_double_delete_second_returns_404 - test_multiple_chats_queued_then_delete Pre-existing 4-failure baseline (get_or_create_session auto-create-after-delete tests) is unchanged: 5 pass / 4 fail. Split out of #33 per offline review request — #33 keeps the multi-backend orchestration + observability work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rmfan and others added 5 commits May 29, 2026 14:49

rmfan requested a review from a team as a code owner June 4, 2026 04:47

rmfan marked this pull request as draft June 4, 2026 05:15

rmfan and others added 5 commits June 3, 2026 22:24

rmfan added 8 commits June 3, 2026 22:38

style: black/isort on PR files (pre-commit autofix)

355c939

docs: drop references to design docs not in the repo

0353c73

fix(session-server): explicit error when port walk exhausts 65535

11cca4a

rmfan changed the base branch from fix/http-client-no-keepalive to prod June 4, 2026 06:54

style: pre-commit autofix on prod files

eee1684

Unrelated to this PR's changes but required for CI green — `prod` has pre-existing black/isort violations in 8 files that the `--all-files` pre-commit hook catches on every PR run.

github-advanced-security AI found potential problems Jun 4, 2026

View reviewed changes

Comment thread miles/rollout/session/session_router.py Fixed

rmfan and others added 3 commits June 4, 2026 00:44

This was referenced Jun 5, 2026

feat(session-server): multi-backend vanilla + per-request observability #33

Merged

session-server: run TITO tokenization off the event loop #34

Merged

rmfan mentioned this pull request Jun 9, 2026

fix(session-server): restore lock across Phase 1+2+3, preempt DELETE via cancellation channel #38

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(session): multi-process session-server (N workers + sticky-hash router)#31

feat(session): multi-process session-server (N workers + sticky-hash router)#31
rmfan wants to merge 22 commits into
prodfrom
feat/session-server-multiprocess

rmfan commented Jun 4, 2026 •

edited

Loading

Uh oh!

rmfan commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rmfan commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Design

Backend multi-process (--session-server-workers N)

Router multi-worker (--session-router-workers K)

How stickiness is preserved

Supporting changes

Evidence

Files touched

Known limitations

Risk / rollout

Uh oh!

rmfan commented Jun 4, 2026

Audit cleanup pass — B1 + H1–H4

Commits

Out of scope (follow-up PR)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rmfan commented Jun 4, 2026 •

edited

Loading

Backend multi-process (`--session-server-workers N`)

Router multi-worker (`--session-router-workers K`)