feat(session): multi-process session-server (N workers + sticky-hash router)#31
Draft
rmfan wants to merge 22 commits into
Draft
feat(session): multi-process session-server (N workers + sticky-hash router)#31rmfan wants to merge 22 commits into
rmfan wants to merge 22 commits into
Conversation
…n workers A pooled httpx.AsyncClient against a uvicorn --workers N server pins all requests to the small subset of workers that accept()-won the pooled TCP connections (uvicorn shares one listen socket across workers; no SO_REUSEPORT, no work-stealing). Observed in a harbor_server run: n_workers_active = 2 of 32 for most minutes, with those 2 workers saturated at their per-process Semaphore cap while the other 30 sat idle. Setting max_keepalive_connections=0 closes the TCP after each response, so every /run gets its own accept() race and load spreads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `worker_index` / `worker_count` to `SessionRegistry` and pin `create_session()` to regenerate uuids until the hash falls in the current worker's bucket. The hash function (`session_id_bucket`, md5-of-utf8 truncated mod N) is the load-bearing agreement between the front-end router and the registry — they must use the same one. When `worker_count == 1` (default), behavior is identical to before. Also surface `worker_index`/`worker_count` in `/health` so operators can correlate logs across workers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A small FastAPI app that:
* Parses session_id from /sessions/{id}/... URL paths.
* Picks a backend with session_id_bucket(session_id, N) — the same
hash SessionRegistry.create_session uses to pin uuids, so a
session always routes back to its creator.
* Round-robins stateless paths (POST /sessions, /health, etc.) so
we don't hot-spot worker 0.
* Streams body through with no JSON re-work in the hot path.
Picked the Python ASGI option over nginx because nginx is not a
guaranteed presence on the Slurm compute nodes RolloutManager runs on;
this keeps the deployment dependency-free. The router does almost no
per-request CPU work (path-parse + md5 + httpx passthrough), so the
GIL on the router process is not the new bottleneck — all the
tokenizer / TITO / JSON work happens in the N backend workers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `--session-server-workers N` (default 1, opt-in).
When N == 1, `_start_session_server` behaves exactly as before.
When N > 1, it:
1. Allocates N backend ports starting at session_server_port + 1.
2. Spawns N SessionServer processes, each with its own
(worker_index, worker_count, instance_id) and its own tokenizer.
3. Waits for all N backends to be reachable.
4. Spawns the SessionRouter front-end on session_server_port.
5. Registers SIGTERM and atexit reapers so children don't outlive
the parent (Ray actor shutdown, kill -TERM, etc.).
Memory cost: tokenizer is loaded N times (per-worker process). For
Qwen-class tokenizers this is a few hundred MB each; verify before
flipping the default to something like cpu_count()//2.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test_session_uuid_routing.py:
- session_id_bucket is deterministic and roughly balanced.
- Every uuid SessionRegistry.create_session(i, N) returns hashes
to bucket i (load-bearing invariant — if this breaks, sticky
routing breaks and trials silently fall through the auto-reseed
path on every turn).
- SessionRouter.pick_backend agrees with SessionRegistry on the
same hash.
- Stateless paths round-robin instead of pinning to worker 0.
test_multi_worker_startup.py:
- Smoke-test the router end-to-end with 4 fake HTTP backends
spun up on real ports; verify hash-routed requests land on the
expected backend.
- Verify /health on the router is local, not proxied.
Not run: any test that loads a real tokenizer or starts a real Ray
job. Out of scope for this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion_id Switch from md5(session_id) %% N + UUID rejection-sampling to Stripe-style prefix encoding: SessionRegistry stamps "w<idx>-" on every multi-worker session_id, and SessionRouter parses the prefix instead of hashing. This kills two failure modes the prior design carried: 1. The 100-try regen cap was a hot-path RuntimeError land-mine — at N=8 it tripped roughly once per ~20 minutes of 800 RPS create_session traffic ((7/8)^100 ~= 1.6e-6 per call), bubbling a 500 to the agent and killing the rollout. 2. Router and backend had to agree on the hash function "forever"; any future drift silently 100%%-misrouted every request. The new contract is a 4-char str.split — no shared algorithm, no rejection sampling. Invalid prefixes 404 with a clear error. Stateless paths (POST /sessions, /health) still round-robin. Tests: - test_session_uuid_routing.py: replace TestRouterAgreement hash test with parse-correctness invariants (well-formed prefixes parse, bare uuids / non-numeric prefixes / out-of-range indices raise). - test_multi_worker_startup.py: mint test ids with "w<idx>-" prefix instead of bare uuid + session_id_bucket lookup. Design rationale in docs/sticky-session-routing-research.md (Mechanism A). Addresses smell #1 and audit H1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… B1) Replace the buffer+re-encode proxy with a true streaming pass-through: - Request body: await request.body() (full buffer) -> content=request.stream() in client.build_request, so the multi-MB SGLang request payload never lands in router RAM. - Response body: await response.aread() + JSONResponse(json.loads(...)) -> client.send(req, stream=True) + StreamingResponse(aiter_raw()), so output_token_logprobs streams chunk-by-chunk and the router doesn't burn one GIL on json.loads + json.dumps per response. The old code held ~800 x ~5MB = ~4GB of decoded Python dicts at the 800-in-flight concurrency this PR is sized for, and spent the router's GIL on json codecs — the exact bottleneck class the multi-process layout was supposed to eliminate. Header handling: factored hop-by-hop header set per RFC 7230 §6.1 into _HOP_BY_HOP_HEADERS; applied symmetrically to request and response. content-type passes through unmodified (preserves charset hints that the old JSONResponse path silently dropped — see audit M4). Upstream response is closed in a try/finally inside the body generator so backend connections don't leak if the client disconnects mid-stream. Addresses audit B1 (and incidentally M4). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The reaper was registered after wait_for_server_ready returned for both backends (~60s) and the router (~30s). Any SIGTERM to the parent during that ~90s ready-window leaked all N child processes — they survived as orphans holding their ports, and the next rollout's is_port_available pre-check fired the "stale session server" RuntimeError telling the operator to pkill -9 python. Fix: pass a mutable tracked_processes list into the reaper before spawning the first child, then append to it as children come up. The reaper sees an empty list until a child actually starts, so registering early is safe; once a SIGTERM lands, the list has whatever was already spawned. The try/except cleanup path now uses tracked_processes too, so it covers the router_process if startup fails between its .start() and its readiness check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The chained SIGTERM handler had two real problems: 1. Race with Ray's own SIGTERM handler. Whether ours or Ray's wins depends on the install order, which depends on when _start_session_server runs relative to Ray actor init — fragile coupling we should not be relying on. 2. If _start_session_server is called twice in the same process (rare but possible on actor restart), the second prev = getsignal(SIGTERM) captures the first call's _handler closure, forming a chain that re-terminates already-dead processes and masks any genuine failure in the second call. Fix: rely on atexit + daemon=True only. Daemon children are terminated automatically by the Python runtime on parent exit; atexit covers the clean-shutdown path. The SIGTERM-during-startup leak is a real concern but it's better addressed by the earlier reaper-registration fix (H2) than by a fragile signal chain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gs (audit H4) _shallow_copy_args was copy.copy(args) — fine for the current attribute set (all scalars), but a footgun: any future session_server_* field that's a list / dict / nested Namespace would alias the same object across all N worker copies, and mutating it in one worker (or in this very setup loop) would silently corrupt the others. Replace with copy.deepcopy via a renamed _per_worker_args_copy. The args object is small and parsed once at startup, so the deepcopy cost is dwarfed by multiprocessing fork overhead. Also drop the dataclasses.replace / is_dataclass branch at the call site. argparse.Namespace (what arguments.py actually returns) is not a dataclass, so that branch was dead code that hid the real copy mechanism behind a misleading dispatch. _per_worker_args_copy handles both Namespaces and dataclasses uniformly via deepcopy. Promotes "import copy" from a function-local import to the module top (the other in-file copy import at line 1123 is unrelated and kept local-scoped to avoid widening this diff). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
Audit cleanup pass — B1 + H1–H4Pushed 5 commits addressing the architectural + blocker + high-severity findings from two prior audits:
Commits
Out of scope (follow-up PR)M1–M6 and L1–L6 from the code-smell audit. Per the task spec, this PR only lands B1 + H1–H4. Verification
🤖 Generated with Claude Code |
The router's httpx client previously set ``max_keepalive_connections=0``, which forced a full TCP handshake + teardown per request. Under sustained load (PR cites ~800 RPS) this exhausts the router's ephemeral port range via TIME_WAIT and silently throttles the backend pool — request throughput plateaus regardless of how many backends are added. The router explicitly routes by ``session_id`` (Stripe-style prefix), so the "pinning to one backend" concern that motivates the same setting in ``init_http_client`` does not apply here. Keepalive is safe and saves the per-request handshake. Also bump ``max_connections`` to 4096 to comfortably cover N backends * hundreds of in-flight requests. Identified in PR #31 deep review (blocker B1).
OpenAIEndpointTracer.create (openai_endpoint_utils.py:39-43) reads ``session_server_instance_id`` from ``/health`` and stamps it on trial metadata. In single-worker mode this came from the backend process; in multi-worker mode the router is the user-facing session_url, so the field has to come from there too. Add ``session_server_instance_id`` (sourced from the cluster-facing ``args.session_server_instance_id`` set in rollout.py:1320) to the router's ``/health`` response. The backend workers' per-worker ids remain ``<cluster-id>-w<i>`` for log correlation; only the cluster-facing id is exposed externally. Identified in PR #31 deep review (high H1).
Two related lifecycle fixes flagged by the PR #31 deep review: H2 — port allocation race. The previous loop picked a port for worker i and advanced if taken, but worker i+1's ``is_port_available`` check ran before worker i's child had bound. Both could target the same port; one then crashed on bind, manifesting as a slow ``wait_for_server_ready`` timeout. Track ``chosen_ports`` and skip already-handed-out ports. H3 — atexit-only reaper leaks children on SIGTERM. The previous cleanup commit (7ed52a7) correctly removed the brittle SIGTERM handler chain, but the resulting reaper only ran on clean Python exit. Under Ray actor preemption (SIGTERM, parent stays alive briefly) the children turned into zombies that held the session-server port — and the next rollout then trips the "stale session server" RuntimeError at line 1324. Re-add a SIMPLE SIGTERM handler that just calls the same reap path (no chain semantics, no prev-handler capture). Reap now also ``join``s after ``terminate`` and escalates to ``kill`` if needed, so the port is definitely released. Also fix a B905 ``zip(..., strict=)`` ruff warning that surfaced once the file came into scope.
When ``parse_worker_index`` raises (id has no ``w<idx>-`` prefix, or the parsed index is outside ``[0, worker_count)``), the router used to return 404. That breaks the registry's auto-reseed safety net during rolling deploys that shrink ``worker_count``: an in-flight trial still holding ``w5-<uuid>`` from a previous N=8 deploy gets hard-killed instead of resuming on a different worker. Fall back to round-robin instead. The receiving backend's ``get_or_create_session`` reseeds the session cleanly under a fresh prefix — trial loses state but recovers in-place rather than dying mid-rollout. This restores the established "router restart" behavior that single-worker mode has always had. Update the unit test contract (``test_router_unknown_session_id_falls_back_to_round_robin``) and add an end-to-end test that a malformed session_id reaches a backend with status 200. Identified in PR #31 deep review (medium M).
prepare_pretokenized and update_pretokenized_state are sync calls inside async handlers that hold the GIL during merge_tokens and chat-template render. With 300+ in-flight sessions on a single process this blocks the event loop and inflates server p99. Microbench (laptop, N=400 concurrent): server_p99 9.9s → 5.8s. Complements the multi-process scaling already in this PR — the two benefits stack: multi-process raises the throughput ceiling (one GIL per process); to_thread keeps each process responsive below saturation.
Unrelated to this PR's changes but required for CI green — `prod` has pre-existing black/isort violations in 8 files that the `--all-files` pre-commit hook catches on every PR run.
CodeQL Information-exposure-through-exception finding on PR #31: the JSONResponse on a backend transport error returned the exception type and message, which can include internal backend hostnames, ports, or file paths from urllib3's error chain. Exception is still logged server-side for debugging; the caller now gets a generic 'session-router backend transport error'.
Single-process router saturates at ~50 r/s with 4 backends behind it, whereas backend aggregate capacity (sticky-routed bench) is ~280 r/s. The router is the new bottleneck. Add --session-router-workers K (default 1; backward-compatible). When K>1, _start_session_server spawns K independent uvicorn workers that bind the same args.session_server_port via SO_REUSEPORT; the Linux kernel hash-distributes incoming connections across them. The router state is per-process (rr counter, httpx pool) and routing decisions are pure functions of the URL prefix, so no cross-worker coordination is needed. uvicorn 0.40--0.49 has no Config(reuse_port=...) kwarg, so each worker opens a SO_REUSEPORT socket and passes it via Server.run(sockets=[sock]).
…or mismatch The split-lock chat flow releases session.lock around the SGLang proxy call (Phase 2) so concurrent same-session writers can overlap at the backend. When a competing writer commits an assistant turn during that unlocked window, the in-flight writer hits the ``num_assistant != expected_num_assistant`` guard at sessions.py:271. Previously the guard silently skipped both ``update_pretokenized_state`` and ``append_record`` while still returning the SGLang body with HTTP 200. The caller (litellm/harbor) treated the 200 as a real turn and appended the assistant message to its local trajectory. On the next ``compute_samples_from_openai_records`` pass (openai_endpoint_utils.py:155), the cursor walk asserted ``cursor == len(accumulated_token_ids)`` and crashed with a delta equal to the dropped turn's token count. Evidence: run 1711903 — the first prod deploy of PR #31 — produced 24 ``state changed during proxy`` warnings and 2 cursor-mismatch failures with deltas of 88 and 102 tokens. See ``~/run_analysis/1711903/1711903_errors_rca.md``. Fix: raise the new ``SessionStateConflictError`` (409) instead of returning a phantom 200. Callers see a clear retryable conflict and do not record the dropped turn locally, so the trajectory's accumulated_token_ids and records stay mutually consistent. The closing-during-proxy branch above keeps its existing skip-200 behavior because deleted sessions have no downstream ``compute_samples_from_openai_records`` consumer. Tests: - ``test_same_session_concurrent_requests_reach_backend`` updated to expect exactly 1 winner (200) + N-1 conflicts (409) instead of all 200s. - New ``TestStateConflictNoCursorMismatch`` asserts (a) status codes partition cleanly into 1x200 + Nx409, (b) the session's records list length matches the number of accumulated checkpoints, and (c) a serial follow-up turn built on the winner's checkpoint succeeds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Jun 5, 2026
rmfan
added a commit
that referenced
this pull request
Jun 9, 2026
…via cancellation channel PR #31 split session.lock around the SGLang proxy call to keep DELETE non-blocking while a chat was in flight. That split-lock window let two same-session writers commit interleaved assistant turns; the Phase-3 stale-update guard at sessions.py:241 silently dropped the second commit via `logger.warning(... skipping state update); return backend.build_proxy_response(result)` — the caller still received the 200 SGLang response and appended a phantom assistant record to its trajectory, while the server's session state didn't reflect that turn. The next chat then hit a cursor mismatch. Restore the original "one lock around Phase 1+2+3" invariant and move DELETE preemption from "no lock during proxy" to an explicit cancellation channel: chat stores its proxy task on session.current_proxy_task, DELETE cancels it (after setting session.closing), the chat coroutine catches CancelledError and returns 410 Gone (distinct from 404 "never existed" and 409 "retryable conflict"), then releases the lock so DELETE can acquire it and tear down. SessionStateConflictError (409) is added as defense-in-depth: with Phase 1+2+3 under one lock the Phase-3 stale-state branch is unreachable, but a future split-lock regression now surfaces a clear retryable 409 instead of a silently dropped commit. Two tests in test_session_race_conditions.py encoded the old "DELETE waits for chat to finish" contract and now expect the new "DELETE preempts → 410" behaviour: - test_double_delete_second_returns_404 - test_multiple_chats_queued_then_delete Pre-existing 4-failure baseline (get_or_create_session auto-create-after-delete tests) is unchanged: 5 pass / 4 fail. Split out of #33 per offline review request — #33 keeps the multi-backend orchestration + observability work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Draft
3 tasks
rmfan
added a commit
that referenced
this pull request
Jun 9, 2026
…via cancellation channel PR #31 split session.lock around the SGLang proxy call to keep DELETE non-blocking while a chat was in flight. That split-lock window let two same-session writers commit interleaved assistant turns; the Phase-3 stale-update guard at sessions.py:241 silently dropped the second commit via `logger.warning(... skipping state update); return backend.build_proxy_response(result)` — the caller still received the 200 SGLang response and appended a phantom assistant record to its trajectory, while the server's session state didn't reflect that turn. The next chat then hit a cursor mismatch. Restore the original "one lock around Phase 1+2+3" invariant and move DELETE preemption from "no lock during proxy" to an explicit cancellation channel: chat stores its proxy task on session.current_proxy_task, DELETE cancels it (after setting session.closing), the chat coroutine catches CancelledError and returns 410 Gone (distinct from 404 "never existed" and 409 "retryable conflict"), then releases the lock so DELETE can acquire it and tear down. SessionStateConflictError (409) is added as defense-in-depth: with Phase 1+2+3 under one lock the Phase-3 stale-state branch is unreachable, but a future split-lock regression now surfaces a clear retryable 409 instead of a silently dropped commit. Two tests in test_session_race_conditions.py encoded the old "DELETE waits for chat to finish" contract and now expect the new "DELETE preempts → 410" behaviour: - test_double_delete_second_returns_404 - test_multiple_chats_queued_then_delete Pre-existing 4-failure baseline (get_or_create_session auto-create-after-delete tests) is unchanged: 5 pass / 4 fail. Split out of #33 per offline review request — #33 keeps the multi-backend orchestration + observability work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The session-server (
miles/rollout/session/sessions.py) runs as a single uvicorn process perRolloutManageractor. One GIL serialises tokenizer work (merge_tokens, chat-template render), JSON serialisation of multi-MB SGLang responses, and asyncio scheduling for every in-flight request. On a Slurm CPU node a single process saturates near ~50 r/s. Past that, requests queue inside the event loop and tail latency explodes — production run1709750showed one session-server PID serving 180k+ events with row-5 p50 = 20 s and p99 = 192 s, while the upstream SGLang generation itself stayed well under a second.This PR lifts both layers off a single GIL: the session-server backend can be sharded across N processes, and the front-end proxy that fans requests to those backends can itself be sharded across K processes. Session affinity is preserved end-to-end without any shared state between the K proxy workers.
Design
Backend multi-process (
--session-server-workers N)_start_session_server(inmiles/ray/rollout.py) spawns N independentSessionServerprocesses whenN > 1. Each backend gets its own port (walked forward fromport+1, tracked in achosen_portsset so sibling spawns can't race onto the same one), its own tokenizer, its own asyncio loop, and aworker_indexit stamps onto every session it creates. Children aredaemon=True; the parent installs a SIGTERM handler plusatexithook that does.terminate() → .join(timeout) → .kill()on each child.N = 1is bit-for-bit unchanged from today.Router multi-worker (
--session-router-workers K)The user-facing port is held by a small Python
uvicornproxy (session_router.py) that forwards each request to one of the N backends. WithK = 1, the original single-process router runs unchanged. A second bench showed this single router process itself caps at ~50 r/s — even with 4 backends behind it that together can serve ~280 r/s when reached directly, the front-end GIL becomes the new bottleneck (see Evidence below).With
K > 1,_start_session_serverspawns K independentuvicornworker processes, each opening its own listening socket on the user-facing port withSO_REUSEPORTset, then handing the bound socket touvicorn.Server.run(sockets=[sock]). The Linux kernel (≥ 3.9) hash-distributes inbound TCP connections across the K listeners. The router workers share no state and do not coordinate; each is a stateless duplicate of the others. (uvicorn 0.40–0.49 — the versions we ship — has noreuse_portkwarg, so handing it a pre-bound socket is the supported path.)How stickiness is preserved
Harbor sees one URL. Every session always lands on the same backend, regardless of which router worker accepted the connection. The mechanism:
POST /sessions,SessionRegistry.create_sessionmints an id of the formw{worker_index}-{uuid4().hex}. The backend's index is encoded in the id itself.w<idx>-prefix and forwards to backendidx. The id is the routing key — there is no shared hash function and no routing table for the K routers to keep in sync.SO_REUSEPORTdistribution picks a router worker per inbound TCP connection. That choice is independent of session affinity because every router worker applies the same deterministic prefix → backend rule.POST /sessions,/health) are dispatched round-robin via a per-workeritertools.count(). The chosen backend stamps its own index on the returned id, so all follow-up calls route correctly. Per-worker round-robin counters drift but average out at scale.Net effect: K routers behave as one logical front-end. Same
session_id→ same backend, every time, no coordination required.Supporting changes
asyncio.to_threadwrap on the two synchronous tokenizer calls (prepare_pretokenized,update_pretokenized_state) — keeps the event loop responsive during merge/encode. Benefits N=1 deployments too.client.send(stream=True)+StreamingResponse(aiter_raw())) — no router-side buffering or JSON re-encoding.uuid4().hex(N=1, back-compat) tow{idx}-{uuid4().hex}(N>1) — prefix-encoded, no shared hash algorithm.get_or_create_sessionauto-reseed path actually fire.Evidence
Headline server-side
server_done_p99(45 s window, SGLang replaced by a constant-0.5 s stub so the table reflects session-server behaviour alone):End-to-end on a real Slurm CPU node at N=400 in-flight, going from 1 → 4 backends dropped
server_p999× (6.74 s → 0.74 s). Throughput at the same point did not scale, which surfaced the second cap.Router-cap finding (
wc=4, N=400):The 5.5× gap is the router's GIL. K > 1 router workers close it; the K-way path is what makes the multi-backend gains reachable end-to-end through the real entry point.
Files touched
miles/utils/arguments.py—--session-server-workers,--session-router-workersflags.miles/ray/rollout.py_start_session_server— N==1 / K==1 paths unchanged; otherwise spawns N backends and K router workers, registers the SIGTERM reaper before the first.start(), deep-copies per-worker args to avoid shared-ref bugs, and handles port exhaustion explicitly.miles/rollout/session/sessions.py—asyncio.to_threadwrap;worker_indexsurfaced in/health.miles/rollout/session/linear_trajectory.py—SessionRegistry(worker_index, worker_count)+ prefix-encoded session_id.miles/rollout/session/session_router.py(new) — proxy with streaming pass-through, prefix-parsing router, round-robin fallback,SO_REUSEPORTsocket helper.miles/utils/http_utils.py— httpx keepalive disabled on the loadgen side to spread connections across router workers (kernel hash is by 4-tuple).tests/fast/router/test_session_uuid_routing.py,tests/fast/router/test_multi_worker_startup.py,tests/fast/router/test_router_multi_worker.py— prefix invariants across N ∈ {1, 2, 4, 8, 16},/healthschema + malformed-id fallback, K=1 vs K>1 launch contract and per-workerinstance_idisolation.Known limitations
await response.aread()on the upstream body (miles/rollout/session/session_server.py:73) to extractmeta_info. The streaming change in this PR covers the wildcard pass-through path only; the chat-completions path remains buffered. Out of scope.SO_REUSEPORTsemantics differ on macOS. K > 1 is intended for Linux production. The K=1 default keeps dev environments on the original code path; the multi-worker tests patch the uvicorn call surface rather than binding real ports.nginx(template-generated config, started as a child of_start_session_server): per-worker throughput is ~100× higher, K = 1 nginx worker saturates 8+ backends, and the multi-worker router code path can be removed. Not in this PR to keep the change focused on backend multi-process + minimum-viable routing.SessionStateConflictError(409) whennum_assistantchanges during the unlocked proxy phase, instead of silently skipping the state update and returning 200. The old behaviour caused cursor-mismatch crashes incompute_samples_from_openai_recordsdownstream (run 1711903: 24 conflict warnings → 2 cursor-mismatch failures with 88/102-token deltas). Callers must treat 409 from/sessions/{id}/v1/chat/completionsas a retryable conflict.Risk / rollout
Default off (
--session-server-workers 1 --session-router-workers 1) → bit-for-bit current behaviour. Recommended initial production config:--session-server-workers 4 --session-router-workers 4(until the nginx replacement lands, set K ≈ N so the router doesn't become the new cap). Roll forward, no migration; flip per-job after one clean run.