feat(session-server): multi-backend vanilla + per-request observability#33
Conversation
|
Pushed
Sample ( Tests: Diff stat (~250 LoC, all in two files): Staying as draft. |
|
running end-to-end tests on it now |
Bench results (multibackend stub rig) + logging validation1. Bench resultsRig:
Peak throughput per backend count:
Throughput scales near-linearly with backend count once the loadgen-side concurrency is high enough to keep all backends fed. The drop-off after the peak (e.g. wc=8 falling from 795 -> 444 at N=800) is the stub's Comparison vs PR #31's router rig at wc=4, in-flight=400:
Multibackend sits much closer to the bypass path than the router path. The remaining gap to 284 is plausibly loadgen-side overhead (8 loadgen procs here vs 4 in PR #31) — to verify, re-run with 2. Logging validation (commit
|
| file | lines | chat_start | chat_done | stats | state-changed warnings | proxy_transport_error |
|---|---|---|---|---|---|---|
| wc1 | 124,983 | 16,349 | 16,349 | 10 | 0 | 24 |
| wc2 | 250,551 | 34,654 | 34,654 | 20 | 0 | 31 |
| wc4 | 448,353 | 62,718 | 62,718 | 40 | 0 | 37 |
| wc8 | 763,233 | 107,844 | 107,844 | 80 | 0 | 51 |
chat_done counts are 6-18% above the client-counted totals from summary_multibackend.csv (~13.9k / 31.2k / 58.7k / 102.5k), which matches the expected n_server > n_client gap from mid-window cutoffs.
worker_port populated correctly — distinct ports per file matches backend count:
- wc1:
{8910}(1) - wc2:
{8911, 8912}(2) - wc4:
{8910, 8913, 8914, 8915}(4) - wc8:
{8911, 8912, 8916, 8917, 8918, 8919, 8920, 8921}(8) - Zero occurrences of
worker_port=Noneacross all files.
stats cadence: exactly 10 stats lines per backend per file (10 / 20 / 40 / 80 = N × 10). Matches ~5-min bench duration at 30s interval. rss_mb / vms_mb populated — psutil import did not silently fail.
Sample chat_done (verbatim, wc=8):
2026-06-06 01:31:41,332 INFO miles.rollout.session.sessions [session-server] chat_done worker_port=8912 session_id=0fc8977e42474e60b613dedf48633aca req_id=7bc6c541 lock_wait_ms=0.0 tokenize_in_ms=2.8 proxy_elapsed_ms=485.2 tokenize_out_ms=3.7 total_ms=494.2 inflight_now=60 prompt_tokens=1700 completion_tokens=200
lock_wait_ms=0.0(no contention in steady state — lock-restore working)proxy_elapsed_ms=485.2(≈ stub's 0.5s, as expected)tokenize_in_ms+tokenize_out_ms< 10ms eachinflight_nowinteger
Sample stats (verbatim, wc=4):
2026-06-06 01:29:06,041 INFO miles.rollout.session.session_server [session-server] stats worker_port=8913 reqs_total=15874 reqs_since_last=1911 inflight_now=332 turns_completed=15535 rss_mb=840 vms_mb=17618
- All fields populated. RSS grew from 534 MB at start to 840 MB after ~15k requests — bounded.
No state changed during proxy warnings observed across any file (0 / 0 / 0 / 0). Confirms the lock-restore via cancellation channel closes the race under load (≤ 1600 concurrent, 8 backends, ~108k requests).
No tracebacks or unhandled exceptions. The only WARNING-level lines are proxy_transport_error (24 / 31 / 37 / 51 occurrences) — all error_type=ReadError against the stub at http://127.0.0.1:8801. These appear in clusters during the higher-N rows where the stub's 4-worker pool saturates and connections back up; the enrichment fields (url, elapsed_ms) added by this PR are populated. On those requests chat_done correctly emits prompt_tokens=-1 completion_tokens=-1 sentinels (Phase 3 didn't run). Behavior matches the design.
3. Operational guide
- Rule of thumb: pick
wc(backend count) so thatconcurrency / wc ≈ 50-100. At that ratio the rig sits at peak throughput with sub-second p99 (wc=2 @ 100, wc=4 @ 200, wc=8 @ 400 all land here). - Below the ratio (under-filled): per-backend pool is idle → throughput leaves money on the table (e.g. wc=8 @ 100 only hits 208 r/s).
- Above the ratio (over-filled): per-backend
STUB_WORKERS=4saturates → tail latency blows up andproxy_transport_errorstarts firing (e.g. wc=4 @ 1600: p99=10.5s, 37 ReadErrors in the log). - Caveat — stub vs real SGLang: the stub has constant 0.5s latency. Real SGLang has a heavy-tailed distribution, so production p99 will be larger than these numbers under-state. The relative shape (linear scaling with
wc, the concurrency-per-backend sweet spot) should carry over; the absolute p99 ceilings will not.
Files
- Bench rig:
scripts/orchestrate_multibackend.py,scripts/run_multibackend_bench.sh - Logging surfaces:
miles/rollout/session/sessions.py,miles/rollout/session/session_server.py - Logs validated:
~/multibacked_logs/multibackend/multibackend_wc{1,2,4,8}.log(187 files, ~1.6M lines)
Addresses two review comments on PR #33: 1. @odp comment on session_server.py:130 — psutil is now a hard requirement (added to requirements.txt). Dropped the try/except import and the `_proc is None` fallback branches in _stats_logger_loop; stats heartbeat is single-path with rss_mb / vms_mb always populated. /cc @nightlessbaron 2. @odp comment on session_server.py:182 — every log record in each session-server subprocess now carries pid=<n>, via a one-shot logging.basicConfig(..., format="... pid=%(process)d ...", force=True) at the top of run_session_server. Covers every logger.info in the module (handlers, stats heartbeat, race-detection warning) with no per-call-site edits. force=True overrides any handler config inherited from the parent (e.g. ray-set handlers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…via cancellation channel PR #31 split session.lock around the SGLang proxy call to keep DELETE non-blocking while a chat was in flight. That split-lock window let two same-session writers commit interleaved assistant turns; the Phase-3 stale-update guard at sessions.py:241 silently dropped the second commit via `logger.warning(... skipping state update); return backend.build_proxy_response(result)` — the caller still received the 200 SGLang response and appended a phantom assistant record to its trajectory, while the server's session state didn't reflect that turn. The next chat then hit a cursor mismatch. Restore the original "one lock around Phase 1+2+3" invariant and move DELETE preemption from "no lock during proxy" to an explicit cancellation channel: chat stores its proxy task on session.current_proxy_task, DELETE cancels it (after setting session.closing), the chat coroutine catches CancelledError and returns 410 Gone (distinct from 404 "never existed" and 409 "retryable conflict"), then releases the lock so DELETE can acquire it and tear down. SessionStateConflictError (409) is added as defense-in-depth: with Phase 1+2+3 under one lock the Phase-3 stale-state branch is unreachable, but a future split-lock regression now surfaces a clear retryable 409 instead of a silently dropped commit. Two tests in test_session_race_conditions.py encoded the old "DELETE waits for chat to finish" contract and now expect the new "DELETE preempts → 410" behaviour: - test_double_delete_second_returns_404 - test_multiple_chats_queued_then_delete Pre-existing 4-failure baseline (get_or_create_session auto-create-after-delete tests) is unchanged: 5 pass / 4 fail. Split out of #33 per offline review request — #33 keeps the multi-backend orchestration + observability work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spawn N independent vanilla session-server processes on the SGLang gateway node; callee-side random pick per session. Adds structured per-request observability (chat_start / chat_done with timing buckets + token counts) and a per-worker stats heartbeat for memory/throughput. This PR is the safe (multi-backend + logging) subset of the original PR #33 design. The lock-restoration + DELETE cancellation channel + SessionStateConflictError 409 surface — which originally shipped in this PR — have been split out to a follow-up so they can be reviewed and landed independently. See the lock-restore PR for that work. Session-server changes: - Spawn N independent backends + callee-side pick (miles/ray/rollout.py, miles/utils/arguments.py) - chat_start / chat_done structured logs with req_id (uuid8) and timing buckets (lock_wait_ms, tokenize_in_ms, proxy_elapsed_ms, tokenize_out_ms, total_ms, inflight_now, prompt_tokens, completion_tokens). One INFO log per request, success or short-circuit. - Per-worker _stats (reqs_total, turns_completed, inflight) + _worker_stats registry keyed by session_server_port. Background _stats_logger_loop in session_server.py emits 30s heartbeat with rss_mb / vms_mb via psutil. - pid=<n> prefix on every record in each subprocess via logging.basicConfig in run_session_server (force=True overrides ray-set handlers). - Enriched state_changed_during_proxy warning with worker_port, inflight_chat_count, caller_request_id (x-request-id), proxy_elapsed_ms — grep-correlatable with Layer-1 client retries. - prepare_pretokenized + update_pretokenized_state run in asyncio.to_thread so the event loop isn't blocked by sync TITO tokenizer calls (independent perf win, ~40% off p99 in microbench). - proxy_transport_error warning includes url + elapsed_ms. Dependencies: - psutil hard-pinned in requirements.txt (consumed by the stats heartbeat; no graceful fallback — we want memory metrics in all environments). Race-tests baseline preserved: 5 pass / 4 fail (same as prod). The 4 failures are the documented get_or_create_session auto-create-after-delete behaviour, unrelated to this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdee1f0 to
d079b36
Compare
b57a1ce to
e1b93a0
Compare
Apply formatter fixes flagged by CI's pre-commit run. Three files inside this PR's scope had style-fixes (linear_trajectory.py, test_session_race_conditions.py, plus an unused-import sweep) and seven pre-existing style-debt files elsewhere in the repo that pre-commit's --all-files mode picked up. All changes are formatter output, no semantic edits. Same lint-fix bundle applied to PR #33 in commit e1b93a0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spawn N independent vanilla session-server processes on the SGLang gateway node; callee-side random pick per session. Adds structured per-request observability (chat_start / chat_done with timing buckets + token counts) and a per-worker stats heartbeat for memory/throughput. This PR is the safe (multi-backend + logging) subset of the original PR #33 design. The lock-restoration + DELETE cancellation channel + SessionStateConflictError 409 surface — which originally shipped in this PR — have been split out to a follow-up so they can be reviewed and landed independently. See the lock-restore PR for that work. Session-server changes: - Spawn N independent backends + callee-side pick (miles/ray/rollout.py, miles/utils/arguments.py) - chat_start / chat_done structured logs with req_id (uuid8) and timing buckets (lock_wait_ms, tokenize_in_ms, proxy_elapsed_ms, tokenize_out_ms, total_ms, inflight_now, prompt_tokens, completion_tokens). One INFO log per request, success or short-circuit. - Per-worker _stats (reqs_total, turns_completed, inflight) + _worker_stats registry keyed by session_server_port. Background _stats_logger_loop in session_server.py emits 30s heartbeat with rss_mb / vms_mb via psutil. - pid=<n> prefix on every record in each subprocess via logging.basicConfig in run_session_server (force=True overrides ray-set handlers). - Enriched state_changed_during_proxy warning with worker_port, inflight_chat_count, caller_request_id (x-request-id), proxy_elapsed_ms — grep-correlatable with Layer-1 client retries. - prepare_pretokenized + update_pretokenized_state run in asyncio.to_thread so the event loop isn't blocked by sync TITO tokenizer calls (independent perf win, ~40% off p99 in microbench). - proxy_transport_error warning includes url + elapsed_ms. Dependencies: - psutil hard-pinned in requirements.txt (consumed by the stats heartbeat; no graceful fallback — we want memory metrics in all environments). Race-tests baseline preserved: 5 pass / 4 fail (same as prod). The 4 failures are the documented get_or_create_session auto-create-after-delete behaviour, unrelated to this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply formatter fixes flagged by CI's pre-commit run. Eleven files touched: four of them inside this PR's scope (generate_utils/openai_endpoint_utils.py, session/sessions.py, session/linear_trajectory.py, ray/rollout.py — single-line/spacing fixes only) and seven pre-existing style-debt files elsewhere in the repo that pre-commit's --all-files mode picked up. All changes are formatter output, no semantic edits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e1b93a0 to
577283b
Compare
…via cancellation channel PR #31 split session.lock around the SGLang proxy call to keep DELETE non-blocking while a chat was in flight. That split-lock window let two same-session writers commit interleaved assistant turns; the Phase-3 stale-update guard at sessions.py:241 silently dropped the second commit via `logger.warning(... skipping state update); return backend.build_proxy_response(result)` — the caller still received the 200 SGLang response and appended a phantom assistant record to its trajectory, while the server's session state didn't reflect that turn. The next chat then hit a cursor mismatch. Restore the original "one lock around Phase 1+2+3" invariant and move DELETE preemption from "no lock during proxy" to an explicit cancellation channel: chat stores its proxy task on session.current_proxy_task, DELETE cancels it (after setting session.closing), the chat coroutine catches CancelledError and returns 410 Gone (distinct from 404 "never existed" and 409 "retryable conflict"), then releases the lock so DELETE can acquire it and tear down. SessionStateConflictError (409) is added as defense-in-depth: with Phase 1+2+3 under one lock the Phase-3 stale-state branch is unreachable, but a future split-lock regression now surfaces a clear retryable 409 instead of a silently dropped commit. Two tests in test_session_race_conditions.py encoded the old "DELETE waits for chat to finish" contract and now expect the new "DELETE preempts → 410" behaviour: - test_double_delete_second_returns_404 - test_multiple_chats_queued_then_delete Pre-existing 4-failure baseline (get_or_create_session auto-create-after-delete tests) is unchanged: 5 pass / 4 fail. Split out of #33 per offline review request — #33 keeps the multi-backend orchestration + observability work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply formatter fixes flagged by CI's pre-commit run. Three files inside this PR's scope had style-fixes (linear_trajectory.py, test_session_race_conditions.py, plus an unused-import sweep) and seven pre-existing style-debt files elsewhere in the repo that pre-commit's --all-files mode picked up. All changes are formatter output, no semantic edits. Same lint-fix bundle applied to PR #33 in commit e1b93a0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
N independent vanilla session-server processes on the SGLang gateway node + callee-side random pick per session + structured per-request observability with timing buckets + per-worker stats heartbeat.
Scope-change note: the lock-restoration + DELETE cancellation channel +
SessionStateConflictError409 surface that originally shipped with this PR have been split out to #38 per offline review. This PR is now strictly multi-backend + logging; it does NOT change the lock holding pattern.What's in this PR
Multi-backend orchestration
miles/ray/rollout.py— spawn N session-server subprocesses on the gateway nodemiles/utils/arguments.py—session_server_portarg threadingmiles/rollout/generate_utils/openai_endpoint_utils.py— callee-side random pick per sessionmiles/rollout/session/session_server.py—_stats_logger_loopheartbeat (every 30s; reqs_total / reqs_since_last / inflight_now / turns_completed / rss_mb / vms_mb);pid=<n>prefix on every record in each subprocess vialogging.basicConfig(force=True)Per-request observability (
sessions.py)chat_start/chat_donestructured logs withreq_id(uuid8) and timing buckets:lock_wait_ms,tokenize_in_ms,proxy_elapsed_ms,tokenize_out_ms,total_ms,inflight_now,prompt_tokens,completion_tokens. One INFO log per request, success or short-circuit._stats(reqs_total, turns_completed, inflight) +_worker_statsregistry keyed bysession_server_port.state_changed_during_proxywarning withworker_port,inflight_chat_count,caller_request_id(x-request-id),proxy_elapsed_ms— grep-correlatable with Layer-1 client retries.prepare_pretokenized+update_pretokenized_staterun inasyncio.to_threadso the event loop isn't blocked by sync TITO tokenizer calls (~40% off p99 in microbench).proxy_transport_errorwarning includesurl+elapsed_ms.Dependencies
psutilhard-pinned inrequirements.txt(consumed by the stats heartbeat; no graceful fallback — we want memory metrics in all environments). /cc @nightlessbaronWhat's NOT in this PR
Moved to #38:
session.lockacross Phase 1+2+3 (currently split-lock on prod and unchanged here)session.current_proxy_task.cancel()cancellation channelSessionStateConflictError(409) defense-in-depth for the Phase-3 commit guardcurrent_proxy_taskfield onLinearTrajectoryTest plan
get_or_create_sessionauto-create-after-delete behaviour, unrelated to this PR.Observable behavior changes
None compared to prod's chat-flow contract — same split-lock structure, same DELETE-waits-for-lock semantics. The only new public surface is structured logging.
🤖 Generated with Claude Code