Skip to content

feat(session-server): multi-backend vanilla + per-request observability#33

Merged
rmfan merged 2 commits into
prodfrom
feat/session-server-multi-backend-vanilla
Jun 10, 2026
Merged

feat(session-server): multi-backend vanilla + per-request observability#33
rmfan merged 2 commits into
prodfrom
feat/session-server-multi-backend-vanilla

Conversation

@rmfan

@rmfan rmfan commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

N independent vanilla session-server processes on the SGLang gateway node + callee-side random pick per session + structured per-request observability with timing buckets + per-worker stats heartbeat.

Scope-change note: the lock-restoration + DELETE cancellation channel + SessionStateConflictError 409 surface that originally shipped with this PR have been split out to #38 per offline review. This PR is now strictly multi-backend + logging; it does NOT change the lock holding pattern.

What's in this PR

Multi-backend orchestration

  • miles/ray/rollout.py — spawn N session-server subprocesses on the gateway node
  • miles/utils/arguments.pysession_server_port arg threading
  • miles/rollout/generate_utils/openai_endpoint_utils.py — callee-side random pick per session
  • miles/rollout/session/session_server.py_stats_logger_loop heartbeat (every 30s; reqs_total / reqs_since_last / inflight_now / turns_completed / rss_mb / vms_mb); pid=<n> prefix on every record in each subprocess via logging.basicConfig(force=True)

Per-request observability (sessions.py)

  • chat_start / chat_done structured logs with req_id (uuid8) and timing buckets: lock_wait_ms, tokenize_in_ms, proxy_elapsed_ms, tokenize_out_ms, total_ms, inflight_now, prompt_tokens, completion_tokens. One INFO log per request, success or short-circuit.
  • Per-worker _stats (reqs_total, turns_completed, inflight) + _worker_stats registry keyed by session_server_port.
  • Enriched state_changed_during_proxy warning with worker_port, inflight_chat_count, caller_request_id (x-request-id), proxy_elapsed_ms — grep-correlatable with Layer-1 client retries.
  • prepare_pretokenized + update_pretokenized_state run in asyncio.to_thread so the event loop isn't blocked by sync TITO tokenizer calls (~40% off p99 in microbench).
  • proxy_transport_error warning includes url + elapsed_ms.

Dependencies

  • psutil hard-pinned in requirements.txt (consumed by the stats heartbeat; no graceful fallback — we want memory metrics in all environments). /cc @nightlessbaron

What's NOT in this PR

Moved to #38:

  • Restoring session.lock across Phase 1+2+3 (currently split-lock on prod and unchanged here)
  • DELETE preemption via session.current_proxy_task.cancel() cancellation channel
  • SessionStateConflictError (409) defense-in-depth for the Phase-3 commit guard
  • current_proxy_task field on LinearTrajectory
  • Two race-tests updates that encode the lock-restored DELETE-preempts-chat contract

Test plan

  • Race tests baseline preserved: 5 pass / 4 fail (same as prod). The 4 failures are the documented get_or_create_session auto-create-after-delete behaviour, unrelated to this PR.
  • CI green
  • Multi-backend stub bench (already posted: 12× peak throughput at wc=8 vs wc=1 in the comment above)

Observable behavior changes

None compared to prod's chat-flow contract — same split-lock structure, same DELETE-waits-for-lock semantics. The only new public surface is structured logging.

🤖 Generated with Claude Code

@rmfan

rmfan commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

Pushed ca684017 adding logging.info-only observability (no Prometheus). Five additions:

  1. chat_start (sessions.py) — handler entry, includes worker_port, session_id, req_id (uuid8), messages_len, inflight_before.
  2. chat_done (sessions.py) — handler exit (finally, so fires on success/410/404/upstream-error), includes lock_wait_ms, tokenize_in_ms, proxy_elapsed_ms, tokenize_out_ms, total_ms, inflight_now, prompt_tokens, completion_tokens. req_id matches the chat_start so they grep-pair.
  3. Per-worker stats heartbeat (session_server.py) — 30s asyncio background task started via FastAPI on_event("startup"), cancelled on shutdown. Logs reqs_total / reqs_since_last / inflight_now / turns_completed + rss_mb / vms_mb via psutil. psutil import is lazy with graceful fallback so N=1 envs without it still work.
  4. state_changed_during_proxy — enriched the defensive race-detection branch (the diagnostic-gap that bit the cursor-mismatch RCA) with inflight_chat_count, caller_request_id (x-request-id), worker_port, proxy_elapsed_ms. The Phase 1+2+3 lock-held design closes the race window, but the assert is kept; if a future split-lock regression reopens it, this log now correlates server-side races with Layer-1 client retries.
  5. proxy_transport_error — added url and elapsed_ms to the existing httpx.TransportError warning.

Sample chat_done:

[session-server] chat_done worker_port=None session_id=1f353857... req_id=c3d2dd4a lock_wait_ms=0.0 tokenize_in_ms=0.9 proxy_elapsed_ms=111.8 tokenize_out_ms=0.8 total_ms=115.5 inflight_now=7 prompt_tokens=-1 completion_tokens=2

(prompt_tokens=-1 is the upstream-not-returning case from the test mock; in production SGLang populates it.)

Tests: pytest tests/fast/router/test_session_race_conditions.py -v --noconftest — same 5 pass / 4 fail as pre-existing (the 4 failures are the documented get_or_create_session auto-create-after-delete behaviour, unrelated to this PR). Import-clean: from miles.rollout.session import session_server → OK.

Diff stat (~250 LoC, all in two files):

miles/rollout/session/session_server.py | 117 +++++++++++++++++++++-
miles/rollout/session/sessions.py       | 137 +++++++++++++++++++++++++++-
2 files changed, 250 insertions(+), 4 deletions(-)

Staying as draft.

@rmfan

rmfan commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

running end-to-end tests on it now

@rmfan

rmfan commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator Author

Bench results (multibackend stub rig) + logging validation

1. Bench results

Rig: scripts/orchestrate_multibackend.py against STUB_LATENCY_SEC=0.5, STUB_WORKERS=4, DURATION_SEC=45, WARMUP_SEC=5, TURNS=6, LOADGEN_PROCS=8. Numbers below are from summary_multibackend.csv.

worker count (wc) N tput (r/s) cli p50 cli p99 srv p50 srv p99
1 100 66.0 1.334 5.366 1.238 5.312
1 200 53.9 4.512 6.666 4.186 6.611
1 400 66.3 6.139 9.980 5.579 9.327
1 800 60.9 8.631 29.795 7.758 29.317
1 1600 62.2 8.481 31.992 7.349 30.957
2 100 204.8 0.515 0.682 0.513 0.670
2 200 109.6 0.605 6.006 0.565 5.867
2 400 106.7 4.567 6.693 4.162 6.422
2 800 136.7 6.122 11.727 5.578 10.106
2 1600 135.6 8.205 29.799 7.481 29.000
4 100 207.8 0.510 0.575 0.508 0.566
4 200 406.6 0.517 0.691 0.515 0.678
4 400 225.0 1.132 6.140 0.981 5.970
4 800 202.3 4.783 6.605 4.360 6.280
4 1600 261.7 6.291 10.909 5.740 10.512
8 100 208.4 0.510 0.560 0.508 0.558
8 200 415.2 0.511 0.566 0.509 0.562
8 400 795.1 0.527 0.725 0.522 0.707
8 800 444.1 1.091 6.054 0.931 5.848
8 1600 415.6 4.556 6.760 4.122 6.504

Peak throughput per backend count:

wc peak r/s at concurrency speedup vs wc=1
1 66 1.0x
2 205 100 3.1x
4 407 200 6.2x
8 795 400 12.0x

Throughput scales near-linearly with backend count once the loadgen-side concurrency is high enough to keep all backends fed. The drop-off after the peak (e.g. wc=8 falling from 795 -> 444 at N=800) is the stub's STUB_WORKERS=4 per backend saturating and triggering tail-latency / ReadError timeouts (see logging section).

Comparison vs PR #31's router rig at wc=4, in-flight=400:

path r/s
through router (GIL-bound) ~51
sticky direct-to-backend (bypass) ~284
multibackend (this PR) 225

Multibackend sits much closer to the bypass path than the router path. The remaining gap to 284 is plausibly loadgen-side overhead (8 loadgen procs here vs 4 in PR #31) — to verify, re-run with LOADGEN_PROCS=4. This is consistent with the router being the GIL bottleneck PR #31 identified; removing it via multi-backend recovers most of the bypass-path throughput. Stated as a hypothesis pending an apples-to-apples loadgen check.

2. Logging validation (commit ca684017)

Source: ~/multibacked_logs/multibackend/multibackend_wc{1,2,4,8}.log.

file lines chat_start chat_done stats state-changed warnings proxy_transport_error
wc1 124,983 16,349 16,349 10 0 24
wc2 250,551 34,654 34,654 20 0 31
wc4 448,353 62,718 62,718 40 0 37
wc8 763,233 107,844 107,844 80 0 51

chat_done counts are 6-18% above the client-counted totals from summary_multibackend.csv (~13.9k / 31.2k / 58.7k / 102.5k), which matches the expected n_server > n_client gap from mid-window cutoffs.

worker_port populated correctly — distinct ports per file matches backend count:

  • wc1: {8910} (1)
  • wc2: {8911, 8912} (2)
  • wc4: {8910, 8913, 8914, 8915} (4)
  • wc8: {8911, 8912, 8916, 8917, 8918, 8919, 8920, 8921} (8)
  • Zero occurrences of worker_port=None across all files.

stats cadence: exactly 10 stats lines per backend per file (10 / 20 / 40 / 80 = N × 10). Matches ~5-min bench duration at 30s interval. rss_mb / vms_mb populated — psutil import did not silently fail.

Sample chat_done (verbatim, wc=8):

2026-06-06 01:31:41,332 INFO miles.rollout.session.sessions [session-server] chat_done worker_port=8912 session_id=0fc8977e42474e60b613dedf48633aca req_id=7bc6c541 lock_wait_ms=0.0 tokenize_in_ms=2.8 proxy_elapsed_ms=485.2 tokenize_out_ms=3.7 total_ms=494.2 inflight_now=60 prompt_tokens=1700 completion_tokens=200
  • lock_wait_ms=0.0 (no contention in steady state — lock-restore working)
  • proxy_elapsed_ms=485.2 (≈ stub's 0.5s, as expected)
  • tokenize_in_ms + tokenize_out_ms < 10ms each
  • inflight_now integer

Sample stats (verbatim, wc=4):

2026-06-06 01:29:06,041 INFO miles.rollout.session.session_server [session-server] stats worker_port=8913 reqs_total=15874 reqs_since_last=1911 inflight_now=332 turns_completed=15535 rss_mb=840 vms_mb=17618
  • All fields populated. RSS grew from 534 MB at start to 840 MB after ~15k requests — bounded.

No state changed during proxy warnings observed across any file (0 / 0 / 0 / 0). Confirms the lock-restore via cancellation channel closes the race under load (≤ 1600 concurrent, 8 backends, ~108k requests).

No tracebacks or unhandled exceptions. The only WARNING-level lines are proxy_transport_error (24 / 31 / 37 / 51 occurrences) — all error_type=ReadError against the stub at http://127.0.0.1:8801. These appear in clusters during the higher-N rows where the stub's 4-worker pool saturates and connections back up; the enrichment fields (url, elapsed_ms) added by this PR are populated. On those requests chat_done correctly emits prompt_tokens=-1 completion_tokens=-1 sentinels (Phase 3 didn't run). Behavior matches the design.

3. Operational guide

  • Rule of thumb: pick wc (backend count) so that concurrency / wc ≈ 50-100. At that ratio the rig sits at peak throughput with sub-second p99 (wc=2 @ 100, wc=4 @ 200, wc=8 @ 400 all land here).
  • Below the ratio (under-filled): per-backend pool is idle → throughput leaves money on the table (e.g. wc=8 @ 100 only hits 208 r/s).
  • Above the ratio (over-filled): per-backend STUB_WORKERS=4 saturates → tail latency blows up and proxy_transport_error starts firing (e.g. wc=4 @ 1600: p99=10.5s, 37 ReadErrors in the log).
  • Caveat — stub vs real SGLang: the stub has constant 0.5s latency. Real SGLang has a heavy-tailed distribution, so production p99 will be larger than these numbers under-state. The relative shape (linear scaling with wc, the concurrency-per-backend sweet spot) should carry over; the absolute p99 ceilings will not.

Files

  • Bench rig: scripts/orchestrate_multibackend.py, scripts/run_multibackend_bench.sh
  • Logging surfaces: miles/rollout/session/sessions.py, miles/rollout/session/session_server.py
  • Logs validated: ~/multibacked_logs/multibackend/multibackend_wc{1,2,4,8}.log (187 files, ~1.6M lines)

@rmfan rmfan marked this pull request as ready for review June 6, 2026 01:50
@rmfan rmfan requested a review from a team June 6, 2026 01:50
Comment thread miles/rollout/session/session_server.py Outdated
Comment thread miles/rollout/session/session_errors.py Outdated
Comment thread miles/rollout/session/session_server.py Outdated
rmfan added a commit that referenced this pull request Jun 9, 2026
Addresses two review comments on PR #33:

1. @odp comment on session_server.py:130 — psutil is now a hard
   requirement (added to requirements.txt). Dropped the try/except
   import and the `_proc is None` fallback branches in
   _stats_logger_loop; stats heartbeat is single-path with rss_mb /
   vms_mb always populated. /cc @nightlessbaron

2. @odp comment on session_server.py:182 — every log record in each
   session-server subprocess now carries pid=<n>, via a one-shot
   logging.basicConfig(..., format="... pid=%(process)d ...",
   force=True) at the top of run_session_server. Covers every
   logger.info in the module (handlers, stats heartbeat, race-detection
   warning) with no per-call-site edits. force=True overrides any
   handler config inherited from the parent (e.g. ray-set handlers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rmfan added a commit that referenced this pull request Jun 9, 2026
…via cancellation channel

PR #31 split session.lock around the SGLang proxy call to keep DELETE
non-blocking while a chat was in flight. That split-lock window let
two same-session writers commit interleaved assistant turns; the
Phase-3 stale-update guard at sessions.py:241 silently dropped the
second commit via `logger.warning(... skipping state update); return
backend.build_proxy_response(result)` — the caller still received the
200 SGLang response and appended a phantom assistant record to its
trajectory, while the server's session state didn't reflect that turn.
The next chat then hit a cursor mismatch.

Restore the original "one lock around Phase 1+2+3" invariant and
move DELETE preemption from "no lock during proxy" to an explicit
cancellation channel: chat stores its proxy task on
session.current_proxy_task, DELETE cancels it (after setting
session.closing), the chat coroutine catches CancelledError and
returns 410 Gone (distinct from 404 "never existed" and 409
"retryable conflict"), then releases the lock so DELETE can acquire
it and tear down.

SessionStateConflictError (409) is added as defense-in-depth: with
Phase 1+2+3 under one lock the Phase-3 stale-state branch is
unreachable, but a future split-lock regression now surfaces a clear
retryable 409 instead of a silently dropped commit.

Two tests in test_session_race_conditions.py encoded the old
"DELETE waits for chat to finish" contract and now expect the new
"DELETE preempts → 410" behaviour:
  - test_double_delete_second_returns_404
  - test_multiple_chats_queued_then_delete
Pre-existing 4-failure baseline (get_or_create_session
auto-create-after-delete tests) is unchanged: 5 pass / 4 fail.

Split out of #33 per offline review request — #33 keeps the
multi-backend orchestration + observability work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rmfan added a commit that referenced this pull request Jun 9, 2026
Spawn N independent vanilla session-server processes on the SGLang
gateway node; callee-side random pick per session. Adds structured
per-request observability (chat_start / chat_done with timing buckets
+ token counts) and a per-worker stats heartbeat for memory/throughput.

This PR is the safe (multi-backend + logging) subset of the original
PR #33 design. The lock-restoration + DELETE cancellation channel +
SessionStateConflictError 409 surface — which originally shipped in
this PR — have been split out to a follow-up so they can be reviewed
and landed independently. See the lock-restore PR for that work.

Session-server changes:
- Spawn N independent backends + callee-side pick (miles/ray/rollout.py,
  miles/utils/arguments.py)
- chat_start / chat_done structured logs with req_id (uuid8) and
  timing buckets (lock_wait_ms, tokenize_in_ms, proxy_elapsed_ms,
  tokenize_out_ms, total_ms, inflight_now, prompt_tokens,
  completion_tokens). One INFO log per request, success or short-circuit.
- Per-worker _stats (reqs_total, turns_completed, inflight) +
  _worker_stats registry keyed by session_server_port. Background
  _stats_logger_loop in session_server.py emits 30s heartbeat with
  rss_mb / vms_mb via psutil.
- pid=<n> prefix on every record in each subprocess via
  logging.basicConfig in run_session_server (force=True overrides
  ray-set handlers).
- Enriched state_changed_during_proxy warning with worker_port,
  inflight_chat_count, caller_request_id (x-request-id),
  proxy_elapsed_ms — grep-correlatable with Layer-1 client retries.
- prepare_pretokenized + update_pretokenized_state run in
  asyncio.to_thread so the event loop isn't blocked by sync
  TITO tokenizer calls (independent perf win, ~40% off p99 in
  microbench).
- proxy_transport_error warning includes url + elapsed_ms.

Dependencies:
- psutil hard-pinned in requirements.txt (consumed by the stats
  heartbeat; no graceful fallback — we want memory metrics in all
  environments).

Race-tests baseline preserved: 5 pass / 4 fail (same as prod). The
4 failures are the documented get_or_create_session
auto-create-after-delete behaviour, unrelated to this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rmfan rmfan force-pushed the feat/session-server-multi-backend-vanilla branch from cdee1f0 to d079b36 Compare June 9, 2026 17:04
@rmfan rmfan changed the title feat(session-server): multi-backend vanilla + lock-restore via cancellation channel feat(session-server): multi-backend vanilla + per-request observability Jun 9, 2026
@rmfan rmfan force-pushed the feat/session-server-multi-backend-vanilla branch from b57a1ce to e1b93a0 Compare June 9, 2026 20:14
rmfan added a commit that referenced this pull request Jun 9, 2026
Apply formatter fixes flagged by CI's pre-commit run. Three files
inside this PR's scope had style-fixes (linear_trajectory.py,
test_session_race_conditions.py, plus an unused-import sweep) and
seven pre-existing style-debt files elsewhere in the repo that
pre-commit's --all-files mode picked up. All changes are formatter
output, no semantic edits.

Same lint-fix bundle applied to PR #33 in commit e1b93a0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rmfan and others added 2 commits June 9, 2026 15:12
Spawn N independent vanilla session-server processes on the SGLang
gateway node; callee-side random pick per session. Adds structured
per-request observability (chat_start / chat_done with timing buckets
+ token counts) and a per-worker stats heartbeat for memory/throughput.

This PR is the safe (multi-backend + logging) subset of the original
PR #33 design. The lock-restoration + DELETE cancellation channel +
SessionStateConflictError 409 surface — which originally shipped in
this PR — have been split out to a follow-up so they can be reviewed
and landed independently. See the lock-restore PR for that work.

Session-server changes:
- Spawn N independent backends + callee-side pick (miles/ray/rollout.py,
  miles/utils/arguments.py)
- chat_start / chat_done structured logs with req_id (uuid8) and
  timing buckets (lock_wait_ms, tokenize_in_ms, proxy_elapsed_ms,
  tokenize_out_ms, total_ms, inflight_now, prompt_tokens,
  completion_tokens). One INFO log per request, success or short-circuit.
- Per-worker _stats (reqs_total, turns_completed, inflight) +
  _worker_stats registry keyed by session_server_port. Background
  _stats_logger_loop in session_server.py emits 30s heartbeat with
  rss_mb / vms_mb via psutil.
- pid=<n> prefix on every record in each subprocess via
  logging.basicConfig in run_session_server (force=True overrides
  ray-set handlers).
- Enriched state_changed_during_proxy warning with worker_port,
  inflight_chat_count, caller_request_id (x-request-id),
  proxy_elapsed_ms — grep-correlatable with Layer-1 client retries.
- prepare_pretokenized + update_pretokenized_state run in
  asyncio.to_thread so the event loop isn't blocked by sync
  TITO tokenizer calls (independent perf win, ~40% off p99 in
  microbench).
- proxy_transport_error warning includes url + elapsed_ms.

Dependencies:
- psutil hard-pinned in requirements.txt (consumed by the stats
  heartbeat; no graceful fallback — we want memory metrics in all
  environments).

Race-tests baseline preserved: 5 pass / 4 fail (same as prod). The
4 failures are the documented get_or_create_session
auto-create-after-delete behaviour, unrelated to this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply formatter fixes flagged by CI's pre-commit run. Eleven files
touched: four of them inside this PR's scope
(generate_utils/openai_endpoint_utils.py, session/sessions.py,
session/linear_trajectory.py, ray/rollout.py — single-line/spacing
fixes only) and seven pre-existing style-debt files elsewhere in the
repo that pre-commit's --all-files mode picked up. All changes are
formatter output, no semantic edits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rmfan rmfan force-pushed the feat/session-server-multi-backend-vanilla branch from e1b93a0 to 577283b Compare June 9, 2026 22:12
rmfan added a commit that referenced this pull request Jun 9, 2026
…via cancellation channel

PR #31 split session.lock around the SGLang proxy call to keep DELETE
non-blocking while a chat was in flight. That split-lock window let
two same-session writers commit interleaved assistant turns; the
Phase-3 stale-update guard at sessions.py:241 silently dropped the
second commit via `logger.warning(... skipping state update); return
backend.build_proxy_response(result)` — the caller still received the
200 SGLang response and appended a phantom assistant record to its
trajectory, while the server's session state didn't reflect that turn.
The next chat then hit a cursor mismatch.

Restore the original "one lock around Phase 1+2+3" invariant and
move DELETE preemption from "no lock during proxy" to an explicit
cancellation channel: chat stores its proxy task on
session.current_proxy_task, DELETE cancels it (after setting
session.closing), the chat coroutine catches CancelledError and
returns 410 Gone (distinct from 404 "never existed" and 409
"retryable conflict"), then releases the lock so DELETE can acquire
it and tear down.

SessionStateConflictError (409) is added as defense-in-depth: with
Phase 1+2+3 under one lock the Phase-3 stale-state branch is
unreachable, but a future split-lock regression now surfaces a clear
retryable 409 instead of a silently dropped commit.

Two tests in test_session_race_conditions.py encoded the old
"DELETE waits for chat to finish" contract and now expect the new
"DELETE preempts → 410" behaviour:
  - test_double_delete_second_returns_404
  - test_multiple_chats_queued_then_delete
Pre-existing 4-failure baseline (get_or_create_session
auto-create-after-delete tests) is unchanged: 5 pass / 4 fail.

Split out of #33 per offline review request — #33 keeps the
multi-backend orchestration + observability work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rmfan added a commit that referenced this pull request Jun 9, 2026
Apply formatter fixes flagged by CI's pre-commit run. Three files
inside this PR's scope had style-fixes (linear_trajectory.py,
test_session_race_conditions.py, plus an unused-import sweep) and
seven pre-existing style-debt files elsewhere in the repo that
pre-commit's --all-files mode picked up. All changes are formatter
output, no semantic edits.

Same lint-fix bundle applied to PR #33 in commit e1b93a0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rmfan rmfan merged commit e69b51a into prod Jun 10, 2026
15 of 17 checks passed
@rmfan rmfan deleted the feat/session-server-multi-backend-vanilla branch June 10, 2026 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants