Skip to content

fix(session-server): restore lock across Phase 1+2+3, preempt DELETE via cancellation channel#38

Draft
rmfan wants to merge 2 commits into
prodfrom
feat/session-server-lock-restore
Draft

fix(session-server): restore lock across Phase 1+2+3, preempt DELETE via cancellation channel#38
rmfan wants to merge 2 commits into
prodfrom
feat/session-server-lock-restore

Conversation

@rmfan

@rmfan rmfan commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Split out of #33 per offline review request: this PR contains ONLY the lock-restoration + DELETE-preemption cancellation channel + SessionStateConflictError 409 surface. #33 keeps multi-backend orchestration and logging/observability.

Why

PR #31 split session.lock around the SGLang proxy call to keep DELETE non-blocking while a chat was in flight. That split-lock window has a documented race:

  1. Two same-session writers can both reach Phase 3 (the post-proxy commit).
  2. The Phase-3 stale-state guard at sessions.py:241 on prod does:
    if session.num_assistant != expected_num_assistant:
        logger.warning(... "skipping state update")
        return backend.build_proxy_response(result)
    i.e. silently drops the second commit and still returns the 200 SGLang response to its caller.
  3. The caller treats the 200 as success and appends the assistant message to its trajectory.
  4. The next chat from that caller arrives with a trajectory the server hasn't seen → cursor mismatch.

Multi-backend (#33) doesn't fix this — it inherits the race per-backend. Restoring the lock around Phase 2 closes the race; the cancellation-channel preserves the non-blocking DELETE behaviour that the split-lock design was originally trying to achieve.

What changed

sessions.pydelete_session

After setting session.closing = True, cancel session.current_proxy_task if non-None. The in-flight chat coroutine catches CancelledError and returns 410 Gone, then releases the lock so DELETE's await session.lock.acquire() proceeds.

sessions.pychat_completions

Phase 1+2+3 all run inside one async with session.lock: block. Phase 2 wraps the proxy call (and the rollback-retry path) in asyncio.create_task(...) stored on session.current_proxy_task. try/except asyncio.CancelledError → JSONResponse(410) / finally: session.current_proxy_task = None.

sessions.py — Phase-3 state-changed guard

Now raise SessionStateConflictError(...) (→ 409) instead of logger.warning + silent return of build_proxy_response(result). Unreachable under the lock-restored flow; defense-in-depth against a future split-lock regression.

linear_trajectory.py

Adds current_proxy_task: asyncio.Task | None = None to LinearTrajectory.

session_errors.py

Adds SessionStateConflictError(SessionError) with status_code = 409.

Test updates

Two race tests encoded the old "DELETE waits for chat to finish" contract:

  • test_double_delete_second_returns_404 — chat now returns 410 (was 200) when DELETE preempts it
  • test_multiple_chats_queued_then_delete — in-flight chat at DELETE time now returns 410; queued chats still 404. Status-code set widened to {200, 404, 410}; at-least-one-in-flight assertion widened to accept 410.

Test plan

  • Race tests baseline preserved: 5 pass / 4 fail (same as prod). The 4 failures are the documented get_or_create_session auto-create-after-delete behaviour, unrelated to this PR.
    • Before this PR (prod): 4 fail / 5 pass — test_delete_can_proceed_while_chat_is_mid_proxy, test_chat_during_delete_returns_404, test_chat_after_delete_returns_404, test_rapid_create_chat_delete_cycles.
    • After this PR (with test updates): same 4 fail / 5 pass.
  • CI green
  • Manual: drive a DELETE while a chat is in flight; confirm chat returns 410 and DELETE returns 204 within milliseconds.

Observable behavior changes

  • A chat in flight when its session is DELETEd now returns 410 Gone (was: 200 with the upstream SGLang response). Distinct from 404 ("never existed") and 409 ("retryable conflict"). Harbor / litellm should treat 410 as definite session-closed.
  • A future split-lock regression now surfaces 409 with SessionStateConflictError instead of silently corrupting trajectory state.

Relationship to #33

#33 keeps a copy of these lock-restoration changes today; once this PR lands on prod, #33 will rebase and shed the duplicated changes. The two PRs are not blocked on each other — #33 can land first too, in which case this PR's rebase is a no-op.

🤖 Generated with Claude Code

rmfan and others added 2 commits June 9, 2026 15:19
…via cancellation channel

PR #31 split session.lock around the SGLang proxy call to keep DELETE
non-blocking while a chat was in flight. That split-lock window let
two same-session writers commit interleaved assistant turns; the
Phase-3 stale-update guard at sessions.py:241 silently dropped the
second commit via `logger.warning(... skipping state update); return
backend.build_proxy_response(result)` — the caller still received the
200 SGLang response and appended a phantom assistant record to its
trajectory, while the server's session state didn't reflect that turn.
The next chat then hit a cursor mismatch.

Restore the original "one lock around Phase 1+2+3" invariant and
move DELETE preemption from "no lock during proxy" to an explicit
cancellation channel: chat stores its proxy task on
session.current_proxy_task, DELETE cancels it (after setting
session.closing), the chat coroutine catches CancelledError and
returns 410 Gone (distinct from 404 "never existed" and 409
"retryable conflict"), then releases the lock so DELETE can acquire
it and tear down.

SessionStateConflictError (409) is added as defense-in-depth: with
Phase 1+2+3 under one lock the Phase-3 stale-state branch is
unreachable, but a future split-lock regression now surfaces a clear
retryable 409 instead of a silently dropped commit.

Two tests in test_session_race_conditions.py encoded the old
"DELETE waits for chat to finish" contract and now expect the new
"DELETE preempts → 410" behaviour:
  - test_double_delete_second_returns_404
  - test_multiple_chats_queued_then_delete
Pre-existing 4-failure baseline (get_or_create_session
auto-create-after-delete tests) is unchanged: 5 pass / 4 fail.

Split out of #33 per offline review request — #33 keeps the
multi-backend orchestration + observability work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply formatter fixes flagged by CI's pre-commit run. Three files
inside this PR's scope had style-fixes (linear_trajectory.py,
test_session_race_conditions.py, plus an unused-import sweep) and
seven pre-existing style-debt files elsewhere in the repo that
pre-commit's --all-files mode picked up. All changes are formatter
output, no semantic edits.

Same lint-fix bundle applied to PR #33 in commit e1b93a0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rmfan rmfan force-pushed the feat/session-server-lock-restore branch from 1e6c34b to a114e9d Compare June 9, 2026 22:20
# --- Phase 3: update state (still under the same lock) ---
# Defensive assert: with the lock held through Phase 1+2+3, no
# other writer can mutate num_assistant. If this fires, some
# future change has reintroduced a split-lock window — surface

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising SessionStateConflictError here makes one think that this can still happen, which is impossible because of the new locking protocol. IMO this can be removed safely to avoid confusion, and only the lock related change should stay.

@odp

odp commented Jun 10, 2026

Copy link
Copy Markdown

PR #31 split session.lock around the SGLang proxy call to keep DELETE non-blocking while a chat was in flight. That split-lock window has a documented race

have we observed this in one of our runs? If yes, please document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants