fix(session-server): restore lock across Phase 1+2+3, preempt DELETE via cancellation channel#38
Draft
rmfan wants to merge 2 commits into
Draft
fix(session-server): restore lock across Phase 1+2+3, preempt DELETE via cancellation channel#38rmfan wants to merge 2 commits into
rmfan wants to merge 2 commits into
Conversation
3 tasks
…via cancellation channel PR #31 split session.lock around the SGLang proxy call to keep DELETE non-blocking while a chat was in flight. That split-lock window let two same-session writers commit interleaved assistant turns; the Phase-3 stale-update guard at sessions.py:241 silently dropped the second commit via `logger.warning(... skipping state update); return backend.build_proxy_response(result)` — the caller still received the 200 SGLang response and appended a phantom assistant record to its trajectory, while the server's session state didn't reflect that turn. The next chat then hit a cursor mismatch. Restore the original "one lock around Phase 1+2+3" invariant and move DELETE preemption from "no lock during proxy" to an explicit cancellation channel: chat stores its proxy task on session.current_proxy_task, DELETE cancels it (after setting session.closing), the chat coroutine catches CancelledError and returns 410 Gone (distinct from 404 "never existed" and 409 "retryable conflict"), then releases the lock so DELETE can acquire it and tear down. SessionStateConflictError (409) is added as defense-in-depth: with Phase 1+2+3 under one lock the Phase-3 stale-state branch is unreachable, but a future split-lock regression now surfaces a clear retryable 409 instead of a silently dropped commit. Two tests in test_session_race_conditions.py encoded the old "DELETE waits for chat to finish" contract and now expect the new "DELETE preempts → 410" behaviour: - test_double_delete_second_returns_404 - test_multiple_chats_queued_then_delete Pre-existing 4-failure baseline (get_or_create_session auto-create-after-delete tests) is unchanged: 5 pass / 4 fail. Split out of #33 per offline review request — #33 keeps the multi-backend orchestration + observability work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply formatter fixes flagged by CI's pre-commit run. Three files inside this PR's scope had style-fixes (linear_trajectory.py, test_session_race_conditions.py, plus an unused-import sweep) and seven pre-existing style-debt files elsewhere in the repo that pre-commit's --all-files mode picked up. All changes are formatter output, no semantic edits. Same lint-fix bundle applied to PR #33 in commit e1b93a0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1e6c34b to
a114e9d
Compare
odp
requested changes
Jun 10, 2026
| # --- Phase 3: update state (still under the same lock) --- | ||
| # Defensive assert: with the lock held through Phase 1+2+3, no | ||
| # other writer can mutate num_assistant. If this fires, some | ||
| # future change has reintroduced a split-lock window — surface |
There was a problem hiding this comment.
Raising SessionStateConflictError here makes one think that this can still happen, which is impossible because of the new locking protocol. IMO this can be removed safely to avoid confusion, and only the lock related change should stay.
have we observed this in one of our runs? If yes, please document. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Split out of #33 per offline review request: this PR contains ONLY the lock-restoration + DELETE-preemption cancellation channel + SessionStateConflictError 409 surface. #33 keeps multi-backend orchestration and logging/observability.
Why
PR #31 split
session.lockaround the SGLang proxy call to keep DELETE non-blocking while a chat was in flight. That split-lock window has a documented race:sessions.py:241on prod does:Multi-backend (#33) doesn't fix this — it inherits the race per-backend. Restoring the lock around Phase 2 closes the race; the cancellation-channel preserves the non-blocking DELETE behaviour that the split-lock design was originally trying to achieve.
What changed
sessions.py—delete_sessionAfter setting
session.closing = True, cancelsession.current_proxy_taskif non-None. The in-flight chat coroutine catchesCancelledErrorand returns 410 Gone, then releases the lock so DELETE'sawait session.lock.acquire()proceeds.sessions.py—chat_completionsPhase 1+2+3 all run inside one
async with session.lock:block. Phase 2 wraps the proxy call (and the rollback-retry path) inasyncio.create_task(...)stored onsession.current_proxy_task.try/except asyncio.CancelledError → JSONResponse(410)/finally: session.current_proxy_task = None.sessions.py— Phase-3 state-changed guardNow
raise SessionStateConflictError(...)(→ 409) instead oflogger.warning + silent return of build_proxy_response(result). Unreachable under the lock-restored flow; defense-in-depth against a future split-lock regression.linear_trajectory.pyAdds
current_proxy_task: asyncio.Task | None = NonetoLinearTrajectory.session_errors.pyAdds
SessionStateConflictError(SessionError)withstatus_code = 409.Test updates
Two race tests encoded the old "DELETE waits for chat to finish" contract:
test_double_delete_second_returns_404— chat now returns 410 (was 200) when DELETE preempts ittest_multiple_chats_queued_then_delete— in-flight chat at DELETE time now returns 410; queued chats still 404. Status-code set widened to{200, 404, 410}; at-least-one-in-flight assertion widened to accept 410.Test plan
get_or_create_sessionauto-create-after-delete behaviour, unrelated to this PR.test_delete_can_proceed_while_chat_is_mid_proxy,test_chat_during_delete_returns_404,test_chat_after_delete_returns_404,test_rapid_create_chat_delete_cycles.Observable behavior changes
SessionStateConflictErrorinstead of silently corrupting trajectory state.Relationship to #33
#33 keeps a copy of these lock-restoration changes today; once this PR lands on
prod, #33 will rebase and shed the duplicated changes. The two PRs are not blocked on each other — #33 can land first too, in which case this PR's rebase is a no-op.🤖 Generated with Claude Code