Skip to content

session-server: keep restart auto-create, but make delete stick#35

Open
DavidBellamy wants to merge 1 commit into
prodfrom
session-delete-tombstone
Open

session-server: keep restart auto-create, but make delete stick#35
DavidBellamy wants to merge 1 commit into
prodfrom
session-delete-tombstone

Conversation

@DavidBellamy

Copy link
Copy Markdown
Collaborator

What

PR radixark#888 ("auto-create sessions on first access for restart tolerance") made the session server auto-create a session whenever it gets a request for an id it does not have. That gives useful router-restart tolerance, but it also resurrects explicitly deleted sessions: a chat or GET for a deleted id silently creates a fresh one, so DELETE no longer holds. It also left several session tests failing, since they assert the old "deleted -> 404" contract.

This keeps the restart tolerance but restores delete semantics by tombstoning explicitly deleted ids:

  • remove_session records the id in a bounded set (the most recent 10k deletes).
  • get_or_create_session and the GET /sessions/{id} route return 404 for a tombstoned (deleted) id, instead of recreating it or returning empty.
  • An id the server has never seen still auto-creates / returns empty, so restart tolerance is preserved.

Tests

  • The 4 race tests (test_delete_can_proceed_while_chat_is_mid_proxy, test_chat_during_delete_returns_404, test_chat_after_delete_returns_404, test_rapid_create_chat_delete_cycles) now pass unchanged.
  • test_get_session_not_found becomes test_get_unknown_session_returns_empty: asserts the intended empty-200 for a never-seen id.
  • Adds test_get_deleted_session_returns_404 for the tombstone behavior.

Verified in the miles container: full session suite (test_sessions.py, test_session_pretokenized_e2e.py, test_session_race_conditions.py) is green (56 passed), with the race file green on a repeat run.

@DavidBellamy DavidBellamy requested a review from a team June 6, 2026 05:11
PR radixark#888 added auto-create-on-missing for router-restart tolerance, but it
also resurrected explicitly deleted sessions: a chat or GET for a deleted id
silently created a fresh one, so "delete" no longer held. That left four
session tests asserting the old "deleted -> 404" contract failing.

Tombstone explicitly deleted ids (bounded to the most recent 10k): a deleted
id now returns 404 on both the chat and GET paths, while an id the server has
never seen still auto-creates, so restart tolerance is preserved.

Update test_get_session_not_found -> test_get_unknown_session_returns_empty
(asserts the intended empty-200 for a never-seen id) and add
test_get_deleted_session_returns_404 for the tombstone behavior.
@DavidBellamy DavidBellamy force-pushed the session-delete-tombstone branch from d8ee8e4 to f2ebcd4 Compare June 6, 2026 05:20
@flukeskywalker

Copy link
Copy Markdown

Looks good! A couple of questions:

  1. What are the situations in our setup when a chat or GET request for a deleted id might arrive? What happens currently in those situations after the session gets resurrected?
  2. Depending on 1, under for which rollout batch sizes would 10k be too low as the number of recently deleted IDs to track?

@flukeskywalker

Copy link
Copy Markdown

@nightlessbaron I think we should merge this, but want to make sure you saw this on the off-chance that this has implications for the request abort issues you were noticing.

@DavidBellamy

Copy link
Copy Markdown
Collaborator Author

Looks good! A couple of questions:

  1. What are the situations in our setup when a chat or GET request for a deleted id might arrive? What happens currently in those situations after the session gets resurrected?
  2. Depending on 1, under for which rollout batch sizes would 10k be too low as the number of recently deleted IDs to track?
  1. It's never on purpose. Each conversation gets a fresh, unique ID, so a request for a deleted one is always a leftover retry. The most common cause: the agent retries a message after the conversation was already wrapped up and deleted. Before this fix, that retry quietly created a ghost empty conversation that just sat around wasting space for 2 hours. This fix makes it cleanly fail instead.
  2. The list only needs to remember an ID long enough for stray retries to stop coming (a couple minutes). 10k is plenty unless you're running so many conversations at once that you delete more than 10,000 within that short window, roughly when batch size times samples-per-prompt gets above ~10k. Will we hit this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants