Skip to content

session-server: run TITO tokenization off the event loop#34

Merged
DavidBellamy merged 2 commits into
prodfrom
session-server-offload-tokenization
Jun 9, 2026
Merged

session-server: run TITO tokenization off the event loop#34
DavidBellamy merged 2 commits into
prodfrom
session-server-offload-tokenization

Conversation

@DavidBellamy

@DavidBellamy DavidBellamy commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

What

The standalone session server runs as a single uvicorn worker (one asyncio event loop). On the chat-completions hot path, the two TITO tokenization calls, prepare_pretokenized and update_pretokenized_state, run synchronous CPU work directly on that loop. While one trajectory tokenizes, every other in-flight request on the server is blocked, including cheap calls like /health and create_session. With many concurrent agentic trajectories funneling through the one process, this serializes tokenization across every turn and surfaces as rising latency that can look like a downstream (sandbox/router) problem.

This wraps both calls in asyncio.to_thread, so tokenization runs on a worker thread and the event loop stays responsive. Fast (Rust) tokenizers release the GIL during encoding, so concurrent tokenizations also run in parallel across cores rather than serializing on one.

Correctness

  • Both calls stay inside async with session.lock, held across the await, so each session still mutates its own trajectory single-threaded and in order.
  • The shared tito_tokenizer is read-only (encode / apply_chat_template), never reconfigured, so concurrent use across sessions is safe.

Test

Adds TestTokenizationOffload::test_slow_tokenization_does_not_block_event_loop: patches prepare_pretokenized to sleep, fires a chat completion, and asserts /health returns well within the tokenization sleep window. It fails on the original code (event loop blocked) and passes with this change.

Verified on-cluster against this prod tip: across the full session suite (test_sessions.py, test_session_pretokenized_e2e.py, test_session_race_conditions.py), run with and without this change, the only test whose result changes is the new one (red to green). Every other session test behaves identically with and without the change.

@DavidBellamy DavidBellamy requested a review from a team June 6, 2026 01:16
Comment thread .github/workflows/docker-build.yml Fixed
The session server runs as a single uvicorn worker (one asyncio event
loop). prepare_pretokenized and update_pretokenized_state do synchronous
CPU tokenization inline on that loop, so one trajectory's tokenization
blocks every other in-flight request. Wrap both calls in
asyncio.to_thread so tokenization runs in a worker thread and the event
loop stays responsive; fast tokenizers release the GIL, so concurrent
tokenizations also run in parallel.

Per-session state stays correct: both calls remain under session.lock
(held across the await), and the shared tito_tokenizer is read-only.

Adds a regression test asserting /health stays responsive while a slow
tokenization is in flight.
@DavidBellamy DavidBellamy force-pushed the session-server-offload-tokenization branch from 5d1385a to 006a203 Compare June 6, 2026 04:20
@flukeskywalker

Copy link
Copy Markdown

@DavidBellamy what happens if the request is cancelled while the background thread is working? (This question was motivated by a discussion with codex)

@DavidBellamy

DavidBellamy commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

@DavidBellamy what happens if the request is cancelled while the background thread is working? (This question was motivated by a discussion with codex)

I looked into this. No bug should happen as-is. When litellm cancels a request (e.g. timeout) it hangs up. So even though miles session server can't cancel a running thread, that just means that if litellm hangs up while a session is mid-tokenization (on this thread), the tokenization step will just finish. I don't think that's an issue?

@rmfan

rmfan commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

We have actually ran experiments last week with this implemented (as part of larger changes to multi-session server). See #31. That PR has since been superceded by #33, which doesn't have this multi-threading

@DavidBellamy

Copy link
Copy Markdown
Collaborator Author

We have actually ran experiments last week with this implemented (as part of larger changes to multi-session server). See #31. That PR has since been superceded by #33, which doesn't have this multi-threading

I see the same code changes on #31. I guess that validates the changes on this PR. I also saw that #33 does not have these changes. I think it is cleanest to keep them as separate PRs. It keeps the scope of this PR tighter and easier to review.

@flukeskywalker

Copy link
Copy Markdown

@DavidBellamy what happens if the request is cancelled while the background thread is working? (This question was motivated by a discussion with codex)

I looked into this. No bug should happen as-is. When litellm cancels a request (e.g. timeout) it hangs up. So even though miles session server can't cancel a running thread, that just means that if litellm hangs up while a session is mid-tokenization (on this thread), the tokenization step will just finish. I don't think that's an issue?

Yes if litellm cancels, it should be okay. But what if miles tries to abort a trajectory because it is part of oversampling and it already has enough samples for the batch?

@DavidBellamy DavidBellamy merged commit 94b06f9 into prod Jun 9, 2026
15 of 17 checks passed
@DavidBellamy DavidBellamy deleted the session-server-offload-tokenization branch June 9, 2026 22:04
@DavidBellamy

Copy link
Copy Markdown
Collaborator Author

@DavidBellamy what happens if the request is cancelled while the background thread is working? (This question was motivated by a discussion with codex)

I looked into this. No bug should happen as-is. When litellm cancels a request (e.g. timeout) it hangs up. So even though miles session server can't cancel a running thread, that just means that if litellm hangs up while a session is mid-tokenization (on this thread), the tokenization step will just finish. I don't think that's an issue?

Yes if litellm cancels, it should be okay. But what if miles tries to abort a trajectory because it is part of oversampling and it already has enough samples for the batch?

Seems bug-free to me. Nothing cancels the miles session server mid-tokenization besides forced shutdown (SIGTERM) but in that case any race is moot because the process is dying.

Agent timeout/litellm hangup, trajectory timeout (cancels the agent task), agent crash, network reset -> the TCP connection to the miles session server is dropped because that's how uvicorn behaves. So no race is possible.

This is how miles kills sessions deliberately (part of its oversampling logic):

  1. The miles rollout driver loop calls abort() which posts /abort_request directly to the sglang workers (bypasses miles session server) and waits for in-flight sessions to drain:
    abort_tasks = [post(f"{url}/abort_request", {"abort_all": True}) for url in urls]
  2. sglang returns the completion to the miles session server (via smg) with finish_reason="abort" and HTTP 200.
  1. Skips the if HTTP != 200 block.
  2. Records the partial turn into the session's history.
  3. Returns the reply to the agent unchanged.

Because of (1), the tokenization step (which this PR moves to a side thread/off the event loop) is never cancelled partway through by miles.

@flukeskywalker

Copy link
Copy Markdown

@DavidBellamy okay I see, this is fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants