session-server: run TITO tokenization off the event loop#34
Conversation
The session server runs as a single uvicorn worker (one asyncio event loop). prepare_pretokenized and update_pretokenized_state do synchronous CPU tokenization inline on that loop, so one trajectory's tokenization blocks every other in-flight request. Wrap both calls in asyncio.to_thread so tokenization runs in a worker thread and the event loop stays responsive; fast tokenizers release the GIL, so concurrent tokenizations also run in parallel. Per-session state stays correct: both calls remain under session.lock (held across the await), and the shared tito_tokenizer is read-only. Adds a regression test asserting /health stays responsive while a slow tokenization is in flight.
5d1385a to
006a203
Compare
|
@DavidBellamy what happens if the request is cancelled while the background thread is working? (This question was motivated by a discussion with codex) |
I looked into this. No bug should happen as-is. When litellm cancels a request (e.g. timeout) it hangs up. So even though miles session server can't cancel a running thread, that just means that if litellm hangs up while a session is mid-tokenization (on this thread), the tokenization step will just finish. I don't think that's an issue? |
I see the same code changes on #31. I guess that validates the changes on this PR. I also saw that #33 does not have these changes. I think it is cleanest to keep them as separate PRs. It keeps the scope of this PR tighter and easier to review. |
Yes if litellm cancels, it should be okay. But what if miles tries to abort a trajectory because it is part of oversampling and it already has enough samples for the batch? |
Seems bug-free to me. Nothing cancels the miles session server mid-tokenization besides forced shutdown (SIGTERM) but in that case any race is moot because the process is dying. Agent timeout/litellm hangup, trajectory timeout (cancels the agent task), agent crash, network reset -> the TCP connection to the miles session server is dropped because that's how uvicorn behaves. So no race is possible. This is how miles kills sessions deliberately (part of its oversampling logic):
Because of (1), the tokenization step (which this PR moves to a side thread/off the event loop) is never cancelled partway through by miles. |
|
@DavidBellamy okay I see, this is fine! |
What
The standalone session server runs as a single uvicorn worker (one asyncio event loop). On the chat-completions hot path, the two TITO tokenization calls,
prepare_pretokenizedandupdate_pretokenized_state, run synchronous CPU work directly on that loop. While one trajectory tokenizes, every other in-flight request on the server is blocked, including cheap calls like/healthandcreate_session. With many concurrent agentic trajectories funneling through the one process, this serializes tokenization across every turn and surfaces as rising latency that can look like a downstream (sandbox/router) problem.This wraps both calls in
asyncio.to_thread, so tokenization runs on a worker thread and the event loop stays responsive. Fast (Rust) tokenizers release the GIL during encoding, so concurrent tokenizations also run in parallel across cores rather than serializing on one.Correctness
async with session.lock, held across theawait, so each session still mutates its own trajectory single-threaded and in order.tito_tokenizeris read-only (encode / apply_chat_template), never reconfigured, so concurrent use across sessions is safe.Test
Adds
TestTokenizationOffload::test_slow_tokenization_does_not_block_event_loop: patchesprepare_pretokenizedto sleep, fires a chat completion, and asserts/healthreturns well within the tokenization sleep window. It fails on the original code (event loop blocked) and passes with this change.Verified on-cluster against this prod tip: across the full session suite (
test_sessions.py,test_session_pretokenized_e2e.py,test_session_race_conditions.py), run with and without this change, the only test whose result changes is the new one (red to green). Every other session test behaves identically with and without the change.