session-server: run TITO tokenization off the event loop by DavidBellamy · Pull Request #34 · LLM360/miles

DavidBellamy · 2026-06-06T01:16:36Z

What

The standalone session server runs as a single uvicorn worker (one asyncio event loop). On the chat-completions hot path, the two TITO tokenization calls, prepare_pretokenized and update_pretokenized_state, run synchronous CPU work directly on that loop. While one trajectory tokenizes, every other in-flight request on the server is blocked, including cheap calls like /health and create_session. With many concurrent agentic trajectories funneling through the one process, this serializes tokenization across every turn and surfaces as rising latency that can look like a downstream (sandbox/router) problem.

This wraps both calls in asyncio.to_thread, so tokenization runs on a worker thread and the event loop stays responsive. Fast (Rust) tokenizers release the GIL during encoding, so concurrent tokenizations also run in parallel across cores rather than serializing on one.

Correctness

Both calls stay inside async with session.lock, held across the await, so each session still mutates its own trajectory single-threaded and in order.
The shared tito_tokenizer is read-only (encode / apply_chat_template), never reconfigured, so concurrent use across sessions is safe.

Test

Adds TestTokenizationOffload::test_slow_tokenization_does_not_block_event_loop: patches prepare_pretokenized to sleep, fires a chat completion, and asserts /health returns well within the tokenization sleep window. It fails on the original code (event loop blocked) and passes with this change.

Verified on-cluster against this prod tip: across the full session suite (test_sessions.py, test_session_pretokenized_e2e.py, test_session_race_conditions.py), run with and without this change, the only test whose result changes is the new one (red to green). Every other session test behaves identically with and without the change.

The session server runs as a single uvicorn worker (one asyncio event loop). prepare_pretokenized and update_pretokenized_state do synchronous CPU tokenization inline on that loop, so one trajectory's tokenization blocks every other in-flight request. Wrap both calls in asyncio.to_thread so tokenization runs in a worker thread and the event loop stays responsive; fast tokenizers release the GIL, so concurrent tokenizations also run in parallel. Per-session state stays correct: both calls remain under session.lock (held across the await), and the shared tito_tokenizer is read-only. Adds a regression test asserting /health stays responsive while a slow tokenization is in flight.

flukeskywalker · 2026-06-08T06:51:07Z

@DavidBellamy what happens if the request is cancelled while the background thread is working? (This question was motivated by a discussion with codex)

DavidBellamy · 2026-06-08T18:18:13Z

@DavidBellamy what happens if the request is cancelled while the background thread is working? (This question was motivated by a discussion with codex)

I looked into this. No bug should happen as-is. When litellm cancels a request (e.g. timeout) it hangs up. So even though miles session server can't cancel a running thread, that just means that if litellm hangs up while a session is mid-tokenization (on this thread), the tokenization step will just finish. I don't think that's an issue?

rmfan · 2026-06-09T00:30:07Z

We have actually ran experiments last week with this implemented (as part of larger changes to multi-session server). See #31. That PR has since been superceded by #33, which doesn't have this multi-threading

DavidBellamy · 2026-06-09T00:36:56Z

We have actually ran experiments last week with this implemented (as part of larger changes to multi-session server). See #31. That PR has since been superceded by #33, which doesn't have this multi-threading

I see the same code changes on #31. I guess that validates the changes on this PR. I also saw that #33 does not have these changes. I think it is cleanest to keep them as separate PRs. It keeps the scope of this PR tighter and easier to review.

flukeskywalker · 2026-06-09T07:01:06Z

@DavidBellamy what happens if the request is cancelled while the background thread is working? (This question was motivated by a discussion with codex)

I looked into this. No bug should happen as-is. When litellm cancels a request (e.g. timeout) it hangs up. So even though miles session server can't cancel a running thread, that just means that if litellm hangs up while a session is mid-tokenization (on this thread), the tokenization step will just finish. I don't think that's an issue?

Yes if litellm cancels, it should be okay. But what if miles tries to abort a trajectory because it is part of oversampling and it already has enough samples for the batch?

DavidBellamy · 2026-06-09T23:19:15Z

@DavidBellamy what happens if the request is cancelled while the background thread is working? (This question was motivated by a discussion with codex)

I looked into this. No bug should happen as-is. When litellm cancels a request (e.g. timeout) it hangs up. So even though miles session server can't cancel a running thread, that just means that if litellm hangs up while a session is mid-tokenization (on this thread), the tokenization step will just finish. I don't think that's an issue?

Yes if litellm cancels, it should be okay. But what if miles tries to abort a trajectory because it is part of oversampling and it already has enough samples for the batch?

Seems bug-free to me. Nothing cancels the miles session server mid-tokenization besides forced shutdown (SIGTERM) but in that case any race is moot because the process is dying.

Agent timeout/litellm hangup, trajectory timeout (cancels the agent task), agent crash, network reset -> the TCP connection to the miles session server is dropped because that's how uvicorn behaves. So no race is possible.

This is how miles kills sessions deliberately (part of its oversampling logic):

The miles rollout driver loop calls abort() which posts /abort_request directly to the sglang workers (bypasses miles session server) and waits for in-flight sessions to drain:

miles/miles/rollout/sglang_rollout.py

Line 348 in 94b06f9

abort_tasks = [post(f"{url}/abort_request", {"abort_all": True}) for url in urls]
sglang returns the completion to the miles session server (via smg) with finish_reason="abort" and HTTP 200.

Proof: abort leaves status code undefined -> FINISH_ABORT() serializes 'abort' type and None status_code -> tokenizer manager falls through to yield out when status_code is None -> OpenAIServingChat returns normal response -> status code is 200.
Then the miles session server:

Skips the if HTTP != 200 block.
Records the partial turn into the session's history.
Returns the reply to the agent unchanged.

Because of (1), the tokenization step (which this PR moves to a side thread/off the event loop) is never cancelled partway through by miles.

flukeskywalker · 2026-06-09T23:39:45Z

@DavidBellamy okay I see, this is fine!

DavidBellamy requested a review from a team June 6, 2026 01:16

github-advanced-security AI found potential problems Jun 6, 2026

View reviewed changes

Comment thread .github/workflows/docker-build.yml Fixed

DavidBellamy force-pushed the session-server-offload-tokenization branch from 5d1385a to 006a203 Compare June 6, 2026 04:20

rmfan approved these changes Jun 9, 2026

View reviewed changes

Merge branch 'prod' into session-server-offload-tokenization

d066ac4

DavidBellamy merged commit 94b06f9 into prod Jun 9, 2026
15 of 17 checks passed

DavidBellamy deleted the session-server-offload-tokenization branch June 9, 2026 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

session-server: run TITO tokenization off the event loop#34

session-server: run TITO tokenization off the event loop#34
DavidBellamy merged 2 commits into
prodfrom
session-server-offload-tokenization

DavidBellamy commented Jun 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

flukeskywalker commented Jun 8, 2026

Uh oh!

DavidBellamy commented Jun 8, 2026 •

edited

Loading

Uh oh!

rmfan commented Jun 9, 2026 •

edited

Loading

Uh oh!

DavidBellamy commented Jun 9, 2026

Uh oh!

flukeskywalker commented Jun 9, 2026

Uh oh!

Uh oh!

DavidBellamy commented Jun 9, 2026

Uh oh!

flukeskywalker commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

DavidBellamy commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Correctness

Test

Uh oh!

Uh oh!

flukeskywalker commented Jun 8, 2026

Uh oh!

DavidBellamy commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rmfan commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DavidBellamy commented Jun 9, 2026

Uh oh!

flukeskywalker commented Jun 9, 2026

Uh oh!

Uh oh!

DavidBellamy commented Jun 9, 2026

Uh oh!

flukeskywalker commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DavidBellamy commented Jun 6, 2026 •

edited

Loading

DavidBellamy commented Jun 8, 2026 •

edited

Loading

rmfan commented Jun 9, 2026 •

edited

Loading