Skip to content

[tinker][SkyRL] Add sample and renderer API to new inference client#1287

Open
nithinvc wants to merge 38 commits intoNovaSky-AI:mainfrom
nithinvc:nithinc/inference-server-sample
Open

[tinker][SkyRL] Add sample and renderer API to new inference client#1287
nithinvc wants to merge 38 commits intoNovaSky-AI:mainfrom
nithinvc:nithinc/inference-server-sample

Conversation

@nithinvc
Copy link

@nithinvc nithinvc commented Mar 6, 2026

Summary

This PR adds the sample() and render_chat_completion() methods to RemoteInferenceClient and wires the new HTTP-based inference pathway into SkyRLTrainBackend, gated behind the _SKYRL_USE_NEW_INFERENCE env var.

Issues addressed: #1286 #1288

Changes

  • RemoteInferenceClient — Added sample() (fires num_samples parallel _generate_single calls and aggregates into a single InferenceEngineOutput) and render_chat_completion() (calls /v1/chat/completions/render to tokenize chat messages without generating). Tested against vllm 0.16.0.
  • SkyRLTrainBackend — Added _create_remote_inference_client() with the same 4-way branching logic as main_base.py (external proxy + servers, proxy only, servers only, fully internal ServerGroup + InferenceRouter). _ensure_inference_engines() now branches on _SKYRL_USE_NEW_INFERENCE to use the new HTTP client path.
  • test_engine_generation.py — Removed the guard on the sample API test. The new sample API passes.
  • test_save_weights_for_sampler.py — Fixed GPU test to pass tokenizer to run_inference and added gpu_memory_utilization=0.5 for colocated placement to avoid OOM when running on L4 GPUs.
  • CPU tests — Added mock /v1/chat/completions/render endpoint and test_render_chat_completion to test_remote_inference_client.py.
  • GPU tests — Added test_client_render_chat_completion to test_new_inference_generation.py.

Testing

 TINKER_API_KEY=tml-dummy uv run --with tinker --with datasets --with torch \
     python -m tinker_cookbook.recipes.rl_loop \
     base_url=http://localhost:8000 \
     model_name="Qwen/Qwen3-0.6B" \
     lora_rank=0
  • CPU unit tests pass for:
    • test_remote_inference_client.py - no regressions + render API tests
    • test_engine_generation.py - Sample API test
  • GPU unit tests pass for:
    • test_save_weights_for_sampler.py - GPU inference + weight syncing tests

Limitations

  • LoRA is not supported. lora_rank=0 is required. LoRA support will be added once we move to the native VLLM weight sync API.

Open with Devin

gemini-code-assist[bot]

This comment was marked as resolved.

@nithinvc nithinvc marked this pull request as draft March 6, 2026 02:28
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@nithinvc nithinvc marked this pull request as ready for review March 6, 2026 02:37
gemini-code-assist[bot]

This comment was marked as resolved.

@nithinvc nithinvc marked this pull request as draft March 6, 2026 18:09
@nithinvc nithinvc changed the title [tinker] Add sample API to new inference client [tinker] Add sample and renderer API to new inference client Mar 6, 2026
return s.getsockname()[1]


def find_and_reserve_port(start_port: int) -> Tuple[int, socket.socket]:
Copy link
Collaborator

@pcmoritz pcmoritz Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be best done in a separate PR (making separate PRs is e.g. very useful if one of the changes needs to be reverted, and also helps reviewing).

The same goes for the tests associated to this change of course.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I can split this into two? One for the sample / render and one for the port collision.

session_id: Optional[Union[str, int]] = None,
) -> Dict[str, Any]:
"""
Render chat messages into a tokenized prompt via /v1/chat/completions/render.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice! I hadn't know about this endpoint, looks very useful. For client side rendering, I have found https://github.com/thinking-machines-lab/tinker-cookbook/tree/main/tinker_cookbook/renderers useful, but you are right to use the vllm endpoint for this PR, it will make it much easier to keep everything consistent :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw this earlier! It's very useful - I think it might make sense to model the training side changes in a similar way so there's little drift between the client renderer <> training backend


@pytest.mark.asyncio
@pytest.mark.skipif(not _SKYRL_USE_NEW_INFERENCE, reason="Render API only supported with new inference client")
async def test_render_chat_completion(self, client):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this we should also test if the result is actually correct / is what we expect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check for the prompt and prompt token ids returned by the mock

client = vllm_server.client
messages = [{"role": "user", "content": "Hello"}]
result = asyncio.run(client.render_chat_completion(messages=messages))
# vLLM returns [conversation, engine_prompts]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again it would be good to have stricter checks here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a local tokenizer check for the prompt ids

@pcmoritz
Copy link
Collaborator

It looks good to me overall, maybe @kouroshHakha has some more feedback :)

One thing that seems a little strange is that we need all of {sample, render_chat_completion, chat_completion, completion, tokenize, detokenize}. It seems to me we would always go through sample, and since we pretty much always want token in token out, it seems less useful to have endpoints that do both tokenization and processing like completion and chat_completion.

cc @CharlieFRuan since he has been thinking a lot about token in token out :)

@nithinvc nithinvc marked this pull request as ready for review March 10, 2026 02:08
@nithinvc nithinvc changed the title [tinker] Add sample and renderer API to new inference client [tinker][SkyRL] Add sample and renderer API to new inference client Mar 10, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces sample() and render_chat_completion() methods to the RemoteInferenceClient and integrates the new HTTP-based inference pathway into SkyRLTrainBackend, controlled by the _SKYRL_USE_NEW_INFERENCE environment variable. The changes are accompanied by new tests and updates to existing ones. My review found a couple of areas for improvement: an unused parameter in the new sample method and a configuration issue in a new GPU test that prevents it from testing the intended scenario.

nithinvc and others added 2 commits March 9, 2026 19:15
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Copy link
Collaborator

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one big comment:

Comment on lines +296 to +297
async def sample(
self,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the interface of this consistent with how chat/completion are implemented. i.e. request_body goes in and then it is parsed out. The request_body should have the same API as the tinker sample api.

Re lora_id / model_id I think we should do this:

InferenceClient has a field called model_name that should be used for the apis. I looked at sample API spec from tinker and they don't need to define that there because the client carries the model name information. We should do a similar thing here.

Comment on lines +51 to +52
@pytest.fixture(scope="class")
def ray_env_with_new_inference():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we already have a conftest for this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in fact this test was redundant with some the existing tests so I removed it

@nithinvc nithinvc marked this pull request as draft March 11, 2026 01:31
@nithinvc nithinvc marked this pull request as ready for review March 12, 2026 21:12
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new HTTP-based inference pathway by adding sample() and render_chat_completion() methods to RemoteInferenceClient and integrating it into SkyRLTrainBackend. The changes are gated by the _SKYRL_USE_NEW_INFERENCE environment variable, which is a good practice for introducing significant new functionality. The implementation is well-structured, and the addition of comprehensive unit and integration tests for the new APIs is commendable. I've identified a couple of areas for improvement in the RemoteInferenceClient, including a potential runtime type error and a best-practice violation regarding function side effects.

@nithinvc
Copy link
Author

@kouroshHakha @pcmoritz I modified the sample method to follow the request_payload, similar to the completions method. The input request is expected to be the args to a tinker sampling client and the output is a dict version of SampleResponse. Could you guys take a look? Thank you!

devin-ai-integration[bot]

This comment was marked as resolved.

nithinvc and others added 2 commits March 12, 2026 14:21
…client.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 13 additional findings in Devin Review.

Open in Devin Review

Comment on lines +249 to +253
return RemoteInferenceClient(
proxy_url=proxy_url,
server_urls=server_urls,
model_name=self._cfg.trainer.policy.model.path,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 _create_remote_inference_client ignores served_model_name, causing request rejection when configured

_create_remote_inference_client always uses self._cfg.trainer.policy.model.path as model_name, but when served_model_name is configured in the inference engine config, the vLLM server only accepts that name (not the model path). This causes all data plane requests (sample, generate, chat_completion, etc.) to fail with a "model not found" error.

The old InferenceEngineClient correctly handles this at skyrl/backends/skyrl_train/inference_engines/inference_engine_client.py:68-70, and the test utility at tests/backends/skyrl_train/gpu/utils.py:512 also correctly uses served_model_name if served_model_name else cfg.trainer.policy.model.path. The production code here omits this logic.

Note: main_base.py:377 has the same pre-existing issue, which this code mirrors — but it should be fixed here nonetheless.

Suggested change
return RemoteInferenceClient(
proxy_url=proxy_url,
server_urls=server_urls,
model_name=self._cfg.trainer.policy.model.path,
)
ie_served_name = self._cfg.generator.inference_engine.served_model_name
return RemoteInferenceClient(
proxy_url=proxy_url,
server_urls=server_urls,
model_name=ie_served_name if ie_served_name else self._cfg.trainer.policy.model.path,
)
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants