[tinker][SkyRL] Add sample and renderer API to new inference client by nithinvc · Pull Request #1287 · NovaSky-AI/SkyRL

nithinvc · 2026-03-06T02:26:34Z

Summary

This PR adds the sample() and render_chat_completion() methods to RemoteInferenceClient and wires the new HTTP-based inference pathway into SkyRLTrainBackend, gated behind the _SKYRL_USE_NEW_INFERENCE env var.

Issues addressed: #1286 #1288

Changes

RemoteInferenceClient — Added sample() (fires num_samples parallel _generate_single calls and aggregates into a single InferenceEngineOutput) and render_chat_completion() (calls /v1/chat/completions/render to tokenize chat messages without generating). Tested against vllm 0.16.0.
SkyRLTrainBackend — Added _create_remote_inference_client() with the same 4-way branching logic as main_base.py (external proxy + servers, proxy only, servers only, fully internal ServerGroup + InferenceRouter). _ensure_inference_engines() now branches on _SKYRL_USE_NEW_INFERENCE to use the new HTTP client path.
test_engine_generation.py — Removed the guard on the sample API test. The new sample API passes.
test_save_weights_for_sampler.py — Fixed GPU test to pass tokenizer to run_inference and added gpu_memory_utilization=0.5 for colocated placement to avoid OOM when running on L4 GPUs.
CPU tests — Added mock /v1/chat/completions/render endpoint and test_render_chat_completion to test_remote_inference_client.py.
GPU tests — Added test_client_render_chat_completion to test_new_inference_generation.py.

Testing

end-to-end on 2x L4 GPUs using the tinker cookbook RL loop (requires fix PR [skyrl][inference] Fix port collision when ports are allocated. #1302 ):

 TINKER_API_KEY=tml-dummy uv run --with tinker --with datasets --with torch \
     python -m tinker_cookbook.recipes.rl_loop \
     base_url=http://localhost:8000 \
     model_name="Qwen/Qwen3-0.6B" \
     lora_rank=0

CPU unit tests pass for:
- test_remote_inference_client.py - no regressions + render API tests
- test_engine_generation.py - Sample API test
GPU unit tests pass for:
- test_save_weights_for_sampler.py - GPU inference + weight syncing tests

Limitations

LoRA is not supported. lora_rank=0 is required. LoRA support will be added once we move to the native VLLM weight sync API.

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

pcmoritz · 2026-03-09T22:43:08Z

skyrl/backends/skyrl_train/inference_servers/common.py

        return s.getsockname()[1]
+
+
+def find_and_reserve_port(start_port: int) -> Tuple[int, socket.socket]:


I think this would be best done in a separate PR (making separate PRs is e.g. very useful if one of the changes needs to be reverted, and also helps reviewing).

The same goes for the tests associated to this change of course.

Sounds good! I can split this into two? One for the sample / render and one for the port collision.

pcmoritz · 2026-03-09T23:00:44Z

skyrl/backends/skyrl_train/inference_servers/remote_inference_client.py

+        session_id: Optional[Union[str, int]] = None,
+    ) -> Dict[str, Any]:
+        """
+        Render chat messages into a tokenized prompt via /v1/chat/completions/render.


This is very nice! I hadn't know about this endpoint, looks very useful. For client side rendering, I have found https://github.com/thinking-machines-lab/tinker-cookbook/tree/main/tinker_cookbook/renderers useful, but you are right to use the vllm endpoint for this PR, it will make it much easier to keep everything consistent :)

I saw this earlier! It's very useful - I think it might make sense to model the training side changes in a similar way so there's little drift between the client renderer <> training backend

pcmoritz · 2026-03-09T23:11:29Z

tests/backends/skyrl_train/inference_servers/test_remote_inference_client.py


+    @pytest.mark.asyncio
+    @pytest.mark.skipif(not _SKYRL_USE_NEW_INFERENCE, reason="Render API only supported with new inference client")
+    async def test_render_chat_completion(self, client):


For this we should also test if the result is actually correct / is what we expect.

Added a check for the prompt and prompt token ids returned by the mock

pcmoritz · 2026-03-09T23:25:43Z

tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_new_inference_generation.py

+    client = vllm_server.client
+    messages = [{"role": "user", "content": "Hello"}]
+    result = asyncio.run(client.render_chat_completion(messages=messages))
+    # vLLM returns [conversation, engine_prompts]


Again it would be good to have stricter checks here

Added a local tokenizer check for the prompt ids

This reverts commit 3f07164.

This reverts commit 29125ab.

This reverts commit 5b2fa34.

This reverts commit 2c0b47f.

pcmoritz · 2026-03-10T00:03:28Z

It looks good to me overall, maybe @kouroshHakha has some more feedback :)

One thing that seems a little strange is that we need all of {sample, render_chat_completion, chat_completion, completion, tokenize, detokenize}. It seems to me we would always go through sample, and since we pretty much always want token in token out, it seems less useful to have endpoints that do both tokenization and processing like completion and chat_completion.

cc @CharlieFRuan since he has been thinking a lot about token in token out :)

gemini-code-assist

Code Review

This pull request introduces sample() and render_chat_completion() methods to the RemoteInferenceClient and integrates the new HTTP-based inference pathway into SkyRLTrainBackend, controlled by the _SKYRL_USE_NEW_INFERENCE environment variable. The changes are accompanied by new tests and updates to existing ones. My review found a couple of areas for improvement: an unused parameter in the new sample method and a configuration issue in a new GPU test that prevents it from testing the intended scenario.

tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_backend_weight_sync.py

skyrl/backends/skyrl_train/inference_servers/remote_inference_client.py

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

kouroshHakha

Just one big comment:

kouroshHakha · 2026-03-10T23:41:02Z

skyrl/backends/skyrl_train/inference_servers/remote_inference_client.py

+    async def sample(
+        self,


I would make the interface of this consistent with how chat/completion are implemented. i.e. request_body goes in and then it is parsed out. The request_body should have the same API as the tinker sample api.

Re lora_id / model_id I think we should do this:

InferenceClient has a field called model_name that should be used for the apis. I looked at sample API spec from tinker and they don't need to define that there because the client carries the model name information. We should do a similar thing here.

kouroshHakha · 2026-03-10T23:43:58Z

tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_backend_weight_sync.py

+@pytest.fixture(scope="class")
+def ray_env_with_new_inference():


don't we already have a conftest for this?

Yes, in fact this test was redundant with some the existing tests so I removed it

gemini-code-assist

Code Review

This pull request introduces a new HTTP-based inference pathway by adding sample() and render_chat_completion() methods to RemoteInferenceClient and integrating it into SkyRLTrainBackend. The changes are gated by the _SKYRL_USE_NEW_INFERENCE environment variable, which is a good practice for introducing significant new functionality. The implementation is well-structured, and the addition of comprehensive unit and integration tests for the new APIs is commendable. I've identified a couple of areas for improvement in the RemoteInferenceClient, including a potential runtime type error and a best-practice violation regarding function side effects.

skyrl/backends/skyrl_train/inference_servers/remote_inference_client.py

nithinvc · 2026-03-12T21:20:13Z

@kouroshHakha @pcmoritz I modified the sample method to follow the request_payload, similar to the completions method. The input request is expected to be the args to a tinker sampling client and the output is a dict version of SampleResponse. Could you guys take a look? Thank you!

…client.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 13 additional findings in Devin Review.

devin-ai-integration · 2026-03-12T21:35:20Z

skyrl/backends/skyrl_train_backend.py

+        return RemoteInferenceClient(
+            proxy_url=proxy_url,
+            server_urls=server_urls,
+            model_name=self._cfg.trainer.policy.model.path,
+        )


🔴 _create_remote_inference_client ignores served_model_name, causing request rejection when configured

_create_remote_inference_client always uses self._cfg.trainer.policy.model.path as model_name, but when served_model_name is configured in the inference engine config, the vLLM server only accepts that name (not the model path). This causes all data plane requests (sample, generate, chat_completion, etc.) to fail with a "model not found" error.

The old InferenceEngineClient correctly handles this at skyrl/backends/skyrl_train/inference_engines/inference_engine_client.py:68-70, and the test utility at tests/backends/skyrl_train/gpu/utils.py:512 also correctly uses served_model_name if served_model_name else cfg.trainer.policy.model.path. The production code here omits this logic.

Note: main_base.py:377 has the same pre-existing issue, which this code mirrors — but it should be fixed here nonetheless.

Suggested change

return RemoteInferenceClient(

proxy_url=proxy_url,

server_urls=server_urls,

model_name=self._cfg.trainer.policy.model.path,

)

ie_served_name = self._cfg.generator.inference_engine.served_model_name

return RemoteInferenceClient(

proxy_url=proxy_url,

server_urls=server_urls,

model_name=ie_served_name if ie_served_name else self._cfg.trainer.policy.model.path,

)

Was this helpful? React with 👍 or 👎 to provide feedback.

remove test guard, add sample method

395f187

This comment was marked as resolved.

Sign in to view

nithinvc marked this pull request as draft March 6, 2026 02:28

devin-ai-integration bot reviewed Mar 6, 2026

View reviewed changes

add session id

f0ec828

nithinvc marked this pull request as ready for review March 6, 2026 02:37

This comment was marked as resolved.

Sign in to view

nithinvc marked this pull request as draft March 6, 2026 18:09

nithinvc changed the title ~~[tinker] Add sample API to new inference client~~ [tinker] Add sample and renderer API to new inference client Mar 6, 2026

nithinvc and others added 12 commits March 6, 2026 10:42

add render api

07ae4de

fix gpu save weights test

713ada8

add init of new inf backend

8ab564e

add test for weight sync

7d8e5f4

remove test

c7e4d52

add inference backend weight sync test

5ec91b7

add test back in

f353605

colocate all off

6526887

port fix

2c0b47f

move port allocation to shared commmoon

5b2fa34

add tests for port reservation

29125ab

remove print statements

3f07164

pcmoritz reviewed Mar 9, 2026

View reviewed changes

nithinvc added 5 commits March 9, 2026 16:36

Revert "remove print statements"

c6fbfde

This reverts commit 3f07164.

Revert "add tests for port reservation"

c950ced

This reverts commit 29125ab.

Revert "move port allocation to shared commmoon"

26b7abf

This reverts commit 5b2fa34.

Revert "port fix"

516ae5e

This reverts commit 2c0b47f.

remove print statements

281ba77

move imports up

4651da7

nithinvc and others added 3 commits March 9, 2026 17:08

fmt

3951469

add typing

2c399e0

add lora_id

fde5342

nithinvc marked this pull request as ready for review March 10, 2026 02:08

nithinvc changed the title ~~[tinker] Add sample and renderer API to new inference client~~ [tinker][SkyRL] Add sample and renderer API to new inference client Mar 10, 2026

This comment was marked as resolved.

Sign in to view

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

tests/backends/skyrl_train/gpu/gpu_ci/inference_servers/test_backend_weight_sync.py Outdated Show resolved Hide resolved

skyrl/backends/skyrl_train/inference_servers/remote_inference_client.py Outdated Show resolved Hide resolved

nithinvc and others added 2 commits March 9, 2026 19:15

fix typing

d894f82

Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

fix docstrings

269a81c

kouroshHakha reviewed Mar 10, 2026

View reviewed changes

remove duplicate test fixture

bd26015

nithinvc marked this pull request as draft March 11, 2026 01:31

nithinvc and others added 9 commits March 10, 2026 19:33

update to use request_payload format

b5bda89

add mock test

b7484b3

add comments + update gpu test

aff3835

stronger gpu checks

15b2ddb

update gpu sample api test

863c203

update render to use payload_request

7eaa5d8

add log probs extraction

03ff39d

comment + remove duplicate test

71ca1cf

Merge branch 'main' into nithinc/inference-server-sample

4f5ccd4

nithinvc marked this pull request as ready for review March 12, 2026 21:12

gemini-code-assist bot reviewed Mar 12, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

nithinvc and others added 2 commits March 12, 2026 14:21

Update skyrl/backends/skyrl_train/inference_servers/remote_inference_…

47c9400

…client.py Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>

lint

9a5eb86

devin-ai-integration bot reviewed Mar 12, 2026

View reviewed changes

		return s.getsockname()[1]


		def find_and_reserve_port(start_port: int) -> Tuple[int, socket.socket]:

		@pytest.fixture(scope="class")
		def ray_env_with_new_inference():

Conversation

nithinvc commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Limitations

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

This comment was marked as resolved.

Uh oh!

pcmoritz Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcmoritz commented Mar 10, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nithinvc commented Mar 12, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nithinvc commented Mar 6, 2026 •

edited

Loading

pcmoritz Mar 9, 2026 •

edited

Loading