Skip to content

[pull] master from ray-project:master#4059

Merged
pull[bot] merged 19 commits intomiqdigital:masterfrom
ray-project:master
Apr 17, 2026
Merged

[pull] master from ray-project:master#4059
pull[bot] merged 19 commits intomiqdigital:masterfrom
ray-project:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull bot commented Apr 17, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

bveeramani and others added 19 commits April 16, 2026 19:35
## Description
The "Registered dataset logger for dataset ..." message is
unconditionally emitted at INFO level on every dataset creation and
isn't actionable to users. This demotes it to DEBUG to reduce log noise.

## Related issues
None.

## Additional information
The concurrent-dataset warning (when a new dataset registers while
another is already active) is kept at INFO because it's less frequent
and more useful for debugging performance issues.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Looking into the comment of worker_oom_score_adjustment in
ray_config_def.h, it says that

/// A value to add to workers' OOM score adjustment, so that the OS
prioritizes
/// killing these over the raylet. 0 or positive values only (negative
values
/// require sudo permissions).

But it doesn't actually add it to the current value, but just set it as
the oom score adjustment. So I updated the logic to correctly reflect
its behaviour.

## Related issues
When the raylet processes has a oom_score_adj of -999, the oom_score_adj
of the worker processes can only be set to 0 at the moment, but it
should be able to be set to -998.

---------

Signed-off-by: peterjc123 <peterghost86@gmail.com>
… item being marked (#62588)

When `MarkFootprintAsBusy` was called for `NODE_WORKERS`, it was
clearing the saved idle time for all WorkFootprint entries, including
`PULLING_TASK_ARGUMENTS`. This meant that if `PULLING_TASK_ARGUMENTS`
resolved after `NODE_WORKERS` had already become idle, it had no saved
time to restore and reset to `clock_.Now()`, bumping the node's
effective idle start forward and reducing `idle_duration_ms`.

The fix scopes the saved-state clear to only the footprint being marked
busy, so `PULLING_TASK_ARGUMENTS` can still restore its saved time
regardless of what `NODE_WORKERS` does. Fixes #62430

---------

Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com>
Co-authored-by: Sampan S Nayak <sampansnayak2@gmail.com>
…rePolicy (#62633)

## Description

`can_add_input` and `max_task_output_bytes_to_read` in
`DownstreamCapacityBackpressurePolicy` provide no visibility into when
backpressure decisions change. This adds debug logging on state
transitions in `_should_apply_backpressure` to help with debugging
without spamming logs.

Logs are only emitted when the backpressure state changes for an
operator (not every call), following the existing pattern from
`ConcurrencyCapBackpressurePolicy`. Each log includes: op name, old/new
state, queue ratio, queue size bytes, downstream capacity bytes, and
utilized budget fraction.

## Related issues

N/A

## Additional information

N/A

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
building py314 mac wheels

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
## Summary

Add `serve.get_deployment_actor_context()` so deployment-scoped actors
can read their deployment metadata from runtime context instead of
receiving implicit constructor kwargs.

## Changes

- add `DeploymentActorContext` and export
`serve.get_deployment_actor_context()`
- inject internal deployment actor metadata into `runtime_env.env_vars`
at actor startup
- hydrate deployment actor context lazily from those reserved env vars
inside the actor process
- drop the old constructor-injection path and restore `capacity_queue`
examples/tests to explicit `deployment_id_*` wiring

## Testing

- `python -m pytest python/ray/serve/tests/unit/test_deployment_state.py
-k "injects_internal_deployment_context_env_vars or
preserves_user_env_vars_when_injecting_internal_context" -v`
- `python -m pytest python/ray/serve/tests/test_deployment_actors.py -k
"get_deployment_actor_context or
get_deployment_actor_outside_replica_raises" -v`
- `SKIP=shellcheck pre-commit run --files
doc/source/serve/doc_code/capacity_queue_request_router_app.py
python/ray/serve/__init__.py
python/ray/serve/_private/deployment_state.py python/ray/serve/api.py
python/ray/serve/context.py
python/ray/serve/tests/test_capacity_queue_router.py
python/ray/serve/tests/test_deployment_actors.py
python/ray/serve/tests/unit/test_deployment_state.py`

Fix #62397

---------

Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
Signed-off-by: Ziy <tanziyi0925@gmail.com>
Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>
…classes (#62661)

## Summary

Fixes flaky `test_deployment_actors` failures caused by
deployment-scoped actors being created with `max_concurrency=15,000` — a
value designed for the async Serve controller but incorrectly applied to
all deployment actors, including synchronous user-provided classes.

For synchronous Ray actors, `max_concurrency=N` creates **N OS threads**
via `BoundedExecutor` in the C++ runtime (`thread_pool.cc`). With
`N=15,000`, the worker process attempts 15,000 `pthread_create` calls on
startup, which under CI CPU contention causes either crashes or
multi-minute stalls.

### Root cause analysis

Investigation of two distinct postmerge build failures:

**Build 17042**
([logs](https://buildkite.com/ray-project/postmerge/builds/17042#019d8f88-649e-4020-b8d7-108fb250bc2a)):
- `SharedCounter` worker crashed during thread pool initialization:
  ```
  (SharedCounter pid=709094) thread_pool.cc:54: Check failed: !timed_out
  Failed to initialize threads in 10000 milliseconds
  ```
- The `BoundedExecutor` constructor couldn't create 15,000 threads
within its 10-second timeout, crashing the deployment actor worker
process.
- This caused the Serve controller's GCS calls to start timing out,
eventually leading to node death via missed heartbeats.

**Build 17033**
([logs](https://buildkite.com/ray-project/postmerge/builds/17033#019d8edb-2bbb-4959-bfd1-71e7e539bdf3)):
- Raylet, GCS, and dashboard were all healthy — no infrastructure-level
failure.
- `SharedCounter` deployment actor (with `num_cpus=0.01`) was submitted
to a worker but never completed initialization. GCS showed
`actor_to_create_callbacks_: 1` persistently across all debug dumps —
the actor creation task never finished.
- The worker was alive but stuck creating 15,000 threads with minimal
CPU allocation. The controller's `check_deployment_actors_ready()`
polled every 100ms returning `(False, None)` — no timeout, no error
escalation.
- `serve.run()` → `_wait_for_application_running(timeout_s=-1)` waited
indefinitely until pytest's 180s SIGALRM killed the test.

### Why 15,000 threads?

The code path (`deployment_state.py:189`):
```python
actor_options.setdefault("max_concurrency", CONTROLLER_MAX_CONCURRENCY)  # 15,000
```

For **async** actors (like `ServeController`), `max_concurrency` limits
concurrent coroutines via `FiberState` — lightweight, no thread
creation. Safe at any value.

For **synchronous** actors (like `SharedCounter`), Ray takes the
`BoundedExecutor` path (`task_receiver.cc:102-105`), which creates
exactly `max_concurrency` OS threads:
```cpp
// thread_pool.cc:37
for (int i = 0; i < max_concurrency; i++) {
    threads_.emplace_back(...);
}
```

The decision is binary based on `has_async_methods(actor_class)` — one
async method makes the whole actor use fibers; all-sync uses threads.

### Fix

Detect whether the deployment actor class is async or sync before
setting `max_concurrency`:
- **Async classes**: keep `max_concurrency=15,000` (coroutines via
`FiberState`, safe)
- **Sync classes**: cap at `max_concurrency=100` (OS threads via
`BoundedExecutor`, bounded)
- **User-specified** `max_concurrency` in `actor_options`: always
respected

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: harshit <harshit@anyscale.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Migrate core CI Docker image from install-dependencies.sh + pip
install to
lock-file-based installs via uv pip install --no-deps, using the
raydepsets
framework
- Add ci_core.depsets.yaml defining three depsets: CPU-only
(core_ci_depset),
GPU (core_gpu_ci_depset), and CUDA 13.0 GPU (core_cu130_ci_depset) for
Python
  3.10 and 3.12
- Update core.build.Dockerfile and core.build.wanda.yaml to install from
the
generated lock file instead of resolving dependencies at image build
time
- Create python/requirements/ml/py313/core-requirements.txt with
upgraded
mlflow>=3.0.0 for py313-targeted depsets, keeping the original
  core-requirements.txt aligned with master                 
- Bump mlflow from 2.22.0 to >=3.0.0 in py313/test-requirements.txt and
update
  requirements_compiled_py3.13.txt accordingly 


Postmerge build:
https://buildkite.com/ray-project/postmerge/builds/17058

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Summary

This PR fixes pip jax dependency issues in our dreambooth release test.
Original error:

```

Collecting jax==0.4.13 (from -r dreambooth/requirements.txt (line 7))
--
Downloading jax-0.4.13.tar.gz (1.3 MB)
���������������������������������������� 1.3/1.3 MB 59.5 MB/s eta 0:00:00
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'done'
Preparing metadata (pyproject.toml): started
Preparing metadata (pyproject.toml): finished with status 'done'
ERROR: Ignored the following yanked versions: 0.4.32
ERROR: Ignored the following versions that require a different python version: 0.11.0 Requires-Python >=3.11; 0.11.1 Requires-Python >=3.11; 0.11.2 Requires-Python >=3.11; 0.12.0 Requires-Python >=3.11; 0.12.1 Requires-Python >=3.11; 0.12.2 Requires-Python >=3.11; 0.12.3 Requires-Python >=3.11; 0.12.4 Requires-Python >=3.11; 0.12.5 Requires-Python >=3.11; 0.12.6 Requires-Python >=3.11; 0.7.0 Requires-Python >=3.11; 0.7.1 Requires-Python >=3.11; 0.7.2 Requires-Python >=3.11; 0.8.0 Requires-Python >=3.11; 0.8.1 Requires-Python >=3.11; 0.8.2 Requires-Python >=3.11; 0.8.3 Requires-Python >=3.11; 0.9.0 Requires-Python >=3.11; 0.9.0.1 Requires-Python >=3.11; 0.9.1 Requires-Python >=3.11; 0.9.2 Requires-Python >=3.11
ERROR: Could not find a version that satisfies the requirement jaxlib==0.4.13 (from versions: 0.4.17, 0.4.18, 0.4.19, 0.4.20, 0.4.21, 0.4.22, 0.4.23, 0.4.24, 0.4.25, 0.4.26, 0.4.27, 0.4.28, 0.4.29, 0.4.30, 0.4.31, 0.4.33, 0.4.34, 0.4.35, 0.4.36, 0.4.38, 0.5.0, 0.5.1, 0.5.3, 0.6.0, 0.6.1, 0.6.2)
ERROR: No matching distribution found for jaxlib==0.4.13
Subprocess return code: 1
```

I did the following:
1) Created new python3.9 conda env
2) Ran `pip-compile
release/air_examples/dreambooth/dreambooth/requirements.txt
--output-file=requirements.test.txt`. This indeed failed with
`pip._internal.exceptions.DistributionNotFound: No matching distribution
found for jaxlib==0.4.13`
3) Relaxed the `jax`, `jaxlib` and `flax` requirements in
requirements.txt
4) Ran `pip-compile
release/air_examples/dreambooth/dreambooth/requirements.txt
--output-file=requirements.test.txt` again. Here is the result:
https://gist.github.com/TimothySeah/5a5e345d581d0b7f97162a4e3a494edb
5) Copied the `jax`, `jaxlib`, and `flax` versions back to
`requirements.txt`

But then that resulted in some dependency issues still:
https://buildkite.com/ray-project/release/builds/88314#019d7a74-9760-4595-b080-72d0797c2f4b
i.e. `
ImportError: cannot import name 'cached_download' from
'huggingface_hub'`

I repeated the process above but unpinned all the deps this time, which
produced this file:
https://gist.github.com/TimothySeah/58f928b7d1d611b5ff88d383934e2b5f. I
then updated requirements.txt to match. But that was even worse:
https://buildkite.com/ray-project/release/builds/88650#019d8961-6343-420f-a3cc-dbb33786b18f.

I went back to the first change but only downgraded huggingface_hub to a
version that contains `cached_download`. Now there's a version mismatch
between diffusers and jax:
https://buildkite.com/ray-project/release/builds/88671#019d899f-1699-4e39-8b10-9b38effda088

Taking a step back, I see that the `from versions` log mentions jax
0.4.17 onwards, implying that jax 0.4.13 is no longer supported on my
python/cuda setup. Therefore I tried changing it to jax 0.4.17, which
produced this requirements file:
https://gist.github.com/TimothySeah/465b0b09d46531216fbd9162a183738e.
Back to the cached_download error:
https://buildkite.com/ray-project/release/builds/88884#019d8e89-b998-48b1-8d8d-7761de4ca484.

Now I'm trying an older `huggingface_hub` that supports
`cached_download`.



# Testing

Release tests with `name:air_example_dreambooth_finetuning.*`

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Remove hf_transfer dependency from all notebooks, train configs, CI
build scripts, and BYOD scripts
Add new run_as_job.ipynb notebook with step-by-step guide for running
LlamaFactory as an Anyscale job
Add job-configs/job.yaml for standalone job submission
Simplify file-copy steps in SFT and CPT notebooks
Replace inline YAML/JSON config blocks in notebook markdown with cat
code cells

---------

Signed-off-by: as-jding <jding@anyscale.com>
Signed-off-by: Jason Ding <jding@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…62687)

The four :crane: publish: steps in build.rayci.yml run inside the forge
docker image (via the default docker plugin), but the dependency on the
forge wanda step was implicit — not declared via depends_on. In the
default unpruned pipeline this works because rayci includes every step
regardless, but under RAYCI_SELECT pruning (where only the selected step
and its transitive upstream deps are emitted), the forge step gets
filtered out and the publish step crashes with "manifest for
citemp:{build_id}-forge not found".

Making the dep explicit lets rayci's upstream dep walker include forge
in the pruned pipeline whenever any publish step is selected.

Signed-off-by: andrew <andrew@anyscale.com>
bumping min python ver in pyproject.toml

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…corn versions (#62529)

## Description
This change reverts the
[PR](#57555) which is a part of
Ray 2.54. I
[did](#57555 (comment))
my best to test that PR, but it apparently introduces issues.

## Related issues
Reopens #55776. Relates to kuberay
[issue](ray-project/kuberay#4494).

## Additional information

Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
…amination and state loss for skipped deployments (#62484)

## Description
Changes to fix #62482
- autoscaling_policy.py: 
a) Copies the user state before merging
b) Call `_merge_user_state_with_internal_state` for skipped deployments
instead of returning raw internal state.
- test_autoscaling_policy.py : Unit test which verifies two deployments
with different internal state don't cross contaminate when the user
policy returns a shared dict.
- test_application_state.py : Unit test which verifies user state
persists across control loop iterations for a deployment the policy
skips.

## Related issues
> Fixes:#62482
> Related to: #58857

---------

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
Co-authored-by: abrar <abrar@anyscale.com>
…nto a single helper (#62452)

## Description
This PR refactors duplicated internal policy state extraction (delay
counter and timestamp keys) into `_extract_internal_policy_state` in
autoscaling_policy.py. Follow up to #58857.



### Related Issues
> Related to  #58857

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
Co-authored-by: harshit-anyscale <harshit@anyscale.com>
## Why

Fixes #62358.

For non-gRPC Serve requests, `Replica.handle_request*` currently does:

```python
raise self._maybe_wrap_grpc_exception(e, request_metadata) from e
```

When `_maybe_wrap_grpc_exception()` returns the original exception
unchanged, this degenerates into `raise e from e`, creating a
self-referential `__cause__` chain (`e.__cause__ is e`).

That breaks error-reporting tools that walk `__cause__` recursively and
can also confuse debugging and logging paths for normal HTTP and
handle-based requests.

## Root cause

This regression came from `#60482` (`221a19395a`), which unified
exception handling across the `handle_request*` paths so gRPC requests
could preserve user-set status codes.

That behavior is correct when a new gRPC wrapper exception is created.
It is wrong for non-gRPC requests, because
`_maybe_wrap_grpc_exception()` just returns `e`, so the code effectively
executes `raise e from e`.

## Fix

Introduce a small `_raise_user_exception()` helper in `replica.py`:

- if `_maybe_wrap_grpc_exception()` returns the original exception,
raise it directly
- if it returns a wrapped gRPC exception, raise the wrapped exception
`from e`

This preserves the intended gRPC chaining behavior while eliminating the
circular `__cause__` chain on non-gRPC paths.

## Testing

```bash
source /home/simple/github/ray/.venv/bin/activate
PYTHONPATH=/home/simple/github/ray/.worktrees/fix-issue-62358-replica-exception-chaining/python \
python -m pytest python/ray/serve/tests/unit/test_replica_exception_chaining.py -v
```

---------

Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
## Description
The PR adds a new Prometheus histogram metric
`serve_objref_resolution_latency_ms` that tracks how long the Serve
router spends resolving upstream DeploymentResponse arguments before a
request enters the routing queue.
This gives visibility into resolution wait time that was previously
hidden as part of `fulfillment_time_ms `.
## Related issues
Fixes #62286 

## Additional information
Used the following reproduction script that compares two cases: one
passing an unresolved DeploymentResponse as an argument and another
passing a plain dict, to isolate the resolution overhead.
```
@serve.deployment(num_replicas=1)
class SlowDependency:
    async def __call__(self, request=None):
        await asyncio.sleep(2)
        return {"source": "upstream", "ts": time.time()}

@serve.deployment(num_replicas=1)
class DownstreamUnresolved:
    async def __call__(self, upstream_result):
        return {
            "downstream_received": upstream_result,
            "processed_at": time.time(),
        }


@serve.deployment(num_replicas=1)
class PipelineUnresolved:
    def __init__(self, upstream_handle, downstream_handle):
        self.upstream = upstream_handle
        self.downstream = downstream_handle

    async def __call__(self, request):
        upstream_resp = self.upstream.remote()
        return await self.downstream.remote(upstream_resp)

@serve.deployment(num_replicas=1)
class DownstreamPreresolved:
    async def __call__(self, upstream_result):
        return {
            "downstream_received": upstream_result,
            "processed_at": time.time(),
        }


@serve.deployment(num_replicas=1)
class PipelinePreresolved:
    def __init__(self, upstream_handle, downstream_handle):
        self.upstream = upstream_handle
        self.downstream = downstream_handle

    async def __call__(self, request):
        upstream_result = await self.upstream.remote()
        return await self.downstream.remote(upstream_result)

up_a = SlowDependency.bind()
down_a = DownstreamUnresolved.bind()
pipe_a = PipelineUnresolved.bind(up_a, down_a)
serve.run(pipe_a, name="pipeline_unresolved", route_prefix="/pipeline-unresolved")

up_b = SlowDependency.bind()
down_b = DownstreamPreresolved.bind()
pipe_b = PipelinePreresolved.bind(up_b, down_b)
serve.run(pipe_b, name="pipeline_preresolved", route_prefix="/pipeline-preresolved")

```

 
<img width="690" height="66" alt="image"
src="https://github.com/user-attachments/assets/e9e9b05c-2b9b-420c-914f-de1420410ad3"
/>

---------

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
Co-authored-by: harshit-anyscale <harshit@anyscale.com>
removing python/requirements/ml/py313/dl-cpu-requirements.txt from
docgpu depset due to conflicts mentioned
[here](#62595)

postmerge run: https://buildkite.com/ray-project/postmerge/builds/17056

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@pull pull bot locked and limited conversation to collaborators Apr 17, 2026
@pull pull bot added the ⤵️ pull label Apr 17, 2026
@pull pull bot merged commit e14b30f into miqdigital:master Apr 17, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.