[pull] master from ray-project:master#4059
Merged
pull[bot] merged 19 commits intomiqdigital:masterfrom Apr 17, 2026
Merged
Conversation
## Description The "Registered dataset logger for dataset ..." message is unconditionally emitted at INFO level on every dataset creation and isn't actionable to users. This demotes it to DEBUG to reduce log noise. ## Related issues None. ## Additional information The concurrent-dataset warning (when a new dataset registers while another is already active) is kept at INFO because it's less frequent and more useful for debugging performance issues. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Looking into the comment of worker_oom_score_adjustment in ray_config_def.h, it says that /// A value to add to workers' OOM score adjustment, so that the OS prioritizes /// killing these over the raylet. 0 or positive values only (negative values /// require sudo permissions). But it doesn't actually add it to the current value, but just set it as the oom score adjustment. So I updated the logic to correctly reflect its behaviour. ## Related issues When the raylet processes has a oom_score_adj of -999, the oom_score_adj of the worker processes can only be set to 0 at the moment, but it should be able to be set to -998. --------- Signed-off-by: peterjc123 <peterghost86@gmail.com>
… item being marked (#62588) When `MarkFootprintAsBusy` was called for `NODE_WORKERS`, it was clearing the saved idle time for all WorkFootprint entries, including `PULLING_TASK_ARGUMENTS`. This meant that if `PULLING_TASK_ARGUMENTS` resolved after `NODE_WORKERS` had already become idle, it had no saved time to restore and reset to `clock_.Now()`, bumping the node's effective idle start forward and reducing `idle_duration_ms`. The fix scopes the saved-state clear to only the footprint being marked busy, so `PULLING_TASK_ARGUMENTS` can still restore its saved time regardless of what `NODE_WORKERS` does. Fixes #62430 --------- Signed-off-by: Samaresh Kumar Singh <ssam3003@gmail.com> Co-authored-by: Sampan S Nayak <sampansnayak2@gmail.com>
…rePolicy (#62633) ## Description `can_add_input` and `max_task_output_bytes_to_read` in `DownstreamCapacityBackpressurePolicy` provide no visibility into when backpressure decisions change. This adds debug logging on state transitions in `_should_apply_backpressure` to help with debugging without spamming logs. Logs are only emitted when the backpressure state changes for an operator (not every call), following the existing pattern from `ConcurrencyCapBackpressurePolicy`. Each log includes: op name, old/new state, queue ratio, queue size bytes, downstream capacity bytes, and utilized budget fraction. ## Related issues N/A ## Additional information N/A --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
building py314 mac wheels Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
## Summary Add `serve.get_deployment_actor_context()` so deployment-scoped actors can read their deployment metadata from runtime context instead of receiving implicit constructor kwargs. ## Changes - add `DeploymentActorContext` and export `serve.get_deployment_actor_context()` - inject internal deployment actor metadata into `runtime_env.env_vars` at actor startup - hydrate deployment actor context lazily from those reserved env vars inside the actor process - drop the old constructor-injection path and restore `capacity_queue` examples/tests to explicit `deployment_id_*` wiring ## Testing - `python -m pytest python/ray/serve/tests/unit/test_deployment_state.py -k "injects_internal_deployment_context_env_vars or preserves_user_env_vars_when_injecting_internal_context" -v` - `python -m pytest python/ray/serve/tests/test_deployment_actors.py -k "get_deployment_actor_context or get_deployment_actor_outside_replica_raises" -v` - `SKIP=shellcheck pre-commit run --files doc/source/serve/doc_code/capacity_queue_request_router_app.py python/ray/serve/__init__.py python/ray/serve/_private/deployment_state.py python/ray/serve/api.py python/ray/serve/context.py python/ray/serve/tests/test_capacity_queue_router.py python/ray/serve/tests/test_deployment_actors.py python/ray/serve/tests/unit/test_deployment_state.py` Fix #62397 --------- Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com> Signed-off-by: Ziy <tanziyi0925@gmail.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>
…classes (#62661) ## Summary Fixes flaky `test_deployment_actors` failures caused by deployment-scoped actors being created with `max_concurrency=15,000` — a value designed for the async Serve controller but incorrectly applied to all deployment actors, including synchronous user-provided classes. For synchronous Ray actors, `max_concurrency=N` creates **N OS threads** via `BoundedExecutor` in the C++ runtime (`thread_pool.cc`). With `N=15,000`, the worker process attempts 15,000 `pthread_create` calls on startup, which under CI CPU contention causes either crashes or multi-minute stalls. ### Root cause analysis Investigation of two distinct postmerge build failures: **Build 17042** ([logs](https://buildkite.com/ray-project/postmerge/builds/17042#019d8f88-649e-4020-b8d7-108fb250bc2a)): - `SharedCounter` worker crashed during thread pool initialization: ``` (SharedCounter pid=709094) thread_pool.cc:54: Check failed: !timed_out Failed to initialize threads in 10000 milliseconds ``` - The `BoundedExecutor` constructor couldn't create 15,000 threads within its 10-second timeout, crashing the deployment actor worker process. - This caused the Serve controller's GCS calls to start timing out, eventually leading to node death via missed heartbeats. **Build 17033** ([logs](https://buildkite.com/ray-project/postmerge/builds/17033#019d8edb-2bbb-4959-bfd1-71e7e539bdf3)): - Raylet, GCS, and dashboard were all healthy — no infrastructure-level failure. - `SharedCounter` deployment actor (with `num_cpus=0.01`) was submitted to a worker but never completed initialization. GCS showed `actor_to_create_callbacks_: 1` persistently across all debug dumps — the actor creation task never finished. - The worker was alive but stuck creating 15,000 threads with minimal CPU allocation. The controller's `check_deployment_actors_ready()` polled every 100ms returning `(False, None)` — no timeout, no error escalation. - `serve.run()` → `_wait_for_application_running(timeout_s=-1)` waited indefinitely until pytest's 180s SIGALRM killed the test. ### Why 15,000 threads? The code path (`deployment_state.py:189`): ```python actor_options.setdefault("max_concurrency", CONTROLLER_MAX_CONCURRENCY) # 15,000 ``` For **async** actors (like `ServeController`), `max_concurrency` limits concurrent coroutines via `FiberState` — lightweight, no thread creation. Safe at any value. For **synchronous** actors (like `SharedCounter`), Ray takes the `BoundedExecutor` path (`task_receiver.cc:102-105`), which creates exactly `max_concurrency` OS threads: ```cpp // thread_pool.cc:37 for (int i = 0; i < max_concurrency; i++) { threads_.emplace_back(...); } ``` The decision is binary based on `has_async_methods(actor_class)` — one async method makes the whole actor use fibers; all-sync uses threads. ### Fix Detect whether the deployment actor class is async or sync before setting `max_concurrency`: - **Async classes**: keep `max_concurrency=15,000` (coroutines via `FiberState`, safe) - **Sync classes**: cap at `max_concurrency=100` (OS threads via `BoundedExecutor`, bounded) - **User-specified** `max_concurrency` in `actor_options`: always respected 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: harshit <harshit@anyscale.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Migrate core CI Docker image from install-dependencies.sh + pip install to lock-file-based installs via uv pip install --no-deps, using the raydepsets framework - Add ci_core.depsets.yaml defining three depsets: CPU-only (core_ci_depset), GPU (core_gpu_ci_depset), and CUDA 13.0 GPU (core_cu130_ci_depset) for Python 3.10 and 3.12 - Update core.build.Dockerfile and core.build.wanda.yaml to install from the generated lock file instead of resolving dependencies at image build time - Create python/requirements/ml/py313/core-requirements.txt with upgraded mlflow>=3.0.0 for py313-targeted depsets, keeping the original core-requirements.txt aligned with master - Bump mlflow from 2.22.0 to >=3.0.0 in py313/test-requirements.txt and update requirements_compiled_py3.13.txt accordingly Postmerge build: https://buildkite.com/ray-project/postmerge/builds/17058 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
# Summary This PR fixes pip jax dependency issues in our dreambooth release test. Original error: ``` Collecting jax==0.4.13 (from -r dreambooth/requirements.txt (line 7)) -- Downloading jax-0.4.13.tar.gz (1.3 MB) ���������������������������������������� 1.3/1.3 MB 59.5 MB/s eta 0:00:00 Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing metadata (pyproject.toml): started Preparing metadata (pyproject.toml): finished with status 'done' ERROR: Ignored the following yanked versions: 0.4.32 ERROR: Ignored the following versions that require a different python version: 0.11.0 Requires-Python >=3.11; 0.11.1 Requires-Python >=3.11; 0.11.2 Requires-Python >=3.11; 0.12.0 Requires-Python >=3.11; 0.12.1 Requires-Python >=3.11; 0.12.2 Requires-Python >=3.11; 0.12.3 Requires-Python >=3.11; 0.12.4 Requires-Python >=3.11; 0.12.5 Requires-Python >=3.11; 0.12.6 Requires-Python >=3.11; 0.7.0 Requires-Python >=3.11; 0.7.1 Requires-Python >=3.11; 0.7.2 Requires-Python >=3.11; 0.8.0 Requires-Python >=3.11; 0.8.1 Requires-Python >=3.11; 0.8.2 Requires-Python >=3.11; 0.8.3 Requires-Python >=3.11; 0.9.0 Requires-Python >=3.11; 0.9.0.1 Requires-Python >=3.11; 0.9.1 Requires-Python >=3.11; 0.9.2 Requires-Python >=3.11 ERROR: Could not find a version that satisfies the requirement jaxlib==0.4.13 (from versions: 0.4.17, 0.4.18, 0.4.19, 0.4.20, 0.4.21, 0.4.22, 0.4.23, 0.4.24, 0.4.25, 0.4.26, 0.4.27, 0.4.28, 0.4.29, 0.4.30, 0.4.31, 0.4.33, 0.4.34, 0.4.35, 0.4.36, 0.4.38, 0.5.0, 0.5.1, 0.5.3, 0.6.0, 0.6.1, 0.6.2) ERROR: No matching distribution found for jaxlib==0.4.13 Subprocess return code: 1 ``` I did the following: 1) Created new python3.9 conda env 2) Ran `pip-compile release/air_examples/dreambooth/dreambooth/requirements.txt --output-file=requirements.test.txt`. This indeed failed with `pip._internal.exceptions.DistributionNotFound: No matching distribution found for jaxlib==0.4.13` 3) Relaxed the `jax`, `jaxlib` and `flax` requirements in requirements.txt 4) Ran `pip-compile release/air_examples/dreambooth/dreambooth/requirements.txt --output-file=requirements.test.txt` again. Here is the result: https://gist.github.com/TimothySeah/5a5e345d581d0b7f97162a4e3a494edb 5) Copied the `jax`, `jaxlib`, and `flax` versions back to `requirements.txt` But then that resulted in some dependency issues still: https://buildkite.com/ray-project/release/builds/88314#019d7a74-9760-4595-b080-72d0797c2f4b i.e. ` ImportError: cannot import name 'cached_download' from 'huggingface_hub'` I repeated the process above but unpinned all the deps this time, which produced this file: https://gist.github.com/TimothySeah/58f928b7d1d611b5ff88d383934e2b5f. I then updated requirements.txt to match. But that was even worse: https://buildkite.com/ray-project/release/builds/88650#019d8961-6343-420f-a3cc-dbb33786b18f. I went back to the first change but only downgraded huggingface_hub to a version that contains `cached_download`. Now there's a version mismatch between diffusers and jax: https://buildkite.com/ray-project/release/builds/88671#019d899f-1699-4e39-8b10-9b38effda088 Taking a step back, I see that the `from versions` log mentions jax 0.4.17 onwards, implying that jax 0.4.13 is no longer supported on my python/cuda setup. Therefore I tried changing it to jax 0.4.17, which produced this requirements file: https://gist.github.com/TimothySeah/465b0b09d46531216fbd9162a183738e. Back to the cached_download error: https://buildkite.com/ray-project/release/builds/88884#019d8e89-b998-48b1-8d8d-7761de4ca484. Now I'm trying an older `huggingface_hub` that supports `cached_download`. # Testing Release tests with `name:air_example_dreambooth_finetuning.*` --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
Remove hf_transfer dependency from all notebooks, train configs, CI build scripts, and BYOD scripts Add new run_as_job.ipynb notebook with step-by-step guide for running LlamaFactory as an Anyscale job Add job-configs/job.yaml for standalone job submission Simplify file-copy steps in SFT and CPT notebooks Replace inline YAML/JSON config blocks in notebook markdown with cat code cells --------- Signed-off-by: as-jding <jding@anyscale.com> Signed-off-by: Jason Ding <jding@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…62687) The four :crane: publish: steps in build.rayci.yml run inside the forge docker image (via the default docker plugin), but the dependency on the forge wanda step was implicit — not declared via depends_on. In the default unpruned pipeline this works because rayci includes every step regardless, but under RAYCI_SELECT pruning (where only the selected step and its transitive upstream deps are emitted), the forge step gets filtered out and the publish step crashes with "manifest for citemp:{build_id}-forge not found". Making the dep explicit lets rayci's upstream dep walker include forge in the pruned pipeline whenever any publish step is selected. Signed-off-by: andrew <andrew@anyscale.com>
bumping min python ver in pyproject.toml Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…corn versions (#62529) ## Description This change reverts the [PR](#57555) which is a part of Ray 2.54. I [did](#57555 (comment)) my best to test that PR, but it apparently introduces issues. ## Related issues Reopens #55776. Relates to kuberay [issue](ray-project/kuberay#4494). ## Additional information Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
…amination and state loss for skipped deployments (#62484) ## Description Changes to fix #62482 - autoscaling_policy.py: a) Copies the user state before merging b) Call `_merge_user_state_with_internal_state` for skipped deployments instead of returning raw internal state. - test_autoscaling_policy.py : Unit test which verifies two deployments with different internal state don't cross contaminate when the user policy returns a shared dict. - test_application_state.py : Unit test which verifies user state persists across control loop iterations for a deployment the policy skips. ## Related issues > Fixes:#62482 > Related to: #58857 --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Co-authored-by: abrar <abrar@anyscale.com>
…nto a single helper (#62452) ## Description This PR refactors duplicated internal policy state extraction (delay counter and timestamp keys) into `_extract_internal_policy_state` in autoscaling_policy.py. Follow up to #58857. ### Related Issues > Related to #58857 Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Co-authored-by: harshit-anyscale <harshit@anyscale.com>
## Why Fixes #62358. For non-gRPC Serve requests, `Replica.handle_request*` currently does: ```python raise self._maybe_wrap_grpc_exception(e, request_metadata) from e ``` When `_maybe_wrap_grpc_exception()` returns the original exception unchanged, this degenerates into `raise e from e`, creating a self-referential `__cause__` chain (`e.__cause__ is e`). That breaks error-reporting tools that walk `__cause__` recursively and can also confuse debugging and logging paths for normal HTTP and handle-based requests. ## Root cause This regression came from `#60482` (`221a19395a`), which unified exception handling across the `handle_request*` paths so gRPC requests could preserve user-set status codes. That behavior is correct when a new gRPC wrapper exception is created. It is wrong for non-gRPC requests, because `_maybe_wrap_grpc_exception()` just returns `e`, so the code effectively executes `raise e from e`. ## Fix Introduce a small `_raise_user_exception()` helper in `replica.py`: - if `_maybe_wrap_grpc_exception()` returns the original exception, raise it directly - if it returns a wrapped gRPC exception, raise the wrapped exception `from e` This preserves the intended gRPC chaining behavior while eliminating the circular `__cause__` chain on non-gRPC paths. ## Testing ```bash source /home/simple/github/ray/.venv/bin/activate PYTHONPATH=/home/simple/github/ray/.worktrees/fix-issue-62358-replica-exception-chaining/python \ python -m pytest python/ray/serve/tests/unit/test_replica_exception_chaining.py -v ``` --------- Signed-off-by: Ziy1-Tan <ajb459684460@gmail.com>
## Description The PR adds a new Prometheus histogram metric `serve_objref_resolution_latency_ms` that tracks how long the Serve router spends resolving upstream DeploymentResponse arguments before a request enters the routing queue. This gives visibility into resolution wait time that was previously hidden as part of `fulfillment_time_ms `. ## Related issues Fixes #62286 ## Additional information Used the following reproduction script that compares two cases: one passing an unresolved DeploymentResponse as an argument and another passing a plain dict, to isolate the resolution overhead. ``` @serve.deployment(num_replicas=1) class SlowDependency: async def __call__(self, request=None): await asyncio.sleep(2) return {"source": "upstream", "ts": time.time()} @serve.deployment(num_replicas=1) class DownstreamUnresolved: async def __call__(self, upstream_result): return { "downstream_received": upstream_result, "processed_at": time.time(), } @serve.deployment(num_replicas=1) class PipelineUnresolved: def __init__(self, upstream_handle, downstream_handle): self.upstream = upstream_handle self.downstream = downstream_handle async def __call__(self, request): upstream_resp = self.upstream.remote() return await self.downstream.remote(upstream_resp) @serve.deployment(num_replicas=1) class DownstreamPreresolved: async def __call__(self, upstream_result): return { "downstream_received": upstream_result, "processed_at": time.time(), } @serve.deployment(num_replicas=1) class PipelinePreresolved: def __init__(self, upstream_handle, downstream_handle): self.upstream = upstream_handle self.downstream = downstream_handle async def __call__(self, request): upstream_result = await self.upstream.remote() return await self.downstream.remote(upstream_result) up_a = SlowDependency.bind() down_a = DownstreamUnresolved.bind() pipe_a = PipelineUnresolved.bind(up_a, down_a) serve.run(pipe_a, name="pipeline_unresolved", route_prefix="/pipeline-unresolved") up_b = SlowDependency.bind() down_b = DownstreamPreresolved.bind() pipe_b = PipelinePreresolved.bind(up_b, down_b) serve.run(pipe_b, name="pipeline_preresolved", route_prefix="/pipeline-preresolved") ``` <img width="690" height="66" alt="image" src="https://github.com/user-attachments/assets/e9e9b05c-2b9b-420c-914f-de1420410ad3" /> --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Co-authored-by: harshit-anyscale <harshit@anyscale.com>
removing python/requirements/ml/py313/dl-cpu-requirements.txt from docgpu depset due to conflicts mentioned [here](#62595) postmerge run: https://buildkite.com/ray-project/postmerge/builds/17056 Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )