Check is_ready flag in /healthz endpoint by ecarrara · Pull Request #2255 · roboflow/inference

ecarrara · 2026-04-21T19:18:29Z

What does this PR do?

Extends /healthz to also check the is_ready flag (same one used by /readiness). Returns 503 with reason: "not_ready" if model initialization hasn't completed, otherwise proceeds with the existing CUDA health check.

Type of Change

New feature (non-breaking change that adds functionality)

Testing

I have tested this change locally
I have added/updated tests for this change

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code where necessary, particularly in hard-to-understand areas
My changes generate no new warnings or errors
I have updated the documentation accordingly (if applicable)

hansent · 2026-04-21T20:28:14Z

Reviewed this together with roboflow/async-serverless#260. The core change is fine in isolation but I'm flagging a regression that only shows up when you combine it with the pod-health-monitor sidecar in the consumer chart.

Main concern: restart loop on inference-container-only crashes

async-serverless/k8s/charts/consumer-inference/templates/deployment.yaml runs a busybox sidecar that polls http://localhost:9001/healthz every checkInterval seconds and exit 1s (forcing a full pod restart) after maxFailures consecutive failures. The startup grace period is anchored to the sidecar's start time, not the inference container's — so on an inference-only container restart, the grace period has long elapsed.

Before this PR: /healthz did only a CUDA check, returned 200 throughout preload → sidecar stays happy, inference recovers in place.
After this PR: /healthz returns 503 while is_ready=False, i.e. for the entire re-preload window → sidecar counts failures → force-restarts the pod.

Configured thresholds vs. typical GPU preload:

Env	`maxFailures × checkInterval`	Typical preload
CRU-STG	3 × 20s = 60s	often > 60s
CRU-PROD	3 × 15s = 45s	often > 45s
GKE-STG/PROD	3 × 20–30s = 60–90s	often > 90s

Because preload frequently exceeds those windows, we'd be promoting every inference crash into a full pod restart, and — if preload consistently exceeds the threshold — potentially into a restart loop (pod starts, sidecar grace period covers first preload, inference crashes again after N minutes, pod restarts, …).

Suggested mitigations (pick one)

Split the endpoints: keep /healthz as a liveness-only check (CUDA / process health) and keep the is_ready gate in /readiness. This PR is effectively folding readiness into liveness, which is what the sidecar reads.
Point the sidecar at a new endpoint (e.g. /healthz-live) that does only the CUDA check. Coordinated change in async-serverless.
Bump maxFailures × checkInterval past 95p preload time in all env values files, and document that any inference crash = full pod restart.

Second-order observation (not blocking)

is_ready is one-way (http_api.py:2141, 2149, 2225) — set False once, set True once, never flipped back. Combined with startupProbe: /readiness (from #260), liveness is already suppressed until is_ready=True, so /healthz checking is_ready has no steady-state effect and no effect on fresh pod startup. Its only observable impact is in transient windows (container-only restart, re-preload) — which is exactly the scenario above.

Also worth noting: is_ready is set to True at line 2225 even when state.initialization_errors is non-empty — partial preload failures don't block readiness. Existing behavior, not this PR, but it means /healthz won't surface those either.

PawelPeczek-Roboflow

Rejecting until clarified how that may go wrong in terms of k8s termination due to pre-loading and marking unhealthy at that moment

Check is_ready flag in /healthz endpoint

7b2aef5

ecarrara marked this pull request as ready for review April 21, 2026 19:20

ecarrara requested review from PawelPeczek-Roboflow, dkosowski87, grzegorz-roboflow, hansent, probicheaux, rafel-roboflow and yeldarby as code owners April 21, 2026 19:20

Merge branch 'main' into worktree-erle+healthz-is-ready-flag

1b63263

PawelPeczek-Roboflow requested changes Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check is_ready flag in /healthz endpoint#2255

Check is_ready flag in /healthz endpoint#2255
ecarrara wants to merge 2 commits intomainfrom
worktree-erle+healthz-is-ready-flag

ecarrara commented Apr 21, 2026 •

edited

Loading

Uh oh!

hansent commented Apr 21, 2026

Uh oh!

PawelPeczek-Roboflow left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ecarrara commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Type of Change

Testing

Checklist

Uh oh!

hansent commented Apr 21, 2026

Main concern: restart loop on inference-container-only crashes

Suggested mitigations (pick one)

Second-order observation (not blocking)

Uh oh!

PawelPeczek-Roboflow left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ecarrara commented Apr 21, 2026 •

edited

Loading