Skip to content

Check is_ready flag in /healthz endpoint#2255

Open
ecarrara wants to merge 2 commits intomainfrom
worktree-erle+healthz-is-ready-flag
Open

Check is_ready flag in /healthz endpoint#2255
ecarrara wants to merge 2 commits intomainfrom
worktree-erle+healthz-is-ready-flag

Conversation

@ecarrara
Copy link
Copy Markdown
Contributor

@ecarrara ecarrara commented Apr 21, 2026

What does this PR do?

Extends /healthz to also check the is_ready flag (same one used by /readiness). Returns 503 with reason: "not_ready" if model initialization hasn't completed, otherwise proceeds with the existing CUDA health check.

Type of Change

  • New feature (non-breaking change that adds functionality)

Testing

  • I have tested this change locally
  • I have added/updated tests for this change

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code where necessary, particularly in hard-to-understand areas
  • My changes generate no new warnings or errors
  • I have updated the documentation accordingly (if applicable)

@hansent
Copy link
Copy Markdown
Collaborator

hansent commented Apr 21, 2026

Reviewed this together with roboflow/async-serverless#260. The core change is fine in isolation but I'm flagging a regression that only shows up when you combine it with the pod-health-monitor sidecar in the consumer chart.

Main concern: restart loop on inference-container-only crashes

async-serverless/k8s/charts/consumer-inference/templates/deployment.yaml runs a busybox sidecar that polls http://localhost:9001/healthz every checkInterval seconds and exit 1s (forcing a full pod restart) after maxFailures consecutive failures. The startup grace period is anchored to the sidecar's start time, not the inference container's — so on an inference-only container restart, the grace period has long elapsed.

  • Before this PR: /healthz did only a CUDA check, returned 200 throughout preload → sidecar stays happy, inference recovers in place.
  • After this PR: /healthz returns 503 while is_ready=False, i.e. for the entire re-preload window → sidecar counts failures → force-restarts the pod.

Configured thresholds vs. typical GPU preload:

Env maxFailures × checkInterval Typical preload
CRU-STG 3 × 20s = 60s often > 60s
CRU-PROD 3 × 15s = 45s often > 45s
GKE-STG/PROD 3 × 20–30s = 60–90s often > 90s

Because preload frequently exceeds those windows, we'd be promoting every inference crash into a full pod restart, and — if preload consistently exceeds the threshold — potentially into a restart loop (pod starts, sidecar grace period covers first preload, inference crashes again after N minutes, pod restarts, …).

Suggested mitigations (pick one)

  1. Split the endpoints: keep /healthz as a liveness-only check (CUDA / process health) and keep the is_ready gate in /readiness. This PR is effectively folding readiness into liveness, which is what the sidecar reads.
  2. Point the sidecar at a new endpoint (e.g. /healthz-live) that does only the CUDA check. Coordinated change in async-serverless.
  3. Bump maxFailures × checkInterval past 95p preload time in all env values files, and document that any inference crash = full pod restart.

Second-order observation (not blocking)

is_ready is one-way (http_api.py:2141, 2149, 2225) — set False once, set True once, never flipped back. Combined with startupProbe: /readiness (from #260), liveness is already suppressed until is_ready=True, so /healthz checking is_ready has no steady-state effect and no effect on fresh pod startup. Its only observable impact is in transient windows (container-only restart, re-preload) — which is exactly the scenario above.

Also worth noting: is_ready is set to True at line 2225 even when state.initialization_errors is non-empty — partial preload failures don't block readiness. Existing behavior, not this PR, but it means /healthz won't surface those either.

Copy link
Copy Markdown
Collaborator

@PawelPeczek-Roboflow PawelPeczek-Roboflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rejecting until clarified how that may go wrong in terms of k8s termination due to pre-loading and marking unhealthy at that moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants