fix(gateway): make readiness health checks dependency-aware by alangou · Pull Request #1328 · NVIDIA/OpenShell

alangou · 2026-05-12T15:18:42Z

Summary

This PR makes gateway readiness signals dependency-aware instead of always healthy, while keeping liveness intentionally lightweight.
It adds a configurable database readiness timeout, wires it end-to-end (CLI/env -> server config -> health router), and aligns Helm defaults so application timeout stays below Kubernetes probe timeout.
It also extends readiness coverage with Docker e2e and updates the Rust test task so openshell-server test-only coverage runs with test-support.

Related Issue

closes OS-156
Parent initiative: OSGH-111 (Runtime Reliability)

Changes

Added dependency-aware health behavior in openshell-server:
- /healthz remains liveness-only (200 when process is responsive)
- /readyz and /health perform DB connectivity checks and return 503 when unavailable
- readiness payload includes structured dependency details (checks.database.status, latency, error)
Added bounded timeout handling for DB readiness checks.
Added configurable readiness timeout:
- CLI flag: --readiness-db-timeout-secs
- Env var: OPENSHELL_READINESS_DB_TIMEOUT_SECS
- Core/server config wiring and runtime validation (> 0)
Added persistence connectivity helpers (ping / close) for both SQLite and Postgres stores.
Added Prometheus readiness metrics:
- openshell_server_readiness_database_healthy gauge (1 healthy, 0 unhealthy)
- openshell_server_readiness_database_probe_duration_seconds histogram labeled by outcome (success, db_error, timeout)
- health tests validate metric emission and /metrics exposure
Scoped pool teardown APIs to test support only:
- Store::close and backend close methods are behind #[cfg(any(test, feature = "test-support"))]
Updated existing integration tests that instantiate health_router so they provide a real Store.
Added readiness integration test coverage:
- crates/openshell-server/tests/health_endpoint_integration.rs
Added Docker e2e coverage for readiness:
- e2e/rust/tests/readyz_health.rs
- e2e test target in e2e/rust/Cargo.toml
- Docker e2e wrapper exposes a dedicated health port for /readyz probing
Helm chart updates for timeout customization and safe defaults:
- server.readinessDbTimeoutSecs passed to gateway args
- probes.readiness.timeoutSeconds default set to 2s
- app default timeout set to 1s so app timeout < probe timeout
Test task update for test-support coverage by default in Rust test lane:
- tasks/test.toml runs workspace tests excluding openshell-server, then runs openshell-server with --features test-support

Why `/healthz` still returns 200 when DB is down

/healthz is kept as a pure liveness probe by design. If liveness depended on DB, transient DB outages could trigger unnecessary pod restarts and CrashLoop behavior without fixing the dependency outage. Readiness (/readyz) is the dependency-aware signal used to remove unhealthy pods from traffic.

Why `close` is test-only

close is used to simulate dependency outages in tests. Exposing it in runtime code would make it possible to tear down an active pool under live traffic. Until a dedicated graceful-shutdown flow exists, keeping it behind test support prevents accidental production use.

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Validation run:

mise run e2e
mise run ci

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

TaylorMutch

A couple of focused comments on the readiness changes.

TaylorMutch · 2026-05-12T20:22:50Z

+}

-    (StatusCode::OK, Json(response))
+async fn run_database_probe<F>(probe: F, timeout: Duration) -> DependencyCheck


Worth emitting a Prometheus signal from this code path. The warn! is grep-friendly but doesn't give us an alertable series, and now that /readyz drives traffic routing, "DB unreachable for N minutes" should be a first-class metric rather than something inferred from log volume.

A gauge would be the minimum useful thing — e.g. gateway_readiness_database_healthy (0/1) updated in each of the three match arms. A _seconds histogram of latency_ms is a natural follow-up but not blocking.

Indeed, I have added both metrics (gauge + histogram). The description of the PR has been updated as well to explain a bit more the implementation details

TaylorMutch · 2026-05-12T20:22:50Z

+    }
+
+    /// Close the underlying connection pool.
+    pub async fn close(&self) {


Store::close is publicly exposed but the only non-test caller would be a shutdown path that doesn't exist yet — the integration tests use it to simulate a DB outage. Calling this from production code by accident would tear down the pool under live traffic.

Either gate it (#[cfg(any(test, feature = "test-support"))]) or mark #[doc(hidden)] and add a // test-only: do not call from runtime code comment. The #[cfg] option is stricter and prevents accidental release use; #[doc(hidden)] is friendlier if you anticipate adding a real shutdown caller soon.

The goal was indeed to have this function ready when a shutdown flow would be implemented. The function is now gated behind #[cfg(any(test, feature = "test-support"))]. I added a quick comment in the code to explain why the code is gated.

PR description has been updated to reflect this change

Emit Prometheus readiness metrics for database probes (healthy gauge and outcome-labeled latency histogram) with coverage in health HTTP tests. Restrict Store::close behind test support cfg to prevent accidental runtime pool shutdown under live traffic. Signed-off-by: Adrien Langou <alangou@nvidia.com>

Signed-off-by: Adrien Langou <alangou@nvidia.com>

alangou · 2026-05-13T13:42:31Z

@TaylorMutch I removed the hardcoded value for the database timeout some documentation and deployments resources has been updated to reflect that change (helm chart, doc)

alangou requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 12, 2026 15:18

TaylorMutch reviewed May 12, 2026

View reviewed changes

alangou force-pushed the alangou/os-156-update-gateway-health-check-to-account-for-database branch 3 times, most recently from 01afe38 to b52f7d7 Compare May 13, 2026 10:10

alangou force-pushed the alangou/os-156-update-gateway-health-check-to-account-for-database branch from b52f7d7 to 2cb34c9 Compare May 13, 2026 12:24

test(e2e): add simple e2e test with docker to test /readyz

05b6998

Signed-off-by: Adrien Langou <alangou@nvidia.com>

alangou force-pushed the alangou/os-156-update-gateway-health-check-to-account-for-database branch from 2cb34c9 to 05b6998 Compare May 13, 2026 12:44

alangou requested a review from TaylorMutch May 13, 2026 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): make readiness health checks dependency-aware#1328

fix(gateway): make readiness health checks dependency-aware#1328
alangou wants to merge 2 commits into
NVIDIA:mainfrom
alangou:alangou/os-156-update-gateway-health-check-to-account-for-database

alangou commented May 12, 2026 •

edited

Loading

Uh oh!

TaylorMutch left a comment

Uh oh!

TaylorMutch May 12, 2026

Uh oh!

alangou May 13, 2026

Uh oh!

TaylorMutch May 12, 2026

Uh oh!

alangou May 13, 2026

Uh oh!

alangou commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alangou commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Why /healthz still returns 200 when DB is down

Why close is test-only

Testing

Checklist

Uh oh!

TaylorMutch left a comment

Choose a reason for hiding this comment

Uh oh!

TaylorMutch May 12, 2026

Choose a reason for hiding this comment

Uh oh!

alangou May 13, 2026

Choose a reason for hiding this comment

Uh oh!

TaylorMutch May 12, 2026

Choose a reason for hiding this comment

Uh oh!

alangou May 13, 2026

Choose a reason for hiding this comment

Uh oh!

alangou commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alangou commented May 12, 2026 •

edited

Loading

Why `/healthz` still returns 200 when DB is down

Why `close` is test-only