fix(gateway): make readiness health checks dependency-aware#1328
fix(gateway): make readiness health checks dependency-aware#1328alangou wants to merge 2 commits into
Conversation
TaylorMutch
left a comment
There was a problem hiding this comment.
A couple of focused comments on the readiness changes.
| } | ||
|
|
||
| (StatusCode::OK, Json(response)) | ||
| async fn run_database_probe<F>(probe: F, timeout: Duration) -> DependencyCheck |
There was a problem hiding this comment.
Worth emitting a Prometheus signal from this code path. The warn! is grep-friendly but doesn't give us an alertable series, and now that /readyz drives traffic routing, "DB unreachable for N minutes" should be a first-class metric rather than something inferred from log volume.
A gauge would be the minimum useful thing — e.g. gateway_readiness_database_healthy (0/1) updated in each of the three match arms. A _seconds histogram of latency_ms is a natural follow-up but not blocking.
There was a problem hiding this comment.
Indeed, I have added both metrics (gauge + histogram). The description of the PR has been updated as well to explain a bit more the implementation details
| } | ||
|
|
||
| /// Close the underlying connection pool. | ||
| pub async fn close(&self) { |
There was a problem hiding this comment.
Store::close is publicly exposed but the only non-test caller would be a shutdown path that doesn't exist yet — the integration tests use it to simulate a DB outage. Calling this from production code by accident would tear down the pool under live traffic.
Either gate it (#[cfg(any(test, feature = "test-support"))]) or mark #[doc(hidden)] and add a // test-only: do not call from runtime code comment. The #[cfg] option is stricter and prevents accidental release use; #[doc(hidden)] is friendlier if you anticipate adding a real shutdown caller soon.
There was a problem hiding this comment.
The goal was indeed to have this function ready when a shutdown flow would be implemented. The function is now gated behind #[cfg(any(test, feature = "test-support"))]. I added a quick comment in the code to explain why the code is gated.
PR description has been updated to reflect this change
01afe38 to
b52f7d7
Compare
Emit Prometheus readiness metrics for database probes (healthy gauge and outcome-labeled latency histogram) with coverage in health HTTP tests. Restrict Store::close behind test support cfg to prevent accidental runtime pool shutdown under live traffic. Signed-off-by: Adrien Langou <alangou@nvidia.com>
b52f7d7 to
2cb34c9
Compare
Signed-off-by: Adrien Langou <alangou@nvidia.com>
2cb34c9 to
05b6998
Compare
|
@TaylorMutch I removed the hardcoded value for the database timeout some documentation and deployments resources has been updated to reflect that change (helm chart, doc) |
Summary
This PR makes gateway readiness signals dependency-aware instead of always healthy, while keeping liveness intentionally lightweight.
It adds a configurable database readiness timeout, wires it end-to-end (CLI/env -> server config -> health router), and aligns Helm defaults so application timeout stays below Kubernetes probe timeout.
It also extends readiness coverage with Docker e2e and updates the Rust test task so
openshell-servertest-only coverage runs withtest-support.Related Issue
Changes
openshell-server:/healthzremains liveness-only (200when process is responsive)/readyzand/healthperform DB connectivity checks and return503when unavailablechecks.database.status, latency, error)--readiness-db-timeout-secsOPENSHELL_READINESS_DB_TIMEOUT_SECS> 0)ping/close) for both SQLite and Postgres stores.openshell_server_readiness_database_healthygauge (1healthy,0unhealthy)openshell_server_readiness_database_probe_duration_secondshistogram labeled byoutcome(success,db_error,timeout)/metricsexposureStore::closeand backendclosemethods are behind#[cfg(any(test, feature = "test-support"))]health_routerso they provide a realStore.crates/openshell-server/tests/health_endpoint_integration.rse2e/rust/tests/readyz_health.rse2e/rust/Cargo.toml/readyzprobingserver.readinessDbTimeoutSecspassed to gateway argsprobes.readiness.timeoutSecondsdefault set to 2stest-supportcoverage by default in Rust test lane:tasks/test.tomlruns workspace tests excludingopenshell-server, then runsopenshell-serverwith--features test-supportWhy
/healthzstill returns 200 when DB is down/healthzis kept as a pure liveness probe by design. If liveness depended on DB, transient DB outages could trigger unnecessary pod restarts and CrashLoop behavior without fixing the dependency outage. Readiness (/readyz) is the dependency-aware signal used to remove unhealthy pods from traffic.Why
closeis test-onlycloseis used to simulate dependency outages in tests. Exposing it in runtime code would make it possible to tear down an active pool under live traffic. Until a dedicated graceful-shutdown flow exists, keeping it behind test support prevents accidental production use.Testing
mise run pre-commitpassesValidation run:
mise run e2emise run ciChecklist