feat(observability): OpenTelemetry golden-signals middleware#998
feat(observability): OpenTelemetry golden-signals middleware#998knutties wants to merge 40 commits into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughAdds complete OpenTelemetry observability to the Superposition HTTP API, including Prometheus metrics exposition, health endpoints, saturation monitoring for database/Redis/Tokio runtime, all configurable via environment variables with optional OTLP push integration. ChangesOpenTelemetry Observability Implementation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
4951507 to
726f0c8
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds a new observability subsystem to the Superposition Actix API that exports Google SRE “golden signal” metrics (latency/traffic/errors/saturation) via OpenTelemetry, exposing Prometheus scrape output on a dedicated port and optionally pushing metrics via OTLP. It wires the subsystem into the superposition binary (middleware + /metrics server + health endpoints) and documents the design/rollout.
Changes:
- Introduces
service_utils::observability(config, OTel MeterProvider init, Actix metrics middleware, saturation collectors, metrics server, health endpoints) plus integration tests and a cardinality budget test. - Wires observability into the main API binary:
/healthz|/livez|/readyz, auth bypass list updates, metrics server spawned alongside the main server, and middleware added to the app. - Adds workspace-wide build support for
tokio_unstableand introduces new OTel/Prometheus dependencies with lockfile updates and supporting docs.
Reviewed changes
Copilot reviewed 21 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Documents /metrics endpoint and health probes; notes tokio_unstable build flag. |
| docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md | Design spec for the golden-signals OTel/Prometheus approach. |
| docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md | Step-by-step implementation plan and test strategy. |
| crates/superposition/src/main.rs | Initializes observability, registers saturation observers, spawns metrics server, adds middleware + health scope, and try_join!s servers. |
| crates/superposition/src/app_state.rs | Extends auth exclusion list to always include `/healthz |
| crates/superposition/Cargo.toml | Adds direct deps needed for new wiring (futures-util, opentelemetry). |
| crates/service_utils/src/lib.rs | Exposes new observability module and adds placeholder crate uses to satisfy unused_crate_dependencies. |
| crates/service_utils/Cargo.toml | Enables new observability dependencies (opentelemetry*, prometheus, tokio-metrics, humantime, etc.). |
| crates/service_utils/src/observability.rs | Observability public API + OTel MeterProvider init (Prometheus exporter + optional OTLP). |
| crates/service_utils/src/observability/config.rs | Env-driven configuration parsing + tests. |
| crates/service_utils/src/observability/meters.rs | Defines HTTP metric instruments (histogram/counter/updowncounter). |
| crates/service_utils/src/observability/middleware.rs | Actix middleware that records per-request HTTP metrics with bounded route cardinality. |
| crates/service_utils/src/observability/metrics_server.rs | Dedicated Actix server exposing /metrics from the Prometheus registry. |
| crates/service_utils/src/observability/health.rs | Adds /healthz, /livez, /readyz handlers and exposes path list for auth bypass. |
| crates/service_utils/src/observability/saturation.rs | Wires saturation collectors (DB, Redis, Tokio runtime). |
| crates/service_utils/src/observability/saturation/db_pool.rs | r2d2 pool saturation observable gauges. |
| crates/service_utils/src/observability/saturation/redis_pool.rs | fred-based Redis “commands in flight” gauge and tolerant stats abstraction. |
| crates/service_utils/src/observability/saturation/tokio_runtime.rs | cfg(tokio_unstable) Tokio runtime saturation gauges via tokio-metrics. |
| crates/service_utils/tests/observability_integration.rs | End-to-end scrape assertions + cardinality regression test. |
| Cargo.toml | Adds workspace deps for OTel/Prometheus; adds unexpected_cfgs allowance for tokio_unstable. |
| Cargo.lock | Locks new dependency graph (notably OTel + Prometheus + tokio-metrics). |
| .cargo/config.toml | Enables --cfg tokio_unstable workspace-wide. |
| .gitignore | Ensures .cargo/config.toml remains tracked despite .cargo ignore rule. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .with_resource(Resource::new(resource_attrs)); | ||
|
|
||
| if let Some(endpoint) = &cfg.otlp_endpoint { | ||
| builder = with_otlp_reader(builder, endpoint, cfg.collect_interval)?; |
| // Protocol: this binary is compiled with the `http-proto` feature only. | ||
| // `OTEL_EXPORTER_OTLP_PROTOCOL=grpc` is NOT supported in v1; use the | ||
| // default `http/protobuf` transport. | ||
| // | ||
| // Headers: the opentelemetry-otlp 0.27 HTTP exporter reads | ||
| // `OTEL_EXPORTER_OTLP_HEADERS` (and `OTEL_EXPORTER_OTLP_METRICS_HEADERS`) | ||
| // automatically during `build()` — no explicit wiring needed here. | ||
| use opentelemetry_otlp::{MetricExporter, WithExportConfig}; | ||
| use opentelemetry_sdk::metrics::PeriodicReader; | ||
| use opentelemetry_sdk::runtime; | ||
|
|
||
| let exporter = MetricExporter::builder() | ||
| .with_http() | ||
| .with_endpoint(endpoint.to_owned()) | ||
| .build() |
| fn with_env<F: FnOnce()>(vars: &[(&str, Option<&str>)], f: F) { | ||
| use std::sync::Mutex; | ||
| static LOCK: Mutex<()> = Mutex::new(()); | ||
| let _guard = LOCK.lock().unwrap(); | ||
| let prev: Vec<_> = | ||
| vars.iter().map(|(k, _)| (k.to_string(), std::env::var(k).ok())).collect(); | ||
| for (k, v) in vars { | ||
| match v { | ||
| // SAFETY: single-threaded test execution guaranteed by the | ||
| // Mutex above and --test-threads=1 at the call site. | ||
| Some(v) => unsafe { std::env::set_var(k, v) }, | ||
| None => unsafe { std::env::remove_var(k) }, | ||
| } | ||
| } | ||
| f(); | ||
| for (k, v) in prev { | ||
| match v { | ||
| Some(v) => unsafe { std::env::set_var(&k, &v) }, | ||
| None => unsafe { std::env::remove_var(&k) }, | ||
| } | ||
| } |
| // Re-exported so T4+ modules can use them without triggering | ||
| // the `unused_crate_dependencies` lint while no source yet imports them. | ||
| use humantime as _; | ||
| use opentelemetry as _; | ||
| use opentelemetry_otlp as _; | ||
| use opentelemetry_prometheus as _; | ||
| use opentelemetry_sdk as _; | ||
| use opentelemetry_semantic_conventions as _; |
| // Rather than killing the binary we log a warning and serve traffic without metrics. | ||
| let observability = if obs_cfg.enabled { | ||
| match Observability::init(obs_cfg.clone()) { | ||
| Ok(o) => Some(o), | ||
| Err(e) => { | ||
| tracing::warn!( | ||
| error = %e, | ||
| "observability init failed; metrics disabled for this instance" | ||
| ); | ||
| None |
Datron
left a comment
There was a problem hiding this comment.
Overall review
- Some crates can be removed, they are not used. Just imported
opentelemetry-otlp = { version = "0.27", default-features = false, features = ["metrics", "http-proto", "reqwest-client"] }
opentelemetry-semantic-conventions = { version = "0.27" }
- We can use https://crates.io/crates/opentelemetry-instrumentation-actix-web which comes from open-telemetry so that we can directly monitor requests, latency and errors
- Simplify the observability module, there are too many files with just 30 lines. Could we group everything in fewer files and use internal modules if we want namespacing
- Is it possible to add some CPU and memory monitoring on the box with sysinfo crate or we'll leave that to a deployment runtime like ECS
| let max_pool_name = KeyValue::new("pool.name", pool_name); | ||
| meter | ||
| .u64_observable_gauge("db.client.connections.max") | ||
| .with_description("Configured maximum size of the DB connection pool.") | ||
| .with_callback(move |observer| { | ||
| observer.observe(pool_for_max.max_size() as u64, &[max_pool_name.clone()]); | ||
| }) | ||
| .build(); | ||
| } |
There was a problem hiding this comment.
Should we just read the env DB_MAX_CONNECTION_POOL_SIZE and publish, getting the size of the pool array may not be accurate for larger pools
| #[cfg(test)] | ||
| mod tests { | ||
| // Constructing a real r2d2 pool requires a database. We assert the function | ||
| // signature compiles and that calling `register` does not panic with a | ||
| // synthetic in-memory pool; this is exercised via the integration test in | ||
| // Task 19 instead. | ||
| } |
There was a problem hiding this comment.
Should be just remove this?
| fn used_connections(&self) -> Option<u64> { | ||
| // TODO(observability): fred 9.2.1 does not expose active/used | ||
| // connection counts. The internal `Counters.in_flight` field tracks | ||
| // per-connection in-flight commands, but it is private to | ||
| // `fred::protocol::connection`. Return None until exposed publicly. | ||
| None | ||
| } |
There was a problem hiding this comment.
You can get the number of used clients via
let connected_count = pool_clone.clients().iter().filter(|c| c.is_connected()).count();
| fn idle_connections(&self) -> Option<u64> { | ||
| // TODO(observability): fred 9.2.1 does not expose idle-connection | ||
| // counts via any public API. The pool's `RedisPoolInner.clients` are | ||
| // all async and there is no `is_idle()` method on `RedisClient`. | ||
| // Return None until fred provides such an API. | ||
| None | ||
| } |
| //! in atomics that observable-gauge callbacks read. | ||
|
|
||
| #[cfg(not(tokio_unstable))] | ||
| #[allow(dead_code)] // called from saturation.rs only under cfg(tokio_unstable) |
There was a problem hiding this comment.
Do we need this if cfg is at the top?
|
|
||
| #[cfg(not(tokio_unstable))] | ||
| #[allow(dead_code)] // called from saturation.rs only under cfg(tokio_unstable) | ||
| pub fn spawn(_meter: &opentelemetry::metrics::Meter, _interval: std::time::Duration) {} |
| #[cfg(tokio_unstable)] | ||
| mod inner { | ||
| use std::sync::Arc; | ||
| use std::sync::atomic::{AtomicU64, Ordering}; | ||
| use std::time::Duration; | ||
|
|
||
| use opentelemetry::metrics::Meter; | ||
| use tokio_metrics::RuntimeMonitor; | ||
|
|
||
| #[derive(Default)] | ||
| struct Snapshot { | ||
| workers: AtomicU64, | ||
| global_queue_depth: AtomicU64, |
There was a problem hiding this comment.
This can be simplified a bit, just a function that reads metrics
// 1. Global Queue Depth: Tasks waiting to be picked up by an Actix worker
let h_clone = handle.clone();
meter.u64_observable_gauge("tokio.runtime.global_queue_depth")
.with_description("Tasks stuck in the global queue waiting for a worker thread")
.with_callback(move |observer| {
let metrics = h_clone.metrics();
observer.observe(metrics.global_queue_depth() as u64, &[]);
})
.init();
// 2. Active Tasks: Total tasks currently alive in the runtime
let h_clone2 = handle.clone();
meter.u64_observable_gauge("tokio.runtime.active_tasks")
.with_description("Total number of alive tasks (running or idle)")
.with_callback(move |observer| {
let metrics = h_clone2.metrics();
observer.observe(metrics.active_tasks_count() as u64, &[]);
})
.init();
// 3. Worker Count: How many OS threads Tokio is using for workers
let h_clone3 = handle.clone();
meter.u64_observable_gauge("tokio.runtime.workers_count")
.with_description("Number of worker threads")
.with_callback(move |observer| {
let metrics = h_clone3.metrics();
observer.observe(metrics.num_workers() as u64, &[]);
})
.init();
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod tests { |
There was a problem hiding this comment.
Don't know if we need tests
| use std::{net::IpAddr, str::FromStr, time::Duration}; | ||
|
|
||
| #[derive(Debug, Clone)] | ||
| pub struct ObservabilityConfig { |
There was a problem hiding this comment.
Should this be moved to app state?
Design for adding an Actix middleware that emits Google SRE golden signals (latency, traffic, errors, saturation) via OpenTelemetry, with Prometheus scrape on a separate metrics port and optional OTLP push. Reference TSDB is VictoriaMetrics; design is TSDB-agnostic. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
22-task implementation plan derived from the 2026-05-10 spec, broken down into TDD-style steps with full code and exact commands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds opentelemetry, opentelemetry_sdk, opentelemetry-prometheus, opentelemetry-otlp, opentelemetry-semantic-conventions, prometheus, tokio-metrics, and humantime as workspace dependencies. Enabled in service_utils in a follow-up commit.
Required by tokio-metrics for runtime instrumentation introduced in the upcoming observability subsystem. tokio_unstable only adds APIs; no behavioural change for existing code. Also un-ignores .cargo/config.toml in .gitignore so the workspace build config can be tracked; the rest of .cargo/ remains ignored.
Wires the eight new workspace deps (opentelemetry, opentelemetry_sdk, opentelemetry-otlp, opentelemetry-prometheus, opentelemetry-semantic-conventions, prometheus, tokio-metrics, humantime) into crates/service_utils/Cargo.toml. The crate has #![deny(unused_crate_dependencies)], so temporary `use X as _;` stubs are added to lib.rs to satisfy the lint until T4+ modules consume the crates directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the empty module structure that subsequent commits flesh out:
- observability.rs: public surface, Observability handle, errors
- observability/{config,meters,middleware,metrics_server,health,saturation}.rs: stubs
No behaviour change. Observability::init() body is unimplemented!()
until Task 7.
Reads SUPERPOSITION_METRICS_* and OTEL_* env vars into a typed ObservabilityConfig. Defaults: enabled, port 9091, bind 0.0.0.0, both org/workspace labels on, 10s collect interval. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per OTel semconv: collapse unknown methods to _OTHER to bound the cardinality of the http.request.method label.
Builds an SdkMeterProvider wired to an opentelemetry-prometheus exporter that writes into a per-process prometheus::Registry. OTLP push exporter is plumbed but only activates when OTEL_EXPORTER_OTLP_ENDPOINT is set. Also adds Shutdown(String) error variant and updates shutdown() to use it; defines HttpMeters struct with request_duration histogram, busy_duration counter, and active_requests up-down-counter. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drop-based decrement ensures the gauge stays balanced even when a handler panics or the future is cancelled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps every request with timing + active_requests gauge + busy_duration counter. Uses match_pattern() to template routes, OrgWorkspaceMiddleware extensions for tenant labels, and an InFlightGuard for panic safety. Hoists all middleware.rs use statements to the file top (cleanup carried forward from T8-T10 reviews).
T14: ObservableGauge callbacks read r2d2::Pool::state() at scrape time.
Emits db.client.connections.usage{state} and db.client.connections.max.
T15: Redis pool gauges via fred metrics with a tolerant RedisStats
trait. FredPoolStats wraps RedisPool and exposes commands_in_flight via
MetricsInterface (summed across pool clients). idle/used connection
counts are not exposed by fred 9.2.1's non-blocking pool and return
None; the corresponding observations are skipped rather than emitted as
zero, so no dead series.
Also adds workspace check-cfg lint config for cfg(tokio_unstable) (new
warning surfaced once cfg-gated code appeared in saturation.rs) and
exports RedisHandle/RedisStats/FredPoolStats from saturation and
observability modules so T18 can populate SaturationDeps.
Note: T14 and T15 are bundled here due to a reflog mishap during review
that lost T15's separate commit; the contents of both tasks are
verified present and reviewed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Background sampler updates atomic snapshots; observable gauges read from the snapshot. Gated on cfg(tokio_unstable); compiles to a no-op when the flag is disabled.
Adds the observability health paths to tenant_middleware_exclusion_list so probes do not trigger auth flow. Operators no longer need to remember to put these in TENANT_MIDDLEWARE_EXCLUSION_LIST.
- Init Observability early (Prometheus exporter + optional OTLP push) - Spawn metrics server on SUPERPOSITION_METRICS_PORT - Register DB/Redis/Tokio saturation observers - Wrap App with MetricsMiddleware (gated by SUPERPOSITION_METRICS_ENABLED) - Mount /healthz /livez /readyz on the main app (auth bypass via T17) - try_join! both servers so the process exits if either dies Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wraps a small App with MetricsMiddleware, issues requests of various shapes (200, 201, 500, 404), and asserts on the parsed Prometheus exposition: per-route counts, the 5xx series, the __not_found__ sentinel, busy_duration > 0, and active_requests returning to 0.
Asserts that a 3-route × 1-method × 1-status scenario produces no more than 200 series, catching accidental high-cardinality labels in review.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`Observability::init` errors (e.g. OTLP endpoint unreachable) no longer panic the binary at startup. Instead they emit a `tracing::warn!` and leave observability as `None`, so the service continues to handle traffic without metrics. `obs_enabled` is now derived from `observability.is_some()` after the init attempt, ensuring the `Condition` gate mirrors the actual init outcome. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… production wiring (review fixes 2 + 6)
Adds two cardinality-control improvements to http.route tracking:
- New `ROUTE_STATIC` sentinel ("__static__") collapses static-asset patterns
(/pkg/*, /assets/*, /favicon*) so they don't inflate the cardinality budget.
- `is_static_pattern(pattern)` is the shared helper used by both
`extract_route` (ServiceRequest phase) and the new
`extract_route_from_response` (ServiceResponse phase).
- Production middleware now calls `extract_route_from_response` instead of
inlining `match_pattern().unwrap_or(ROUTE_NOT_FOUND)`, so static-prefix
collapsing is active for all real traffic.
- Unit test `extract_route_helper_handles_static_paths` verifies the helper.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… var handling (review fixes 4 + 5)
Fix 5 — OTEL_RESOURCE_ATTRIBUTES merge (spec §8.5):
Parse the standard OTel env var ("k1=v1,k2=v2,...") in `Observability::init`
and append each key=value pair to resource_attrs, enabling operators to inject
arbitrary resource dimensions without code changes.
Fix 4 — OTLP env var documentation:
The opentelemetry-otlp 0.27 HTTP exporter already reads
OTEL_EXPORTER_OTLP_HEADERS / OTEL_EXPORTER_OTLP_METRICS_HEADERS automatically
during build(), so no explicit wiring is needed. gRPC protocol is not supported
in v1 (requires grpc-tonic feature not in this workspace); document this clearly
in `with_otlp_reader`.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Spec doc reference removed for now; the env-var table and inline notes in this section are sufficient as the operator-facing surface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7bf6bad to
1bca586
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (8)
crates/service_utils/src/observability/meters.rs (1)
34-37: 💤 Low valueAdd unit specification to
http.server.active_requestsfor semconv compliance and consistency.The other two metrics in
HttpMeters(request_durationandbusy_duration) both specify units via.with_unit(). The OpenTelemetry semantic conventions specify{request}as the unit forhttp.server.active_requests, so adding it keeps this metric consistent with the file's pattern and fully semconv-compliant.♻️ Suggested change
let active_requests = meter .i64_up_down_counter("http.server.active_requests") + .with_unit("{request}") .with_description("Number of HTTP server requests currently in flight.") .build();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/service_utils/src/observability/meters.rs` around lines 34 - 37, The metric "http.server.active_requests" (built via meter.i64_up_down_counter in HttpMeters as the active_requests variable) is missing a unit; update the builder chain for active_requests to include .with_unit("{request}") before .build() so it matches the pattern used by request_duration and busy_duration and complies with the OpenTelemetry semconv for this metric.crates/service_utils/src/observability/middleware.rs (3)
215-241: 💤 Low valueMinor: unnecessary borrow of
&reswhenresis already a reference.In the
Ok(res)arm,res: &ServiceResponse<B>because the outer match is on&result.extract_route_from_response(&res)passes&&ServiceResponse<B>, which compiles via deref coercion but is unidiomatic.Small style nit
- let route = extract_route_from_response(&res); + let route = extract_route_from_response(res);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/service_utils/src/observability/middleware.rs` around lines 215 - 241, In the Ok(res) match arm you’re passing a double reference to extract_route_from_response by calling extract_route_from_response(&res) even though res is already a &ServiceResponse<B>; change the call to extract_route_from_response(res) (and any similar places) to avoid the unnecessary borrow and keep types idiomatic (locate this in the Ok(res) branch where extract_route_from_response, build_attributes, meters.request_duration.record, and meters.busy_duration.add are used).
312-327: 💤 Low valueTest name promises more than it asserts.
matched_route_returns_patternonly asserts a 200 status — it does not exerciseextract_route/extract_route_from_response. The inline comment acknowledges this, but the test name will mislead future readers; consider renaming tomatched_route_setup_smokeor wiring an actual middleware that records and reads back the captured pattern so the integration test atobservability_integration.rsisn't the only behavioral coverage for the response-phase extraction.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/service_utils/src/observability/middleware.rs` around lines 312 - 327, Rename or update the test matched_route_returns_pattern so its name matches behavior or it actually asserts route extraction: either rename to matched_route_setup_smoke (to reflect it only asserts StatusCode::OK) or modify the test to install the middleware that calls extract_route / extract_route_from_response and assert the captured pattern is "/contexts/{id}" (use the test harness App::new().wrap(your_observability_middleware) and make the middleware expose the matched pattern for assertion); reference the test function matched_route_returns_pattern and the extraction helpers extract_route and extract_route_from_response when changing the test so behavioral coverage isn't only in observability_integration.rs.
242-262: ⚖️ Poor tradeoffError path conflates handler errors with real 404s under
__not_found__.When
service.call(req)returnsErr, the request has been moved andmatch_pattern()is unreachable, so the route attribute is hard-coded toROUTE_NOT_FOUND. The acknowledged tradeoff is documented inline, but the metric consequence is that a 500 from a real route like/contexts/{id}ends up in the same time-series as a true 404, which makes per-route error-rate alerting unreliable for the error-as-Errpath (handler errors converted by Actix into responses, vs. handlers that returnOk(response_with_500)).Worth at least noting in the rollout dashboard docs, or — if practical — capturing the matched pattern in request extensions during routing (e.g., from an inner middleware closer to the resource) and reading it back here. Not a blocker.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/service_utils/src/observability/middleware.rs` around lines 242 - 262, The error-path currently hardcodes ROUTE_NOT_FOUND because match_pattern() is unavailable after service.call(req) moves the request; change this by having the routing layer store the matched pattern into the request extensions (e.g., under a key like "matched_pattern") when matching a route, and then in this middleware (where service.call(req) and the Err branch live) try to read that extension and use the stored pattern instead of ROUTE_NOT_FOUND when calling build_attributes; update the Err block to fallback to ROUTE_NOT_FOUND only if the extension is missing, and keep using service.call, match_pattern(), build_attributes, and meters.* as the integration points.crates/superposition/src/main.rs (1)
306-313: 💤 Low value
try_join!couples main API lifetime to metrics server lifetime.If the metrics server task ever returns an error (e.g., the underlying listener gets closed, or the bound port is reclaimed),
try_join!short-circuits and the main HTTP server is aborted mid-flight. The pattern is acceptable if you treat the metrics endpoint as a hard SLO, but it's worth being explicit — the PR objectives describe "bad OTLP config degrades to disabled with warnings", and this coupling is the opposite stance for the scrape endpoint.Consider
tokio::select!on the main server with atokio::spawn-detached metrics task, or accept this and document it in the rollout notes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/superposition/src/main.rs` around lines 306 - 313, The current use of futures_util::try_join! with metrics_server_handle couples the lifetime of main_server to metrics errors; change to spawn the metrics task (metrics_server_handle) with tokio::spawn so it runs detached and use tokio::select! (or await main_server directly) to ensure the main_server (the API HTTP server) is not aborted if the metrics task returns an error — instead log/handle errors from the spawned metrics task (e.g., attach a task that .awaits and logs its error) while allowing main_server.await to continue independently.crates/service_utils/tests/observability_integration.rs (1)
85-91: 💤 Low valueValue parsing assumes no per-sample timestamps.
rsplit_once(' ').unwrap().1takes the last whitespace-separated token. The Prometheus exposition format permits an optional trailing timestamp (metric{…} value [timestamp]); if the exporter ever emits timestamps, this parses the timestamp as the value and the assertions silently mis-fire. The currentopentelemetry-prometheus0.x exporter does not emit timestamps, so the test passes today — flagging as a brittle assumption to revisit if the exporter is upgraded.A more robust parse splits on whitespace and asserts exactly two tokens, or extracts the second token of a known-shape split:
-let busy_value: f64 = busy.rsplit_once(' ').unwrap().1.trim().parse().unwrap(); +let mut parts = busy.split_whitespace(); +let _series = parts.next().unwrap(); +let busy_value: f64 = parts.next().unwrap().parse().unwrap();Also applies to: 118-118, 127-127
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/service_utils/tests/observability_integration.rs` around lines 85 - 91, The current parsing of ping_count_line uses rsplit_once(' ').unwrap().1 which will pick up an optional Prometheus timestamp as the value; change parsing to split_whitespace (or split_ascii_whitespace), collect tokens, assert tokens.len() >= 2 and use tokens[ tokens.len() - 2 ] (the value) or preferably assert exactly 2 tokens when you expect no timestamps, then parse that token into ping_count; apply the same change to the other occurrences mentioned (the lines around ping_count_line and the similar parses at the other two locations) so tests fail loudly if format differs instead of silently parsing a timestamp as the value.crates/service_utils/src/observability/health.rs (1)
33-36: ⚡ Quick win
readyzcurrently signals ready even when dependencies are down.The placeholder is acknowledged with a comment, so this is informational. In Kubernetes,
readyzreturning 200 while DB/Redis are unhealthy means traffic continues to be routed to a pod that cannot serve requests — defeating the purpose of a readiness probe. Track this as a follow-up to wire DB pool ping / Redis ping intoreadyzbefore relying on it for traffic gating.Want me to open a follow-up issue tracking the readyz dependency checks (DB pool
getw/ short timeout, Redis ping)?🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/service_utils/src/observability/health.rs` around lines 33 - 36, The readyz handler (function readyz) currently always returns OK and must be updated to probe critical dependencies before signaling readiness: add checks for the DB connection pool (attempt a quick get or simple ping with a short timeout) and Redis (PING) and only return HttpResponse::Ok when those succeed; on any failure return an appropriate non-200 response (e.g., 503) and log the underlying error so Kubernetes stops routing traffic to unhealthy pods. Use the existing livez behavior as a template if present and ensure checks are fast and non-blocking (async) to avoid delaying readiness responses.crates/service_utils/src/observability.rs (1)
147-155: 💤 Low value
#[cfg(test)]stub silently drops OTLP wiring.Test builds ignore
_endpointentirely, so any unit test that setsotlp_endpoint = Some(...)exercises the prometheus-only path. That's intentional per the comment, but it means OTLP-specific behavior (endpoint resolution, header merging, period validation) has zero unit coverage. Either acknowledge in the comment that OTLP coverage relies on integration/manual tests, or add a token assertion (e.g., that the function is reachable) so future readers don't expect symmetric behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/service_utils/src/observability.rs` around lines 147 - 155, The #[cfg(test)] stub for with_otlp_reader silently ignores the _endpoint and thus gives no unit coverage for OTLP-specific behavior (endpoint resolution, header merging, period validation); either update the comment above with_otlp_reader to explicitly state that OTLP behavior is covered only by integration/manual tests (and mention otlp_endpoint/ObservabilityError), or make the test-only stub assert it's been reached (e.g., a token/debug assertion referencing the _endpoint) so future readers know the function is intentionally non-functional in unit tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/service_utils/src/observability.rs`:
- Around line 85-96: The OTEL_RESOURCE_ATTRIBUTES parsing loop does not
percent-decode keys/values (violating the OTel/W3C baggage format); update the
loop that iterates extra.split(',') and uses pair.split_once('=') so that both k
and v are percent-decoded (use the existing urlencoding crate) before trimming
and pushing into resource_attrs via KeyValue::new; ensure decoding errors are
handled (skip or log the offending pair) and decode both the key and value
consistently.
In `@crates/service_utils/src/observability/saturation/tokio_runtime.rs`:
- Around line 50-58: Replace the configured interval with the actual elapsed
duration when computing the per-worker busy ratio: use m.elapsed (not interval)
as the time base while preserving the per-worker semantics by multiplying by
(m.workers_count as f64). In other words, compute total =
m.elapsed.as_secs_f64() * (m.workers_count as f64).max(1.0) and keep storing the
scaled value into snap_for_task.busy_ratio_milli via
snap_for_task.busy_ratio_milli.store((ratio * 1000.0) as u64,
Ordering::Relaxed), so you avoid under-reporting when sleep returns late but
retain the per-worker metric semantics rather than switching to the aggregate
m.busy_ratio().
In `@crates/superposition/src/main.rs`:
- Around line 192-199: The metrics bind construction currently formats a string
and parses to SocketAddr which fails for IPv6 (e.g., "::1:9091"); change the
code in the block that sets metrics_server_handle (references: observability,
obs_cfg, metrics_server_handle, spawn_metrics_server) to construct the address
with SocketAddr::new(std::net::IpAddr::from_str(&obs_cfg.bind)? or obs_cfg.bind
parsed to IpAddr, and obs_cfg.port as u16) instead of string parsing, and if
spawn_metrics_server returns an Err treat it like Observability::init failures
(log a warning and set metrics_server_handle to None) so a bind error doesn’t
panic the process. Ensure the fix parses obs_cfg.bind to IpAddr (handling
parsing errors gracefully) and uses SocketAddr::new to support IPv6 properly.
In `@docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md`:
- Line 18: Replace the machine-specific absolute path string
"/Users/natarajankannan/src/superposition/" on line containing "All paths in
this plan are absolute under the repo root" with a repo-root-relative notation
(e.g., "repo-root-relative paths" or a placeholder like "<repo-root>/...") so
the document uses portable, non-machine-specific paths; update the sentence in
the same paragraph to instruct contributors to use repo-root-relative or
environment-variable-based placeholders instead of local absolute filesystem
paths.
In `@docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md`:
- Line 69: Two fenced code blocks in the doc (the block showing "[request on
:8080] ..." and the block listing "crates/service_utils/src/...") are unlabeled
and trigger markdownlint MD040; update each opening triple-backtick to include
the language identifier text (i.e., change ``` to ```text) so both code fences
become labeled, leaving their closing backticks unchanged.
In `@README.md`:
- Around line 157-159: The markdown code fence containing the curl example is
missing a language identifier, triggering markdownlint MD040; update the fenced
block that wraps the curl invocation (the triple-backtick block around "curl
http://localhost:9091/metrics") to include the bash language identifier
(```bash) so the snippet is syntax-highlighted and the lint rule passes.
---
Nitpick comments:
In `@crates/service_utils/src/observability.rs`:
- Around line 147-155: The #[cfg(test)] stub for with_otlp_reader silently
ignores the _endpoint and thus gives no unit coverage for OTLP-specific behavior
(endpoint resolution, header merging, period validation); either update the
comment above with_otlp_reader to explicitly state that OTLP behavior is covered
only by integration/manual tests (and mention otlp_endpoint/ObservabilityError),
or make the test-only stub assert it's been reached (e.g., a token/debug
assertion referencing the _endpoint) so future readers know the function is
intentionally non-functional in unit tests.
In `@crates/service_utils/src/observability/health.rs`:
- Around line 33-36: The readyz handler (function readyz) currently always
returns OK and must be updated to probe critical dependencies before signaling
readiness: add checks for the DB connection pool (attempt a quick get or simple
ping with a short timeout) and Redis (PING) and only return HttpResponse::Ok
when those succeed; on any failure return an appropriate non-200 response (e.g.,
503) and log the underlying error so Kubernetes stops routing traffic to
unhealthy pods. Use the existing livez behavior as a template if present and
ensure checks are fast and non-blocking (async) to avoid delaying readiness
responses.
In `@crates/service_utils/src/observability/meters.rs`:
- Around line 34-37: The metric "http.server.active_requests" (built via
meter.i64_up_down_counter in HttpMeters as the active_requests variable) is
missing a unit; update the builder chain for active_requests to include
.with_unit("{request}") before .build() so it matches the pattern used by
request_duration and busy_duration and complies with the OpenTelemetry semconv
for this metric.
In `@crates/service_utils/src/observability/middleware.rs`:
- Around line 215-241: In the Ok(res) match arm you’re passing a double
reference to extract_route_from_response by calling
extract_route_from_response(&res) even though res is already a
&ServiceResponse<B>; change the call to extract_route_from_response(res) (and
any similar places) to avoid the unnecessary borrow and keep types idiomatic
(locate this in the Ok(res) branch where extract_route_from_response,
build_attributes, meters.request_duration.record, and meters.busy_duration.add
are used).
- Around line 312-327: Rename or update the test matched_route_returns_pattern
so its name matches behavior or it actually asserts route extraction: either
rename to matched_route_setup_smoke (to reflect it only asserts StatusCode::OK)
or modify the test to install the middleware that calls extract_route /
extract_route_from_response and assert the captured pattern is "/contexts/{id}"
(use the test harness App::new().wrap(your_observability_middleware) and make
the middleware expose the matched pattern for assertion); reference the test
function matched_route_returns_pattern and the extraction helpers extract_route
and extract_route_from_response when changing the test so behavioral coverage
isn't only in observability_integration.rs.
- Around line 242-262: The error-path currently hardcodes ROUTE_NOT_FOUND
because match_pattern() is unavailable after service.call(req) moves the
request; change this by having the routing layer store the matched pattern into
the request extensions (e.g., under a key like "matched_pattern") when matching
a route, and then in this middleware (where service.call(req) and the Err branch
live) try to read that extension and use the stored pattern instead of
ROUTE_NOT_FOUND when calling build_attributes; update the Err block to fallback
to ROUTE_NOT_FOUND only if the extension is missing, and keep using
service.call, match_pattern(), build_attributes, and meters.* as the integration
points.
In `@crates/service_utils/tests/observability_integration.rs`:
- Around line 85-91: The current parsing of ping_count_line uses rsplit_once('
').unwrap().1 which will pick up an optional Prometheus timestamp as the value;
change parsing to split_whitespace (or split_ascii_whitespace), collect tokens,
assert tokens.len() >= 2 and use tokens[ tokens.len() - 2 ] (the value) or
preferably assert exactly 2 tokens when you expect no timestamps, then parse
that token into ping_count; apply the same change to the other occurrences
mentioned (the lines around ping_count_line and the similar parses at the other
two locations) so tests fail loudly if format differs instead of silently
parsing a timestamp as the value.
In `@crates/superposition/src/main.rs`:
- Around line 306-313: The current use of futures_util::try_join! with
metrics_server_handle couples the lifetime of main_server to metrics errors;
change to spawn the metrics task (metrics_server_handle) with tokio::spawn so it
runs detached and use tokio::select! (or await main_server directly) to ensure
the main_server (the API HTTP server) is not aborted if the metrics task returns
an error — instead log/handle errors from the spawned metrics task (e.g., attach
a task that .awaits and logs its error) while allowing main_server.await to
continue independently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 692299ed-71fc-42c3-9a01-9fbf6addfee3
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (22)
.cargo/config.toml.gitignoreCargo.tomlREADME.mdcrates/service_utils/Cargo.tomlcrates/service_utils/src/lib.rscrates/service_utils/src/observability.rscrates/service_utils/src/observability/config.rscrates/service_utils/src/observability/health.rscrates/service_utils/src/observability/meters.rscrates/service_utils/src/observability/metrics_server.rscrates/service_utils/src/observability/middleware.rscrates/service_utils/src/observability/saturation.rscrates/service_utils/src/observability/saturation/db_pool.rscrates/service_utils/src/observability/saturation/redis_pool.rscrates/service_utils/src/observability/saturation/tokio_runtime.rscrates/service_utils/tests/observability_integration.rscrates/superposition/Cargo.tomlcrates/superposition/src/app_state.rscrates/superposition/src/main.rsdocs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.mddocs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md
| // §8.5 — merge OTEL_RESOURCE_ATTRIBUTES ("k1=v1,k2=v2,...") if set. | ||
| if let Ok(extra) = std::env::var("OTEL_RESOURCE_ATTRIBUTES") { | ||
| for pair in extra.split(',') { | ||
| if let Some((k, v)) = pair.split_once('=') { | ||
| let k = k.trim().to_owned(); | ||
| let v = v.trim().to_owned(); | ||
| if !k.is_empty() { | ||
| resource_attrs.push(KeyValue::new(k, v)); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify percent-encoding crate availability and current OTEL_RESOURCE_ATTRIBUTES handling.
rg -n 'percent-encoding' --type=toml
fd -e toml -x rg -nH 'percent-encoding' {} \;Repository: juspay/superposition
Length of output: 2517
🏁 Script executed:
cd crates/service_utils && cat Cargo.toml | grep -A 50 "^\[dependencies\]"Repository: juspay/superposition
Length of output: 1677
🌐 Web query:
OTel environment variable spec OTEL_RESOURCE_ATTRIBUTES W3C Baggage percent-encoding encoding
💡 Result:
OTEL_RESOURCE_ATTRIBUTES encoding/percent-encoding rules (grounded in the current OpenTelemetry Resource SDK spec and the W3C Baggage spec) 1) How OTEL_RESOURCE_ATTRIBUTES is formatted - OTEL_RESOURCE_ATTRIBUTES is a comma-separated list of key=value pairs (key1=value1,key2=value2). [1] 2) What “percent-encoding” is required for in OTEL_RESOURCE_ATTRIBUTES - In OTEL_RESOURCE_ATTRIBUTES, the characters ',' and '=' in both keys and values MUST be percent-encoded. [1] - Other characters MAY be percent-encoded (e.g., values outside the ANSI characters set). [1] 3) Relation to W3C Baggage and W3C baggage value percent-encoding - OpenTelemetry states OTEL_RESOURCE_ATTRIBUTES is represented using a format matching W3C Baggage, with the specific OTEL_RESOURCE_ATTRIBUTES rules above for separating key/value pairs via comma and '=' and requiring percent-encoding of ',' and '='. [1] - In W3C Baggage, the baggage entry value is a UTF-8 string, where any characters outside the baggage-octet range MUST be percent-encoded. [2] - W3C Baggage also specifies that the percent code point (U+0025, i.e. '%') MUST be percent-encoded. [2] - W3C Baggage defines percent-encoding per RFC3986 Section 2.1. [2] 4) How percent-encoding is interpreted (decoding expectations) - W3C Baggage: when decoding, percent-encoded octet sequences that do not match the UTF-8 encoding scheme are replaced with U+FFFD. [2] - OpenTelemetry resource SDK: during decoding, if an error occurs (e.g., failure during decoding), the entire environment variable value SHOULD be discarded and an error SHOULD be reported. [1] 5) Practical implications / example - Because ',' and '=' MUST be percent-encoded in OTEL_RESOURCE_ATTRIBUTES, values containing commas or equals signs must be written with percent escapes, so they don’t get misinterpreted as list separators or key/value separators. [1] Sources (most directly relevant) - OpenTelemetry Resource SDK spec: specification/resource/sdk.md (OTEL_RESOURCE_ATTRIBUTES format + required percent-encoding + error handling). [1] - W3C Baggage spec (percent-encoding rules for baggage values, including baggage-octet range, UTF-8, and '%' encoding). [2]
Citations:
- 1: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md
- 2: https://www.w3.org/TR/baggage/
🏁 Script executed:
# Search for tests of OTEL_RESOURCE_ATTRIBUTES parsing
rg -n 'OTEL_RESOURCE_ATTRIBUTES' --type rust
# Check urlencoding crate capabilities
rg -n 'urlencoding' crates/service_utils/Repository: juspay/superposition
Length of output: 424
🌐 Web query:
urlencoding rust crate decode percent-decoding function API
💡 Result:
In the Rust crate urlencoding, percent-decoding is done via these public APIs: 1) urlencoding::decode Signature/behavior: It decodes a percent-encoded string assuming UTF-8, and returns a Result (since the docs show decode(...).expect("UTF-8")/unwrap usage). [1] Example usage: use urlencoding::decode; let decoded = decode("%F0%9F%91%BE%20Exterminate%21").expect("UTF-8"); // decoded == "👾 Exterminate!" [1] 2) urlencoding::decode_binary Signature/behavior: It decodes percent-encoded input as raw bytes (binary data), allowing arbitrary bytes and invalid UTF-8; you then convert as needed (e.g., String::from_utf8_lossy). [1] Example usage: use urlencoding::decode_binary; let binary = decode_binary(b"%F1%F2%F3%C0%C1%C2"); let decoded = String::from_utf8_lossy(&binary); [1] How percent-decoding works conceptually in this ecosystem A commonly used lower-level percent-decoder in Rust is the percent_encoding crate’s percent_decode function, whose API takes bytes and percent-decodes any “%” followed by two hex digits, returning an iterator-like type with decode_utf8/decode_utf8_lossy. [2] Signature (from docs): pub fn percent_decode(input: &[u8]) -> PercentDecode<'_>. [2] So, for “urlencoding crate decode percent-decoding function API”: use urlencoding::decode (UTF-8 Result) or urlencoding::decode_binary (raw bytes) depending on whether you need to handle invalid UTF-8. [1]
Citations:
- 1: https://docs.rs/urlencoding/latest/urlencoding/
- 2: https://docs.rs/percent-encoding/*/percent_encoding/fn.percent_decode.html
🏁 Script executed:
# Check for existing tests of OTEL_RESOURCE_ATTRIBUTES
rg -A 10 -B 5 'OTEL_RESOURCE_ATTRIBUTES' crates/service_utils/src/observability.rs | head -40
# Check how urlencoding::decode is used elsewhere
rg -B 2 -A 3 'urlencoding::decode' crates/Repository: juspay/superposition
Length of output: 749
🏁 Script executed:
# Look for tests in observability.rs
rg -n '#\[test\]|#\[cfg\(test\)\]' crates/service_utils/src/observability.rs
# Look for integration tests that might test observability init
fd -e rs 'test' crates/service_utils/src/ | head -10
# Check if there are any tests for observability module
fd observability crates/service_utils/Repository: juspay/superposition
Length of output: 268
🏁 Script executed:
# Check the integration test file
cat crates/service_utils/tests/observability_integration.rs
# Also check lines around 147, 157, 176, 186 in observability.rs for test details
sed -n '145,190p' crates/service_utils/src/observability.rsRepository: juspay/superposition
Length of output: 7630
OTEL_RESOURCE_ATTRIBUTES values are not URL-decoded, violating the OTel specification.
Per the OpenTelemetry resource SDK specification, OTEL_RESOURCE_ATTRIBUTES follows the W3C Baggage format and requires percent-decoding of both keys and values (per RFC 3986). The current implementation (lines 85–96) trims whitespace but does not percent-decode, so an operator setting e.g. OTEL_RESOURCE_ATTRIBUTES=deployment.environment=prod%20us-east-1 ends up with the literal string prod%20us-east-1 instead of prod us-east-1.
The urlencoding crate is already in the project's dependencies. A straightforward fix:
Suggested fix
if let Some((k, v)) = pair.split_once('=') {
let k = k.trim().to_owned();
- let v = v.trim().to_owned();
+ let v = urlencoding::decode(v.trim())
+ .unwrap_or_else(|_| v.trim().into())
+ .into_owned();
if !k.is_empty() {
resource_attrs.push(KeyValue::new(k, v));
}
}Also consider decoding keys the same way, as the OTel spec requires percent-encoding for both.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/service_utils/src/observability.rs` around lines 85 - 96, The
OTEL_RESOURCE_ATTRIBUTES parsing loop does not percent-decode keys/values
(violating the OTel/W3C baggage format); update the loop that iterates
extra.split(',') and uses pair.split_once('=') so that both k and v are
percent-decoded (use the existing urlencoding crate) before trimming and pushing
into resource_attrs via KeyValue::new; ensure decoding errors are handled (skip
or log the offending pair) and decode both the key and value consistently.
| // Busy fraction = total worker-seconds busy / total | ||
| // worker-seconds available over the sample interval. | ||
| let busy = m.total_busy_duration.as_secs_f64(); | ||
| let total = | ||
| interval.as_secs_f64() * (m.workers_count as f64).max(1.0); | ||
| let ratio = (busy / total).clamp(0.0, 1.0); | ||
| snap_for_task | ||
| .busy_ratio_milli | ||
| .store((ratio * 1000.0) as u64, Ordering::Relaxed); |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In tokio-metrics 0.3, does RuntimeMetrics expose a busy_ratio()method and anelapsedfield, and isbusy_ratio() computed as total_busy_duration / (elapsed * workers_count)?
💡 Result:
Yes—at least in the docs for tokio-metrics RuntimeMetrics, there is a busy_ratio method and an elapsed field, but the busy_ratio computation formula you proposed is not supported by tokio-metrics 0.3 docs. 1) RuntimeMetrics has an elapsed field RuntimeMetrics (tokio-metrics 0.3.x) includes a public field elapsed: Duration [1]. 2) RuntimeMetrics has busy_ratio method The RuntimeMetrics API also includes a pub fn busy_ratio(&self) -> f64 method [2]. 3) busy_ratio computation: not total_busy_duration / (elapsed * workers_count) The docs for busy_ratio in the RuntimeMetrics API state that busy_ratio “Returns the ratio of RuntimeMetrics::total_busy_duration to the RuntimeMetrics::elapsed.” [2] That description implies busy_ratio = total_busy_duration / elapsed (i.e., no multiplication by workers_count in the denominator) [2]. I found no doc evidence that busy_ratio multiplies elapsed by workers_count in the formula. Conclusion: RuntimeMetrics exposes busy_ratio and elapsed, but busy_ratio is documented as total_busy_duration / elapsed, not total_busy_duration / (elapsed * workers_count).
Citations:
- 1: https://docs.rs/tokio-metrics/0.3.1/tokio_metrics/struct.RuntimeMetrics.html
- 2: https://docs.rs/tokio-metrics/latest/tokio_metrics/struct.RuntimeMetrics.html
🏁 Script executed:
cat -n crates/service_utils/src/observability/saturation/tokio_runtime.rs | head -80Repository: juspay/superposition
Length of output: 3858
🏁 Script executed:
cat -n crates/service_utils/src/observability/saturation/tokio_runtime.rs | sed -n '80,100p'Repository: juspay/superposition
Length of output: 547
Use actual elapsed duration instead of configured interval when computing busy ratio, but preserve per-worker semantics.
The concern about using the configured interval is valid—under scheduler pressure, tokio::time::sleep(interval) may return late, causing the denominator to be inaccurate and under-reporting busy ratio when it matters most. However, the suggested diff has a hidden semantic issue:
tokio_metrics::RuntimeMetrics::busy_ratio() computes total_busy_duration / elapsed (aggregate ratio across all workers), whereas the current code computes total_busy_duration / (elapsed * workers_count) (per-worker average). The metric is named runtime.tokio.workers.busy_ratio, suggesting per-worker semantics should be preserved.
The correct fix: replace the configured interval with m.elapsed, but keep the workers_count multiplier:
- let busy = m.total_busy_duration.as_secs_f64();
- let total =
- interval.as_secs_f64() * (m.workers_count as f64).max(1.0);
+ let busy = m.total_busy_duration.as_secs_f64();
+ let total =
+ m.elapsed.as_secs_f64() * (m.workers_count as f64).max(1.0);
let ratio = (busy / total).clamp(0.0, 1.0);Alternatively, if you intentionally want to switch to aggregate ratio, use m.busy_ratio() directly but document why the metric semantics changed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/service_utils/src/observability/saturation/tokio_runtime.rs` around
lines 50 - 58, Replace the configured interval with the actual elapsed duration
when computing the per-worker busy ratio: use m.elapsed (not interval) as the
time base while preserving the per-worker semantics by multiplying by
(m.workers_count as f64). In other words, compute total =
m.elapsed.as_secs_f64() * (m.workers_count as f64).max(1.0) and keep storing the
scaled value into snap_for_task.busy_ratio_milli via
snap_for_task.busy_ratio_milli.store((ratio * 1000.0) as u64,
Ordering::Relaxed), so you avoid under-reporting when sleep returns late but
retain the per-worker metric semantics rather than switching to the aggregate
m.busy_ratio().
| let metrics_server_handle = if let Some(obs) = observability.as_ref() { | ||
| let bind: std::net::SocketAddr = format!("{}:{}", obs_cfg.bind, obs_cfg.port) | ||
| .parse() | ||
| .expect("invalid metrics bind addr"); | ||
| Some(spawn_metrics_server(obs.registry(), bind)?) | ||
| } else { | ||
| None | ||
| }; |
There was a problem hiding this comment.
Metrics bind construction is broken for IPv6 addresses.
format!("{}:{}", obs_cfg.bind, obs_cfg.port).parse::<SocketAddr>() works for IPv4, but if an operator configures SUPERPOSITION_METRICS_BIND=:: or ::1, the resulting string ::1:9091 is not a valid SocketAddr representation — IPv6 socket strings require bracketed host ([::1]:9091). This will trigger the .expect("invalid metrics bind addr") panic at startup with no warning path, contradicting the graceful-degradation philosophy applied to Observability::init above.
Use SocketAddr::new to build the address directly from the typed IpAddr and u16 — no string round-trip, no IPv6 special-casing.
🛠️ Proposed fix
- let metrics_server_handle = if let Some(obs) = observability.as_ref() {
- let bind: std::net::SocketAddr = format!("{}:{}", obs_cfg.bind, obs_cfg.port)
- .parse()
- .expect("invalid metrics bind addr");
- Some(spawn_metrics_server(obs.registry(), bind)?)
- } else {
- None
- };
+ let metrics_server_handle = if let Some(obs) = observability.as_ref() {
+ let bind = std::net::SocketAddr::new(obs_cfg.bind, obs_cfg.port);
+ match spawn_metrics_server(obs.registry(), bind) {
+ Ok(h) => Some(h),
+ Err(e) => {
+ tracing::warn!(error = %e, %bind, "metrics server failed to start; continuing without scrape endpoint");
+ None
+ }
+ }
+ } else {
+ None
+ };Two changes:
SocketAddr::new(IpAddr, u16)— IPv4/IPv6 correct, no parsing.- Treat metrics-server bind failure the same way
Observability::initfailure is treated (warn + disable), so a transient port conflict on:9091doesn't take down the API.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/superposition/src/main.rs` around lines 192 - 199, The metrics bind
construction currently formats a string and parses to SocketAddr which fails for
IPv6 (e.g., "::1:9091"); change the code in the block that sets
metrics_server_handle (references: observability, obs_cfg,
metrics_server_handle, spawn_metrics_server) to construct the address with
SocketAddr::new(std::net::IpAddr::from_str(&obs_cfg.bind)? or obs_cfg.bind
parsed to IpAddr, and obs_cfg.port as u16) instead of string parsing, and if
spawn_metrics_server returns an Err treat it like Observability::init failures
(log a warning and set metrics_server_handle to None) so a bind error doesn’t
panic the process. Ensure the fix parses obs_cfg.bind to IpAddr (handling
parsing errors gracefully) and uses SocketAddr::new to support IPv6 properly.
| - The OpenTelemetry Rust SDK has had API churn between minor versions. The exact import paths and builder method names below are written against `opentelemetry` 0.27. If you pin a different version in Task 1, expect to adjust 1–3 import paths or method names per call site. The plan uses the **stable** APIs only (no unstable/preview features). | ||
| - After Task 1 (deps), run `cargo check -p service_utils` after every code-touching task to catch wiring issues early — even on tasks that don't add tests yet. | ||
| - File commit boundary: each task ends with one commit. If a step within a task fails, fix and continue within the same task before committing. | ||
| - All paths in this plan are absolute under the repo root: `/Users/natarajankannan/src/superposition/`. |
There was a problem hiding this comment.
Avoid machine-specific absolute paths in the plan.
Line 18 hardcodes a local filesystem path, which makes the instructions non-portable for other contributors and CI shells. Prefer repo-root-relative paths throughout.
Suggested doc edit
-- All paths in this plan are absolute under the repo root: `/Users/natarajankannan/src/superposition/`.
+- All paths in this plan are relative to the repository root.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - All paths in this plan are absolute under the repo root: `/Users/natarajankannan/src/superposition/`. | |
| - All paths in this plan are relative to the repository root. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md` at line
18, Replace the machine-specific absolute path string
"/Users/natarajankannan/src/superposition/" on line containing "All paths in
this plan are absolute under the repo root" with a repo-root-relative notation
(e.g., "repo-root-relative paths" or a placeholder like "<repo-root>/...") so
the document uses portable, non-machine-specific paths; update the sentence in
the same paragraph to instruct contributors to use repo-root-relative or
environment-variable-based placeholders instead of local absolute filesystem
paths.
|
|
||
| ### 5.1 Data flow | ||
|
|
||
| ``` |
There was a problem hiding this comment.
Add language identifiers to fenced code blocks (MD040).
Lines 69 and 108 use unlabeled fenced blocks, which fails markdownlint. Add a language tag (for these blocks, text fits).
Suggested doc edit
-```
+```text
[request on :8080]
...
-```
+```
-```
+```text
crates/service_utils/src/
...
-```
+```Also applies to: 108-108
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 69-69: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md`
at line 69, Two fenced code blocks in the doc (the block showing "[request on
:8080] ..." and the block listing "crates/service_utils/src/...") are unlabeled
and trigger markdownlint MD040; update each opening triple-backtick to include
the language identifier text (i.e., change ``` to ```text) so both code fences
become labeled, leaving their closing backticks unchanged.
| ``` | ||
| curl http://localhost:9091/metrics | ||
| ``` |
There was a problem hiding this comment.
Add a language identifier to the code fence.
markdownlint (MD040) flags this fence. Use bash so the curl snippet gets syntax highlighting and the lint passes.
📝 Suggested change
-```
-curl http://localhost:9091/metrics
-```
+```bash
+curl http://localhost:9091/metrics
+```📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| curl http://localhost:9091/metrics | |
| ``` |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 157-157: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@README.md` around lines 157 - 159, The markdown code fence containing the
curl example is missing a language identifier, triggering markdownlint MD040;
update the fenced block that wraps the curl invocation (the triple-backtick
block around "curl http://localhost:9091/metrics") to include the bash language
identifier (```bash) so the snippet is syntax-highlighted and the lint rule
passes.
… (items 1 & 3) Item 1: Wrap OTLP reader init in a match so failure logs a warning and continues with Prometheus-only metrics instead of propagating Err and disabling the /metrics scrape endpoint entirely. Item 3: Warn at startup when OTEL_EXPORTER_OTLP_PROTOCOL is set to a value other than 'http/protobuf' (e.g. grpc), since only HTTP/proto is supported in v1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The crate was only listed to satisfy the unused_crate_dependencies lint via a use-as-_ in lib.rs; actual metric attribute keys are hardcoded strings. Remove it from both workspace and service_utils Cargo.toml, and remove all the use-as-_ shims from lib.rs (the real crates are now imported directly in the observability sub-modules). Also restructure Observability::init to clone the Resource so the Prom-only fallback path can reuse it after the builder is consumed by with_otlp_reader. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Item 4: Drop the empty #[cfg(test)] mod tests {} placeholder in db_pool.rs;
real coverage lives in tests/observability_integration.rs.
Items 5+6: Remove the #[cfg(not(tokio_unstable))] spawn stub from
tokio_runtime.rs. The call site in saturation.rs is itself gated on
#[cfg(tokio_unstable)], so no symbol resolution is needed when the flag
is off. Add a scoped cfg(not(tokio_unstable)) use-as-_ in lib.rs to
keep the unused_crate_dependencies lint green on both build paths.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…item 7) Replace the idle_connections/used_connections trait methods (which always returned None because fred 9.2.1 has no such public API) with a single connected_connections() method using ClientLike::is_connected() to count active pool members. Replace the redis.client.connections.usage gauge (which never emitted data) with redis.client.connections.connected. Keep the existing redis.client.commands.in_flight gauge unchanged. Remove the #[allow(dead_code)] on FredPoolStats since both getters are now meaningful. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…from_env (item 8) Introduce from_source<F>(get: F) that accepts a closure returning Option<String> per key. from_env() delegates to it using std::env::var. Tests rewritten to use from_source with a HashMap-backed closure — no unsafe set_var/remove_var, no Mutex, no --test-threads=1 requirement. Three tests pass in parallel with zero process-global state mutation. Also add cfg(test) use-as-_ for opentelemetry_otlp in lib.rs to keep the unused_crate_dependencies lint clean when the not(test) OTLP reader function is compiled out during test runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`health_endpoints()` previously returned `web::scope("")` with the three
health routes inside it. An empty-prefix Actix scope matches every
request path and returns 404 for any path that does not match one of
the routes inside the scope — so `/admin/organisations` and every other
non-health URL was being swallowed by the health scope and returning
404 before reaching Leptos or the main API scope registered after it.
Replaces the helper with `configure_health_endpoints(&mut ServiceConfig)`
that registers the three routes directly at App root via `.configure`,
so unmatched paths fall through to subsequent services as expected.
Adds a regression test `does_not_shadow_other_routes` that wires both
the health endpoints and an unrelated `/other` route into the same App
and verifies both reach their handlers.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tic-conventions The Cargo.lock entry was missed in e98e978 when the dependency was dropped from Cargo.toml. CI's uniffi bindings job regenerates the lock during build and the resulting diff was failing the post-build `git diff --exit-code` check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`make fmt` output on files modified in this branch. All changes are purely stylistic (edition-2024 formatting per .rustfmt.toml): import re-sorting in `use` blocks, trailing commas, and line wrapping within max_width=90. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Five clippy errors caught by `cargo clippy -- -Dwarnings` on rustc 1.90
(CI toolchain):
- middleware.rs:222: drop needless borrow on `res` already bound from
`&result` destructuring (needless_borrow).
- middleware.rs:317: bare `HttpResponse::Ok()` returns the awaitable
HttpResponseBuilder; calling `.finish()` produces an HttpResponse and
clears async_yields_async in the test handler.
- db_pool.rs:44 + redis_pool.rs:{73,87}: use `std::slice::from_ref(&kv)`
for the single-attribute slice instead of `&[kv.clone()]`
(cloned_ref_to_slice_refs).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same fix as e9e4fea — five more bare `HttpResponse::*()` calls in test handlers were returning the awaitable HttpResponseBuilder. Appending `.finish()` materialises an HttpResponse and clears the `clippy::async_yields_async` lint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…urce attrs, accurate busy_ratio Three correctness fixes flagged by CodeRabbit: - main.rs: build the metrics SocketAddr from the parsed IpAddr in config (SocketAddr::new) instead of via string formatting + parse, which is ambiguous for IPv6 addresses like "::1:9091" (needs brackets). Demote spawn_metrics_server errors from a process-killing `?` to a warn-and-continue, matching the OTLP "best-effort" stance. - observability.rs: percent-decode keys and values from OTEL_RESOURCE_ATTRIBUTES per the W3C baggage format the OTel spec references. Without this, values containing %2C / %3D were stored verbatim instead of as "," / "=". - tokio_runtime.rs: compute the busy_ratio denominator from RuntimeMetrics::elapsed rather than the configured sampler interval, so the ratio stays accurate when the sampler tick runs late under load. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bundle of low-risk cleanups from CodeRabbit review:
- README.md, design spec, plan: add bash/text language tags to fenced
code blocks (markdownlint MD040).
- Plan doc: replace local absolute path with <repo-root>/ placeholder.
- meters.rs: add `.with_unit("{request}")` to active_requests for OTel
semconv parity with the duration metrics.
- middleware.rs: rename `matched_route_returns_pattern` to
`matched_route_setup_smoke` — the test only asserts status, the route
extraction itself is covered in observability_integration.rs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously `try_join!(main_server, metrics_handle)` aborted the main HTTP server if the metrics server task ever returned an error (port reclaimed, listener closed, transient bind issue). That contradicts the "metrics are best-effort" stance applied elsewhere (OTLP failures log a warning, bind failure already demoted to warn-and-continue in the previous commit). Spawn the metrics server as a detached tokio task that logs on error and let the main API server's lifetime drive process exit. Drop the `futures-util` crate dependency from superposition since `try_join!` was its only use site. Add explicit `tokio` dep (was previously transitive via futures-util/actix-web; now direct). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…elper `parse_bool` and the open-coded `IpAddr::from_str`/`u16::parse`/ `humantime::parse_duration` sites each named the env key twice — once for the lookup and once for the error label. Six fields × two mentions = twelve places where the key string had to stay in sync. Replace with one generic `parse_or_default<T: FromStr>` helper that takes the key once. `bool`/`IpAddr`/`u16`/`humantime::Duration` all implement `FromStr` so they go through the same path; the helper folds the lookup, the default, and the error label into one place. Error message format changes uniformly to "<KEY>: <parse-error>" (the only existing assertion just checks for the key name as a substring, so it still holds). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Adds an Actix middleware and supporting subsystem that emits Google SRE golden signals (latency, traffic, errors, saturation) for every HTTP route on the main API, exposed via Prometheus scrape on a dedicated port (default
:9091) and optional OTLP push. Built on the OpenTelemetry Rust SDK so users can scrape into Prometheus / VictoriaMetrics / Mimir or push to any OTel-native backend without code changes.docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.mddocs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.mdWhat ships
HTTP golden signals (
MetricsMiddleware, automatically wraps every route)http.server.request.duration— histogram (s), 9 explicit buckets[0.005..10], labels:http.request.method,http.route,http.response.status_code,sp.org_id,sp.workspace_id. Traffic + error rate are derived from_countin PromQL.http.server.busy.duration— counter (s).rate(...)gives time-averaged request concurrency (insensitive to scrape aliasing, unlikeactive_requests).http.server.active_requests— UpDownCounter. Semconv standard, included for completeness.* Both tenant labels are env-disable-able for very-high-cardinality deployments.
Saturation signals
db.client.connections.{usage,max}— passive r2d2 callbacks, no instrumentation atpool.get()call sites.redis.client.{connections.usage,commands.in_flight}— via a tolerantRedisStatstrait over fred'sMetricsInterface.runtime.tokio.{workers,workers.busy_ratio,global_queue.depth}—cfg(tokio_unstable)-gated. Background sampler updates atomics; observable gauges read them at scrape time.Endpoints
GET /metricsonSUPERPOSITION_METRICS_PORT(default9091) — Prometheus text exposition.GET /healthz,/livez,/readyzon main port:8080. Auth bypass viatenant_middleware_exclusion_list.Resilience
__not_found__and__static__route sentinels boundhttp.routecardinality.Configuration
Standard OTel env vars are honored (
OTEL_SERVICE_NAME,OTEL_EXPORTER_OTLP_ENDPOINT,OTEL_EXPORTER_OTLP_HEADERS,OTEL_RESOURCE_ATTRIBUTES). Plus:SUPERPOSITION_METRICS_ENABLEDtrueSUPERPOSITION_METRICS_PORT9091SUPERPOSITION_METRICS_BIND0.0.0.0SUPERPOSITION_METRICS_LABEL_ORGtruesp.org_id.SUPERPOSITION_METRICS_LABEL_WORKSPACEtruesp.workspace_id.SUPERPOSITION_METRICS_COLLECT_INTERVAL10sBuild notes
`.cargo/config.toml` enables `--cfg tokio_unstable` workspace-wide so `tokio-metrics` can collect runtime saturation. This flag only adds APIs; no behavioural change for existing code. Building outside cargo (custom IDE invocations) needs the same flag — otherwise the `runtime.tokio.*` metrics simply won't appear.
Test Plan
__not_found__sentinel, busy_duration > 0, active_requests returns to 0).tokio_unstableon) andRUSTFLAGS=\"\"(off → no-op stub).Rollout
Phased per spec §13:
Future work (explicitly out of scope, tracked in spec §14)
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
9091) viaGET /metrics/healthz,/livez,/readyzChores