feat(observability): OpenTelemetry golden-signals middleware by knutties · Pull Request #998 · juspay/superposition

knutties · 2026-05-11T08:20:12Z

Summary

Adds an Actix middleware and supporting subsystem that emits Google SRE golden signals (latency, traffic, errors, saturation) for every HTTP route on the main API, exposed via Prometheus scrape on a dedicated port (default :9091) and optional OTLP push. Built on the OpenTelemetry Rust SDK so users can scrape into Prometheus / VictoriaMetrics / Mimir or push to any OTel-native backend without code changes.

What ships

HTTP golden signals (MetricsMiddleware, automatically wraps every route)

http.server.request.duration — histogram (s), 9 explicit buckets [0.005..10], labels: http.request.method, http.route, http.response.status_code, sp.org_id, sp.workspace_id. Traffic + error rate are derived from _count in PromQL.
http.server.busy.duration — counter (s). rate(...) gives time-averaged request concurrency (insensitive to scrape aliasing, unlike active_requests).
http.server.active_requests — UpDownCounter. Semconv standard, included for completeness.

* Both tenant labels are env-disable-able for very-high-cardinality deployments.

Saturation signals

db.client.connections.{usage,max} — passive r2d2 callbacks, no instrumentation at pool.get() call sites.
redis.client.{connections.usage,commands.in_flight} — via a tolerant RedisStats trait over fred's MetricsInterface.
runtime.tokio.{workers,workers.busy_ratio,global_queue.depth} — cfg(tokio_unstable)-gated. Background sampler updates atomics; observable gauges read them at scrape time.

Endpoints

GET /metrics on SUPERPOSITION_METRICS_PORT (default 9091) — Prometheus text exposition.
GET /healthz, /livez, /readyz on main port :8080. Auth bypass via tenant_middleware_exclusion_list.

Resilience

Bad OTLP config logs and degrades to disabled, doesn't crash the binary.
`/healthz` etc. instrumented (we want to observe probe latency) but exempt from auth_n/auth_z.
__not_found__ and __static__ route sentinels bound http.route cardinality.

Configuration

Standard OTel env vars are honored (OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_HEADERS, OTEL_RESOURCE_ATTRIBUTES). Plus:

Var	Default	Purpose
`SUPERPOSITION_METRICS_ENABLED`	`true`	Master switch.
`SUPERPOSITION_METRICS_PORT`	`9091`	Scrape port.
`SUPERPOSITION_METRICS_BIND`	`0.0.0.0`	Bind address.
`SUPERPOSITION_METRICS_LABEL_ORG`	`true`	Include `sp.org_id`.
`SUPERPOSITION_METRICS_LABEL_WORKSPACE`	`true`	Include `sp.workspace_id`.
`SUPERPOSITION_METRICS_COLLECT_INTERVAL`	`10s`	Tokio runtime poll interval.

Build notes

`.cargo/config.toml` enables `--cfg tokio_unstable` workspace-wide so `tokio-metrics` can collect runtime saturation. This flag only adds APIs; no behavioural change for existing code. Building outside cargo (custom IDE invocations) needs the same flag — otherwise the `runtime.tokio.*` metrics simply won't appear.

Test Plan

`cargo test --workspace` — all green
17 unit tests in `service_utils::observability` (config, middleware, meters, health, metrics_server)
2 integration tests in `crates/service_utils/tests/observability_integration.rs`:
- `metrics_appear_after_requests` — wraps an Actix App, sends 200/201/500/404, asserts on scraped Prometheus exposition (per-route counts, 5xx series, __not_found__ sentinel, busy_duration > 0, active_requests returns to 0).
- `cardinality_stays_within_budget` — fails CI if a 3-route × 1-method scenario exceeds 200 series. Current count: 41.
Both cfg branches compile cleanly: default (tokio_unstable on) and RUSTFLAGS=\"\" (off → no-op stub).
`cargo check -p superposition` — clean wiring.
Live deployment smoke (deferred to staging rollout per spec §13).

Rollout

Phased per spec §13:

Land with code default `SUPERPOSITION_METRICS_ENABLED=true` but disabled in prod env files.
Enable in staging, watch VM ingest 48h.
Enable in prod with `SUPERPOSITION_METRICS_LABEL_WORKSPACE=false`.
Flip workspace label on after a week of clean VM headroom.

Future work (explicitly out of scope, tracked in spec §14)

Trace correlation via exemplars (requires `tracing-opentelemetry` bridge).
Per-tenant separate histogram if global cardinality budget tightens.
Grafana dashboards JSON + VM/Prometheus alert rules (separate PR).
Typed pool wrapper to unlock `db.client.connection.wait.duration` + `pending_requests`.
gRPC OTLP transport (currently `http/protobuf` only).

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Added Prometheus metrics exposure on configurable port (default 9091) via GET /metrics
- Added health check endpoints: /healthz, /livez, /readyz
- Added HTTP request metrics (latency, active requests, saturation)
- Added database and Redis pool saturation monitoring
- Added optional OTLP metrics export support
Chores
- Added observability dependencies and configuration system with environment variable support

semanticdiff-com · 2026-05-11T08:20:17Z

Review changes with

Changed Files

File	Status
crates/experimentation_platform/src/api/experiments/helpers.rs	100% smaller
crates/experimentation_platform/src/api/experiments/handlers.rs	68% smaller
crates/superposition/src/app_state.rs	63% smaller
crates/experimentation_platform/src/api/experiment_groups/handlers.rs	59% smaller
crates/superposition_types/src/database/models.rs	34% smaller
crates/superposition/src/main.rs	4% smaller
.cargo/config.toml	Unsupported file format
.gitignore	Unsupported file format
Cargo.lock	Unsupported file format
Cargo.toml	Unsupported file format
README.md	Unsupported file format
crates/service_utils/Cargo.toml	Unsupported file format
crates/service_utils/src/lib.rs	0% smaller
crates/service_utils/src/observability.rs	0% smaller
crates/service_utils/src/observability/config.rs	Unsupported file format
crates/service_utils/src/observability/health.rs	0% smaller
crates/service_utils/src/observability/meters.rs	0% smaller
crates/service_utils/src/observability/metrics_server.rs	0% smaller
crates/service_utils/src/observability/middleware.rs	0% smaller
crates/service_utils/src/observability/saturation.rs	0% smaller
crates/service_utils/src/observability/saturation/db_pool.rs	0% smaller
crates/service_utils/src/observability/saturation/redis_pool.rs	0% smaller
crates/service_utils/src/observability/saturation/tokio_runtime.rs	0% smaller
crates/service_utils/tests/observability_integration.rs	0% smaller
crates/superposition/Cargo.toml	Unsupported file format
docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md	Unsupported file format
docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md	Unsupported file format

coderabbitai · 2026-05-11T08:20:29Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8bf11488-d07b-427a-8ba7-02fe88c7b8cd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

Adds complete OpenTelemetry observability to the Superposition HTTP API, including Prometheus metrics exposition, health endpoints, saturation monitoring for database/Redis/Tokio runtime, all configurable via environment variables with optional OTLP push integration.

Changes

OpenTelemetry Observability Implementation

Layer / File(s)	Summary
Build Configuration & Dependencies `.cargo/config.toml`, `.gitignore`, `Cargo.toml`, `crates/service_utils/Cargo.toml`, `crates/superposition/Cargo.toml`	Workspace dependencies added for `opentelemetry`, `prometheus`, `tokio-metrics`, `humantime`, `thiserror`; `.cargo/config.toml` sets `--cfg tokio_unstable`; build lints configure `check-cfg` for the flag.
Module Exports & Crate Setup `crates/service_utils/src/lib.rs`	New `pub mod observability` exported; placeholder `use <crate> as _` imports for dependencies to avoid unused-crate-dependencies warnings.
Configuration Loading from Environment `crates/service_utils/src/observability/config.rs`	`ObservabilityConfig` struct parses env vars (enabled/bind/port/labels/interval/service identity/OTLP endpoint); `LabelConfig` controls org/workspace label inclusion; unit tests verify defaults and error handling.
Observability Core & MeterProvider `crates/service_utils/src/observability.rs`	`Observability` struct initializes OpenTelemetry `SdkMeterProvider` with Prometheus exporter, optional OTLP HTTP exporter, resource attributes merged from env; `ObservabilityError` covers init/shutdown failures.
Health Probe Endpoints `crates/service_utils/src/observability/health.rs`	Three GET endpoints (`/healthz`, `/livez`, `/readyz`) return HTTP 200 with `"ok"` body; `health_endpoint_paths()` returns static list for auth exclusion configuration.
HTTP Metric Instruments `crates/service_utils/src/observability/meters.rs`	`HttpMeters` struct holds request duration histogram, busy duration counter, and active requests up/down counter; `new()` constructs instruments from provided `Meter`.
HTTP Metrics Middleware `crates/service_utils/src/observability/middleware.rs`	Actix middleware normalizes HTTP methods, extracts route patterns with sentinels for not-found/static routes, builds attributes with optional org/workspace labels, records latency/busy/active metrics, implements panic-safe in-flight tracking via `InFlightGuard`.
Saturation Collection Infrastructure `crates/service_utils/src/observability/saturation.rs`	`SaturationDeps` holds optional DB/Redis handles plus Tokio collection interval; `register_observers()` conditionally registers saturation collectors and spawns Tokio runtime monitor.
Database Pool Saturation `crates/service_utils/src/observability/saturation/db_pool.rs`	Observable gauges for r2d2 PostgreSQL pool idle/used connections and max pool size via OpenTelemetry callbacks; `DbPoolHandle` type alias wraps the pool.
Redis Pool Saturation `crates/service_utils/src/observability/saturation/redis_pool.rs`	`RedisStats` trait abstracts optional metrics; `FredPoolStats` impl for fred::RedisPool summing commands-in-flight; observable gauges for pool saturation with state labels.
Tokio Runtime Saturation `crates/service_utils/src/observability/saturation/tokio_runtime.rs`	cfg(tokio_unstable)-gated background task polling `tokio_metrics::RuntimeMonitor` intervals; registers observable gauges for worker count, queue depth, and busy ratio via atomic snapshots.
Metrics Exposition Server `crates/service_utils/src/observability/metrics_server.rs`	Standalone Actix server on `SUPERPOSITION_METRICS_PORT` exposing GET `/metrics` endpoint; TextEncoder serializes Prometheus registry to text format.
Application Startup & Server Integration `crates/superposition/src/main.rs`	Load observability config (fail fast on invalid); init `Observability` and continue on failure; register saturation observers for DB/Redis; spawn metrics server; mount health endpoints; conditionally wrap app with `MetricsMiddleware`; run main and metrics servers concurrently via `futures_util::try_join`.
Auth Middleware Exclusion List `crates/superposition/src/app_state.rs`	Tenant middleware exclusion list unconditionally includes observability health endpoint paths.
Integration & Unit Tests `crates/service_utils/tests/observability_integration.rs`	End-to-end tests validate metrics recording across routes/status codes, Prometheus exposition format, and cardinality budget (<= 200 series) via repeated requests.
User-Facing Documentation `README.md`	Describe Prometheus metrics port (default 9091), health endpoints, and tokio_unstable flag purpose.
Design Specification & Implementation Plan `docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md`, `docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md`	Complete architecture, metric schemas, environment variables, configuration, testing strategy, rollout phases, and 22-task implementation plan.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

P0

Suggested reviewers

mahatoankitkumar
sauraww
Datron

Poem

🐰 Metrics bloom like garden clocks,
Prometheus scrapes from observation blocks,
Health checks hop through /healthz gates,
Saturation gauges seal their fates,
Observability's golden path awaits! 📊✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 78.13% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat(observability): OpenTelemetry golden-signals middleware' accurately and clearly describes the primary change: adding an OpenTelemetry-based observability subsystem with metrics middleware.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/otel-golden-signals-middleware

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR adds a new observability subsystem to the Superposition Actix API that exports Google SRE “golden signal” metrics (latency/traffic/errors/saturation) via OpenTelemetry, exposing Prometheus scrape output on a dedicated port and optionally pushing metrics via OTLP. It wires the subsystem into the superposition binary (middleware + /metrics server + health endpoints) and documents the design/rollout.

Changes:

Introduces service_utils::observability (config, OTel MeterProvider init, Actix metrics middleware, saturation collectors, metrics server, health endpoints) plus integration tests and a cardinality budget test.
Wires observability into the main API binary: /healthz|/livez|/readyz, auth bypass list updates, metrics server spawned alongside the main server, and middleware added to the app.
Adds workspace-wide build support for tokio_unstable and introduces new OTel/Prometheus dependencies with lockfile updates and supporting docs.

Reviewed changes

Copilot reviewed 21 out of 23 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
README.md	Documents `/metrics` endpoint and health probes; notes `tokio_unstable` build flag.
docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md	Design spec for the golden-signals OTel/Prometheus approach.
docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md	Step-by-step implementation plan and test strategy.
crates/superposition/src/main.rs	Initializes observability, registers saturation observers, spawns metrics server, adds middleware + health scope, and `try_join!`s servers.
crates/superposition/src/app_state.rs	Extends auth exclusion list to always include `/healthz
crates/superposition/Cargo.toml	Adds direct deps needed for new wiring (`futures-util`, `opentelemetry`).
crates/service_utils/src/lib.rs	Exposes new `observability` module and adds placeholder crate uses to satisfy `unused_crate_dependencies`.
crates/service_utils/Cargo.toml	Enables new observability dependencies (`opentelemetry*`, `prometheus`, `tokio-metrics`, `humantime`, etc.).
crates/service_utils/src/observability.rs	Observability public API + OTel MeterProvider init (Prometheus exporter + optional OTLP).
crates/service_utils/src/observability/config.rs	Env-driven configuration parsing + tests.
crates/service_utils/src/observability/meters.rs	Defines HTTP metric instruments (histogram/counter/updowncounter).
crates/service_utils/src/observability/middleware.rs	Actix middleware that records per-request HTTP metrics with bounded route cardinality.
crates/service_utils/src/observability/metrics_server.rs	Dedicated Actix server exposing `/metrics` from the Prometheus registry.
crates/service_utils/src/observability/health.rs	Adds `/healthz`, `/livez`, `/readyz` handlers and exposes path list for auth bypass.
crates/service_utils/src/observability/saturation.rs	Wires saturation collectors (DB, Redis, Tokio runtime).
crates/service_utils/src/observability/saturation/db_pool.rs	r2d2 pool saturation observable gauges.
crates/service_utils/src/observability/saturation/redis_pool.rs	fred-based Redis “commands in flight” gauge and tolerant stats abstraction.
crates/service_utils/src/observability/saturation/tokio_runtime.rs	`cfg(tokio_unstable)` Tokio runtime saturation gauges via `tokio-metrics`.
crates/service_utils/tests/observability_integration.rs	End-to-end scrape assertions + cardinality regression test.
Cargo.toml	Adds workspace deps for OTel/Prometheus; adds `unexpected_cfgs` allowance for `tokio_unstable`.
Cargo.lock	Locks new dependency graph (notably OTel + Prometheus + tokio-metrics).
.cargo/config.toml	Enables `--cfg tokio_unstable` workspace-wide.
.gitignore	Ensures `.cargo/config.toml` remains tracked despite `.cargo` ignore rule.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            .with_resource(Resource::new(resource_attrs));
+
+        if let Some(endpoint) = &cfg.otlp_endpoint {
+            builder = with_otlp_reader(builder, endpoint, cfg.collect_interval)?;


+    // Protocol: this binary is compiled with the `http-proto` feature only.
+    // `OTEL_EXPORTER_OTLP_PROTOCOL=grpc` is NOT supported in v1; use the
+    // default `http/protobuf` transport.
+    //
+    // Headers: the opentelemetry-otlp 0.27 HTTP exporter reads
+    // `OTEL_EXPORTER_OTLP_HEADERS` (and `OTEL_EXPORTER_OTLP_METRICS_HEADERS`)
+    // automatically during `build()` — no explicit wiring needed here.
+    use opentelemetry_otlp::{MetricExporter, WithExportConfig};
+    use opentelemetry_sdk::metrics::PeriodicReader;
+    use opentelemetry_sdk::runtime;
+
+    let exporter = MetricExporter::builder()
+        .with_http()
+        .with_endpoint(endpoint.to_owned())
+        .build()


+    fn with_env<F: FnOnce()>(vars: &[(&str, Option<&str>)], f: F) {
+        use std::sync::Mutex;
+        static LOCK: Mutex<()> = Mutex::new(());
+        let _guard = LOCK.lock().unwrap();
+        let prev: Vec<_> =
+            vars.iter().map(|(k, _)| (k.to_string(), std::env::var(k).ok())).collect();
+        for (k, v) in vars {
+            match v {
+                // SAFETY: single-threaded test execution guaranteed by the
+                // Mutex above and --test-threads=1 at the call site.
+                Some(v) => unsafe { std::env::set_var(k, v) },
+                None => unsafe { std::env::remove_var(k) },
+            }
+        }
+        f();
+        for (k, v) in prev {
+            match v {
+                Some(v) => unsafe { std::env::set_var(&k, &v) },
+                None => unsafe { std::env::remove_var(&k) },
+            }
+        }


+// Re-exported so T4+ modules can use them without triggering
+// the `unused_crate_dependencies` lint while no source yet imports them.
+use humantime as _;
+use opentelemetry as _;
+use opentelemetry_otlp as _;
+use opentelemetry_prometheus as _;
+use opentelemetry_sdk as _;
+use opentelemetry_semantic_conventions as _;


+    // Rather than killing the binary we log a warning and serve traffic without metrics.
+    let observability = if obs_cfg.enabled {
+        match Observability::init(obs_cfg.clone()) {
+            Ok(o) => Some(o),
+            Err(e) => {
+                tracing::warn!(
+                    error = %e,
+                    "observability init failed; metrics disabled for this instance"
+                );
+                None


Datron

Overall review

Some crates can be removed, they are not used. Just imported

opentelemetry-otlp = { version = "0.27", default-features = false, features = ["metrics", "http-proto", "reqwest-client"] }
opentelemetry-semantic-conventions = { version = "0.27" }

We can use https://crates.io/crates/opentelemetry-instrumentation-actix-web which comes from open-telemetry so that we can directly monitor requests, latency and errors
Simplify the observability module, there are too many files with just 30 lines. Could we group everything in fewer files and use internal modules if we want namespacing
Is it possible to add some CPU and memory monitoring on the box with sysinfo crate or we'll leave that to a deployment runtime like ECS

Datron · 2026-05-11T11:58:50Z

+    let max_pool_name = KeyValue::new("pool.name", pool_name);
+    meter
+        .u64_observable_gauge("db.client.connections.max")
+        .with_description("Configured maximum size of the DB connection pool.")
+        .with_callback(move |observer| {
+            observer.observe(pool_for_max.max_size() as u64, &[max_pool_name.clone()]);
+        })
+        .build();
+}


Should we just read the env DB_MAX_CONNECTION_POOL_SIZE and publish, getting the size of the pool array may not be accurate for larger pools

Datron · 2026-05-11T11:59:10Z

+#[cfg(test)]
+mod tests {
+    // Constructing a real r2d2 pool requires a database. We assert the function
+    // signature compiles and that calling `register` does not panic with a
+    // synthetic in-memory pool; this is exercised via the integration test in
+    // Task 19 instead.
+}


Should be just remove this?

Datron · 2026-05-11T12:03:03Z

+    fn used_connections(&self) -> Option<u64> {
+        // TODO(observability): fred 9.2.1 does not expose active/used
+        // connection counts. The internal `Counters.in_flight` field tracks
+        // per-connection in-flight commands, but it is private to
+        // `fred::protocol::connection`. Return None until exposed publicly.
+        None
+    }


You can get the number of used clients via

let connected_count = pool_clone.clients().iter().filter(|c| c.is_connected()).count();

Datron · 2026-05-11T12:03:10Z

+    fn idle_connections(&self) -> Option<u64> {
+        // TODO(observability): fred 9.2.1 does not expose idle-connection
+        // counts via any public API. The pool's `RedisPoolInner.clients` are
+        // all async and there is no `is_idle()` method on `RedisClient`.
+        // Return None until fred provides such an API.
+        None
+    }


We can just remove this

Datron · 2026-05-11T12:03:56Z

+//! in atomics that observable-gauge callbacks read.
+
+#[cfg(not(tokio_unstable))]
+#[allow(dead_code)] // called from saturation.rs only under cfg(tokio_unstable)


Do we need this if cfg is at the top?

Datron · 2026-05-11T12:04:08Z

+
+#[cfg(not(tokio_unstable))]
+#[allow(dead_code)] // called from saturation.rs only under cfg(tokio_unstable)
+pub fn spawn(_meter: &opentelemetry::metrics::Meter, _interval: std::time::Duration) {}


dummy function

Datron · 2026-05-11T12:05:54Z

+#[cfg(tokio_unstable)]
+mod inner {
+    use std::sync::Arc;
+    use std::sync::atomic::{AtomicU64, Ordering};
+    use std::time::Duration;
+
+    use opentelemetry::metrics::Meter;
+    use tokio_metrics::RuntimeMonitor;
+
+    #[derive(Default)]
+    struct Snapshot {
+        workers: AtomicU64,
+        global_queue_depth: AtomicU64,


This can be simplified a bit, just a function that reads metrics

// 1. Global Queue Depth: Tasks waiting to be picked up by an Actix worker let h_clone = handle.clone(); meter.u64_observable_gauge("tokio.runtime.global_queue_depth") .with_description("Tasks stuck in the global queue waiting for a worker thread") .with_callback(move |observer| { let metrics = h_clone.metrics(); observer.observe(metrics.global_queue_depth() as u64, &[]); }) .init(); // 2. Active Tasks: Total tasks currently alive in the runtime let h_clone2 = handle.clone(); meter.u64_observable_gauge("tokio.runtime.active_tasks") .with_description("Total number of alive tasks (running or idle)") .with_callback(move |observer| { let metrics = h_clone2.metrics(); observer.observe(metrics.active_tasks_count() as u64, &[]); }) .init(); // 3. Worker Count: How many OS threads Tokio is using for workers let h_clone3 = handle.clone(); meter.u64_observable_gauge("tokio.runtime.workers_count") .with_description("Number of worker threads") .with_callback(move |observer| { let metrics = h_clone3.metrics(); observer.observe(metrics.num_workers() as u64, &[]); }) .init();

Datron · 2026-05-11T12:06:25Z

+}
+
+#[cfg(test)]
+mod tests {


Don't know if we need tests

Datron · 2026-05-11T12:06:46Z

+use std::{net::IpAddr, str::FromStr, time::Duration};
+
+#[derive(Debug, Clone)]
+pub struct ObservabilityConfig {


Should this be moved to app state?

Design for adding an Actix middleware that emits Google SRE golden signals (latency, traffic, errors, saturation) via OpenTelemetry, with Prometheus scrape on a separate metrics port and optional OTLP push. Reference TSDB is VictoriaMetrics; design is TSDB-agnostic. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

22-task implementation plan derived from the 2026-05-10 spec, broken down into TDD-style steps with full code and exact commands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds opentelemetry, opentelemetry_sdk, opentelemetry-prometheus, opentelemetry-otlp, opentelemetry-semantic-conventions, prometheus, tokio-metrics, and humantime as workspace dependencies. Enabled in service_utils in a follow-up commit.

Required by tokio-metrics for runtime instrumentation introduced in the upcoming observability subsystem. tokio_unstable only adds APIs; no behavioural change for existing code. Also un-ignores .cargo/config.toml in .gitignore so the workspace build config can be tracked; the rest of .cargo/ remains ignored.

Wires the eight new workspace deps (opentelemetry, opentelemetry_sdk, opentelemetry-otlp, opentelemetry-prometheus, opentelemetry-semantic-conventions, prometheus, tokio-metrics, humantime) into crates/service_utils/Cargo.toml. The crate has #![deny(unused_crate_dependencies)], so temporary `use X as _;` stubs are added to lib.rs to satisfy the lint until T4+ modules consume the crates directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds the empty module structure that subsequent commits flesh out: - observability.rs: public surface, Observability handle, errors - observability/{config,meters,middleware,metrics_server,health,saturation}.rs: stubs No behaviour change. Observability::init() body is unimplemented!() until Task 7.

Reads SUPERPOSITION_METRICS_* and OTEL_* env vars into a typed ObservabilityConfig. Defaults: enabled, port 9091, bind 0.0.0.0, both org/workspace labels on, 10s collect interval. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Per OTel semconv: collapse unknown methods to _OTHER to bound the cardinality of the http.request.method label.

Builds an SdkMeterProvider wired to an opentelemetry-prometheus exporter that writes into a per-process prometheus::Registry. OTLP push exporter is plumbed but only activates when OTEL_EXPORTER_OTLP_ENDPOINT is set. Also adds Shutdown(String) error variant and updates shutdown() to use it; defines HttpMeters struct with request_duration histogram, busy_duration counter, and active_requests up-down-counter. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Drop-based decrement ensures the gauge stays balanced even when a handler panics or the future is cancelled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Wraps every request with timing + active_requests gauge + busy_duration counter. Uses match_pattern() to template routes, OrgWorkspaceMiddleware extensions for tenant labels, and an InFlightGuard for panic safety. Hoists all middleware.rs use statements to the file top (cleanup carried forward from T8-T10 reviews).

T14: ObservableGauge callbacks read r2d2::Pool::state() at scrape time. Emits db.client.connections.usage{state} and db.client.connections.max. T15: Redis pool gauges via fred metrics with a tolerant RedisStats trait. FredPoolStats wraps RedisPool and exposes commands_in_flight via MetricsInterface (summed across pool clients). idle/used connection counts are not exposed by fred 9.2.1's non-blocking pool and return None; the corresponding observations are skipped rather than emitted as zero, so no dead series. Also adds workspace check-cfg lint config for cfg(tokio_unstable) (new warning surfaced once cfg-gated code appeared in saturation.rs) and exports RedisHandle/RedisStats/FredPoolStats from saturation and observability modules so T18 can populate SaturationDeps. Note: T14 and T15 are bundled here due to a reflog mishap during review that lost T15's separate commit; the contents of both tasks are verified present and reviewed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Background sampler updates atomic snapshots; observable gauges read from the snapshot. Gated on cfg(tokio_unstable); compiles to a no-op when the flag is disabled.

Adds the observability health paths to tenant_middleware_exclusion_list so probes do not trigger auth flow. Operators no longer need to remember to put these in TENANT_MIDDLEWARE_EXCLUSION_LIST.

- Init Observability early (Prometheus exporter + optional OTLP push) - Spawn metrics server on SUPERPOSITION_METRICS_PORT - Register DB/Redis/Tokio saturation observers - Wrap App with MetricsMiddleware (gated by SUPERPOSITION_METRICS_ENABLED) - Mount /healthz /livez /readyz on the main app (auth bypass via T17) - try_join! both servers so the process exits if either dies Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Wraps a small App with MetricsMiddleware, issues requests of various shapes (200, 201, 500, 404), and asserts on the parsed Prometheus exposition: per-route counts, the 5xx series, the __not_found__ sentinel, busy_duration > 0, and active_requests returning to 0.

Asserts that a 3-route × 1-method × 1-status scenario produces no more than 200 series, catching accidental high-cardinality labels in review.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

`Observability::init` errors (e.g. OTLP endpoint unreachable) no longer panic the binary at startup. Instead they emit a `tracing::warn!` and leave observability as `None`, so the service continues to handle traffic without metrics. `obs_enabled` is now derived from `observability.is_some()` after the init attempt, ensuring the `Condition` gate mirrors the actual init outcome. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… production wiring (review fixes 2 + 6) Adds two cardinality-control improvements to http.route tracking: - New `ROUTE_STATIC` sentinel ("__static__") collapses static-asset patterns (/pkg/*, /assets/*, /favicon*) so they don't inflate the cardinality budget. - `is_static_pattern(pattern)` is the shared helper used by both `extract_route` (ServiceRequest phase) and the new `extract_route_from_response` (ServiceResponse phase). - Production middleware now calls `extract_route_from_response` instead of inlining `match_pattern().unwrap_or(ROUTE_NOT_FOUND)`, so static-prefix collapsing is active for all real traffic. - Unit test `extract_route_helper_handles_static_paths` verifies the helper. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… var handling (review fixes 4 + 5) Fix 5 — OTEL_RESOURCE_ATTRIBUTES merge (spec §8.5): Parse the standard OTel env var ("k1=v1,k2=v2,...") in `Observability::init` and append each key=value pair to resource_attrs, enabling operators to inject arbitrary resource dimensions without code changes. Fix 4 — OTLP env var documentation: The opentelemetry-otlp 0.27 HTTP exporter already reads OTEL_EXPORTER_OTLP_HEADERS / OTEL_EXPORTER_OTLP_METRICS_HEADERS automatically during build(), so no explicit wiring is needed. gRPC protocol is not supported in v1 (requires grpc-tonic feature not in this workspace); document this clearly in `with_otlp_reader`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Spec doc reference removed for now; the env-var table and inline notes in this section are sufficient as the operator-facing surface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (8)

crates/service_utils/src/observability/meters.rs (1)
34-37: 💤 Low value

Add unit specification to http.server.active_requests for semconv compliance and consistency.

The other two metrics in HttpMeters (request_duration and busy_duration) both specify units via .with_unit(). The OpenTelemetry semantic conventions specify {request} as the unit for http.server.active_requests, so adding it keeps this metric consistent with the file's pattern and fully semconv-compliant.
♻️ Suggested change
         let active_requests = meter
             .i64_up_down_counter("http.server.active_requests")
+            .with_unit("{request}")
             .with_description("Number of HTTP server requests currently in flight.")
             .build();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability/meters.rs` around lines 34 - 37, The
metric "http.server.active_requests" (built via meter.i64_up_down_counter in
HttpMeters as the active_requests variable) is missing a unit; update the
builder chain for active_requests to include .with_unit("{request}") before
.build() so it matches the pattern used by request_duration and busy_duration
and complies with the OpenTelemetry semconv for this metric.
crates/service_utils/src/observability/middleware.rs (3)
215-241: 💤 Low value

Minor: unnecessary borrow of &res when res is already a reference.

In the Ok(res) arm, res: &ServiceResponse<B> because the outer match is on &result. extract_route_from_response(&res) passes &&ServiceResponse<B>, which compiles via deref coercion but is unidiomatic.
Small style nit
-                    let route = extract_route_from_response(&res);
+                    let route = extract_route_from_response(res);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability/middleware.rs` around lines 215 - 241,
In the Ok(res) match arm you’re passing a double reference to
extract_route_from_response by calling extract_route_from_response(&res) even
though res is already a &ServiceResponse<B>; change the call to
extract_route_from_response(res) (and any similar places) to avoid the
unnecessary borrow and keep types idiomatic (locate this in the Ok(res) branch
where extract_route_from_response, build_attributes,
meters.request_duration.record, and meters.busy_duration.add are used).
312-327: 💤 Low value

Test name promises more than it asserts.

matched_route_returns_pattern only asserts a 200 status — it does not exercise extract_route / extract_route_from_response. The inline comment acknowledges this, but the test name will mislead future readers; consider renaming to matched_route_setup_smoke or wiring an actual middleware that records and reads back the captured pattern so the integration test at observability_integration.rs isn't the only behavioral coverage for the response-phase extraction.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability/middleware.rs` around lines 312 - 327,
Rename or update the test matched_route_returns_pattern so its name matches
behavior or it actually asserts route extraction: either rename to
matched_route_setup_smoke (to reflect it only asserts StatusCode::OK) or modify
the test to install the middleware that calls extract_route /
extract_route_from_response and assert the captured pattern is "/contexts/{id}"
(use the test harness App::new().wrap(your_observability_middleware) and make
the middleware expose the matched pattern for assertion); reference the test
function matched_route_returns_pattern and the extraction helpers extract_route
and extract_route_from_response when changing the test so behavioral coverage
isn't only in observability_integration.rs.
242-262: ⚖️ Poor tradeoff

Error path conflates handler errors with real 404s under __not_found__.

When service.call(req) returns Err, the request has been moved and match_pattern() is unreachable, so the route attribute is hard-coded to ROUTE_NOT_FOUND. The acknowledged tradeoff is documented inline, but the metric consequence is that a 500 from a real route like /contexts/{id} ends up in the same time-series as a true 404, which makes per-route error-rate alerting unreliable for the error-as-Err path (handler errors converted by Actix into responses, vs. handlers that return Ok(response_with_500)).

Worth at least noting in the rollout dashboard docs, or — if practical — capturing the matched pattern in request extensions during routing (e.g., from an inner middleware closer to the resource) and reading it back here. Not a blocker.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability/middleware.rs` around lines 242 - 262,
The error-path currently hardcodes ROUTE_NOT_FOUND because match_pattern() is
unavailable after service.call(req) moves the request; change this by having the
routing layer store the matched pattern into the request extensions (e.g., under
a key like "matched_pattern") when matching a route, and then in this middleware
(where service.call(req) and the Err branch live) try to read that extension and
use the stored pattern instead of ROUTE_NOT_FOUND when calling build_attributes;
update the Err block to fallback to ROUTE_NOT_FOUND only if the extension is
missing, and keep using service.call, match_pattern(), build_attributes, and
meters.* as the integration points.
crates/superposition/src/main.rs (1)
306-313: 💤 Low value

try_join! couples main API lifetime to metrics server lifetime.

If the metrics server task ever returns an error (e.g., the underlying listener gets closed, or the bound port is reclaimed), try_join! short-circuits and the main HTTP server is aborted mid-flight. The pattern is acceptable if you treat the metrics endpoint as a hard SLO, but it's worth being explicit — the PR objectives describe "bad OTLP config degrades to disabled with warnings", and this coupling is the opposite stance for the scrape endpoint.

Consider tokio::select! on the main server with a tokio::spawn-detached metrics task, or accept this and document it in the rollout notes.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/superposition/src/main.rs` around lines 306 - 313, The current use of
futures_util::try_join! with metrics_server_handle couples the lifetime of
main_server to metrics errors; change to spawn the metrics task
(metrics_server_handle) with tokio::spawn so it runs detached and use
tokio::select! (or await main_server directly) to ensure the main_server (the
API HTTP server) is not aborted if the metrics task returns an error — instead
log/handle errors from the spawned metrics task (e.g., attach a task that
.awaits and logs its error) while allowing main_server.await to continue
independently.
crates/service_utils/tests/observability_integration.rs (1)
85-91: 💤 Low value

Value parsing assumes no per-sample timestamps.

rsplit_once(' ').unwrap().1 takes the last whitespace-separated token. The Prometheus exposition format permits an optional trailing timestamp (metric{…} value [timestamp]); if the exporter ever emits timestamps, this parses the timestamp as the value and the assertions silently mis-fire. The current opentelemetry-prometheus 0.x exporter does not emit timestamps, so the test passes today — flagging as a brittle assumption to revisit if the exporter is upgraded.

A more robust parse splits on whitespace and asserts exactly two tokens, or extracts the second token of a known-shape split:
-let busy_value: f64 = busy.rsplit_once(' ').unwrap().1.trim().parse().unwrap();
+let mut parts = busy.split_whitespace();
+let _series = parts.next().unwrap();
+let busy_value: f64 = parts.next().unwrap().parse().unwrap();
Also applies to: 118-118, 127-127
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/tests/observability_integration.rs` around lines 85 -
91, The current parsing of ping_count_line uses rsplit_once(' ').unwrap().1
which will pick up an optional Prometheus timestamp as the value; change parsing
to split_whitespace (or split_ascii_whitespace), collect tokens, assert
tokens.len() >= 2 and use tokens[ tokens.len() - 2 ] (the value) or preferably
assert exactly 2 tokens when you expect no timestamps, then parse that token
into ping_count; apply the same change to the other occurrences mentioned (the
lines around ping_count_line and the similar parses at the other two locations)
so tests fail loudly if format differs instead of silently parsing a timestamp
as the value.
crates/service_utils/src/observability/health.rs (1)
33-36: ⚡ Quick win

readyz currently signals ready even when dependencies are down.

The placeholder is acknowledged with a comment, so this is informational. In Kubernetes, readyz returning 200 while DB/Redis are unhealthy means traffic continues to be routed to a pod that cannot serve requests — defeating the purpose of a readiness probe. Track this as a follow-up to wire DB pool ping / Redis ping into readyz before relying on it for traffic gating.

Want me to open a follow-up issue tracking the readyz dependency checks (DB pool get w/ short timeout, Redis ping)?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability/health.rs` around lines 33 - 36, The
readyz handler (function readyz) currently always returns OK and must be updated
to probe critical dependencies before signaling readiness: add checks for the DB
connection pool (attempt a quick get or simple ping with a short timeout) and
Redis (PING) and only return HttpResponse::Ok when those succeed; on any failure
return an appropriate non-200 response (e.g., 503) and log the underlying error
so Kubernetes stops routing traffic to unhealthy pods. Use the existing livez
behavior as a template if present and ensure checks are fast and non-blocking
(async) to avoid delaying readiness responses.
crates/service_utils/src/observability.rs (1)
147-155: 💤 Low value

#[cfg(test)] stub silently drops OTLP wiring.

Test builds ignore _endpoint entirely, so any unit test that sets otlp_endpoint = Some(...) exercises the prometheus-only path. That's intentional per the comment, but it means OTLP-specific behavior (endpoint resolution, header merging, period validation) has zero unit coverage. Either acknowledge in the comment that OTLP coverage relies on integration/manual tests, or add a token assertion (e.g., that the function is reachable) so future readers don't expect symmetric behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability.rs` around lines 147 - 155, The
#[cfg(test)] stub for with_otlp_reader silently ignores the _endpoint and thus
gives no unit coverage for OTLP-specific behavior (endpoint resolution, header
merging, period validation); either update the comment above with_otlp_reader to
explicitly state that OTLP behavior is covered only by integration/manual tests
(and mention otlp_endpoint/ObservabilityError), or make the test-only stub
assert it's been reached (e.g., a token/debug assertion referencing the
_endpoint) so future readers know the function is intentionally non-functional
in unit tests.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/service_utils/src/observability.rs`:
- Around line 85-96: The OTEL_RESOURCE_ATTRIBUTES parsing loop does not
percent-decode keys/values (violating the OTel/W3C baggage format); update the
loop that iterates extra.split(',') and uses pair.split_once('=') so that both k
and v are percent-decoded (use the existing urlencoding crate) before trimming
and pushing into resource_attrs via KeyValue::new; ensure decoding errors are
handled (skip or log the offending pair) and decode both the key and value
consistently.

In `@crates/service_utils/src/observability/saturation/tokio_runtime.rs`:
- Around line 50-58: Replace the configured interval with the actual elapsed
duration when computing the per-worker busy ratio: use m.elapsed (not interval)
as the time base while preserving the per-worker semantics by multiplying by
(m.workers_count as f64). In other words, compute total =
m.elapsed.as_secs_f64() * (m.workers_count as f64).max(1.0) and keep storing the
scaled value into snap_for_task.busy_ratio_milli via
snap_for_task.busy_ratio_milli.store((ratio * 1000.0) as u64,
Ordering::Relaxed), so you avoid under-reporting when sleep returns late but
retain the per-worker metric semantics rather than switching to the aggregate
m.busy_ratio().

In `@crates/superposition/src/main.rs`:
- Around line 192-199: The metrics bind construction currently formats a string
and parses to SocketAddr which fails for IPv6 (e.g., "::1:9091"); change the
code in the block that sets metrics_server_handle (references: observability,
obs_cfg, metrics_server_handle, spawn_metrics_server) to construct the address
with SocketAddr::new(std::net::IpAddr::from_str(&obs_cfg.bind)? or obs_cfg.bind
parsed to IpAddr, and obs_cfg.port as u16) instead of string parsing, and if
spawn_metrics_server returns an Err treat it like Observability::init failures
(log a warning and set metrics_server_handle to None) so a bind error doesn’t
panic the process. Ensure the fix parses obs_cfg.bind to IpAddr (handling
parsing errors gracefully) and uses SocketAddr::new to support IPv6 properly.

In `@docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md`:
- Line 18: Replace the machine-specific absolute path string
"/Users/natarajankannan/src/superposition/" on line containing "All paths in
this plan are absolute under the repo root" with a repo-root-relative notation
(e.g., "repo-root-relative paths" or a placeholder like "<repo-root>/...") so
the document uses portable, non-machine-specific paths; update the sentence in
the same paragraph to instruct contributors to use repo-root-relative or
environment-variable-based placeholders instead of local absolute filesystem
paths.

In `@docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md`:
- Line 69: Two fenced code blocks in the doc (the block showing "[request on
:8080] ..." and the block listing "crates/service_utils/src/...") are unlabeled
and trigger markdownlint MD040; update each opening triple-backtick to include
the language identifier text (i.e., change ``` to ```text) so both code fences
become labeled, leaving their closing backticks unchanged.

In `@README.md`:
- Around line 157-159: The markdown code fence containing the curl example is
missing a language identifier, triggering markdownlint MD040; update the fenced
block that wraps the curl invocation (the triple-backtick block around "curl
http://localhost:9091/metrics") to include the bash language identifier
(```bash) so the snippet is syntax-highlighted and the lint rule passes.

---

Nitpick comments:
In `@crates/service_utils/src/observability.rs`:
- Around line 147-155: The #[cfg(test)] stub for with_otlp_reader silently
ignores the _endpoint and thus gives no unit coverage for OTLP-specific behavior
(endpoint resolution, header merging, period validation); either update the
comment above with_otlp_reader to explicitly state that OTLP behavior is covered
only by integration/manual tests (and mention otlp_endpoint/ObservabilityError),
or make the test-only stub assert it's been reached (e.g., a token/debug
assertion referencing the _endpoint) so future readers know the function is
intentionally non-functional in unit tests.

In `@crates/service_utils/src/observability/health.rs`:
- Around line 33-36: The readyz handler (function readyz) currently always
returns OK and must be updated to probe critical dependencies before signaling
readiness: add checks for the DB connection pool (attempt a quick get or simple
ping with a short timeout) and Redis (PING) and only return HttpResponse::Ok
when those succeed; on any failure return an appropriate non-200 response (e.g.,
503) and log the underlying error so Kubernetes stops routing traffic to
unhealthy pods. Use the existing livez behavior as a template if present and
ensure checks are fast and non-blocking (async) to avoid delaying readiness
responses.

In `@crates/service_utils/src/observability/meters.rs`:
- Around line 34-37: The metric "http.server.active_requests" (built via
meter.i64_up_down_counter in HttpMeters as the active_requests variable) is
missing a unit; update the builder chain for active_requests to include
.with_unit("{request}") before .build() so it matches the pattern used by
request_duration and busy_duration and complies with the OpenTelemetry semconv
for this metric.

In `@crates/service_utils/src/observability/middleware.rs`:
- Around line 215-241: In the Ok(res) match arm you’re passing a double
reference to extract_route_from_response by calling
extract_route_from_response(&res) even though res is already a
&ServiceResponse<B>; change the call to extract_route_from_response(res) (and
any similar places) to avoid the unnecessary borrow and keep types idiomatic
(locate this in the Ok(res) branch where extract_route_from_response,
build_attributes, meters.request_duration.record, and meters.busy_duration.add
are used).
- Around line 312-327: Rename or update the test matched_route_returns_pattern
so its name matches behavior or it actually asserts route extraction: either
rename to matched_route_setup_smoke (to reflect it only asserts StatusCode::OK)
or modify the test to install the middleware that calls extract_route /
extract_route_from_response and assert the captured pattern is "/contexts/{id}"
(use the test harness App::new().wrap(your_observability_middleware) and make
the middleware expose the matched pattern for assertion); reference the test
function matched_route_returns_pattern and the extraction helpers extract_route
and extract_route_from_response when changing the test so behavioral coverage
isn't only in observability_integration.rs.
- Around line 242-262: The error-path currently hardcodes ROUTE_NOT_FOUND
because match_pattern() is unavailable after service.call(req) moves the
request; change this by having the routing layer store the matched pattern into
the request extensions (e.g., under a key like "matched_pattern") when matching
a route, and then in this middleware (where service.call(req) and the Err branch
live) try to read that extension and use the stored pattern instead of
ROUTE_NOT_FOUND when calling build_attributes; update the Err block to fallback
to ROUTE_NOT_FOUND only if the extension is missing, and keep using
service.call, match_pattern(), build_attributes, and meters.* as the integration
points.

In `@crates/service_utils/tests/observability_integration.rs`:
- Around line 85-91: The current parsing of ping_count_line uses rsplit_once('
').unwrap().1 which will pick up an optional Prometheus timestamp as the value;
change parsing to split_whitespace (or split_ascii_whitespace), collect tokens,
assert tokens.len() >= 2 and use tokens[ tokens.len() - 2 ] (the value) or
preferably assert exactly 2 tokens when you expect no timestamps, then parse
that token into ping_count; apply the same change to the other occurrences
mentioned (the lines around ping_count_line and the similar parses at the other
two locations) so tests fail loudly if format differs instead of silently
parsing a timestamp as the value.

In `@crates/superposition/src/main.rs`:
- Around line 306-313: The current use of futures_util::try_join! with
metrics_server_handle couples the lifetime of main_server to metrics errors;
change to spawn the metrics task (metrics_server_handle) with tokio::spawn so it
runs detached and use tokio::select! (or await main_server directly) to ensure
the main_server (the API HTTP server) is not aborted if the metrics task returns
an error — instead log/handle errors from the spawned metrics task (e.g., attach
a task that .awaits and logs its error) while allowing main_server.await to
continue independently.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 692299ed-71fc-42c3-9a01-9fbf6addfee3

📥 Commits

Reviewing files that changed from the base of the PR and between 83dac07 and 1bca586.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (22)

.cargo/config.toml
.gitignore
Cargo.toml
README.md
crates/service_utils/Cargo.toml
crates/service_utils/src/lib.rs
crates/service_utils/src/observability.rs
crates/service_utils/src/observability/config.rs
crates/service_utils/src/observability/health.rs
crates/service_utils/src/observability/meters.rs
crates/service_utils/src/observability/metrics_server.rs
crates/service_utils/src/observability/middleware.rs
crates/service_utils/src/observability/saturation.rs
crates/service_utils/src/observability/saturation/db_pool.rs
crates/service_utils/src/observability/saturation/redis_pool.rs
crates/service_utils/src/observability/saturation/tokio_runtime.rs
crates/service_utils/tests/observability_integration.rs
crates/superposition/Cargo.toml
crates/superposition/src/app_state.rs
crates/superposition/src/main.rs
docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md
docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md

coderabbitai · 2026-05-11T13:47:30Z

+        // §8.5 — merge OTEL_RESOURCE_ATTRIBUTES ("k1=v1,k2=v2,...") if set.
+        if let Ok(extra) = std::env::var("OTEL_RESOURCE_ATTRIBUTES") {
+            for pair in extra.split(',') {
+                if let Some((k, v)) = pair.split_once('=') {
+                    let k = k.trim().to_owned();
+                    let v = v.trim().to_owned();
+                    if !k.is_empty() {
+                        resource_attrs.push(KeyValue::new(k, v));
+                    }
+                }
+            }
+        }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify percent-encoding crate availability and current OTEL_RESOURCE_ATTRIBUTES handling. rg -n 'percent-encoding' --type=toml fd -e toml -x rg -nH 'percent-encoding' {} \;

Repository: juspay/superposition

Length of output: 2517

🏁 Script executed:

cd crates/service_utils && cat Cargo.toml | grep -A 50 "^\[dependencies\]"

Repository: juspay/superposition

Length of output: 1677

🌐 Web query:

OTel environment variable spec OTEL_RESOURCE_ATTRIBUTES W3C Baggage percent-encoding encoding

💡 Result:

OTEL_RESOURCE_ATTRIBUTES encoding/percent-encoding rules (grounded in the current OpenTelemetry Resource SDK spec and the W3C Baggage spec) 1) How OTEL_RESOURCE_ATTRIBUTES is formatted - OTEL_RESOURCE_ATTRIBUTES is a comma-separated list of key=value pairs (key1=value1,key2=value2). [1] 2) What “percent-encoding” is required for in OTEL_RESOURCE_ATTRIBUTES - In OTEL_RESOURCE_ATTRIBUTES, the characters ',' and '=' in both keys and values MUST be percent-encoded. [1] - Other characters MAY be percent-encoded (e.g., values outside the ANSI characters set). [1] 3) Relation to W3C Baggage and W3C baggage value percent-encoding - OpenTelemetry states OTEL_RESOURCE_ATTRIBUTES is represented using a format matching W3C Baggage, with the specific OTEL_RESOURCE_ATTRIBUTES rules above for separating key/value pairs via comma and '=' and requiring percent-encoding of ',' and '='. [1] - In W3C Baggage, the baggage entry value is a UTF-8 string, where any characters outside the baggage-octet range MUST be percent-encoded. [2] - W3C Baggage also specifies that the percent code point (U+0025, i.e. '%') MUST be percent-encoded. [2] - W3C Baggage defines percent-encoding per RFC3986 Section 2.1. [2] 4) How percent-encoding is interpreted (decoding expectations) - W3C Baggage: when decoding, percent-encoded octet sequences that do not match the UTF-8 encoding scheme are replaced with U+FFFD. [2] - OpenTelemetry resource SDK: during decoding, if an error occurs (e.g., failure during decoding), the entire environment variable value SHOULD be discarded and an error SHOULD be reported. [1] 5) Practical implications / example - Because ',' and '=' MUST be percent-encoded in OTEL_RESOURCE_ATTRIBUTES, values containing commas or equals signs must be written with percent escapes, so they don’t get misinterpreted as list separators or key/value separators. [1] Sources (most directly relevant) - OpenTelemetry Resource SDK spec: specification/resource/sdk.md (OTEL_RESOURCE_ATTRIBUTES format + required percent-encoding + error handling). [1] - W3C Baggage spec (percent-encoding rules for baggage values, including baggage-octet range, UTF-8, and '%' encoding). [2]

Citations:

1: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/resource/sdk.md

2: https://www.w3.org/TR/baggage/

🏁 Script executed:

# Search for tests of OTEL_RESOURCE_ATTRIBUTES parsing rg -n 'OTEL_RESOURCE_ATTRIBUTES' --type rust # Check urlencoding crate capabilities rg -n 'urlencoding' crates/service_utils/

Repository: juspay/superposition

Length of output: 424

🌐 Web query:

urlencoding rust crate decode percent-decoding function API

💡 Result:

In the Rust crate urlencoding, percent-decoding is done via these public APIs: 1) urlencoding::decode Signature/behavior: It decodes a percent-encoded string assuming UTF-8, and returns a Result (since the docs show decode(...).expect("UTF-8")/unwrap usage). [1] Example usage: use urlencoding::decode; let decoded = decode("%F0%9F%91%BE%20Exterminate%21").expect("UTF-8"); // decoded == "👾 Exterminate!" [1] 2) urlencoding::decode_binary Signature/behavior: It decodes percent-encoded input as raw bytes (binary data), allowing arbitrary bytes and invalid UTF-8; you then convert as needed (e.g., String::from_utf8_lossy). [1] Example usage: use urlencoding::decode_binary; let binary = decode_binary(b"%F1%F2%F3%C0%C1%C2"); let decoded = String::from_utf8_lossy(&binary); [1] How percent-decoding works conceptually in this ecosystem A commonly used lower-level percent-decoder in Rust is the percent_encoding crate’s percent_decode function, whose API takes bytes and percent-decodes any “%” followed by two hex digits, returning an iterator-like type with decode_utf8/decode_utf8_lossy. [2] Signature (from docs): pub fn percent_decode(input: &[u8]) -> PercentDecode<'_>. [2] So, for “urlencoding crate decode percent-decoding function API”: use urlencoding::decode (UTF-8 Result) or urlencoding::decode_binary (raw bytes) depending on whether you need to handle invalid UTF-8. [1]

Citations:

1: https://docs.rs/urlencoding/latest/urlencoding/

2: https://docs.rs/percent-encoding/*/percent_encoding/fn.percent_decode.html

🏁 Script executed:

# Check for existing tests of OTEL_RESOURCE_ATTRIBUTES rg -A 10 -B 5 'OTEL_RESOURCE_ATTRIBUTES' crates/service_utils/src/observability.rs | head -40 # Check how urlencoding::decode is used elsewhere rg -B 2 -A 3 'urlencoding::decode' crates/

Repository: juspay/superposition

Length of output: 749

🏁 Script executed:

# Look for tests in observability.rs rg -n '#\[test\]|#\[cfg\(test\)\]' crates/service_utils/src/observability.rs # Look for integration tests that might test observability init fd -e rs 'test' crates/service_utils/src/ | head -10 # Check if there are any tests for observability module fd observability crates/service_utils/

Repository: juspay/superposition

Length of output: 268

🏁 Script executed:

# Check the integration test file cat crates/service_utils/tests/observability_integration.rs # Also check lines around 147, 157, 176, 186 in observability.rs for test details sed -n '145,190p' crates/service_utils/src/observability.rs

Repository: juspay/superposition

Length of output: 7630

OTEL_RESOURCE_ATTRIBUTES values are not URL-decoded, violating the OTel specification.

Per the OpenTelemetry resource SDK specification, OTEL_RESOURCE_ATTRIBUTES follows the W3C Baggage format and requires percent-decoding of both keys and values (per RFC 3986). The current implementation (lines 85–96) trims whitespace but does not percent-decode, so an operator setting e.g. OTEL_RESOURCE_ATTRIBUTES=deployment.environment=prod%20us-east-1 ends up with the literal string prod%20us-east-1 instead of prod us-east-1.

The urlencoding crate is already in the project's dependencies. A straightforward fix:

Suggested fix

if let Some((k, v)) = pair.split_once('=') { let k = k.trim().to_owned(); - let v = v.trim().to_owned(); + let v = urlencoding::decode(v.trim()) + .unwrap_or_else(|_| v.trim().into()) + .into_owned(); if !k.is_empty() { resource_attrs.push(KeyValue::new(k, v)); } }

Also consider decoding keys the same way, as the OTel spec requires percent-encoding for both.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/service_utils/src/observability.rs` around lines 85 - 96, The OTEL_RESOURCE_ATTRIBUTES parsing loop does not percent-decode keys/values (violating the OTel/W3C baggage format); update the loop that iterates extra.split(',') and uses pair.split_once('=') so that both k and v are percent-decoded (use the existing urlencoding crate) before trimming and pushing into resource_attrs via KeyValue::new; ensure decoding errors are handled (skip or log the offending pair) and decode both the key and value consistently.

coderabbitai · 2026-05-11T13:47:30Z

+                    // Busy fraction = total worker-seconds busy / total
+                    // worker-seconds available over the sample interval.
+                    let busy = m.total_busy_duration.as_secs_f64();
+                    let total =
+                        interval.as_secs_f64() * (m.workers_count as f64).max(1.0);
+                    let ratio = (busy / total).clamp(0.0, 1.0);
+                    snap_for_task
+                        .busy_ratio_milli
+                        .store((ratio * 1000.0) as u64, Ordering::Relaxed);


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In tokio-metrics 0.3, does RuntimeMetrics expose a busy_ratio()method and anelapsedfield, and isbusy_ratio() computed as total_busy_duration / (elapsed * workers_count)?

💡 Result:

Yes—at least in the docs for tokio-metrics RuntimeMetrics, there is a busy_ratio method and an elapsed field, but the busy_ratio computation formula you proposed is not supported by tokio-metrics 0.3 docs. 1) RuntimeMetrics has an elapsed field RuntimeMetrics (tokio-metrics 0.3.x) includes a public field elapsed: Duration [1]. 2) RuntimeMetrics has busy_ratio method The RuntimeMetrics API also includes a pub fn busy_ratio(&self) -> f64 method [2]. 3) busy_ratio computation: not total_busy_duration / (elapsed * workers_count) The docs for busy_ratio in the RuntimeMetrics API state that busy_ratio “Returns the ratio of RuntimeMetrics::total_busy_duration to the RuntimeMetrics::elapsed.” [2] That description implies busy_ratio = total_busy_duration / elapsed (i.e., no multiplication by workers_count in the denominator) [2]. I found no doc evidence that busy_ratio multiplies elapsed by workers_count in the formula. Conclusion: RuntimeMetrics exposes busy_ratio and elapsed, but busy_ratio is documented as total_busy_duration / elapsed, not total_busy_duration / (elapsed * workers_count).

Citations:

1: https://docs.rs/tokio-metrics/0.3.1/tokio_metrics/struct.RuntimeMetrics.html

2: https://docs.rs/tokio-metrics/latest/tokio_metrics/struct.RuntimeMetrics.html

🏁 Script executed:

cat -n crates/service_utils/src/observability/saturation/tokio_runtime.rs | head -80

Repository: juspay/superposition

Length of output: 3858

🏁 Script executed:

cat -n crates/service_utils/src/observability/saturation/tokio_runtime.rs | sed -n '80,100p'

Repository: juspay/superposition

Length of output: 547

Use actual elapsed duration instead of configured interval when computing busy ratio, but preserve per-worker semantics.

The concern about using the configured interval is valid—under scheduler pressure, tokio::time::sleep(interval) may return late, causing the denominator to be inaccurate and under-reporting busy ratio when it matters most. However, the suggested diff has a hidden semantic issue:

tokio_metrics::RuntimeMetrics::busy_ratio() computes total_busy_duration / elapsed (aggregate ratio across all workers), whereas the current code computes total_busy_duration / (elapsed * workers_count) (per-worker average). The metric is named runtime.tokio.workers.busy_ratio, suggesting per-worker semantics should be preserved.

The correct fix: replace the configured interval with m.elapsed, but keep the workers_count multiplier:

- let busy = m.total_busy_duration.as_secs_f64(); - let total = - interval.as_secs_f64() * (m.workers_count as f64).max(1.0); + let busy = m.total_busy_duration.as_secs_f64(); + let total = + m.elapsed.as_secs_f64() * (m.workers_count as f64).max(1.0); let ratio = (busy / total).clamp(0.0, 1.0);

Alternatively, if you intentionally want to switch to aggregate ratio, use m.busy_ratio() directly but document why the metric semantics changed.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/service_utils/src/observability/saturation/tokio_runtime.rs` around lines 50 - 58, Replace the configured interval with the actual elapsed duration when computing the per-worker busy ratio: use m.elapsed (not interval) as the time base while preserving the per-worker semantics by multiplying by (m.workers_count as f64). In other words, compute total = m.elapsed.as_secs_f64() * (m.workers_count as f64).max(1.0) and keep storing the scaled value into snap_for_task.busy_ratio_milli via snap_for_task.busy_ratio_milli.store((ratio * 1000.0) as u64, Ordering::Relaxed), so you avoid under-reporting when sleep returns late but retain the per-worker metric semantics rather than switching to the aggregate m.busy_ratio().

coderabbitai · 2026-05-11T13:47:30Z

+    let metrics_server_handle = if let Some(obs) = observability.as_ref() {
+        let bind: std::net::SocketAddr = format!("{}:{}", obs_cfg.bind, obs_cfg.port)
+            .parse()
+            .expect("invalid metrics bind addr");
+        Some(spawn_metrics_server(obs.registry(), bind)?)
+    } else {
+        None
+    };


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Metrics bind construction is broken for IPv6 addresses.

format!("{}:{}", obs_cfg.bind, obs_cfg.port).parse::<SocketAddr>() works for IPv4, but if an operator configures SUPERPOSITION_METRICS_BIND=:: or ::1, the resulting string ::1:9091 is not a valid SocketAddr representation — IPv6 socket strings require bracketed host ([::1]:9091). This will trigger the .expect("invalid metrics bind addr") panic at startup with no warning path, contradicting the graceful-degradation philosophy applied to Observability::init above.

Use SocketAddr::new to build the address directly from the typed IpAddr and u16 — no string round-trip, no IPv6 special-casing.

🛠️ Proposed fix

- let metrics_server_handle = if let Some(obs) = observability.as_ref() { - let bind: std::net::SocketAddr = format!("{}:{}", obs_cfg.bind, obs_cfg.port) - .parse() - .expect("invalid metrics bind addr"); - Some(spawn_metrics_server(obs.registry(), bind)?) - } else { - None - }; + let metrics_server_handle = if let Some(obs) = observability.as_ref() { + let bind = std::net::SocketAddr::new(obs_cfg.bind, obs_cfg.port); + match spawn_metrics_server(obs.registry(), bind) { + Ok(h) => Some(h), + Err(e) => { + tracing::warn!(error = %e, %bind, "metrics server failed to start; continuing without scrape endpoint"); + None + } + } + } else { + None + };

Two changes:

SocketAddr::new(IpAddr, u16) — IPv4/IPv6 correct, no parsing.

Treat metrics-server bind failure the same way Observability::init failure is treated (warn + disable), so a transient port conflict on :9091 doesn't take down the API.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/superposition/src/main.rs` around lines 192 - 199, The metrics bind construction currently formats a string and parses to SocketAddr which fails for IPv6 (e.g., "::1:9091"); change the code in the block that sets metrics_server_handle (references: observability, obs_cfg, metrics_server_handle, spawn_metrics_server) to construct the address with SocketAddr::new(std::net::IpAddr::from_str(&obs_cfg.bind)? or obs_cfg.bind parsed to IpAddr, and obs_cfg.port as u16) instead of string parsing, and if spawn_metrics_server returns an Err treat it like Observability::init failures (log a warning and set metrics_server_handle to None) so a bind error doesn’t panic the process. Ensure the fix parses obs_cfg.bind to IpAddr (handling parsing errors gracefully) and uses SocketAddr::new to support IPv6 properly.

coderabbitai · 2026-05-11T13:47:31Z

+- The OpenTelemetry Rust SDK has had API churn between minor versions. The exact import paths and builder method names below are written against `opentelemetry` 0.27. If you pin a different version in Task 1, expect to adjust 1–3 import paths or method names per call site. The plan uses the **stable** APIs only (no unstable/preview features).
+- After Task 1 (deps), run `cargo check -p service_utils` after every code-touching task to catch wiring issues early — even on tasks that don't add tests yet.
+- File commit boundary: each task ends with one commit. If a step within a task fails, fix and continue within the same task before committing.
+- All paths in this plan are absolute under the repo root: `/Users/natarajankannan/src/superposition/`.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid machine-specific absolute paths in the plan.

Line 18 hardcodes a local filesystem path, which makes the instructions non-portable for other contributors and CI shells. Prefer repo-root-relative paths throughout.

Suggested doc edit

-- All paths in this plan are absolute under the repo root: `/Users/natarajankannan/src/superposition/`. +- All paths in this plan are relative to the repository root.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- All paths in this plan are absolute under the repo root: `/Users/natarajankannan/src/superposition/`.

- All paths in this plan are relative to the repository root.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md` at line 18, Replace the machine-specific absolute path string "/Users/natarajankannan/src/superposition/" on line containing "All paths in this plan are absolute under the repo root" with a repo-root-relative notation (e.g., "repo-root-relative paths" or a placeholder like "<repo-root>/...") so the document uses portable, non-machine-specific paths; update the sentence in the same paragraph to instruct contributors to use repo-root-relative or environment-variable-based placeholders instead of local absolute filesystem paths.

coderabbitai · 2026-05-11T13:47:31Z

+
+### 5.1 Data flow
+
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to fenced code blocks (MD040).

Lines 69 and 108 use unlabeled fenced blocks, which fails markdownlint. Add a language tag (for these blocks, text fits).

Suggested doc edit

-``` +```text [request on :8080] ... -``` +``` -``` +```text crates/service_utils/src/ ... -``` +```

Also applies to: 108-108

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 69-69: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md` at line 69, Two fenced code blocks in the doc (the block showing "[request on :8080] ..." and the block listing "crates/service_utils/src/...") are unlabeled and trigger markdownlint MD040; update each opening triple-backtick to include the language identifier text (i.e., change ``` to ```text) so both code fences become labeled, leaving their closing backticks unchanged.

coderabbitai · 2026-05-11T13:47:31Z

+```
+curl http://localhost:9091/metrics
+```


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language identifier to the code fence.

markdownlint (MD040) flags this fence. Use bash so the curl snippet gets syntax highlighting and the lint passes.

📝 Suggested change

-``` -curl http://localhost:9091/metrics -``` +```bash +curl http://localhost:9091/metrics +```

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

```

curl http://localhost:9091/metrics

```

🧰 Tools

🪛 markdownlint-cli2 (0.22.1)

[warning] 157-157: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@README.md` around lines 157 - 159, The markdown code fence containing the curl example is missing a language identifier, triggering markdownlint MD040; update the fenced block that wraps the curl invocation (the triple-backtick block around "curl http://localhost:9091/metrics") to include the bash language identifier (```bash) so the snippet is syntax-highlighted and the lint rule passes.

… (items 1 & 3) Item 1: Wrap OTLP reader init in a match so failure logs a warning and continues with Prometheus-only metrics instead of propagating Err and disabling the /metrics scrape endpoint entirely. Item 3: Warn at startup when OTEL_EXPORTER_OTLP_PROTOCOL is set to a value other than 'http/protobuf' (e.g. grpc), since only HTTP/proto is supported in v1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The crate was only listed to satisfy the unused_crate_dependencies lint via a use-as-_ in lib.rs; actual metric attribute keys are hardcoded strings. Remove it from both workspace and service_utils Cargo.toml, and remove all the use-as-_ shims from lib.rs (the real crates are now imported directly in the observability sub-modules). Also restructure Observability::init to clone the Resource so the Prom-only fallback path can reuse it after the builder is consumed by with_otlp_reader. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Item 4: Drop the empty #[cfg(test)] mod tests {} placeholder in db_pool.rs; real coverage lives in tests/observability_integration.rs. Items 5+6: Remove the #[cfg(not(tokio_unstable))] spawn stub from tokio_runtime.rs. The call site in saturation.rs is itself gated on #[cfg(tokio_unstable)], so no symbol resolution is needed when the flag is off. Add a scoped cfg(not(tokio_unstable)) use-as-_ in lib.rs to keep the unused_crate_dependencies lint green on both build paths. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…item 7) Replace the idle_connections/used_connections trait methods (which always returned None because fred 9.2.1 has no such public API) with a single connected_connections() method using ClientLike::is_connected() to count active pool members. Replace the redis.client.connections.usage gauge (which never emitted data) with redis.client.connections.connected. Keep the existing redis.client.commands.in_flight gauge unchanged. Remove the #[allow(dead_code)] on FredPoolStats since both getters are now meaningful. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…from_env (item 8) Introduce from_source<F>(get: F) that accepts a closure returning Option<String> per key. from_env() delegates to it using std::env::var. Tests rewritten to use from_source with a HashMap-backed closure — no unsafe set_var/remove_var, no Mutex, no --test-threads=1 requirement. Three tests pass in parallel with zero process-global state mutation. Also add cfg(test) use-as-_ for opentelemetry_otlp in lib.rs to keep the unused_crate_dependencies lint clean when the not(test) OTLP reader function is compiled out during test runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

`health_endpoints()` previously returned `web::scope("")` with the three health routes inside it. An empty-prefix Actix scope matches every request path and returns 404 for any path that does not match one of the routes inside the scope — so `/admin/organisations` and every other non-health URL was being swallowed by the health scope and returning 404 before reaching Leptos or the main API scope registered after it. Replaces the helper with `configure_health_endpoints(&mut ServiceConfig)` that registers the three routes directly at App root via `.configure`, so unmatched paths fall through to subsequent services as expected. Adds a regression test `does_not_shadow_other_routes` that wires both the health endpoints and an unrelated `/other` route into the same App and verifies both reach their handlers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…tic-conventions The Cargo.lock entry was missed in e98e978 when the dependency was dropped from Cargo.toml. CI's uniffi bindings job regenerates the lock during build and the resulting diff was failing the post-build `git diff --exit-code` check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

`make fmt` output on files modified in this branch. All changes are purely stylistic (edition-2024 formatting per .rustfmt.toml): import re-sorting in `use` blocks, trailing commas, and line wrapping within max_width=90. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Five clippy errors caught by `cargo clippy -- -Dwarnings` on rustc 1.90 (CI toolchain): - middleware.rs:222: drop needless borrow on `res` already bound from `&result` destructuring (needless_borrow). - middleware.rs:317: bare `HttpResponse::Ok()` returns the awaitable HttpResponseBuilder; calling `.finish()` produces an HttpResponse and clears async_yields_async in the test handler. - db_pool.rs:44 + redis_pool.rs:{73,87}: use `std::slice::from_ref(&kv)` for the single-attribute slice instead of `&[kv.clone()]` (cloned_ref_to_slice_refs). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Same fix as e9e4fea — five more bare `HttpResponse::*()` calls in test handlers were returning the awaitable HttpResponseBuilder. Appending `.finish()` materialises an HttpResponse and clears the `clippy::async_yields_async` lint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…urce attrs, accurate busy_ratio Three correctness fixes flagged by CodeRabbit: - main.rs: build the metrics SocketAddr from the parsed IpAddr in config (SocketAddr::new) instead of via string formatting + parse, which is ambiguous for IPv6 addresses like "::1:9091" (needs brackets). Demote spawn_metrics_server errors from a process-killing `?` to a warn-and-continue, matching the OTLP "best-effort" stance. - observability.rs: percent-decode keys and values from OTEL_RESOURCE_ATTRIBUTES per the W3C baggage format the OTel spec references. Without this, values containing %2C / %3D were stored verbatim instead of as "," / "=". - tokio_runtime.rs: compute the busy_ratio denominator from RuntimeMetrics::elapsed rather than the configured sampler interval, so the ratio stays accurate when the sampler tick runs late under load. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Bundle of low-risk cleanups from CodeRabbit review: - README.md, design spec, plan: add bash/text language tags to fenced code blocks (markdownlint MD040). - Plan doc: replace local absolute path with <repo-root>/ placeholder. - meters.rs: add `.with_unit("{request}")` to active_requests for OTel semconv parity with the duration metrics. - middleware.rs: rename `matched_route_returns_pattern` to `matched_route_setup_smoke` — the test only asserts status, the route extraction itself is covered in observability_integration.rs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Previously `try_join!(main_server, metrics_handle)` aborted the main HTTP server if the metrics server task ever returned an error (port reclaimed, listener closed, transient bind issue). That contradicts the "metrics are best-effort" stance applied elsewhere (OTLP failures log a warning, bind failure already demoted to warn-and-continue in the previous commit). Spawn the metrics server as a detached tokio task that logs on error and let the main API server's lifetime drive process exit. Drop the `futures-util` crate dependency from superposition since `try_join!` was its only use site. Add explicit `tokio` dep (was previously transitive via futures-util/actix-web; now direct). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…elper `parse_bool` and the open-coded `IpAddr::from_str`/`u16::parse`/ `humantime::parse_duration` sites each named the env key twice — once for the lookup and once for the error label. Six fields × two mentions = twelve places where the key string had to stay in sync. Replace with one generic `parse_or_default<T: FromStr>` helper that takes the key once. `bool`/`IpAddr`/`u16`/`humantime::Duration` all implement `FromStr` so they go through the same path; the helper folds the lookup, the default, and the error label into one place. Error message format changes uniformly to "<KEY>: <parse-error>" (the only existing assertion just checks for the key name as a substring, so it still holds). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

knutties requested a review from a team as a code owner May 11, 2026 08:20

Copilot AI review requested due to automatic review settings May 11, 2026 08:20

Copilot started reviewing on behalf of knutties May 11, 2026 08:21 View session

knutties force-pushed the feat/otel-golden-signals-middleware branch from 4951507 to 726f0c8 Compare May 11, 2026 08:22

Copilot AI reviewed May 11, 2026

View reviewed changes

Datron requested changes May 11, 2026

View reviewed changes

knutties and others added 22 commits May 11, 2026 18:58

docs(plan): OpenTelemetry golden-signals middleware implementation plan

ee4d70b

22-task implementation plan derived from the 2026-05-10 spec, broken down into TDD-style steps with full code and exact commands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

build: add opentelemetry deps to workspace

1255c16

Adds opentelemetry, opentelemetry_sdk, opentelemetry-prometheus, opentelemetry-otlp, opentelemetry-semantic-conventions, prometheus, tokio-metrics, and humantime as workspace dependencies. Enabled in service_utils in a follow-up commit.

feat(observability): config from env

b7483e7

Reads SUPERPOSITION_METRICS_* and OTEL_* env vars into a typed ObservabilityConfig. Defaults: enabled, port 9091, bind 0.0.0.0, both org/workspace labels on, 10s collect interval. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(observability): http method normalization

4b1cc8b

Per OTel semconv: collapse unknown methods to _OTHER to bound the cardinality of the http.request.method label.

feat(observability): route template extraction helper

3bea122

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(observability): build OTel attributes for HTTP metrics

ea2e596

feat(observability): RAII guard for active_requests gauge

c05be67

Drop-based decrement ensures the gauge stays balanced even when a handler panics or the future is cancelled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(observability): /healthz /livez /readyz handlers

73c35ee

feat(observability): /metrics server on dedicated port

e7bfadd

feat(observability): tokio runtime saturation gauges

ce50783

Background sampler updates atomic snapshots; observable gauges read from the snapshot. Gated on cfg(tokio_unstable); compiles to a no-op when the flag is disabled.

feat: exclude /healthz /livez /readyz from auth checks

43cafd8

Adds the observability health paths to tenant_middleware_exclusion_list so probes do not trigger auth flow. Operators no longer need to remember to put these in TENANT_MIDDLEWARE_EXCLUSION_LIST.

test(observability): cardinality regression budget

dae4266

Asserts that a 3-route × 1-method × 1-status scenario produces no more than 200 series, catching accidental high-cardinality labels in review.

docs: note metrics endpoints and tokio_unstable build flag

77dc30b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

knutties and others added 4 commits May 11, 2026 18:59

docs(README): drop spec link from observability section

1bca586

Spec doc reference removed for now; the env-var table and inline notes in this section are sufficient as the operator-facing surface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

knutties force-pushed the feat/otel-golden-signals-middleware branch from 7bf6bad to 1bca586 Compare May 11, 2026 13:37

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

knutties and others added 14 commits May 11, 2026 19:20

	- All paths in this plan are absolute under the repo root: `/Users/natarajankannan/src/superposition/`.
	- All paths in this plan are relative to the repository root.

Conversation

knutties commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What ships

Configuration

Build notes

Test Plan

Rollout

Future work (explicitly out of scope, tracked in spec §14)

Summary by CodeRabbit

Release Notes

Uh oh!

semanticdiff-com Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Datron left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

knutties commented May 11, 2026 •

edited by coderabbitai Bot

Loading

semanticdiff-com Bot commented May 11, 2026 •

edited

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading