Skip to content

feat(observability): OpenTelemetry golden-signals middleware#998

Open
knutties wants to merge 40 commits into
mainfrom
feat/otel-golden-signals-middleware
Open

feat(observability): OpenTelemetry golden-signals middleware#998
knutties wants to merge 40 commits into
mainfrom
feat/otel-golden-signals-middleware

Conversation

@knutties
Copy link
Copy Markdown
Collaborator

@knutties knutties commented May 11, 2026

Summary

Adds an Actix middleware and supporting subsystem that emits Google SRE golden signals (latency, traffic, errors, saturation) for every HTTP route on the main API, exposed via Prometheus scrape on a dedicated port (default :9091) and optional OTLP push. Built on the OpenTelemetry Rust SDK so users can scrape into Prometheus / VictoriaMetrics / Mimir or push to any OTel-native backend without code changes.

What ships

HTTP golden signals (MetricsMiddleware, automatically wraps every route)

  • http.server.request.duration — histogram (s), 9 explicit buckets [0.005..10], labels: http.request.method, http.route, http.response.status_code, sp.org_id, sp.workspace_id. Traffic + error rate are derived from _count in PromQL.
  • http.server.busy.duration — counter (s). rate(...) gives time-averaged request concurrency (insensitive to scrape aliasing, unlike active_requests).
  • http.server.active_requests — UpDownCounter. Semconv standard, included for completeness.

* Both tenant labels are env-disable-able for very-high-cardinality deployments.

Saturation signals

  • db.client.connections.{usage,max} — passive r2d2 callbacks, no instrumentation at pool.get() call sites.
  • redis.client.{connections.usage,commands.in_flight} — via a tolerant RedisStats trait over fred's MetricsInterface.
  • runtime.tokio.{workers,workers.busy_ratio,global_queue.depth}cfg(tokio_unstable)-gated. Background sampler updates atomics; observable gauges read them at scrape time.

Endpoints

  • GET /metrics on SUPERPOSITION_METRICS_PORT (default 9091) — Prometheus text exposition.
  • GET /healthz, /livez, /readyz on main port :8080. Auth bypass via tenant_middleware_exclusion_list.

Resilience

  • Bad OTLP config logs and degrades to disabled, doesn't crash the binary.
  • `/healthz` etc. instrumented (we want to observe probe latency) but exempt from auth_n/auth_z.
  • __not_found__ and __static__ route sentinels bound http.route cardinality.

Configuration

Standard OTel env vars are honored (OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_HEADERS, OTEL_RESOURCE_ATTRIBUTES). Plus:

Var Default Purpose
SUPERPOSITION_METRICS_ENABLED true Master switch.
SUPERPOSITION_METRICS_PORT 9091 Scrape port.
SUPERPOSITION_METRICS_BIND 0.0.0.0 Bind address.
SUPERPOSITION_METRICS_LABEL_ORG true Include sp.org_id.
SUPERPOSITION_METRICS_LABEL_WORKSPACE true Include sp.workspace_id.
SUPERPOSITION_METRICS_COLLECT_INTERVAL 10s Tokio runtime poll interval.

Build notes

`.cargo/config.toml` enables `--cfg tokio_unstable` workspace-wide so `tokio-metrics` can collect runtime saturation. This flag only adds APIs; no behavioural change for existing code. Building outside cargo (custom IDE invocations) needs the same flag — otherwise the `runtime.tokio.*` metrics simply won't appear.

Test Plan

  • `cargo test --workspace` — all green
  • 17 unit tests in `service_utils::observability` (config, middleware, meters, health, metrics_server)
  • 2 integration tests in `crates/service_utils/tests/observability_integration.rs`:
    • `metrics_appear_after_requests` — wraps an Actix App, sends 200/201/500/404, asserts on scraped Prometheus exposition (per-route counts, 5xx series, __not_found__ sentinel, busy_duration > 0, active_requests returns to 0).
    • `cardinality_stays_within_budget` — fails CI if a 3-route × 1-method scenario exceeds 200 series. Current count: 41.
  • Both cfg branches compile cleanly: default (tokio_unstable on) and RUSTFLAGS=\"\" (off → no-op stub).
  • `cargo check -p superposition` — clean wiring.
  • Live deployment smoke (deferred to staging rollout per spec §13).

Rollout

Phased per spec §13:

  1. Land with code default `SUPERPOSITION_METRICS_ENABLED=true` but disabled in prod env files.
  2. Enable in staging, watch VM ingest 48h.
  3. Enable in prod with `SUPERPOSITION_METRICS_LABEL_WORKSPACE=false`.
  4. Flip workspace label on after a week of clean VM headroom.

Future work (explicitly out of scope, tracked in spec §14)

  • Trace correlation via exemplars (requires `tracing-opentelemetry` bridge).
  • Per-tenant separate histogram if global cardinality budget tightens.
  • Grafana dashboards JSON + VM/Prometheus alert rules (separate PR).
  • Typed pool wrapper to unlock `db.client.connection.wait.duration` + `pending_requests`.
  • gRPC OTLP transport (currently `http/protobuf` only).

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Prometheus metrics exposure on configurable port (default 9091) via GET /metrics
    • Added health check endpoints: /healthz, /livez, /readyz
    • Added HTTP request metrics (latency, active requests, saturation)
    • Added database and Redis pool saturation monitoring
    • Added optional OTLP metrics export support
  • Chores

    • Added observability dependencies and configuration system with environment variable support

Review Change Stack

@knutties knutties requested a review from a team as a code owner May 11, 2026 08:20
@semanticdiff-com
Copy link
Copy Markdown

semanticdiff-com Bot commented May 11, 2026

Review changes with  SemanticDiff

Changed Files
File Status
  crates/experimentation_platform/src/api/experiments/helpers.rs  100% smaller
  crates/experimentation_platform/src/api/experiments/handlers.rs  68% smaller
  crates/superposition/src/app_state.rs  63% smaller
  crates/experimentation_platform/src/api/experiment_groups/handlers.rs  59% smaller
  crates/superposition_types/src/database/models.rs  34% smaller
  crates/superposition/src/main.rs  4% smaller
  .cargo/config.toml Unsupported file format
  .gitignore Unsupported file format
  Cargo.lock Unsupported file format
  Cargo.toml Unsupported file format
  README.md Unsupported file format
  crates/service_utils/Cargo.toml Unsupported file format
  crates/service_utils/src/lib.rs  0% smaller
  crates/service_utils/src/observability.rs  0% smaller
  crates/service_utils/src/observability/config.rs Unsupported file format
  crates/service_utils/src/observability/health.rs  0% smaller
  crates/service_utils/src/observability/meters.rs  0% smaller
  crates/service_utils/src/observability/metrics_server.rs  0% smaller
  crates/service_utils/src/observability/middleware.rs  0% smaller
  crates/service_utils/src/observability/saturation.rs  0% smaller
  crates/service_utils/src/observability/saturation/db_pool.rs  0% smaller
  crates/service_utils/src/observability/saturation/redis_pool.rs  0% smaller
  crates/service_utils/src/observability/saturation/tokio_runtime.rs  0% smaller
  crates/service_utils/tests/observability_integration.rs  0% smaller
  crates/superposition/Cargo.toml Unsupported file format
  docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md Unsupported file format
  docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md Unsupported file format

Copilot AI review requested due to automatic review settings May 11, 2026 08:20
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8bf11488-d07b-427a-8ba7-02fe88c7b8cd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

Adds complete OpenTelemetry observability to the Superposition HTTP API, including Prometheus metrics exposition, health endpoints, saturation monitoring for database/Redis/Tokio runtime, all configurable via environment variables with optional OTLP push integration.

Changes

OpenTelemetry Observability Implementation

Layer / File(s) Summary
Build Configuration & Dependencies
.cargo/config.toml, .gitignore, Cargo.toml, crates/service_utils/Cargo.toml, crates/superposition/Cargo.toml
Workspace dependencies added for opentelemetry, prometheus, tokio-metrics, humantime, thiserror; .cargo/config.toml sets --cfg tokio_unstable; build lints configure check-cfg for the flag.
Module Exports & Crate Setup
crates/service_utils/src/lib.rs
New pub mod observability exported; placeholder use <crate> as _ imports for dependencies to avoid unused-crate-dependencies warnings.
Configuration Loading from Environment
crates/service_utils/src/observability/config.rs
ObservabilityConfig struct parses env vars (enabled/bind/port/labels/interval/service identity/OTLP endpoint); LabelConfig controls org/workspace label inclusion; unit tests verify defaults and error handling.
Observability Core & MeterProvider
crates/service_utils/src/observability.rs
Observability struct initializes OpenTelemetry SdkMeterProvider with Prometheus exporter, optional OTLP HTTP exporter, resource attributes merged from env; ObservabilityError covers init/shutdown failures.
Health Probe Endpoints
crates/service_utils/src/observability/health.rs
Three GET endpoints (/healthz, /livez, /readyz) return HTTP 200 with "ok" body; health_endpoint_paths() returns static list for auth exclusion configuration.
HTTP Metric Instruments
crates/service_utils/src/observability/meters.rs
HttpMeters struct holds request duration histogram, busy duration counter, and active requests up/down counter; new() constructs instruments from provided Meter.
HTTP Metrics Middleware
crates/service_utils/src/observability/middleware.rs
Actix middleware normalizes HTTP methods, extracts route patterns with sentinels for not-found/static routes, builds attributes with optional org/workspace labels, records latency/busy/active metrics, implements panic-safe in-flight tracking via InFlightGuard.
Saturation Collection Infrastructure
crates/service_utils/src/observability/saturation.rs
SaturationDeps holds optional DB/Redis handles plus Tokio collection interval; register_observers() conditionally registers saturation collectors and spawns Tokio runtime monitor.
Database Pool Saturation
crates/service_utils/src/observability/saturation/db_pool.rs
Observable gauges for r2d2 PostgreSQL pool idle/used connections and max pool size via OpenTelemetry callbacks; DbPoolHandle type alias wraps the pool.
Redis Pool Saturation
crates/service_utils/src/observability/saturation/redis_pool.rs
RedisStats trait abstracts optional metrics; FredPoolStats impl for fred::RedisPool summing commands-in-flight; observable gauges for pool saturation with state labels.
Tokio Runtime Saturation
crates/service_utils/src/observability/saturation/tokio_runtime.rs
cfg(tokio_unstable)-gated background task polling tokio_metrics::RuntimeMonitor intervals; registers observable gauges for worker count, queue depth, and busy ratio via atomic snapshots.
Metrics Exposition Server
crates/service_utils/src/observability/metrics_server.rs
Standalone Actix server on SUPERPOSITION_METRICS_PORT exposing GET /metrics endpoint; TextEncoder serializes Prometheus registry to text format.
Application Startup & Server Integration
crates/superposition/src/main.rs
Load observability config (fail fast on invalid); init Observability and continue on failure; register saturation observers for DB/Redis; spawn metrics server; mount health endpoints; conditionally wrap app with MetricsMiddleware; run main and metrics servers concurrently via futures_util::try_join.
Auth Middleware Exclusion List
crates/superposition/src/app_state.rs
Tenant middleware exclusion list unconditionally includes observability health endpoint paths.
Integration & Unit Tests
crates/service_utils/tests/observability_integration.rs
End-to-end tests validate metrics recording across routes/status codes, Prometheus exposition format, and cardinality budget (<= 200 series) via repeated requests.
User-Facing Documentation
README.md
Describe Prometheus metrics port (default 9091), health endpoints, and tokio_unstable flag purpose.
Design Specification & Implementation Plan
docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md, docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md
Complete architecture, metric schemas, environment variables, configuration, testing strategy, rollout phases, and 22-task implementation plan.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

P0

Suggested reviewers

  • mahatoankitkumar
  • sauraww
  • Datron

Poem

🐰 Metrics bloom like garden clocks,
Prometheus scrapes from observation blocks,
Health checks hop through /healthz gates,
Saturation gauges seal their fates,
Observability's golden path awaits! 📊✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 78.13% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat(observability): OpenTelemetry golden-signals middleware' accurately and clearly describes the primary change: adding an OpenTelemetry-based observability subsystem with metrics middleware.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/otel-golden-signals-middleware

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@knutties knutties force-pushed the feat/otel-golden-signals-middleware branch from 4951507 to 726f0c8 Compare May 11, 2026 08:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new observability subsystem to the Superposition Actix API that exports Google SRE “golden signal” metrics (latency/traffic/errors/saturation) via OpenTelemetry, exposing Prometheus scrape output on a dedicated port and optionally pushing metrics via OTLP. It wires the subsystem into the superposition binary (middleware + /metrics server + health endpoints) and documents the design/rollout.

Changes:

  • Introduces service_utils::observability (config, OTel MeterProvider init, Actix metrics middleware, saturation collectors, metrics server, health endpoints) plus integration tests and a cardinality budget test.
  • Wires observability into the main API binary: /healthz|/livez|/readyz, auth bypass list updates, metrics server spawned alongside the main server, and middleware added to the app.
  • Adds workspace-wide build support for tokio_unstable and introduces new OTel/Prometheus dependencies with lockfile updates and supporting docs.

Reviewed changes

Copilot reviewed 21 out of 23 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
README.md Documents /metrics endpoint and health probes; notes tokio_unstable build flag.
docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md Design spec for the golden-signals OTel/Prometheus approach.
docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md Step-by-step implementation plan and test strategy.
crates/superposition/src/main.rs Initializes observability, registers saturation observers, spawns metrics server, adds middleware + health scope, and try_join!s servers.
crates/superposition/src/app_state.rs Extends auth exclusion list to always include `/healthz
crates/superposition/Cargo.toml Adds direct deps needed for new wiring (futures-util, opentelemetry).
crates/service_utils/src/lib.rs Exposes new observability module and adds placeholder crate uses to satisfy unused_crate_dependencies.
crates/service_utils/Cargo.toml Enables new observability dependencies (opentelemetry*, prometheus, tokio-metrics, humantime, etc.).
crates/service_utils/src/observability.rs Observability public API + OTel MeterProvider init (Prometheus exporter + optional OTLP).
crates/service_utils/src/observability/config.rs Env-driven configuration parsing + tests.
crates/service_utils/src/observability/meters.rs Defines HTTP metric instruments (histogram/counter/updowncounter).
crates/service_utils/src/observability/middleware.rs Actix middleware that records per-request HTTP metrics with bounded route cardinality.
crates/service_utils/src/observability/metrics_server.rs Dedicated Actix server exposing /metrics from the Prometheus registry.
crates/service_utils/src/observability/health.rs Adds /healthz, /livez, /readyz handlers and exposes path list for auth bypass.
crates/service_utils/src/observability/saturation.rs Wires saturation collectors (DB, Redis, Tokio runtime).
crates/service_utils/src/observability/saturation/db_pool.rs r2d2 pool saturation observable gauges.
crates/service_utils/src/observability/saturation/redis_pool.rs fred-based Redis “commands in flight” gauge and tolerant stats abstraction.
crates/service_utils/src/observability/saturation/tokio_runtime.rs cfg(tokio_unstable) Tokio runtime saturation gauges via tokio-metrics.
crates/service_utils/tests/observability_integration.rs End-to-end scrape assertions + cardinality regression test.
Cargo.toml Adds workspace deps for OTel/Prometheus; adds unexpected_cfgs allowance for tokio_unstable.
Cargo.lock Locks new dependency graph (notably OTel + Prometheus + tokio-metrics).
.cargo/config.toml Enables --cfg tokio_unstable workspace-wide.
.gitignore Ensures .cargo/config.toml remains tracked despite .cargo ignore rule.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.with_resource(Resource::new(resource_attrs));

if let Some(endpoint) = &cfg.otlp_endpoint {
builder = with_otlp_reader(builder, endpoint, cfg.collect_interval)?;
Comment on lines +123 to +137
// Protocol: this binary is compiled with the `http-proto` feature only.
// `OTEL_EXPORTER_OTLP_PROTOCOL=grpc` is NOT supported in v1; use the
// default `http/protobuf` transport.
//
// Headers: the opentelemetry-otlp 0.27 HTTP exporter reads
// `OTEL_EXPORTER_OTLP_HEADERS` (and `OTEL_EXPORTER_OTLP_METRICS_HEADERS`)
// automatically during `build()` — no explicit wiring needed here.
use opentelemetry_otlp::{MetricExporter, WithExportConfig};
use opentelemetry_sdk::metrics::PeriodicReader;
use opentelemetry_sdk::runtime;

let exporter = MetricExporter::builder()
.with_http()
.with_endpoint(endpoint.to_owned())
.build()
Comment on lines +94 to +114
fn with_env<F: FnOnce()>(vars: &[(&str, Option<&str>)], f: F) {
use std::sync::Mutex;
static LOCK: Mutex<()> = Mutex::new(());
let _guard = LOCK.lock().unwrap();
let prev: Vec<_> =
vars.iter().map(|(k, _)| (k.to_string(), std::env::var(k).ok())).collect();
for (k, v) in vars {
match v {
// SAFETY: single-threaded test execution guaranteed by the
// Mutex above and --test-threads=1 at the call site.
Some(v) => unsafe { std::env::set_var(k, v) },
None => unsafe { std::env::remove_var(k) },
}
}
f();
for (k, v) in prev {
match v {
Some(v) => unsafe { std::env::set_var(&k, &v) },
None => unsafe { std::env::remove_var(&k) },
}
}
Comment thread crates/service_utils/src/lib.rs Outdated
Comment on lines +2 to +9
// Re-exported so T4+ modules can use them without triggering
// the `unused_crate_dependencies` lint while no source yet imports them.
use humantime as _;
use opentelemetry as _;
use opentelemetry_otlp as _;
use opentelemetry_prometheus as _;
use opentelemetry_sdk as _;
use opentelemetry_semantic_conventions as _;
Comment on lines +89 to +98
// Rather than killing the binary we log a warning and serve traffic without metrics.
let observability = if obs_cfg.enabled {
match Observability::init(obs_cfg.clone()) {
Ok(o) => Some(o),
Err(e) => {
tracing::warn!(
error = %e,
"observability init failed; metrics disabled for this instance"
);
None
Copy link
Copy Markdown
Collaborator

@Datron Datron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall review

  1. Some crates can be removed, they are not used. Just imported
opentelemetry-otlp = { version = "0.27", default-features = false, features = ["metrics", "http-proto", "reqwest-client"] }
opentelemetry-semantic-conventions = { version = "0.27" }
  1. We can use https://crates.io/crates/opentelemetry-instrumentation-actix-web which comes from open-telemetry so that we can directly monitor requests, latency and errors
  2. Simplify the observability module, there are too many files with just 30 lines. Could we group everything in fewer files and use internal modules if we want namespacing
  3. Is it possible to add some CPU and memory monitoring on the box with sysinfo crate or we'll leave that to a deployment runtime like ECS

Comment on lines +46 to +54
let max_pool_name = KeyValue::new("pool.name", pool_name);
meter
.u64_observable_gauge("db.client.connections.max")
.with_description("Configured maximum size of the DB connection pool.")
.with_callback(move |observer| {
observer.observe(pool_for_max.max_size() as u64, &[max_pool_name.clone()]);
})
.build();
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just read the env DB_MAX_CONNECTION_POOL_SIZE and publish, getting the size of the pool array may not be accurate for larger pools

Comment on lines +56 to +62
#[cfg(test)]
mod tests {
// Constructing a real r2d2 pool requires a database. We assert the function
// signature compiles and that calling `register` does not panic with a
// synthetic in-memory pool; this is exercised via the integration test in
// Task 19 instead.
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be just remove this?

Comment on lines +68 to +74
fn used_connections(&self) -> Option<u64> {
// TODO(observability): fred 9.2.1 does not expose active/used
// connection counts. The internal `Counters.in_flight` field tracks
// per-connection in-flight commands, but it is private to
// `fred::protocol::connection`. Return None until exposed publicly.
None
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get the number of used clients via

let connected_count = pool_clone.clients().iter().filter(|c| c.is_connected()).count();

Comment on lines +60 to +66
fn idle_connections(&self) -> Option<u64> {
// TODO(observability): fred 9.2.1 does not expose idle-connection
// counts via any public API. The pool's `RedisPoolInner.clients` are
// all async and there is no `is_idle()` method on `RedisClient`.
// Return None until fred provides such an API.
None
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just remove this

//! in atomics that observable-gauge callbacks read.

#[cfg(not(tokio_unstable))]
#[allow(dead_code)] // called from saturation.rs only under cfg(tokio_unstable)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this if cfg is at the top?


#[cfg(not(tokio_unstable))]
#[allow(dead_code)] // called from saturation.rs only under cfg(tokio_unstable)
pub fn spawn(_meter: &opentelemetry::metrics::Meter, _interval: std::time::Duration) {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dummy function

Comment on lines +12 to +24
#[cfg(tokio_unstable)]
mod inner {
use std::sync::Arc;
use std::sync::atomic::{AtomicU64, Ordering};
use std::time::Duration;

use opentelemetry::metrics::Meter;
use tokio_metrics::RuntimeMonitor;

#[derive(Default)]
struct Snapshot {
workers: AtomicU64,
global_queue_depth: AtomicU64,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified a bit, just a function that reads metrics

// 1. Global Queue Depth: Tasks waiting to be picked up by an Actix worker
    let h_clone = handle.clone();
    meter.u64_observable_gauge("tokio.runtime.global_queue_depth")
        .with_description("Tasks stuck in the global queue waiting for a worker thread")
        .with_callback(move |observer| {
            let metrics = h_clone.metrics();
            observer.observe(metrics.global_queue_depth() as u64, &[]);
        })
        .init();

    // 2. Active Tasks: Total tasks currently alive in the runtime
    let h_clone2 = handle.clone();
    meter.u64_observable_gauge("tokio.runtime.active_tasks")
        .with_description("Total number of alive tasks (running or idle)")
        .with_callback(move |observer| {
            let metrics = h_clone2.metrics();
            observer.observe(metrics.active_tasks_count() as u64, &[]);
        })
        .init();
        
    // 3. Worker Count: How many OS threads Tokio is using for workers
    let h_clone3 = handle.clone();
    meter.u64_observable_gauge("tokio.runtime.workers_count")
        .with_description("Number of worker threads")
        .with_callback(move |observer| {
            let metrics = h_clone3.metrics();
            observer.observe(metrics.num_workers() as u64, &[]);
        })
        .init();

}

#[cfg(test)]
mod tests {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know if we need tests

use std::{net::IpAddr, str::FromStr, time::Duration};

#[derive(Debug, Clone)]
pub struct ObservabilityConfig {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to app state?

knutties and others added 22 commits May 11, 2026 18:58
Design for adding an Actix middleware that emits Google SRE golden
signals (latency, traffic, errors, saturation) via OpenTelemetry, with
Prometheus scrape on a separate metrics port and optional OTLP push.
Reference TSDB is VictoriaMetrics; design is TSDB-agnostic.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
22-task implementation plan derived from the 2026-05-10 spec, broken
down into TDD-style steps with full code and exact commands.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds opentelemetry, opentelemetry_sdk, opentelemetry-prometheus,
opentelemetry-otlp, opentelemetry-semantic-conventions, prometheus,
tokio-metrics, and humantime as workspace dependencies. Enabled in
service_utils in a follow-up commit.
Required by tokio-metrics for runtime instrumentation introduced in
the upcoming observability subsystem. tokio_unstable only adds APIs;
no behavioural change for existing code.

Also un-ignores .cargo/config.toml in .gitignore so the workspace
build config can be tracked; the rest of .cargo/ remains ignored.
Wires the eight new workspace deps (opentelemetry, opentelemetry_sdk,
opentelemetry-otlp, opentelemetry-prometheus,
opentelemetry-semantic-conventions, prometheus, tokio-metrics, humantime)
into crates/service_utils/Cargo.toml.

The crate has #![deny(unused_crate_dependencies)], so temporary
`use X as _;` stubs are added to lib.rs to satisfy the lint until T4+
modules consume the crates directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the empty module structure that subsequent commits flesh out:
- observability.rs: public surface, Observability handle, errors
- observability/{config,meters,middleware,metrics_server,health,saturation}.rs: stubs

No behaviour change. Observability::init() body is unimplemented!()
until Task 7.
Reads SUPERPOSITION_METRICS_* and OTEL_* env vars into a typed
ObservabilityConfig. Defaults: enabled, port 9091, bind 0.0.0.0,
both org/workspace labels on, 10s collect interval.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Per OTel semconv: collapse unknown methods to _OTHER to bound the
cardinality of the http.request.method label.
Builds an SdkMeterProvider wired to an opentelemetry-prometheus
exporter that writes into a per-process prometheus::Registry. OTLP
push exporter is plumbed but only activates when
OTEL_EXPORTER_OTLP_ENDPOINT is set.

Also adds Shutdown(String) error variant and updates shutdown() to use
it; defines HttpMeters struct with request_duration histogram,
busy_duration counter, and active_requests up-down-counter.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drop-based decrement ensures the gauge stays balanced even when a
handler panics or the future is cancelled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps every request with timing + active_requests gauge + busy_duration
counter. Uses match_pattern() to template routes, OrgWorkspaceMiddleware
extensions for tenant labels, and an InFlightGuard for panic safety.

Hoists all middleware.rs use statements to the file top (cleanup carried
forward from T8-T10 reviews).
T14: ObservableGauge callbacks read r2d2::Pool::state() at scrape time.
Emits db.client.connections.usage{state} and db.client.connections.max.

T15: Redis pool gauges via fred metrics with a tolerant RedisStats
trait. FredPoolStats wraps RedisPool and exposes commands_in_flight via
MetricsInterface (summed across pool clients). idle/used connection
counts are not exposed by fred 9.2.1's non-blocking pool and return
None; the corresponding observations are skipped rather than emitted as
zero, so no dead series.

Also adds workspace check-cfg lint config for cfg(tokio_unstable) (new
warning surfaced once cfg-gated code appeared in saturation.rs) and
exports RedisHandle/RedisStats/FredPoolStats from saturation and
observability modules so T18 can populate SaturationDeps.

Note: T14 and T15 are bundled here due to a reflog mishap during review
that lost T15's separate commit; the contents of both tasks are
verified present and reviewed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Background sampler updates atomic snapshots; observable gauges read
from the snapshot. Gated on cfg(tokio_unstable); compiles to a no-op
when the flag is disabled.
Adds the observability health paths to tenant_middleware_exclusion_list
so probes do not trigger auth flow. Operators no longer need to remember
to put these in TENANT_MIDDLEWARE_EXCLUSION_LIST.
- Init Observability early (Prometheus exporter + optional OTLP push)
- Spawn metrics server on SUPERPOSITION_METRICS_PORT
- Register DB/Redis/Tokio saturation observers
- Wrap App with MetricsMiddleware (gated by SUPERPOSITION_METRICS_ENABLED)
- Mount /healthz /livez /readyz on the main app (auth bypass via T17)
- try_join! both servers so the process exits if either dies

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wraps a small App with MetricsMiddleware, issues requests of various
shapes (200, 201, 500, 404), and asserts on the parsed Prometheus
exposition: per-route counts, the 5xx series, the __not_found__
sentinel, busy_duration > 0, and active_requests returning to 0.
Asserts that a 3-route × 1-method × 1-status scenario produces no more
than 200 series, catching accidental high-cardinality labels in review.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
knutties and others added 4 commits May 11, 2026 18:59
`Observability::init` errors (e.g. OTLP endpoint unreachable) no longer
panic the binary at startup.  Instead they emit a `tracing::warn!` and
leave observability as `None`, so the service continues to handle traffic
without metrics.  `obs_enabled` is now derived from `observability.is_some()`
after the init attempt, ensuring the `Condition` gate mirrors the actual
init outcome.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… production wiring (review fixes 2 + 6)

Adds two cardinality-control improvements to http.route tracking:

- New `ROUTE_STATIC` sentinel ("__static__") collapses static-asset patterns
  (/pkg/*, /assets/*, /favicon*) so they don't inflate the cardinality budget.
- `is_static_pattern(pattern)` is the shared helper used by both
  `extract_route` (ServiceRequest phase) and the new
  `extract_route_from_response` (ServiceResponse phase).
- Production middleware now calls `extract_route_from_response` instead of
  inlining `match_pattern().unwrap_or(ROUTE_NOT_FOUND)`, so static-prefix
  collapsing is active for all real traffic.
- Unit test `extract_route_helper_handles_static_paths` verifies the helper.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… var handling (review fixes 4 + 5)

Fix 5 — OTEL_RESOURCE_ATTRIBUTES merge (spec §8.5):
  Parse the standard OTel env var ("k1=v1,k2=v2,...") in `Observability::init`
  and append each key=value pair to resource_attrs, enabling operators to inject
  arbitrary resource dimensions without code changes.

Fix 4 — OTLP env var documentation:
  The opentelemetry-otlp 0.27 HTTP exporter already reads
  OTEL_EXPORTER_OTLP_HEADERS / OTEL_EXPORTER_OTLP_METRICS_HEADERS automatically
  during build(), so no explicit wiring is needed.  gRPC protocol is not supported
  in v1 (requires grpc-tonic feature not in this workspace); document this clearly
  in `with_otlp_reader`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Spec doc reference removed for now; the env-var table and inline notes
in this section are sufficient as the operator-facing surface.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@knutties knutties force-pushed the feat/otel-golden-signals-middleware branch from 7bf6bad to 1bca586 Compare May 11, 2026 13:37
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (8)
crates/service_utils/src/observability/meters.rs (1)

34-37: 💤 Low value

Add unit specification to http.server.active_requests for semconv compliance and consistency.

The other two metrics in HttpMeters (request_duration and busy_duration) both specify units via .with_unit(). The OpenTelemetry semantic conventions specify {request} as the unit for http.server.active_requests, so adding it keeps this metric consistent with the file's pattern and fully semconv-compliant.

♻️ Suggested change
         let active_requests = meter
             .i64_up_down_counter("http.server.active_requests")
+            .with_unit("{request}")
             .with_description("Number of HTTP server requests currently in flight.")
             .build();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability/meters.rs` around lines 34 - 37, The
metric "http.server.active_requests" (built via meter.i64_up_down_counter in
HttpMeters as the active_requests variable) is missing a unit; update the
builder chain for active_requests to include .with_unit("{request}") before
.build() so it matches the pattern used by request_duration and busy_duration
and complies with the OpenTelemetry semconv for this metric.
crates/service_utils/src/observability/middleware.rs (3)

215-241: 💤 Low value

Minor: unnecessary borrow of &res when res is already a reference.

In the Ok(res) arm, res: &ServiceResponse<B> because the outer match is on &result. extract_route_from_response(&res) passes &&ServiceResponse<B>, which compiles via deref coercion but is unidiomatic.

Small style nit
-                    let route = extract_route_from_response(&res);
+                    let route = extract_route_from_response(res);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability/middleware.rs` around lines 215 - 241,
In the Ok(res) match arm you’re passing a double reference to
extract_route_from_response by calling extract_route_from_response(&res) even
though res is already a &ServiceResponse<B>; change the call to
extract_route_from_response(res) (and any similar places) to avoid the
unnecessary borrow and keep types idiomatic (locate this in the Ok(res) branch
where extract_route_from_response, build_attributes,
meters.request_duration.record, and meters.busy_duration.add are used).

312-327: 💤 Low value

Test name promises more than it asserts.

matched_route_returns_pattern only asserts a 200 status — it does not exercise extract_route / extract_route_from_response. The inline comment acknowledges this, but the test name will mislead future readers; consider renaming to matched_route_setup_smoke or wiring an actual middleware that records and reads back the captured pattern so the integration test at observability_integration.rs isn't the only behavioral coverage for the response-phase extraction.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability/middleware.rs` around lines 312 - 327,
Rename or update the test matched_route_returns_pattern so its name matches
behavior or it actually asserts route extraction: either rename to
matched_route_setup_smoke (to reflect it only asserts StatusCode::OK) or modify
the test to install the middleware that calls extract_route /
extract_route_from_response and assert the captured pattern is "/contexts/{id}"
(use the test harness App::new().wrap(your_observability_middleware) and make
the middleware expose the matched pattern for assertion); reference the test
function matched_route_returns_pattern and the extraction helpers extract_route
and extract_route_from_response when changing the test so behavioral coverage
isn't only in observability_integration.rs.

242-262: ⚖️ Poor tradeoff

Error path conflates handler errors with real 404s under __not_found__.

When service.call(req) returns Err, the request has been moved and match_pattern() is unreachable, so the route attribute is hard-coded to ROUTE_NOT_FOUND. The acknowledged tradeoff is documented inline, but the metric consequence is that a 500 from a real route like /contexts/{id} ends up in the same time-series as a true 404, which makes per-route error-rate alerting unreliable for the error-as-Err path (handler errors converted by Actix into responses, vs. handlers that return Ok(response_with_500)).

Worth at least noting in the rollout dashboard docs, or — if practical — capturing the matched pattern in request extensions during routing (e.g., from an inner middleware closer to the resource) and reading it back here. Not a blocker.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability/middleware.rs` around lines 242 - 262,
The error-path currently hardcodes ROUTE_NOT_FOUND because match_pattern() is
unavailable after service.call(req) moves the request; change this by having the
routing layer store the matched pattern into the request extensions (e.g., under
a key like "matched_pattern") when matching a route, and then in this middleware
(where service.call(req) and the Err branch live) try to read that extension and
use the stored pattern instead of ROUTE_NOT_FOUND when calling build_attributes;
update the Err block to fallback to ROUTE_NOT_FOUND only if the extension is
missing, and keep using service.call, match_pattern(), build_attributes, and
meters.* as the integration points.
crates/superposition/src/main.rs (1)

306-313: 💤 Low value

try_join! couples main API lifetime to metrics server lifetime.

If the metrics server task ever returns an error (e.g., the underlying listener gets closed, or the bound port is reclaimed), try_join! short-circuits and the main HTTP server is aborted mid-flight. The pattern is acceptable if you treat the metrics endpoint as a hard SLO, but it's worth being explicit — the PR objectives describe "bad OTLP config degrades to disabled with warnings", and this coupling is the opposite stance for the scrape endpoint.

Consider tokio::select! on the main server with a tokio::spawn-detached metrics task, or accept this and document it in the rollout notes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/superposition/src/main.rs` around lines 306 - 313, The current use of
futures_util::try_join! with metrics_server_handle couples the lifetime of
main_server to metrics errors; change to spawn the metrics task
(metrics_server_handle) with tokio::spawn so it runs detached and use
tokio::select! (or await main_server directly) to ensure the main_server (the
API HTTP server) is not aborted if the metrics task returns an error — instead
log/handle errors from the spawned metrics task (e.g., attach a task that
.awaits and logs its error) while allowing main_server.await to continue
independently.
crates/service_utils/tests/observability_integration.rs (1)

85-91: 💤 Low value

Value parsing assumes no per-sample timestamps.

rsplit_once(' ').unwrap().1 takes the last whitespace-separated token. The Prometheus exposition format permits an optional trailing timestamp (metric{…} value [timestamp]); if the exporter ever emits timestamps, this parses the timestamp as the value and the assertions silently mis-fire. The current opentelemetry-prometheus 0.x exporter does not emit timestamps, so the test passes today — flagging as a brittle assumption to revisit if the exporter is upgraded.

A more robust parse splits on whitespace and asserts exactly two tokens, or extracts the second token of a known-shape split:

-let busy_value: f64 = busy.rsplit_once(' ').unwrap().1.trim().parse().unwrap();
+let mut parts = busy.split_whitespace();
+let _series = parts.next().unwrap();
+let busy_value: f64 = parts.next().unwrap().parse().unwrap();

Also applies to: 118-118, 127-127

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/tests/observability_integration.rs` around lines 85 -
91, The current parsing of ping_count_line uses rsplit_once(' ').unwrap().1
which will pick up an optional Prometheus timestamp as the value; change parsing
to split_whitespace (or split_ascii_whitespace), collect tokens, assert
tokens.len() >= 2 and use tokens[ tokens.len() - 2 ] (the value) or preferably
assert exactly 2 tokens when you expect no timestamps, then parse that token
into ping_count; apply the same change to the other occurrences mentioned (the
lines around ping_count_line and the similar parses at the other two locations)
so tests fail loudly if format differs instead of silently parsing a timestamp
as the value.
crates/service_utils/src/observability/health.rs (1)

33-36: ⚡ Quick win

readyz currently signals ready even when dependencies are down.

The placeholder is acknowledged with a comment, so this is informational. In Kubernetes, readyz returning 200 while DB/Redis are unhealthy means traffic continues to be routed to a pod that cannot serve requests — defeating the purpose of a readiness probe. Track this as a follow-up to wire DB pool ping / Redis ping into readyz before relying on it for traffic gating.

Want me to open a follow-up issue tracking the readyz dependency checks (DB pool get w/ short timeout, Redis ping)?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability/health.rs` around lines 33 - 36, The
readyz handler (function readyz) currently always returns OK and must be updated
to probe critical dependencies before signaling readiness: add checks for the DB
connection pool (attempt a quick get or simple ping with a short timeout) and
Redis (PING) and only return HttpResponse::Ok when those succeed; on any failure
return an appropriate non-200 response (e.g., 503) and log the underlying error
so Kubernetes stops routing traffic to unhealthy pods. Use the existing livez
behavior as a template if present and ensure checks are fast and non-blocking
(async) to avoid delaying readiness responses.
crates/service_utils/src/observability.rs (1)

147-155: 💤 Low value

#[cfg(test)] stub silently drops OTLP wiring.

Test builds ignore _endpoint entirely, so any unit test that sets otlp_endpoint = Some(...) exercises the prometheus-only path. That's intentional per the comment, but it means OTLP-specific behavior (endpoint resolution, header merging, period validation) has zero unit coverage. Either acknowledge in the comment that OTLP coverage relies on integration/manual tests, or add a token assertion (e.g., that the function is reachable) so future readers don't expect symmetric behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability.rs` around lines 147 - 155, The
#[cfg(test)] stub for with_otlp_reader silently ignores the _endpoint and thus
gives no unit coverage for OTLP-specific behavior (endpoint resolution, header
merging, period validation); either update the comment above with_otlp_reader to
explicitly state that OTLP behavior is covered only by integration/manual tests
(and mention otlp_endpoint/ObservabilityError), or make the test-only stub
assert it's been reached (e.g., a token/debug assertion referencing the
_endpoint) so future readers know the function is intentionally non-functional
in unit tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/service_utils/src/observability.rs`:
- Around line 85-96: The OTEL_RESOURCE_ATTRIBUTES parsing loop does not
percent-decode keys/values (violating the OTel/W3C baggage format); update the
loop that iterates extra.split(',') and uses pair.split_once('=') so that both k
and v are percent-decoded (use the existing urlencoding crate) before trimming
and pushing into resource_attrs via KeyValue::new; ensure decoding errors are
handled (skip or log the offending pair) and decode both the key and value
consistently.

In `@crates/service_utils/src/observability/saturation/tokio_runtime.rs`:
- Around line 50-58: Replace the configured interval with the actual elapsed
duration when computing the per-worker busy ratio: use m.elapsed (not interval)
as the time base while preserving the per-worker semantics by multiplying by
(m.workers_count as f64). In other words, compute total =
m.elapsed.as_secs_f64() * (m.workers_count as f64).max(1.0) and keep storing the
scaled value into snap_for_task.busy_ratio_milli via
snap_for_task.busy_ratio_milli.store((ratio * 1000.0) as u64,
Ordering::Relaxed), so you avoid under-reporting when sleep returns late but
retain the per-worker metric semantics rather than switching to the aggregate
m.busy_ratio().

In `@crates/superposition/src/main.rs`:
- Around line 192-199: The metrics bind construction currently formats a string
and parses to SocketAddr which fails for IPv6 (e.g., "::1:9091"); change the
code in the block that sets metrics_server_handle (references: observability,
obs_cfg, metrics_server_handle, spawn_metrics_server) to construct the address
with SocketAddr::new(std::net::IpAddr::from_str(&obs_cfg.bind)? or obs_cfg.bind
parsed to IpAddr, and obs_cfg.port as u16) instead of string parsing, and if
spawn_metrics_server returns an Err treat it like Observability::init failures
(log a warning and set metrics_server_handle to None) so a bind error doesn’t
panic the process. Ensure the fix parses obs_cfg.bind to IpAddr (handling
parsing errors gracefully) and uses SocketAddr::new to support IPv6 properly.

In `@docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md`:
- Line 18: Replace the machine-specific absolute path string
"/Users/natarajankannan/src/superposition/" on line containing "All paths in
this plan are absolute under the repo root" with a repo-root-relative notation
(e.g., "repo-root-relative paths" or a placeholder like "<repo-root>/...") so
the document uses portable, non-machine-specific paths; update the sentence in
the same paragraph to instruct contributors to use repo-root-relative or
environment-variable-based placeholders instead of local absolute filesystem
paths.

In `@docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md`:
- Line 69: Two fenced code blocks in the doc (the block showing "[request on
:8080] ..." and the block listing "crates/service_utils/src/...") are unlabeled
and trigger markdownlint MD040; update each opening triple-backtick to include
the language identifier text (i.e., change ``` to ```text) so both code fences
become labeled, leaving their closing backticks unchanged.

In `@README.md`:
- Around line 157-159: The markdown code fence containing the curl example is
missing a language identifier, triggering markdownlint MD040; update the fenced
block that wraps the curl invocation (the triple-backtick block around "curl
http://localhost:9091/metrics") to include the bash language identifier
(```bash) so the snippet is syntax-highlighted and the lint rule passes.

---

Nitpick comments:
In `@crates/service_utils/src/observability.rs`:
- Around line 147-155: The #[cfg(test)] stub for with_otlp_reader silently
ignores the _endpoint and thus gives no unit coverage for OTLP-specific behavior
(endpoint resolution, header merging, period validation); either update the
comment above with_otlp_reader to explicitly state that OTLP behavior is covered
only by integration/manual tests (and mention otlp_endpoint/ObservabilityError),
or make the test-only stub assert it's been reached (e.g., a token/debug
assertion referencing the _endpoint) so future readers know the function is
intentionally non-functional in unit tests.

In `@crates/service_utils/src/observability/health.rs`:
- Around line 33-36: The readyz handler (function readyz) currently always
returns OK and must be updated to probe critical dependencies before signaling
readiness: add checks for the DB connection pool (attempt a quick get or simple
ping with a short timeout) and Redis (PING) and only return HttpResponse::Ok
when those succeed; on any failure return an appropriate non-200 response (e.g.,
503) and log the underlying error so Kubernetes stops routing traffic to
unhealthy pods. Use the existing livez behavior as a template if present and
ensure checks are fast and non-blocking (async) to avoid delaying readiness
responses.

In `@crates/service_utils/src/observability/meters.rs`:
- Around line 34-37: The metric "http.server.active_requests" (built via
meter.i64_up_down_counter in HttpMeters as the active_requests variable) is
missing a unit; update the builder chain for active_requests to include
.with_unit("{request}") before .build() so it matches the pattern used by
request_duration and busy_duration and complies with the OpenTelemetry semconv
for this metric.

In `@crates/service_utils/src/observability/middleware.rs`:
- Around line 215-241: In the Ok(res) match arm you’re passing a double
reference to extract_route_from_response by calling
extract_route_from_response(&res) even though res is already a
&ServiceResponse<B>; change the call to extract_route_from_response(res) (and
any similar places) to avoid the unnecessary borrow and keep types idiomatic
(locate this in the Ok(res) branch where extract_route_from_response,
build_attributes, meters.request_duration.record, and meters.busy_duration.add
are used).
- Around line 312-327: Rename or update the test matched_route_returns_pattern
so its name matches behavior or it actually asserts route extraction: either
rename to matched_route_setup_smoke (to reflect it only asserts StatusCode::OK)
or modify the test to install the middleware that calls extract_route /
extract_route_from_response and assert the captured pattern is "/contexts/{id}"
(use the test harness App::new().wrap(your_observability_middleware) and make
the middleware expose the matched pattern for assertion); reference the test
function matched_route_returns_pattern and the extraction helpers extract_route
and extract_route_from_response when changing the test so behavioral coverage
isn't only in observability_integration.rs.
- Around line 242-262: The error-path currently hardcodes ROUTE_NOT_FOUND
because match_pattern() is unavailable after service.call(req) moves the
request; change this by having the routing layer store the matched pattern into
the request extensions (e.g., under a key like "matched_pattern") when matching
a route, and then in this middleware (where service.call(req) and the Err branch
live) try to read that extension and use the stored pattern instead of
ROUTE_NOT_FOUND when calling build_attributes; update the Err block to fallback
to ROUTE_NOT_FOUND only if the extension is missing, and keep using
service.call, match_pattern(), build_attributes, and meters.* as the integration
points.

In `@crates/service_utils/tests/observability_integration.rs`:
- Around line 85-91: The current parsing of ping_count_line uses rsplit_once('
').unwrap().1 which will pick up an optional Prometheus timestamp as the value;
change parsing to split_whitespace (or split_ascii_whitespace), collect tokens,
assert tokens.len() >= 2 and use tokens[ tokens.len() - 2 ] (the value) or
preferably assert exactly 2 tokens when you expect no timestamps, then parse
that token into ping_count; apply the same change to the other occurrences
mentioned (the lines around ping_count_line and the similar parses at the other
two locations) so tests fail loudly if format differs instead of silently
parsing a timestamp as the value.

In `@crates/superposition/src/main.rs`:
- Around line 306-313: The current use of futures_util::try_join! with
metrics_server_handle couples the lifetime of main_server to metrics errors;
change to spawn the metrics task (metrics_server_handle) with tokio::spawn so it
runs detached and use tokio::select! (or await main_server directly) to ensure
the main_server (the API HTTP server) is not aborted if the metrics task returns
an error — instead log/handle errors from the spawned metrics task (e.g., attach
a task that .awaits and logs its error) while allowing main_server.await to
continue independently.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 692299ed-71fc-42c3-9a01-9fbf6addfee3

📥 Commits

Reviewing files that changed from the base of the PR and between 83dac07 and 1bca586.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (22)
  • .cargo/config.toml
  • .gitignore
  • Cargo.toml
  • README.md
  • crates/service_utils/Cargo.toml
  • crates/service_utils/src/lib.rs
  • crates/service_utils/src/observability.rs
  • crates/service_utils/src/observability/config.rs
  • crates/service_utils/src/observability/health.rs
  • crates/service_utils/src/observability/meters.rs
  • crates/service_utils/src/observability/metrics_server.rs
  • crates/service_utils/src/observability/middleware.rs
  • crates/service_utils/src/observability/saturation.rs
  • crates/service_utils/src/observability/saturation/db_pool.rs
  • crates/service_utils/src/observability/saturation/redis_pool.rs
  • crates/service_utils/src/observability/saturation/tokio_runtime.rs
  • crates/service_utils/tests/observability_integration.rs
  • crates/superposition/Cargo.toml
  • crates/superposition/src/app_state.rs
  • crates/superposition/src/main.rs
  • docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md
  • docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md

Comment on lines +85 to +96
// §8.5 — merge OTEL_RESOURCE_ATTRIBUTES ("k1=v1,k2=v2,...") if set.
if let Ok(extra) = std::env::var("OTEL_RESOURCE_ATTRIBUTES") {
for pair in extra.split(',') {
if let Some((k, v)) = pair.split_once('=') {
let k = k.trim().to_owned();
let v = v.trim().to_owned();
if !k.is_empty() {
resource_attrs.push(KeyValue::new(k, v));
}
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify percent-encoding crate availability and current OTEL_RESOURCE_ATTRIBUTES handling.
rg -n 'percent-encoding' --type=toml
fd -e toml -x rg -nH 'percent-encoding' {} \;

Repository: juspay/superposition

Length of output: 2517


🏁 Script executed:

cd crates/service_utils && cat Cargo.toml | grep -A 50 "^\[dependencies\]"

Repository: juspay/superposition

Length of output: 1677


🌐 Web query:

OTel environment variable spec OTEL_RESOURCE_ATTRIBUTES W3C Baggage percent-encoding encoding

💡 Result:

OTEL_RESOURCE_ATTRIBUTES encoding/percent-encoding rules (grounded in the current OpenTelemetry Resource SDK spec and the W3C Baggage spec) 1) How OTEL_RESOURCE_ATTRIBUTES is formatted - OTEL_RESOURCE_ATTRIBUTES is a comma-separated list of key=value pairs (key1=value1,key2=value2). [1] 2) What “percent-encoding” is required for in OTEL_RESOURCE_ATTRIBUTES - In OTEL_RESOURCE_ATTRIBUTES, the characters ',' and '=' in both keys and values MUST be percent-encoded. [1] - Other characters MAY be percent-encoded (e.g., values outside the ANSI characters set). [1] 3) Relation to W3C Baggage and W3C baggage value percent-encoding - OpenTelemetry states OTEL_RESOURCE_ATTRIBUTES is represented using a format matching W3C Baggage, with the specific OTEL_RESOURCE_ATTRIBUTES rules above for separating key/value pairs via comma and '=' and requiring percent-encoding of ',' and '='. [1] - In W3C Baggage, the baggage entry value is a UTF-8 string, where any characters outside the baggage-octet range MUST be percent-encoded. [2] - W3C Baggage also specifies that the percent code point (U+0025, i.e. '%') MUST be percent-encoded. [2] - W3C Baggage defines percent-encoding per RFC3986 Section 2.1. [2] 4) How percent-encoding is interpreted (decoding expectations) - W3C Baggage: when decoding, percent-encoded octet sequences that do not match the UTF-8 encoding scheme are replaced with U+FFFD. [2] - OpenTelemetry resource SDK: during decoding, if an error occurs (e.g., failure during decoding), the entire environment variable value SHOULD be discarded and an error SHOULD be reported. [1] 5) Practical implications / example - Because ',' and '=' MUST be percent-encoded in OTEL_RESOURCE_ATTRIBUTES, values containing commas or equals signs must be written with percent escapes, so they don’t get misinterpreted as list separators or key/value separators. [1] Sources (most directly relevant) - OpenTelemetry Resource SDK spec: specification/resource/sdk.md (OTEL_RESOURCE_ATTRIBUTES format + required percent-encoding + error handling). [1] - W3C Baggage spec (percent-encoding rules for baggage values, including baggage-octet range, UTF-8, and '%' encoding). [2]

Citations:


🏁 Script executed:

# Search for tests of OTEL_RESOURCE_ATTRIBUTES parsing
rg -n 'OTEL_RESOURCE_ATTRIBUTES' --type rust

# Check urlencoding crate capabilities
rg -n 'urlencoding' crates/service_utils/

Repository: juspay/superposition

Length of output: 424


🌐 Web query:

urlencoding rust crate decode percent-decoding function API

💡 Result:

In the Rust crate urlencoding, percent-decoding is done via these public APIs: 1) urlencoding::decode Signature/behavior: It decodes a percent-encoded string assuming UTF-8, and returns a Result (since the docs show decode(...).expect("UTF-8")/unwrap usage). [1] Example usage: use urlencoding::decode; let decoded = decode("%F0%9F%91%BE%20Exterminate%21").expect("UTF-8"); // decoded == "👾 Exterminate!" [1] 2) urlencoding::decode_binary Signature/behavior: It decodes percent-encoded input as raw bytes (binary data), allowing arbitrary bytes and invalid UTF-8; you then convert as needed (e.g., String::from_utf8_lossy). [1] Example usage: use urlencoding::decode_binary; let binary = decode_binary(b"%F1%F2%F3%C0%C1%C2"); let decoded = String::from_utf8_lossy(&binary); [1] How percent-decoding works conceptually in this ecosystem A commonly used lower-level percent-decoder in Rust is the percent_encoding crate’s percent_decode function, whose API takes bytes and percent-decodes any “%” followed by two hex digits, returning an iterator-like type with decode_utf8/decode_utf8_lossy. [2] Signature (from docs): pub fn percent_decode(input: &[u8]) -> PercentDecode<'_>. [2] So, for “urlencoding crate decode percent-decoding function API”: use urlencoding::decode (UTF-8 Result) or urlencoding::decode_binary (raw bytes) depending on whether you need to handle invalid UTF-8. [1]

Citations:


🏁 Script executed:

# Check for existing tests of OTEL_RESOURCE_ATTRIBUTES
rg -A 10 -B 5 'OTEL_RESOURCE_ATTRIBUTES' crates/service_utils/src/observability.rs | head -40

# Check how urlencoding::decode is used elsewhere
rg -B 2 -A 3 'urlencoding::decode' crates/

Repository: juspay/superposition

Length of output: 749


🏁 Script executed:

# Look for tests in observability.rs
rg -n '#\[test\]|#\[cfg\(test\)\]' crates/service_utils/src/observability.rs

# Look for integration tests that might test observability init
fd -e rs 'test' crates/service_utils/src/ | head -10

# Check if there are any tests for observability module
fd observability crates/service_utils/

Repository: juspay/superposition

Length of output: 268


🏁 Script executed:

# Check the integration test file
cat crates/service_utils/tests/observability_integration.rs

# Also check lines around 147, 157, 176, 186 in observability.rs for test details
sed -n '145,190p' crates/service_utils/src/observability.rs

Repository: juspay/superposition

Length of output: 7630


OTEL_RESOURCE_ATTRIBUTES values are not URL-decoded, violating the OTel specification.

Per the OpenTelemetry resource SDK specification, OTEL_RESOURCE_ATTRIBUTES follows the W3C Baggage format and requires percent-decoding of both keys and values (per RFC 3986). The current implementation (lines 85–96) trims whitespace but does not percent-decode, so an operator setting e.g. OTEL_RESOURCE_ATTRIBUTES=deployment.environment=prod%20us-east-1 ends up with the literal string prod%20us-east-1 instead of prod us-east-1.

The urlencoding crate is already in the project's dependencies. A straightforward fix:

Suggested fix
                 if let Some((k, v)) = pair.split_once('=') {
                     let k = k.trim().to_owned();
-                    let v = v.trim().to_owned();
+                    let v = urlencoding::decode(v.trim())
+                        .unwrap_or_else(|_| v.trim().into())
+                        .into_owned();
                     if !k.is_empty() {
                         resource_attrs.push(KeyValue::new(k, v));
                     }
                 }

Also consider decoding keys the same way, as the OTel spec requires percent-encoding for both.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability.rs` around lines 85 - 96, The
OTEL_RESOURCE_ATTRIBUTES parsing loop does not percent-decode keys/values
(violating the OTel/W3C baggage format); update the loop that iterates
extra.split(',') and uses pair.split_once('=') so that both k and v are
percent-decoded (use the existing urlencoding crate) before trimming and pushing
into resource_attrs via KeyValue::new; ensure decoding errors are handled (skip
or log the offending pair) and decode both the key and value consistently.

Comment on lines +50 to +58
// Busy fraction = total worker-seconds busy / total
// worker-seconds available over the sample interval.
let busy = m.total_busy_duration.as_secs_f64();
let total =
interval.as_secs_f64() * (m.workers_count as f64).max(1.0);
let ratio = (busy / total).clamp(0.0, 1.0);
snap_for_task
.busy_ratio_milli
.store((ratio * 1000.0) as u64, Ordering::Relaxed);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In tokio-metrics 0.3, does RuntimeMetrics expose a busy_ratio()method and anelapsedfield, and isbusy_ratio() computed as total_busy_duration / (elapsed * workers_count)?

💡 Result:

Yes—at least in the docs for tokio-metrics RuntimeMetrics, there is a busy_ratio method and an elapsed field, but the busy_ratio computation formula you proposed is not supported by tokio-metrics 0.3 docs. 1) RuntimeMetrics has an elapsed field RuntimeMetrics (tokio-metrics 0.3.x) includes a public field elapsed: Duration [1]. 2) RuntimeMetrics has busy_ratio method The RuntimeMetrics API also includes a pub fn busy_ratio(&self) -> f64 method [2]. 3) busy_ratio computation: not total_busy_duration / (elapsed * workers_count) The docs for busy_ratio in the RuntimeMetrics API state that busy_ratio “Returns the ratio of RuntimeMetrics::total_busy_duration to the RuntimeMetrics::elapsed.” [2] That description implies busy_ratio = total_busy_duration / elapsed (i.e., no multiplication by workers_count in the denominator) [2]. I found no doc evidence that busy_ratio multiplies elapsed by workers_count in the formula. Conclusion: RuntimeMetrics exposes busy_ratio and elapsed, but busy_ratio is documented as total_busy_duration / elapsed, not total_busy_duration / (elapsed * workers_count).

Citations:


🏁 Script executed:

cat -n crates/service_utils/src/observability/saturation/tokio_runtime.rs | head -80

Repository: juspay/superposition

Length of output: 3858


🏁 Script executed:

cat -n crates/service_utils/src/observability/saturation/tokio_runtime.rs | sed -n '80,100p'

Repository: juspay/superposition

Length of output: 547


Use actual elapsed duration instead of configured interval when computing busy ratio, but preserve per-worker semantics.

The concern about using the configured interval is valid—under scheduler pressure, tokio::time::sleep(interval) may return late, causing the denominator to be inaccurate and under-reporting busy ratio when it matters most. However, the suggested diff has a hidden semantic issue:

tokio_metrics::RuntimeMetrics::busy_ratio() computes total_busy_duration / elapsed (aggregate ratio across all workers), whereas the current code computes total_busy_duration / (elapsed * workers_count) (per-worker average). The metric is named runtime.tokio.workers.busy_ratio, suggesting per-worker semantics should be preserved.

The correct fix: replace the configured interval with m.elapsed, but keep the workers_count multiplier:

-                    let busy = m.total_busy_duration.as_secs_f64();
-                    let total =
-                        interval.as_secs_f64() * (m.workers_count as f64).max(1.0);
+                    let busy = m.total_busy_duration.as_secs_f64();
+                    let total =
+                        m.elapsed.as_secs_f64() * (m.workers_count as f64).max(1.0);
                    let ratio = (busy / total).clamp(0.0, 1.0);

Alternatively, if you intentionally want to switch to aggregate ratio, use m.busy_ratio() directly but document why the metric semantics changed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/service_utils/src/observability/saturation/tokio_runtime.rs` around
lines 50 - 58, Replace the configured interval with the actual elapsed duration
when computing the per-worker busy ratio: use m.elapsed (not interval) as the
time base while preserving the per-worker semantics by multiplying by
(m.workers_count as f64). In other words, compute total =
m.elapsed.as_secs_f64() * (m.workers_count as f64).max(1.0) and keep storing the
scaled value into snap_for_task.busy_ratio_milli via
snap_for_task.busy_ratio_milli.store((ratio * 1000.0) as u64,
Ordering::Relaxed), so you avoid under-reporting when sleep returns late but
retain the per-worker metric semantics rather than switching to the aggregate
m.busy_ratio().

Comment on lines +192 to +199
let metrics_server_handle = if let Some(obs) = observability.as_ref() {
let bind: std::net::SocketAddr = format!("{}:{}", obs_cfg.bind, obs_cfg.port)
.parse()
.expect("invalid metrics bind addr");
Some(spawn_metrics_server(obs.registry(), bind)?)
} else {
None
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Metrics bind construction is broken for IPv6 addresses.

format!("{}:{}", obs_cfg.bind, obs_cfg.port).parse::<SocketAddr>() works for IPv4, but if an operator configures SUPERPOSITION_METRICS_BIND=:: or ::1, the resulting string ::1:9091 is not a valid SocketAddr representation — IPv6 socket strings require bracketed host ([::1]:9091). This will trigger the .expect("invalid metrics bind addr") panic at startup with no warning path, contradicting the graceful-degradation philosophy applied to Observability::init above.

Use SocketAddr::new to build the address directly from the typed IpAddr and u16 — no string round-trip, no IPv6 special-casing.

🛠️ Proposed fix
-    let metrics_server_handle = if let Some(obs) = observability.as_ref() {
-        let bind: std::net::SocketAddr = format!("{}:{}", obs_cfg.bind, obs_cfg.port)
-            .parse()
-            .expect("invalid metrics bind addr");
-        Some(spawn_metrics_server(obs.registry(), bind)?)
-    } else {
-        None
-    };
+    let metrics_server_handle = if let Some(obs) = observability.as_ref() {
+        let bind = std::net::SocketAddr::new(obs_cfg.bind, obs_cfg.port);
+        match spawn_metrics_server(obs.registry(), bind) {
+            Ok(h) => Some(h),
+            Err(e) => {
+                tracing::warn!(error = %e, %bind, "metrics server failed to start; continuing without scrape endpoint");
+                None
+            }
+        }
+    } else {
+        None
+    };

Two changes:

  1. SocketAddr::new(IpAddr, u16) — IPv4/IPv6 correct, no parsing.
  2. Treat metrics-server bind failure the same way Observability::init failure is treated (warn + disable), so a transient port conflict on :9091 doesn't take down the API.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/superposition/src/main.rs` around lines 192 - 199, The metrics bind
construction currently formats a string and parses to SocketAddr which fails for
IPv6 (e.g., "::1:9091"); change the code in the block that sets
metrics_server_handle (references: observability, obs_cfg,
metrics_server_handle, spawn_metrics_server) to construct the address with
SocketAddr::new(std::net::IpAddr::from_str(&obs_cfg.bind)? or obs_cfg.bind
parsed to IpAddr, and obs_cfg.port as u16) instead of string parsing, and if
spawn_metrics_server returns an Err treat it like Observability::init failures
(log a warning and set metrics_server_handle to None) so a bind error doesn’t
panic the process. Ensure the fix parses obs_cfg.bind to IpAddr (handling
parsing errors gracefully) and uses SocketAddr::new to support IPv6 properly.

- The OpenTelemetry Rust SDK has had API churn between minor versions. The exact import paths and builder method names below are written against `opentelemetry` 0.27. If you pin a different version in Task 1, expect to adjust 1–3 import paths or method names per call site. The plan uses the **stable** APIs only (no unstable/preview features).
- After Task 1 (deps), run `cargo check -p service_utils` after every code-touching task to catch wiring issues early — even on tasks that don't add tests yet.
- File commit boundary: each task ends with one commit. If a step within a task fails, fix and continue within the same task before committing.
- All paths in this plan are absolute under the repo root: `/Users/natarajankannan/src/superposition/`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid machine-specific absolute paths in the plan.

Line 18 hardcodes a local filesystem path, which makes the instructions non-portable for other contributors and CI shells. Prefer repo-root-relative paths throughout.

Suggested doc edit
-- All paths in this plan are absolute under the repo root: `/Users/natarajankannan/src/superposition/`.
+- All paths in this plan are relative to the repository root.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- All paths in this plan are absolute under the repo root: `/Users/natarajankannan/src/superposition/`.
- All paths in this plan are relative to the repository root.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-10-otel-golden-signals-middleware.md` at line
18, Replace the machine-specific absolute path string
"/Users/natarajankannan/src/superposition/" on line containing "All paths in
this plan are absolute under the repo root" with a repo-root-relative notation
(e.g., "repo-root-relative paths" or a placeholder like "<repo-root>/...") so
the document uses portable, non-machine-specific paths; update the sentence in
the same paragraph to instruct contributors to use repo-root-relative or
environment-variable-based placeholders instead of local absolute filesystem
paths.


### 5.1 Data flow

```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add language identifiers to fenced code blocks (MD040).

Lines 69 and 108 use unlabeled fenced blocks, which fails markdownlint. Add a language tag (for these blocks, text fits).

Suggested doc edit
-```
+```text
 [request on :8080]
 ...
-```
+```

-```
+```text
 crates/service_utils/src/
 ...
-```
+```

Also applies to: 108-108

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 69-69: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-10-otel-golden-signals-middleware-design.md`
at line 69, Two fenced code blocks in the doc (the block showing "[request on
:8080] ..." and the block listing "crates/service_utils/src/...") are unlabeled
and trigger markdownlint MD040; update each opening triple-backtick to include
the language identifier text (i.e., change ``` to ```text) so both code fences
become labeled, leaving their closing backticks unchanged.

Comment thread README.md Outdated
Comment on lines +157 to +159
```
curl http://localhost:9091/metrics
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add a language identifier to the code fence.

markdownlint (MD040) flags this fence. Use bash so the curl snippet gets syntax highlighting and the lint passes.

📝 Suggested change
-```
-curl http://localhost:9091/metrics
-```
+```bash
+curl http://localhost:9091/metrics
+```
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
curl http://localhost:9091/metrics
```
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 157-157: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@README.md` around lines 157 - 159, The markdown code fence containing the
curl example is missing a language identifier, triggering markdownlint MD040;
update the fenced block that wraps the curl invocation (the triple-backtick
block around "curl http://localhost:9091/metrics") to include the bash language
identifier (```bash) so the snippet is syntax-highlighted and the lint rule
passes.

knutties and others added 14 commits May 11, 2026 19:20
… (items 1 & 3)

Item 1: Wrap OTLP reader init in a match so failure logs a warning and
continues with Prometheus-only metrics instead of propagating Err and
disabling the /metrics scrape endpoint entirely.

Item 3: Warn at startup when OTEL_EXPORTER_OTLP_PROTOCOL is set to a
value other than 'http/protobuf' (e.g. grpc), since only HTTP/proto is
supported in v1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The crate was only listed to satisfy the unused_crate_dependencies lint
via a use-as-_ in lib.rs; actual metric attribute keys are hardcoded
strings.  Remove it from both workspace and service_utils Cargo.toml,
and remove all the use-as-_ shims from lib.rs (the real crates are now
imported directly in the observability sub-modules).

Also restructure Observability::init to clone the Resource so the
Prom-only fallback path can reuse it after the builder is consumed by
with_otlp_reader.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Item 4: Drop the empty #[cfg(test)] mod tests {} placeholder in db_pool.rs;
real coverage lives in tests/observability_integration.rs.

Items 5+6: Remove the #[cfg(not(tokio_unstable))] spawn stub from
tokio_runtime.rs.  The call site in saturation.rs is itself gated on
#[cfg(tokio_unstable)], so no symbol resolution is needed when the flag
is off.  Add a scoped cfg(not(tokio_unstable)) use-as-_ in lib.rs to
keep the unused_crate_dependencies lint green on both build paths.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…item 7)

Replace the idle_connections/used_connections trait methods (which always
returned None because fred 9.2.1 has no such public API) with a single
connected_connections() method using ClientLike::is_connected() to count
active pool members.

Replace the redis.client.connections.usage gauge (which never emitted
data) with redis.client.connections.connected.  Keep the existing
redis.client.commands.in_flight gauge unchanged.

Remove the #[allow(dead_code)] on FredPoolStats since both getters are
now meaningful.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…from_env (item 8)

Introduce from_source<F>(get: F) that accepts a closure returning
Option<String> per key.  from_env() delegates to it using std::env::var.

Tests rewritten to use from_source with a HashMap-backed closure —
no unsafe set_var/remove_var, no Mutex, no --test-threads=1 requirement.
Three tests pass in parallel with zero process-global state mutation.

Also add cfg(test) use-as-_ for opentelemetry_otlp in lib.rs to keep
the unused_crate_dependencies lint clean when the not(test) OTLP reader
function is compiled out during test runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`health_endpoints()` previously returned `web::scope("")` with the three
health routes inside it. An empty-prefix Actix scope matches every
request path and returns 404 for any path that does not match one of
the routes inside the scope — so `/admin/organisations` and every other
non-health URL was being swallowed by the health scope and returning
404 before reaching Leptos or the main API scope registered after it.

Replaces the helper with `configure_health_endpoints(&mut ServiceConfig)`
that registers the three routes directly at App root via `.configure`,
so unmatched paths fall through to subsequent services as expected.

Adds a regression test `does_not_shadow_other_routes` that wires both
the health endpoints and an unrelated `/other` route into the same App
and verifies both reach their handlers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tic-conventions

The Cargo.lock entry was missed in e98e978 when the dependency was
dropped from Cargo.toml. CI's uniffi bindings job regenerates the lock
during build and the resulting diff was failing the post-build
`git diff --exit-code` check.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`make fmt` output on files modified in this branch. All changes are
purely stylistic (edition-2024 formatting per .rustfmt.toml):
import re-sorting in `use` blocks, trailing commas, and line wrapping
within max_width=90.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Five clippy errors caught by `cargo clippy -- -Dwarnings` on rustc 1.90
(CI toolchain):

- middleware.rs:222: drop needless borrow on `res` already bound from
  `&result` destructuring (needless_borrow).
- middleware.rs:317: bare `HttpResponse::Ok()` returns the awaitable
  HttpResponseBuilder; calling `.finish()` produces an HttpResponse and
  clears async_yields_async in the test handler.
- db_pool.rs:44 + redis_pool.rs:{73,87}: use `std::slice::from_ref(&kv)`
  for the single-attribute slice instead of `&[kv.clone()]`
  (cloned_ref_to_slice_refs).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same fix as e9e4fea — five more bare `HttpResponse::*()` calls in
test handlers were returning the awaitable HttpResponseBuilder.
Appending `.finish()` materialises an HttpResponse and clears the
`clippy::async_yields_async` lint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…urce attrs, accurate busy_ratio

Three correctness fixes flagged by CodeRabbit:

- main.rs: build the metrics SocketAddr from the parsed IpAddr in config
  (SocketAddr::new) instead of via string formatting + parse, which is
  ambiguous for IPv6 addresses like "::1:9091" (needs brackets).
  Demote spawn_metrics_server errors from a process-killing `?` to a
  warn-and-continue, matching the OTLP "best-effort" stance.
- observability.rs: percent-decode keys and values from
  OTEL_RESOURCE_ATTRIBUTES per the W3C baggage format the OTel spec
  references. Without this, values containing %2C / %3D were stored
  verbatim instead of as "," / "=".
- tokio_runtime.rs: compute the busy_ratio denominator from
  RuntimeMetrics::elapsed rather than the configured sampler interval,
  so the ratio stays accurate when the sampler tick runs late under
  load.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bundle of low-risk cleanups from CodeRabbit review:

- README.md, design spec, plan: add bash/text language tags to fenced
  code blocks (markdownlint MD040).
- Plan doc: replace local absolute path with <repo-root>/ placeholder.
- meters.rs: add `.with_unit("{request}")` to active_requests for OTel
  semconv parity with the duration metrics.
- middleware.rs: rename `matched_route_returns_pattern` to
  `matched_route_setup_smoke` — the test only asserts status, the route
  extraction itself is covered in observability_integration.rs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously `try_join!(main_server, metrics_handle)` aborted the main
HTTP server if the metrics server task ever returned an error (port
reclaimed, listener closed, transient bind issue). That contradicts
the "metrics are best-effort" stance applied elsewhere (OTLP failures
log a warning, bind failure already demoted to warn-and-continue in
the previous commit).

Spawn the metrics server as a detached tokio task that logs on error
and let the main API server's lifetime drive process exit.

Drop the `futures-util` crate dependency from superposition since
`try_join!` was its only use site. Add explicit `tokio` dep (was
previously transitive via futures-util/actix-web; now direct).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…elper

`parse_bool` and the open-coded `IpAddr::from_str`/`u16::parse`/
`humantime::parse_duration` sites each named the env key twice — once
for the lookup and once for the error label. Six fields × two
mentions = twelve places where the key string had to stay in sync.

Replace with one generic `parse_or_default<T: FromStr>` helper that
takes the key once. `bool`/`IpAddr`/`u16`/`humantime::Duration` all
implement `FromStr` so they go through the same path; the helper
folds the lookup, the default, and the error label into one place.

Error message format changes uniformly to "<KEY>: <parse-error>"
(the only existing assertion just checks for the key name as a
substring, so it still holds).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants