Skip to content

feat(observability): add gateway OTLP traces and initial Kube monitoring surface#1270

Open
TaylorMutch wants to merge 4 commits into
mainfrom
tmutch/otel-metrics-traces
Open

feat(observability): add gateway OTLP traces and initial Kube monitoring surface#1270
TaylorMutch wants to merge 4 commits into
mainfrom
tmutch/otel-metrics-traces

Conversation

@TaylorMutch
Copy link
Copy Markdown
Collaborator

Summary

Adds opt-in OpenTelemetry trace export to the gateway and a Prometheus ServiceMonitor to the Helm chart. Both surfaces are independent from the existing /metrics endpoint and the OCSF sandbox log fan-out, default off, and configured via standard OTEL_* env vars or chart values.

Changes

Gateway (crates/openshell-server)

  • Pin OTel 0.29 / tracing-opentelemetry 0.30 (the latest set compatible with the workspace's tonic 0.12 + prost 0.13).
  • TracingLogBus::install_subscriber now optionally appends a tracing-opentelemetry layer when an OTLP endpoint is configured. The existing tower_http::trace::TraceLayer per-request span automatically becomes the OTLP root — no #[instrument] rewrites required.
  • New OtlpTracingConfig::resolve honors OTEL_EXPORTER_OTLP_TRACES_ENDPOINTOTEL_EXPORTER_OTLP_ENDPOINT--otlp-endpoint precedence.
  • Sampler reads OTEL_TRACES_SAMPLER / OTEL_TRACES_SAMPLER_ARG; default parent_based_traceidratio(1.0).
  • New shutdown() flushes the BatchSpanProcessor from the gateway shutdown path on SIGTERM.

Helm chart

  • New monitoring.serviceMonitor.* and monitoring.tracing.* blocks in values.yaml (off by default).
  • New templates/servicemonitor.yaml (gated, scrapes the existing named metrics port).
  • StatefulSet projects OTEL_* env vars when tracing is enabled, including merged OTEL_RESOURCE_ATTRIBUTES.
  • New ci/values-monitoring.yaml overlay and commented-in kube-prometheus-stack + jaeger Helm releases in skaffold.yaml.
  • New Monitoring section in deploy/helm/openshell/README.md.

Tooling

  • New tasks/observability.toml exposing observability:k8s:setup, observability:k8s:teardown, and observability:port-forward.
  • New scripts under tasks/scripts/ mirroring the existing keycloak-k8s-setup.sh shape: install slim kube-prometheus-stack + Jaeger all-in-one, idempotent re-runs.

Docs / agent skills

  • New docs/kubernetes/monitoring.mdx (operator + local-dev guide).
  • Cross-links from docs/observability/overview.mdx and a new "Observability surface" subsection in architecture/gateway.md.
  • helm-dev-environment and debug-openshell-cluster skills updated.

Testing

  • mise run pre-commit passes (lint, format, license headers, clippy, helm-lint matrix, full workspace tests).
  • Unit tests added for OtlpTracingConfig::resolve and sampler_from_env.
  • End-to-end on local k3d: created cluster, ran observability:k8s:setup, deployed gateway with ci/values-monitoring.yaml, drove 5 ListSandboxes + 3 Health gRPC calls. Verified:
    • Prometheus target up{job=\"openshell\"} == 1; openshell_server_grpc_requests_total totals match driven traffic (8).
    • Jaeger registers openshell-gateway service; 8 request spans with correct method, path, request_id attributes; resource attributes include service.namespace=openshell, service.version=0.0.0, deployment.environment=dev, telemetry.sdk.version=0.29.0.
  • No new e2e runtime test in CI for OTLP — unit tests + manual validation are sufficient for v1; standing up Jaeger in CI is disproportionate.

Out of scope (follow-ups)

  • OTLP log export (Loki / Collector logs receiver). OCSF JSONL remains the canonical log story.
  • In-process OTLP metrics push exporter — Prometheus pull is sufficient.
  • HTTP/protobuf OTLP transport — gateway currently only supports gRPC; chart accepts protocol: grpc.
  • Pre-built Grafana dashboards as ConfigMaps.
  • Per-handler #[tracing::instrument] annotations on gRPC handlers.

Checklist

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@TaylorMutch TaylorMutch force-pushed the tmutch/otel-metrics-traces branch from a551804 to c6463bf Compare May 8, 2026 23:14
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Adds opt-in OpenTelemetry trace export and a Prometheus ServiceMonitor to
the gateway Helm chart. The exporter and chart toggles are independent
from the existing /metrics surface and the OCSF sandbox log fan-out.

- Gateway: append a tracing-opentelemetry layer to TracingLogBus when an
  OTLP/gRPC endpoint is configured; flush spans on shutdown. CLI gains
  --otlp-endpoint; standard OTEL_* env vars drive sampling and resource
  attributes.
- Helm: monitoring.serviceMonitor.* renders a Prometheus-Operator
  ServiceMonitor; monitoring.tracing.* projects OTEL_* env vars onto the
  gateway container. Both default off.
- Tooling: observability:k8s:{setup,teardown,port-forward} mise tasks
  install kube-prometheus-stack + Jaeger all-in-one for local dev.
- Docs: new docs/kubernetes/monitoring.mdx; cross-links from observability
  overview and architecture/gateway.md; helm-dev-environment and
  debug-openshell-cluster skills updated.
…files

The kube-prometheus-stack and Jaeger releases were configured via long
chains of `--set` flags, which obscure the configuration and make the
script hard to extend. Extract them into two checked-in values files
the setup script consumes via `--values`.

- tasks/scripts/observability-prometheus-values.yaml — slim chart config
  plus Grafana auto-provisioning of a Jaeger datasource (stable uid so
  dashboards can reference it).
- tasks/scripts/observability-jaeger-values.yaml — all-in-one Jaeger.
- PROMSTACK_VALUES and JAEGER_VALUES env vars allow pointing at custom
  files for local experimentation.
@TaylorMutch TaylorMutch force-pushed the tmutch/otel-metrics-traces branch from c6463bf to 7d4c3d5 Compare May 12, 2026 20:53
Operator-facing /docs pages shouldn't surface mise tasks. Trim the
`Local development` section out of docs/kubernetes/monitoring.mdx and
move it into deploy/helm/openshell/README.md alongside the Monitoring
opt-in block.

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
`tasks/scripts/` is for shell scripts, not third-party Helm values. The
kube-prometheus-stack and Jaeger values files belong with other K8s
deployment artifacts.

- Move observability-{prometheus,jaeger}-values.yaml to deploy/kube/observability/
  and drop the `observability-` prefix (parent dir already scopes them).
- Update observability-k8s-setup.sh to resolve them via a REPO_ROOT-anchored
  VALUES_DIR instead of SCRIPT_DIR. PROMSTACK_VALUES / JAEGER_VALUES
  env-var overrides continue to work.

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
@TaylorMutch TaylorMutch changed the title feat(observability): add gateway OTLP traces and Helm monitoring surface feat(observability): add gateway OTLP traces and initial Kube monitoring surface May 12, 2026
@TaylorMutch TaylorMutch marked this pull request as ready for review May 12, 2026 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant