Skip to content

chart: port Cortex + Alertmanager + alerting from docker-compose#243

Open
lezzago wants to merge 1 commit into
opensearch-project:mainfrom
lezzago:port-cortex-helm
Open

chart: port Cortex + Alertmanager + alerting from docker-compose#243
lezzago wants to merge 1 commit into
opensearch-project:mainfrom
lezzago:port-cortex-helm

Conversation

@lezzago
Copy link
Copy Markdown
Member

@lezzago lezzago commented May 12, 2026

Summary

Mirrors the cortex-metrics docker-compose branch (PR #226) into the Helm chart and terraform/aws/ overlay. Swaps vanilla Prometheus for Cortex to unlock the full Prometheus HTTP API surface (query + Ruler + Alertmanager) and ships pre-canned alerting end-to-end.

Chart changes

  • Chart.yaml: drop the prometheus subchart dep; bump version 0.1.0 → 0.2.0.
  • New native templates for Cortex + Alertmanager (cortex-*.yaml, alertmanager-*.yaml). Cortex Service name is kept as <release>-prometheus-server so existing references in otel-collector, data-prepper, and init-dashboards templates keep working unchanged.
  • Cortex ConfigMap templates the ruler's alertmanager_url at render time so it resolves against the K8s Service (<release>-alertmanager) instead of the compose-only alertmanager hostname.
  • First-boot cleanup shim from docker-compose (removes stale vanilla-Prometheus TSDB artifacts) ported as a Cortex pod initContainer.
  • Alertmanager Secret renders alertmanager.yml with OPENSEARCH_USER/PASSWORD and opensearch:9200opensearch-cluster-master:9200 via sprig replace. tpl is avoided because the file embeds alertmanager's own Go-template tokens that must NOT be evaluated by Helm.
  • New Jobs (post-install/post-upgrade hooks): cortex-rules-init, stack-monitors-init, otel-demo-monitors-init (the last conditional on otel-demo).
  • otel-collector-configmap: add prometheus/self + prometheus/envoy receivers; replace otlphttp/prometheus exporter with prometheusremotewrite/cortex (endpoint /api/v1/push, resource_to_telemetry_conversion: true).
  • data-prepper-pipeline-secret: split service-map-pipeline; new service-metrics-cortex-pipeline strips per-event randomKey UUID before remote-write; endpoint /api/v1/write/api/v1/push.
  • OpenSearch Dashboards config: observability.alertManager.enabled: true (requires OSD 3.7.0 — point OS + OSD images at the opensearchstaging Docker Hub org until 3.7.0 ships to opensearchproject).
  • init-dashboards Job: add ALERTMANAGER_HOST / ALERTMANAGER_PORT env so the init script wires alertmanager.uri on the Prometheus datasource.
  • files/init-opensearch-dashboards.py: 3-way merge of compose tip (+272 L) with chart-specific deltas.
  • otel-demo values: expose envoy admin listener (port 10000) via components.frontend-proxy.ports so the OTel Collector's prometheus/envoy scrape target resolves; the upstream subchart only exposes app port 8080.
  • values.yaml: remove prometheus: block; add cortex: and alertmanager: blocks.

Terraform (AWS EKS overlay)

  • values-eks.yaml: replace prometheus.server block with cortex + alertmanager sizing (gp2 PVCs, 2Gi/500m → 4Gi/1000m for Cortex, 128Mi for Alertmanager).
  • main.tf: bump EKS node group max_size from node_count + 1node_count + 2 to accommodate Alertmanager + rules-init pods.
  • observability-stack.tf: helm_release timeout 900s → 1800s. Cold-start EKS + image pulls + init jobs routinely exceed 15 min.

Test plan

  • `helm lint` clean
  • `helm unittest` — 54/54 tests green across 10 suites (4 new suites: `cortex`, `alertmanager`, `cortex_rules_configmap`, `otel_collector_configmap`)
  • `helm template` renders cleanly with both default values and `terraform/aws/values-eks.yaml`
  • End-to-end deploy to a fresh EKS cluster: pods healthy, Cortex ruler → Alertmanager → OpenSearch webhook flows (confirmed `alertmanager-alerts` index populated), 4 rule groups loaded, `up{job=otel-collector}=1` and `up{job=envoy-frontend-proxy}=1`, 1,961 envoy series in Cortex, 232K traces and 68K logs indexed in OpenSearch, Observability → Alerts UI visible in OSD

Known limitation (pre-existing, not introduced here)

Data-prepper's `prometheus` sink periodically throws `400 duplicate sample for timestamp` from Cortex on multi-language ingress operations. This exists identically in the compose branch (PR #226) and is a `otel_apm_service_map` processor limitation, not something the chart port introduced.

🤖 Generated with Claude Code

Mirrors the cortex-metrics docker-compose branch (PR opensearch-project#226) into the Helm
chart and terraform/aws/ overlay. Swaps vanilla Prometheus for Cortex to
unlock the full Prometheus HTTP API surface (query + Ruler + Alertmanager)
and ships pre-canned alerting end-to-end.

Chart
-----
- Chart.yaml: drop the prometheus subchart dep; bump version 0.1.0 -> 0.2.0;
  keep appVersion at 3.6.0.
- New native templates for Cortex + Alertmanager:
    cortex-{deployment,service,pvc,configmap}.yaml
    alertmanager-{deployment,service,pvc,secret}.yaml
  Cortex Service name is kept as <release>-prometheus-server so existing
  references in otel-collector, data-prepper, and init-dashboards templates
  keep working unchanged.
- Cortex ConfigMap templates the ruler's alertmanager_url at render time so
  it resolves against the K8s Service (<release>-alertmanager) instead of
  the compose-only 'alertmanager' hostname.
- First-boot cleanup shim from docker-compose (removes stale vanilla-
  Prometheus TSDB artifacts) ported as a Cortex pod initContainer.
- Alertmanager Secret renders alertmanager.yml with OPENSEARCH_USER/PASSWORD
  and opensearch:9200 -> opensearch-cluster-master:9200 via sprig replace.
  tpl is avoided because the file embeds alertmanager's own Go-template
  tokens that must NOT be evaluated by Helm.
- New Jobs (post-install/post-upgrade hooks):
    cortex-rules-init (loads rules via Cortex Ruler API; stack namespace
      always, otel-demo namespace conditional on otel-demo.enabled)
    stack-monitors-init, otel-demo-monitors-init (create OpenSearch
      alerting monitors for cluster + otel-demo services)
- otel-collector-configmap: add prometheus/self + prometheus/envoy receivers;
  replace otlphttp/prometheus exporter with prometheusremotewrite/cortex
  (endpoint /api/v1/push, resource_to_telemetry_conversion: true).
- data-prepper-pipeline-secret: split service-map-pipeline; new
  service-metrics-cortex-pipeline strips per-event randomKey UUID before
  remote-write; endpoint suffix /api/v1/write -> /api/v1/push.
- opensearch-dashboards config: observability.alertManager.enabled: true
  (requires OpenSearch Dashboards 3.7.0; point OS + OSD images at the
  opensearchstaging Docker Hub org until 3.7.0 ships to opensearchproject).
- init-dashboards Job: add ALERTMANAGER_HOST / ALERTMANAGER_PORT env so
  the init script wires alertmanager.uri on the Prometheus datasource.
- files/init-opensearch-dashboards.py: 3-way merge of compose tip (+272 L)
  with chart-specific deltas (BASE_URL/PROMETHEUS_HOST env overrides,
  set_default_workspace, broken-datasource detection, anonymous-role
  handling, dashboard-k8s-cluster-health load, password-print removal).
- opentelemetry-demo values: expose envoy admin listener (port 10000) via
  components.frontend-proxy.ports so the OTel Collector's prometheus/envoy
  scrape target resolves; the upstream subchart only exposes app port 8080.
- values.yaml: remove prometheus: block; add cortex: and alertmanager:
  blocks; add observability.alertManager.enabled to OSD config.

Tests
-----
- New suites: cortex_test.yaml, alertmanager_test.yaml,
  cortex_rules_configmap_test.yaml, otel_collector_configmap_test.yaml.
- All 54/54 tests green; helm lint clean; helm template renders both with
  default values and with terraform/aws/values-eks.yaml.

Terraform (AWS EKS overlay)
---------------------------
- values-eks.yaml: replace prometheus.server block with cortex + alertmanager
  sizing (gp2 PVCs, 2Gi/500m -> 4Gi/1000m for Cortex, 128Mi for Alertmanager).
- main.tf: bump EKS node group max_size from node_count+1 -> node_count+2
  to accommodate Alertmanager + rules-init pods.
- observability-stack.tf: helm_release timeout 900s -> 1800s. Cold-start
  EKS + image pulls + init jobs routinely exceed 15 min.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>
@lezzago lezzago force-pushed the port-cortex-helm branch from 36d376d to 7b69b75 Compare May 12, 2026 18:04
@lezzago lezzago marked this pull request as ready for review May 12, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant