chart: port Cortex + Alertmanager + alerting from docker-compose#243
Open
lezzago wants to merge 1 commit into
Open
chart: port Cortex + Alertmanager + alerting from docker-compose#243lezzago wants to merge 1 commit into
lezzago wants to merge 1 commit into
Conversation
Mirrors the cortex-metrics docker-compose branch (PR opensearch-project#226) into the Helm chart and terraform/aws/ overlay. Swaps vanilla Prometheus for Cortex to unlock the full Prometheus HTTP API surface (query + Ruler + Alertmanager) and ships pre-canned alerting end-to-end. Chart ----- - Chart.yaml: drop the prometheus subchart dep; bump version 0.1.0 -> 0.2.0; keep appVersion at 3.6.0. - New native templates for Cortex + Alertmanager: cortex-{deployment,service,pvc,configmap}.yaml alertmanager-{deployment,service,pvc,secret}.yaml Cortex Service name is kept as <release>-prometheus-server so existing references in otel-collector, data-prepper, and init-dashboards templates keep working unchanged. - Cortex ConfigMap templates the ruler's alertmanager_url at render time so it resolves against the K8s Service (<release>-alertmanager) instead of the compose-only 'alertmanager' hostname. - First-boot cleanup shim from docker-compose (removes stale vanilla- Prometheus TSDB artifacts) ported as a Cortex pod initContainer. - Alertmanager Secret renders alertmanager.yml with OPENSEARCH_USER/PASSWORD and opensearch:9200 -> opensearch-cluster-master:9200 via sprig replace. tpl is avoided because the file embeds alertmanager's own Go-template tokens that must NOT be evaluated by Helm. - New Jobs (post-install/post-upgrade hooks): cortex-rules-init (loads rules via Cortex Ruler API; stack namespace always, otel-demo namespace conditional on otel-demo.enabled) stack-monitors-init, otel-demo-monitors-init (create OpenSearch alerting monitors for cluster + otel-demo services) - otel-collector-configmap: add prometheus/self + prometheus/envoy receivers; replace otlphttp/prometheus exporter with prometheusremotewrite/cortex (endpoint /api/v1/push, resource_to_telemetry_conversion: true). - data-prepper-pipeline-secret: split service-map-pipeline; new service-metrics-cortex-pipeline strips per-event randomKey UUID before remote-write; endpoint suffix /api/v1/write -> /api/v1/push. - opensearch-dashboards config: observability.alertManager.enabled: true (requires OpenSearch Dashboards 3.7.0; point OS + OSD images at the opensearchstaging Docker Hub org until 3.7.0 ships to opensearchproject). - init-dashboards Job: add ALERTMANAGER_HOST / ALERTMANAGER_PORT env so the init script wires alertmanager.uri on the Prometheus datasource. - files/init-opensearch-dashboards.py: 3-way merge of compose tip (+272 L) with chart-specific deltas (BASE_URL/PROMETHEUS_HOST env overrides, set_default_workspace, broken-datasource detection, anonymous-role handling, dashboard-k8s-cluster-health load, password-print removal). - opentelemetry-demo values: expose envoy admin listener (port 10000) via components.frontend-proxy.ports so the OTel Collector's prometheus/envoy scrape target resolves; the upstream subchart only exposes app port 8080. - values.yaml: remove prometheus: block; add cortex: and alertmanager: blocks; add observability.alertManager.enabled to OSD config. Tests ----- - New suites: cortex_test.yaml, alertmanager_test.yaml, cortex_rules_configmap_test.yaml, otel_collector_configmap_test.yaml. - All 54/54 tests green; helm lint clean; helm template renders both with default values and with terraform/aws/values-eks.yaml. Terraform (AWS EKS overlay) --------------------------- - values-eks.yaml: replace prometheus.server block with cortex + alertmanager sizing (gp2 PVCs, 2Gi/500m -> 4Gi/1000m for Cortex, 128Mi for Alertmanager). - main.tf: bump EKS node group max_size from node_count+1 -> node_count+2 to accommodate Alertmanager + rules-init pods. - observability-stack.tf: helm_release timeout 900s -> 1800s. Cold-start EKS + image pulls + init jobs routinely exceed 15 min. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Mirrors the cortex-metrics docker-compose branch (PR #226) into the Helm chart and
terraform/aws/overlay. Swaps vanilla Prometheus for Cortex to unlock the full Prometheus HTTP API surface (query + Ruler + Alertmanager) and ships pre-canned alerting end-to-end.Chart changes
Chart.yaml: drop theprometheussubchart dep; bump version 0.1.0 → 0.2.0.cortex-*.yaml,alertmanager-*.yaml). CortexServicename is kept as<release>-prometheus-serverso existing references in otel-collector, data-prepper, and init-dashboards templates keep working unchanged.alertmanager_urlat render time so it resolves against the K8s Service (<release>-alertmanager) instead of the compose-onlyalertmanagerhostname.Secretrendersalertmanager.ymlwithOPENSEARCH_USER/PASSWORDandopensearch:9200→opensearch-cluster-master:9200via sprigreplace.tplis avoided because the file embeds alertmanager's own Go-template tokens that must NOT be evaluated by Helm.cortex-rules-init,stack-monitors-init,otel-demo-monitors-init(the last conditional on otel-demo).otel-collector-configmap: addprometheus/self+prometheus/envoyreceivers; replaceotlphttp/prometheusexporter withprometheusremotewrite/cortex(endpoint/api/v1/push,resource_to_telemetry_conversion: true).data-prepper-pipeline-secret: splitservice-map-pipeline; newservice-metrics-cortex-pipelinestrips per-eventrandomKeyUUID before remote-write; endpoint/api/v1/write→/api/v1/push.observability.alertManager.enabled: true(requires OSD 3.7.0 — point OS + OSD images at theopensearchstagingDocker Hub org until 3.7.0 ships toopensearchproject).init-dashboardsJob: addALERTMANAGER_HOST/ALERTMANAGER_PORTenv so the init script wiresalertmanager.urion the Prometheus datasource.files/init-opensearch-dashboards.py: 3-way merge of compose tip (+272 L) with chart-specific deltas.components.frontend-proxy.portsso the OTel Collector'sprometheus/envoyscrape target resolves; the upstream subchart only exposes app port 8080.values.yaml: removeprometheus:block; addcortex:andalertmanager:blocks.Terraform (AWS EKS overlay)
values-eks.yaml: replaceprometheus.serverblock withcortex+alertmanagersizing (gp2 PVCs, 2Gi/500m → 4Gi/1000m for Cortex, 128Mi for Alertmanager).main.tf: bump EKS node groupmax_sizefromnode_count + 1→node_count + 2to accommodate Alertmanager + rules-init pods.observability-stack.tf:helm_releasetimeout 900s → 1800s. Cold-start EKS + image pulls + init jobs routinely exceed 15 min.Test plan
Known limitation (pre-existing, not introduced here)
Data-prepper's `prometheus` sink periodically throws `400 duplicate sample for timestamp` from Cortex on multi-language ingress operations. This exists identically in the compose branch (PR #226) and is a `otel_apm_service_map` processor limitation, not something the chart port introduced.
🤖 Generated with Claude Code