Architecting a Production-Ready IDP: Argo CD, Crossplane & OPA in Practice
Reference architecture for KubeCon EU 2026 — Platform Engineering Zero Day.
Portal-first IDPs fail at scale. Build the automation, guardrails, and GitOps pipelines first. The portal is optional.
Developer GitOps Admission Platform API Cloud
┌──────────┐ ┌──────────┐ ┌───────────┐ ┌─────────────┐ ┌───────┐
│ 9-line │─git─▶│ ArgoCD │─sync▶│ Kyverno │─ok──▶│ Crossplane │─────▶│ AWS │
│ claim │ push │ v3 App │ │ CEL │ │ v2 XRD + │ │ GCP │
│ │ │ Sets │ │ Policies │ │ Pipeline │ │ Azure │
└──────────┘ └──────────┘ └───────────┘ │ Compositions│ └───────┘
│ └─────────────┘
│ │
┌─────▼─────┐ ┌──────▼──────┐
│ Policy │ │ Shadow │
│ Promotion │ │ Metrics │
│ dev→prod │ │ (runtime) │
└───────────┘ └─────────────┘
A developer writes a 9-line claim. ArgoCD syncs it. Kyverno validates against CEL policies (region restrictions, size caps, naming, HA, backups). Crossplane provisions the right cloud resources. Shadow Metrics evaluate whether the valid configuration is actually correct for the workload. No portal needed.
- 7 XRDs — Database, Cache, Message Queue, Object Storage, CDN, DNS, Namespace
- 21 Compositions — 7 resource types × 3 clouds (AWS, GCP, Azure)
- 6 Kyverno Policies — Region enforcement (PCI-DSS), size caps, labels, naming, backup retention, HA
- Policy Promotion Pipeline — dev (Audit) → staging (mixed) → production (Enforce)
- 12 Teams, 109 Claims — Realistic "100+ service environment"
- Shadow Metrics — CRD + 4 rules closing the semantic gap between "valid" and "correct"
- Full Observability — OpenTelemetry, Prometheus, OpenCost, 5 Grafana dashboards
- Composition Drift Detection — CronJob + Prometheus alerting
- External Secrets Operator — ClusterSecretStores for AWS/GCP/Azure
- One-Command Bootstrap —
./bootstrap/install.sh --provider aws
See docs/architecture.md for the full architecture breakdown.
# Clone the repo
git clone git@github.com:peopleforrester/backend-first-idp.git
cd backend-first-idp
# Preview what would be installed (dry run)
./bootstrap/install.sh --provider aws --dry-run
# Bootstrap the full platform
./bootstrap/install.sh --provider aws # or gcp, or azure
# Lightweight install (skip observability stack)
./bootstrap/install.sh --provider aws --skip-observability
# Submit a database claim
kubectl apply -f golden-path/examples/claim-database.yaml
# Watch it provision
kubectl get databaseinstanceclaim -w
# Try a full service (DB + cache + queue + storage)
kubectl apply -f golden-path/examples/claim-full-service.yaml
# See what happens when you break the rules
kubectl apply -f golden-path/examples/claim-database-WILL-FAIL.yamlbackend-first-idp/
├── platform-api/
│ ├── xrds/ # 7 CompositeResourceDefinitions
│ ├── compositions/ # 21 cloud-specific implementations
│ │ ├── aws/ # RDS, ElastiCache, SQS, S3, CloudFront, Route53, NS
│ │ ├── gcp/ # Cloud SQL, Memorystore, Pub/Sub, GCS, Cloud CDN, DNS, NS
│ │ └── azure/ # FlexibleServer, Redis Cache, Service Bus, Blob, Front Door, DNS, NS
│ ├── shadow-metrics/ # ShadowMetricRule CRD + 4 runtime validation rules
│ └── drift-detection/ # Composition drift CronJob + Prometheus alerting
├── policies/
│ └── kyverno/
│ ├── cluster-policies/ # 6 CEL policies (region, size, labels, naming, backup, HA)
│ ├── policy-exceptions/ # Platform team overrides
│ ├── policy-tests/ # Pass/fail test resources per policy
│ └── promotion/ # dev → staging → production overlays
├── gitops/
│ ├── argocd/ # 4 ApplicationSets + team RBAC
│ └── kustomize/ # Base + cloud overlays + environment overlays
├── golden-path/
│ ├── examples/ # 6 claim examples (working, failing, shadow metric)
│ └── templates/ # Service scaffold with all 7 resource types
├── teams/ # 12 teams, 109 claims (generated)
├── secrets/
│ └── eso/ # ClusterSecretStores + ExternalSecret templates
├── observability/
│ ├── opentelemetry/ # Collector agents + gateway + instrumentation
│ ├── prometheus/ # Rules, ServiceMonitors, Helm values
│ ├── opencost/ # Cost allocation by team label
│ └── grafana/dashboards/ # 5 JSON dashboards
├── bootstrap/
│ ├── install.sh # 12-step one-command setup
│ └── providers/ # AWS (IRSA), GCP (WI), Azure (OIDC)
├── scripts/ # Team claim generator
├── tests/ # 10 test suites
├── docs/ # Architecture, Shadow Metrics, Kyverno, drift detection
└── DEMO.md # On-stage walkthrough (6 beats, ~10 min)
The same claim works across all three clouds:
| Field | AWS | GCP | Azure |
|---|---|---|---|
| eu-west-1 | eu-west-1 | europe-west1 | westeurope |
| eu-central-1 | eu-central-1 | europe-west3 | germanywestcentral |
| us-east-1 | us-east-1 | us-east1 | eastus |
| us-west-2 | us-west-2 | us-west1 | westus2 |
| Size | AWS (DB) | GCP (DB) | Azure (DB) |
|---|---|---|---|
| small | db.t4g.medium | db-custom-2-4096 | B_Standard_B2s |
| medium | db.t4g.large | db-custom-4-8192 | GP_Standard_D2ds_v4 |
| large | db.r6g.xlarge | db-custom-8-32768 | GP_Standard_D4ds_v4 |
What developers interact with — a single claim:
apiVersion: platform.kubecon.io/v1alpha1
kind: DatabaseInstanceClaim
metadata:
name: checkout-db
namespace: checkout
spec:
size: small
region: eu-west-1
team: checkoutNine lines. That's the entire developer interface. Everything else is platform.
6 Kyverno CEL policies enforce team-level guardrails at admission time:
| Policy | What It Enforces | Severity |
|---|---|---|
| Region enforcement | checkout/payments EU-only (PCI-DSS) | High |
| Size caps | checkout/analytics capped at medium | Medium |
| Required labels | All claims must have team label |
Medium |
| Naming conventions | Names must start with team name | Low |
| Backup retention | Prod DBs need ≥7 day backups | High |
| HA enforcement | Prod DBs must enable HA | High |
Policies are promoted through environments: dev (Audit) → staging (mixed) → production (Enforce).
Policies validate what is allowed. Shadow Metrics validate what makes sense.
A claim that passes every Kyverno policy can still be wrong for the workload. Shadow Metrics evaluate runtime data (traffic volume, latency percentiles, utilization) and annotate claims with risk signals:
| Rule | What It Checks | Signal |
|---|---|---|
| Database sizing | Is the DB size right for traffic? | Request rate |
| Region latency | Is the region optimal for users? | P95 latency |
| Cost efficiency | Is the resource over-provisioned? | CPU utilization |
| HA requirement | Should HA be enabled for this SLO? | Error rate |
See docs/shadow-metrics.md and platform-api/shadow-metrics/README.md.
| Component | Version | Notes |
|---|---|---|
| Crossplane | v2.2.0 | Namespaced XRs, Pipeline mode only |
| ArgoCD | v3.3.6 | Server-side apply, fine-grained RBAC |
| Kyverno | chart 3.7.1 | CEL policies v1-promoted |
| External Secrets Operator | chart 2.2.0 | IRSA/WI/OIDC auth |
| OpenTelemetry Operator | latest | v1beta1 Collector CRs |
| kube-prometheus-stack | chart 72.3.0 | Prometheus + Grafana |
| OpenCost | chart 1.46.0 | Team-label cost allocation |
| Upbound Providers | v1.17.0 | AWS, GCP, Azure |
make test # Run all 10 test suites
make test-yaml # YAML lint (208 files)
make test-shell # Shellcheck (13 scripts)
make test-kyverno # Kyverno CLI policy tests (25 assertions)
make test-xrd # XRD schema validation (150 assertions)
make test-compositions # Composition structure (231 assertions)
make test-golden-path # Golden path examples (18 assertions)
make test-observability # OTel/Prometheus/Grafana (15 assertions)
make test-eso # External Secrets (12 assertions)
make test-scale # 100+ claims validation (5 assertions)
make test-structure # File tree completeness- Read the golden path:
golden-path/examples/claim-database.yaml— 9 lines - Read the XRD:
platform-api/xrds/database-instance.yaml— the API contract - See three clouds: Compare
platform-api/compositions/aws/database.yamlvsgcp/vsazure/ - Test a policy violation:
kyverno apply policies/kyverno/cluster-policies/region-enforcement.yaml --resource golden-path/examples/claim-database-WILL-FAIL.yaml - See the semantic gap: Read the annotation on
policies/kyverno/cluster-policies/size-caps.yaml - Explore Shadow Metrics:
platform-api/shadow-metrics/README.md - Bootstrap on a kind cluster:
./bootstrap/install.sh --provider aws --dry-run
See DEMO.md for the 6-beat on-stage walkthrough (~10 minutes).
Apache 2.0 — see LICENSE.