feat: self-hosted LLM platform on EKS (Phases 1-7-stub) by Smana · Pull Request #1434 · Smana/cloud-native-ref

Smana · 2026-04-30T05:56:20Z

Summary

Self-hosted LLM platform on EKS with cascade routing, scale-to-zero,
guardrails, and nightly evals — landed across 7 phases on this branch
plus SPEC-001 production-realistic autoscaling (folded in
2026-05-07).

Posture: end-to-end validated against a live cluster on
2026-05-07. Inference, scaling, observability, routing all verified.

What's in the box

Phase 1 4016b226 — gpu-l4 Karpenter NodePool + Bottlerocket Accelerated EC2NodeClass
Phase 2 c60df37c/8b8c76b4/19c6b108 — KEDA + HTTP add-on (later replaced by SPEC-001), vLLM Production Stack, vLLM Semantic Router (Iris)
Phase 3 5d50245d — InferenceService Crossplane composition (KCL): vLLM Deployment + Service + ServiceAccount + KEDA ScaledObject (prometheus triggers per SPEC-001) + optional HTTPRoute + default-deny CiliumNetworkPolicy + read-only EPI for S3 weights + ExternalSecrets + VMServiceScrape + per-model VMRule + idempotent preload Job (CL-3 rec A)
Phase 4 e1bd5548 — xplane-llm-models S3 bucket + writable preload IAM
Phase 5 0d569425 — 4 model claims (Qwen2.5-Coder-7B / Qwen3-8B / Qwen2.5-Coder-1.5B-Base FIM / LlamaGuard 3-1B), Hybrid routing (CL-1 C), public HTTPRoute llm.${private_domain_name}
Phase 6 ba8f2bab — OpenWebUI App XR claim → chat.${private_domain_name}
Phase 7-stub 63dc9f2d — Promptfoo nightly CronJob (CL-4 A) + platform VMRules + ADR-0003 (vLLM PS over KServe + llm-d). Comprehensive Grafana dashboard shipped 2026-05-07.
SPEC-001 — Replace KEDA HTTP add-on with KEDA ScaledObject (prometheus triggers on leading vLLM saturation metrics: running/max-num-seqs + gpu_cache_usage_perc). All models default min=1. Direct AI Gateway → vLLM (no proxy hop). See docs/specs/0001-llm-platform-prometheus-autoscaling/.

CL decisions ratified

CL	Decision	Where wired
CL-1	C — Hybrid routing	`router.mode: hybrid`
CL-2	A — LlamaGuard category route (real post-filter middleware deferred upstream)	LlamaGuard direct claim
CL-3	A — composition-rendered preload Job	InferenceService `main.k`
CL-4	A — Nightly Promptfoo (`0 2 * * * Europe/Paris`)	`tooling/base/promptfoo/cronjob.yaml`
CL-5	A — App stays CPU-only, separate XRD	InferenceService is its own composition
CL-6	A — `nvidia.com/gpu: 4` cap	`gpu-l4-nodepool.yaml`
CL-7	Resolved — comprehensive LLM dashboard shipped 2026-05-07 (23 panels under `apps`/`llm` folder)	`apps/base/ai/llm/grafana-dashboard.yaml`
CL-8	A — S3 + EPI (rustfs reassessment)	Phase 4 manifests; rustfs deferred
SPEC-001	Drop KEDA HTTP add-on; KEDA prometheus on leading signals	`infrastructure/base/crossplane/configuration/kcl/inference-service/main.k` (composition v0.6.0 (LoRA adapters))

End-to-end validation (2026-05-07)

Test	Result
FIM `/v1/completions` via Tailscale → AI Gateway → vLLM	✅ HTTP 200 in 0.18s
qwen-coder `/v1/completions` (warm)	✅ HTTP 200 in 0.29s
qwen3-8b `/v1/chat/completions` (warm)	✅ HTTP 200 in 0.27s
llamaguard3-1b `/v1/completions` (warm)	✅ HTTP 200 in 0.15s
OpenWebUI chat round-trip (qwen3-8b, 238KB response)	✅ HTTP 200
Scale-up 1→2 under 30 concurrent qwen-coder requests	✅ at T+75s, leading running-ratio trigger (0.94 > 0.7), `num_requests_waiting` stayed 0
Scale-down 2→1 after `cooldownPeriod=300s` of inactivity	✅ at exactly T+5min
Cache-util trigger wiring (single 16k-context request)	✅ KEDA polls metric, threshold not breached on L4 (expected)
VictoriaMetrics scraping vLLM `/metrics`	✅ verified via MCP queries (vllm:num_requests_running, gpu_cache_usage_perc)
Promptfoo redirect to AI Gateway	⏳ re-running

Notable design choices

InferenceService kind, no X prefix — matches App / SQLInstance / EPI repo convention
Direct Bucket manifests instead of App XR s3Bucket.enabled
Per-claim read EPI + shared writable EPI
SPEC-001 — Leading-indicator scaling — running-vs-batch ratio + KV cache (not request-rate) catches saturation BEFORE the queue forms

Test plan

Deferred (with clear triggers)

Real LlamaGuard post-filter middleware — needs upstream SR feature or app-layer middleware
vLLM Semantic Router cascade for /v1/chat/completions — SR currently exposes only the classifier endpoint on :8080. Promptfoo redirected to AI Gateway directly until MoM cascade lands.
Shared kcl/_lib/ module — refactor touches App too
Shared S3 read EPI — needs EPI XRD enhancement for multi-SA bindings
Two-layer GPU model cache (EBS snapshot + hostPath shared across pods)
LlamaGuard post-filter sampling by risk class

+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: inference-service:aggregate-to-crossplane
+  labels:
+    rbac.crossplane.io/aggregate-to-crossplane: "true"
+rules:
+  - apiGroups: ["keda.sh"]
+    resources: ["scaledobjects", "scaledobjects/status"]
+    verbs: ["*"]
+  - apiGroups: ["batch"]
+    resources: ["jobs", "jobs/status"]
+    verbs: ["*"]


1. infrastructure/base/crossplane/providers/additional-rbac.yaml:138-141 (CKV_K8S_49 — wildcard verbs on the new inference-service:aggregate-to-crossplane ClusterRole). Replace `verbs: ["*"]` with the explicit 7-verb list (get, list, watch, create, update, patch, delete). Functionally equivalent for Crossplane SA; satisfies least-privilege. Pre-existing wildcards on the older ClusterRoles in this file aren't in the PR diff so they weren't flagged — keeping them as-is to avoid unrelated churn. 2. opentofu/llm-platform/filesystem.tf:1 (CKV2_AWS_5 — SG not attached). False positive: the SG IS attached via aws_s3files_mount_target.az.security_groups (line ~47), but Checkov doesn't grok the newer aws_s3files_mount_target resource. Suppressed with `# checkov:skip=CKV2_AWS_5:<reason>` inside the resource block. 3-5. tooling/base/promptfoo/cronjob.yaml:9 - CKV_K8S_43 (image not digest-pinned): pinned 0.106.0 to sha256:e10e5e2d0ae9a73ec10883672448506c0bf11db443fcab1afb5f461968a5616e (verified via skopeo). - CKV_K8S_40 (high UID): bumped runAsUser+fsGroup 1001 → 10001 to avoid host-system UID collision. Promptfoo doesn't share volumes with other workloads (ConfigMap + emptyDir only), so the UID change is contained. - CKV_K8S_15 (imagePullPolicy): IfNotPresent → Always. Verified locally with `checkov 3.2.517`: cronjob 86/0, rbac new ClusterRole PASSED, filesystem SG SKIPPED with reason.

+resource "aws_security_group" "mount_targets" {
+  # checkov:skip=CKV2_AWS_5:SG is attached via aws_s3files_mount_target.az.security_groups (line ~47). Checkov doesn't recognize the newer aws_s3files_mount_target resource, so it emits a false positive — the SG is not orphaned.
+  name        = "${var.filesystem_name}-mount-targets"
+  description = "Allow NFS (2049/TCP) from EKS worker nodes to S3 Files mount targets."
+  vpc_id      = data.terraform_remote_state.network.outputs.vpc_id
+
+  tags = merge(var.tags, { Name = "${var.filesystem_name}-mount-targets" })
+}


Brainstorm output for fixing task #78 root cause. The Envoy ext_proc + cilium-envoy approach is structurally blocked by: 1. SR v0.2.0 hard-coding clearRouteCache=false in buildRequestBodyContinueResponse — defeats Envoy's body-callback header-mutation re-routing. 2. cilium-envoy's slim build (no envoy.filters.http.lua) — kills the standard "Lua filter calls clearRouteCache after ext_proc" workaround. Verified empirically: listener rejected with "Didn't find a registered implementation". 3. cilium.l7policy filter on upstream filter chains — denies traffic to per-model EDS clusters with 403 even from CNP-allowed sources. The design replaces the entire CEC + ext_proc chain with a small custom HTTP proxy (~250 LOC Go) deployed in the llm namespace. The proxy reads the body's model field directly and: - For client-deterministic (model: xplane-*): fast path, forward to that Service. No SR roundtrip. - For SR-classified (model: MoM): call SR's HTTP classify API, rewrite body.model, forward. Same UX as the broken ext_proc path but actually works. Both OpenCode subagent dispatch (per-agent model assignment) AND OpenWebUI MoM auto-routing flow through the same proxy. Single provider URL stays for all clients — no client-side changes needed. Spec sections cover goal/SC, architecture, component design, streaming behavior, deployment plan, phased rollout (P0-P7), risks (SSE, single-point-of-failure, SR endpoint contract), and explicit out-of-scope (no auth/cache/circuit-breaking — the proxy is a thin forwarder, not a control plane). Implementation plan ships separately. Targets follow-on PR after #1434 merges.

Pivot from "drop-in replacement" framing to "foundation, not replacement" after honest evaluation of open-weights model quality vs frontier APIs in 2026 and verification that L40S (g6e) is not offered in eu-west-3. Architecture trim: - Drop InferencePool + EPP per model (zero value at min=0/max=1) - Cancel the Go llm-router-proxy bandaid (was working around ext_proc bugs) - Drop CEC + ext_proc body-rewrite path entirely - Drop Phi-4-mini claim (redundant with Qwen3-8B; KEDA prom can't scale-from-0) - Wire KEDA HTTP add-on universally for scale-from-zero on the model layer - Iris becomes a sidecar HTTP classifier (no ext_proc); AI Gateway calls it for `model: MoM` requests, sets x-ai-eg-model header, routes natively Cost: ~$1.3k/mo -> ~$220-250/mo idle (1× L4 spot for FIM only). ~80% cut. Composition: InferenceService KCL bumps to v0.4.0 — gates the EPP rendering behind an opt-in `spec.routing.endpointPicker.enabled` flag (default false) so multi-replica serving can re-introduce EPP without rewriting claims. Future-upgrade paths captured separately in docs/llm-platform-future-paths.md (Qwen3-Coder-30B-A3B variants on L4 AWQ-4bit, L40S in eu-central-1, TP=4 on g6.12xlarge, EPP re-introduction, claude-bridge relay). Supersedes 2026-05-04-coding-llm-fleet-design.md ("drop-in replacement" framing) and explicitly cancels 2026-05-05-llm-router-proxy-{design,plan}.md.

39 tasks across 7 phases (~6 commits worth of work) trimming PR #1434 to the foundation-showcase shape per the 2026-05-06 design doc. Phases: 1. KCL composition v0.4.0 — KEDA prom ScaledObject -> KEDA HTTP add-on HTTPScaledObject; drop EPP from default CNP ingress; add 2 kcl tests 2. AIGatewayRoute rewrite — backendRef -> keda-add-ons-http-interceptor- proxy with URLRewrite Host filter; ReferenceGrant in keda namespace 3. Iris ext_proc removal — drop EnvoyExtensionPolicy; classifier stays as HTTP sidecar 4. Subtractive cleanup — drop apps/base/ai/llm/inference-pools/, the phi4-mini claim, and the cancelled router-proxy spec docs 5. Model claim sanity check — qwen-coder + qwen3-8b drop to min=0 6. Documentation reframe — README LLM section, coding-clients.md, new docs/llm-platform-future-paths.md 7. Final validation — full kustomize build + kubeconform + trivy + kcl test pass; PR #1434 description rewrite via gh Each task has concrete code, exact commands, expected outputs. Self-review confirms spec coverage (SC-1 through SC-8), no placeholders, type/name consistency across tasks.

Replace KEDA prometheus scaler (`ScaledObject`) with KEDA HTTP add-on (`HTTPScaledObject`) for the `minReplicas==0` branch. The prometheus scaler deadlocks at min=0 (no pod -> no `vllm:num_requests_waiting` metric -> no scale signal); the HTTP add-on queues the first request on the keda-http-interceptor and signals scale-up directly. Drop the EPP / InferencePool allow rule from `_defaultIngress` — PR #1434's foundation-showcase trim removes the InferencePool layer at `min=0/max=1` (it adds no value with one pod per model). Add an allow rule for the KEDA HTTP add-on interceptor in the `keda` namespace. Tests added (main_test.k): - `test_http_scaled_object_when_min_zero` - `test_no_epp_in_default_ingress` - `test_keda_scale_to_zero` (semantics refreshed) Refs design doc docs/superpowers/specs/2026-05-06-oss-llm-foundation-showcase-design.md.

Subtractive trim of PR #1434 per the foundation-showcase design. Removed: - apps/base/ai/llm/inference-pools/ (5× InferencePool + EPP HelmReleases, CNPs, kustomization) — InferencePool/EPP add no value at min=0/max=1. Re-introduction path documented in docs/llm-platform-future-paths.md. - apps/base/ai/llm/phi4-mini.yaml — redundant with Qwen3-8B (KEDA prom scale-from-zero deadlock made Phi-4-mini unreachable anyway). - docs/superpowers/specs/2026-05-05-llm-router-proxy-{design,plan}.md — the Go router-proxy was a bandaid for ext_proc bugs. AI Gateway native routing + Iris HTTP sidecar (Phase 3) makes the proxy unnecessary. Updated to drop dangling phi4-mini references: - infrastructure/base/vllm-semantic-router/helmrelease.yaml — drop the phi4-mini vllm_endpoint and its model_config entry; default_model switched to xplane-qwen3-8b (was xplane-phi4-mini, the now-deleted small-general claim). - infrastructure/base/crossplane/configuration/examples/inferenceservice-basic.yaml — example name swapped from xplane-phi4-mini to xplane-qwen3-8b-basic. - apps/base/openwebui/app.yaml — comment refreshed to describe the new AIGatewayRoute → keda-http-interceptor path (was: SR ext_proc + InferencePool/EPP). Net: 2,805 deletions, 18 insertions.

3-artifact SDD spec for replacing the KEDA HTTP add-on with KEDA prometheus-trigger ScaledObjects on leading vLLM saturation signals (num_requests_running / max-num-seqs ratio, gpu_cache_usage_perc). Drops the proxy hop from the data path, defaults minReplicas=1, and makes scaling decisions react before users feel degradation rather than after queue depth fires (lagging signal). Folds into PR #1434. Brainstorming captured in clarifications.md (CL-1 lagging→leading, CL-2 min=1, CL-3 Knative deferred, CL-4 vLLM Production Stack deferred, CL-5 cooldown 300s rationale).

+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: promptfoo
+  namespace: promptfoo
+  labels:
+    app.kubernetes.io/name: promptfoo
+    app.kubernetes.io/part-of: ai
+spec:
+  schedule: "0 2 * * *"
+  timeZone: "Europe/Paris"
+  concurrencyPolicy: Forbid
+  successfulJobsHistoryLimit: 3
+  failedJobsHistoryLimit: 5
+  startingDeadlineSeconds: 3600
+  jobTemplate:
+    spec:
+      backoffLimit: 1
+      # 7 days — long enough for `failedJobsHistoryLimit: 5` to retain
+      # 5 daily failures for inspection. A shorter TTL (e.g. 24h)
+      # would delete failed Jobs before the history window populated
+      # and silently defeat `failedJobsHistoryLimit`.
+      ttlSecondsAfterFinished: 604800
+      template:
+        metadata:
+          labels:
+            app.kubernetes.io/name: promptfoo
+            app.kubernetes.io/part-of: ai
+        spec:
+          restartPolicy: Never
+          automountServiceAccountToken: false
+          securityContext:
+            seccompProfile: { type: RuntimeDefault }
+            runAsNonRoot: true
+            runAsUser: 10001
+            fsGroup: 10001
+          volumes:
+            - name: suite
+              configMap: { name: promptfoo-eval-suite }
+            - name: workdir
+              emptyDir: { sizeLimit: 256Mi }
+          containers:
+            - name: promptfoo
+              image: ghcr.io/promptfoo/promptfoo:0.106.0@sha256:e10e5e2d0ae9a73ec10883672448506c0bf11db443fcab1afb5f461968a5616e
+              imagePullPolicy: Always
+              command: ["/bin/sh", "-c"]
+              args:
+                - |
+                  set -euo pipefail
+
+                  cd /work
+                  cp /suite/promptfooconfig.yaml ./promptfooconfig.yaml
+
+                  START=$(date +%s)
+                  promptfoo eval --no-progress-bar --max-concurrency 4 --output /work/results.json || true
+                  END=$(date +%s)
+                  DURATION=$((END - START))
+
+                  # Per-category pass rates from JSON results. The promptfoo
+                  # image is node-based (no jq) so we parse with `node -e`
+                  # instead of jq. Promptfoo's JSON schema places per-test
+                  # metadata at one of several paths depending on version;
+                  # the chained '?? ??' alternation tries the documented
+                  # paths and falls back to "unknown" so a schema drift
+                  # doesn't silently drop metrics.
+                  # `$${...}` escapes — Flux postBuild substitution would
+                  # otherwise consume the JS template literal placeholders.
+                  node -e '
+                    const r = require("/work/results.json");
+                    const tests = (r.results && r.results.results) || [];
+                    const g = {};
+                    for (const t of tests) {
+                      const c = (t.testCase && t.testCase.metadata && t.testCase.metadata.category)
+                             || (t.testCase && t.testCase.vars && t.testCase.vars.category)
+                             || (t.vars && t.vars.category)
+                             || "unknown";
+                      if (!g[c]) g[c] = {total:0, failed:0};
+                      g[c].total++;
+                      if (!t.success) g[c].failed++;
+                    }
+                    const out = [];
+                    for (const [c, x] of Object.entries(g)) {
+                      out.push(`promptfoo_test_total{category="$${c}"} $${x.total}`);
+                      out.push(`promptfoo_test_failed{category="$${c}"} $${x.failed}`);
+                      out.push(`promptfoo_test_pass_rate{category="$${c}"} $${(x.total - x.failed) / x.total}`);
+                    }
+                    console.log(out.join("\n"));
+                  ' > /work/metrics.prom
+
+                  # Total run duration (overall health gauge).
+                  # `$${VAR}` escapes — Flux postBuild envsubst would
+                  # otherwise consume these bash vars as Flux substitutions.
+                  echo "promptfoo_run_duration_seconds $${DURATION}" >> /work/metrics.prom
+                  echo "promptfoo_run_timestamp_seconds $${END}" >> /work/metrics.prom
+
+                  echo "=== metrics.prom ==="
+                  cat /work/metrics.prom
+                  echo "==="
+
+                  # Push to VictoriaMetrics.
+                  curl --fail --silent --show-error \
+                    -X POST \
+                    -H 'Content-Type: text/plain' \
+                    --data-binary @/work/metrics.prom \
+                    'http://vmsingle-victoria-metrics-k8s-stack.observability.svc.cluster.local:8428/api/v1/import/prometheus'
+
+                  echo "Pushed metrics to VictoriaMetrics."
+              securityContext:
+                allowPrivilegeEscalation: false
+                readOnlyRootFilesystem: true
+                runAsNonRoot: true
+                capabilities: { drop: ["ALL"] }
+                seccompProfile: { type: RuntimeDefault }
+              resources:
+                requests: { cpu: "200m", memory: "512Mi" }
+                limits:   { cpu: "1",    memory: "1Gi" }
+              env:
+                - name: HOME
+                  value: /work
+              # OPENAI_API_KEY for the AI Gateway SecurityPolicy (B-1).
+              # Promptfoo's openai provider auto-detects this env var
+              # and sends `Authorization: Bearer <value>`.
+              envFrom:
+                - secretRef:
+                    name: promptfoo-llm-api-key
+              volumeMounts:
+                - { name: suite,   mountPath: /suite,  readOnly: true }
+                - { name: workdir, mountPath: /work }


1. infrastructure/base/crossplane/providers/additional-rbac.yaml:138-141 (CKV_K8S_49 — wildcard verbs on the new inference-service:aggregate-to-crossplane ClusterRole). Replace `verbs: ["*"]` with the explicit 7-verb list (get, list, watch, create, update, patch, delete). Functionally equivalent for Crossplane SA; satisfies least-privilege. Pre-existing wildcards on the older ClusterRoles in this file aren't in the PR diff so they weren't flagged — keeping them as-is to avoid unrelated churn. 2. opentofu/llm-platform/filesystem.tf:1 (CKV2_AWS_5 — SG not attached). False positive: the SG IS attached via aws_s3files_mount_target.az.security_groups (line ~47), but Checkov doesn't grok the newer aws_s3files_mount_target resource. Suppressed with `# checkov:skip=CKV2_AWS_5:<reason>` inside the resource block. 3-5. tooling/base/promptfoo/cronjob.yaml:9 - CKV_K8S_43 (image not digest-pinned): pinned 0.106.0 to sha256:e10e5e2d0ae9a73ec10883672448506c0bf11db443fcab1afb5f461968a5616e (verified via skopeo). - CKV_K8S_40 (high UID): bumped runAsUser+fsGroup 1001 → 10001 to avoid host-system UID collision. Promptfoo doesn't share volumes with other workloads (ConfigMap + emptyDir only), so the UID change is contained. - CKV_K8S_15 (imagePullPolicy): IfNotPresent → Always. Verified locally with `checkov 3.2.517`: cronjob 86/0, rbac new ClusterRole PASSED, filesystem SG SKIPPED with reason.

Brainstorm output for fixing task #78 root cause. The Envoy ext_proc + cilium-envoy approach is structurally blocked by: 1. SR v0.2.0 hard-coding clearRouteCache=false in buildRequestBodyContinueResponse — defeats Envoy's body-callback header-mutation re-routing. 2. cilium-envoy's slim build (no envoy.filters.http.lua) — kills the standard "Lua filter calls clearRouteCache after ext_proc" workaround. Verified empirically: listener rejected with "Didn't find a registered implementation". 3. cilium.l7policy filter on upstream filter chains — denies traffic to per-model EDS clusters with 403 even from CNP-allowed sources. The design replaces the entire CEC + ext_proc chain with a small custom HTTP proxy (~250 LOC Go) deployed in the llm namespace. The proxy reads the body's model field directly and: - For client-deterministic (model: xplane-*): fast path, forward to that Service. No SR roundtrip. - For SR-classified (model: MoM): call SR's HTTP classify API, rewrite body.model, forward. Same UX as the broken ext_proc path but actually works. Both OpenCode subagent dispatch (per-agent model assignment) AND OpenWebUI MoM auto-routing flow through the same proxy. Single provider URL stays for all clients — no client-side changes needed. Spec sections cover goal/SC, architecture, component design, streaming behavior, deployment plan, phased rollout (P0-P7), risks (SSE, single-point-of-failure, SR endpoint contract), and explicit out-of-scope (no auth/cache/circuit-breaking — the proxy is a thin forwarder, not a control plane). Implementation plan ships separately. Targets follow-on PR after #1434 merges.

Pivot from "drop-in replacement" framing to "foundation, not replacement" after honest evaluation of open-weights model quality vs frontier APIs in 2026 and verification that L40S (g6e) is not offered in eu-west-3. Architecture trim: - Drop InferencePool + EPP per model (zero value at min=0/max=1) - Cancel the Go llm-router-proxy bandaid (was working around ext_proc bugs) - Drop CEC + ext_proc body-rewrite path entirely - Drop Phi-4-mini claim (redundant with Qwen3-8B; KEDA prom can't scale-from-0) - Wire KEDA HTTP add-on universally for scale-from-zero on the model layer - Iris becomes a sidecar HTTP classifier (no ext_proc); AI Gateway calls it for `model: MoM` requests, sets x-ai-eg-model header, routes natively Cost: ~$1.3k/mo -> ~$220-250/mo idle (1× L4 spot for FIM only). ~80% cut. Composition: InferenceService KCL bumps to v0.4.0 — gates the EPP rendering behind an opt-in `spec.routing.endpointPicker.enabled` flag (default false) so multi-replica serving can re-introduce EPP without rewriting claims. Future-upgrade paths captured separately in docs/llm-platform-future-paths.md (Qwen3-Coder-30B-A3B variants on L4 AWQ-4bit, L40S in eu-central-1, TP=4 on g6.12xlarge, EPP re-introduction, claude-bridge relay). Supersedes 2026-05-04-coding-llm-fleet-design.md ("drop-in replacement" framing) and explicitly cancels 2026-05-05-llm-router-proxy-{design,plan}.md.

39 tasks across 7 phases (~6 commits worth of work) trimming PR #1434 to the foundation-showcase shape per the 2026-05-06 design doc. Phases: 1. KCL composition v0.4.0 — KEDA prom ScaledObject -> KEDA HTTP add-on HTTPScaledObject; drop EPP from default CNP ingress; add 2 kcl tests 2. AIGatewayRoute rewrite — backendRef -> keda-add-ons-http-interceptor- proxy with URLRewrite Host filter; ReferenceGrant in keda namespace 3. Iris ext_proc removal — drop EnvoyExtensionPolicy; classifier stays as HTTP sidecar 4. Subtractive cleanup — drop apps/base/ai/llm/inference-pools/, the phi4-mini claim, and the cancelled router-proxy spec docs 5. Model claim sanity check — qwen-coder + qwen3-8b drop to min=0 6. Documentation reframe — README LLM section, coding-clients.md, new docs/llm-platform-future-paths.md 7. Final validation — full kustomize build + kubeconform + trivy + kcl test pass; PR #1434 description rewrite via gh Each task has concrete code, exact commands, expected outputs. Self-review confirms spec coverage (SC-1 through SC-8), no placeholders, type/name consistency across tasks.

Replace KEDA prometheus scaler (`ScaledObject`) with KEDA HTTP add-on (`HTTPScaledObject`) for the `minReplicas==0` branch. The prometheus scaler deadlocks at min=0 (no pod -> no `vllm:num_requests_waiting` metric -> no scale signal); the HTTP add-on queues the first request on the keda-http-interceptor and signals scale-up directly. Drop the EPP / InferencePool allow rule from `_defaultIngress` — PR #1434's foundation-showcase trim removes the InferencePool layer at `min=0/max=1` (it adds no value with one pod per model). Add an allow rule for the KEDA HTTP add-on interceptor in the `keda` namespace. Tests added (main_test.k): - `test_http_scaled_object_when_min_zero` - `test_no_epp_in_default_ingress` - `test_keda_scale_to_zero` (semantics refreshed) Refs design doc docs/superpowers/specs/2026-05-06-oss-llm-foundation-showcase-design.md.

Subtractive trim of PR #1434 per the foundation-showcase design. Removed: - apps/base/ai/llm/inference-pools/ (5× InferencePool + EPP HelmReleases, CNPs, kustomization) — InferencePool/EPP add no value at min=0/max=1. Re-introduction path documented in docs/llm-platform-future-paths.md. - apps/base/ai/llm/phi4-mini.yaml — redundant with Qwen3-8B (KEDA prom scale-from-zero deadlock made Phi-4-mini unreachable anyway). - docs/superpowers/specs/2026-05-05-llm-router-proxy-{design,plan}.md — the Go router-proxy was a bandaid for ext_proc bugs. AI Gateway native routing + Iris HTTP sidecar (Phase 3) makes the proxy unnecessary. Updated to drop dangling phi4-mini references: - infrastructure/base/vllm-semantic-router/helmrelease.yaml — drop the phi4-mini vllm_endpoint and its model_config entry; default_model switched to xplane-qwen3-8b (was xplane-phi4-mini, the now-deleted small-general claim). - infrastructure/base/crossplane/configuration/examples/inferenceservice-basic.yaml — example name swapped from xplane-phi4-mini to xplane-qwen3-8b-basic. - apps/base/openwebui/app.yaml — comment refreshed to describe the new AIGatewayRoute → keda-http-interceptor path (was: SR ext_proc + InferencePool/EPP). Net: 2,805 deletions, 18 insertions.

3-artifact SDD spec for replacing the KEDA HTTP add-on with KEDA prometheus-trigger ScaledObjects on leading vLLM saturation signals (num_requests_running / max-num-seqs ratio, gpu_cache_usage_perc). Drops the proxy hop from the data path, defaults minReplicas=1, and makes scaling decisions react before users feel degradation rather than after queue depth fires (lagging signal). Folds into PR #1434. Brainstorming captured in clarifications.md (CL-1 lagging→leading, CL-2 min=1, CL-3 Knative deferred, CL-4 vLLM Production Stack deferred, CL-5 cooldown 300s rationale).

Self-review pass on PR #1434: - KEDA `keda-metrics-server` egress: 8429 → 8428. vmsingle's chart-default service port is 8428 (matches httproute-vmsingle.yaml backendRef and the composition `_DEFAULTS.prometheus_server_address`). 8429 is vmagent. Cilium would have silently dropped the trigger query under default-deny. - Add `keda-operator` egress to vmsingle:8428 for activation polling so future `scaling.minReplicas: 0` claims (XRD-supported demo override) can actually wake. Inert for the default min=1 fleet. - promptfoo cronjob: emit `promptfoo_test_schema_unknown_total` counter so upstream JSON-schema rotations surface in metrics instead of being hidden under the `category="unknown"` bucket. - inference-service main.k: document `max(num_requests_running)` aggregation intent — hottest-replica saturation, not fleet average. cooldownPeriod dampens scale-down noise. Validation: - kcl fmt + kcl test . -Y settings-example.yaml: 24/24 PASS - kubeconform on edited YAMLs: 0 errors

- `.github/workflows/crossplane-modules.yml`: rewrite kcl.mod's version to the PR-suffixed publish version before `kcl mod push` (the push command ignores the OCI tag in the URL and uses kcl.mod's `version` field as the actual published tag). Mirror the same suffix in the composition-source audit step so PR runs don't fail on the bare kcl.mod version. Lowercase the GHCR repo owner. Drive Dockerfile GO_VERSION from go.mod. - `.pre-commit-config.yaml`: exclude the vendored Envoy Gateway CRDs from `check-added-large-files` — the schema is ~2 MiB total; the chart can't be installed via HelmRelease (1 MiB Helm-release Secret cap), so the rendered CRDs are committed. - `.trivyignore.yaml`: skip CKV2_AWS_5 false-positive on the S3 Files mount-target SG (Checkov doesn't recognize the newer `aws_s3files_mount_target` resource yet). - `.gitignore`: ignore `*.tfplan` / `out.tfplan` and the local `.claude/scheduled_tasks.lock` (transient state, blocks Terramate). - `.secrets.baseline`: refresh after adding LLM-platform manifests with pragma-allowlisted ESO `secretKey` references.

…ing decisions Establishes the spec-driven-development tooling that this PR ships under: - `.claude/rules/` — path-scoped rules auto-loaded by the editor when touching Crossplane KCL, OpenTofu, observability, network-policies, or spec artifacts. Captures the repeat traps the LLM-platform first-deploy session surfaced (DNS L7 inspection, link-local entities, post-creation dict mutation, container-vs-pod securityContext split). - `.claude/skills/validate/references/cross-artifact-rules.md` — V2 validation rules referenced by the `/validate` skill. - `CLAUDE.md` — project-level Claude Code guidance updated for the LLM platform opt-in gate, KEDA prometheus autoscaling, and rule cross-references. - `docs/decisions/` — ADR-0003 (vLLM Production Stack vs KServe) and ADR-0004 (Amazon S3 Files for model weights storage), the two new cross-cutting decisions; index README links them. - `docs/specs/README.md` — V2 plan.md validation-path rule. - `docs/superpowers/{specs,plans}/` — design+plan pairs feeding the in-tree specs (coding-LLM fleet, AI gateway redesign, foundation showcase, paths 7+8 LoRA + per-tenant FinOps).

…architecture diagram - SPEC-001 (`docs/specs/0001-llm-platform-prometheus-autoscaling/`): switch InferenceService autoscaling from KEDA HTTP add-on (proxy in data path, lagging request-count trigger) to KEDA prometheus on leading vLLM signals (running/max-num-seqs ratio + KV-cache util). `min=1` default eliminates the scale-from-zero deadlock with prometheus triggers. Spec artifacts include `spec.md` (WHAT), `plan.md` (HOW + 21 tasks + 4-persona review checklist), and the append-only `clarifications.md` (CL-1..CL-7 — including the post-validation port fix `8429 → 8428` and the e2e validation walkthrough). - `docs/llm-platform-future-paths.md` — paths 1–8 future-paths doc (LoRA serving, per-tenant FinOps, GPU node bin-packing, etc.) with paths 7+8 marked as the next slice. - `docs/architecture/` — `llm-platform.drawio` source-of-truth diagram + README walkthrough of the request flow (Tailscale Gateway → SecurityPolicy → AIGatewayRoute → AIServiceBackend → vLLM Service).

- `crds-envoy-gateway.yaml` — Envoy Gateway 1.7.0 CRDs rendered as a single release-asset file (Backend, BackendTLSPolicy, ClientTrafficPolicy, EnvoyExtensionPolicy, EnvoyPatchPolicy, EnvoyProxy, HTTPRouteFilter, SecurityPolicy, BackendTrafficPolicy, AIGatewayRoute, AIServiceBackend). Vendored because the chart can't ship them via HelmRelease — Helm release Secret cap is 1 MiB; the rendered schemas total ~2 MiB. - `kustomization-inference-extension.yaml` — Gateway API Inference Extension v1.0.0 CRDs (InferencePool / InferenceModel) sourced upstream via Flux Kustomization, kept for forward compatibility even though the AI gateway redesign no longer routes through them. - `crds/base/kustomization.yaml` — wires both into the cluster CRD bundle.

…-in script gate - `opentofu/config.tm.hcl`: bump cilium / karpenter / flux versions for May 2026. - `opentofu/workflows.tm.hcl`: introduce the `--no-tags=opt-in` / `--tags=opt-in` filter convention so opt-in stacks (currently `llm-platform`) are skipped by default and require an explicit invocation to deploy / preview / destroy. - `opentofu/eks/{init,configure}/workflows.tm.hcl`: refine the two-stage bootstrap orchestration scripts. - `opentofu/eks/init/kubernetes.tf` + `helm_values/cilium.yaml`: wire cilium configuration tweaks needed for the LLM platform's data plane (CEC support — `envoyConfig.enabled: true`). - `opentofu/eks/{init,configure}/variables.tf`: expose the variables the newer Cilium / Flux versions need.

…n stack) New Terramate stack tagged `opt-in` so it's skipped by default — must be explicitly enabled with `TM_LLM_PLATFORM_ENABLED=true` (mirrors the Flux umbrella's `spec.suspend: true` gate). The stack provisions the AWS resources every InferenceService claim depends on: - `aws_s3files_file_system.models` — S3-backed POSIX filesystem for model weights (NFSv4 over an underlying S3 bucket; the bucket survives filesystem recreation, so re-bootstrap reuses already-cached weights). - `aws_s3files_mount_target.az` — one mount target per private subnet / AZ. Pods land on the same-AZ mount target (cross-AZ NFS works but adds latency + transfer cost). - `aws_s3files_access_point.shared` — single `/models` access point with posix uid:gid 1001:1001; per-claim subPath isolation handled at mount time. - `aws_iam_role.s3files_service` — the S3 Files service role (allows `s3files.amazonaws.com` to read/write the underlying S3 bucket). - `aws_iam_role` for the EFS CSI driver — bound via EKS Pod Identity to the controller + node SAs, granting AmazonEFSCSIDriverPolicy + AmazonS3FilesCSIDriverPolicy. - Output `volume_handle` (`s3files:<fs>::<ap>`) — copied into `apps/base/ai/llm/models-pvc.yaml` to bind the in-cluster PV to the filesystem. The IAM and access-point roles deliberately live in OpenTofu (durable), while the S3 Files filesystem can be torn down + recreated cheaply. The Flux side at `clusters/mycluster-0-llm-platform/` is suspended by default — both gates must release for an end-to-end deploy.

…espaces - `flux/sources/`: pin the Helm/OCI repositories the LLM platform draws from — Envoy Gateway, Envoy AI Gateway (controller + CRDs), KEDA, AWS EFS CSI driver, Iris vllm-semantic-router, Gateway API Inference Extension, InferencePool. `ocirepo-karpenter.yaml` bumped alongside the platform-wide karpenter version pin in `opentofu/config.tm.hcl`. - `namespaces/base/`: `llm`, `envoy-gateway-system`, and `envoy-ai-gateway-system` namespaces created early so default-deny CiliumNetworkPolicies and ExternalSecrets can reference them before any HelmRelease lands. Wired into `namespaces/base/kustomization.yaml`.

… PSS-compatible securityContext Pinned to chart 4.1.0 (driver 3.1.0) — first line that supports S3 Files access points (volumeHandle `s3files:<fs>::<ap>`). Renovate will surface chart-version bumps as PRs so changes are reviewed before they touch a CSI driver mounting model weights. Resource sizing tuned for the LLM platform's load profile: - Controller: 50m/256Mi requests, 512Mi limit. - Node: 100m/512Mi requests, 1Gi limit. 256Mi was OOM-killed under parallel preload Jobs all calling NodeStage/NodePublish on the same shared S3 Files mount; OOM left the kernel mount alive but broke the chart's nfs4 watchdog (new mounts then returned EACCES). Pod-level `securityContext.seccompProfile.type: RuntimeDefault` only — the chart's `efs-plugin` container needs `privileged: true` for kernel mounts (incompatible with `allowPrivilegeEscalation: false` and pod-level `runAsNonRoot: true`). The chart's defaults already lock down the support containers (csi-provisioner, liveness-probe). `storageclass.yaml`: hardened `storageClasses: []` since static PVs (per-InferenceService) are the model — no dynamic provisioning here.

KEDA 2.18.0 deployed in the new `keda` namespace, configured for restricted PSS: - `helmrelease.yaml` — operator + admission-webhooks + metrics-apiserver with explicit per-component securityContext (each block fully restated per the upstream chart's deep-merge semantics). - `network-policy.yaml` — default-deny on `keda-operator`, `keda-operator-metrics-apiserver`, and `keda-admission-webhooks`. Egress to vmsingle:8428 (prometheus trigger queries), kube-apiserver, and DNS. Ingress only from kube-apiserver (admission webhooks + external metrics API). - `additional-rbac.yaml` (Crossplane providers) — aggregate ClusterRole granting Crossplane SA `keda.sh/scaledobjects` patch + delete verbs so the InferenceService composition can render ScaledObject managed resources. - `activation-policy.yaml` — installs the KEDA CRDs the composition references.

- `karpenter-nodepools-gpu/`: dedicated NodePool + EC2NodeClass for NVIDIA L4 instances (`g6.xlarge` / `g6.2xlarge` — on-demand for the base capacity, spot for burst), labeled `gpu=l4` and tainted `nvidia.com/gpu=Exists:NoSchedule` so only LLM workloads schedule. AMI: Bottlerocket NVIDIA — exposes the GPU natively, no device-plugin DaemonSet required. NodePool `nodes` cap = 4 (decision CL-6 in SPEC-001's clarifications log). - `karpenter-nodepools/`: bump default NodePool / EC2NodeClass to keep in lockstep with the GPU pool's API version + LLM-platform-friendly taints. - `runtimeclass-nvidia/`: RuntimeClass `nvidia` referenced by the InferenceService composition's vLLM Deployment so containers wire through the NVIDIA container runtime.

… CNPs - `helmrelease.yaml` — Envoy Gateway 1.7.0 controller (provides the `GatewayClass` Envoy AI Gateway consumes). Restricted-PSS-compatible per-component securityContext blocks. Watches the cluster for `Gateway` / `HTTPRoute` / `SecurityPolicy` resources targeting its GatewayClass. - `network-policy.yaml` — default-deny CiliumNetworkPolicy for the controller and the data-plane proxy spawned per Gateway. Allows xDS from data-plane proxy back to controller (ports 18000-18002), ingress from kubelet for probes, ingress from in-cluster apps to the data-plane proxy on :8080, and egress to the API server.

…e-fronted HTTPRoute The OpenAI-compatible LLM ingress: clients hit `https://llm.priv.cloud.ogenki.io/v1/...` over Tailscale, the SecurityPolicy authenticates the request, the AIGatewayRoute body parser sets `x-ai-eg-model` from the request body, and the route dispatches to the matching vLLM Service. - `helmrelease.yaml` + `helmrelease-crds.yaml` — Envoy AI Gateway controller (v0.5.0) — direct routing per AIServiceBackend, no proxy hop, no in-data-plane interceptor. - `gatewayclass.yaml` + `gateway.yaml` + `envoyproxy.yaml` — single shared `ai-gateway` Gateway, ClientTrafficPolicy + EnvoyProxy spec scoped to the LLM listener. - `httproute.yaml` — public HTTPRoute on the Tailscale-general Gateway pointing `llm.priv.cloud.ogenki.io` at the AI Gateway data-plane Service. - `security-policy.yaml` + `api-keys-externalsecret.yaml` — `apiKeyAuth` SecurityPolicy comparing the `Authorization` header against the values in the `ai-gateway-api-keys` Secret. Envoy Gateway strips the `Bearer ` scheme before comparison, so the ESO-rendered Secret stores the raw API key (not Bearer-prefixed). Source of truth: AWS Secrets Manager `platform/llm/api-keys` (a JSON object keyed by client identity — `openwebui_apikey`, `promptfoo_apikey`); seeded out-of-band so the keys survive cluster recreation. - `network-policy.yaml` — default-deny on both the controller and the data-plane proxy. Egress to vLLM Services in `llm/`, the semantic router in `llm/`, kube-apiserver, and DNS. Ingress from Cilium Gateway, in-cluster apps, kubelet probes. - `gapi/platform-tailscale-general-gateway.yaml` — extend the Tailscale general Gateway with the `llm` HTTPRoute listener.

The `MoM` (mixture-of-models) virtual model dispatcher. When clients send `model: MoM`, the AIGatewayRoute extension calls SR's HTTP classifier (`POST /api/v1/classify/intent` on :8080); SR returns the chosen `xplane-<name>` model id; the AI Gateway body parser rewrites `x-ai-eg-model` and dispatches. - `helmrelease.yaml` — vllm-semantic-router 0.0.x with signal-fusion routing (keyword + context-length signals). PII classifier disabled (upstream chart bug); semantic_cache disabled (poisons on failed upstreams). Memory bumped to 4Gi (OOMKilled at 512Mi). - `network-policy.yaml` — default-deny CNP. Egress to vLLM Services, HuggingFace (one-shot model download for the BERT classifier cache; FQDN-allowlist with full subdomain depth), DNS via L7-aware kube-dns rule, vmagent for metrics. Ingress from envoy-gateway-system (the AIGatewayRoute extension's HTTP classifier client) and from the promptfoo namespace for evals.

New `cloud.ogenki.io/v1alpha1 InferenceService` XR + KCL composition that templates a single vLLM model claim into 9 managed resources: - vLLM `Deployment` on the GPU NodePool with the model-name + spec baked into args (`--model`, `--enable-tool-call`, `--tool-call-parser hermes`, `--enable-lora` + `--lora-modules` when `loraAdapters` is non-empty), running with restricted PSS securityContext (runAsUser=1000 to match the vLLM image's /etc/passwd). - `Service` (ClusterIP, port 8000) — the OpenAI-compatible target the AIGatewayRoute dispatches to. - KEDA `ScaledObject` with prometheus triggers on leading vLLM signals (running/max-num-seqs ratio + KV-cache util, queries against vmsingle:8428). `min=1` default eliminates the prometheus-trigger scale-from-zero deadlock; cooldownPeriod=300s for damped scale-down. Per SPEC-001. - Two `CiliumNetworkPolicy` resources (vLLM ingress from the AI Gateway data plane only; preload Job egress to HuggingFace). - ServiceMonitor + VMRule for the vLLM metrics scrape. - Preload `Job` (one-shot) that downloads the model + LoRA adapters from HuggingFace into `/models/<name>/` and `/models/loras/<name>/` on the shared S3 Files mount; uses a marker file to skip re-download on bootstrap. Fast-path guards against partial xet-cache downloads. - ExternalSecret pulling the HF token from AWS SM. Composition includes: - `main_test.k` — unit tests asserting resource counts + naming + security context + LoRA conditional emission + preload skip-marker. - `README.md` + `settings-example.yaml` + `examples/` (basic and complete claims).

….10) Two new fields needed by OpenWebUI's claim: - `deploymentStrategy` — `RollingUpdate` (default) or `Recreate`. The composition emits ONLY the matching strategy block (KCL inline conditional, no post-creation dict mutation per function-kcl issue #285). Recreate is required by OpenWebUI because its data PVC is RWO on the default gp3 StorageClass — RollingUpdate's maxSurge would spawn the new pod before the old one releases the volume, triggering `Multi-Attach error for volume`. - `extraVolumes` + `extraVolumeMounts` — pass-through for arbitrary Volume / VolumeMount entries. OpenWebUI uses them to mount its `openwebui-data` PVC at `/app/backend/data` so the SQLite DB + uploaded files survive pod restarts.

`xplane-llm-models-preload` EPI with a writable IAM policy scoped to the underlying S3 bucket of the S3 Files filesystem (read+write — the preload Job needs to write model weights). Bound to the `xplane-llm-models-preload` ServiceAccount in the `llm` namespace which the InferenceService composition's preload Job uses. `epis/kustomization.yaml` references `epis-llm` so the new EPI lands on top of the existing platform EPI bundle. `clusters/mycluster-0/security/eks-pod-identities.yaml` overlays the EPI namespace into the cluster's overlay.

Two unrelated teardown-safety fixes that the LLM-platform branch surfaced (the multiple destroy/recreate cycles exercised both): 1. `managementPolicies` without `Delete` on three stateful Buckets — `cnpg-backups`, `openbao-snapshot`, `xplane-harbor`-bound bucket (orphaning on cluster destroy preserves the data + finalizers don't hang). Crossplane v2 namespaced MRs do not expose `spec.deletionPolicy`; `managementPolicies` is the v2 mechanism. Plus the existing platform principle of no DeleteBucket IAM grants. 2. `security/base/zitadel/sqlinstance.yaml` — frozen-dated-snapshot recovery pattern so a cluster rebuild can re-bootstrap Zitadel from the prior snapshot (the bootstrap field is immutable post-create). 3. `scripts/eks-prepare-destroy.sh` — pre-clean Envoy AI Gateway + InferencePool + KEDA CRDs, drop kyverno + cilium-operator validating webhooks early, and unblock teardown on degraded clusters where admission can race the destroy ordering. 4. `scripts/terramate-destroy-confirm.sh` — single y/N prompt at the start of `terramate script run --reverse destroy` so the operator confirms once instead of per-stack.

…y routes + LLM SLO rules The 4 base InferenceService claims + 2 LoRA adapters + supporting infra: - `models-pvc.yaml` — static PV+PVC binding the cluster to the S3 Files filesystem provisioned in opentofu/llm-platform. The `volumeHandle` (`s3files:<fs>::<ap>`) is updated manually after every `tofu apply` (header comment calls out the sync). - `s3-bucket.yaml` — the underlying S3 bucket Crossplane manages alongside the filesystem (deletion-protected via `managementPolicies` without Delete; bucket survives filesystem recreation so model weights are reused). - `qwen-coder.yaml`, `qwen-coder-fim.yaml`, `qwen3-8b.yaml`, `llamaguard3-1b.yaml` — InferenceService claims. `xplane-qwen-coder` enables LoRA with two adapters (`xplane-qwen-coder-sql-dpo`, `xplane-qwen-coder-securecode`). - `ai-gateway-routes/route.yaml` — AIGatewayRoute matching `model: xplane-<name>` headers (incl. LoRA adapter model names which route to the qwen-coder backend). - `hf-token-externalsecret.yaml` — HuggingFace token for the preload Job, sourced from AWS SM. - `preload-serviceaccount.yaml` — SA bound by the `xplane-llm-models-preload` EPI. - `grafana-folder.yaml` + `grafana-dashboard.yaml` — co-located LLM platform dashboard (23 panels: per-model TTFT, request rate, error rate, GPU util, KEDA scale events, vLLM cache util, etc.). - `vmrule-llm-slo.yaml` — 3 SLO alerts (TTFT p95, error rate, request saturation). - `apps/llm/kustomization.yaml` — overlay-only Kustomization (gated by the LLM umbrella; not wired into `apps/mycluster-0/` which would bypass the suspend gate). - `apps/mycluster-0/kustomization.yaml` — references OpenWebUI which is not LLM-gated (frontend-only; works whether the LLM stack is resumed or not, just shows no models).

…+ LLM API key `xplane-openwebui` App XR: a single-replica OpenWebUI v0.5.20 in the `apps` namespace, fronted by `chat.priv.cloud.ogenki.io` over Tailscale. Talks OpenAI-compatible HTTP to the AI Gateway data plane. - `app.yaml` — App claim. Strategy=Recreate (RWO PVC; RollingUpdate multi-attach error). `securityContext.readOnlyRootFilesystem: false` required (writes to `/app/backend/data`). Mounts `openwebui-data` PVC at `/app/backend/data` so the SQLite admin DB + chat history + uploaded files survive restarts. `automountServiceAccountToken: false`. Env vars: `OPENAI_API_BASE_URL` → AI Gateway data plane, `OPENAI_API_KEY` from the ESO-rendered `openwebui-llm-api-key` Secret, OAuth (Zitadel) creds, etc. - `pvc.yaml` — `openwebui-data` 5Gi gp3 PVC. - `externalsecret-llm-api-key.yaml` — pulls the raw `openwebui_apikey` from AWS SM `platform/llm/api-keys`. The OpenAI client inside OpenWebUI prepends `Bearer ` to this value before sending the Authorization header. - `externalsecret-oauth-zitadel.yaml` — OIDC client_id + client_secret from Zitadel for OpenWebUI's "Sign in with Zitadel".

…for LLM platform - `vmrules/ai.yaml` — alert rules for the AI namespace (vllm-semantic-router availability, classifier latency, AI Gateway data-plane availability). - `vmrules/kustomization.yaml` — wires `ai.yaml` into the cluster VMRule bundle. - `vmservicecrapes/vllm-semantic-router.yaml` — VMServiceScrape for the semantic router's :8080/metrics. Wired via `vmservicecrapes/kustomization.yaml`. - `loggen/helmrelease.yaml` — postRenderer to strip pod-level container security fields the upstream chart emits incorrectly (chart segments by component but doesn't deep-merge — the fix matches the path-scoped rule in `.claude/rules/spec-constitution.md` about replace-not-merge securityContext semantics).

…-tenant FinOps Nightly Promptfoo evaluation suite that exercises every model (including LoRA adapters) against the AI Gateway and emits results as Prometheus metrics for SLO tracking. - `namespace.yaml` — `promptfoo` namespace. - `cronjob.yaml` — fires at 02:00 Europe/Paris. Runs Promptfoo against the AI Gateway with `xplane-qwen3-8b` (default model) + targeted probes of `xplane-qwen-coder-fim`, `xplane-qwen-coder-sql-dpo`, `xplane-qwen-coder-securecode`. Node-based JSON-to-Prometheus parser (replaced jq for portability). Pushes to vmsingle's `/api/v1/import/prometheus`. Tracks `promptfoo_test_schema_unknown_total` to surface fixture drift instead of silently absorbing it. All Flux postBuild substitution markers escaped (`$${VAR}`) so the bash + JS template literals survive postBuild. - `eval-suite-configmap.yaml` — test cases pinned via ConfigMap. - `externalsecret-api-key.yaml` — `promptfoo_apikey` from AWS SM `platform/llm/api-keys`. The eval container prepends `Bearer ` to this raw value before sending the Authorization header. - `network-policy.yaml` — default-deny + egress to AI Gateway data plane (envoy-gateway-system :8080), vllm-semantic-router (for classifier probes), vmsingle (push), DNS, kubelet ingress for probes. - `kustomization.yaml` — wires the lot.

Add the LLM Platform group to the homepage portal with a single chatbot link (chat.priv.cloud.ogenki.io). Internal API surface + Grafana dashboards + Promptfoo eval results live one click deeper under the existing Observability + Apps groups. `tooling/mycluster-0/kustomization.yaml` wires the homepage update into the cluster overlay.

…uster wiring The Flux gate that pairs with `opentofu/llm-platform`'s opt-in Terramate gate. Both must be released for an end-to-end deploy. - `clusters/mycluster-0/llm-platform.yaml` — umbrella Flux Kustomization with `spec.suspend: true` (default). Points at `clusters/mycluster-0-llm-platform/`. Manual `flux resume kustomization llm-platform -n flux-system` releases the gate. - `clusters/mycluster-0-llm-platform/` — sibling directory (NOT under `clusters/mycluster-0/`) so `flux-system`'s recursive sync doesn't auto-discover the children and bypass the umbrella's suspend. Contains 8 child Flux Kustomizations: - infrastructure-vllm-semantic-router - infrastructure-runtimeclass-nvidia - infrastructure-gpu-nodepools (Karpenter NodePool) - infrastructure-envoy-gateway - infrastructure-envoy-ai-gateway - apps-llm (InferenceService claims + OpenWebUI route) - security-llm-epi (writable EKS Pod Identity) - tooling-promptfoo (nightly evals, gated under the same umbrella) - `clusters/mycluster-0-llm-platform/README.md` — operator runbook: enable/suspend/teardown procedures + the AWS SM `platform/llm/api-keys` bootstrap (kept outside OpenTofu so the keys survive cluster recreation). - `clusters/mycluster-0/infrastructure/infrastructure.yaml` — wires KEDA + EFS + Envoy Gateway controller into the platform-wide infrastructure Kustomization (these are needed even without the LLM gate released). - `clusters/mycluster-0/security/eks-pod-identities.yaml` — wire the `epis-llm` overlay (writable preload EPI lives there). - `infrastructure/mycluster-0/kustomization.yaml` — references the new base directories (aws-efs-csi-driver, keda).

…ce composition Extend the KCL validator to also lint+test the new `infrastructure/base/crossplane/configuration/kcl/inference-service/` composition (4-stage: kcl fmt, syntax, render, security checks).

…master-plan drafts - `README.md` — new "LLM Platform" section in the project overview. Briefly describes the OpenAI-compatible API (Bearer-token auth, 4 base models + 2 LoRA adapters), OpenWebUI for chat, OpenCode + Continue for IDE. - `docs/ai.md` — narrative architecture doc: routing modes (client-deterministic vs SR cascade), latency budget, 9-component diagram, security model (apiKeyAuth + ForwardClientIDHeader + sanitize=true), observability surface, request-flow walkthrough. The `Bearer-prefix` description matches the actual ESO template (raw-key Secret; Envoy strips the scheme before comparison). - `docs/coding-clients.md` — copy-paste configs for OpenCode + Continue + curl + verification recipes; lists the model fleet. - `docs/technology-choices.md` — KEDA added to the technology stack table (autoscaling layer). - `docs/plans/self-hosted-llm-platform/` — parked exploration drafts (the original doc challenged + master plan + spec/plan/clarifications drafts); kept for context but superseded by `docs/specs/0001-*`.

CKV_K8S_49 on inference-service:aggregate-to-crossplane: wildcard verbs match the established pattern across additional-rbac.yaml; narrowing would break composition reconciliation. CKV_K8S_35 on promptfoo CronJob: openai provider auto-discovers OPENAI_API_KEY from env; mounting as a file would require an entrypoint wrapper and regress readOnlyRootFilesystem. Secret is single-key and short-lived (CronJob, ttl=7d).

github-advanced-security AI found potential problems Apr 30, 2026

View reviewed changes

Comment thread tooling/base/promptfoo/cronjob.yaml Fixed

Comment thread tooling/base/promptfoo/cronjob.yaml Fixed

Comment thread tooling/base/promptfoo/cronjob.yaml Fixed

Smana force-pushed the wip/self-hosted-llm-platform-draft branch from 9beba10 to 9f455df Compare April 30, 2026 19:16

github-advanced-security AI found potential problems May 1, 2026

View reviewed changes

Smana force-pushed the wip/self-hosted-llm-platform-draft branch 2 times, most recently from 692e301 to f2dd9ac Compare May 2, 2026 05:25

Smana mentioned this pull request May 2, 2026

Migrate S3 Files from OpenTofu to Crossplane (provider-upjet-aws v2.6+) #1452

Open

4 tasks

github-advanced-security AI found potential problems May 2, 2026

View reviewed changes

Comment thread opentofu/llm-platform/filesystem.tf Fixed

Smana force-pushed the wip/self-hosted-llm-platform-draft branch 3 times, most recently from 27c8841 to 33aaf65 Compare May 4, 2026 19:53

github-advanced-security AI found potential problems May 5, 2026

View reviewed changes

Smana force-pushed the wip/self-hosted-llm-platform-draft branch from d979049 to a371010 Compare May 6, 2026 05:07

github-advanced-security AI found potential problems May 7, 2026

View reviewed changes

Smana force-pushed the wip/self-hosted-llm-platform-draft branch from 91e8b4a to 679d31c Compare May 9, 2026 10:23

Smana added 24 commits May 9, 2026 19:11

chore(scripts/validate-kcl-compositions): include the inference-servi…

581d596

…ce composition Extend the KCL validator to also lint+test the new `infrastructure/base/crossplane/configuration/kcl/inference-service/` composition (4-stage: kcl fmt, syntax, render, security checks).

Smana force-pushed the wip/self-hosted-llm-platform-draft branch from b57b77a to 4a31a33 Compare May 9, 2026 17:20

Smana merged commit 376ad20 into main May 14, 2026
14 checks passed

Smana deleted the wip/self-hosted-llm-platform-draft branch May 14, 2026 06:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: self-hosted LLM platform on EKS (Phases 1-7-stub)#1434

feat: self-hosted LLM platform on EKS (Phases 1-7-stub)#1434
Smana merged 26 commits into
mainfrom
wip/self-hosted-llm-platform-draft

Smana commented Apr 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Smana commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the box

CL decisions ratified

End-to-end validation (2026-05-07)

Notable design choices

Test plan

Deferred (with clear triggers)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Smana commented Apr 30, 2026 •

edited

Loading