feat: self-hosted LLM platform on EKS (Phases 1-7-stub)#1434
Merged
Conversation
9beba10 to
9f455df
Compare
Comment on lines
+129
to
+141
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: ClusterRole | ||
| metadata: | ||
| name: inference-service:aggregate-to-crossplane | ||
| labels: | ||
| rbac.crossplane.io/aggregate-to-crossplane: "true" | ||
| rules: | ||
| - apiGroups: ["keda.sh"] | ||
| resources: ["scaledobjects", "scaledobjects/status"] | ||
| verbs: ["*"] | ||
| - apiGroups: ["batch"] | ||
| resources: ["jobs", "jobs/status"] | ||
| verbs: ["*"] |
692e301 to
f2dd9ac
Compare
4 tasks
27c8841 to
33aaf65
Compare
Smana
added a commit
that referenced
this pull request
May 5, 2026
1. infrastructure/base/crossplane/providers/additional-rbac.yaml:138-141
(CKV_K8S_49 — wildcard verbs on the new
inference-service:aggregate-to-crossplane ClusterRole). Replace
`verbs: ["*"]` with the explicit 7-verb list (get, list, watch,
create, update, patch, delete). Functionally equivalent for
Crossplane SA; satisfies least-privilege. Pre-existing wildcards
on the older ClusterRoles in this file aren't in the PR diff so
they weren't flagged — keeping them as-is to avoid unrelated churn.
2. opentofu/llm-platform/filesystem.tf:1 (CKV2_AWS_5 — SG not attached).
False positive: the SG IS attached via
aws_s3files_mount_target.az.security_groups (line ~47), but
Checkov doesn't grok the newer aws_s3files_mount_target resource.
Suppressed with `# checkov:skip=CKV2_AWS_5:<reason>` inside the
resource block.
3-5. tooling/base/promptfoo/cronjob.yaml:9
- CKV_K8S_43 (image not digest-pinned): pinned 0.106.0 to
sha256:e10e5e2d0ae9a73ec10883672448506c0bf11db443fcab1afb5f461968a5616e
(verified via skopeo).
- CKV_K8S_40 (high UID): bumped runAsUser+fsGroup 1001 → 10001
to avoid host-system UID collision. Promptfoo doesn't share
volumes with other workloads (ConfigMap + emptyDir only),
so the UID change is contained.
- CKV_K8S_15 (imagePullPolicy): IfNotPresent → Always.
Verified locally with `checkov 3.2.517`: cronjob 86/0, rbac new
ClusterRole PASSED, filesystem SG SKIPPED with reason.
Comment on lines
+1
to
+8
| resource "aws_security_group" "mount_targets" { | ||
| # checkov:skip=CKV2_AWS_5:SG is attached via aws_s3files_mount_target.az.security_groups (line ~47). Checkov doesn't recognize the newer aws_s3files_mount_target resource, so it emits a false positive — the SG is not orphaned. | ||
| name = "${var.filesystem_name}-mount-targets" | ||
| description = "Allow NFS (2049/TCP) from EKS worker nodes to S3 Files mount targets." | ||
| vpc_id = data.terraform_remote_state.network.outputs.vpc_id | ||
|
|
||
| tags = merge(var.tags, { Name = "${var.filesystem_name}-mount-targets" }) | ||
| } |
Smana
added a commit
that referenced
this pull request
May 5, 2026
Brainstorm output for fixing task #78 root cause. The Envoy ext_proc + cilium-envoy approach is structurally blocked by: 1. SR v0.2.0 hard-coding clearRouteCache=false in buildRequestBodyContinueResponse — defeats Envoy's body-callback header-mutation re-routing. 2. cilium-envoy's slim build (no envoy.filters.http.lua) — kills the standard "Lua filter calls clearRouteCache after ext_proc" workaround. Verified empirically: listener rejected with "Didn't find a registered implementation". 3. cilium.l7policy filter on upstream filter chains — denies traffic to per-model EDS clusters with 403 even from CNP-allowed sources. The design replaces the entire CEC + ext_proc chain with a small custom HTTP proxy (~250 LOC Go) deployed in the llm namespace. The proxy reads the body's model field directly and: - For client-deterministic (model: xplane-*): fast path, forward to that Service. No SR roundtrip. - For SR-classified (model: MoM): call SR's HTTP classify API, rewrite body.model, forward. Same UX as the broken ext_proc path but actually works. Both OpenCode subagent dispatch (per-agent model assignment) AND OpenWebUI MoM auto-routing flow through the same proxy. Single provider URL stays for all clients — no client-side changes needed. Spec sections cover goal/SC, architecture, component design, streaming behavior, deployment plan, phased rollout (P0-P7), risks (SSE, single-point-of-failure, SR endpoint contract), and explicit out-of-scope (no auth/cache/circuit-breaking — the proxy is a thin forwarder, not a control plane). Implementation plan ships separately. Targets follow-on PR after #1434 merges.
d979049 to
a371010
Compare
Smana
added a commit
that referenced
this pull request
May 6, 2026
Pivot from "drop-in replacement" framing to "foundation, not replacement"
after honest evaluation of open-weights model quality vs frontier APIs in
2026 and verification that L40S (g6e) is not offered in eu-west-3.
Architecture trim:
- Drop InferencePool + EPP per model (zero value at min=0/max=1)
- Cancel the Go llm-router-proxy bandaid (was working around ext_proc bugs)
- Drop CEC + ext_proc body-rewrite path entirely
- Drop Phi-4-mini claim (redundant with Qwen3-8B; KEDA prom can't scale-from-0)
- Wire KEDA HTTP add-on universally for scale-from-zero on the model layer
- Iris becomes a sidecar HTTP classifier (no ext_proc); AI Gateway calls it
for `model: MoM` requests, sets x-ai-eg-model header, routes natively
Cost: ~$1.3k/mo -> ~$220-250/mo idle (1× L4 spot for FIM only). ~80% cut.
Composition: InferenceService KCL bumps to v0.4.0 — gates the EPP rendering
behind an opt-in `spec.routing.endpointPicker.enabled` flag (default false)
so multi-replica serving can re-introduce EPP without rewriting claims.
Future-upgrade paths captured separately in docs/llm-platform-future-paths.md
(Qwen3-Coder-30B-A3B variants on L4 AWQ-4bit, L40S in eu-central-1, TP=4 on
g6.12xlarge, EPP re-introduction, claude-bridge relay).
Supersedes 2026-05-04-coding-llm-fleet-design.md ("drop-in replacement"
framing) and explicitly cancels 2026-05-05-llm-router-proxy-{design,plan}.md.
Smana
added a commit
that referenced
this pull request
May 6, 2026
39 tasks across 7 phases (~6 commits worth of work) trimming PR #1434 to the foundation-showcase shape per the 2026-05-06 design doc. Phases: 1. KCL composition v0.4.0 — KEDA prom ScaledObject -> KEDA HTTP add-on HTTPScaledObject; drop EPP from default CNP ingress; add 2 kcl tests 2. AIGatewayRoute rewrite — backendRef -> keda-add-ons-http-interceptor- proxy with URLRewrite Host filter; ReferenceGrant in keda namespace 3. Iris ext_proc removal — drop EnvoyExtensionPolicy; classifier stays as HTTP sidecar 4. Subtractive cleanup — drop apps/base/ai/llm/inference-pools/, the phi4-mini claim, and the cancelled router-proxy spec docs 5. Model claim sanity check — qwen-coder + qwen3-8b drop to min=0 6. Documentation reframe — README LLM section, coding-clients.md, new docs/llm-platform-future-paths.md 7. Final validation — full kustomize build + kubeconform + trivy + kcl test pass; PR #1434 description rewrite via gh Each task has concrete code, exact commands, expected outputs. Self-review confirms spec coverage (SC-1 through SC-8), no placeholders, type/name consistency across tasks.
Smana
added a commit
that referenced
this pull request
May 6, 2026
Replace KEDA prometheus scaler (`ScaledObject`) with KEDA HTTP add-on (`HTTPScaledObject`) for the `minReplicas==0` branch. The prometheus scaler deadlocks at min=0 (no pod -> no `vllm:num_requests_waiting` metric -> no scale signal); the HTTP add-on queues the first request on the keda-http-interceptor and signals scale-up directly. Drop the EPP / InferencePool allow rule from `_defaultIngress` — PR #1434's foundation-showcase trim removes the InferencePool layer at `min=0/max=1` (it adds no value with one pod per model). Add an allow rule for the KEDA HTTP add-on interceptor in the `keda` namespace. Tests added (main_test.k): - `test_http_scaled_object_when_min_zero` - `test_no_epp_in_default_ingress` - `test_keda_scale_to_zero` (semantics refreshed) Refs design doc docs/superpowers/specs/2026-05-06-oss-llm-foundation-showcase-design.md.
Smana
added a commit
that referenced
this pull request
May 6, 2026
Subtractive trim of PR #1434 per the foundation-showcase design. Removed: - apps/base/ai/llm/inference-pools/ (5× InferencePool + EPP HelmReleases, CNPs, kustomization) — InferencePool/EPP add no value at min=0/max=1. Re-introduction path documented in docs/llm-platform-future-paths.md. - apps/base/ai/llm/phi4-mini.yaml — redundant with Qwen3-8B (KEDA prom scale-from-zero deadlock made Phi-4-mini unreachable anyway). - docs/superpowers/specs/2026-05-05-llm-router-proxy-{design,plan}.md — the Go router-proxy was a bandaid for ext_proc bugs. AI Gateway native routing + Iris HTTP sidecar (Phase 3) makes the proxy unnecessary. Updated to drop dangling phi4-mini references: - infrastructure/base/vllm-semantic-router/helmrelease.yaml — drop the phi4-mini vllm_endpoint and its model_config entry; default_model switched to xplane-qwen3-8b (was xplane-phi4-mini, the now-deleted small-general claim). - infrastructure/base/crossplane/configuration/examples/inferenceservice-basic.yaml — example name swapped from xplane-phi4-mini to xplane-qwen3-8b-basic. - apps/base/openwebui/app.yaml — comment refreshed to describe the new AIGatewayRoute → keda-http-interceptor path (was: SR ext_proc + InferencePool/EPP). Net: 2,805 deletions, 18 insertions.
Smana
added a commit
that referenced
this pull request
May 7, 2026
3-artifact SDD spec for replacing the KEDA HTTP add-on with KEDA prometheus-trigger ScaledObjects on leading vLLM saturation signals (num_requests_running / max-num-seqs ratio, gpu_cache_usage_perc). Drops the proxy hop from the data path, defaults minReplicas=1, and makes scaling decisions react before users feel degradation rather than after queue depth fires (lagging signal). Folds into PR #1434. Brainstorming captured in clarifications.md (CL-1 lagging→leading, CL-2 min=1, CL-3 Knative deferred, CL-4 vLLM Production Stack deferred, CL-5 cooldown 300s rationale).
Comment on lines
+9
to
+136
| apiVersion: batch/v1 | ||
| kind: CronJob | ||
| metadata: | ||
| name: promptfoo | ||
| namespace: promptfoo | ||
| labels: | ||
| app.kubernetes.io/name: promptfoo | ||
| app.kubernetes.io/part-of: ai | ||
| spec: | ||
| schedule: "0 2 * * *" | ||
| timeZone: "Europe/Paris" | ||
| concurrencyPolicy: Forbid | ||
| successfulJobsHistoryLimit: 3 | ||
| failedJobsHistoryLimit: 5 | ||
| startingDeadlineSeconds: 3600 | ||
| jobTemplate: | ||
| spec: | ||
| backoffLimit: 1 | ||
| # 7 days — long enough for `failedJobsHistoryLimit: 5` to retain | ||
| # 5 daily failures for inspection. A shorter TTL (e.g. 24h) | ||
| # would delete failed Jobs before the history window populated | ||
| # and silently defeat `failedJobsHistoryLimit`. | ||
| ttlSecondsAfterFinished: 604800 | ||
| template: | ||
| metadata: | ||
| labels: | ||
| app.kubernetes.io/name: promptfoo | ||
| app.kubernetes.io/part-of: ai | ||
| spec: | ||
| restartPolicy: Never | ||
| automountServiceAccountToken: false | ||
| securityContext: | ||
| seccompProfile: { type: RuntimeDefault } | ||
| runAsNonRoot: true | ||
| runAsUser: 10001 | ||
| fsGroup: 10001 | ||
| volumes: | ||
| - name: suite | ||
| configMap: { name: promptfoo-eval-suite } | ||
| - name: workdir | ||
| emptyDir: { sizeLimit: 256Mi } | ||
| containers: | ||
| - name: promptfoo | ||
| image: ghcr.io/promptfoo/promptfoo:0.106.0@sha256:e10e5e2d0ae9a73ec10883672448506c0bf11db443fcab1afb5f461968a5616e | ||
| imagePullPolicy: Always | ||
| command: ["/bin/sh", "-c"] | ||
| args: | ||
| - | | ||
| set -euo pipefail | ||
|
|
||
| cd /work | ||
| cp /suite/promptfooconfig.yaml ./promptfooconfig.yaml | ||
|
|
||
| START=$(date +%s) | ||
| promptfoo eval --no-progress-bar --max-concurrency 4 --output /work/results.json || true | ||
| END=$(date +%s) | ||
| DURATION=$((END - START)) | ||
|
|
||
| # Per-category pass rates from JSON results. The promptfoo | ||
| # image is node-based (no jq) so we parse with `node -e` | ||
| # instead of jq. Promptfoo's JSON schema places per-test | ||
| # metadata at one of several paths depending on version; | ||
| # the chained '?? ??' alternation tries the documented | ||
| # paths and falls back to "unknown" so a schema drift | ||
| # doesn't silently drop metrics. | ||
| # `$${...}` escapes — Flux postBuild substitution would | ||
| # otherwise consume the JS template literal placeholders. | ||
| node -e ' | ||
| const r = require("/work/results.json"); | ||
| const tests = (r.results && r.results.results) || []; | ||
| const g = {}; | ||
| for (const t of tests) { | ||
| const c = (t.testCase && t.testCase.metadata && t.testCase.metadata.category) | ||
| || (t.testCase && t.testCase.vars && t.testCase.vars.category) | ||
| || (t.vars && t.vars.category) | ||
| || "unknown"; | ||
| if (!g[c]) g[c] = {total:0, failed:0}; | ||
| g[c].total++; | ||
| if (!t.success) g[c].failed++; | ||
| } | ||
| const out = []; | ||
| for (const [c, x] of Object.entries(g)) { | ||
| out.push(`promptfoo_test_total{category="$${c}"} $${x.total}`); | ||
| out.push(`promptfoo_test_failed{category="$${c}"} $${x.failed}`); | ||
| out.push(`promptfoo_test_pass_rate{category="$${c}"} $${(x.total - x.failed) / x.total}`); | ||
| } | ||
| console.log(out.join("\n")); | ||
| ' > /work/metrics.prom | ||
|
|
||
| # Total run duration (overall health gauge). | ||
| # `$${VAR}` escapes — Flux postBuild envsubst would | ||
| # otherwise consume these bash vars as Flux substitutions. | ||
| echo "promptfoo_run_duration_seconds $${DURATION}" >> /work/metrics.prom | ||
| echo "promptfoo_run_timestamp_seconds $${END}" >> /work/metrics.prom | ||
|
|
||
| echo "=== metrics.prom ===" | ||
| cat /work/metrics.prom | ||
| echo "===" | ||
|
|
||
| # Push to VictoriaMetrics. | ||
| curl --fail --silent --show-error \ | ||
| -X POST \ | ||
| -H 'Content-Type: text/plain' \ | ||
| --data-binary @/work/metrics.prom \ | ||
| 'http://vmsingle-victoria-metrics-k8s-stack.observability.svc.cluster.local:8428/api/v1/import/prometheus' | ||
|
|
||
| echo "Pushed metrics to VictoriaMetrics." | ||
| securityContext: | ||
| allowPrivilegeEscalation: false | ||
| readOnlyRootFilesystem: true | ||
| runAsNonRoot: true | ||
| capabilities: { drop: ["ALL"] } | ||
| seccompProfile: { type: RuntimeDefault } | ||
| resources: | ||
| requests: { cpu: "200m", memory: "512Mi" } | ||
| limits: { cpu: "1", memory: "1Gi" } | ||
| env: | ||
| - name: HOME | ||
| value: /work | ||
| # OPENAI_API_KEY for the AI Gateway SecurityPolicy (B-1). | ||
| # Promptfoo's openai provider auto-detects this env var | ||
| # and sends `Authorization: Bearer <value>`. | ||
| envFrom: | ||
| - secretRef: | ||
| name: promptfoo-llm-api-key | ||
| volumeMounts: | ||
| - { name: suite, mountPath: /suite, readOnly: true } | ||
| - { name: workdir, mountPath: /work } |
91e8b4a to
679d31c
Compare
Smana
added a commit
that referenced
this pull request
May 9, 2026
1. infrastructure/base/crossplane/providers/additional-rbac.yaml:138-141
(CKV_K8S_49 — wildcard verbs on the new
inference-service:aggregate-to-crossplane ClusterRole). Replace
`verbs: ["*"]` with the explicit 7-verb list (get, list, watch,
create, update, patch, delete). Functionally equivalent for
Crossplane SA; satisfies least-privilege. Pre-existing wildcards
on the older ClusterRoles in this file aren't in the PR diff so
they weren't flagged — keeping them as-is to avoid unrelated churn.
2. opentofu/llm-platform/filesystem.tf:1 (CKV2_AWS_5 — SG not attached).
False positive: the SG IS attached via
aws_s3files_mount_target.az.security_groups (line ~47), but
Checkov doesn't grok the newer aws_s3files_mount_target resource.
Suppressed with `# checkov:skip=CKV2_AWS_5:<reason>` inside the
resource block.
3-5. tooling/base/promptfoo/cronjob.yaml:9
- CKV_K8S_43 (image not digest-pinned): pinned 0.106.0 to
sha256:e10e5e2d0ae9a73ec10883672448506c0bf11db443fcab1afb5f461968a5616e
(verified via skopeo).
- CKV_K8S_40 (high UID): bumped runAsUser+fsGroup 1001 → 10001
to avoid host-system UID collision. Promptfoo doesn't share
volumes with other workloads (ConfigMap + emptyDir only),
so the UID change is contained.
- CKV_K8S_15 (imagePullPolicy): IfNotPresent → Always.
Verified locally with `checkov 3.2.517`: cronjob 86/0, rbac new
ClusterRole PASSED, filesystem SG SKIPPED with reason.
Smana
added a commit
that referenced
this pull request
May 9, 2026
Brainstorm output for fixing task #78 root cause. The Envoy ext_proc + cilium-envoy approach is structurally blocked by: 1. SR v0.2.0 hard-coding clearRouteCache=false in buildRequestBodyContinueResponse — defeats Envoy's body-callback header-mutation re-routing. 2. cilium-envoy's slim build (no envoy.filters.http.lua) — kills the standard "Lua filter calls clearRouteCache after ext_proc" workaround. Verified empirically: listener rejected with "Didn't find a registered implementation". 3. cilium.l7policy filter on upstream filter chains — denies traffic to per-model EDS clusters with 403 even from CNP-allowed sources. The design replaces the entire CEC + ext_proc chain with a small custom HTTP proxy (~250 LOC Go) deployed in the llm namespace. The proxy reads the body's model field directly and: - For client-deterministic (model: xplane-*): fast path, forward to that Service. No SR roundtrip. - For SR-classified (model: MoM): call SR's HTTP classify API, rewrite body.model, forward. Same UX as the broken ext_proc path but actually works. Both OpenCode subagent dispatch (per-agent model assignment) AND OpenWebUI MoM auto-routing flow through the same proxy. Single provider URL stays for all clients — no client-side changes needed. Spec sections cover goal/SC, architecture, component design, streaming behavior, deployment plan, phased rollout (P0-P7), risks (SSE, single-point-of-failure, SR endpoint contract), and explicit out-of-scope (no auth/cache/circuit-breaking — the proxy is a thin forwarder, not a control plane). Implementation plan ships separately. Targets follow-on PR after #1434 merges.
Smana
added a commit
that referenced
this pull request
May 9, 2026
Pivot from "drop-in replacement" framing to "foundation, not replacement"
after honest evaluation of open-weights model quality vs frontier APIs in
2026 and verification that L40S (g6e) is not offered in eu-west-3.
Architecture trim:
- Drop InferencePool + EPP per model (zero value at min=0/max=1)
- Cancel the Go llm-router-proxy bandaid (was working around ext_proc bugs)
- Drop CEC + ext_proc body-rewrite path entirely
- Drop Phi-4-mini claim (redundant with Qwen3-8B; KEDA prom can't scale-from-0)
- Wire KEDA HTTP add-on universally for scale-from-zero on the model layer
- Iris becomes a sidecar HTTP classifier (no ext_proc); AI Gateway calls it
for `model: MoM` requests, sets x-ai-eg-model header, routes natively
Cost: ~$1.3k/mo -> ~$220-250/mo idle (1× L4 spot for FIM only). ~80% cut.
Composition: InferenceService KCL bumps to v0.4.0 — gates the EPP rendering
behind an opt-in `spec.routing.endpointPicker.enabled` flag (default false)
so multi-replica serving can re-introduce EPP without rewriting claims.
Future-upgrade paths captured separately in docs/llm-platform-future-paths.md
(Qwen3-Coder-30B-A3B variants on L4 AWQ-4bit, L40S in eu-central-1, TP=4 on
g6.12xlarge, EPP re-introduction, claude-bridge relay).
Supersedes 2026-05-04-coding-llm-fleet-design.md ("drop-in replacement"
framing) and explicitly cancels 2026-05-05-llm-router-proxy-{design,plan}.md.
Smana
added a commit
that referenced
this pull request
May 9, 2026
39 tasks across 7 phases (~6 commits worth of work) trimming PR #1434 to the foundation-showcase shape per the 2026-05-06 design doc. Phases: 1. KCL composition v0.4.0 — KEDA prom ScaledObject -> KEDA HTTP add-on HTTPScaledObject; drop EPP from default CNP ingress; add 2 kcl tests 2. AIGatewayRoute rewrite — backendRef -> keda-add-ons-http-interceptor- proxy with URLRewrite Host filter; ReferenceGrant in keda namespace 3. Iris ext_proc removal — drop EnvoyExtensionPolicy; classifier stays as HTTP sidecar 4. Subtractive cleanup — drop apps/base/ai/llm/inference-pools/, the phi4-mini claim, and the cancelled router-proxy spec docs 5. Model claim sanity check — qwen-coder + qwen3-8b drop to min=0 6. Documentation reframe — README LLM section, coding-clients.md, new docs/llm-platform-future-paths.md 7. Final validation — full kustomize build + kubeconform + trivy + kcl test pass; PR #1434 description rewrite via gh Each task has concrete code, exact commands, expected outputs. Self-review confirms spec coverage (SC-1 through SC-8), no placeholders, type/name consistency across tasks.
Smana
added a commit
that referenced
this pull request
May 9, 2026
Replace KEDA prometheus scaler (`ScaledObject`) with KEDA HTTP add-on (`HTTPScaledObject`) for the `minReplicas==0` branch. The prometheus scaler deadlocks at min=0 (no pod -> no `vllm:num_requests_waiting` metric -> no scale signal); the HTTP add-on queues the first request on the keda-http-interceptor and signals scale-up directly. Drop the EPP / InferencePool allow rule from `_defaultIngress` — PR #1434's foundation-showcase trim removes the InferencePool layer at `min=0/max=1` (it adds no value with one pod per model). Add an allow rule for the KEDA HTTP add-on interceptor in the `keda` namespace. Tests added (main_test.k): - `test_http_scaled_object_when_min_zero` - `test_no_epp_in_default_ingress` - `test_keda_scale_to_zero` (semantics refreshed) Refs design doc docs/superpowers/specs/2026-05-06-oss-llm-foundation-showcase-design.md.
Smana
added a commit
that referenced
this pull request
May 9, 2026
Subtractive trim of PR #1434 per the foundation-showcase design. Removed: - apps/base/ai/llm/inference-pools/ (5× InferencePool + EPP HelmReleases, CNPs, kustomization) — InferencePool/EPP add no value at min=0/max=1. Re-introduction path documented in docs/llm-platform-future-paths.md. - apps/base/ai/llm/phi4-mini.yaml — redundant with Qwen3-8B (KEDA prom scale-from-zero deadlock made Phi-4-mini unreachable anyway). - docs/superpowers/specs/2026-05-05-llm-router-proxy-{design,plan}.md — the Go router-proxy was a bandaid for ext_proc bugs. AI Gateway native routing + Iris HTTP sidecar (Phase 3) makes the proxy unnecessary. Updated to drop dangling phi4-mini references: - infrastructure/base/vllm-semantic-router/helmrelease.yaml — drop the phi4-mini vllm_endpoint and its model_config entry; default_model switched to xplane-qwen3-8b (was xplane-phi4-mini, the now-deleted small-general claim). - infrastructure/base/crossplane/configuration/examples/inferenceservice-basic.yaml — example name swapped from xplane-phi4-mini to xplane-qwen3-8b-basic. - apps/base/openwebui/app.yaml — comment refreshed to describe the new AIGatewayRoute → keda-http-interceptor path (was: SR ext_proc + InferencePool/EPP). Net: 2,805 deletions, 18 insertions.
Smana
added a commit
that referenced
this pull request
May 9, 2026
3-artifact SDD spec for replacing the KEDA HTTP add-on with KEDA prometheus-trigger ScaledObjects on leading vLLM saturation signals (num_requests_running / max-num-seqs ratio, gpu_cache_usage_perc). Drops the proxy hop from the data path, defaults minReplicas=1, and makes scaling decisions react before users feel degradation rather than after queue depth fires (lagging signal). Folds into PR #1434. Brainstorming captured in clarifications.md (CL-1 lagging→leading, CL-2 min=1, CL-3 Knative deferred, CL-4 vLLM Production Stack deferred, CL-5 cooldown 300s rationale).
Smana
added a commit
that referenced
this pull request
May 9, 2026
Self-review pass on PR #1434: - KEDA `keda-metrics-server` egress: 8429 → 8428. vmsingle's chart-default service port is 8428 (matches httproute-vmsingle.yaml backendRef and the composition `_DEFAULTS.prometheus_server_address`). 8429 is vmagent. Cilium would have silently dropped the trigger query under default-deny. - Add `keda-operator` egress to vmsingle:8428 for activation polling so future `scaling.minReplicas: 0` claims (XRD-supported demo override) can actually wake. Inert for the default min=1 fleet. - promptfoo cronjob: emit `promptfoo_test_schema_unknown_total` counter so upstream JSON-schema rotations surface in metrics instead of being hidden under the `category="unknown"` bucket. - inference-service main.k: document `max(num_requests_running)` aggregation intent — hottest-replica saturation, not fleet average. cooldownPeriod dampens scale-down noise. Validation: - kcl fmt + kcl test . -Y settings-example.yaml: 24/24 PASS - kubeconform on edited YAMLs: 0 errors
- `.github/workflows/crossplane-modules.yml`: rewrite kcl.mod's version to the PR-suffixed publish version before `kcl mod push` (the push command ignores the OCI tag in the URL and uses kcl.mod's `version` field as the actual published tag). Mirror the same suffix in the composition-source audit step so PR runs don't fail on the bare kcl.mod version. Lowercase the GHCR repo owner. Drive Dockerfile GO_VERSION from go.mod. - `.pre-commit-config.yaml`: exclude the vendored Envoy Gateway CRDs from `check-added-large-files` — the schema is ~2 MiB total; the chart can't be installed via HelmRelease (1 MiB Helm-release Secret cap), so the rendered CRDs are committed. - `.trivyignore.yaml`: skip CKV2_AWS_5 false-positive on the S3 Files mount-target SG (Checkov doesn't recognize the newer `aws_s3files_mount_target` resource yet). - `.gitignore`: ignore `*.tfplan` / `out.tfplan` and the local `.claude/scheduled_tasks.lock` (transient state, blocks Terramate). - `.secrets.baseline`: refresh after adding LLM-platform manifests with pragma-allowlisted ESO `secretKey` references.
…ing decisions
Establishes the spec-driven-development tooling that this PR ships under:
- `.claude/rules/` — path-scoped rules auto-loaded by the editor when
touching Crossplane KCL, OpenTofu, observability, network-policies,
or spec artifacts. Captures the repeat traps the LLM-platform
first-deploy session surfaced (DNS L7 inspection, link-local
entities, post-creation dict mutation, container-vs-pod
securityContext split).
- `.claude/skills/validate/references/cross-artifact-rules.md` — V2
validation rules referenced by the `/validate` skill.
- `CLAUDE.md` — project-level Claude Code guidance updated for the LLM
platform opt-in gate, KEDA prometheus autoscaling, and rule
cross-references.
- `docs/decisions/` — ADR-0003 (vLLM Production Stack vs KServe) and
ADR-0004 (Amazon S3 Files for model weights storage), the two new
cross-cutting decisions; index README links them.
- `docs/specs/README.md` — V2 plan.md validation-path rule.
- `docs/superpowers/{specs,plans}/` — design+plan pairs feeding the
in-tree specs (coding-LLM fleet, AI gateway redesign, foundation
showcase, paths 7+8 LoRA + per-tenant FinOps).
…architecture diagram - SPEC-001 (`docs/specs/0001-llm-platform-prometheus-autoscaling/`): switch InferenceService autoscaling from KEDA HTTP add-on (proxy in data path, lagging request-count trigger) to KEDA prometheus on leading vLLM signals (running/max-num-seqs ratio + KV-cache util). `min=1` default eliminates the scale-from-zero deadlock with prometheus triggers. Spec artifacts include `spec.md` (WHAT), `plan.md` (HOW + 21 tasks + 4-persona review checklist), and the append-only `clarifications.md` (CL-1..CL-7 — including the post-validation port fix `8429 → 8428` and the e2e validation walkthrough). - `docs/llm-platform-future-paths.md` — paths 1–8 future-paths doc (LoRA serving, per-tenant FinOps, GPU node bin-packing, etc.) with paths 7+8 marked as the next slice. - `docs/architecture/` — `llm-platform.drawio` source-of-truth diagram + README walkthrough of the request flow (Tailscale Gateway → SecurityPolicy → AIGatewayRoute → AIServiceBackend → vLLM Service).
- `crds-envoy-gateway.yaml` — Envoy Gateway 1.7.0 CRDs rendered as a single release-asset file (Backend, BackendTLSPolicy, ClientTrafficPolicy, EnvoyExtensionPolicy, EnvoyPatchPolicy, EnvoyProxy, HTTPRouteFilter, SecurityPolicy, BackendTrafficPolicy, AIGatewayRoute, AIServiceBackend). Vendored because the chart can't ship them via HelmRelease — Helm release Secret cap is 1 MiB; the rendered schemas total ~2 MiB. - `kustomization-inference-extension.yaml` — Gateway API Inference Extension v1.0.0 CRDs (InferencePool / InferenceModel) sourced upstream via Flux Kustomization, kept for forward compatibility even though the AI gateway redesign no longer routes through them. - `crds/base/kustomization.yaml` — wires both into the cluster CRD bundle.
…-in script gate
- `opentofu/config.tm.hcl`: bump cilium / karpenter / flux versions for
May 2026.
- `opentofu/workflows.tm.hcl`: introduce the `--no-tags=opt-in` /
`--tags=opt-in` filter convention so opt-in stacks (currently
`llm-platform`) are skipped by default and require an explicit
invocation to deploy / preview / destroy.
- `opentofu/eks/{init,configure}/workflows.tm.hcl`: refine the
two-stage bootstrap orchestration scripts.
- `opentofu/eks/init/kubernetes.tf` + `helm_values/cilium.yaml`: wire
cilium configuration tweaks needed for the LLM platform's data plane
(CEC support — `envoyConfig.enabled: true`).
- `opentofu/eks/{init,configure}/variables.tf`: expose the variables
the newer Cilium / Flux versions need.
…n stack) New Terramate stack tagged `opt-in` so it's skipped by default — must be explicitly enabled with `TM_LLM_PLATFORM_ENABLED=true` (mirrors the Flux umbrella's `spec.suspend: true` gate). The stack provisions the AWS resources every InferenceService claim depends on: - `aws_s3files_file_system.models` — S3-backed POSIX filesystem for model weights (NFSv4 over an underlying S3 bucket; the bucket survives filesystem recreation, so re-bootstrap reuses already-cached weights). - `aws_s3files_mount_target.az` — one mount target per private subnet / AZ. Pods land on the same-AZ mount target (cross-AZ NFS works but adds latency + transfer cost). - `aws_s3files_access_point.shared` — single `/models` access point with posix uid:gid 1001:1001; per-claim subPath isolation handled at mount time. - `aws_iam_role.s3files_service` — the S3 Files service role (allows `s3files.amazonaws.com` to read/write the underlying S3 bucket). - `aws_iam_role` for the EFS CSI driver — bound via EKS Pod Identity to the controller + node SAs, granting AmazonEFSCSIDriverPolicy + AmazonS3FilesCSIDriverPolicy. - Output `volume_handle` (`s3files:<fs>::<ap>`) — copied into `apps/base/ai/llm/models-pvc.yaml` to bind the in-cluster PV to the filesystem. The IAM and access-point roles deliberately live in OpenTofu (durable), while the S3 Files filesystem can be torn down + recreated cheaply. The Flux side at `clusters/mycluster-0-llm-platform/` is suspended by default — both gates must release for an end-to-end deploy.
…espaces - `flux/sources/`: pin the Helm/OCI repositories the LLM platform draws from — Envoy Gateway, Envoy AI Gateway (controller + CRDs), KEDA, AWS EFS CSI driver, Iris vllm-semantic-router, Gateway API Inference Extension, InferencePool. `ocirepo-karpenter.yaml` bumped alongside the platform-wide karpenter version pin in `opentofu/config.tm.hcl`. - `namespaces/base/`: `llm`, `envoy-gateway-system`, and `envoy-ai-gateway-system` namespaces created early so default-deny CiliumNetworkPolicies and ExternalSecrets can reference them before any HelmRelease lands. Wired into `namespaces/base/kustomization.yaml`.
… PSS-compatible securityContext Pinned to chart 4.1.0 (driver 3.1.0) — first line that supports S3 Files access points (volumeHandle `s3files:<fs>::<ap>`). Renovate will surface chart-version bumps as PRs so changes are reviewed before they touch a CSI driver mounting model weights. Resource sizing tuned for the LLM platform's load profile: - Controller: 50m/256Mi requests, 512Mi limit. - Node: 100m/512Mi requests, 1Gi limit. 256Mi was OOM-killed under parallel preload Jobs all calling NodeStage/NodePublish on the same shared S3 Files mount; OOM left the kernel mount alive but broke the chart's nfs4 watchdog (new mounts then returned EACCES). Pod-level `securityContext.seccompProfile.type: RuntimeDefault` only — the chart's `efs-plugin` container needs `privileged: true` for kernel mounts (incompatible with `allowPrivilegeEscalation: false` and pod-level `runAsNonRoot: true`). The chart's defaults already lock down the support containers (csi-provisioner, liveness-probe). `storageclass.yaml`: hardened `storageClasses: []` since static PVs (per-InferenceService) are the model — no dynamic provisioning here.
KEDA 2.18.0 deployed in the new `keda` namespace, configured for restricted PSS: - `helmrelease.yaml` — operator + admission-webhooks + metrics-apiserver with explicit per-component securityContext (each block fully restated per the upstream chart's deep-merge semantics). - `network-policy.yaml` — default-deny on `keda-operator`, `keda-operator-metrics-apiserver`, and `keda-admission-webhooks`. Egress to vmsingle:8428 (prometheus trigger queries), kube-apiserver, and DNS. Ingress only from kube-apiserver (admission webhooks + external metrics API). - `additional-rbac.yaml` (Crossplane providers) — aggregate ClusterRole granting Crossplane SA `keda.sh/scaledobjects` patch + delete verbs so the InferenceService composition can render ScaledObject managed resources. - `activation-policy.yaml` — installs the KEDA CRDs the composition references.
- `karpenter-nodepools-gpu/`: dedicated NodePool + EC2NodeClass for NVIDIA L4 instances (`g6.xlarge` / `g6.2xlarge` — on-demand for the base capacity, spot for burst), labeled `gpu=l4` and tainted `nvidia.com/gpu=Exists:NoSchedule` so only LLM workloads schedule. AMI: Bottlerocket NVIDIA — exposes the GPU natively, no device-plugin DaemonSet required. NodePool `nodes` cap = 4 (decision CL-6 in SPEC-001's clarifications log). - `karpenter-nodepools/`: bump default NodePool / EC2NodeClass to keep in lockstep with the GPU pool's API version + LLM-platform-friendly taints. - `runtimeclass-nvidia/`: RuntimeClass `nvidia` referenced by the InferenceService composition's vLLM Deployment so containers wire through the NVIDIA container runtime.
… CNPs - `helmrelease.yaml` — Envoy Gateway 1.7.0 controller (provides the `GatewayClass` Envoy AI Gateway consumes). Restricted-PSS-compatible per-component securityContext blocks. Watches the cluster for `Gateway` / `HTTPRoute` / `SecurityPolicy` resources targeting its GatewayClass. - `network-policy.yaml` — default-deny CiliumNetworkPolicy for the controller and the data-plane proxy spawned per Gateway. Allows xDS from data-plane proxy back to controller (ports 18000-18002), ingress from kubelet for probes, ingress from in-cluster apps to the data-plane proxy on :8080, and egress to the API server.
…e-fronted HTTPRoute The OpenAI-compatible LLM ingress: clients hit `https://llm.priv.cloud.ogenki.io/v1/...` over Tailscale, the SecurityPolicy authenticates the request, the AIGatewayRoute body parser sets `x-ai-eg-model` from the request body, and the route dispatches to the matching vLLM Service. - `helmrelease.yaml` + `helmrelease-crds.yaml` — Envoy AI Gateway controller (v0.5.0) — direct routing per AIServiceBackend, no proxy hop, no in-data-plane interceptor. - `gatewayclass.yaml` + `gateway.yaml` + `envoyproxy.yaml` — single shared `ai-gateway` Gateway, ClientTrafficPolicy + EnvoyProxy spec scoped to the LLM listener. - `httproute.yaml` — public HTTPRoute on the Tailscale-general Gateway pointing `llm.priv.cloud.ogenki.io` at the AI Gateway data-plane Service. - `security-policy.yaml` + `api-keys-externalsecret.yaml` — `apiKeyAuth` SecurityPolicy comparing the `Authorization` header against the values in the `ai-gateway-api-keys` Secret. Envoy Gateway strips the `Bearer ` scheme before comparison, so the ESO-rendered Secret stores the raw API key (not Bearer-prefixed). Source of truth: AWS Secrets Manager `platform/llm/api-keys` (a JSON object keyed by client identity — `openwebui_apikey`, `promptfoo_apikey`); seeded out-of-band so the keys survive cluster recreation. - `network-policy.yaml` — default-deny on both the controller and the data-plane proxy. Egress to vLLM Services in `llm/`, the semantic router in `llm/`, kube-apiserver, and DNS. Ingress from Cilium Gateway, in-cluster apps, kubelet probes. - `gapi/platform-tailscale-general-gateway.yaml` — extend the Tailscale general Gateway with the `llm` HTTPRoute listener.
The `MoM` (mixture-of-models) virtual model dispatcher. When clients send `model: MoM`, the AIGatewayRoute extension calls SR's HTTP classifier (`POST /api/v1/classify/intent` on :8080); SR returns the chosen `xplane-<name>` model id; the AI Gateway body parser rewrites `x-ai-eg-model` and dispatches. - `helmrelease.yaml` — vllm-semantic-router 0.0.x with signal-fusion routing (keyword + context-length signals). PII classifier disabled (upstream chart bug); semantic_cache disabled (poisons on failed upstreams). Memory bumped to 4Gi (OOMKilled at 512Mi). - `network-policy.yaml` — default-deny CNP. Egress to vLLM Services, HuggingFace (one-shot model download for the BERT classifier cache; FQDN-allowlist with full subdomain depth), DNS via L7-aware kube-dns rule, vmagent for metrics. Ingress from envoy-gateway-system (the AIGatewayRoute extension's HTTP classifier client) and from the promptfoo namespace for evals.
New `cloud.ogenki.io/v1alpha1 InferenceService` XR + KCL composition that templates a single vLLM model claim into 9 managed resources: - vLLM `Deployment` on the GPU NodePool with the model-name + spec baked into args (`--model`, `--enable-tool-call`, `--tool-call-parser hermes`, `--enable-lora` + `--lora-modules` when `loraAdapters` is non-empty), running with restricted PSS securityContext (runAsUser=1000 to match the vLLM image's /etc/passwd). - `Service` (ClusterIP, port 8000) — the OpenAI-compatible target the AIGatewayRoute dispatches to. - KEDA `ScaledObject` with prometheus triggers on leading vLLM signals (running/max-num-seqs ratio + KV-cache util, queries against vmsingle:8428). `min=1` default eliminates the prometheus-trigger scale-from-zero deadlock; cooldownPeriod=300s for damped scale-down. Per SPEC-001. - Two `CiliumNetworkPolicy` resources (vLLM ingress from the AI Gateway data plane only; preload Job egress to HuggingFace). - ServiceMonitor + VMRule for the vLLM metrics scrape. - Preload `Job` (one-shot) that downloads the model + LoRA adapters from HuggingFace into `/models/<name>/` and `/models/loras/<name>/` on the shared S3 Files mount; uses a marker file to skip re-download on bootstrap. Fast-path guards against partial xet-cache downloads. - ExternalSecret pulling the HF token from AWS SM. Composition includes: - `main_test.k` — unit tests asserting resource counts + naming + security context + LoRA conditional emission + preload skip-marker. - `README.md` + `settings-example.yaml` + `examples/` (basic and complete claims).
….10) Two new fields needed by OpenWebUI's claim: - `deploymentStrategy` — `RollingUpdate` (default) or `Recreate`. The composition emits ONLY the matching strategy block (KCL inline conditional, no post-creation dict mutation per function-kcl issue #285). Recreate is required by OpenWebUI because its data PVC is RWO on the default gp3 StorageClass — RollingUpdate's maxSurge would spawn the new pod before the old one releases the volume, triggering `Multi-Attach error for volume`. - `extraVolumes` + `extraVolumeMounts` — pass-through for arbitrary Volume / VolumeMount entries. OpenWebUI uses them to mount its `openwebui-data` PVC at `/app/backend/data` so the SQLite DB + uploaded files survive pod restarts.
`xplane-llm-models-preload` EPI with a writable IAM policy scoped to the underlying S3 bucket of the S3 Files filesystem (read+write — the preload Job needs to write model weights). Bound to the `xplane-llm-models-preload` ServiceAccount in the `llm` namespace which the InferenceService composition's preload Job uses. `epis/kustomization.yaml` references `epis-llm` so the new EPI lands on top of the existing platform EPI bundle. `clusters/mycluster-0/security/eks-pod-identities.yaml` overlays the EPI namespace into the cluster's overlay.
Two unrelated teardown-safety fixes that the LLM-platform branch surfaced (the multiple destroy/recreate cycles exercised both): 1. `managementPolicies` without `Delete` on three stateful Buckets — `cnpg-backups`, `openbao-snapshot`, `xplane-harbor`-bound bucket (orphaning on cluster destroy preserves the data + finalizers don't hang). Crossplane v2 namespaced MRs do not expose `spec.deletionPolicy`; `managementPolicies` is the v2 mechanism. Plus the existing platform principle of no DeleteBucket IAM grants. 2. `security/base/zitadel/sqlinstance.yaml` — frozen-dated-snapshot recovery pattern so a cluster rebuild can re-bootstrap Zitadel from the prior snapshot (the bootstrap field is immutable post-create). 3. `scripts/eks-prepare-destroy.sh` — pre-clean Envoy AI Gateway + InferencePool + KEDA CRDs, drop kyverno + cilium-operator validating webhooks early, and unblock teardown on degraded clusters where admission can race the destroy ordering. 4. `scripts/terramate-destroy-confirm.sh` — single y/N prompt at the start of `terramate script run --reverse destroy` so the operator confirms once instead of per-stack.
…y routes + LLM SLO rules The 4 base InferenceService claims + 2 LoRA adapters + supporting infra: - `models-pvc.yaml` — static PV+PVC binding the cluster to the S3 Files filesystem provisioned in opentofu/llm-platform. The `volumeHandle` (`s3files:<fs>::<ap>`) is updated manually after every `tofu apply` (header comment calls out the sync). - `s3-bucket.yaml` — the underlying S3 bucket Crossplane manages alongside the filesystem (deletion-protected via `managementPolicies` without Delete; bucket survives filesystem recreation so model weights are reused). - `qwen-coder.yaml`, `qwen-coder-fim.yaml`, `qwen3-8b.yaml`, `llamaguard3-1b.yaml` — InferenceService claims. `xplane-qwen-coder` enables LoRA with two adapters (`xplane-qwen-coder-sql-dpo`, `xplane-qwen-coder-securecode`). - `ai-gateway-routes/route.yaml` — AIGatewayRoute matching `model: xplane-<name>` headers (incl. LoRA adapter model names which route to the qwen-coder backend). - `hf-token-externalsecret.yaml` — HuggingFace token for the preload Job, sourced from AWS SM. - `preload-serviceaccount.yaml` — SA bound by the `xplane-llm-models-preload` EPI. - `grafana-folder.yaml` + `grafana-dashboard.yaml` — co-located LLM platform dashboard (23 panels: per-model TTFT, request rate, error rate, GPU util, KEDA scale events, vLLM cache util, etc.). - `vmrule-llm-slo.yaml` — 3 SLO alerts (TTFT p95, error rate, request saturation). - `apps/llm/kustomization.yaml` — overlay-only Kustomization (gated by the LLM umbrella; not wired into `apps/mycluster-0/` which would bypass the suspend gate). - `apps/mycluster-0/kustomization.yaml` — references OpenWebUI which is not LLM-gated (frontend-only; works whether the LLM stack is resumed or not, just shows no models).
…+ LLM API key `xplane-openwebui` App XR: a single-replica OpenWebUI v0.5.20 in the `apps` namespace, fronted by `chat.priv.cloud.ogenki.io` over Tailscale. Talks OpenAI-compatible HTTP to the AI Gateway data plane. - `app.yaml` — App claim. Strategy=Recreate (RWO PVC; RollingUpdate multi-attach error). `securityContext.readOnlyRootFilesystem: false` required (writes to `/app/backend/data`). Mounts `openwebui-data` PVC at `/app/backend/data` so the SQLite admin DB + chat history + uploaded files survive restarts. `automountServiceAccountToken: false`. Env vars: `OPENAI_API_BASE_URL` → AI Gateway data plane, `OPENAI_API_KEY` from the ESO-rendered `openwebui-llm-api-key` Secret, OAuth (Zitadel) creds, etc. - `pvc.yaml` — `openwebui-data` 5Gi gp3 PVC. - `externalsecret-llm-api-key.yaml` — pulls the raw `openwebui_apikey` from AWS SM `platform/llm/api-keys`. The OpenAI client inside OpenWebUI prepends `Bearer ` to this value before sending the Authorization header. - `externalsecret-oauth-zitadel.yaml` — OIDC client_id + client_secret from Zitadel for OpenWebUI's "Sign in with Zitadel".
…for LLM platform - `vmrules/ai.yaml` — alert rules for the AI namespace (vllm-semantic-router availability, classifier latency, AI Gateway data-plane availability). - `vmrules/kustomization.yaml` — wires `ai.yaml` into the cluster VMRule bundle. - `vmservicecrapes/vllm-semantic-router.yaml` — VMServiceScrape for the semantic router's :8080/metrics. Wired via `vmservicecrapes/kustomization.yaml`. - `loggen/helmrelease.yaml` — postRenderer to strip pod-level container security fields the upstream chart emits incorrectly (chart segments by component but doesn't deep-merge — the fix matches the path-scoped rule in `.claude/rules/spec-constitution.md` about replace-not-merge securityContext semantics).
…-tenant FinOps
Nightly Promptfoo evaluation suite that exercises every model
(including LoRA adapters) against the AI Gateway and emits results
as Prometheus metrics for SLO tracking.
- `namespace.yaml` — `promptfoo` namespace.
- `cronjob.yaml` — fires at 02:00 Europe/Paris. Runs Promptfoo against
the AI Gateway with `xplane-qwen3-8b` (default model) + targeted
probes of `xplane-qwen-coder-fim`, `xplane-qwen-coder-sql-dpo`,
`xplane-qwen-coder-securecode`. Node-based JSON-to-Prometheus
parser (replaced jq for portability). Pushes to vmsingle's
`/api/v1/import/prometheus`. Tracks `promptfoo_test_schema_unknown_total`
to surface fixture drift instead of silently absorbing it. All
Flux postBuild substitution markers escaped (`$${VAR}`) so the
bash + JS template literals survive postBuild.
- `eval-suite-configmap.yaml` — test cases pinned via ConfigMap.
- `externalsecret-api-key.yaml` — `promptfoo_apikey` from AWS SM
`platform/llm/api-keys`. The eval container prepends `Bearer ` to
this raw value before sending the Authorization header.
- `network-policy.yaml` — default-deny + egress to AI Gateway data
plane (envoy-gateway-system :8080), vllm-semantic-router (for
classifier probes), vmsingle (push), DNS, kubelet ingress for
probes.
- `kustomization.yaml` — wires the lot.
Add the LLM Platform group to the homepage portal with a single chatbot link (chat.priv.cloud.ogenki.io). Internal API surface + Grafana dashboards + Promptfoo eval results live one click deeper under the existing Observability + Apps groups. `tooling/mycluster-0/kustomization.yaml` wires the homepage update into the cluster overlay.
…uster wiring
The Flux gate that pairs with `opentofu/llm-platform`'s opt-in
Terramate gate. Both must be released for an end-to-end deploy.
- `clusters/mycluster-0/llm-platform.yaml` — umbrella Flux
Kustomization with `spec.suspend: true` (default). Points at
`clusters/mycluster-0-llm-platform/`. Manual `flux resume
kustomization llm-platform -n flux-system` releases the gate.
- `clusters/mycluster-0-llm-platform/` — sibling directory (NOT
under `clusters/mycluster-0/`) so `flux-system`'s recursive sync
doesn't auto-discover the children and bypass the umbrella's
suspend. Contains 8 child Flux Kustomizations:
- infrastructure-vllm-semantic-router
- infrastructure-runtimeclass-nvidia
- infrastructure-gpu-nodepools (Karpenter NodePool)
- infrastructure-envoy-gateway
- infrastructure-envoy-ai-gateway
- apps-llm (InferenceService claims + OpenWebUI route)
- security-llm-epi (writable EKS Pod Identity)
- tooling-promptfoo (nightly evals, gated under the same umbrella)
- `clusters/mycluster-0-llm-platform/README.md` — operator runbook:
enable/suspend/teardown procedures + the AWS SM `platform/llm/api-keys`
bootstrap (kept outside OpenTofu so the keys survive cluster
recreation).
- `clusters/mycluster-0/infrastructure/infrastructure.yaml` — wires
KEDA + EFS + Envoy Gateway controller into the platform-wide
infrastructure Kustomization (these are needed even without the LLM
gate released).
- `clusters/mycluster-0/security/eks-pod-identities.yaml` — wire the
`epis-llm` overlay (writable preload EPI lives there).
- `infrastructure/mycluster-0/kustomization.yaml` — references the
new base directories (aws-efs-csi-driver, keda).
…ce composition Extend the KCL validator to also lint+test the new `infrastructure/base/crossplane/configuration/kcl/inference-service/` composition (4-stage: kcl fmt, syntax, render, security checks).
…master-plan drafts - `README.md` — new "LLM Platform" section in the project overview. Briefly describes the OpenAI-compatible API (Bearer-token auth, 4 base models + 2 LoRA adapters), OpenWebUI for chat, OpenCode + Continue for IDE. - `docs/ai.md` — narrative architecture doc: routing modes (client-deterministic vs SR cascade), latency budget, 9-component diagram, security model (apiKeyAuth + ForwardClientIDHeader + sanitize=true), observability surface, request-flow walkthrough. The `Bearer-prefix` description matches the actual ESO template (raw-key Secret; Envoy strips the scheme before comparison). - `docs/coding-clients.md` — copy-paste configs for OpenCode + Continue + curl + verification recipes; lists the model fleet. - `docs/technology-choices.md` — KEDA added to the technology stack table (autoscaling layer). - `docs/plans/self-hosted-llm-platform/` — parked exploration drafts (the original doc challenged + master plan + spec/plan/clarifications drafts); kept for context but superseded by `docs/specs/0001-*`.
b57b77a to
4a31a33
Compare
CKV_K8S_49 on inference-service:aggregate-to-crossplane: wildcard verbs match the established pattern across additional-rbac.yaml; narrowing would break composition reconciliation. CKV_K8S_35 on promptfoo CronJob: openai provider auto-discovers OPENAI_API_KEY from env; mounting as a file would require an entrypoint wrapper and regress readOnlyRootFilesystem. Secret is single-key and short-lived (CronJob, ttl=7d).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Self-hosted LLM platform on EKS with cascade routing, scale-to-zero,
guardrails, and nightly evals — landed across 7 phases on this branch
plus SPEC-001 production-realistic autoscaling (folded in
2026-05-07).
Posture: end-to-end validated against a live cluster on
2026-05-07. Inference, scaling, observability, routing all verified.
What's in the box
4016b226— gpu-l4 Karpenter NodePool + Bottlerocket Accelerated EC2NodeClassc60df37c/8b8c76b4/19c6b108— KEDA + HTTP add-on (later replaced by SPEC-001), vLLM Production Stack, vLLM Semantic Router (Iris)5d50245d—InferenceServiceCrossplane composition (KCL): vLLM Deployment + Service + ServiceAccount + KEDAScaledObject(prometheus triggers per SPEC-001) + optional HTTPRoute + default-deny CiliumNetworkPolicy + read-only EPI for S3 weights + ExternalSecrets + VMServiceScrape + per-model VMRule + idempotent preload Job (CL-3 rec A)e1bd5548—xplane-llm-modelsS3 bucket + writable preload IAM0d569425— 4 model claims (Qwen2.5-Coder-7B / Qwen3-8B / Qwen2.5-Coder-1.5B-Base FIM / LlamaGuard 3-1B), Hybrid routing (CL-1 C), public HTTPRoutellm.${private_domain_name}ba8f2bab— OpenWebUI App XR claim →chat.${private_domain_name}63dc9f2d— Promptfoo nightly CronJob (CL-4 A) + platform VMRules + ADR-0003 (vLLM PS over KServe + llm-d). Comprehensive Grafana dashboard shipped 2026-05-07.ScaledObject(prometheus triggers on leading vLLM saturation metrics:running/max-num-seqs+gpu_cache_usage_perc). All models defaultmin=1. Direct AI Gateway → vLLM (no proxy hop). Seedocs/specs/0001-llm-platform-prometheus-autoscaling/.CL decisions ratified
router.mode: hybridmain.k0 2 * * * Europe/Paris)tooling/base/promptfoo/cronjob.yamlnvidia.com/gpu: 4capgpu-l4-nodepool.yamlapps/llmfolder)apps/base/ai/llm/grafana-dashboard.yamlinfrastructure/base/crossplane/configuration/kcl/inference-service/main.k(composition v0.6.0 (LoRA adapters))End-to-end validation (2026-05-07)
/v1/completionsvia Tailscale → AI Gateway → vLLM/v1/completions(warm)/v1/chat/completions(warm)/v1/completions(warm)num_requests_waitingstayed 0cooldownPeriod=300sof inactivity/metricsNotable design choices
InferenceServicekind, no X prefix — matches App / SQLInstance / EPI repo conventionBucketmanifests instead of App XRs3Bucket.enabledTest plan
kcl test22/22 PASS (inference-service)./scripts/validate-kcl-compositions.shstages 1-2 passaws iam simulate-principal-policyagainst the live EPI roles-pr1434suffix from composition source URLs (crossplane-inference-service:0.6.0-pr1434→0.6.0,crossplane-app:0.1.10-pr1434→0.1.10) once CI publishes clean tagsDeferred (with clear triggers)
/v1/chat/completions— SR currently exposes only the classifier endpoint on :8080. Promptfoo redirected to AI Gateway directly until MoM cascade lands.kcl/_lib/module — refactor touches App too