Skip to content

feat: self-hosted LLM platform on EKS (Phases 1-7-stub)#1434

Merged
Smana merged 26 commits into
mainfrom
wip/self-hosted-llm-platform-draft
May 14, 2026
Merged

feat: self-hosted LLM platform on EKS (Phases 1-7-stub)#1434
Smana merged 26 commits into
mainfrom
wip/self-hosted-llm-platform-draft

Conversation

@Smana
Copy link
Copy Markdown
Owner

@Smana Smana commented Apr 30, 2026

Summary

Self-hosted LLM platform on EKS with cascade routing, scale-to-zero,
guardrails, and nightly evals — landed across 7 phases on this branch
plus SPEC-001 production-realistic autoscaling (folded in
2026-05-07).

Posture: end-to-end validated against a live cluster on
2026-05-07. Inference, scaling, observability, routing all verified.

What's in the box

  • Phase 1 4016b226 — gpu-l4 Karpenter NodePool + Bottlerocket Accelerated EC2NodeClass
  • Phase 2 c60df37c/8b8c76b4/19c6b108 — KEDA + HTTP add-on (later replaced by SPEC-001), vLLM Production Stack, vLLM Semantic Router (Iris)
  • Phase 3 5d50245dInferenceService Crossplane composition (KCL): vLLM Deployment + Service + ServiceAccount + KEDA ScaledObject (prometheus triggers per SPEC-001) + optional HTTPRoute + default-deny CiliumNetworkPolicy + read-only EPI for S3 weights + ExternalSecrets + VMServiceScrape + per-model VMRule + idempotent preload Job (CL-3 rec A)
  • Phase 4 e1bd5548xplane-llm-models S3 bucket + writable preload IAM
  • Phase 5 0d569425 — 4 model claims (Qwen2.5-Coder-7B / Qwen3-8B / Qwen2.5-Coder-1.5B-Base FIM / LlamaGuard 3-1B), Hybrid routing (CL-1 C), public HTTPRoute llm.${private_domain_name}
  • Phase 6 ba8f2bab — OpenWebUI App XR claim → chat.${private_domain_name}
  • Phase 7-stub 63dc9f2d — Promptfoo nightly CronJob (CL-4 A) + platform VMRules + ADR-0003 (vLLM PS over KServe + llm-d). Comprehensive Grafana dashboard shipped 2026-05-07.
  • SPEC-001 — Replace KEDA HTTP add-on with KEDA ScaledObject (prometheus triggers on leading vLLM saturation metrics: running/max-num-seqs + gpu_cache_usage_perc). All models default min=1. Direct AI Gateway → vLLM (no proxy hop). See docs/specs/0001-llm-platform-prometheus-autoscaling/.

CL decisions ratified

CL Decision Where wired
CL-1 C — Hybrid routing router.mode: hybrid
CL-2 A — LlamaGuard category route (real post-filter middleware deferred upstream) LlamaGuard direct claim
CL-3 A — composition-rendered preload Job InferenceService main.k
CL-4 A — Nightly Promptfoo (0 2 * * * Europe/Paris) tooling/base/promptfoo/cronjob.yaml
CL-5 A — App stays CPU-only, separate XRD InferenceService is its own composition
CL-6 A — nvidia.com/gpu: 4 cap gpu-l4-nodepool.yaml
CL-7 Resolved — comprehensive LLM dashboard shipped 2026-05-07 (23 panels under apps/llm folder) apps/base/ai/llm/grafana-dashboard.yaml
CL-8 A — S3 + EPI (rustfs reassessment) Phase 4 manifests; rustfs deferred
SPEC-001 Drop KEDA HTTP add-on; KEDA prometheus on leading signals infrastructure/base/crossplane/configuration/kcl/inference-service/main.k (composition v0.6.0 (LoRA adapters))

End-to-end validation (2026-05-07)

Test Result
FIM /v1/completions via Tailscale → AI Gateway → vLLM ✅ HTTP 200 in 0.18s
qwen-coder /v1/completions (warm) ✅ HTTP 200 in 0.29s
qwen3-8b /v1/chat/completions (warm) ✅ HTTP 200 in 0.27s
llamaguard3-1b /v1/completions (warm) ✅ HTTP 200 in 0.15s
OpenWebUI chat round-trip (qwen3-8b, 238KB response) ✅ HTTP 200
Scale-up 1→2 under 30 concurrent qwen-coder requests ✅ at T+75s, leading running-ratio trigger (0.94 > 0.7), num_requests_waiting stayed 0
Scale-down 2→1 after cooldownPeriod=300s of inactivity ✅ at exactly T+5min
Cache-util trigger wiring (single 16k-context request) ✅ KEDA polls metric, threshold not breached on L4 (expected)
VictoriaMetrics scraping vLLM /metrics ✅ verified via MCP queries (vllm:num_requests_running, gpu_cache_usage_perc)
Promptfoo redirect to AI Gateway ⏳ re-running

Notable design choices

  • InferenceService kind, no X prefix — matches App / SQLInstance / EPI repo convention
  • Direct Bucket manifests instead of App XR s3Bucket.enabled
  • Per-claim read EPI + shared writable EPI
  • SPEC-001 — Leading-indicator scaling — running-vs-batch ratio + KV cache (not request-rate) catches saturation BEFORE the queue forms

Test plan

  • kcl fmt clean
  • kcl test 22/22 PASS (inference-service)
  • kustomize build apps/mycluster-0 + tooling/mycluster-0
  • kubeconform on rendered manifests
  • trivy 0 misconfigurations
  • ./scripts/validate-kcl-compositions.sh stages 1-2 pass
  • Live e2e validation (2026-05-07 cluster): T027–T030 all 4 models, scale-up + scale-down via SPEC-001 triggers, OpenWebUI round-trip, vmagent metrics flow
  • T034 OpenWebUI UAT — chat.priv.cloud.ogenki.io reachable, qwen3-8b chat works
  • Grafana dashboards — comprehensive 23-panel LLM Platform dashboard (CL-7 unblocked)
  • Post-merge: T020 aws iam simulate-principal-policy against the live EPI roles
  • Post-merge: T040–T041 regression-inject + cost-panel cross-check
  • Post-merge: strip -pr1434 suffix from composition source URLs (crossplane-inference-service:0.6.0-pr14340.6.0, crossplane-app:0.1.10-pr14340.1.10) once CI publishes clean tags

Deferred (with clear triggers)

  • Real LlamaGuard post-filter middleware — needs upstream SR feature or app-layer middleware
  • vLLM Semantic Router cascade for /v1/chat/completions — SR currently exposes only the classifier endpoint on :8080. Promptfoo redirected to AI Gateway directly until MoM cascade lands.
  • Shared kcl/_lib/ module — refactor touches App too
  • Shared S3 read EPI — needs EPI XRD enhancement for multi-SA bindings
  • Two-layer GPU model cache (EBS snapshot + hostPath shared across pods)
  • LlamaGuard post-filter sampling by risk class

Comment thread tooling/base/promptfoo/cronjob.yaml Fixed
Comment thread tooling/base/promptfoo/cronjob.yaml Fixed
Comment thread tooling/base/promptfoo/cronjob.yaml Fixed
@Smana Smana force-pushed the wip/self-hosted-llm-platform-draft branch from 9beba10 to 9f455df Compare April 30, 2026 19:16
Comment on lines +129 to +141
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: inference-service:aggregate-to-crossplane
labels:
rbac.crossplane.io/aggregate-to-crossplane: "true"
rules:
- apiGroups: ["keda.sh"]
resources: ["scaledobjects", "scaledobjects/status"]
verbs: ["*"]
- apiGroups: ["batch"]
resources: ["jobs", "jobs/status"]
verbs: ["*"]
@Smana Smana force-pushed the wip/self-hosted-llm-platform-draft branch 2 times, most recently from 692e301 to f2dd9ac Compare May 2, 2026 05:25
Comment thread opentofu/llm-platform/filesystem.tf Fixed
@Smana Smana force-pushed the wip/self-hosted-llm-platform-draft branch 3 times, most recently from 27c8841 to 33aaf65 Compare May 4, 2026 19:53
Smana added a commit that referenced this pull request May 5, 2026
1. infrastructure/base/crossplane/providers/additional-rbac.yaml:138-141
   (CKV_K8S_49 — wildcard verbs on the new
   inference-service:aggregate-to-crossplane ClusterRole). Replace
   `verbs: ["*"]` with the explicit 7-verb list (get, list, watch,
   create, update, patch, delete). Functionally equivalent for
   Crossplane SA; satisfies least-privilege. Pre-existing wildcards
   on the older ClusterRoles in this file aren't in the PR diff so
   they weren't flagged — keeping them as-is to avoid unrelated churn.

2. opentofu/llm-platform/filesystem.tf:1 (CKV2_AWS_5 — SG not attached).
   False positive: the SG IS attached via
   aws_s3files_mount_target.az.security_groups (line ~47), but
   Checkov doesn't grok the newer aws_s3files_mount_target resource.
   Suppressed with `# checkov:skip=CKV2_AWS_5:<reason>` inside the
   resource block.

3-5. tooling/base/promptfoo/cronjob.yaml:9
   - CKV_K8S_43 (image not digest-pinned): pinned 0.106.0 to
     sha256:e10e5e2d0ae9a73ec10883672448506c0bf11db443fcab1afb5f461968a5616e
     (verified via skopeo).
   - CKV_K8S_40 (high UID): bumped runAsUser+fsGroup 1001 → 10001
     to avoid host-system UID collision. Promptfoo doesn't share
     volumes with other workloads (ConfigMap + emptyDir only),
     so the UID change is contained.
   - CKV_K8S_15 (imagePullPolicy): IfNotPresent → Always.

Verified locally with `checkov 3.2.517`: cronjob 86/0, rbac new
ClusterRole PASSED, filesystem SG SKIPPED with reason.
Comment on lines +1 to +8
resource "aws_security_group" "mount_targets" {
# checkov:skip=CKV2_AWS_5:SG is attached via aws_s3files_mount_target.az.security_groups (line ~47). Checkov doesn't recognize the newer aws_s3files_mount_target resource, so it emits a false positive — the SG is not orphaned.
name = "${var.filesystem_name}-mount-targets"
description = "Allow NFS (2049/TCP) from EKS worker nodes to S3 Files mount targets."
vpc_id = data.terraform_remote_state.network.outputs.vpc_id

tags = merge(var.tags, { Name = "${var.filesystem_name}-mount-targets" })
}
Smana added a commit that referenced this pull request May 5, 2026
Brainstorm output for fixing task #78 root cause. The Envoy ext_proc
+ cilium-envoy approach is structurally blocked by:
  1. SR v0.2.0 hard-coding clearRouteCache=false in
     buildRequestBodyContinueResponse — defeats Envoy's body-callback
     header-mutation re-routing.
  2. cilium-envoy's slim build (no envoy.filters.http.lua) — kills
     the standard "Lua filter calls clearRouteCache after ext_proc"
     workaround. Verified empirically: listener rejected with
     "Didn't find a registered implementation".
  3. cilium.l7policy filter on upstream filter chains — denies
     traffic to per-model EDS clusters with 403 even from
     CNP-allowed sources.

The design replaces the entire CEC + ext_proc chain with a small
custom HTTP proxy (~250 LOC Go) deployed in the llm namespace. The
proxy reads the body's model field directly and:
  - For client-deterministic (model: xplane-*): fast path, forward
    to that Service. No SR roundtrip.
  - For SR-classified (model: MoM): call SR's HTTP classify API,
    rewrite body.model, forward. Same UX as the broken ext_proc
    path but actually works.

Both OpenCode subagent dispatch (per-agent model assignment) AND
OpenWebUI MoM auto-routing flow through the same proxy. Single
provider URL stays for all clients — no client-side changes needed.

Spec sections cover goal/SC, architecture, component design,
streaming behavior, deployment plan, phased rollout (P0-P7), risks
(SSE, single-point-of-failure, SR endpoint contract), and explicit
out-of-scope (no auth/cache/circuit-breaking — the proxy is a thin
forwarder, not a control plane).

Implementation plan ships separately. Targets follow-on PR after
#1434 merges.
@Smana Smana force-pushed the wip/self-hosted-llm-platform-draft branch from d979049 to a371010 Compare May 6, 2026 05:07
Smana added a commit that referenced this pull request May 6, 2026
Pivot from "drop-in replacement" framing to "foundation, not replacement"
after honest evaluation of open-weights model quality vs frontier APIs in
2026 and verification that L40S (g6e) is not offered in eu-west-3.

Architecture trim:
- Drop InferencePool + EPP per model (zero value at min=0/max=1)
- Cancel the Go llm-router-proxy bandaid (was working around ext_proc bugs)
- Drop CEC + ext_proc body-rewrite path entirely
- Drop Phi-4-mini claim (redundant with Qwen3-8B; KEDA prom can't scale-from-0)
- Wire KEDA HTTP add-on universally for scale-from-zero on the model layer
- Iris becomes a sidecar HTTP classifier (no ext_proc); AI Gateway calls it
  for `model: MoM` requests, sets x-ai-eg-model header, routes natively

Cost: ~$1.3k/mo -> ~$220-250/mo idle (1× L4 spot for FIM only). ~80% cut.

Composition: InferenceService KCL bumps to v0.4.0 — gates the EPP rendering
behind an opt-in `spec.routing.endpointPicker.enabled` flag (default false)
so multi-replica serving can re-introduce EPP without rewriting claims.

Future-upgrade paths captured separately in docs/llm-platform-future-paths.md
(Qwen3-Coder-30B-A3B variants on L4 AWQ-4bit, L40S in eu-central-1, TP=4 on
g6.12xlarge, EPP re-introduction, claude-bridge relay).

Supersedes 2026-05-04-coding-llm-fleet-design.md ("drop-in replacement"
framing) and explicitly cancels 2026-05-05-llm-router-proxy-{design,plan}.md.
Smana added a commit that referenced this pull request May 6, 2026
39 tasks across 7 phases (~6 commits worth of work) trimming PR #1434
to the foundation-showcase shape per the 2026-05-06 design doc.

Phases:
1. KCL composition v0.4.0 — KEDA prom ScaledObject -> KEDA HTTP add-on
   HTTPScaledObject; drop EPP from default CNP ingress; add 2 kcl tests
2. AIGatewayRoute rewrite — backendRef -> keda-add-ons-http-interceptor-
   proxy with URLRewrite Host filter; ReferenceGrant in keda namespace
3. Iris ext_proc removal — drop EnvoyExtensionPolicy; classifier stays
   as HTTP sidecar
4. Subtractive cleanup — drop apps/base/ai/llm/inference-pools/, the
   phi4-mini claim, and the cancelled router-proxy spec docs
5. Model claim sanity check — qwen-coder + qwen3-8b drop to min=0
6. Documentation reframe — README LLM section, coding-clients.md, new
   docs/llm-platform-future-paths.md
7. Final validation — full kustomize build + kubeconform + trivy + kcl
   test pass; PR #1434 description rewrite via gh

Each task has concrete code, exact commands, expected outputs.
Self-review confirms spec coverage (SC-1 through SC-8), no
placeholders, type/name consistency across tasks.
Smana added a commit that referenced this pull request May 6, 2026
Replace KEDA prometheus scaler (`ScaledObject`) with KEDA HTTP add-on
(`HTTPScaledObject`) for the `minReplicas==0` branch. The prometheus
scaler deadlocks at min=0 (no pod -> no `vllm:num_requests_waiting`
metric -> no scale signal); the HTTP add-on queues the first request
on the keda-http-interceptor and signals scale-up directly.

Drop the EPP / InferencePool allow rule from `_defaultIngress` —
PR #1434's foundation-showcase trim removes the InferencePool layer
at `min=0/max=1` (it adds no value with one pod per model). Add an
allow rule for the KEDA HTTP add-on interceptor in the `keda`
namespace.

Tests added (main_test.k):
- `test_http_scaled_object_when_min_zero`
- `test_no_epp_in_default_ingress`
- `test_keda_scale_to_zero` (semantics refreshed)

Refs design doc docs/superpowers/specs/2026-05-06-oss-llm-foundation-showcase-design.md.
Smana added a commit that referenced this pull request May 6, 2026
Subtractive trim of PR #1434 per the foundation-showcase design.

Removed:
- apps/base/ai/llm/inference-pools/ (5× InferencePool + EPP HelmReleases,
  CNPs, kustomization) — InferencePool/EPP add no value at min=0/max=1.
  Re-introduction path documented in docs/llm-platform-future-paths.md.
- apps/base/ai/llm/phi4-mini.yaml — redundant with Qwen3-8B (KEDA prom
  scale-from-zero deadlock made Phi-4-mini unreachable anyway).
- docs/superpowers/specs/2026-05-05-llm-router-proxy-{design,plan}.md —
  the Go router-proxy was a bandaid for ext_proc bugs. AI Gateway native
  routing + Iris HTTP sidecar (Phase 3) makes the proxy unnecessary.

Updated to drop dangling phi4-mini references:
- infrastructure/base/vllm-semantic-router/helmrelease.yaml — drop the
  phi4-mini vllm_endpoint and its model_config entry; default_model
  switched to xplane-qwen3-8b (was xplane-phi4-mini, the now-deleted
  small-general claim).
- infrastructure/base/crossplane/configuration/examples/inferenceservice-basic.yaml —
  example name swapped from xplane-phi4-mini to xplane-qwen3-8b-basic.
- apps/base/openwebui/app.yaml — comment refreshed to describe the new
  AIGatewayRoute → keda-http-interceptor path (was: SR ext_proc +
  InferencePool/EPP).

Net: 2,805 deletions, 18 insertions.
Smana added a commit that referenced this pull request May 7, 2026
3-artifact SDD spec for replacing the KEDA HTTP add-on with KEDA
prometheus-trigger ScaledObjects on leading vLLM saturation signals
(num_requests_running / max-num-seqs ratio, gpu_cache_usage_perc).

Drops the proxy hop from the data path, defaults minReplicas=1, and
makes scaling decisions react before users feel degradation rather
than after queue depth fires (lagging signal).

Folds into PR #1434. Brainstorming captured in clarifications.md
(CL-1 lagging→leading, CL-2 min=1, CL-3 Knative deferred, CL-4 vLLM
Production Stack deferred, CL-5 cooldown 300s rationale).
Comment on lines +9 to +136
apiVersion: batch/v1
kind: CronJob
metadata:
name: promptfoo
namespace: promptfoo
labels:
app.kubernetes.io/name: promptfoo
app.kubernetes.io/part-of: ai
spec:
schedule: "0 2 * * *"
timeZone: "Europe/Paris"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5
startingDeadlineSeconds: 3600
jobTemplate:
spec:
backoffLimit: 1
# 7 days — long enough for `failedJobsHistoryLimit: 5` to retain
# 5 daily failures for inspection. A shorter TTL (e.g. 24h)
# would delete failed Jobs before the history window populated
# and silently defeat `failedJobsHistoryLimit`.
ttlSecondsAfterFinished: 604800
template:
metadata:
labels:
app.kubernetes.io/name: promptfoo
app.kubernetes.io/part-of: ai
spec:
restartPolicy: Never
automountServiceAccountToken: false
securityContext:
seccompProfile: { type: RuntimeDefault }
runAsNonRoot: true
runAsUser: 10001
fsGroup: 10001
volumes:
- name: suite
configMap: { name: promptfoo-eval-suite }
- name: workdir
emptyDir: { sizeLimit: 256Mi }
containers:
- name: promptfoo
image: ghcr.io/promptfoo/promptfoo:0.106.0@sha256:e10e5e2d0ae9a73ec10883672448506c0bf11db443fcab1afb5f461968a5616e
imagePullPolicy: Always
command: ["/bin/sh", "-c"]
args:
- |
set -euo pipefail

cd /work
cp /suite/promptfooconfig.yaml ./promptfooconfig.yaml

START=$(date +%s)
promptfoo eval --no-progress-bar --max-concurrency 4 --output /work/results.json || true
END=$(date +%s)
DURATION=$((END - START))

# Per-category pass rates from JSON results. The promptfoo
# image is node-based (no jq) so we parse with `node -e`
# instead of jq. Promptfoo's JSON schema places per-test
# metadata at one of several paths depending on version;
# the chained '?? ??' alternation tries the documented
# paths and falls back to "unknown" so a schema drift
# doesn't silently drop metrics.
# `$${...}` escapes — Flux postBuild substitution would
# otherwise consume the JS template literal placeholders.
node -e '
const r = require("/work/results.json");
const tests = (r.results && r.results.results) || [];
const g = {};
for (const t of tests) {
const c = (t.testCase && t.testCase.metadata && t.testCase.metadata.category)
|| (t.testCase && t.testCase.vars && t.testCase.vars.category)
|| (t.vars && t.vars.category)
|| "unknown";
if (!g[c]) g[c] = {total:0, failed:0};
g[c].total++;
if (!t.success) g[c].failed++;
}
const out = [];
for (const [c, x] of Object.entries(g)) {
out.push(`promptfoo_test_total{category="$${c}"} $${x.total}`);
out.push(`promptfoo_test_failed{category="$${c}"} $${x.failed}`);
out.push(`promptfoo_test_pass_rate{category="$${c}"} $${(x.total - x.failed) / x.total}`);
}
console.log(out.join("\n"));
' > /work/metrics.prom

# Total run duration (overall health gauge).
# `$${VAR}` escapes — Flux postBuild envsubst would
# otherwise consume these bash vars as Flux substitutions.
echo "promptfoo_run_duration_seconds $${DURATION}" >> /work/metrics.prom
echo "promptfoo_run_timestamp_seconds $${END}" >> /work/metrics.prom

echo "=== metrics.prom ==="
cat /work/metrics.prom
echo "==="

# Push to VictoriaMetrics.
curl --fail --silent --show-error \
-X POST \
-H 'Content-Type: text/plain' \
--data-binary @/work/metrics.prom \
'http://vmsingle-victoria-metrics-k8s-stack.observability.svc.cluster.local:8428/api/v1/import/prometheus'

echo "Pushed metrics to VictoriaMetrics."
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
capabilities: { drop: ["ALL"] }
seccompProfile: { type: RuntimeDefault }
resources:
requests: { cpu: "200m", memory: "512Mi" }
limits: { cpu: "1", memory: "1Gi" }
env:
- name: HOME
value: /work
# OPENAI_API_KEY for the AI Gateway SecurityPolicy (B-1).
# Promptfoo's openai provider auto-detects this env var
# and sends `Authorization: Bearer <value>`.
envFrom:
- secretRef:
name: promptfoo-llm-api-key
volumeMounts:
- { name: suite, mountPath: /suite, readOnly: true }
- { name: workdir, mountPath: /work }
@Smana Smana force-pushed the wip/self-hosted-llm-platform-draft branch from 91e8b4a to 679d31c Compare May 9, 2026 10:23
Smana added a commit that referenced this pull request May 9, 2026
1. infrastructure/base/crossplane/providers/additional-rbac.yaml:138-141
   (CKV_K8S_49 — wildcard verbs on the new
   inference-service:aggregate-to-crossplane ClusterRole). Replace
   `verbs: ["*"]` with the explicit 7-verb list (get, list, watch,
   create, update, patch, delete). Functionally equivalent for
   Crossplane SA; satisfies least-privilege. Pre-existing wildcards
   on the older ClusterRoles in this file aren't in the PR diff so
   they weren't flagged — keeping them as-is to avoid unrelated churn.

2. opentofu/llm-platform/filesystem.tf:1 (CKV2_AWS_5 — SG not attached).
   False positive: the SG IS attached via
   aws_s3files_mount_target.az.security_groups (line ~47), but
   Checkov doesn't grok the newer aws_s3files_mount_target resource.
   Suppressed with `# checkov:skip=CKV2_AWS_5:<reason>` inside the
   resource block.

3-5. tooling/base/promptfoo/cronjob.yaml:9
   - CKV_K8S_43 (image not digest-pinned): pinned 0.106.0 to
     sha256:e10e5e2d0ae9a73ec10883672448506c0bf11db443fcab1afb5f461968a5616e
     (verified via skopeo).
   - CKV_K8S_40 (high UID): bumped runAsUser+fsGroup 1001 → 10001
     to avoid host-system UID collision. Promptfoo doesn't share
     volumes with other workloads (ConfigMap + emptyDir only),
     so the UID change is contained.
   - CKV_K8S_15 (imagePullPolicy): IfNotPresent → Always.

Verified locally with `checkov 3.2.517`: cronjob 86/0, rbac new
ClusterRole PASSED, filesystem SG SKIPPED with reason.
Smana added a commit that referenced this pull request May 9, 2026
Brainstorm output for fixing task #78 root cause. The Envoy ext_proc
+ cilium-envoy approach is structurally blocked by:
  1. SR v0.2.0 hard-coding clearRouteCache=false in
     buildRequestBodyContinueResponse — defeats Envoy's body-callback
     header-mutation re-routing.
  2. cilium-envoy's slim build (no envoy.filters.http.lua) — kills
     the standard "Lua filter calls clearRouteCache after ext_proc"
     workaround. Verified empirically: listener rejected with
     "Didn't find a registered implementation".
  3. cilium.l7policy filter on upstream filter chains — denies
     traffic to per-model EDS clusters with 403 even from
     CNP-allowed sources.

The design replaces the entire CEC + ext_proc chain with a small
custom HTTP proxy (~250 LOC Go) deployed in the llm namespace. The
proxy reads the body's model field directly and:
  - For client-deterministic (model: xplane-*): fast path, forward
    to that Service. No SR roundtrip.
  - For SR-classified (model: MoM): call SR's HTTP classify API,
    rewrite body.model, forward. Same UX as the broken ext_proc
    path but actually works.

Both OpenCode subagent dispatch (per-agent model assignment) AND
OpenWebUI MoM auto-routing flow through the same proxy. Single
provider URL stays for all clients — no client-side changes needed.

Spec sections cover goal/SC, architecture, component design,
streaming behavior, deployment plan, phased rollout (P0-P7), risks
(SSE, single-point-of-failure, SR endpoint contract), and explicit
out-of-scope (no auth/cache/circuit-breaking — the proxy is a thin
forwarder, not a control plane).

Implementation plan ships separately. Targets follow-on PR after
#1434 merges.
Smana added a commit that referenced this pull request May 9, 2026
Pivot from "drop-in replacement" framing to "foundation, not replacement"
after honest evaluation of open-weights model quality vs frontier APIs in
2026 and verification that L40S (g6e) is not offered in eu-west-3.

Architecture trim:
- Drop InferencePool + EPP per model (zero value at min=0/max=1)
- Cancel the Go llm-router-proxy bandaid (was working around ext_proc bugs)
- Drop CEC + ext_proc body-rewrite path entirely
- Drop Phi-4-mini claim (redundant with Qwen3-8B; KEDA prom can't scale-from-0)
- Wire KEDA HTTP add-on universally for scale-from-zero on the model layer
- Iris becomes a sidecar HTTP classifier (no ext_proc); AI Gateway calls it
  for `model: MoM` requests, sets x-ai-eg-model header, routes natively

Cost: ~$1.3k/mo -> ~$220-250/mo idle (1× L4 spot for FIM only). ~80% cut.

Composition: InferenceService KCL bumps to v0.4.0 — gates the EPP rendering
behind an opt-in `spec.routing.endpointPicker.enabled` flag (default false)
so multi-replica serving can re-introduce EPP without rewriting claims.

Future-upgrade paths captured separately in docs/llm-platform-future-paths.md
(Qwen3-Coder-30B-A3B variants on L4 AWQ-4bit, L40S in eu-central-1, TP=4 on
g6.12xlarge, EPP re-introduction, claude-bridge relay).

Supersedes 2026-05-04-coding-llm-fleet-design.md ("drop-in replacement"
framing) and explicitly cancels 2026-05-05-llm-router-proxy-{design,plan}.md.
Smana added a commit that referenced this pull request May 9, 2026
39 tasks across 7 phases (~6 commits worth of work) trimming PR #1434
to the foundation-showcase shape per the 2026-05-06 design doc.

Phases:
1. KCL composition v0.4.0 — KEDA prom ScaledObject -> KEDA HTTP add-on
   HTTPScaledObject; drop EPP from default CNP ingress; add 2 kcl tests
2. AIGatewayRoute rewrite — backendRef -> keda-add-ons-http-interceptor-
   proxy with URLRewrite Host filter; ReferenceGrant in keda namespace
3. Iris ext_proc removal — drop EnvoyExtensionPolicy; classifier stays
   as HTTP sidecar
4. Subtractive cleanup — drop apps/base/ai/llm/inference-pools/, the
   phi4-mini claim, and the cancelled router-proxy spec docs
5. Model claim sanity check — qwen-coder + qwen3-8b drop to min=0
6. Documentation reframe — README LLM section, coding-clients.md, new
   docs/llm-platform-future-paths.md
7. Final validation — full kustomize build + kubeconform + trivy + kcl
   test pass; PR #1434 description rewrite via gh

Each task has concrete code, exact commands, expected outputs.
Self-review confirms spec coverage (SC-1 through SC-8), no
placeholders, type/name consistency across tasks.
Smana added a commit that referenced this pull request May 9, 2026
Replace KEDA prometheus scaler (`ScaledObject`) with KEDA HTTP add-on
(`HTTPScaledObject`) for the `minReplicas==0` branch. The prometheus
scaler deadlocks at min=0 (no pod -> no `vllm:num_requests_waiting`
metric -> no scale signal); the HTTP add-on queues the first request
on the keda-http-interceptor and signals scale-up directly.

Drop the EPP / InferencePool allow rule from `_defaultIngress` —
PR #1434's foundation-showcase trim removes the InferencePool layer
at `min=0/max=1` (it adds no value with one pod per model). Add an
allow rule for the KEDA HTTP add-on interceptor in the `keda`
namespace.

Tests added (main_test.k):
- `test_http_scaled_object_when_min_zero`
- `test_no_epp_in_default_ingress`
- `test_keda_scale_to_zero` (semantics refreshed)

Refs design doc docs/superpowers/specs/2026-05-06-oss-llm-foundation-showcase-design.md.
Smana added a commit that referenced this pull request May 9, 2026
Subtractive trim of PR #1434 per the foundation-showcase design.

Removed:
- apps/base/ai/llm/inference-pools/ (5× InferencePool + EPP HelmReleases,
  CNPs, kustomization) — InferencePool/EPP add no value at min=0/max=1.
  Re-introduction path documented in docs/llm-platform-future-paths.md.
- apps/base/ai/llm/phi4-mini.yaml — redundant with Qwen3-8B (KEDA prom
  scale-from-zero deadlock made Phi-4-mini unreachable anyway).
- docs/superpowers/specs/2026-05-05-llm-router-proxy-{design,plan}.md —
  the Go router-proxy was a bandaid for ext_proc bugs. AI Gateway native
  routing + Iris HTTP sidecar (Phase 3) makes the proxy unnecessary.

Updated to drop dangling phi4-mini references:
- infrastructure/base/vllm-semantic-router/helmrelease.yaml — drop the
  phi4-mini vllm_endpoint and its model_config entry; default_model
  switched to xplane-qwen3-8b (was xplane-phi4-mini, the now-deleted
  small-general claim).
- infrastructure/base/crossplane/configuration/examples/inferenceservice-basic.yaml —
  example name swapped from xplane-phi4-mini to xplane-qwen3-8b-basic.
- apps/base/openwebui/app.yaml — comment refreshed to describe the new
  AIGatewayRoute → keda-http-interceptor path (was: SR ext_proc +
  InferencePool/EPP).

Net: 2,805 deletions, 18 insertions.
Smana added a commit that referenced this pull request May 9, 2026
3-artifact SDD spec for replacing the KEDA HTTP add-on with KEDA
prometheus-trigger ScaledObjects on leading vLLM saturation signals
(num_requests_running / max-num-seqs ratio, gpu_cache_usage_perc).

Drops the proxy hop from the data path, defaults minReplicas=1, and
makes scaling decisions react before users feel degradation rather
than after queue depth fires (lagging signal).

Folds into PR #1434. Brainstorming captured in clarifications.md
(CL-1 lagging→leading, CL-2 min=1, CL-3 Knative deferred, CL-4 vLLM
Production Stack deferred, CL-5 cooldown 300s rationale).
Smana added a commit that referenced this pull request May 9, 2026
Self-review pass on PR #1434:

- KEDA `keda-metrics-server` egress: 8429 → 8428. vmsingle's chart-default
  service port is 8428 (matches httproute-vmsingle.yaml backendRef and the
  composition `_DEFAULTS.prometheus_server_address`). 8429 is vmagent.
  Cilium would have silently dropped the trigger query under default-deny.
- Add `keda-operator` egress to vmsingle:8428 for activation polling so
  future `scaling.minReplicas: 0` claims (XRD-supported demo override) can
  actually wake. Inert for the default min=1 fleet.
- promptfoo cronjob: emit `promptfoo_test_schema_unknown_total` counter so
  upstream JSON-schema rotations surface in metrics instead of being
  hidden under the `category="unknown"` bucket.
- inference-service main.k: document `max(num_requests_running)` aggregation
  intent — hottest-replica saturation, not fleet average. cooldownPeriod
  dampens scale-down noise.

Validation:
- kcl fmt + kcl test . -Y settings-example.yaml: 24/24 PASS
- kubeconform on edited YAMLs: 0 errors
- `.github/workflows/crossplane-modules.yml`: rewrite kcl.mod's version
  to the PR-suffixed publish version before `kcl mod push` (the push
  command ignores the OCI tag in the URL and uses kcl.mod's `version`
  field as the actual published tag). Mirror the same suffix in the
  composition-source audit step so PR runs don't fail on the bare
  kcl.mod version. Lowercase the GHCR repo owner. Drive Dockerfile
  GO_VERSION from go.mod.
- `.pre-commit-config.yaml`: exclude the vendored Envoy Gateway CRDs
  from `check-added-large-files` — the schema is ~2 MiB total; the chart
  can't be installed via HelmRelease (1 MiB Helm-release Secret cap), so
  the rendered CRDs are committed.
- `.trivyignore.yaml`: skip CKV2_AWS_5 false-positive on the S3 Files
  mount-target SG (Checkov doesn't recognize the newer
  `aws_s3files_mount_target` resource yet).
- `.gitignore`: ignore `*.tfplan` / `out.tfplan` and the local
  `.claude/scheduled_tasks.lock` (transient state, blocks Terramate).
- `.secrets.baseline`: refresh after adding LLM-platform manifests with
  pragma-allowlisted ESO `secretKey` references.
Smana added 24 commits May 9, 2026 19:11
…ing decisions

Establishes the spec-driven-development tooling that this PR ships under:

- `.claude/rules/` — path-scoped rules auto-loaded by the editor when
  touching Crossplane KCL, OpenTofu, observability, network-policies,
  or spec artifacts. Captures the repeat traps the LLM-platform
  first-deploy session surfaced (DNS L7 inspection, link-local
  entities, post-creation dict mutation, container-vs-pod
  securityContext split).
- `.claude/skills/validate/references/cross-artifact-rules.md` — V2
  validation rules referenced by the `/validate` skill.
- `CLAUDE.md` — project-level Claude Code guidance updated for the LLM
  platform opt-in gate, KEDA prometheus autoscaling, and rule
  cross-references.
- `docs/decisions/` — ADR-0003 (vLLM Production Stack vs KServe) and
  ADR-0004 (Amazon S3 Files for model weights storage), the two new
  cross-cutting decisions; index README links them.
- `docs/specs/README.md` — V2 plan.md validation-path rule.
- `docs/superpowers/{specs,plans}/` — design+plan pairs feeding the
  in-tree specs (coding-LLM fleet, AI gateway redesign, foundation
  showcase, paths 7+8 LoRA + per-tenant FinOps).
…architecture diagram

- SPEC-001 (`docs/specs/0001-llm-platform-prometheus-autoscaling/`):
  switch InferenceService autoscaling from KEDA HTTP add-on (proxy in
  data path, lagging request-count trigger) to KEDA prometheus on
  leading vLLM signals (running/max-num-seqs ratio + KV-cache util).
  `min=1` default eliminates the scale-from-zero deadlock with prometheus
  triggers. Spec artifacts include `spec.md` (WHAT), `plan.md` (HOW + 21
  tasks + 4-persona review checklist), and the append-only
  `clarifications.md` (CL-1..CL-7 — including the post-validation port
  fix `8429 → 8428` and the e2e validation walkthrough).
- `docs/llm-platform-future-paths.md` — paths 1–8 future-paths doc
  (LoRA serving, per-tenant FinOps, GPU node bin-packing, etc.) with
  paths 7+8 marked as the next slice.
- `docs/architecture/` — `llm-platform.drawio` source-of-truth diagram
  + README walkthrough of the request flow (Tailscale Gateway →
  SecurityPolicy → AIGatewayRoute → AIServiceBackend → vLLM Service).
- `crds-envoy-gateway.yaml` — Envoy Gateway 1.7.0 CRDs rendered as a
  single release-asset file (Backend, BackendTLSPolicy,
  ClientTrafficPolicy, EnvoyExtensionPolicy, EnvoyPatchPolicy,
  EnvoyProxy, HTTPRouteFilter, SecurityPolicy, BackendTrafficPolicy,
  AIGatewayRoute, AIServiceBackend). Vendored because the chart can't
  ship them via HelmRelease — Helm release Secret cap is 1 MiB; the
  rendered schemas total ~2 MiB.
- `kustomization-inference-extension.yaml` — Gateway API Inference
  Extension v1.0.0 CRDs (InferencePool / InferenceModel) sourced
  upstream via Flux Kustomization, kept for forward compatibility even
  though the AI gateway redesign no longer routes through them.
- `crds/base/kustomization.yaml` — wires both into the cluster CRD bundle.
…-in script gate

- `opentofu/config.tm.hcl`: bump cilium / karpenter / flux versions for
  May 2026.
- `opentofu/workflows.tm.hcl`: introduce the `--no-tags=opt-in` /
  `--tags=opt-in` filter convention so opt-in stacks (currently
  `llm-platform`) are skipped by default and require an explicit
  invocation to deploy / preview / destroy.
- `opentofu/eks/{init,configure}/workflows.tm.hcl`: refine the
  two-stage bootstrap orchestration scripts.
- `opentofu/eks/init/kubernetes.tf` + `helm_values/cilium.yaml`: wire
  cilium configuration tweaks needed for the LLM platform's data plane
  (CEC support — `envoyConfig.enabled: true`).
- `opentofu/eks/{init,configure}/variables.tf`: expose the variables
  the newer Cilium / Flux versions need.
…n stack)

New Terramate stack tagged `opt-in` so it's skipped by default — must
be explicitly enabled with `TM_LLM_PLATFORM_ENABLED=true` (mirrors the
Flux umbrella's `spec.suspend: true` gate). The stack provisions the
AWS resources every InferenceService claim depends on:

- `aws_s3files_file_system.models` — S3-backed POSIX filesystem for
  model weights (NFSv4 over an underlying S3 bucket; the bucket survives
  filesystem recreation, so re-bootstrap reuses already-cached weights).
- `aws_s3files_mount_target.az` — one mount target per private subnet
  / AZ. Pods land on the same-AZ mount target (cross-AZ NFS works but
  adds latency + transfer cost).
- `aws_s3files_access_point.shared` — single `/models` access point
  with posix uid:gid 1001:1001; per-claim subPath isolation handled at
  mount time.
- `aws_iam_role.s3files_service` — the S3 Files service role (allows
  `s3files.amazonaws.com` to read/write the underlying S3 bucket).
- `aws_iam_role` for the EFS CSI driver — bound via EKS Pod Identity
  to the controller + node SAs, granting AmazonEFSCSIDriverPolicy +
  AmazonS3FilesCSIDriverPolicy.
- Output `volume_handle` (`s3files:<fs>::<ap>`) — copied into
  `apps/base/ai/llm/models-pvc.yaml` to bind the in-cluster PV to the
  filesystem.

The IAM and access-point roles deliberately live in OpenTofu (durable),
while the S3 Files filesystem can be torn down + recreated cheaply.
The Flux side at `clusters/mycluster-0-llm-platform/` is suspended by
default — both gates must release for an end-to-end deploy.
…espaces

- `flux/sources/`: pin the Helm/OCI repositories the LLM platform draws
  from — Envoy Gateway, Envoy AI Gateway (controller + CRDs), KEDA,
  AWS EFS CSI driver, Iris vllm-semantic-router, Gateway API Inference
  Extension, InferencePool. `ocirepo-karpenter.yaml` bumped alongside
  the platform-wide karpenter version pin in `opentofu/config.tm.hcl`.
- `namespaces/base/`: `llm`, `envoy-gateway-system`, and
  `envoy-ai-gateway-system` namespaces created early so default-deny
  CiliumNetworkPolicies and ExternalSecrets can reference them before
  any HelmRelease lands. Wired into `namespaces/base/kustomization.yaml`.
… PSS-compatible securityContext

Pinned to chart 4.1.0 (driver 3.1.0) — first line that supports
S3 Files access points (volumeHandle `s3files:<fs>::<ap>`). Renovate
will surface chart-version bumps as PRs so changes are reviewed before
they touch a CSI driver mounting model weights.

Resource sizing tuned for the LLM platform's load profile:
- Controller: 50m/256Mi requests, 512Mi limit.
- Node: 100m/512Mi requests, 1Gi limit. 256Mi was OOM-killed under
  parallel preload Jobs all calling NodeStage/NodePublish on the
  same shared S3 Files mount; OOM left the kernel mount alive but
  broke the chart's nfs4 watchdog (new mounts then returned EACCES).

Pod-level `securityContext.seccompProfile.type: RuntimeDefault` only —
the chart's `efs-plugin` container needs `privileged: true` for kernel
mounts (incompatible with `allowPrivilegeEscalation: false` and
pod-level `runAsNonRoot: true`). The chart's defaults already lock
down the support containers (csi-provisioner, liveness-probe).

`storageclass.yaml`: hardened `storageClasses: []` since static PVs
(per-InferenceService) are the model — no dynamic provisioning here.
KEDA 2.18.0 deployed in the new `keda` namespace, configured for
restricted PSS:
- `helmrelease.yaml` — operator + admission-webhooks + metrics-apiserver
  with explicit per-component securityContext (each block fully restated
  per the upstream chart's deep-merge semantics).
- `network-policy.yaml` — default-deny on `keda-operator`,
  `keda-operator-metrics-apiserver`, and `keda-admission-webhooks`.
  Egress to vmsingle:8428 (prometheus trigger queries),
  kube-apiserver, and DNS. Ingress only from kube-apiserver
  (admission webhooks + external metrics API).
- `additional-rbac.yaml` (Crossplane providers) — aggregate ClusterRole
  granting Crossplane SA `keda.sh/scaledobjects` patch + delete verbs so
  the InferenceService composition can render ScaledObject managed
  resources.
- `activation-policy.yaml` — installs the KEDA CRDs the composition
  references.
- `karpenter-nodepools-gpu/`: dedicated NodePool + EC2NodeClass for
  NVIDIA L4 instances (`g6.xlarge` / `g6.2xlarge` — on-demand for the
  base capacity, spot for burst), labeled `gpu=l4` and tainted
  `nvidia.com/gpu=Exists:NoSchedule` so only LLM workloads schedule.
  AMI: Bottlerocket NVIDIA — exposes the GPU natively, no
  device-plugin DaemonSet required. NodePool `nodes` cap = 4 (decision
  CL-6 in SPEC-001's clarifications log).
- `karpenter-nodepools/`: bump default NodePool / EC2NodeClass to keep
  in lockstep with the GPU pool's API version + LLM-platform-friendly
  taints.
- `runtimeclass-nvidia/`: RuntimeClass `nvidia` referenced by the
  InferenceService composition's vLLM Deployment so containers wire
  through the NVIDIA container runtime.
… CNPs

- `helmrelease.yaml` — Envoy Gateway 1.7.0 controller (provides the
  `GatewayClass` Envoy AI Gateway consumes). Restricted-PSS-compatible
  per-component securityContext blocks. Watches the cluster for
  `Gateway` / `HTTPRoute` / `SecurityPolicy` resources targeting its
  GatewayClass.
- `network-policy.yaml` — default-deny CiliumNetworkPolicy for the
  controller and the data-plane proxy spawned per Gateway. Allows xDS
  from data-plane proxy back to controller (ports 18000-18002),
  ingress from kubelet for probes, ingress from in-cluster apps to
  the data-plane proxy on :8080, and egress to the API server.
…e-fronted HTTPRoute

The OpenAI-compatible LLM ingress: clients hit
`https://llm.priv.cloud.ogenki.io/v1/...` over Tailscale, the
SecurityPolicy authenticates the request, the AIGatewayRoute body
parser sets `x-ai-eg-model` from the request body, and the route
dispatches to the matching vLLM Service.

- `helmrelease.yaml` + `helmrelease-crds.yaml` — Envoy AI Gateway
  controller (v0.5.0) — direct routing per AIServiceBackend, no proxy
  hop, no in-data-plane interceptor.
- `gatewayclass.yaml` + `gateway.yaml` + `envoyproxy.yaml` — single
  shared `ai-gateway` Gateway, ClientTrafficPolicy + EnvoyProxy spec
  scoped to the LLM listener.
- `httproute.yaml` — public HTTPRoute on the Tailscale-general Gateway
  pointing `llm.priv.cloud.ogenki.io` at the AI Gateway data-plane
  Service.
- `security-policy.yaml` + `api-keys-externalsecret.yaml` —
  `apiKeyAuth` SecurityPolicy comparing the `Authorization` header
  against the values in the `ai-gateway-api-keys` Secret. Envoy
  Gateway strips the `Bearer ` scheme before comparison, so the
  ESO-rendered Secret stores the raw API key (not Bearer-prefixed).
  Source of truth: AWS Secrets Manager `platform/llm/api-keys` (a
  JSON object keyed by client identity — `openwebui_apikey`,
  `promptfoo_apikey`); seeded out-of-band so the keys survive
  cluster recreation.
- `network-policy.yaml` — default-deny on both the controller and the
  data-plane proxy. Egress to vLLM Services in `llm/`, the semantic
  router in `llm/`, kube-apiserver, and DNS. Ingress from Cilium
  Gateway, in-cluster apps, kubelet probes.
- `gapi/platform-tailscale-general-gateway.yaml` — extend the Tailscale
  general Gateway with the `llm` HTTPRoute listener.
The `MoM` (mixture-of-models) virtual model dispatcher. When clients
send `model: MoM`, the AIGatewayRoute extension calls SR's HTTP
classifier (`POST /api/v1/classify/intent` on :8080); SR returns the
chosen `xplane-<name>` model id; the AI Gateway body parser rewrites
`x-ai-eg-model` and dispatches.

- `helmrelease.yaml` — vllm-semantic-router 0.0.x with signal-fusion
  routing (keyword + context-length signals). PII classifier disabled
  (upstream chart bug); semantic_cache disabled (poisons on failed
  upstreams). Memory bumped to 4Gi (OOMKilled at 512Mi).
- `network-policy.yaml` — default-deny CNP. Egress to vLLM Services,
  HuggingFace (one-shot model download for the BERT classifier
  cache; FQDN-allowlist with full subdomain depth), DNS via L7-aware
  kube-dns rule, vmagent for metrics. Ingress from envoy-gateway-system
  (the AIGatewayRoute extension's HTTP classifier client) and from
  the promptfoo namespace for evals.
New `cloud.ogenki.io/v1alpha1 InferenceService` XR + KCL composition
that templates a single vLLM model claim into 9 managed resources:

- vLLM `Deployment` on the GPU NodePool with the model-name + spec
  baked into args (`--model`, `--enable-tool-call`, `--tool-call-parser
  hermes`, `--enable-lora` + `--lora-modules` when `loraAdapters` is
  non-empty), running with restricted PSS securityContext
  (runAsUser=1000 to match the vLLM image's /etc/passwd).
- `Service` (ClusterIP, port 8000) — the OpenAI-compatible target the
  AIGatewayRoute dispatches to.
- KEDA `ScaledObject` with prometheus triggers on leading vLLM signals
  (running/max-num-seqs ratio + KV-cache util, queries against
  vmsingle:8428). `min=1` default eliminates the prometheus-trigger
  scale-from-zero deadlock; cooldownPeriod=300s for damped scale-down.
  Per SPEC-001.
- Two `CiliumNetworkPolicy` resources (vLLM ingress from the AI
  Gateway data plane only; preload Job egress to HuggingFace).
- ServiceMonitor + VMRule for the vLLM metrics scrape.
- Preload `Job` (one-shot) that downloads the model + LoRA adapters
  from HuggingFace into `/models/<name>/` and `/models/loras/<name>/`
  on the shared S3 Files mount; uses a marker file to skip re-download
  on bootstrap. Fast-path guards against partial xet-cache downloads.
- ExternalSecret pulling the HF token from AWS SM.

Composition includes:
- `main_test.k` — unit tests asserting resource counts + naming +
  security context + LoRA conditional emission + preload skip-marker.
- `README.md` + `settings-example.yaml` + `examples/` (basic and
  complete claims).
….10)

Two new fields needed by OpenWebUI's claim:

- `deploymentStrategy` — `RollingUpdate` (default) or `Recreate`. The
  composition emits ONLY the matching strategy block (KCL inline
  conditional, no post-creation dict mutation per function-kcl issue
  #285). Recreate is required by OpenWebUI because its data PVC is RWO
  on the default gp3 StorageClass — RollingUpdate's maxSurge would
  spawn the new pod before the old one releases the volume,
  triggering `Multi-Attach error for volume`.
- `extraVolumes` + `extraVolumeMounts` — pass-through for arbitrary
  Volume / VolumeMount entries. OpenWebUI uses them to mount its
  `openwebui-data` PVC at `/app/backend/data` so the SQLite DB +
  uploaded files survive pod restarts.
`xplane-llm-models-preload` EPI with a writable IAM policy scoped to
the underlying S3 bucket of the S3 Files filesystem (read+write —
the preload Job needs to write model weights). Bound to the
`xplane-llm-models-preload` ServiceAccount in the `llm` namespace
which the InferenceService composition's preload Job uses.

`epis/kustomization.yaml` references `epis-llm` so the new EPI lands
on top of the existing platform EPI bundle.
`clusters/mycluster-0/security/eks-pod-identities.yaml` overlays the
EPI namespace into the cluster's overlay.
Two unrelated teardown-safety fixes that the LLM-platform branch
surfaced (the multiple destroy/recreate cycles exercised both):

1. `managementPolicies` without `Delete` on three stateful Buckets —
   `cnpg-backups`, `openbao-snapshot`, `xplane-harbor`-bound bucket
   (orphaning on cluster destroy preserves the data + finalizers don't
   hang). Crossplane v2 namespaced MRs do not expose
   `spec.deletionPolicy`; `managementPolicies` is the v2 mechanism.
   Plus the existing platform principle of no DeleteBucket IAM grants.

2. `security/base/zitadel/sqlinstance.yaml` — frozen-dated-snapshot
   recovery pattern so a cluster rebuild can re-bootstrap Zitadel from
   the prior snapshot (the bootstrap field is immutable post-create).

3. `scripts/eks-prepare-destroy.sh` — pre-clean Envoy AI Gateway +
   InferencePool + KEDA CRDs, drop kyverno + cilium-operator validating
   webhooks early, and unblock teardown on degraded clusters where
   admission can race the destroy ordering.

4. `scripts/terramate-destroy-confirm.sh` — single y/N prompt at the
   start of `terramate script run --reverse destroy` so the operator
   confirms once instead of per-stack.
…y routes + LLM SLO rules

The 4 base InferenceService claims + 2 LoRA adapters + supporting
infra:

- `models-pvc.yaml` — static PV+PVC binding the cluster to the S3
  Files filesystem provisioned in opentofu/llm-platform. The
  `volumeHandle` (`s3files:<fs>::<ap>`) is updated manually after
  every `tofu apply` (header comment calls out the sync).
- `s3-bucket.yaml` — the underlying S3 bucket Crossplane manages
  alongside the filesystem (deletion-protected via
  `managementPolicies` without Delete; bucket survives filesystem
  recreation so model weights are reused).
- `qwen-coder.yaml`, `qwen-coder-fim.yaml`, `qwen3-8b.yaml`,
  `llamaguard3-1b.yaml` — InferenceService claims. `xplane-qwen-coder`
  enables LoRA with two adapters (`xplane-qwen-coder-sql-dpo`,
  `xplane-qwen-coder-securecode`).
- `ai-gateway-routes/route.yaml` — AIGatewayRoute matching `model:
  xplane-<name>` headers (incl. LoRA adapter model names which route
  to the qwen-coder backend).
- `hf-token-externalsecret.yaml` — HuggingFace token for the preload
  Job, sourced from AWS SM.
- `preload-serviceaccount.yaml` — SA bound by the
  `xplane-llm-models-preload` EPI.
- `grafana-folder.yaml` + `grafana-dashboard.yaml` — co-located LLM
  platform dashboard (23 panels: per-model TTFT, request rate, error
  rate, GPU util, KEDA scale events, vLLM cache util, etc.).
- `vmrule-llm-slo.yaml` — 3 SLO alerts (TTFT p95, error rate, request
  saturation).
- `apps/llm/kustomization.yaml` — overlay-only Kustomization (gated by
  the LLM umbrella; not wired into `apps/mycluster-0/` which would
  bypass the suspend gate).
- `apps/mycluster-0/kustomization.yaml` — references OpenWebUI which
  is not LLM-gated (frontend-only; works whether the LLM stack is
  resumed or not, just shows no models).
…+ LLM API key

`xplane-openwebui` App XR: a single-replica OpenWebUI v0.5.20 in the
`apps` namespace, fronted by `chat.priv.cloud.ogenki.io` over
Tailscale. Talks OpenAI-compatible HTTP to the AI Gateway data plane.

- `app.yaml` — App claim. Strategy=Recreate (RWO PVC; RollingUpdate
  multi-attach error). `securityContext.readOnlyRootFilesystem: false`
  required (writes to `/app/backend/data`). Mounts `openwebui-data`
  PVC at `/app/backend/data` so the SQLite admin DB + chat history +
  uploaded files survive restarts. `automountServiceAccountToken: false`.
  Env vars: `OPENAI_API_BASE_URL` → AI Gateway data plane,
  `OPENAI_API_KEY` from the ESO-rendered `openwebui-llm-api-key`
  Secret, OAuth (Zitadel) creds, etc.
- `pvc.yaml` — `openwebui-data` 5Gi gp3 PVC.
- `externalsecret-llm-api-key.yaml` — pulls the raw `openwebui_apikey`
  from AWS SM `platform/llm/api-keys`. The OpenAI client inside
  OpenWebUI prepends `Bearer ` to this value before sending the
  Authorization header.
- `externalsecret-oauth-zitadel.yaml` — OIDC client_id +
  client_secret from Zitadel for OpenWebUI's "Sign in with Zitadel".
…for LLM platform

- `vmrules/ai.yaml` — alert rules for the AI namespace
  (vllm-semantic-router availability, classifier latency, AI Gateway
  data-plane availability).
- `vmrules/kustomization.yaml` — wires `ai.yaml` into the cluster
  VMRule bundle.
- `vmservicecrapes/vllm-semantic-router.yaml` — VMServiceScrape for
  the semantic router's :8080/metrics. Wired via
  `vmservicecrapes/kustomization.yaml`.
- `loggen/helmrelease.yaml` — postRenderer to strip pod-level
  container security fields the upstream chart emits incorrectly
  (chart segments by component but doesn't deep-merge — the fix
  matches the path-scoped rule in `.claude/rules/spec-constitution.md`
  about replace-not-merge securityContext semantics).
…-tenant FinOps

Nightly Promptfoo evaluation suite that exercises every model
(including LoRA adapters) against the AI Gateway and emits results
as Prometheus metrics for SLO tracking.

- `namespace.yaml` — `promptfoo` namespace.
- `cronjob.yaml` — fires at 02:00 Europe/Paris. Runs Promptfoo against
  the AI Gateway with `xplane-qwen3-8b` (default model) + targeted
  probes of `xplane-qwen-coder-fim`, `xplane-qwen-coder-sql-dpo`,
  `xplane-qwen-coder-securecode`. Node-based JSON-to-Prometheus
  parser (replaced jq for portability). Pushes to vmsingle's
  `/api/v1/import/prometheus`. Tracks `promptfoo_test_schema_unknown_total`
  to surface fixture drift instead of silently absorbing it. All
  Flux postBuild substitution markers escaped (`$${VAR}`) so the
  bash + JS template literals survive postBuild.
- `eval-suite-configmap.yaml` — test cases pinned via ConfigMap.
- `externalsecret-api-key.yaml` — `promptfoo_apikey` from AWS SM
  `platform/llm/api-keys`. The eval container prepends `Bearer ` to
  this raw value before sending the Authorization header.
- `network-policy.yaml` — default-deny + egress to AI Gateway data
  plane (envoy-gateway-system :8080), vllm-semantic-router (for
  classifier probes), vmsingle (push), DNS, kubelet ingress for
  probes.
- `kustomization.yaml` — wires the lot.
Add the LLM Platform group to the homepage portal with a single
chatbot link (chat.priv.cloud.ogenki.io). Internal API surface +
Grafana dashboards + Promptfoo eval results live one click deeper
under the existing Observability + Apps groups.

`tooling/mycluster-0/kustomization.yaml` wires the homepage update
into the cluster overlay.
…uster wiring

The Flux gate that pairs with `opentofu/llm-platform`'s opt-in
Terramate gate. Both must be released for an end-to-end deploy.

- `clusters/mycluster-0/llm-platform.yaml` — umbrella Flux
  Kustomization with `spec.suspend: true` (default). Points at
  `clusters/mycluster-0-llm-platform/`. Manual `flux resume
  kustomization llm-platform -n flux-system` releases the gate.
- `clusters/mycluster-0-llm-platform/` — sibling directory (NOT
  under `clusters/mycluster-0/`) so `flux-system`'s recursive sync
  doesn't auto-discover the children and bypass the umbrella's
  suspend. Contains 8 child Flux Kustomizations:
    - infrastructure-vllm-semantic-router
    - infrastructure-runtimeclass-nvidia
    - infrastructure-gpu-nodepools (Karpenter NodePool)
    - infrastructure-envoy-gateway
    - infrastructure-envoy-ai-gateway
    - apps-llm (InferenceService claims + OpenWebUI route)
    - security-llm-epi (writable EKS Pod Identity)
    - tooling-promptfoo (nightly evals, gated under the same umbrella)
- `clusters/mycluster-0-llm-platform/README.md` — operator runbook:
  enable/suspend/teardown procedures + the AWS SM `platform/llm/api-keys`
  bootstrap (kept outside OpenTofu so the keys survive cluster
  recreation).
- `clusters/mycluster-0/infrastructure/infrastructure.yaml` — wires
  KEDA + EFS + Envoy Gateway controller into the platform-wide
  infrastructure Kustomization (these are needed even without the LLM
  gate released).
- `clusters/mycluster-0/security/eks-pod-identities.yaml` — wire the
  `epis-llm` overlay (writable preload EPI lives there).
- `infrastructure/mycluster-0/kustomization.yaml` — references the
  new base directories (aws-efs-csi-driver, keda).
…ce composition

Extend the KCL validator to also lint+test the new
`infrastructure/base/crossplane/configuration/kcl/inference-service/`
composition (4-stage: kcl fmt, syntax, render, security checks).
…master-plan drafts

- `README.md` — new "LLM Platform" section in the project overview.
  Briefly describes the OpenAI-compatible API (Bearer-token auth, 4
  base models + 2 LoRA adapters), OpenWebUI for chat, OpenCode +
  Continue for IDE.
- `docs/ai.md` — narrative architecture doc: routing modes
  (client-deterministic vs SR cascade), latency budget, 9-component
  diagram, security model (apiKeyAuth + ForwardClientIDHeader +
  sanitize=true), observability surface, request-flow walkthrough.
  The `Bearer-prefix` description matches the actual ESO template
  (raw-key Secret; Envoy strips the scheme before comparison).
- `docs/coding-clients.md` — copy-paste configs for OpenCode + Continue
  + curl + verification recipes; lists the model fleet.
- `docs/technology-choices.md` — KEDA added to the technology stack
  table (autoscaling layer).
- `docs/plans/self-hosted-llm-platform/` — parked exploration drafts
  (the original doc challenged + master plan + spec/plan/clarifications
  drafts); kept for context but superseded by `docs/specs/0001-*`.
@Smana Smana force-pushed the wip/self-hosted-llm-platform-draft branch from b57b77a to 4a31a33 Compare May 9, 2026 17:20
CKV_K8S_49 on inference-service:aggregate-to-crossplane: wildcard verbs
match the established pattern across additional-rbac.yaml; narrowing
would break composition reconciliation.

CKV_K8S_35 on promptfoo CronJob: openai provider auto-discovers
OPENAI_API_KEY from env; mounting as a file would require an entrypoint
wrapper and regress readOnlyRootFilesystem. Secret is single-key and
short-lived (CronJob, ttl=7d).
@Smana Smana merged commit 376ad20 into main May 14, 2026
14 checks passed
@Smana Smana deleted the wip/self-hosted-llm-platform-draft branch May 14, 2026 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants