Skip to content

Self hosted stack first version#90

Merged
Smana merged 12 commits into
mainfrom
feat/llm-self-hosted-stack
May 13, 2026
Merged

Self hosted stack first version#90
Smana merged 12 commits into
mainfrom
feat/llm-self-hosted-stack

Conversation

@Smana
Copy link
Copy Markdown
Owner

@Smana Smana commented May 13, 2026

No description provided.

Smana added 12 commits May 9, 2026 22:46
Demonstration of a complete self-hosted LLM platform on Kubernetes:
- vLLM Production Stack with FP8 quantization on NVIDIA L4 (Karpenter spot)
- Crossplane abstraction (InferenceService claim) generating the full Helm/KEDA/Gateway/NetworkPolicy bundle from a single YAML
- Dual routing: explicit (Envoy AI Gateway) + semantic (Iris/vllm-semantic-router via MoM virtual model)
- Anticipatory KEDA autoscaling on vLLM saturation + KV cache pressure metrics
- Amazon S3 Files (GA April 2026) for shared model weights storage
- Three clients covered: OpenWebUI, Continue (VSCode FIM), OpenCode CLI
- Honest cost & quality assessment vs Claude Sonnet/Opus 4.6/4.7
Apply Hugo/Markdown best practices to make the article more scannable:
- Collapse 3 long YAML/JSON blocks behind <details> (Crossplane claim, Iris config, OpenCode config)
- Promote the memory-bandwidth insight into an info notice (visual emphasis)
- Promote the FIM Base-vs-Instruct explanation into a tip notice
- Add lead paragraphs to the "modèles" and "plateforme d'inférence" H2 sections (smoother H2→H3 transitions)
…tice

- Rewrite architecture.drawio using official mxgraph.aws4 and
  mxgraph.kubernetes.icon shapes, aligned with the actual cloud-native-ref
  reality (4 InferenceService claims, Crossplane KCL XR, KEDA, Karpenter
  gpu-l4 NodePool, Promptfoo, ESO, S3 Files, Tailscale, Flux).
- Add architecture-simple.drawio: condensed view (~8 blocks) intended
  for inclusion in the article body.
- Reframe the open-weight notice around the OSAID definition (OSI 2024)
  to acknowledge that "open source" is widely used loosely for permissive
  licenses (Apache 2.0, MIT) covering Mistral, Qwen, DeepSeek.
- All diagram labels in English.
Hugo's bundled AVIF encoder silently produces empty 0-byte output for
several PNGs in this repo; browsers pick the AVIF source from <picture>
and render nothing instead of falling back. Removing AVIF entirely is
the simplest fix — WebP covers modern browsers, PNG is the universal
fallback.

Also bumps the srcset widths from 600/900 to 1200/2400 (retina @1200
CSS px) and the WebP quality from q80 to q90 so diagrams with fine
text stay sharp on the blog. Wraps WebP processing in `try` so any
future encoder failure logs a warning instead of breaking the build.
Hugo 0.156+ deprecated several Site APIs that the Clarity theme still
uses, triggering WARN logs on every build. Overrides the affected
theme templates locally:

- rss.xml: .Site.LanguageCode → site.Language.Locale, .Site.Author.*
  → site.Params.author.* (matches our params.toml schema)
- partials/func/getDefaultLanguage.html: site.Languages → hugo.Sites,
  site.LanguageCode → site.Language.Locale
- partials/header.html, nav.html, follow.html: .Site.Data → hugo.Data
- partials/header.html, i18nlist.html: .Language.LanguageName →
  .Language.Label
- config/_default/languages.toml: LanguageName → label (the
  user-facing config rename)

Also drops stale generated SASS cache files; Hugo regenerates them
on next build.
Replaces the ASCII art "vue d'ensemble" diagram in the self-hosted LLM
stack article with a properly rendered draw.io diagram exported as PNG.

The new diagram emphasizes the architectural point — "you can run
multiple model pods through one gateway" — over the specific lineup,
showing abstract model-X / model-Y pods inside the vLLM Production
Stack frame plus an ellipsis suggesting more. Tailscale, Iris
(semantic router), control plane, and S3 layers stay explicit.

Arrows are routed orthogonally so client → gateway flows go straight
down through the Tailscale band instead of cutting diagonally across
the EKS title.

The .drawio source is kept next to the PNG for future edits; export
with `drawio --export --format png --width 2400 --border 20`.
Commit 9b23ea4 attempted the 0.156 API migration but used fields that
don't exist on the Language type (Locale, Label). Hugo 0.139 was still
pinned, masking the breakage locally.

- bump Hugo to 0.156.0 (CI + new .mise.toml project pin)
- site.Language.Locale -> site.Language.LanguageCode
  (rss.xml, getDefaultLanguage.html)
- .Language.Label -> .Language.Params.label (header.html, i18nlist.html)
- move label under [en.params] / [fr.params] so it surfaces in
  .Language.Params
Replaces the single logo.{png,webp} with logo-light.webp and
logo-dark.webp, served via a custom logo.html partial that switches
on prefers-color-scheme. params.toml points to the light variant as
the default <img src>; dark mode users get the dark variant through
the <picture> <source>.
- Reorganize content bottom-up: Storage/Models -> Inference -> Access,
  with a transverse section on the InferenceService Crossplane abstraction
- Add KCL composition tip (versioning, OCI packaging, kcl test in CI)
- Rework vLLM section as definition + feature bullets, with continuous
  batching/paged attention and FP8 quantization notices verified against
  the official vLLM docs
- Simplify Autoscaling: drop the scale-to-zero detour, focus on KEDA's
  value (scale on any signal) + Karpenter for node elasticity
- New Envoy AI Gateway section explaining the project, its features
  (OpenAI-compat, token rate-limiting, multi-provider) and brief auth note
- Drop Iris routing table, Helm config and curl debug snippet;
  keep a synthetic description of what Iris is and what it offers
- Open-weight footnote points to the OSI definition; vLLM throughput
  footnote cites the SOSP'23 PagedAttention paper (Kwon et al.)
- Cross-link mise-en-abyme from ai-coding-tips to this article
…ients and conclusion

- title: replace "ses propres modèles" framing with "poser les fondations
  d'une plateforme open-weight évolutive" — drops false implications of
  custom models and the Claude Code alternative overclaim
- summary: align with the new positioning (InferenceService, autoscaling,
  GitOps, future-proof)
- clients section: collapse ~135 lines of YAML/config detail into ~30
  lines focused on the use cases (chat web, IDE autocomplete, agent
  CLI). Side-by-side autoplaying screencasts at 2× via a small script
  tag. Frame OpenWebUI/OpenCode as "alternatives" not "equivalents".
- supervision section: drop SLO sub-section. Add a usage/FinOps angle
  built on Envoy AI Gateway's OTel Gen AI metrics with per-tenant/user
  labels via metricsRequestHeaderAttributes. Expand Promptfoo's value
  proposition (orthogonal to Prometheus) with the assertion taxonomy
  (deterministic vs model-assisted).
- conclusion: merge "Bilan honnête" and "Dernières remarques" into a
  single section. Drop the detailed cost table and DeepSeek V4 deep
  dive. Add a short geostrategic note (Chinese push on open-weight,
  SWE-bench numbers vs Opus 4.7). Tease a future article on OpenCode.
- replace "Petite mise en abyme" with "Ironie de l'histoire" and
  nuance to "avec l'aide de Claude Code".
…stack

FR review: fix vLLM metric name (kv_cache_usage_perc), reorder overview to
match section layout, deduplicate Iris description, gloss FIM on first use,
tighten Couche 2 intro, plus typos and French typography fixes.

EN: full translation of the article, including assets.
The architecture diagram is shipped as architecture-vllm.png; the drawio
sources weren't referenced by the article and aren't needed in the repo.
@Smana Smana merged commit 229fbfa into main May 13, 2026
1 check passed
@Smana Smana deleted the feat/llm-self-hosted-stack branch May 13, 2026 13:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant