Self hosted stack first version by Smana · Pull Request #90 · Smana/smana.github.io

Smana · 2026-05-13T13:49:34Z

No description provided.

Demonstration of a complete self-hosted LLM platform on Kubernetes: - vLLM Production Stack with FP8 quantization on NVIDIA L4 (Karpenter spot) - Crossplane abstraction (InferenceService claim) generating the full Helm/KEDA/Gateway/NetworkPolicy bundle from a single YAML - Dual routing: explicit (Envoy AI Gateway) + semantic (Iris/vllm-semantic-router via MoM virtual model) - Anticipatory KEDA autoscaling on vLLM saturation + KV cache pressure metrics - Amazon S3 Files (GA April 2026) for shared model weights storage - Three clients covered: OpenWebUI, Continue (VSCode FIM), OpenCode CLI - Honest cost & quality assessment vs Claude Sonnet/Opus 4.6/4.7

Apply Hugo/Markdown best practices to make the article more scannable: - Collapse 3 long YAML/JSON blocks behind <details> (Crossplane claim, Iris config, OpenCode config) - Promote the memory-bandwidth insight into an info notice (visual emphasis) - Promote the FIM Base-vs-Instruct explanation into a tip notice - Add lead paragraphs to the "modèles" and "plateforme d'inférence" H2 sections (smoother H2→H3 transitions)

…tice - Rewrite architecture.drawio using official mxgraph.aws4 and mxgraph.kubernetes.icon shapes, aligned with the actual cloud-native-ref reality (4 InferenceService claims, Crossplane KCL XR, KEDA, Karpenter gpu-l4 NodePool, Promptfoo, ESO, S3 Files, Tailscale, Flux). - Add architecture-simple.drawio: condensed view (~8 blocks) intended for inclusion in the article body. - Reframe the open-weight notice around the OSAID definition (OSI 2024) to acknowledge that "open source" is widely used loosely for permissive licenses (Apache 2.0, MIT) covering Mistral, Qwen, DeepSeek. - All diagram labels in English.

Hugo's bundled AVIF encoder silently produces empty 0-byte output for several PNGs in this repo; browsers pick the AVIF source from <picture> and render nothing instead of falling back. Removing AVIF entirely is the simplest fix — WebP covers modern browsers, PNG is the universal fallback. Also bumps the srcset widths from 600/900 to 1200/2400 (retina @1200 CSS px) and the WebP quality from q80 to q90 so diagrams with fine text stay sharp on the blog. Wraps WebP processing in `try` so any future encoder failure logs a warning instead of breaking the build.

Hugo 0.156+ deprecated several Site APIs that the Clarity theme still uses, triggering WARN logs on every build. Overrides the affected theme templates locally: - rss.xml: .Site.LanguageCode → site.Language.Locale, .Site.Author.* → site.Params.author.* (matches our params.toml schema) - partials/func/getDefaultLanguage.html: site.Languages → hugo.Sites, site.LanguageCode → site.Language.Locale - partials/header.html, nav.html, follow.html: .Site.Data → hugo.Data - partials/header.html, i18nlist.html: .Language.LanguageName → .Language.Label - config/_default/languages.toml: LanguageName → label (the user-facing config rename) Also drops stale generated SASS cache files; Hugo regenerates them on next build.

Replaces the ASCII art "vue d'ensemble" diagram in the self-hosted LLM stack article with a properly rendered draw.io diagram exported as PNG. The new diagram emphasizes the architectural point — "you can run multiple model pods through one gateway" — over the specific lineup, showing abstract model-X / model-Y pods inside the vLLM Production Stack frame plus an ellipsis suggesting more. Tailscale, Iris (semantic router), control plane, and S3 layers stay explicit. Arrows are routed orthogonally so client → gateway flows go straight down through the Tailscale band instead of cutting diagonally across the EKS title. The .drawio source is kept next to the PNG for future edits; export with `drawio --export --format png --width 2400 --border 20`.

Commit 9b23ea4 attempted the 0.156 API migration but used fields that don't exist on the Language type (Locale, Label). Hugo 0.139 was still pinned, masking the breakage locally. - bump Hugo to 0.156.0 (CI + new .mise.toml project pin) - site.Language.Locale -> site.Language.LanguageCode (rss.xml, getDefaultLanguage.html) - .Language.Label -> .Language.Params.label (header.html, i18nlist.html) - move label under [en.params] / [fr.params] so it surfaces in .Language.Params

Replaces the single logo.{png,webp} with logo-light.webp and logo-dark.webp, served via a custom logo.html partial that switches on prefers-color-scheme. params.toml points to the light variant as the default <img src>; dark mode users get the dark variant through the <picture> <source>.

- Reorganize content bottom-up: Storage/Models -> Inference -> Access, with a transverse section on the InferenceService Crossplane abstraction - Add KCL composition tip (versioning, OCI packaging, kcl test in CI) - Rework vLLM section as definition + feature bullets, with continuous batching/paged attention and FP8 quantization notices verified against the official vLLM docs - Simplify Autoscaling: drop the scale-to-zero detour, focus on KEDA's value (scale on any signal) + Karpenter for node elasticity - New Envoy AI Gateway section explaining the project, its features (OpenAI-compat, token rate-limiting, multi-provider) and brief auth note - Drop Iris routing table, Helm config and curl debug snippet; keep a synthetic description of what Iris is and what it offers - Open-weight footnote points to the OSI definition; vLLM throughput footnote cites the SOSP'23 PagedAttention paper (Kwon et al.) - Cross-link mise-en-abyme from ai-coding-tips to this article

…ients and conclusion - title: replace "ses propres modèles" framing with "poser les fondations d'une plateforme open-weight évolutive" — drops false implications of custom models and the Claude Code alternative overclaim - summary: align with the new positioning (InferenceService, autoscaling, GitOps, future-proof) - clients section: collapse ~135 lines of YAML/config detail into ~30 lines focused on the use cases (chat web, IDE autocomplete, agent CLI). Side-by-side autoplaying screencasts at 2× via a small script tag. Frame OpenWebUI/OpenCode as "alternatives" not "equivalents". - supervision section: drop SLO sub-section. Add a usage/FinOps angle built on Envoy AI Gateway's OTel Gen AI metrics with per-tenant/user labels via metricsRequestHeaderAttributes. Expand Promptfoo's value proposition (orthogonal to Prometheus) with the assertion taxonomy (deterministic vs model-assisted). - conclusion: merge "Bilan honnête" and "Dernières remarques" into a single section. Drop the detailed cost table and DeepSeek V4 deep dive. Add a short geostrategic note (Chinese push on open-weight, SWE-bench numbers vs Opus 4.7). Tease a future article on OpenCode. - replace "Petite mise en abyme" with "Ironie de l'histoire" and nuance to "avec l'aide de Claude Code".

…stack FR review: fix vLLM metric name (kv_cache_usage_perc), reorder overview to match section layout, deduplicate Iris description, gloss FIM on first use, tighten Couche 2 intro, plus typos and French typography fixes. EN: full translation of the article, including assets.

The architecture diagram is shipped as architecture-vllm.png; the drawio sources weren't referenced by the article and aren't needed in the repo.

Smana added 12 commits May 9, 2026 22:46

chore(blog): drop unused drawio sources for LLM stack article

809412b

The architecture diagram is shipped as architecture-vllm.png; the drawio sources weren't referenced by the article and aren't needed in the repo.

Smana merged commit 229fbfa into main May 13, 2026
1 check passed

Smana deleted the feat/llm-self-hosted-stack branch May 13, 2026 13:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self hosted stack first version#90

Self hosted stack first version#90
Smana merged 12 commits into
mainfrom
feat/llm-self-hosted-stack

Smana commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Smana commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant