Self hosted stack first version#90
Merged
Merged
Conversation
Demonstration of a complete self-hosted LLM platform on Kubernetes: - vLLM Production Stack with FP8 quantization on NVIDIA L4 (Karpenter spot) - Crossplane abstraction (InferenceService claim) generating the full Helm/KEDA/Gateway/NetworkPolicy bundle from a single YAML - Dual routing: explicit (Envoy AI Gateway) + semantic (Iris/vllm-semantic-router via MoM virtual model) - Anticipatory KEDA autoscaling on vLLM saturation + KV cache pressure metrics - Amazon S3 Files (GA April 2026) for shared model weights storage - Three clients covered: OpenWebUI, Continue (VSCode FIM), OpenCode CLI - Honest cost & quality assessment vs Claude Sonnet/Opus 4.6/4.7
Apply Hugo/Markdown best practices to make the article more scannable: - Collapse 3 long YAML/JSON blocks behind <details> (Crossplane claim, Iris config, OpenCode config) - Promote the memory-bandwidth insight into an info notice (visual emphasis) - Promote the FIM Base-vs-Instruct explanation into a tip notice - Add lead paragraphs to the "modèles" and "plateforme d'inférence" H2 sections (smoother H2→H3 transitions)
…tice - Rewrite architecture.drawio using official mxgraph.aws4 and mxgraph.kubernetes.icon shapes, aligned with the actual cloud-native-ref reality (4 InferenceService claims, Crossplane KCL XR, KEDA, Karpenter gpu-l4 NodePool, Promptfoo, ESO, S3 Files, Tailscale, Flux). - Add architecture-simple.drawio: condensed view (~8 blocks) intended for inclusion in the article body. - Reframe the open-weight notice around the OSAID definition (OSI 2024) to acknowledge that "open source" is widely used loosely for permissive licenses (Apache 2.0, MIT) covering Mistral, Qwen, DeepSeek. - All diagram labels in English.
Hugo's bundled AVIF encoder silently produces empty 0-byte output for several PNGs in this repo; browsers pick the AVIF source from <picture> and render nothing instead of falling back. Removing AVIF entirely is the simplest fix — WebP covers modern browsers, PNG is the universal fallback. Also bumps the srcset widths from 600/900 to 1200/2400 (retina @1200 CSS px) and the WebP quality from q80 to q90 so diagrams with fine text stay sharp on the blog. Wraps WebP processing in `try` so any future encoder failure logs a warning instead of breaking the build.
Hugo 0.156+ deprecated several Site APIs that the Clarity theme still uses, triggering WARN logs on every build. Overrides the affected theme templates locally: - rss.xml: .Site.LanguageCode → site.Language.Locale, .Site.Author.* → site.Params.author.* (matches our params.toml schema) - partials/func/getDefaultLanguage.html: site.Languages → hugo.Sites, site.LanguageCode → site.Language.Locale - partials/header.html, nav.html, follow.html: .Site.Data → hugo.Data - partials/header.html, i18nlist.html: .Language.LanguageName → .Language.Label - config/_default/languages.toml: LanguageName → label (the user-facing config rename) Also drops stale generated SASS cache files; Hugo regenerates them on next build.
Replaces the ASCII art "vue d'ensemble" diagram in the self-hosted LLM stack article with a properly rendered draw.io diagram exported as PNG. The new diagram emphasizes the architectural point — "you can run multiple model pods through one gateway" — over the specific lineup, showing abstract model-X / model-Y pods inside the vLLM Production Stack frame plus an ellipsis suggesting more. Tailscale, Iris (semantic router), control plane, and S3 layers stay explicit. Arrows are routed orthogonally so client → gateway flows go straight down through the Tailscale band instead of cutting diagonally across the EKS title. The .drawio source is kept next to the PNG for future edits; export with `drawio --export --format png --width 2400 --border 20`.
Commit 9b23ea4 attempted the 0.156 API migration but used fields that don't exist on the Language type (Locale, Label). Hugo 0.139 was still pinned, masking the breakage locally. - bump Hugo to 0.156.0 (CI + new .mise.toml project pin) - site.Language.Locale -> site.Language.LanguageCode (rss.xml, getDefaultLanguage.html) - .Language.Label -> .Language.Params.label (header.html, i18nlist.html) - move label under [en.params] / [fr.params] so it surfaces in .Language.Params
Replaces the single logo.{png,webp} with logo-light.webp and
logo-dark.webp, served via a custom logo.html partial that switches
on prefers-color-scheme. params.toml points to the light variant as
the default <img src>; dark mode users get the dark variant through
the <picture> <source>.
- Reorganize content bottom-up: Storage/Models -> Inference -> Access, with a transverse section on the InferenceService Crossplane abstraction - Add KCL composition tip (versioning, OCI packaging, kcl test in CI) - Rework vLLM section as definition + feature bullets, with continuous batching/paged attention and FP8 quantization notices verified against the official vLLM docs - Simplify Autoscaling: drop the scale-to-zero detour, focus on KEDA's value (scale on any signal) + Karpenter for node elasticity - New Envoy AI Gateway section explaining the project, its features (OpenAI-compat, token rate-limiting, multi-provider) and brief auth note - Drop Iris routing table, Helm config and curl debug snippet; keep a synthetic description of what Iris is and what it offers - Open-weight footnote points to the OSI definition; vLLM throughput footnote cites the SOSP'23 PagedAttention paper (Kwon et al.) - Cross-link mise-en-abyme from ai-coding-tips to this article
…ients and conclusion - title: replace "ses propres modèles" framing with "poser les fondations d'une plateforme open-weight évolutive" — drops false implications of custom models and the Claude Code alternative overclaim - summary: align with the new positioning (InferenceService, autoscaling, GitOps, future-proof) - clients section: collapse ~135 lines of YAML/config detail into ~30 lines focused on the use cases (chat web, IDE autocomplete, agent CLI). Side-by-side autoplaying screencasts at 2× via a small script tag. Frame OpenWebUI/OpenCode as "alternatives" not "equivalents". - supervision section: drop SLO sub-section. Add a usage/FinOps angle built on Envoy AI Gateway's OTel Gen AI metrics with per-tenant/user labels via metricsRequestHeaderAttributes. Expand Promptfoo's value proposition (orthogonal to Prometheus) with the assertion taxonomy (deterministic vs model-assisted). - conclusion: merge "Bilan honnête" and "Dernières remarques" into a single section. Drop the detailed cost table and DeepSeek V4 deep dive. Add a short geostrategic note (Chinese push on open-weight, SWE-bench numbers vs Opus 4.7). Tease a future article on OpenCode. - replace "Petite mise en abyme" with "Ironie de l'histoire" and nuance to "avec l'aide de Claude Code".
…stack FR review: fix vLLM metric name (kv_cache_usage_perc), reorder overview to match section layout, deduplicate Iris description, gloss FIM on first use, tighten Couche 2 intro, plus typos and French typography fixes. EN: full translation of the article, including assets.
The architecture diagram is shipped as architecture-vllm.png; the drawio sources weren't referenced by the article and aren't needed in the repo.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.