SharpInference

A high-performance LLM inference engine and image generation pipeline written in C# 14 / .NET 10. Runs GGUF models on CPU (AVX2/AVX-512 SIMD) and GPU (Vulkan compute shaders or CUDA cuBLAS). Includes an OpenAI- and Anthropic-compatible API server and native pipelines for Z-Image-Turbo and FLUX.1 image generation.

Status: spike. A quick experiment to see how LLM tooling can be built from scratch in .NET. Things may be broken or not work as advertised. No warranty — see LICENSE.

Prerequisites

Required

.NET 10 SDK
x86-64 CPU with AVX2 support (Haswell / Zen 1 or newer)

Optional native dependencies

Feature	Dependency	Notes
Faster batched GEMM (CPU)	OpenBLAS	Place `libopenblas.dll` in `tools/openblas/` or system PATH. Auto-detected at startup; silently skipped if absent.
GPU inference (Vulkan)	Vulkan-capable GPU + drivers	Works on AMD/Intel/NVIDIA. No extra install on Windows — just up-to-date GPU drivers. The `VULKAN_SDK` env var is used for shader recompilation only.
GPU inference (CUDA)	CUDA Toolkit 11.x	Requires `cublas64_11.dll` and `cudart64_110.dll` on PATH (CUDA 11 runtime). NVRTC resolver additionally tries `nvrtc64_120_0.dll` (CUDA 12.x), then `nvrtc64_112_0.dll`, then `nvrtc64_11*.dll`. NVIDIA GPU only. Used for image generation pipelines.
Image upscaling (RRDBNet)	CUDA (above)	Real-ESRGAN ×2/×4 upscaler. Falls back to bicubic if CUDA is unavailable.

Getting Models

All models use the GGUF format and are downloaded from Hugging Face.

Text generation models

The fastest way to download is with the Hugging Face CLI:

pip install huggingface_hub
mkdir -p models

# SmolLM2 1.7B — fast, low memory, great for testing (~1 GB)
huggingface-cli download bartowski/SmolLM2-1.7B-Instruct-GGUF \
  SmolLM2-1.7B-Instruct-Q4_K_M.gguf --local-dir models

# Qwen3 8B — general purpose, fits in 6 GB VRAM (~5 GB)
huggingface-cli download Qwen/Qwen3-8B-GGUF \
  Qwen3-8B-Q4_K_M.gguf --local-dir models

# Qwen3-Coder 30B-A3B — MoE coding model, ~20 t/s CPU (~17 GB)
huggingface-cli download unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF \
  Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --local-dir models

# Llama 4 Scout 109B-16E — MoE, ~5 t/s CPU on DDR4-3200 (~61 GB, 2 shards)
huggingface-cli download unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF \
  --include "Q4_K_M/*" --local-dir models

Image generation models (Z-Image-Turbo)

# DiT model (choose one quant)
huggingface-cli download jayn7/Z-Image-Turbo-GGUF \
  z_image_turbo-Q5_K_M.gguf --local-dir models        # 5.5 GB, best quality
  # z_image_turbo-Q4_K_M.gguf --local-dir models      # 4.5 GB, slightly faster

# VAE + tokenizer (from the original Tongyi-MAI repo)
huggingface-cli download Tongyi-MAI/Z-Image-Turbo \
  --include "vae/*" "tokenizer/*" --local-dir models/z-image-turbo

# Text encoder — uncensored Qwen3-4B fine-tune (~2.9 GB)
huggingface-cli download BennyDaBall/Qwen3-4b-Z-Image-Turbo-AbliteratedV1 \
  Z-Image-AbliteratedV1.Q5_K_M.gguf --local-dir models

Image generation models (FLUX.1)

# FLUX.1-schnell GGUF (~7–9 GB depending on quant)
huggingface-cli download city96/FLUX.1-schnell-gguf \
  flux1-schnell-Q4_K_S.gguf --local-dir models

# VAE + CLIP-L + T5-XXL encoders
huggingface-cli download comfyanonymous/flux_text_encoders \
  ae.safetensors clip_l.safetensors t5xxl_fp16.safetensors --local-dir models/flux

Quick Start

# Build in release mode
dotnet build -c Release

# Single-turn inference (CPU)
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  -p "What is 2+2?" --temp 0

# Single-turn inference (GPU — all layers in VRAM)
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  -p "What is 2+2?" --temp 0 -g -1

# Interactive chat session
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf

# MoE coding model (~20 t/s CPU)
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  -p "Implement a binary search tree in C#" --temp 0

# Speculative decoding (draft + target model, ~2× faster at temp 0)
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-8B-Q4_K_M.gguf \
  --draft-model models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  -p "Write a quicksort in Python" --temp 0

# Start API server
SHARPI_MODEL=models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  dotnet run --project src/SharpInference.Server -c Release

CLI Reference

The CLI is run with dotnet run until a NuGet package is published:

dotnet run --project src/SharpInference.Cli -c Release -- [COMMAND] [OPTIONS]

Text inference (default command)

Flag names are intentionally compatible with llama.cpp / llama-cli.

Flag	Default	Description
`-m, --model`	auto-detect	Path to GGUF model file
`-p, --prompt`	—	Input prompt; omit to enter interactive chat
`-n, --n-predict`	`512`	Maximum tokens to generate
`-c, --ctx-size`	model default	Context / max sequence length (0 = model default)
`--temp`	`0.7`	Sampling temperature (`0` = greedy / deterministic)
`--top-k`	`40`	Top-k sampling (`0` = disabled)
`--top-p`	`0.95`	Top-p nucleus sampling
`--min-p`	`0.05`	Min-p sampling
`--rep-penalty`	`1.1`	Repetition penalty (`1.0` = disabled)
`-s, --seed`	`-1`	RNG seed (`-1` = random)
`-g, --n-gpu-layers`	`0`	Layers to offload to GPU (`0` = CPU only, `-1` = all)
`--tq`	off	Enable TurboQuant KV-cache compression (3-bit, ~5× less VRAM)
`--single-turn`	off	Generate one response and exit (non-interactive)
`--system-prompt`	—	System prompt prepended to conversation
`--no-display-prompt`	off	Suppress echoing the prompt
`--verbose-prompt`	off	Print token IDs before generating
`--draft-model`	—	Path to draft model for speculative decoding (requires `--temp 0`)
`--spec-lookahead`	`4`	Draft tokens per speculative step
`--min-batch-blas`	`16`	Minimum batch size to use OpenBLAS SGEMM (also: `SHARPI_MIN_BATCH_BLAS` env var)

`image` — image generation

Supports two native pipelines: Z-Image-Turbo (auto-detected from model filename) and FLUX.1.

Z-Image-Turbo example

dotnet run --project src/SharpInference.Cli -c Release -- image \
  -m models/z_image_turbo-Q5_K_M.gguf \
  --vae models/z-image-turbo/vae \
  --qwen-encoder models/Z-Image-AbliteratedV1.Q5_K_M.gguf \
  --qwen-tokenizer models/z-image-turbo/tokenizer/tokenizer.json \
  -p "a serene mountain lake at sunrise" \
  -W 1024 -H 1024 --steps 4 -o landscape.png -v

FLUX.1-schnell example

dotnet run --project src/SharpInference.Cli -c Release -- image \
  -m models/flux1-schnell-Q4_K_S.gguf \
  --vae models/flux/ae.safetensors \
  --clip-l models/flux/clip_l.safetensors \
  --clip-tokenizer models/flux/tokenizer_clip.json \
  --t5xxl models/flux/t5xxl_fp16.safetensors \
  --t5-tokenizer models/flux/tokenizer_t5.json \
  -p "a cinematic photograph of a mountain lake" \
  -W 512 -H 512 --steps 4 -o out.png

All image options

Flag	Default	Description
`-m, --model`	—	Diffusion model GGUF
`-p, --prompt`	—	Text prompt describing the image
`--negative-prompt`	—	What to avoid in the image
`--vae`	—	VAE safetensors file or `vae/` directory
`--qwen-encoder`	—	(Z-Image) Qwen3-4B GGUF text encoder
`--qwen-tokenizer`	—	(Z-Image) Qwen3 `tokenizer.json`
`--clip-l`	—	(FLUX) CLIP-L encoder safetensors
`--clip-tokenizer`	—	(FLUX) CLIP `tokenizer.json`
`--t5xxl`	—	(FLUX) T5-XXL encoder safetensors
`--t5-tokenizer`	—	(FLUX) T5 `tokenizer.json`
`-W, --width`	`512`	Output width in pixels (must be divisible by 16)
`-H, --height`	`512`	Output height in pixels (must be divisible by 16)
`--steps`	`4`	Denoising steps (4 optimal for Z-Image-Turbo/FLUX schnell)
`--cfg-scale`	auto	Guidance scale (not used for distilled models)
`-s, --seed`	`-1`	RNG seed (`-1` = random)
`-g, --n-gpu-layers`	`-1`	GPU accel: `-1` = auto (CUDA→Vulkan→CPU), `0` = CPU only
`--backend`	`auto`	Force backend: `auto`, `cuda`, `vulkan`, `cpu`
`--upscaler`	—	Path to ESRGAN/Real-ESRGAN weights (`.safetensors`) for ×2/×4 upscale
`--upscale-blend`	`1.0`	Blend factor for upscaling (`1.0` = sharpest, lower = softer)
`-o, --output`	`output.png`	Output PNG path
`-v, --verbose`	off	Show per-step timing

Z-Image-Turbo GPU acceleration timing

Benchmarked on AMD Zen 4 + RTX 4070 Ti:

Stage	First run	Subsequent runs
Text encoder (Qwen3-4B, cuBLAS bf16)	~90 s (weights cached in VRAM)	~0 s (prompt cache)
DiT denoising — 4 steps (cuBLAS bf16)	~4 s	~4 s
VAE decoder (cuBLAS fp32 im2col)	~23 s (weights cached in VRAM)	~2 s
Total	~117 s	~30 s

`list-metadata` — inspect a GGUF file

dotnet run --project src/SharpInference.Cli -c Release -- \
  list-metadata -m models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf

Prints all GGUF metadata key/value pairs in a table (architecture, context length, rope settings, tokenizer vocab, etc.).

API Server

Note: The ASP.NET host hasn't been exercised end-to-end — it builds and the endpoint handlers have unit tests, but running against real clients has not been validated. Expect it to need fixes.

Starts an HTTP server compatible with OpenAI and Anthropic clients. Defaults to http://localhost:5000.

# Start (CPU)
SHARPI_MODEL=models/SmolLM2-1.7B-Instruct-Q4_K_M.gguf \
  dotnet run --project src/SharpInference.Server -c Release

# OpenAI chat completions (streaming)
curl http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"smollm2","messages":[{"role":"user","content":"Hello"}],"stream":true}'

# Anthropic messages (non-streaming)
curl http://localhost:5000/v1/messages \
  -H "Content-Type: application/json" \
  -d '{"model":"smollm2","messages":[{"role":"user","content":"Hello"}],"max_tokens":256}'

# OpenAI Responses API
curl http://localhost:5000/v1/responses \
  -H "Content-Type: application/json" \
  -d '{"model":"smollm2","input":"Hello"}'

# List loaded model
curl http://localhost:5000/v1/models

# Health check
curl http://localhost:5000/health

# Prometheus metrics
curl http://localhost:5000/metrics

Server environment variables

Variable	Default	Description
`SHARPI_MODEL`	`model.gguf`	Path to GGUF model file
`SHARPI_MAX_BATCH`	`1`	Enable continuous batching for N concurrent users (`> 1` activates `ContinuousBatchingEngine`)
`SHARPI_MIN_BATCH_BLAS`	`16`	Minimum batch size to use OpenBLAS SGEMM in `MatMulBatched`

Supported & Tested Models

Text generation

Model	HuggingFace repo	Architecture	Quant	File size	Notes
SmolLM2 1.7B Instruct	HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF	llama	Q4_K_M	~1 GB	Fast, low RAM, great for testing
Qwen3 8B	Qwen/Qwen3-8B-GGUF	qwen3	Q4_K_M	~5 GB	General purpose; fits in 6 GB VRAM
Qwen3-Coder 30B-A3B Instruct	Qwen/Qwen3-Coder-30B-A3B-Instruct-GGUF	qwen3moe	Q4_K_M	~17 GB	MoE, 128 experts / 8 active, ~20 t/s CPU
Llama 4 Scout 109B-16E Instruct	unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF	llama4	Q4_K_M	~61 GB (2 shards)	MoE, 16 experts, ~5 t/s on DDR4-3200

Any GGUF model with architecture llama, llama4, qwen3, or qwen3moe should work.

Image generation

Model	HuggingFace repo	Quant	File size	Notes
Z-Image-Turbo DiT	jayn7/Z-Image-Turbo-GGUF	Q5_K_M	5.5 GB	Best quality; also Q4_K_M (4.5 GB)
Z-Image-Turbo text encoder	BennyDaBall/Qwen3-4b-Z-Image-Turbo-AbliteratedV1	Q5_K_M	2.9 GB	Uncensored fine-tune of Qwen3-4B
FLUX.1-schnell	city96/FLUX.1-schnell-gguf	Q4_K_S	~7 GB	4-step distilled; VAE+encoders from comfyanonymous/flux_text_encoders

Performance

Benchmarked on AMD Zen 4 (12c/24t, DDR4-3200) + RTX 4070 Ti:

Model	Backend	Decode (t/s)	Notes
SmolLM2 1.7B Q4_K_M	GPU (Vulkan)	131.3	Multi-row compute shaders + subgroupAdd
SmolLM2 1.7B Q4_K_M	CPU (AVX2)	48.6	Fused dequant-matvec, multi-threaded
Qwen3 8B Q4_K_M	GPU (Vulkan)	43.5	Full VRAM fit
Qwen3 8B Q4_K_M	CPU	13.5	1.23× llama.cpp
Llama 4 Scout Q4_K_M	CPU	5.3	Bandwidth-limited on 65 GB DDR4
Qwen3-Coder 30B-A3B Q4_K_M	CPU	20.8	MoE: only 8/128 experts active per token
llama.cpp SmolLM2 1.7B	CPU (reference)	45.1	Same hardware

Build & Test

dotnet build              # Debug build
dotnet build -c Release   # Release (IlcOptimizationPreference=Speed)
dotnet test               # Run all tests (207 tests across 5 projects)

# NativeAOT single-binary publish
dotnet publish src/SharpInference.Cli    -c Release -r win-x64
dotnet publish src/SharpInference.Server -c Release -r win-x64

# Benchmarks (requires benchmark models to be present)
dotnet run --project benchmarks/SharpInference.Bench -c Release -- --filter '*'

Helper Scripts

The scripts/ directory contains optional helpers for development and validation. The PowerShell scripts target Windows; the Python scripts require llama-cpp-python.

Script	Purpose
`download-model.ps1`	Downloads GGUF models into `models/` from Hugging Face. Accepts `-Model <name>` for any of `smollm2`, `qwen3-8b`, `llama31-70b`, `qwen3-coder-30b-a3b`, `llama4-scout`, `z-image-turbo`, `z-image-turbo-q8`, `realesrgan-x4`. Skips files already present.
`setup-openblas.ps1`	Downloads OpenBLAS (default `0.3.28`) and installs `libopenblas.dll` into `tools/openblas/` for the optional CPU GEMM acceleration path.
`setup-llamacpp.ps1`	Downloads prebuilt llama.cpp binaries into `tools/llama.cpp/`. Variants: `cpu` (default), `vulkan`, `cuda-12.4`, `cuda-13.1`. Used as the reference implementation for forward-pass validation.
`generate-reference-logits.ps1`	Runs llama.cpp with `--logits-all` on a fixed prompt and writes reference logits to `tests/reference-data/` for comparison against the SharpInference forward pass. Requires `setup-llamacpp.ps1` and `download-model.ps1 -Model smollm2` to have been run first.
`compare_tokens.py`	Python helper that tokenizes a chat prompt with `llama-cpp-python` and prints top-5 logits at each step. Used to debug divergence against Llama 4 Scout.
`extract_reference.py`	Python helper that prints model metadata (`n_vocab`, `n_ctx_train`, `n_embd`) and token IDs for prompt fragments. Useful when investigating tokenizer disagreements.

Typical first-time setup on Windows:

# From repo root
.\scripts\setup-openblas.ps1                  # optional, enables OpenBLAS GEMM
.\scripts\download-model.ps1 -Model smollm2   # fetch a small test model
.\scripts\setup-llamacpp.ps1                  # optional, for reference validation
.\scripts\generate-reference-logits.ps1       # optional, regenerates tests/reference-data/

Projects

Project	Description
`SharpInference.Core`	GGUF parser, BPE tokenizer, tensor types, model graph
`SharpInference.Cpu`	CPU backend: AVX2/AVX-512 SIMD, Q4_K_M dequantization, optional OpenBLAS GEMM
`SharpInference.Vulkan`	GPU backend: Vulkan compute shaders via Vortice.Vulkan
`SharpInference.Cuda`	GPU backend: CUDA cuBLAS P/Invoke, NVRTC custom kernels (im2col, element-wise ops)
`SharpInference.Engine`	Forward pass (CPU/GPU/Hybrid), paged KV cache, sampling, speculative decoding
`SharpInference.Diffusion`	Z-Image-Turbo + FLUX.1 pipeline: DiT, VAE decoder, Qwen3 + CLIP-L + T5-XXL encoders
`SharpInference.TurboQuant`	KV-cache compression using 3-bit Lloyd-Max codebooks
`SharpInference.Pipeline`	3-tier memory hierarchy (VRAM → RAM → NVMe), SLRU expert cache, async prefetcher
`SharpInference.Cli`	CLI tool (`sharpi-cli`) with NativeAOT support
`SharpInference.Server`	OpenAI + Anthropic + Responses API server with NativeAOT support

Architecture

See docs/SharpInference-Design.md.

License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchmarks		benchmarks
codebooks		codebooks
docs		docs
scripts		scripts
shaders		shaders
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Directory.Build.props		Directory.Build.props
LICENSE		LICENSE
README.md		README.md
SharpInference.slnx		SharpInference.slnx

Folders and files

Latest commit

History

Repository files navigation

SharpInference

Prerequisites

Required

Optional native dependencies

Getting Models

Text generation models

Image generation models (Z-Image-Turbo)

Image generation models (FLUX.1)

Quick Start

CLI Reference

Text inference (default command)

image — image generation

Z-Image-Turbo example

FLUX.1-schnell example

All image options

Z-Image-Turbo GPU acceleration timing

list-metadata — inspect a GGUF file

API Server

Server environment variables

Supported & Tested Models

Text generation

Image generation

Performance

Build & Test

Helper Scripts

Projects

Architecture

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`image` — image generation

`list-metadata` — inspect a GGUF file

Packages