(Experimental) A high-throughput and memory-efficient inference and serving engine for LLMs optimized for GB10 homelabs
-
Updated
Apr 15, 2026 - Python
(Experimental) A high-throughput and memory-efficient inference and serving engine for LLMs optimized for GB10 homelabs
Optimized vLLM deployment for NVIDIA Blackwell (RTX 5090) on Linux Kernel 6.14. Resolves SM_120 kernel incompatibilities, P2P deadlocks, and memory fragmentation for high-performance LLM inference.
Rust-native MoE inference runtime with custom CUDA kernels for Blackwell GPUs. Includes DFlash speculative decoding, multi-tier Engram memory, and entropy-adaptive routing. Targets Qwen3.5-35B-A3B on a single RTX 5060 Ti 16GB.
Pre-built onnxruntime-gpu 1.24.1 with Blackwell sm_120 CUDA kernels (RTX 5090/5080/5070)
llama.cpp fork with additional SOTA quants and improved performance
Complete installation guide for ComfyUI-Hunyuan3DWrapper on NVIDIA Blackwell GPUs (RTX 5070 Ti, 5080, 5090) Covers custom_rasterizer manual compilation for sm_120 / compute_120 architecture.
Add a description, image, and links to the sm120 topic page so that developers can more easily learn about it.
To associate your repository with the sm120 topic, visit your repo's landing page and select "manage topics."