Compiles HuggingFace transformer models into optimised native Metal inference binaries. No runtime framework, no Python — just a compiled binary that runs your model at near-hardware-limit speed on Apple Silicon, using 25% less GPU power and 1.7x better energy efficiency than mlx-lm. See the full resource & power report.
TinyLlama 1.1B on Apple M1 Pro (16GB, 200 GB/s):
UNC Q4_0 ████████████████████████████████████████████████████████████ 152.0 tok/s
mlx-lm Q4 ████████████████████████████████████████████ 112.7 tok/s
UNC Q8_0 ███████████████████████████████ 76.6 tok/s
UNC F16 ███████████████████ 47.9 tok/s
Qwen3-4B on Apple M1 Pro (Q4_0):
mlx-lm Q4 ████████████████████████████████████████████████████ 49.2 tok/s
UNC Q4_0 ██████████████████████████████████████████ 38.7 tok/s
| Metric | UNC Metal | mlx-lm Q4 |
|---|---|---|
| Throughput | 152 tok/s | 113 tok/s |
| GPU power (decode) | 11.3W | 14.1W |
| Energy per token | 74 mJ | 125 mJ |
| Tokens per watt-hour | 12,800 | 8,000 |
| CPU instructions (200 tok) | 5.3B | 31.4B |
| Peak memory | 4.2 GB | 0.9 GB |
UNC is 1.35x faster while using 25% less GPU power, resulting in 1.7x better energy efficiency. The compiled approach eliminates Python runtime and framework dispatch overhead entirely — 8.4x fewer CPU instructions means less heat, less power, and more headroom for the GPU. See unc_resource_usage.md for full methodology and traces.
HuggingFace model
|
[ Frontend ] Parse config.json + safetensors
|
[ IR Graph ] Hardware-agnostic tensor graph
|
[ Compiler ] Fusion, quantization, memory planning
|
+------------------+------------------+------------------+
| | | |
[ Metal ] [ CUDA ] [ ROCm ] [ WASM ]
Obj-C + Metal PTX kernels HIP kernels WebGPU shaders
shaders (planned) (planned) (planned)
|
Native binary
Mach-O (AOT) or
.unc bundle (JIT)
IR: Hardware-agnostic typed tensor graph with BatchMatMul, QuantizedMatVec, RMSNorm, LayerNorm, QKNorm, RoPE, SDPA, SwiGLU, KVCacheAppend, Gather, etc. The IR is target-independent — the same graph can be lowered to Metal (current), CUDA, ROCm, WASM, or CPU-only backends with acceleration providers like Intel oneDNN.
Compiler passes: Weight binding, dead code elimination, QKV fusion, Gate+Up fusion, SwiGLU fusion, Add+RMSNorm fusion, RoPE+KV fusion, PSQ pipeline, dual-path (GEMM/GEMV), kernel matching, barrier analysis, memory planning with buffer aliasing.
Output modes:
| Mode | Output | Use case |
|---|---|---|
| JIT (default) | .unc bundle — JIT-compiled via clang at first run, cached thereafter |
Development, iteration |
AOT (--binary) |
Standalone Mach-O with embedded weights — zero dependencies | Deployment, distribution |
# Prerequisites: Rust toolchain, Xcode Command Line Tools (macOS)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Clone and build
git clone <repo-url> && cd unc
cargo build --release# JIT bundle (default)
unc compile --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant q4-0 -o ./tinyllama
unc compile --model Qwen/Qwen3-4B --quant q4-0 -o ./qwen3
# AOT standalone binary (single Mach-O, zero dependencies)
unc compile --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant q4-0 --binary -o ./tinyllamaunc run ./tinyllama.unc --prompt "The history of" --max-tokens 200| Flag | Precision | Size (1.1B) | Speed |
|---|---|---|---|
f16 |
16-bit float | 2.2 GB | 47.9 tok/s |
q8-0 |
8-bit | 1.1 GB | 76.6 tok/s |
q4-0 |
4-bit | 0.6 GB | 152.0 tok/s |
unc list-architecturesLLaMA, Mistral, Qwen, Phi, Gemma.
src/
frontend/ HuggingFace config parsing, model templates
ir/ Typed tensor IR (ops, graph, types)
compile/ Optimization passes, memory planner
kernel/ Kernel registry, Metal kernel definitions
emit/ Metal orchestrator codegen, AOT binary emission
runtime/ JIT compilation, weight loading, tokenizer
target/ Apple Silicon target detection
unc_format/ .unc bundle serialization
kernel_sources/
metal/
unc_kernels/ Custom Metal shaders (fused GEMV, SDPA, RoPE, RMSNorm, etc.)
upstream_mlx/ MLX reference kernels (QMV, sdpa_vector headers)
MIT — see LICENSE.