☄️ Quick.AI

The fastest way to run an LLM on the device in your hand.

Production-grade causal-LM inference on top of NNTrainer —
Qwen 3, GPT-OSS, Gemma 3, Llama and more, with MoE on phones via on-the-fly expert streaming.

Quick start · Demos · Models · Android · Quantization · Architecture

Quick.AI in three numbers

Peak RAM	Library size	Network use
16.5 GB → 1.3 GB	~13 MB	0 bytes
Peak RAM for Qwen3-MoE 30B with FSU	Single core inference library	Sent over the network at runtime

Why Quick.AI?

MoE that fits in your pocket

Run 30B-parameter Mixture-of-Experts models in ~1.3 GB of RAM with Flash Storage Utilization (FSU) — experts stream in from disk only when their tokens fire.

Tuned for the metal

Hand-written kernels for ARMv8.2-a (FP16, dotprod, i8mm) and AVX2 on x86_64. Multi-threaded with OpenMP, NEON-vectorized hot paths.

Offline by design

Weights, prompts, and activations stay on the device. No telemetry, no Python runtime at inference time.

Pluggable layers

Each transformer building block (RMSNorm, SwiGLU, QKV, MHA core, tied embeddings…) ships as an independently loadable .so — drop in your own without recompiling the world.

Embed anywhere

Native C and C++ APIs plus a clean Android JNI build. Same source tree builds for desktop, server, and mobile.

Zero‑install quantizer

quick_dot_ai_quantize shrinks an FP32 checkpoint to Q4_0 / Q4_K / Q6_K / FP16 in one command.

See it in action

MoE inference on a phone

GPT-OSS 20B	Qwen3-MoE 30B-A3B

FSU: the same model, the same machine, a 12× memory cut

Load whole model _{Qwen3-30B-A3B}	Load experts on the fly _{Quick.AI / FSU}

Memory: 16.5 GB	Memory: 1.3 GB

Supported models

Family	Variants	Notes
Llama	1B / 3B / 7B-class	reference architecture
Qwen 2	0.5B – 7B	causal LM
Qwen 3	0.6B · 1.7B · 4B · 8B · 14B · 32B	HF: Qwen3-4B
Qwen 3-MoE	30B-A3B	HF: Qwen3-30B-A3B · FSU
GPT-OSS	MoE 20B · 120B	HF: gpt-oss-20b · FSU
Gemma 3	all causal variants	+ sentence-embedding head

Bring your own: subclass the causal-LM template under models/<your_family>/ and the factory wires it in. See the model author guide.

Quick start

# 1 · Clone (with submodules — NNTrainer rides along)
git clone --recursive https://github.com/nntrainer/Quick.AI.git
cd Quick.AI

# 2 · System deps (Ubuntu 22.04 / 24.04)
sudo apt-get install -y libopenblas-dev libflatbuffers-dev flatbuffers-compiler \
                        build-essential pkg-config
pip install meson ninja

# 3 · Build (~1 min on a modern laptop)
meson setup build -Denable-fp16=true -Dthread-backend=omp -Domp-num-threads=4
ninja -C build

# 4 · Generate
export OMP_NUM_THREADS=4 OMP_WAIT_POLICY=active OMP_PROC_BIND=true OMP_PLACES=cores
./build/quick_dot_ai_run ./res/qwen3/qwen3-4b/

Model layout — drop a model into res/<name>/ containing config.json, generation_config.json, tokenizer.json, tokenizer_config.json, vocab.json, nntr_config.json, and the NNTrainer .bin weight file referenced from nntr_config.json.

Android build

Click to expand the modular Android pipeline

Prerequisites: Android NDK r21d+, CMake, Rust (for tokenizers-cpp), adb.

export ANDROID_NDK=/path/to/android-ndk
./build_android.sh        # libquick_dot_ai_core.so · quick_dot_ai · quick_dot_ai_quantize
./build_api_lib.sh        # (optional) libquick_dot_ai_api.so
./build_test_app.sh       # (optional) quick_dot_ai_test_api
./install_android.sh      # adb push to /data/local/tmp/quick_dot_ai/

Script	Output(s)	Depends on
`build_android.sh`	`libquick_dot_ai_core.so`, `quick_dot_ai`, `quick_dot_ai_quantize`	NDK + Rust
`build_api_lib.sh`	`libquick_dot_ai_api.so`	core lib
`build_test_app.sh`	`quick_dot_ai_test_api`	core + api lib
`install_android.sh`	`/data/local/tmp/quick_dot_ai/*`	adb device

Run on the phone:

adb shell /data/local/tmp/quick_dot_ai/run_causallm.sh <model_path>
adb shell /data/local/tmp/quick_dot_ai/run_quantize.sh <model_path>
adb shell /data/local/tmp/quick_dot_ai/run_test_api.sh <model_name> "<prompt>"

All artifacts land under jni/libs/arm64-v8a/.

Quantization

# Default: FC → Q4_0, embedding → FP32
./build/quick_dot_ai_quantize /path/to/qwen3-4b

# Mix dtypes per layer family
./build/quick_dot_ai_quantize /path/to/qwen3-4b \
    --fc_dtype Q4_0 --embd_dtype Q6_K --lmhead_dtype FP16

# Write into a separate output directory
./build/quick_dot_ai_quantize /path/to/qwen3-4b -o /out/qwen3-4b-q40

dtype	bits	typical use
`FP32`	32	embedding, LM head (default)
`FP16`	16	LM head when memory matters
`Q4_0`	4	FC layers (default), fastest path
`Q4_K`	4	FC layers, K-quant accuracy
`Q6_K`	6	embedding when 4-bit hurts quality

Q4_0 is ISA-specific — an x86-quantized Q4_0 binary is not byte-compatible with ARM. Quantize on the same architecture you serve from.

After quantization, point quick_dot_ai_run at the quantized directory (or mv nntr_config_quantized.json nntr_config.json in place and rerun).

Continuous integration

Every PR is gated by:

Check	What it does
Linux build	Meson + Ninja on Ubuntu 22.04 & 24.04
Android build	`arm64-v8a`, NDK r26d, Rust `aarch64-linux-android`
C++ format	clang-format 14 against `.clang-format`
CodeQL	security & quality static analysis

Workflows live under .github/workflows/.

Contributing

We love PRs. Before opening one:

meson setup build && ninja -C build — the same command CI runs.
clang-format -i on any changed C/C++ files (config in .clang-format).
Adding a new model family? Drop it under models/<your_family>/, wire it into models/meson.build, and register it in factory.h.

License

Quick.AI is released under the Apache License 2.0. NNTrainer, bundled as a submodule, is also Apache-2.0.

Citation

If Quick.AI is useful for your research, please cite the NNTrainer paper it builds on:

@inproceedings{10.1145/3639477.3639716,
  author    = {Moon, Jijoong and Lee, Hyeonseok and Chu, Jiho and Park, Donghak and Hong, Seungbaek and Seo, Hyungjun and Jeong, Donghyeon and Kong, Sungsik and Ham, Myungjoo},
  title     = {A New Frontier of AI: On-Device AI Training and Personalization},
  booktitle = {Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice},
  series    = {ICSE-SEIP '24},
  year      = {2024},
  pages     = {323--333},
  doi       = {10.1145/3639477.3639716}
}

_{Built on top of NNTrainer.}

Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
.github		.github
api		api
benchmarks		benchmarks
docs		docs
jni		jni
layers		layers
lib		lib
models		models
res		res
subprojects		subprojects
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
build_android.sh		build_android.sh
build_api_lib.sh		build_api_lib.sh
build_test_app.sh		build_test_app.sh
build_tokenizer_android.sh		build_tokenizer_android.sh
factory.h		factory.h
huggingface_tokenizer.cpp		huggingface_tokenizer.cpp
install_android.sh		install_android.sh
llm_util.cpp		llm_util.cpp
llm_util.hpp		llm_util.hpp
main.cpp		main.cpp
meson.build		meson.build
meson_options.txt		meson_options.txt
quantize.cpp		quantize.cpp
repeat_perf_test.py		repeat_perf_test.py
run_test.sh		run_test.sh
tokenizers_c.h		tokenizers_c.h
tokenizers_cpp.h		tokenizers_cpp.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

☄️ Quick.AI

The fastest way to run an LLM on the device in your hand.

Quick.AI in three numbers

Why Quick.AI?

MoE that fits in your pocket

Tuned for the metal

Offline by design

Pluggable layers

Embed anywhere

Zero‑install quantizer

See it in action

MoE inference on a phone

FSU: the same model, the same machine, a 12× memory cut

Supported models

Quick start

Android build

Quantization

Continuous integration

Further reading

Contributing

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Languages

Folders and files

Latest commit

History

Repository files navigation

☄️ Quick.AI

The fastest way to run an LLM on the device in your hand.

Quick.AI in three numbers

Why Quick.AI?

MoE that fits in your pocket

Tuned for the metal

Offline by design

Pluggable layers

Embed anywhere

Zero‑install quantizer

See it in action

MoE inference on a phone

FSU: the same model, the same machine, a 12× memory cut

Supported models

Quick start

Android build

Quantization

Continuous integration

Further reading

Contributing

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 0

Languages

Packages

Contributors