Run local GGUF models from the terminal with llama.cpp
llama.cpp is a local LLM runtime.
llama-cliruns prompts directly in the terminalllama-serverexposes a local OpenAI-compatible APIGGUFis the model file formatllama.cpploads
This makes llama.cpp a practical way to chat with models locally, test different model sizes, and connect local models to tools like OpenCode.
Install llama.cpp with Homebrew.
brew install llama.cppCheck that the main binaries are available.
llama-cli --help
llama-server --helpFor most llama.cpp users, Hugging Face is the main place to find GGUF models, and it is where much of the community publishes them.
The simplest way to get started is to let llama.cpp download a compatible model directly from a Hugging Face repo.
llama-cli -hf ggml-org/gemma-3-1b-it-GGUFRun a one-off prompt:
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF -p "Explain recursion in simple terms."llama.cpp expects models in GGUF format. The -hf <user>/<model>[:quant] flag downloads a compatible model directly.
Models downloaded with -hf are typically cached under ~/.cache/huggingface/hub/.
For the ggml-org/gemma-3-1b-it-GGUF example above, remove the cached model with:
rm -rf ~/.cache/huggingface/hub/models--ggml-org--gemma-3-1b-it-GGUFThis repo includes a small wrapper that makes llama-server the default out-of-the-box path:
For zsh, add an alias to ~/.zshrc that points to this script:
# Add this line to ~/.zshrc, then replace [path-to-your-local-developer-tools-repo] with your local clone path.
alias run-llama-server='[path-to-your-local-developer-tools-repo]/llama-cpp/run-llama-server.sh'
source ~/.zshrcThen start the launcher with:
run-llama-serverllama-server is an OpenAI-compatible local HTTP server. After launch, use:
- Browser UI:
http://127.0.0.1:8080 - API endpoint:
http://127.0.0.1:8080/v1/chat/completions
Optional arguments:
run-llama-server --port 8080
run-llama-server --port 8080 --ctx-size 8192What it does:
- Lists cached
llama.cppmodels - Lets you choose one from a numbered menu
- Starts
llama-serverwith--offline
The launcher uses --offline, so it only starts models already present in the local cache. If the model you want is not installed yet, download it first with llama-cli -hf ... or llama-server -hf ....
For predictable results, install and run the full repo:quant value you want instead of leaving the quant implicit.
If you want to skip the launcher, you can still start the server manually with an exact cached model:
llama-server -hf ggml-org/gemma-4-31B-it-GGUF:Q4_K_M --offline --port 8080
llama-server -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M --offline --port 8080For tools like OpenCode, llama-server is usually the right entrypoint. Coding tools usually send more text than normal chat, including system prompts, tool schemas, diffs, and file contents. If prompts start failing or feel cramped, try a larger context.
If you want the simplest first try, omit --ctx-size and let the model use its default context. If memory or performance becomes a problem, add it later to cap memory use.
If you need advanced tuning for a single local coding session, run llama-server directly. The wrapper in this repo only exposes --port and --ctx-size.
These are useful starting points for local testing:
| Model | Good For | Example |
|---|---|---|
ggml-org/gemma-3-1b-it-GGUF |
Fast local testing and basic prompting | llama-cli -hf ggml-org/gemma-3-1b-it-GGUF |
ggml-org/gemma-4-31B-it-GGUF |
Larger Gemma 4 instruct option from the ggml-org GGUF publisher |
llama-cli -hf ggml-org/gemma-4-31B-it-GGUF:Q4_K_M |
unsloth/Qwen3.6-27B-GGUF |
Strong all-around Qwen 3.6 for coding and tool use | llama-cli -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M |
unsloth/Qwen3.6-35B-A3B-GGUF |
MoE Qwen 3.6 variant for agentic coding with lower active params | llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M |
If you have more hardware headroom, try larger quants such as Q5_K_M, Q6_K, or Q8_0 when the repo provides them.
If you want help understanding model names, quant choices, context size, and common llama-server tuning flags, see Hugging Face And Tuning.
llama.cpp supports Metal on Apple Silicon, which makes it a strong fit for modern Macs.