Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
hugging-face-and-tuning.md	hugging-face-and-tuning.md
run-llama-server.sh	run-llama-server.sh

llama.cpp

Run local GGUF models from the terminal with llama.cpp

What llama.cpp is

llama.cpp is a local LLM runtime.

llama-cli runs prompts directly in the terminal
llama-server exposes a local OpenAI-compatible API
GGUF is the model file format llama.cpp loads

This makes llama.cpp a practical way to chat with models locally, test different model sizes, and connect local models to tools like OpenCode.

Install

Install llama.cpp with Homebrew.

brew install llama.cpp

Verify The Binaries

Check that the main binaries are available.

llama-cli --help
llama-server --help

Get A GGUF Model From Hugging Face

For most llama.cpp users, Hugging Face is the main place to find GGUF models, and it is where much of the community publishes them.

The simplest way to get started is to let llama.cpp download a compatible model directly from a Hugging Face repo.

llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

Run a one-off prompt:

llama-cli -hf ggml-org/gemma-3-1b-it-GGUF -p "Explain recursion in simple terms."

llama.cpp expects models in GGUF format. The -hf <user>/<model>[:quant] flag downloads a compatible model directly.

Remove A Downloaded Model

Models downloaded with -hf are typically cached under ~/.cache/huggingface/hub/.

For the ggml-org/gemma-3-1b-it-GGUF example above, remove the cached model with:

rm -rf ~/.cache/huggingface/hub/models--ggml-org--gemma-3-1b-it-GGUF

Run The Local Server

This repo includes a small wrapper that makes llama-server the default out-of-the-box path:

For zsh, add an alias to ~/.zshrc that points to this script:

# Add this line to ~/.zshrc, then replace [path-to-your-local-developer-tools-repo] with your local clone path.
alias run-llama-server='[path-to-your-local-developer-tools-repo]/llama-cpp/run-llama-server.sh'

source ~/.zshrc

Then start the launcher with:

run-llama-server

llama-server is an OpenAI-compatible local HTTP server. After launch, use:

Browser UI: http://127.0.0.1:8080
API endpoint: http://127.0.0.1:8080/v1/chat/completions

Optional arguments:

run-llama-server --port 8080
run-llama-server --port 8080 --ctx-size 8192

What it does:

Lists cached llama.cpp models
Lets you choose one from a numbered menu
Starts llama-server with --offline

The launcher uses --offline, so it only starts models already present in the local cache. If the model you want is not installed yet, download it first with llama-cli -hf ... or llama-server -hf ....

For predictable results, install and run the full repo:quant value you want instead of leaving the quant implicit.

Run Manually

If you want to skip the launcher, you can still start the server manually with an exact cached model:

llama-server -hf ggml-org/gemma-4-31B-it-GGUF:Q4_K_M --offline --port 8080
llama-server -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M --offline --port 8080

OpenCode

For tools like OpenCode, llama-server is usually the right entrypoint. Coding tools usually send more text than normal chat, including system prompts, tool schemas, diffs, and file contents. If prompts start failing or feel cramped, try a larger context.

If you want the simplest first try, omit --ctx-size and let the model use its default context. If memory or performance becomes a problem, add it later to cap memory use.

If you need advanced tuning for a single local coding session, run llama-server directly. The wrapper in this repo only exposes --port and --ctx-size.

Models To Try

These are useful starting points for local testing:

Model	Good For	Example
`ggml-org/gemma-3-1b-it-GGUF`	Fast local testing and basic prompting	`llama-cli -hf ggml-org/gemma-3-1b-it-GGUF`
`ggml-org/gemma-4-31B-it-GGUF`	Larger Gemma 4 instruct option from the `ggml-org` GGUF publisher	`llama-cli -hf ggml-org/gemma-4-31B-it-GGUF:Q4_K_M`
`unsloth/Qwen3.6-27B-GGUF`	Strong all-around Qwen 3.6 for coding and tool use	`llama-cli -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M`
`unsloth/Qwen3.6-35B-A3B-GGUF`	MoE Qwen 3.6 variant for agentic coding with lower active params	`llama-cli -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M`

If you have more hardware headroom, try larger quants such as Q5_K_M, Q6_K, or Q8_0 when the repo provides them.

Learn More

If you want help understanding model names, quant choices, context size, and common llama-server tuning flags, see Hugging Face And Tuning.

Apple Silicon Note

llama.cpp supports Metal on Apple Silicon, which makes it a strong fit for modern Macs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

llama.cpp

What llama.cpp is

Install

Verify The Binaries

Get A GGUF Model From Hugging Face

Remove A Downloaded Model

Run The Local Server

Run Manually

OpenCode

Models To Try

Learn More

Apple Silicon Note

Official References

Uh oh!

FilesExpand file tree

llama-cpp

Directory actions

More options

Directory actions

More options

Latest commit

History

llama-cpp

Folders and files

parent directory

README.md

llama.cpp

What llama.cpp is

Install

Verify The Binaries

Get A GGUF Model From Hugging Face

Remove A Downloaded Model

Run The Local Server

Run Manually

OpenCode

Models To Try

Learn More

Apple Silicon Note

Official References