llama.cpp
ActiveDescription
llama.cpp is a lightweight C/C++ inference engine that runs a wide range of open-source large language models efficiently on consumer hardware.
Key Features
- Ultra-lightweight inference — pure C/C++ with zero dependencies, runs quantized LLMs on CPU
- GGUF format — unified quantized model format that is cross-platform and supports partial loading
- Hardware acceleration — Apple Silicon Metal, NVIDIA CUDA, AMD ROCm, Vulkan and OpenCL backends
- Many architectures — Llama, Qwen, Mistral, Gemma, DeepSeek, Phi and more work out of the box
- Server ready — built-in llama-server exposes an OpenAI-compatible HTTP API
- Multi-language bindings — Python, Rust and Go bindings via llama-cpp-python and friends
Use Cases
Tags
Categories
Quick Start
# Clone and build
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build && cmake --build build --config Release -j
# Download a GGUF model
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct-GGUF \
qwen2.5-1.5b-instruct-q4_k_m.gguf --local-dir .
# Run interactively in the terminal
./build/bin/llama-cli -m qwen2.5-1.5b-instruct-q4_k_m.gguf \
-p "Hi, please introduce yourself." -n 256
# Or start an OpenAI-compatible server
./build/bin/llama-server -m qwen2.5-1.5b-instruct-q4_k_m.gguf --port 8080