llama.cpp

Active
GitHub C++ MIT

Description

llama.cpp is a lightweight C/C++ inference engine that runs a wide range of open-source large language models efficiently on consumer hardware.

Key Features

  • Ultra-lightweight inference — pure C/C++ with zero dependencies, runs quantized LLMs on CPU
  • GGUF format — unified quantized model format that is cross-platform and supports partial loading
  • Hardware acceleration — Apple Silicon Metal, NVIDIA CUDA, AMD ROCm, Vulkan and OpenCL backends
  • Many architectures — Llama, Qwen, Mistral, Gemma, DeepSeek, Phi and more work out of the box
  • Server ready — built-in llama-server exposes an OpenAI-compatible HTTP API
  • Multi-language bindings — Python, Rust and Go bindings via llama-cpp-python and friends

Use Cases

💡 Running quantized LLMs locally on laptops or edge devices without a GPU
💡 Providing a zero-cost local inference backend for AI agents
💡 Running Llama 3 / Qwen and other open models with Metal acceleration on Apple Silicon
💡 Exposing GGUF models as an OpenAI-compatible API via llama-server
💡 Embedding lightweight local inference inside RAG systems to cut costs

Quick Start

# Clone and build
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build && cmake --build build --config Release -j

# Download a GGUF model
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct-GGUF \
  qwen2.5-1.5b-instruct-q4_k_m.gguf --local-dir .

# Run interactively in the terminal
./build/bin/llama-cli -m qwen2.5-1.5b-instruct-q4_k_m.gguf \
  -p "Hi, please introduce yourself." -n 256

# Or start an OpenAI-compatible server
./build/bin/llama-server -m qwen2.5-1.5b-instruct-q4_k_m.gguf --port 8080

Related Projects

Related Articles