llama.cpp

Active

Description

llama.cpp is a lightweight C/C++ inference engine that runs a wide range of open-source large language models efficiently on consumer hardware.

Key Features

Ultra-lightweight inference — pure C/C++ with zero dependencies, runs quantized LLMs on CPU
GGUF format — unified quantized model format that is cross-platform and supports partial loading
Hardware acceleration — Apple Silicon Metal, NVIDIA CUDA, AMD ROCm, Vulkan and OpenCL backends
Many architectures — Llama, Qwen, Mistral, Gemma, DeepSeek, Phi and more work out of the box
Server ready — built-in llama-server exposes an OpenAI-compatible HTTP API
Multi-language bindings — Python, Rust and Go bindings via llama-cpp-python and friends

Use Cases

💡 Running quantized LLMs locally on laptops or edge devices without a GPU

💡 Providing a zero-cost local inference backend for AI agents

💡 Running Llama 3 / Qwen and other open models with Metal acceleration on Apple Silicon

💡 Exposing GGUF models as an OpenAI-compatible API via llama-server

💡 Embedding lightweight local inference inside RAG systems to cut costs

Quick Start

# Clone and build
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build && cmake --build build --config Release -j

# Download a GGUF model
huggingface-cli download Qwen/Qwen2.5-1.5B-Instruct-GGUF \
  qwen2.5-1.5b-instruct-q4_k_m.gguf --local-dir .

# Run interactively in the terminal
./build/bin/llama-cli -m qwen2.5-1.5b-instruct-q4_k_m.gguf \
  -p "Hi, please introduce yourself." -n 256

# Or start an OpenAI-compatible server
./build/bin/llama-server -m qwen2.5-1.5b-instruct-q4_k_m.gguf --port 8080

Visit GitHub Visit Website View Docs

llama.cpp

Description

Key Features

Use Cases

Tags

Categories

Quick Start

Related Projects

rocketride-server

airbyte

Crawlee

WrenAI

Related Articles

Agent Small-Model Finetuning and Edge Inference