Braintrust

Active
GitHub TypeScript MIT

Description

Braintrust is an evaluation and observability platform for AI applications, providing experiment tracking, scoring, prompt management, and production monitoring for LLM-powered systems.

Key Features

  • Experiment tracking and comparison — record LLM inputs, outputs, params, and results for version comparison
  • Auto and human scoring — supports LLM-as-judge, manual labeling, and custom evaluators
  • Dataset management with versioning and reusability
  • Prompt management with version control and A/B experimentation
  • Production monitoring — track latency, error rates, and quality metrics of online LLM calls
  • SDK ecosystem — Python/JS/TS SDKs with deep integrations for LangChain, LlamaIndex, and Vercel AI SDK

Use Cases

💡 Run prompt regression tests in CI to compare output quality between versions
💡 Bulk-evaluate RAG retrieval and generation with LLM-as-judge scoring
💡 Manage prompt template versions and run A/B experiments to find the best prompt
💡 Monitor latency, error rate, and quality of production LLM calls
💡 Centralize test datasets and evaluators for cross-team reuse

Quick Start

pip install braintrust
import braintrust
from braintrust import Eval
Eval("my-eval", data=lambda: [...], task=lambda x: openai_call(x), scores=[...]).run()
# Or stream production traces via the Braintrust proxy.

Related Projects