Weave

Active

GitHub Python Apache-2.0

Description

A toolkit by Weights & Biases for developing AI-powered applications, providing LLM call tracing, evaluation experiment management, and versioning from prototype to production.

Related Projects

SwanLab

4.0k · Python

Active

An open-source, modern-design AI training tracking and visualization tool. Supports PyTorch, Transformers and more. Monitor and evaluate AI agent training processes.

pythonobservabilityevaluation +2

Argilla

5.0k · Python

Active

Argilla is a collaboration platform for AI engineers and domain experts to build high-quality datasets, collect human feedback, and evaluate models.

evaluationdata-processingllm +2

OpenInference

1.0k · Python

Active

OpenTelemetry instrumentation for AI observability, providing standardized tracing, metrics collection, and span definitions for LLM inference processes to help developers monitor and debug AI agent systems.

observabilitypythonllm +2

Hugging Face Evaluate

2.5k · Python

Active

A library by Hugging Face for easily evaluating machine learning models and datasets, providing a wide range of metrics and evaluation methods.

evaluationllmpython +2

Agent 评估LLM 评测自动化测试

Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines

Most teams evaluate agents by checking a few examples. Real evaluation needs layered metrics, non-rotting datasets, and judges that push back. This article provides runnable code patterns and a practical decision framework.

Weave

Description

Tags

Categories

Related Projects

SwanLab

Argilla

OpenInference

Hugging Face Evaluate

Related Articles

Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines