Arthur Bench
NormalDescription
An open-source evaluation tool for generative AI applications, helping teams build test suites, compare model outputs, and track quality changes over time.
An open-source evaluation tool for generative AI applications, helping teams build test suites, compare model outputs, and track quality changes over time.
Agenta is an open-source LLMOps platform providing prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
An open-source evaluation and testing library for LLM agents providing automated model scanning, bias detection, performance benchmarking, and compliance checks.
End-to-end, code-first tutorials for building production-grade GenAI agents. From prototype to enterprise deployment.
An automatic prompt optimization framework by Salesforce AI Research that leverages LLMs to search for and refine prompts for improved model performance.