Giskard

Active

GitHub Python Apache-2.0

Description

An open-source evaluation and testing library for LLM agents providing automated model scanning, bias detection, performance benchmarking, and compliance checks.

Related Projects

Agentic Radar

974 · Python

Stale

A security scanner for LLM agentic workflows. Automatically detects security vulnerabilities, prompt injection risks, and permission violations in agent pipelines before deployment.

securityagentpython +2

Purple Llama

4.2k · Python

Active

Meta's set of tools to assess and improve LLM security, including safety benchmarks, prompt injection detection, and output auditing to help evaluate and enhance the safety of large language models.

securityevaluationpython +2

Agent Governance Toolkit

3.8k · Python

Active

Microsoft's AI Agent Governance Toolkit providing policy enforcement, zero-trust identity, execution sandboxing, and reliability engineering for autonomous AI agents. Covers 10/10 OWASP Agentic Top 10.

securityevaluationpython +2

Promptfoo

21.8k · TypeScript

Active

Test and evaluate LLM prompts, agents, and RAG pipelines. Built-in red teaming and security evaluation for reliable AI applications.

testingevaluationred-teaming +2

Agent 评估LLM 评测自动化测试

Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines

Most teams evaluate agents by checking a few examples. Real evaluation needs layered metrics, non-rotting datasets, and judges that push back. This article provides runnable code patterns and a practical decision framework.

Giskard

Description

Tags

Categories

Related Projects

Agentic Radar

Purple Llama

Agent Governance Toolkit

Promptfoo

Related Articles

Agent Evaluation and Testing: From Vibe Checks to End-to-End Pipelines