Overview
Agent Bench measures AI coding agents on real enterprise development tasks. Bring your agent, point it at a benchmark, and get graded results. The filesystem is the contract. Any CLI tool that readsINSTRUCTION.md and modifies the workspace can participate --- no SDK integration required.
How It Works
Architecture
Benchmark Catalog
YAML-defined benchmarks with tasks, setup scripts, and difficulty levels
Agent Invocation
Your agent runs as a subprocess in an isolated workspace --- any executable works
Jury System
Cascaded tiers of judges from Agent Judge --- deterministic and LLM-based
Result Model
TrialResult per task, BenchmarkResult aggregate, pass@k, resume support
Quick Start
Benchmarks
| Benchmark | Difficulty | What it measures |
|---|---|---|
| hello-world | Easy | File creation, basic agent infrastructure |
| code-coverage | Medium | JUnit test generation, coverage uplift on Spring projects |
Documentation
Getting Started
Test your agent in 5 minutes
Agent Configuration
How to configure any CLI tool as a benchmark agent
Jury System
Cascaded tiers, built-in judges, custom judge types
CLI Reference
All commands: run, resume, compare, list, grade
Role in the Lab
- Uses Agent Judge for YAML-configured jury grading
- Validated against Code Coverage v2 experiment results
- Agent-agnostic: Claude Code, Gemini, Amazon Q, or any CLI tool
Source
GitHub
Source code --- two modules, 211 tests