Skip to main content

Overview

Agent Bench measures AI coding agents on real enterprise development tasks. Bring your agent, point it at a benchmark, and get graded results. The filesystem is the contract. Any CLI tool that reads INSTRUCTION.md and modifies the workspace can participate --- no SDK integration required.

How It Works

1

Configure your agent

Write a 2-line YAML file with your agent’s command and timeout.
2

Run a benchmark

bench run sets up the workspace, invokes your agent, and grades the result.
3

Review results

JSON results with accuracy, pass@k, per-tier jury scores, cost, and duration.

Architecture

Benchmark Catalog

YAML-defined benchmarks with tasks, setup scripts, and difficulty levels

Agent Invocation

Your agent runs as a subprocess in an isolated workspace --- any executable works

Jury System

Cascaded tiers of judges from Agent Judge --- deterministic and LLM-based

Result Model

TrialResult per task, BenchmarkResult aggregate, pass@k, resume support

Quick Start

# agents/my-agent.yaml
command: claude --print --dangerously-skip-permissions "Read INSTRUCTION.md and follow the instructions."
timeout: PT10M
git clone https://github.com/spring-ai-community/agent-bench.git
cd agent-bench
./mvnw clean install -DskipTests

# Run hello-world benchmark with your agent
./mvnw exec:java -pl agent-bench-core \
  -Dexec.args="run --benchmark hello-world --agent agents/my-agent.yaml"

Benchmarks

BenchmarkDifficultyWhat it measures
hello-worldEasyFile creation, basic agent infrastructure
code-coverageMediumJUnit test generation, coverage uplift on Spring projects

Documentation

Getting Started

Test your agent in 5 minutes

Agent Configuration

How to configure any CLI tool as a benchmark agent

Jury System

Cascaded tiers, built-in judges, custom judge types

CLI Reference

All commands: run, resume, compare, list, grade

Role in the Lab

  • Uses Agent Judge for YAML-configured jury grading
  • Validated against Code Coverage v2 experiment results
  • Agent-agnostic: Claude Code, Gemini, Amazon Q, or any CLI tool

Source

GitHub

Source code --- two modules, 211 tests