What You’ll Do
Run the hello-world benchmark with your own agent.
You’ll write a 2-line agent config, run the benchmark, and see a graded result.
Prerequisites
Java 17+
An AI coding agent with a CLI (Claude Code, Gemini CLI, or any executable)
Step 1: Clone and Build
git clone https://github.com/spring-ai-community/agent-bench.git
cd agent-bench
./mvnw clean install -DskipTests
Create a YAML file that tells the benchmark how to invoke your agent.
The only requirements: a command that runs in a directory, and a timeout.
# agents/my-agent.yaml
command : claude --print --dangerously-skip-permissions "Read INSTRUCTION.md and follow the instructions precisely."
timeout : PT5M
Your agent receives:
A workspace directory as its working directory
An INSTRUCTION.md file describing the task
Your agent’s job: read the instruction, modify the workspace, and exit.
claude -p (print mode) cannot write files --- it only outputs text.
Use claude --print --dangerously-skip-permissions for agents that need to create or modify files.
Other agents
Any CLI tool works. Here are examples for different agents:
# Gemini CLI
command : gemini -p "Read INSTRUCTION.md and follow the instructions."
timeout : PT5M
# Custom script
command : ./my-agent.sh
timeout : PT10M
# Python agent
command : python3 agent.py --instruction INSTRUCTION.md
timeout : PT15M
The benchmark doesn’t care what your agent is --- only that it reads the instruction and writes to the workspace.
Step 3: Run the Benchmark
./mvnw exec:java -pl agent-bench-core \
-Dexec.args= "run --benchmark hello-world --agent agents/my-agent.yaml"
You’ll see output like:
Running: hello-world
Workspace prepared at: runs/<uuid>/tasks/hello-world/workspace
hello-world: RESOLVED
Benchmark: hello-world
Accuracy: 100.0% (1/1)
Duration: PT1M7.24S
Results: runs/<uuid>/result.json
Step 4: Review Results
The benchmark writes structured results to runs/<uuid>/:
runs/<uuid>/
result.json # Aggregate: accuracy, pass@k, agent name
run-metadata.json # Provenance: timestamps, commit hash
bench.lock # Config snapshot (for resume)
tasks/
hello-world/
result.json # Per-task: resolved, scores, failure mode
workspace/ # Agent's working directory (preserved)
result.json contains the full evaluation:
{
"benchmarkName" : "hello-world" ,
"agentName" : "claude --print ..." ,
"accuracy" : 1.0 ,
"trials" : [{
"taskId" : "hello-world" ,
"resolved" : true ,
"failureMode" : "NONE" ,
"scores" : { "reasoning" : "Content matches expected: Hello World!" }
}]
}
Step 5: Try Code Coverage
The code-coverage benchmark is a real-world task: write JUnit tests for Spring Petclinic to maximize coverage.
This requires the full judge stack, so use the agent-bench-agents module:
# Build spring-ai-agents dependency (needed for LLM judge)
cd ../spring-ai-agents && ./mvnw clean install -DskipTests && cd ../agent-bench
# Run code-coverage benchmark (45+ minutes)
./mvnw exec:java -pl agent-bench-agents \
-Dexec.args= "run --benchmark code-coverage --agent agents/my-agent.yaml"
The code-coverage jury evaluates in 4 tiers:
T0 Build : Does ./mvnw test pass?
T1 Coverage Preservation : No regressions from baseline?
T2 Coverage Improvement : Above 50% instruction coverage?
T3 Test Quality : LLM judge scores practice adherence (test slices, assertions, patterns)
If any tier fails, lower tiers are not evaluated.
What’s Next
Agent Configuration Advanced agent config: timeouts, environment, multiple agents
CLI Reference All commands: run, resume, compare, list, grade
Jury System How benchmarks are graded: tiers, policies, custom judges