Getting Started with Agent Bench

What You’ll Do

Run the hello-world benchmark with your own agent. You’ll write a 2-line agent config, run the benchmark, and see a graded result.

Prerequisites

Java 17+
An AI coding agent with a CLI (Claude Code, Gemini CLI, or any executable)

Step 1: Clone and Build

git clone https://github.com/markpollack/agent-bench.git
cd agent-bench
./mvnw clean install -DskipTests

Step 2: Configure Your Agent

Create a YAML file that tells the benchmark how to invoke your agent. The only requirements: a command that runs in a directory, and a timeout.

# agents/my-agent.yaml
command: claude --print --dangerously-skip-permissions "Read INSTRUCTION.md and follow the instructions precisely."
timeout: PT5M

Your agent receives:

A workspace directory as its working directory
An INSTRUCTION.md file describing the task

Your agent’s job: read the instruction, modify the workspace, and exit.

claude -p (print mode) cannot write files --- it only outputs text. Use claude --print --dangerously-skip-permissions for agents that need to create or modify files.

Other agents

Any CLI tool works. Here are examples for different agents:

# Gemini CLI
command: gemini -p "Read INSTRUCTION.md and follow the instructions."
timeout: PT5M

# Custom script
command: ./my-agent.sh
timeout: PT10M

# Python agent
command: python3 agent.py --instruction INSTRUCTION.md
timeout: PT15M

The benchmark doesn’t care what your agent is --- only that it reads the instruction and writes to the workspace.

Step 3: Run the Benchmark

./mvnw exec:java -pl agent-bench-core \
  -Dexec.args="run --benchmark hello-world --agent agents/my-agent.yaml"

You’ll see output like:

Running: hello-world
Workspace prepared at: runs/<uuid>/tasks/hello-world/workspace
  hello-world: RESOLVED

Benchmark: hello-world
Accuracy: 100.0% (1/1)
Duration: PT1M7.24S
Results: runs/<uuid>/result.json

Step 4: Review Results

The benchmark writes structured results to runs/<uuid>/:

runs/<uuid>/
  result.json           # Aggregate: accuracy, pass@k, agent name
  run-metadata.json     # Provenance: timestamps, commit hash
  bench.lock            # Config snapshot (for resume)
  tasks/
    hello-world/
      result.json       # Per-task: resolved, scores, failure mode
      workspace/        # Agent's working directory (preserved)

result.json contains the full evaluation:

{
  "benchmarkName": "hello-world",
  "agentName": "claude --print ...",
  "accuracy": 1.0,
  "trials": [{
    "taskId": "hello-world",
    "resolved": true,
    "failureMode": "NONE",
    "scores": { "reasoning": "Content matches expected: Hello World!" }
  }]
}

Step 5: Try Code Coverage

The code-coverage benchmark is a real-world task: write JUnit tests for Spring Petclinic to maximize coverage. This requires the full judge stack, so use the agent-bench-agents module:

# Run code-coverage benchmark (45+ minutes)
# Requires agent-client 0.20.0+ on Maven Central (resolved automatically)
./mvnw exec:java -pl agent-bench-agents \
  -Dexec.args="run --benchmark code-coverage --agent agents/my-agent.yaml"

The code-coverage jury evaluates in 4 tiers:

T0 Build: Does ./mvnw test pass?
T1 Coverage Preservation: No regressions from baseline?
T2 Coverage Improvement: Above 50% instruction coverage?
T3 Test Quality: LLM judge scores practice adherence (test slices, assertions, patterns)

If any tier fails, lower tiers are not evaluated.

What’s Next

Agent Configuration

Advanced agent config: timeouts, environment, multiple agents

CLI Reference

All commands: run, resume, compare, list, grade

Jury System

How benchmarks are graded: tiers, policies, custom judges

Projects

AgentWorks

Agento

Supporting Projects

Migration

Getting Started with Agent Bench

What You’ll Do

Prerequisites

Step 1: Clone and Build

Step 2: Configure Your Agent

Other agents

Step 3: Run the Benchmark

Step 4: Review Results

Step 5: Try Code Coverage

What’s Next

Agent Configuration

CLI Reference

Jury System

​What You’ll Do

​Prerequisites

​Step 1: Clone and Build

​Step 2: Configure Your Agent

​Other agents

​Step 3: Run the Benchmark

​Step 4: Review Results

​Step 5: Try Code Coverage

​What’s Next

Agent Configuration

CLI Reference

Jury System

What You’ll Do

Prerequisites

Step 1: Clone and Build

Step 2: Configure Your Agent

Other agents

Step 3: Run the Benchmark

Step 4: Review Results

Step 5: Try Code Coverage

What’s Next