Agent Bench

What’s New →

This project has moved from the spring-ai-community GitHub organization to markpollack. New releases are published under the Maven groupId io.github.markpollack, and Java packages now use the io.github.markpollack namespace. If you previously used org.springaicommunity, update your dependency coordinates and imports to the current values shown below.

Overview

Agent Bench measures AI coding agents on real enterprise development tasks. Bring your agent, point it at a benchmark, and get graded results. The filesystem is the contract. Any CLI tool that reads INSTRUCTION.md and modifies the workspace can participate --- no SDK integration required.

How It Works

Configure your agent

Write a 2-line YAML file with your agent’s command and timeout.

Run a benchmark

bench run sets up the workspace, invokes your agent, and grades the result.

Review results

JSON results with accuracy, pass@k, per-tier jury scores, cost, and duration.

Architecture

Benchmark Catalog

YAML-defined benchmarks with tasks, setup scripts, and difficulty levels

Agent Invocation

Your agent runs as a subprocess in an isolated workspace --- any executable works

Jury System

Cascaded tiers of judges from Agent Judge --- deterministic and LLM-based

Result Model

TrialResult per task, BenchmarkResult aggregate, pass@k, resume support

Quick Start

# agents/my-agent.yaml
command: claude --print --dangerously-skip-permissions "Read INSTRUCTION.md and follow the instructions."
timeout: PT10M

git clone https://github.com/markpollack/agent-bench.git
cd agent-bench
./mvnw clean install -DskipTests

# Run hello-world benchmark with your agent
./mvnw exec:java -pl agent-bench-core \
  -Dexec.args="run --benchmark hello-world --agent agents/my-agent.yaml"

Benchmarks

Benchmark	Difficulty	What it measures
hello-world	Easy	File creation, basic agent infrastructure
code-coverage	Medium	JUnit test generation, coverage uplift on Spring projects

Documentation

Getting Started

Test your agent in 5 minutes

Agent Configuration

How to configure any CLI tool as a benchmark agent

Jury System

Cascaded tiers, built-in judges, custom judge types

CLI Reference

All commands: run, resume, compare, list, grade

Role in the Lab

Uses Agent Judge for YAML-configured jury grading
Validated against Code Coverage v2 experiment results
Agent-agnostic: Claude Code, Gemini, Amazon Q, or any CLI tool

Source

GitHub

Source code --- two modules, 211 tests

Design Philosophy What's New

Projects

AgentWorks

Agento

Supporting Projects

Migration

Overview

How It Works

Architecture

Benchmark Catalog

Agent Invocation

Jury System

Result Model

Quick Start

Benchmarks

Documentation

Getting Started

Agent Configuration

Jury System

CLI Reference

Role in the Lab

Source

GitHub

​Overview

​How It Works

​Architecture

Benchmark Catalog

Agent Invocation

Jury System

Result Model

​Quick Start

​Benchmarks

​Documentation

Getting Started

Agent Configuration

Jury System

CLI Reference

​Role in the Lab

​Source

GitHub

Overview

How It Works

Architecture

Quick Start

Benchmarks

Documentation

Role in the Lab

Source