Skip to main content

Documentation Index

Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt

Use this file to discover all available pages before exploring further.

This project has moved from the spring-ai-community GitHub organization to markpollack. New releases are published under the Maven groupId io.github.markpollack, and Java packages now use the io.github.markpollack namespace. If you previously used org.springaicommunity, update your dependency coordinates and imports to the current values shown below.

Overview

Agent Bench measures AI coding agents on real enterprise development tasks. Bring your agent, point it at a benchmark, and get graded results. The filesystem is the contract. Any CLI tool that reads INSTRUCTION.md and modifies the workspace can participate --- no SDK integration required.

How It Works

1

Configure your agent

Write a 2-line YAML file with your agent’s command and timeout.
2

Run a benchmark

bench run sets up the workspace, invokes your agent, and grades the result.
3

Review results

JSON results with accuracy, pass@k, per-tier jury scores, cost, and duration.

Architecture

Benchmark Catalog

YAML-defined benchmarks with tasks, setup scripts, and difficulty levels

Agent Invocation

Your agent runs as a subprocess in an isolated workspace --- any executable works

Jury System

Cascaded tiers of judges from Agent Judge --- deterministic and LLM-based

Result Model

TrialResult per task, BenchmarkResult aggregate, pass@k, resume support

Quick Start

# agents/my-agent.yaml
command: claude --print --dangerously-skip-permissions "Read INSTRUCTION.md and follow the instructions."
timeout: PT10M
git clone https://github.com/markpollack/agent-bench.git
cd agent-bench
./mvnw clean install -DskipTests

# Run hello-world benchmark with your agent
./mvnw exec:java -pl agent-bench-core \
  -Dexec.args="run --benchmark hello-world --agent agents/my-agent.yaml"

Benchmarks

BenchmarkDifficultyWhat it measures
hello-worldEasyFile creation, basic agent infrastructure
code-coverageMediumJUnit test generation, coverage uplift on Spring projects

Documentation

Getting Started

Test your agent in 5 minutes

Agent Configuration

How to configure any CLI tool as a benchmark agent

Jury System

Cascaded tiers, built-in judges, custom judge types

CLI Reference

All commands: run, resume, compare, list, grade

Role in the Lab

  • Uses Agent Judge for YAML-configured jury grading
  • Validated against Code Coverage v2 experiment results
  • Agent-agnostic: Claude Code, Gemini, Amazon Q, or any CLI tool

Source

GitHub

Source code --- two modules, 211 tests