Documentation Index
Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt
Use this file to discover all available pages before exploring further.
This project has moved from the
spring-ai-community GitHub organization to
markpollack. New releases are published under the Maven groupId
io.github.markpollack, and Java packages now use the io.github.markpollack
namespace. If you previously used org.springaicommunity, update your
dependency coordinates and imports to the current values shown below.Overview
Agent Bench measures AI coding agents on real enterprise development tasks. Bring your agent, point it at a benchmark, and get graded results. The filesystem is the contract. Any CLI tool that readsINSTRUCTION.md and modifies the workspace can participate --- no SDK integration required.
How It Works
Architecture
Benchmark Catalog
YAML-defined benchmarks with tasks, setup scripts, and difficulty levels
Agent Invocation
Your agent runs as a subprocess in an isolated workspace --- any executable works
Jury System
Cascaded tiers of judges from Agent Judge --- deterministic and LLM-based
Result Model
TrialResult per task, BenchmarkResult aggregate, pass@k, resume support
Quick Start
Benchmarks
| Benchmark | Difficulty | What it measures |
|---|---|---|
| hello-world | Easy | File creation, basic agent infrastructure |
| code-coverage | Medium | JUnit test generation, coverage uplift on Spring projects |
Documentation
Getting Started
Test your agent in 5 minutes
Agent Configuration
How to configure any CLI tool as a benchmark agent
Jury System
Cascaded tiers, built-in judges, custom judge types
CLI Reference
All commands: run, resume, compare, list, grade
Role in the Lab
- Uses Agent Judge for YAML-configured jury grading
- Validated against Code Coverage v2 experiment results
- Agent-agnostic: Claude Code, Gemini, Amazon Q, or any CLI tool
Source
GitHub
Source code --- two modules, 211 tests