Running Commands
Agent Bench runs via Maven exec plugin:agent-bench-agents when running benchmarks with LLM judges (like code-coverage).
Use agent-bench-core for simple benchmarks (like hello-world) or when Spring AI Agents dependencies are unavailable.
Commands
list
List available benchmarks.tasks
List tasks in a benchmark.| Flag | Required | Description |
|---|---|---|
--benchmark | Yes | Benchmark name |
run
Run a benchmark end-to-end: set up workspace, invoke agent, grade result.| Flag | Required | Description |
|---|---|---|
--benchmark | Yes | Benchmark name |
--agent | No | Path to agent config YAML. If omitted, workspace is prepared for manual grading. |
--task | No | Run only this specific task ID |
--difficulty | No | Filter tasks by difficulty: easy, medium, or hard |
runs/<uuid>/ with result.json, run-metadata.json, and per-task directories.
resume
Resume an interrupted run. Skips tasks that already have aresult.json.
| Flag | Required | Description |
|---|---|---|
--run-id | Yes | UUID of the run to resume |
compare
Compare results across multiple runs.| Flag | Required | Description |
|---|---|---|
--runs | Yes | Two or more run directory paths |
provide
Set up a workspace for manual agent invocation (without running an agent).| Flag | Required | Description |
|---|---|---|
--benchmark | Yes | Benchmark name |
--task | No | Task ID (if benchmark has multiple tasks) |
--workspace | Yes | Directory to set up |
grade
Grade an existing workspace without running an agent.| Flag | Required | Description |
|---|---|---|
--benchmark | Yes | Benchmark name |
--task | No | Task ID |
--workspace | Yes | Directory to evaluate |
Output Structure
Failure Modes
When a trial fails, thefailureMode field classifies why:
| Mode | Meaning |
|---|---|
NONE | Task resolved successfully |
AGENT_TIMEOUT | Agent process timed out |
AGENT_ERROR | Agent process exited with non-zero code |
CONTEXT_LENGTH_EXCEEDED | Agent hit LLM context length limit |
SETUP_ERROR | Setup script failed before agent ran |
GRADE_ERROR | Judge/grading itself failed |
BUILD_FAILURE | Maven build failed during grading |
TEST_FAILURE | Tests failed during grading |
UNKNOWN | Unclassified failure |