Skip to main content

Running Commands

Agent Bench runs via Maven exec plugin:
# Core module (T3 LLM judge abstains)
./mvnw exec:java -pl agent-bench-core -Dexec.args="<command> [flags]"

# Agents module (full judge stack including LLM-based T3)
./mvnw exec:java -pl agent-bench-agents -Dexec.args="<command> [flags]"
Use agent-bench-agents when running benchmarks with LLM judges (like code-coverage). Use agent-bench-core for simple benchmarks (like hello-world) or when Spring AI Agents dependencies are unavailable.

Commands

list

List available benchmarks.
./mvnw exec:java -pl agent-bench-core -Dexec.args="list"

tasks

List tasks in a benchmark.
./mvnw exec:java -pl agent-bench-core \
  -Dexec.args="tasks --benchmark hello-world"
FlagRequiredDescription
--benchmarkYesBenchmark name

run

Run a benchmark end-to-end: set up workspace, invoke agent, grade result.
./mvnw exec:java -pl agent-bench-agents \
  -Dexec.args="run --benchmark code-coverage --agent agents/claude-code.yaml"
FlagRequiredDescription
--benchmarkYesBenchmark name
--agentNoPath to agent config YAML. If omitted, workspace is prepared for manual grading.
--taskNoRun only this specific task ID
--difficultyNoFilter tasks by difficulty: easy, medium, or hard
Output: Creates runs/<uuid>/ with result.json, run-metadata.json, and per-task directories.

resume

Resume an interrupted run. Skips tasks that already have a result.json.
./mvnw exec:java -pl agent-bench-agents \
  -Dexec.args="resume --run-id 386aba4a-4285-45d2-bbf4-a159c00c3f3b"
FlagRequiredDescription
--run-idYesUUID of the run to resume

compare

Compare results across multiple runs.
./mvnw exec:java -pl agent-bench-agents \
  -Dexec.args="compare --runs runs/<uuid1> runs/<uuid2>"
FlagRequiredDescription
--runsYesTwo or more run directory paths
Output: Table comparing agent, accuracy, pass@k, cost, duration, and trial count.
Agent                Accuracy   Pass@1   Cost       Duration   Trials
--------------------------------------------------------------------------
claude-code          100.0%     100.0%   $0.0500    1m7s       1
gemini               0.0%       0.0%     -          2m3s       1

provide

Set up a workspace for manual agent invocation (without running an agent).
./mvnw exec:java -pl agent-bench-core \
  -Dexec.args="provide --benchmark hello-world --task hello-world --workspace /tmp/ws"
FlagRequiredDescription
--benchmarkYesBenchmark name
--taskNoTask ID (if benchmark has multiple tasks)
--workspaceYesDirectory to set up

grade

Grade an existing workspace without running an agent.
./mvnw exec:java -pl agent-bench-core \
  -Dexec.args="grade --benchmark hello-world --task hello-world --workspace /tmp/ws"
FlagRequiredDescription
--benchmarkYesBenchmark name
--taskNoTask ID
--workspaceYesDirectory to evaluate

Output Structure

runs/<uuid>/
  result.json           # BenchmarkResult: accuracy, pass@k, trials
  run-metadata.json     # RunMetadata: agent, timestamps, commit hash
  bench.lock            # Config snapshot (used by resume)
  tasks/
    <task-id>/
      result.json       # TrialResult: resolved, scores, failure mode
      workspace/        # Agent's working directory (preserved)

Failure Modes

When a trial fails, the failureMode field classifies why:
ModeMeaning
NONETask resolved successfully
AGENT_TIMEOUTAgent process timed out
AGENT_ERRORAgent process exited with non-zero code
CONTEXT_LENGTH_EXCEEDEDAgent hit LLM context length limit
SETUP_ERRORSetup script failed before agent ran
GRADE_ERRORJudge/grading itself failed
BUILD_FAILUREMaven build failed during grading
TEST_FAILURETests failed during grading
UNKNOWNUnclassified failure