Overview
Every benchmark defines a jury --- a cascade of judge tiers that evaluate the agent’s workspace. Judges come from the Agent Judge project. Benchmarks wire them together inbenchmark.yaml.
Cascaded Tiers
Tiers run in order. Each tier has a policy that determines whether evaluation continues:| Policy | Behavior |
|---|---|
REJECT_ON_ANY_FAIL | If any check fails, stop. Lower tiers are not evaluated. |
ACCEPT_ON_ALL_PASS | If all checks pass, continue. |
FINAL_TIER | Last tier. Its result is the overall verdict. |
Built-in Judge Types
These are registered inJudgeFactory and available in any benchmark:
| Type | Module | What it checks |
|---|---|---|
file-exists | agent-judge-core | A specific file exists in the workspace |
file-content | agent-judge-core | File content matches expected (exact or contains) |
maven-build | agent-judge-exec | ./mvnw <goals> exits successfully |
coverage-preservation | agent-judge-exec | JaCoCo coverage >= baseline |
coverage-improvement | agent-judge-exec | JaCoCo coverage >= threshold |
test-quality-llm | agent-bench-agents | LLM evaluates test practice adherence |
file-exists
file-content
maven-build
coverage-improvement
test-quality-llm
agent-bench-agents module (which has the Claude SDK dependency).
Writing a Custom Judge
Judges implement theJudge interface from agent-judge-core:
JudgeFactory: