Skip to main content

Overview

Every benchmark defines a jury --- a cascade of judge tiers that evaluate the agent’s workspace. Judges come from the Agent Judge project. Benchmarks wire them together in benchmark.yaml.

Cascaded Tiers

Tiers run in order. Each tier has a policy that determines whether evaluation continues:
PolicyBehavior
REJECT_ON_ANY_FAILIf any check fails, stop. Lower tiers are not evaluated.
ACCEPT_ON_ALL_PASSIf all checks pass, continue.
FINAL_TIERLast tier. Its result is the overall verdict.
This is how the code-coverage benchmark grades:
jury:
  tiers:
    - name: build
      policy: REJECT_ON_ANY_FAIL
      checks:
        - type: maven-build
          goals: [clean, test]
    - name: coverage-preservation
      policy: REJECT_ON_ANY_FAIL
      checks:
        - type: coverage-preservation
    - name: coverage-improvement
      policy: ACCEPT_ON_ALL_PASS
      checks:
        - type: coverage-improvement
          min: 50.0
    - name: test-quality
      policy: FINAL_TIER
      checks:
        - type: test-quality-llm
          prompt: prompts/judge-practice-adherence.txt
          model: claude-sonnet-4-6
If the build fails (T0), coverage is never measured. If coverage regresses (T1), improvement is never checked. This prevents misleading scores from broken code.

Built-in Judge Types

These are registered in JudgeFactory and available in any benchmark:
TypeModuleWhat it checks
file-existsagent-judge-coreA specific file exists in the workspace
file-contentagent-judge-coreFile content matches expected (exact or contains)
maven-buildagent-judge-exec./mvnw <goals> exits successfully
coverage-preservationagent-judge-execJaCoCo coverage >= baseline
coverage-improvementagent-judge-execJaCoCo coverage >= threshold
test-quality-llmagent-bench-agentsLLM evaluates test practice adherence

file-exists

- type: file-exists
  path: hello.txt

file-content

- type: file-content
  path: hello.txt
  expected: "Hello World!"
  match: EXACT    # or CONTAINS

maven-build

- type: maven-build
  goals: [clean, test]

coverage-improvement

- type: coverage-improvement
  min: 50.0       # Minimum instruction coverage percentage

test-quality-llm

- type: test-quality-llm
  prompt: prompts/judge-rubric.txt   # Path relative to benchmark directory
  model: claude-sonnet-4-6
The LLM judge reads the prompt file, evaluates the workspace, and returns structured scores. Requires running via agent-bench-agents module (which has the Claude SDK dependency).

Writing a Custom Judge

Judges implement the Judge interface from agent-judge-core:
public class MyJudge implements Judge {
    @Override
    public Judgment judge(JudgmentContext context) {
        Path workspace = context.workspace();
        // Inspect the workspace...
        return Judgment.builder()
            .status(JudgmentStatus.PASS)
            .reasoning("Looks good")
            .build();
    }
}
Register it in JudgeFactory:
factory.register("my-check", config -> new MyJudge());

Benchmark YAML Schema

schema: bench.benchmark.v1
name: my-benchmark
version: "1.0"
description: "What this benchmark measures"
default-timeout: PT10M

jury:
  tiers:
    - name: tier-name
      policy: REJECT_ON_ANY_FAIL | ACCEPT_ON_ALL_PASS | FINAL_TIER
      checks:
        - type: <judge-type>
          # ... judge-specific config

Task YAML Schema

Each task within a benchmark:
schema: bench.task.v1
id: my-task
difficulty: easy | medium | hard
instruction: |
  What the agent should do.
timeout: PT10M        # Optional, overrides benchmark default
metadata:             # Optional, passed to judges
  key: value
setup:                # Optional, scripts run before agent
  - "command 1"
  - "command 2"
post:                 # Optional, scripts run after agent, before grading
  - "command 3"