Jury System

Overview

Every benchmark defines a jury --- a cascade of judge tiers that evaluate the agent’s workspace. Judges come from the Agent Judge project. Benchmarks wire them together in benchmark.yaml.

Cascaded Tiers

Tiers run in order. Each tier has a policy that determines whether evaluation continues:

Policy	Behavior
`REJECT_ON_ANY_FAIL`	If any check fails, stop. Lower tiers are not evaluated.
`ACCEPT_ON_ALL_PASS`	If all checks pass, continue.
`FINAL_TIER`	Last tier. Its result is the overall verdict.

This is how the code-coverage benchmark grades:

jury:
  tiers:
    - name: build
      policy: REJECT_ON_ANY_FAIL
      checks:
        - type: maven-build
          goals: [clean, test]
    - name: coverage-preservation
      policy: REJECT_ON_ANY_FAIL
      checks:
        - type: coverage-preservation
    - name: coverage-improvement
      policy: ACCEPT_ON_ALL_PASS
      checks:
        - type: coverage-improvement
          min: 50.0
    - name: test-quality
      policy: FINAL_TIER
      checks:
        - type: test-quality-llm
          prompt: prompts/judge-practice-adherence.txt
          model: claude-sonnet-4-6

If the build fails (T0), coverage is never measured. If coverage regresses (T1), improvement is never checked. This prevents misleading scores from broken code.

Built-in Judge Types

These are registered in JudgeFactory and available in any benchmark:

Type	Module	What it checks
`file-exists`	agent-judge-core	A specific file exists in the workspace
`file-content`	agent-judge-core	File content matches expected (exact or contains)
`maven-build`	agent-judge-exec	`./mvnw <goals>` exits successfully
`coverage-preservation`	agent-judge-exec	JaCoCo coverage >= baseline
`coverage-improvement`	agent-judge-exec	JaCoCo coverage >= threshold
`test-quality-llm`	agent-bench-agents	LLM evaluates test practice adherence

file-exists

- type: file-exists
  path: hello.txt

file-content

- type: file-content
  path: hello.txt
  expected: "Hello World!"
  match: EXACT    # or CONTAINS

maven-build

- type: maven-build
  goals: [clean, test]

coverage-improvement

- type: coverage-improvement
  min: 50.0       # Minimum instruction coverage percentage

test-quality-llm

- type: test-quality-llm
  prompt: prompts/judge-rubric.txt   # Path relative to benchmark directory
  model: claude-sonnet-4-6

The LLM judge reads the prompt file, evaluates the workspace, and returns structured scores. Requires running via agent-bench-agents module (which has the Claude SDK dependency).

Writing a Custom Judge

Judges implement the Judge interface from agent-judge-core:

public class MyJudge implements Judge {
    @Override
    public Judgment judge(JudgmentContext context) {
        Path workspace = context.workspace();
        // Inspect the workspace...
        return Judgment.builder()
            .status(JudgmentStatus.PASS)
            .reasoning("Looks good")
            .build();
    }
}

factory.register("my-check", config -> new MyJudge());

Benchmark YAML Schema

schema: bench.benchmark.v1
name: my-benchmark
version: "1.0"
description: "What this benchmark measures"
default-timeout: PT10M

jury:
  tiers:
    - name: tier-name
      policy: REJECT_ON_ANY_FAIL | ACCEPT_ON_ALL_PASS | FINAL_TIER
      checks:
        - type: <judge-type>
          # ... judge-specific config

Task YAML Schema

Each task within a benchmark:

schema: bench.task.v1
id: my-task
difficulty: easy | medium | hard
instruction: |
  What the agent should do.
timeout: PT10M        # Optional, overrides benchmark default
metadata:             # Optional, passed to judges
  key: value
setup:                # Optional, scripts run before agent
  - "command 1"
  - "command 2"
post:                 # Optional, scripts run after agent, before grading
  - "command 3"

Projects

AgentWorks

Agento

Supporting Projects

Migration

Overview

Cascaded Tiers

Built-in Judge Types

file-exists

file-content

maven-build

coverage-improvement

test-quality-llm

Writing a Custom Judge

Benchmark YAML Schema

Task YAML Schema

​Overview

​Cascaded Tiers

​Built-in Judge Types

​file-exists

​file-content

​maven-build

​coverage-improvement

​test-quality-llm

​Writing a Custom Judge

​Benchmark YAML Schema

​Task YAML Schema

Overview

Cascaded Tiers

Built-in Judge Types

file-exists

file-content

maven-build

coverage-improvement

test-quality-llm

Writing a Custom Judge

Benchmark YAML Schema

Task YAML Schema