Skip to main content

Why a Jury?

A single judge gives you a single score. A jury gives you diagnostic information — when something fails, you know where in the stack it failed and why. The experiment driver uses a cascaded jury with three tiers. Each tier is more expensive than the last, and only fires if cheaper tiers don’t already have a verdict.

The Three Tiers

1

Tier 1: Deterministic

Zero-cost, instant, binary. Checks facts that are unambiguously right or wrong.Examples: Does the project compile? Does java -version report the right version? Are all javax.* imports replaced with jakarta.*?Cost: Free (no LLM calls)
2

Tier 2: Structural

Compares the agent’s output against the reference implementation at a structural level — AST diffs, import sets, annotation changes, POM dependency trees.Examples: Are the same imports present? Do method signatures match? Are the right dependencies in the POM?Cost: Free (structural comparison, no LLM)
3

Tier 3: Semantic

LLM-powered evaluation for questions that can’t be answered structurally. Uses criteria extracted from the execution plan to judge whether the agent’s approach was sound.Examples: Is the error handling strategy appropriate? Does the migration preserve business logic semantics?Cost: LLM tokens per item

Wiring a Simple Jury

Start with a single Tier 1 judge:
Jury jury = SimpleJury.builder()
    .judge(new BuildSuccessJudge(), 1.0)
    .votingStrategy(new MajorityVotingStrategy())
    .build();

Wiring a Multi-Tier Jury

Add judges from each tier with weights:
// Tier 1: deterministic
Judge buildJudge = new BuildSuccessJudge();
Judge versionJudge = new ClassVersionJudge(61);  // Java 17 = class version 61

// Tier 2: structural
Judge importJudge = new ImportDiffJudge();
Judge pomJudge = new MavenPomDiffJudge();

// Tier 3: semantic (LLM-powered)
Judge semanticJudge = new SemanticDiffJudge(chatModel, criteriaExtractor);

Jury jury = SimpleJury.builder()
    .judge(buildJudge, 1.0)         // Must compile
    .judge(versionJudge, 0.8)       // Right Java version
    .judge(importJudge, 0.6)        // Correct imports
    .judge(pomJudge, 0.6)           // Correct dependencies
    .judge(semanticJudge, 0.4)      // Semantically sound
    .votingStrategy(new MajorityVotingStrategy())
    .build();
Weights determine influence on the final verdict, not ordering. The cascade is implicit in judge cost — cheap judges run first.

Writing a Custom Judge

Implement Judge and JudgeWithMetadata:
public class BuildSuccessJudge implements Judge, JudgeWithMetadata {

    @Override
    public Judgment judge(JudgmentContext context) {
        Path workspace = context.workspacePath();

        // Run the build
        ProcessBuilder pb = new ProcessBuilder("./mvnw", "compile");
        pb.directory(workspace.toFile());
        Process p = pb.start();
        int exitCode = p.waitFor();

        boolean success = exitCode == 0;

        return Judgment.builder()
            .score(new BooleanScore(success))
            .status(success ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
            .reasoning(success
                ? "Build succeeded"
                : "Build failed with exit code " + exitCode)
            .build();
    }

    @Override
    public JudgeMetadata metadata() {
        return new JudgeMetadata(
            "build_success",
            "Verifies the project compiles after agent modifications",
            JudgeType.DETERMINISTIC);
    }
}

Judge interface

MethodReturnsDescription
judge(JudgmentContext)JudgmentEvaluate the agent’s output

JudgmentContext provides

FieldTypeDescription
workspacePath()PathAgent’s modified workspace
referencePath()PathReference implementation
itemMetadata()MapItem metadata (id, slug, tags)

Judgment fields

FieldTypeDescription
scoreScoreBooleanScore or NumericScore
statusJudgmentStatusPASS, FAIL, or ERROR
reasoningStringHuman-readable explanation

Diagnostic Feedback

After jury evaluation, the DiagnosticAnalyzer classifies failures into 8 gap categories:
GapWhere it failed
KnowledgeMissing or incorrect KB entry
AnalysisPre-analysis missed a pattern
PlanningAgent planned the wrong approach
ExecutionAgent deviated from its own plan
ToolTool limitation or misconfiguration
PromptAmbiguous or misleading task prompt
EvaluationJudge produced a false positive/negative
EnvironmentExternal factor (timeout, network, disk)
This classification feeds the Forge pipeline — knowledge gaps become new KB entries, tool gaps become new deterministic tools, and the flywheel turns.

Four-Tier Jury Methodology

The evaluation framework behind experiment scoring

Creating Experiments

Dataset design, variant ladders, configuration