Why a Jury?
A single judge gives you a single score. A jury gives you diagnostic information — when something fails, you know where in the stack it failed and why. The experiment driver uses a cascaded jury with three tiers. Each tier is more expensive than the last, and only fires if cheaper tiers don’t already have a verdict.The Three Tiers
Tier 1: Deterministic
Zero-cost, instant, binary. Checks facts that are unambiguously right or wrong.Examples: Does the project compile? Does
java -version report the right version? Are all javax.* imports replaced with jakarta.*?Cost: Free (no LLM calls)Tier 2: Structural
Compares the agent’s output against the reference implementation at a structural level — AST diffs, import sets, annotation changes, POM dependency trees.Examples: Are the same imports present? Do method signatures match? Are the right dependencies in the POM?Cost: Free (structural comparison, no LLM)
Tier 3: Semantic
LLM-powered evaluation for questions that can’t be answered structurally. Uses criteria extracted from the execution plan to judge whether the agent’s approach was sound.Examples: Is the error handling strategy appropriate? Does the migration preserve business logic semantics?Cost: LLM tokens per item
Wiring a Simple Jury
Start with a single Tier 1 judge:Wiring a Multi-Tier Jury
Add judges from each tier with weights:Writing a Custom Judge
ImplementJudge and JudgeWithMetadata:
Judge interface
| Method | Returns | Description |
|---|---|---|
judge(JudgmentContext) | Judgment | Evaluate the agent’s output |
JudgmentContext provides
| Field | Type | Description |
|---|---|---|
workspacePath() | Path | Agent’s modified workspace |
referencePath() | Path | Reference implementation |
itemMetadata() | Map | Item metadata (id, slug, tags) |
Judgment fields
| Field | Type | Description |
|---|---|---|
score | Score | BooleanScore or NumericScore |
status | JudgmentStatus | PASS, FAIL, or ERROR |
reasoning | String | Human-readable explanation |
Diagnostic Feedback
After jury evaluation, theDiagnosticAnalyzer classifies failures into 8 gap categories:
| Gap | Where it failed |
|---|---|
| Knowledge | Missing or incorrect KB entry |
| Analysis | Pre-analysis missed a pattern |
| Planning | Agent planned the wrong approach |
| Execution | Agent deviated from its own plan |
| Tool | Tool limitation or misconfiguration |
| Prompt | Ambiguous or misleading task prompt |
| Evaluation | Judge produced a false positive/negative |
| Environment | External factor (timeout, network, disk) |
Related
Four-Tier Jury Methodology
The evaluation framework behind experiment scoring
Creating Experiments
Dataset design, variant ladders, configuration