Four-Tier Jury - Pollack AI Lab

The Problem

LLM-as-judge is expensive and non-deterministic. Running GPT-4 evaluation on every agent output costs $0.50-2.00 per assessment and produces variable results.

The Solution

A cascaded jury that filters through cheap, deterministic checks before reaching expensive LLM evaluation. Only work products that pass all lower tiers advance.

The Four Tiers

T0: Deterministic

Checks that require no execution — regex matching, file existence, syntax validation, compilation checks. Examples:

Does the generated test file exist?
Does it compile?
Does it contain at least one @Test annotation?
Are import statements valid?

Cost: Free. Latency: Milliseconds.

T1: Command

Checks that run shell commands and inspect exit codes or output patterns. Examples:

Does mvn test pass?
Does the coverage report show improvement?
Does checkstyle pass?

Cost: Minimal (compute only). Latency: Seconds.

T2: Golden Test

Compares agent output against known-good reference outputs using structural similarity. Examples:

Does the generated test cover the same methods as the reference test?
Is the assertion strategy consistent with project conventions?
Does the test structure match golden examples?

Cost: Minimal. Latency: Seconds.

T3: LLM Assessment

Semantic evaluation by a language model — reserved for cases that pass all lower tiers. Examples:

Is the test meaningful (not just asserting true)?
Does it test edge cases?
Is the test maintainable?

Cost: $0.50-2.00 per assessment. Latency: 5-15 seconds.

Cascade Economics

By filtering at each tier, typically only 30-40% of outputs reach T3. This reduces evaluation cost by 60-80% while maintaining quality — because outputs that fail T0-T2 would fail T3 anyway.

Implementation

The four-tier jury is implemented in Agent Judge and used across all lab experiments.

Role in the Growth Cycle

The four-tier jury is the MEASURE step of the Improvement Flywheel. Each tier maps to specific loss dimensions, and the cascade structure ensures efficient measurement before expensive LLM evaluation.

Tier-to-Loss Dimension Mapping

Tier	Loss Dimension	What It Catches
T0 (deterministic)	Outcome loss (binary)	File doesn’t exist, won’t compile, missing annotations
T1 (command)	Tooling loss	Tests fail, coverage doesn’t improve, style violations
T2 (golden test)	Behavioral loss (structural match)	Output doesn’t match reference structure or conventions
T3 (LLM)	Outcome loss (semantic) + evaluation loss	Meaningless tests, missing edge cases, judge variance

Per-Criterion Tracking

Track individual criterion scores, not just the aggregate. A rising aggregate can hide a regression in a specific criterion — for example, overall batch score improves +0.4 while a single criterion drops −0.3. The Improvement Flywheel requires per-criterion visibility to detect these hidden regressions.

Regression Detection

Every improvement can introduce regressions. After each intervention, verify:

Did the targeted loss dimension decrease? — The intervention worked as intended.
Did any other dimension increase? — If so, the diagnosis was incomplete — the fix addressed a symptom, not the root cause.
Is the improvement stable across multiple runs? — Distinguish signal from lucky variance.

Applied In

Code Coverage v1 — Full T0-T3 cascade on 9 variants
Code Coverage v2 — Refined scoring, T3=0.933 for forge variant
Issue Classification — Adapted for SWE-bench task evaluation

Improvement Flywheel

The feedback loop that jury evaluation drives

Markov Fingerprinting

Behavioral analysis — the DIAGNOSE step companion

​The Problem

​The Solution

​The Four Tiers

​T0: Deterministic

​T1: Command

​T2: Golden Test

​T3: LLM Assessment

​Cascade Economics

​Implementation

​Role in the Growth Cycle

​Tier-to-Loss Dimension Mapping

​Per-Criterion Tracking

​Regression Detection

​Applied In

​Related