Documentation Index
Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt
Use this file to discover all available pages before exploring further.
Judges Are Like Unit Tests for Agents
The core analogy: just as JUnit gives you assertEquals and AssertJ gives you assertThat, Agent Judge gives you FileExistsJudge, BuildSuccessJudge, and CorrectnessJudge.
You wouldn’t ship application code without tests or assertions.
Agents need the same discipline — automated, repeatable evaluation that runs after every execution and catches regressions before they reach users.
This framing drives several design decisions:
- Judges should be cheap to write (one functional interface, one method)
- Judges should be cheap to run (deterministic judges cost nothing)
- Judges should compose (juries aggregate judges like test suites aggregate tests)
- Results should be actionable (reasoning and checks, not just pass/fail)
Zero-Dependency Core
agent-judge-core has no external dependencies. Not Spring, not Spring AI, not Jackson — nothing.
This means you can evaluate agent output in:
- A plain Java application
- A JUnit test
- A CLI tool
- A Spring Boot service
- A serverless function
The module layering adds dependencies only when needed:
agent-judge-core (zero deps)
agent-judge-ai-core (zero deps)
↓
agent-judge-exec (+ agent-sandbox)
agent-judge-file (+ JavaParser, Maven Model)
agent-judge-llm (+ Spring AI ChatClient, SpringAiJudgeModel)
agent-judge-rag (+ agent-judge-llm)
agent-judge-spring-ai (+ Spring AI Model, provided)
agent-judge-langchain4j (+ LangChain4j, provided)
agent-judge-koog (+ Koog Agents, provided)
agent-judge-agent-client (+ AgentClient, AgentClientJudgeModel, provided)
If all you need is file checks and boolean logic, you pay for nothing you don’t use.
Framework bridge modules use provided-scope dependencies — they assume you already have the framework on your classpath.
Functional Interface Discipline
Judge is a @FunctionalInterface with a single method and no default methods:
@FunctionalInterface
public interface Judge {
Judgment judge(JudgmentContext context);
}
This is deliberate. A single-method interface means:
- Lambdas work:
ctx -> Judgment.pass("ok")
- Method references work:
this::evaluateBuild
- Composition uses the
Judges utility class, not interface default methods
Metadata is handled through composition (NamedJudge wraps any judge) and the JudgeWithMetadata marker interface, not through method defaults on Judge itself.
This avoids the combinatorial explosion of default method interactions and keeps the core contract minimal.
Sealed Score Hierarchy
Score is a sealed interface with three implementations:
public sealed interface Score permits BooleanScore, NumericalScore, CategoricalScore
Sealed types give you compile-time exhaustiveness — a switch expression over Score will warn you if you miss a case.
This matters when aggregating heterogeneous scores in a jury.
The three types cover the evaluation spectrum:
| Type | Use | Example |
|---|
BooleanScore | Binary pass/fail | Did it compile? Does the file exist? |
NumericalScore | Continuous scoring with bounds | Code quality 7.5/10, coverage 85% |
CategoricalScore | Discrete categories | EXCELLENT / GOOD / FAIR / POOR |
The Scores utility handles cross-type normalization so a jury with mixed score types can still aggregate cleanly.
Cascaded Cost Model
Not all evaluation is equal cost. A typical cascade orders checks from cheapest and most decisive to most expensive:
| Category | Judge type | Cost | Latency | Example |
|---|
| Deterministic | File checks, content match | Free | Microseconds | FileExistsJudge, FileContentJudge |
| File comparison | AST diff, POM comparison | Free | Milliseconds | JavaSemanticJudge, MavenSemanticJudge |
| Command | Build, test execution | Compute | Seconds-minutes | BuildSuccessJudge, CommandJudge |
| LLM | Semantic correctness, RAG | Tokens | Seconds | CorrectnessJudge, FaithfulnessJudge |
The CascadedJury codifies this: fail fast on cheap checks, escalate only when necessary.
Tier 1 (REJECT_ON_ANY_FAIL) → fails? → STOP: verdict is FAIL
↓ passes
Tier 2 (ACCEPT_ON_ALL_PASS) → all pass? → STOP: verdict is PASS
↓ mixed
Tier 3 (FINAL_TIER) → always runs → verdict from LLM assessment
If the project doesn’t compile, there’s no point running an LLM judge to evaluate code quality.
The cascaded pattern saves both time and money.
Frameworks Are Vertical, Evaluation Is Horizontal
Agent runtimes are vertical stacks — Spring AI, LangChain4j, Koog, and CLI-delegated agents (via AgentClient) each provide their own execution model, memory, tool calling, and observability.
Evaluation cuts across all of them.
Agent runtimes / frameworks
Spring AI LangChain4j Koog AgentClient (CLI agents)
| | | |
v v v v
-----------------------------------------------------------------
Agent Judge
horizontal evaluation layer
-----------------------------------------------------------------
Build success, file checks, AST comparison, coverage,
tool-use metadata checks, RAG faithfulness, hallucination,
LLM-as-judge, juries, cascaded juries
A FaithfulnessJudge doesn’t care whether the answer came from a Spring AI ChatClient, a LangChain4j AiService, a Koog agent, or Claude Code via AgentClient.
It evaluates the (question, context, answer) triple the same way.
This is the core architectural bet: evaluation is framework-neutral, and the bridge layer is thin.
Adapter Module Architecture
Each framework bridge module follows the same pattern:
- Provided-scope dependency on the framework — the bridge doesn’t pull the framework into your classpath; you already have it.
- JudgmentContextBuilder — a static utility that converts framework-specific output (
ChatResponse, Result<T>, AIAgent, AgentClientResponse) into a JudgmentContext.
- Evaluator — a static convenience class with Judge/Jury overloads, including variants that accept extra metadata, combining execution + context building + evaluation into a one-liner.
- Metadata key conventions — public constants where available (
SpringAiMetadataKeys, AgentClientMetadataKeys), and documented metadata keys for values such as model name, finish reason, token usage, sources, or agent ID.
The bridge code is typically 30-100 lines. It maps framework-specific response metadata (token usage, finish reason, model name) into JudgmentContext.metadata() where judges can optionally inspect it.
This keeps the core zero-dependency, keeps bridges thin, and means new framework support is a single module addition — not a core change.
Composition Over Inheritance for AI Judges
The original LLMJudge uses the template method pattern — you subclass it, override buildPrompt() and parseResponse(), and the base class handles the LLM call via Spring AI’s ChatClient. This works, but it tangles three concerns into one class hierarchy:
- Prompt rendering — how context becomes a prompt string
- Model invocation — which AI backend to call
- Response classification — how to turn the model’s text into a
Judgment
Subclassing couples you to Spring AI and makes it hard to swap one concern without touching the others. Testing requires a real ChatClient.Builder or mocking Spring AI internals.
The agent-judge-ai-core module separates these into composable parts:
JudgePromptTemplate → JudgeModel → JudgmentClassifier → Judgment
(render) (invoke) (classify)
ModelBackedJudge wires the three parts together via a builder. Each part is independently testable and replaceable:
JudgePromptTemplate loads templates from classpath, file, or string. Renders {{variable}} placeholders from JudgmentContext. Validates required variables at build time.
JudgeModel is a @FunctionalInterface — any (JudgeModelRequest → JudgeModelResponse) lambda works. SpringAiJudgeModel (in agent-judge-llm) delegates to Spring AI’s ChatClient. AgentClientJudgeModel (in agent-judge-agent-client) invokes a CLI agent that can use tools and inspect files — enabling agentic judges.
JudgmentClassifier maps text to a Judgment. LabelJudgmentClassifier.passFail() handles the common binary case. Custom classifiers handle structured or multi-label responses.
Like agent-judge-core, the agent-judge-ai-core module has zero external dependencies. The actual AI backend arrives through a JudgeModel implementation from a bridge module. This preserves the zero-dep principle while giving AI judges first-class infrastructure.
When to use which:
| Approach | Use when |
|---|
ModelBackedJudge | Default for AI judges. Composable, testable, framework-neutral. |
LLMJudge subclass | You need Spring AI-specific features, complex prompt logic, or custom response parsing that doesn’t fit a JudgmentClassifier. |
Best-of-Breed Evaluation Patterns
Agent Judge borrows from patterns that have emerged across modern evaluation systems, including Python-first eval frameworks, SaaS evaluation platforms, and JVM projects such as Dokimos.
The goal is not to clone any one framework. It is to bring the strongest ideas into a JVM-native, framework-neutral library:
| Pattern | Why it matters in Agent Judge |
|---|
| Evaluators as small composable functions | Keeps judges easy to write, test, and reuse |
| Structured verdicts, not raw booleans | Gives humans and automation enough information to act |
| Mixed deterministic and LLM-based checks | Lets cheap checks catch obvious failures before expensive semantic checks |
| Aggregation / voting | Supports juries instead of one fragile evaluator |
| Dataset and per-execution evaluation as separate concerns | Keeps Agent Judge focused on “did this execution work?” while leaving bulk experiment orchestration to other tools |
| Metadata-rich evaluation context | Lets judges reason about status, token usage, sources, tool calls, and workspace state without coupling to a framework |
| Cost-aware escalation | Avoids paying for LLM judging when deterministic checks already reject the run |
Agent Judge is intentionally JVM-native rather than a thin wrapper around a Python eval stack: it understands workspaces, Maven builds, Java source structure, typed records, sealed scores, and Java framework integration points.
Judge vs Journal
Agent Judge draws a sharp boundary between inputs to judges and narrative trace.
Judges legitimately reason about:
- Token usage, tool executions, retrieved sources — these are structured outputs that affect verdict logic
- Finish reason, execution status, timing — these determine whether evaluation is even meaningful
Judges do NOT consume:
- Intermediate responses, full conversation history, private reasoning traces, or step-by-step narrative logs — this is cognitive observability, not evaluation input
The second category belongs to a separate concern (agent-journal) that captures the narrative of how an agent arrived at its answer.
Mixing trace data into evaluation context would couple judges to specific agent architectures and make the JudgmentContext contract framework-specific — exactly what the horizontal layer avoids.
Agent-Agnostic by Design
JudgmentContext doesn’t import any agent framework.
It describes what happened (goal, workspace, status, timing) without coupling to how it happened.
The workspace-centric pattern:
- An agent modifies a directory
- A judge inspects the directory
- The judge doesn’t know or care which agent made the changes
This decoupling means the same jury works with Claude Code, Gemini CLI, a custom Python agent, or a human developer.
Immutable Records
Most evaluation data types are Java records: Judgment, Verdict, Check, JudgmentContext, and JudgeMetadata. Score is a sealed interface with record implementations (BooleanScore, NumericalScore, CategoricalScore).
Records are:
- Immutable — no accidental mutation between judges
- Value-based — equality by content, not identity
- Pattern-matchable —
if (score instanceof NumericalScore(var v, var min, var max))
- Easy to serialize — record components map cleanly to JSON/logging formats without requiring serialization dependencies in core
Combined with sealed types, this gives you a type-safe, exhaustive, immutable evaluation data model.