Design Philosophy

Judges Are Like Unit Tests for Agents

The core analogy: just as JUnit gives you assertEquals and AssertJ gives you assertThat, Agent Judge gives you FileExistsJudge, BuildSuccessJudge, and CorrectnessJudge. You wouldn’t ship application code without tests or assertions. Agents need the same discipline — automated, repeatable evaluation that runs after every execution and catches regressions before they reach users. This framing drives several design decisions:

Judges should be cheap to write (one functional interface, one method)
Judges should be cheap to run (deterministic judges cost nothing)
Judges should compose (juries aggregate judges like test suites aggregate tests)
Results should be actionable (reasoning and checks, not just pass/fail)

Zero-Dependency Core

agent-judge-core has no external dependencies. Not Spring, not Spring AI, not Jackson — nothing. This means you can evaluate agent output in:

A plain Java application
A JUnit test
A CLI tool
A Spring Boot service
A serverless function

The module layering adds dependencies only when needed:

agent-judge-core              (zero deps)
agent-judge-ai-core           (zero deps)
    ↓
agent-judge-exec              (+ agent-sandbox)
agent-judge-file              (+ JavaParser, Maven Model)
agent-judge-llm               (+ Spring AI ChatClient, SpringAiJudgeModel)
agent-judge-rag               (+ agent-judge-llm)
agent-judge-spring-ai         (+ Spring AI Model, provided)
agent-judge-langchain4j       (+ LangChain4j, provided)
agent-judge-koog              (+ Koog Agents, provided)
agent-judge-agent-client      (+ AgentClient, AgentClientJudgeModel, provided)

If all you need is file checks and boolean logic, you pay for nothing you don’t use. Framework bridge modules use provided-scope dependencies — they assume you already have the framework on your classpath.

Functional Interface Discipline

Judge is a @FunctionalInterface with a single method and no default methods:

@FunctionalInterface
public interface Judge {
    Judgment judge(JudgmentContext context);
}

This is deliberate. A single-method interface means:

Lambdas work: ctx -> Judgment.pass("ok")
Method references work: this::evaluateBuild
Composition uses the Judges utility class, not interface default methods

Metadata is handled through composition (NamedJudge wraps any judge) and the JudgeWithMetadata marker interface, not through method defaults on Judge itself. This avoids the combinatorial explosion of default method interactions and keeps the core contract minimal.

Sealed Score Hierarchy

Score is a sealed interface with three implementations:

public sealed interface Score permits BooleanScore, NumericalScore, CategoricalScore

Sealed types give you compile-time exhaustiveness — a switch expression over Score will warn you if you miss a case. This matters when aggregating heterogeneous scores in a jury. The three types cover the evaluation spectrum:

Type	Use	Example
`BooleanScore`	Binary pass/fail	Did it compile? Does the file exist?
`NumericalScore`	Continuous scoring with bounds	Code quality 7.5/10, coverage 85%
`CategoricalScore`	Discrete categories	EXCELLENT / GOOD / FAIR / POOR

The Scores utility handles cross-type normalization so a jury with mixed score types can still aggregate cleanly.

Cascaded Cost Model

Not all evaluation is equal cost. A typical cascade orders checks from cheapest and most decisive to most expensive:

Category	Judge type	Cost	Latency	Example
Deterministic	File checks, content match	Free	Microseconds	`FileExistsJudge`, `FileContentJudge`
File comparison	AST diff, POM comparison	Free	Milliseconds	`JavaSemanticJudge`, `MavenSemanticJudge`
Command	Build, test execution	Compute	Seconds-minutes	`BuildSuccessJudge`, `CommandJudge`
LLM	Semantic correctness, RAG	Tokens	Seconds	`CorrectnessJudge`, `FaithfulnessJudge`

The CascadedJury codifies this: fail fast on cheap checks, escalate only when necessary.

Tier 1 (REJECT_ON_ANY_FAIL)  →  fails?  →  STOP: verdict is FAIL
         ↓ passes
Tier 2 (ACCEPT_ON_ALL_PASS)  →  all pass?  →  STOP: verdict is PASS
         ↓ mixed
Tier 3 (FINAL_TIER)          →  always runs  →  verdict from LLM assessment

If the project doesn’t compile, there’s no point running an LLM judge to evaluate code quality. The cascaded pattern saves both time and money.

Frameworks Are Vertical, Evaluation Is Horizontal

Agent runtimes are vertical stacks — Spring AI, LangChain4j, Koog, and CLI-delegated agents (via AgentClient) each provide their own execution model, memory, tool calling, and observability. Evaluation cuts across all of them.

            Agent runtimes / frameworks
    Spring AI     LangChain4j     Koog     AgentClient (CLI agents)
        |              |           |              |
        v              v           v              v
-----------------------------------------------------------------
                     Agent Judge
             horizontal evaluation layer
-----------------------------------------------------------------
   Build success, file checks, AST comparison, coverage,
   tool-use metadata checks, RAG faithfulness, hallucination,
   LLM-as-judge, juries, cascaded juries

A FaithfulnessJudge doesn’t care whether the answer came from a Spring AI ChatClient, a LangChain4j AiService, a Koog agent, or Claude Code via AgentClient. It evaluates the (question, context, answer) triple the same way. This is the core architectural bet: evaluation is framework-neutral, and the bridge layer is thin.

Adapter Module Architecture

Each framework bridge module follows the same pattern:

Provided-scope dependency on the framework — the bridge doesn’t pull the framework into your classpath; you already have it.
JudgmentContextBuilder — a static utility that converts framework-specific output (ChatResponse, Result<T>, AIAgent, AgentClientResponse) into a JudgmentContext.
Evaluator — a static convenience class with Judge/Jury overloads, including variants that accept extra metadata, combining execution + context building + evaluation into a one-liner.
Metadata key conventions — public constants where available (SpringAiMetadataKeys, AgentClientMetadataKeys), and documented metadata keys for values such as model name, finish reason, token usage, sources, or agent ID.

The bridge code is typically 30-100 lines. It maps framework-specific response metadata (token usage, finish reason, model name) into JudgmentContext.metadata() where judges can optionally inspect it. This keeps the core zero-dependency, keeps bridges thin, and means new framework support is a single module addition — not a core change.

Composition Over Inheritance for AI Judges

The original LLMJudge uses the template method pattern — you subclass it, override buildPrompt() and parseResponse(), and the base class handles the LLM call via Spring AI’s ChatClient. This works, but it tangles three concerns into one class hierarchy:

Prompt rendering — how context becomes a prompt string
Model invocation — which AI backend to call
Response classification — how to turn the model’s text into a Judgment

Subclassing couples you to Spring AI and makes it hard to swap one concern without touching the others. Testing requires a real ChatClient.Builder or mocking Spring AI internals. The agent-judge-ai-core module separates these into composable parts:

JudgePromptTemplate  →  JudgeModel  →  JudgmentClassifier  →  Judgment
   (render)             (invoke)         (classify)

ModelBackedJudge wires the three parts together via a builder. Each part is independently testable and replaceable:

JudgePromptTemplate loads templates from classpath, file, or string. Renders {{variable}} placeholders from JudgmentContext. Validates required variables at build time.
JudgeModel is a @FunctionalInterface — any (JudgeModelRequest → JudgeModelResponse) lambda works. SpringAiJudgeModel (in agent-judge-llm) delegates to Spring AI’s ChatClient. AgentClientJudgeModel (in agent-judge-agent-client) invokes a CLI agent that can use tools and inspect files — enabling agentic judges.
JudgmentClassifier maps text to a Judgment. LabelJudgmentClassifier.passFail() handles the common binary case. Custom classifiers handle structured or multi-label responses.

Like agent-judge-core, the agent-judge-ai-core module has zero external dependencies. The actual AI backend arrives through a JudgeModel implementation from a bridge module. This preserves the zero-dep principle while giving AI judges first-class infrastructure. When to use which:

Approach	Use when
`ModelBackedJudge`	Default for AI judges. Composable, testable, framework-neutral.
`LLMJudge` subclass	You need Spring AI-specific features, complex prompt logic, or custom response parsing that doesn’t fit a `JudgmentClassifier`.

Best-of-Breed Evaluation Patterns

Agent Judge borrows from patterns that have emerged across modern evaluation systems, including Python-first eval frameworks, SaaS evaluation platforms, and JVM projects such as Dokimos. The goal is not to clone any one framework. It is to bring the strongest ideas into a JVM-native, framework-neutral library:

Pattern	Why it matters in Agent Judge
Evaluators as small composable functions	Keeps judges easy to write, test, and reuse
Structured verdicts, not raw booleans	Gives humans and automation enough information to act
Mixed deterministic and LLM-based checks	Lets cheap checks catch obvious failures before expensive semantic checks
Aggregation / voting	Supports juries instead of one fragile evaluator
Dataset and per-execution evaluation as separate concerns	Keeps Agent Judge focused on “did this execution work?” while leaving bulk experiment orchestration to other tools
Metadata-rich evaluation context	Lets judges reason about status, token usage, sources, tool calls, and workspace state without coupling to a framework
Cost-aware escalation	Avoids paying for LLM judging when deterministic checks already reject the run

Agent Judge is intentionally JVM-native rather than a thin wrapper around a Python eval stack: it understands workspaces, Maven builds, Java source structure, typed records, sealed scores, and Java framework integration points.

Judge vs Journal

Agent Judge draws a sharp boundary between inputs to judges and narrative trace. Judges legitimately reason about:

Token usage, tool executions, retrieved sources — these are structured outputs that affect verdict logic
Finish reason, execution status, timing — these determine whether evaluation is even meaningful

Judges do NOT consume:

Intermediate responses, full conversation history, private reasoning traces, or step-by-step narrative logs — this is cognitive observability, not evaluation input

The second category belongs to a separate concern (agent-journal) that captures the narrative of how an agent arrived at its answer. Mixing trace data into evaluation context would couple judges to specific agent architectures and make the JudgmentContext contract framework-specific — exactly what the horizontal layer avoids.

Agent-Agnostic by Design

JudgmentContext doesn’t import any agent framework. It describes what happened (goal, workspace, status, timing) without coupling to how it happened. The workspace-centric pattern:

An agent modifies a directory
A judge inspects the directory
The judge doesn’t know or care which agent made the changes

This decoupling means the same jury works with Claude Code, Gemini CLI, a custom Python agent, or a human developer.

Immutable Records

Most evaluation data types are Java records: Judgment, Verdict, Check, JudgmentContext, and JudgeMetadata. Score is a sealed interface with record implementations (BooleanScore, NumericalScore, CategoricalScore). Records are:

Immutable — no accidental mutation between judges
Value-based — equality by content, not identity
Pattern-matchable — if (score instanceof NumericalScore(var v, var min, var max))
Easy to serialize — record components map cleanly to JSON/logging formats without requiring serialization dependencies in core

Combined with sealed types, this gives you a type-safe, exhaustive, immutable evaluation data model.

Projects

AgentWorks

Agento

Supporting Projects

Migration

Judges Are Like Unit Tests for Agents

Zero-Dependency Core

Functional Interface Discipline

Sealed Score Hierarchy

Cascaded Cost Model

Frameworks Are Vertical, Evaluation Is Horizontal

Adapter Module Architecture

Composition Over Inheritance for AI Judges

Best-of-Breed Evaluation Patterns

Judge vs Journal

Agent-Agnostic by Design

Immutable Records

​Judges Are Like Unit Tests for Agents

​Zero-Dependency Core

​Functional Interface Discipline

​Sealed Score Hierarchy

​Cascaded Cost Model

​Frameworks Are Vertical, Evaluation Is Horizontal

​Adapter Module Architecture

​Composition Over Inheritance for AI Judges

​Best-of-Breed Evaluation Patterns

​Judge vs Journal

​Agent-Agnostic by Design

​Immutable Records

Judges Are Like Unit Tests for Agents

Zero-Dependency Core

Functional Interface Discipline

Sealed Score Hierarchy

Cascaded Cost Model

Frameworks Are Vertical, Evaluation Is Horizontal

Adapter Module Architecture

Composition Over Inheritance for AI Judges

Best-of-Breed Evaluation Patterns

Judge vs Journal

Agent-Agnostic by Design

Immutable Records