Skip to main content

Documentation Index

Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt

Use this file to discover all available pages before exploring further.

Judges Are Like Unit Tests for Agents

The core analogy: just as JUnit gives you assertEquals and AssertJ gives you assertThat, Agent Judge gives you FileExistsJudge, BuildSuccessJudge, and CorrectnessJudge. You wouldn’t ship application code without tests or assertions. Agents need the same discipline — automated, repeatable evaluation that runs after every execution and catches regressions before they reach users. This framing drives several design decisions:
  • Judges should be cheap to write (one functional interface, one method)
  • Judges should be cheap to run (deterministic judges cost nothing)
  • Judges should compose (juries aggregate judges like test suites aggregate tests)
  • Results should be actionable (reasoning and checks, not just pass/fail)

Zero-Dependency Core

agent-judge-core has no external dependencies. Not Spring, not Spring AI, not Jackson — nothing. This means you can evaluate agent output in:
  • A plain Java application
  • A JUnit test
  • A CLI tool
  • A Spring Boot service
  • A serverless function
The module layering adds dependencies only when needed:
agent-judge-core              (zero deps)
agent-judge-ai-core           (zero deps)

agent-judge-exec              (+ agent-sandbox)
agent-judge-file              (+ JavaParser, Maven Model)
agent-judge-llm               (+ Spring AI ChatClient, SpringAiJudgeModel)
agent-judge-rag               (+ agent-judge-llm)
agent-judge-spring-ai         (+ Spring AI Model, provided)
agent-judge-langchain4j       (+ LangChain4j, provided)
agent-judge-koog              (+ Koog Agents, provided)
agent-judge-agent-client      (+ AgentClient, AgentClientJudgeModel, provided)
If all you need is file checks and boolean logic, you pay for nothing you don’t use. Framework bridge modules use provided-scope dependencies — they assume you already have the framework on your classpath.

Functional Interface Discipline

Judge is a @FunctionalInterface with a single method and no default methods:
@FunctionalInterface
public interface Judge {
    Judgment judge(JudgmentContext context);
}
This is deliberate. A single-method interface means:
  • Lambdas work: ctx -> Judgment.pass("ok")
  • Method references work: this::evaluateBuild
  • Composition uses the Judges utility class, not interface default methods
Metadata is handled through composition (NamedJudge wraps any judge) and the JudgeWithMetadata marker interface, not through method defaults on Judge itself. This avoids the combinatorial explosion of default method interactions and keeps the core contract minimal.

Sealed Score Hierarchy

Score is a sealed interface with three implementations:
public sealed interface Score permits BooleanScore, NumericalScore, CategoricalScore
Sealed types give you compile-time exhaustiveness — a switch expression over Score will warn you if you miss a case. This matters when aggregating heterogeneous scores in a jury. The three types cover the evaluation spectrum:
TypeUseExample
BooleanScoreBinary pass/failDid it compile? Does the file exist?
NumericalScoreContinuous scoring with boundsCode quality 7.5/10, coverage 85%
CategoricalScoreDiscrete categoriesEXCELLENT / GOOD / FAIR / POOR
The Scores utility handles cross-type normalization so a jury with mixed score types can still aggregate cleanly.

Cascaded Cost Model

Not all evaluation is equal cost. A typical cascade orders checks from cheapest and most decisive to most expensive:
CategoryJudge typeCostLatencyExample
DeterministicFile checks, content matchFreeMicrosecondsFileExistsJudge, FileContentJudge
File comparisonAST diff, POM comparisonFreeMillisecondsJavaSemanticJudge, MavenSemanticJudge
CommandBuild, test executionComputeSeconds-minutesBuildSuccessJudge, CommandJudge
LLMSemantic correctness, RAGTokensSecondsCorrectnessJudge, FaithfulnessJudge
The CascadedJury codifies this: fail fast on cheap checks, escalate only when necessary.
Tier 1 (REJECT_ON_ANY_FAIL)  →  fails?  →  STOP: verdict is FAIL
         ↓ passes
Tier 2 (ACCEPT_ON_ALL_PASS)  →  all pass?  →  STOP: verdict is PASS
         ↓ mixed
Tier 3 (FINAL_TIER)          →  always runs  →  verdict from LLM assessment
If the project doesn’t compile, there’s no point running an LLM judge to evaluate code quality. The cascaded pattern saves both time and money.

Frameworks Are Vertical, Evaluation Is Horizontal

Agent runtimes are vertical stacks — Spring AI, LangChain4j, Koog, and CLI-delegated agents (via AgentClient) each provide their own execution model, memory, tool calling, and observability. Evaluation cuts across all of them.
            Agent runtimes / frameworks
    Spring AI     LangChain4j     Koog     AgentClient (CLI agents)
        |              |           |              |
        v              v           v              v
-----------------------------------------------------------------
                     Agent Judge
             horizontal evaluation layer
-----------------------------------------------------------------
   Build success, file checks, AST comparison, coverage,
   tool-use metadata checks, RAG faithfulness, hallucination,
   LLM-as-judge, juries, cascaded juries
A FaithfulnessJudge doesn’t care whether the answer came from a Spring AI ChatClient, a LangChain4j AiService, a Koog agent, or Claude Code via AgentClient. It evaluates the (question, context, answer) triple the same way. This is the core architectural bet: evaluation is framework-neutral, and the bridge layer is thin.

Adapter Module Architecture

Each framework bridge module follows the same pattern:
  1. Provided-scope dependency on the framework — the bridge doesn’t pull the framework into your classpath; you already have it.
  2. JudgmentContextBuilder — a static utility that converts framework-specific output (ChatResponse, Result<T>, AIAgent, AgentClientResponse) into a JudgmentContext.
  3. Evaluator — a static convenience class with Judge/Jury overloads, including variants that accept extra metadata, combining execution + context building + evaluation into a one-liner.
  4. Metadata key conventions — public constants where available (SpringAiMetadataKeys, AgentClientMetadataKeys), and documented metadata keys for values such as model name, finish reason, token usage, sources, or agent ID.
The bridge code is typically 30-100 lines. It maps framework-specific response metadata (token usage, finish reason, model name) into JudgmentContext.metadata() where judges can optionally inspect it. This keeps the core zero-dependency, keeps bridges thin, and means new framework support is a single module addition — not a core change.

Composition Over Inheritance for AI Judges

The original LLMJudge uses the template method pattern — you subclass it, override buildPrompt() and parseResponse(), and the base class handles the LLM call via Spring AI’s ChatClient. This works, but it tangles three concerns into one class hierarchy:
  1. Prompt rendering — how context becomes a prompt string
  2. Model invocation — which AI backend to call
  3. Response classification — how to turn the model’s text into a Judgment
Subclassing couples you to Spring AI and makes it hard to swap one concern without touching the others. Testing requires a real ChatClient.Builder or mocking Spring AI internals. The agent-judge-ai-core module separates these into composable parts:
JudgePromptTemplate  →  JudgeModel  →  JudgmentClassifier  →  Judgment
   (render)             (invoke)         (classify)
ModelBackedJudge wires the three parts together via a builder. Each part is independently testable and replaceable:
  • JudgePromptTemplate loads templates from classpath, file, or string. Renders {{variable}} placeholders from JudgmentContext. Validates required variables at build time.
  • JudgeModel is a @FunctionalInterface — any (JudgeModelRequest → JudgeModelResponse) lambda works. SpringAiJudgeModel (in agent-judge-llm) delegates to Spring AI’s ChatClient. AgentClientJudgeModel (in agent-judge-agent-client) invokes a CLI agent that can use tools and inspect files — enabling agentic judges.
  • JudgmentClassifier maps text to a Judgment. LabelJudgmentClassifier.passFail() handles the common binary case. Custom classifiers handle structured or multi-label responses.
Like agent-judge-core, the agent-judge-ai-core module has zero external dependencies. The actual AI backend arrives through a JudgeModel implementation from a bridge module. This preserves the zero-dep principle while giving AI judges first-class infrastructure. When to use which:
ApproachUse when
ModelBackedJudgeDefault for AI judges. Composable, testable, framework-neutral.
LLMJudge subclassYou need Spring AI-specific features, complex prompt logic, or custom response parsing that doesn’t fit a JudgmentClassifier.

Best-of-Breed Evaluation Patterns

Agent Judge borrows from patterns that have emerged across modern evaluation systems, including Python-first eval frameworks, SaaS evaluation platforms, and JVM projects such as Dokimos. The goal is not to clone any one framework. It is to bring the strongest ideas into a JVM-native, framework-neutral library:
PatternWhy it matters in Agent Judge
Evaluators as small composable functionsKeeps judges easy to write, test, and reuse
Structured verdicts, not raw booleansGives humans and automation enough information to act
Mixed deterministic and LLM-based checksLets cheap checks catch obvious failures before expensive semantic checks
Aggregation / votingSupports juries instead of one fragile evaluator
Dataset and per-execution evaluation as separate concernsKeeps Agent Judge focused on “did this execution work?” while leaving bulk experiment orchestration to other tools
Metadata-rich evaluation contextLets judges reason about status, token usage, sources, tool calls, and workspace state without coupling to a framework
Cost-aware escalationAvoids paying for LLM judging when deterministic checks already reject the run
Agent Judge is intentionally JVM-native rather than a thin wrapper around a Python eval stack: it understands workspaces, Maven builds, Java source structure, typed records, sealed scores, and Java framework integration points.

Judge vs Journal

Agent Judge draws a sharp boundary between inputs to judges and narrative trace. Judges legitimately reason about:
  • Token usage, tool executions, retrieved sources — these are structured outputs that affect verdict logic
  • Finish reason, execution status, timing — these determine whether evaluation is even meaningful
Judges do NOT consume:
  • Intermediate responses, full conversation history, private reasoning traces, or step-by-step narrative logs — this is cognitive observability, not evaluation input
The second category belongs to a separate concern (agent-journal) that captures the narrative of how an agent arrived at its answer. Mixing trace data into evaluation context would couple judges to specific agent architectures and make the JudgmentContext contract framework-specific — exactly what the horizontal layer avoids.

Agent-Agnostic by Design

JudgmentContext doesn’t import any agent framework. It describes what happened (goal, workspace, status, timing) without coupling to how it happened. The workspace-centric pattern:
  1. An agent modifies a directory
  2. A judge inspects the directory
  3. The judge doesn’t know or care which agent made the changes
This decoupling means the same jury works with Claude Code, Gemini CLI, a custom Python agent, or a human developer.

Immutable Records

Most evaluation data types are Java records: Judgment, Verdict, Check, JudgmentContext, and JudgeMetadata. Score is a sealed interface with record implementations (BooleanScore, NumericalScore, CategoricalScore). Records are:
  • Immutable — no accidental mutation between judges
  • Value-based — equality by content, not identity
  • Pattern-matchableif (score instanceof NumericalScore(var v, var min, var max))
  • Easy to serialize — record components map cleanly to JSON/logging formats without requiring serialization dependencies in core
Combined with sealed types, this gives you a type-safe, exhaustive, immutable evaluation data model.