Documentation Index
Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt
Use this file to discover all available pages before exploring further.
This page covers judge-family modules only. Framework bridges (agent-judge-spring-ai, agent-judge-langchain4j, agent-judge-koog, agent-judge-agent-client) are documented in the API Reference.
agent-judge-core
Zero external dependencies. These judges work in any Java project.
FileExistsJudge
Verifies that a file exists in the workspace.
Judge judge = new FileExistsJudge("src/main/java/App.java");
Judgment result = judge.judge(context);
| Property | Value |
|---|
| Score type | BooleanScore |
| Judge type | DETERMINISTIC |
| Constructor | new FileExistsJudge(String filePath) |
The file path is resolved relative to context.workspace().
FileContentJudge
Verifies that a file’s content matches expected criteria.
Supports three matching modes:
// Exact match
Judge exact = new FileContentJudge("config.json", expectedContent);
// Contains substring
Judge contains = new FileContentJudge("output.log", "BUILD SUCCESS", FileContentJudge.MatchMode.CONTAINS);
// Regex pattern
Judge regex = new FileContentJudge("version.txt", "\\d+\\.\\d+\\.\\d+", FileContentJudge.MatchMode.REGEX);
| Property | Value |
|---|
| Score type | BooleanScore |
| Judge type | DETERMINISTIC |
| Default mode | MatchMode.EXACT (when 2-arg constructor used) |
| Checks | file_exists, file_readable, content_match |
Produces three granular checks, so on failure you can distinguish “file not found” from “file found but content wrong.”
SupersetDiffJudge
Verifies that the workspace files are a superset of the expected files — the agent added content without removing existing files.
Judge judge = new SupersetDiffJudge();
| Property | Value |
|---|
| Score type | NumericalScore (proportion of files matched) |
| Judge type | DETERMINISTIC |
| Constructor | new SupersetDiffJudge() or new SupersetDiffJudge(Set.of(".mvn/", "mvnw")) |
Reads the reference directory from context.metadata().get("expectedDir") (a Path or String).
Abstains if the key is missing. Extra files in the workspace are allowed — this is superset semantics, not exact match.
agent-judge-exec
Requires agent-judge-exec dependency. Executes real processes in the workspace.
BuildSuccessJudge
Runs a Maven or Gradle build and checks the exit code.
// Maven — auto-detects ./mvnw wrapper
Judge maven = BuildSuccessJudge.maven("clean", "compile");
// Gradle — auto-detects ./gradlew wrapper
Judge gradle = BuildSuccessJudge.gradle("build", "test");
// Custom command string
Judge custom = new BuildSuccessJudge("make all");
| Property | Value |
|---|
| Score type | BooleanScore |
| Judge type | DETERMINISTIC |
| Default timeout | 10 minutes |
| Wrapper detection | Checks for mvnw/gradlew in workspace, falls back to system mvn/gradle |
CommandJudge
Executes an arbitrary shell command and verifies the exit code.
// Default: expect exit code 0, 2 minute timeout
Judge simple = new CommandJudge("ls README.md");
// Custom exit code and timeout
Judge custom = new CommandJudge("grep -c TODO src/App.java", 0, Duration.ofSeconds(30));
// With custom sandbox factory (e.g., Docker-based execution)
Judge sandboxed = new CommandJudge("mvn test", 0, Duration.ofMinutes(5), sandboxFactory);
| Property | Value |
|---|
| Score type | BooleanScore |
| Judge type | DETERMINISTIC |
| Default exit code | 0 |
| Default timeout | 2 minutes |
| Metadata keys | command, exitCode, output, duration |
ClassVersionJudge
Validates Java class file bytecode version.
// Java 17 = class version 61
Judge judge = new ClassVersionJudge(61);
| Property | Value |
|---|
| Score type | BooleanScore |
| Common versions | Java 8=52, Java 11=55, Java 17=61, Java 21=65 |
CoveragePreservationJudge
Parses JaCoCo XML report and checks that line coverage hasn’t dropped more than a threshold compared to a baseline.
// Allow up to 5% coverage drop (default threshold)
Judge judge = new CoveragePreservationJudge();
// Custom threshold
Judge strict = new CoveragePreservationJudge(2.0);
| Property | Value |
|---|
| Score type | BooleanScore |
| Input | JaCoCo XML report at target/site/jacoco/jacoco.xml |
| Baseline | context.metadata().get("baselineCoverage") — a Double (line coverage %) or CoverageMetrics |
| Threshold | Maximum allowed percentage-point drop (default 5.0) |
Abstains if baselineCoverage is missing from metadata. Fails if no JaCoCo report is found.
CoverageImprovementJudge
Measures coverage improvement as a continuous score, normalized to [0, 1].
// Score normalized against max improvement of 50 percentage points (default)
Judge judge = new CoverageImprovementJudge();
// Custom normalization ceiling: 20pp improvement = score 1.0
Judge custom = new CoverageImprovementJudge(20.0);
// With minimum coverage floor: fail if below 60% regardless of improvement
Judge withFloor = new CoverageImprovementJudge(20.0, 60.0);
| Property | Value |
|---|
| Score type | NumericalScore (0.0 to 1.0) |
| Input | JaCoCo XML report |
| Baseline | context.metadata().get("baselineCoverage") — same as CoveragePreservationJudge |
| Minimum floor | Optional — fails if current coverage is below the floor, regardless of improvement |
agent-judge-file
Requires agent-judge-file dependency. Compares agent output files against reference implementations using structural/semantic comparison.
FileComparisonJudge
Composite judge that dispatches to the appropriate comparator based on file type.
Judge judge = new FileComparisonJudge();
| File type | Dispatches to |
|---|
pom.xml | MavenSemanticJudge |
*.xml | XmlSemanticJudge |
*.java | JavaSemanticJudge |
| Everything else | TextFileJudge |
Reads the reference directory from context.metadata().get("expectedDir") (a Path) and compares each file against the workspace.
JavaSemanticJudge
AST-based Java file comparison using JavaParser.
Ignores whitespace, comments, and import ordering — compares structure, not formatting.
Judge judge = new JavaSemanticJudge();
MavenSemanticJudge
Semantic comparison of Maven POM files.
Compares dependency lists, plugin configurations, and properties without requiring identical XML formatting.
Judge judge = new MavenSemanticJudge();
XmlSemanticJudge
Structure-aware XML comparison.
Normalizes whitespace and attribute ordering before comparison.
Judge judge = new XmlSemanticJudge();
TextFileJudge
Plain text comparison with whitespace normalization.
Judge judge = new TextFileJudge();
agent-judge-llm
Requires agent-judge-llm dependency and Spring AI on the classpath.
CorrectnessJudge
Uses an LLM to evaluate whether the agent accomplished its goal.
CorrectnessJudge judge = new CorrectnessJudge(chatClientBuilder);
Judgment result = judge.judge(context);
| Property | Value |
|---|
| Score type | BooleanScore |
| Judge type | LLM_POWERED |
| Input | context.goal() + context.agentOutput() |
| Cost | LLM tokens per evaluation |
Sends the goal and agent output to the LLM, asks for a YES/NO determination with reasoning.
LLMJudge (Abstract Base)
Template method base class for building custom LLM judges.
Subclass and implement two methods:
public class QualityJudge extends LLMJudge {
public QualityJudge(ChatClient.Builder chatClientBuilder) {
super("quality", "Rates code quality 0-10", chatClientBuilder);
}
@Override
protected String buildPrompt(JudgmentContext context) {
return "Rate this code quality 0-10:\n" +
context.agentOutput().orElse("");
}
@Override
protected Judgment parseResponse(String response, JudgmentContext context) {
double score = extractScore(response);
return Judgment.builder()
.score(new NumericalScore(score, 0, 10))
.status(score >= 7 ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
.reasoning(response)
.build();
}
}
| Method | Purpose |
|---|
buildPrompt(JudgmentContext) | Construct the evaluation prompt |
parseResponse(String, JudgmentContext) | Parse LLM response into a Judgment |
The base class handles LLM invocation — you focus on prompt design and response parsing.
See Writing Custom Judges for a complete walkthrough.
agent-judge-rag
Requires agent-judge-rag dependency. LLM-powered judges for evaluating retrieval-augmented generation pipelines.
All RAG judges use the RagContext metadata convention:
| Metadata key | Description | Fallback |
|---|
rag.question | The user’s question | context.goal() |
rag.context | Retrieved context (String or List) | langchain4j.sources |
rag.answer | The generated answer | context.agentOutput() |
RAG judges return ABSTAIN when required metadata is missing rather than producing misleading verdicts.
FaithfulnessJudge
Evaluates whether every claim in the answer is grounded in the provided context.
An answer that is factually correct but not supported by the given context is considered unfaithful.
Judge judge = new FaithfulnessJudge(chatClientBuilder);
Judgment result = judge.judge(context);
| Property | Value |
|---|
| Score type | BooleanScore |
| Judge type | LLM_POWERED |
| Requires | rag.context + rag.answer (or fallbacks) |
| ABSTAIN when | Context or answer is empty, or LLM response unparseable |
ContextualRelevanceJudge
Evaluates whether the retrieved context is relevant to the question.
A natural first-tier judge in a CascadedJury — if the context is irrelevant, evaluating faithfulness or hallucination is meaningless.
Judge judge = new ContextualRelevanceJudge(chatClientBuilder);
Judgment result = judge.judge(context);
| Property | Value |
|---|
| Score type | BooleanScore |
| Judge type | LLM_POWERED |
| Requires | rag.context (or fallback) |
| ABSTAIN when | Context is empty, or LLM response unparseable |
HallucinationJudge
Detects specific claims in the answer that are not supported by the context.
Unlike FaithfulnessJudge which asks “is the answer grounded?”, this judge asks “what specifically was made up?” with per-claim analysis.
The most expensive RAG judge — a natural final-tier judge in a CascadedJury.
Judge judge = new HallucinationJudge(chatClientBuilder);
Judgment result = judge.judge(context);
| Property | Value |
|---|
| Score type | BooleanScore |
| Judge type | LLM_POWERED |
| Requires | rag.context + rag.answer (or fallbacks) |
| ABSTAIN when | Context or answer is empty, or LLM response unparseable |
RAG Jury Example
Compose the three RAG judges into a cascaded jury for cost-efficient evaluation:
SimpleJury relevanceTier = SimpleJury.builder()
.judge(new ContextualRelevanceJudge(chatClientBuilder))
.votingStrategy(new MajorityVotingStrategy())
.build();
SimpleJury faithfulnessTier = SimpleJury.builder()
.judge(new FaithfulnessJudge(chatClientBuilder))
.judge(new HallucinationJudge(chatClientBuilder))
.votingStrategy(new ConsensusStrategy())
.build();
CascadedJury ragJury = CascadedJury.builder()
.tier("relevance", relevanceTier, TierPolicy.REJECT_ON_ANY_FAIL)
.tier("grounding", faithfulnessTier, TierPolicy.FINAL_TIER)
.build();
Verdict verdict = ragJury.vote(context);
If the context isn’t relevant, the grounding tier never runs — saving tokens.