This page covers judge-family modules only. Framework bridges (
agent-judge-spring-ai, agent-judge-langchain4j, agent-judge-koog, agent-judge-agent-client) are documented in the API Reference.agent-judge-core
Zero external dependencies. These judges work in any Java project.FileExistsJudge
Verifies that a file exists in the workspace.| Property | Value |
|---|---|
| Score type | BooleanScore |
| Judge type | DETERMINISTIC |
| Constructor | new FileExistsJudge(String filePath) |
context.workspace().
FileContentJudge
Verifies that a file’s content matches expected criteria. Supports three matching modes:| Property | Value |
|---|---|
| Score type | BooleanScore |
| Judge type | DETERMINISTIC |
| Default mode | MatchMode.EXACT (when 2-arg constructor used) |
| Checks | file_exists, file_readable, content_match |
SupersetDiffJudge
Verifies that the workspace files are a superset of the expected files — the agent added content without removing existing files.| Property | Value |
|---|---|
| Score type | NumericalScore (proportion of files matched) |
| Judge type | DETERMINISTIC |
| Constructor | new SupersetDiffJudge() or new SupersetDiffJudge(Set.of(".mvn/", "mvnw")) |
context.metadata().get("expectedDir") (a Path or String).
Abstains if the key is missing. Extra files in the workspace are allowed — this is superset semantics, not exact match.
agent-judge-exec
Requiresagent-judge-exec dependency. Executes real processes in the workspace.
BuildSuccessJudge
Runs a Maven or Gradle build and checks the exit code.| Property | Value |
|---|---|
| Score type | BooleanScore |
| Judge type | DETERMINISTIC |
| Default timeout | 10 minutes |
| Wrapper detection | Checks for mvnw/gradlew in workspace, falls back to system mvn/gradle |
CommandJudge
Executes an arbitrary shell command and verifies the exit code.| Property | Value |
|---|---|
| Score type | BooleanScore |
| Judge type | DETERMINISTIC |
| Default exit code | 0 |
| Default timeout | 2 minutes |
| Metadata keys | command, exitCode, output, duration |
ClassVersionJudge
Validates Java class file bytecode version.| Property | Value |
|---|---|
| Score type | BooleanScore |
| Common versions | Java 8=52, Java 11=55, Java 17=61, Java 21=65 |
CoveragePreservationJudge
Parses JaCoCo XML report and checks that line coverage hasn’t dropped more than a threshold compared to a baseline.| Property | Value |
|---|---|
| Score type | BooleanScore |
| Input | JaCoCo XML report at target/site/jacoco/jacoco.xml |
| Baseline | context.metadata().get("baselineCoverage") — a Double (line coverage %) or CoverageMetrics |
| Threshold | Maximum allowed percentage-point drop (default 5.0) |
baselineCoverage is missing from metadata. Fails if no JaCoCo report is found.
CoverageImprovementJudge
Measures coverage improvement as a continuous score, normalized to [0, 1].| Property | Value |
|---|---|
| Score type | NumericalScore (0.0 to 1.0) |
| Input | JaCoCo XML report |
| Baseline | context.metadata().get("baselineCoverage") — same as CoveragePreservationJudge |
| Minimum floor | Optional — fails if current coverage is below the floor, regardless of improvement |
agent-judge-file
Requiresagent-judge-file dependency. Compares agent output files against reference implementations using structural/semantic comparison.
FileComparisonJudge
Composite judge that dispatches to the appropriate comparator based on file type.| File type | Dispatches to |
|---|---|
pom.xml | MavenSemanticJudge |
*.xml | XmlSemanticJudge |
*.java | JavaSemanticJudge |
| Everything else | TextFileJudge |
context.metadata().get("expectedDir") (a Path) and compares each file against the workspace.
JavaSemanticJudge
AST-based Java file comparison using JavaParser. Ignores whitespace, comments, and import ordering — compares structure, not formatting.MavenSemanticJudge
Semantic comparison of Maven POM files. Compares dependency lists, plugin configurations, and properties without requiring identical XML formatting.XmlSemanticJudge
Structure-aware XML comparison. Normalizes whitespace and attribute ordering before comparison.TextFileJudge
Plain text comparison with whitespace normalization.agent-judge-llm
Requiresagent-judge-llm dependency and Spring AI on the classpath.
CorrectnessJudge
Uses an LLM to evaluate whether the agent accomplished its goal.| Property | Value |
|---|---|
| Score type | BooleanScore |
| Judge type | LLM_POWERED |
| Input | context.goal() + context.agentOutput() |
| Cost | LLM tokens per evaluation |
LLMJudge (Abstract Base)
Template method base class for building custom LLM judges. Subclass and implement two methods:| Method | Purpose |
|---|---|
buildPrompt(JudgmentContext) | Construct the evaluation prompt |
parseResponse(String, JudgmentContext) | Parse LLM response into a Judgment |
agent-judge-rag
Requiresagent-judge-rag dependency. LLM-powered judges for evaluating retrieval-augmented generation pipelines.
All RAG judges use the RagContext metadata convention:
| Metadata key | Description | Fallback |
|---|---|---|
rag.question | The user’s question | context.goal() |
rag.context | Retrieved context (String or List) | langchain4j.sources |
rag.answer | The generated answer | context.agentOutput() |
ABSTAIN when required metadata is missing rather than producing misleading verdicts.
FaithfulnessJudge
Evaluates whether every claim in the answer is grounded in the provided context. An answer that is factually correct but not supported by the given context is considered unfaithful.| Property | Value |
|---|---|
| Score type | BooleanScore |
| Judge type | LLM_POWERED |
| Requires | rag.context + rag.answer (or fallbacks) |
| ABSTAIN when | Context or answer is empty, or LLM response unparseable |
ContextualRelevanceJudge
Evaluates whether the retrieved context is relevant to the question. A natural first-tier judge in aCascadedJury — if the context is irrelevant, evaluating faithfulness or hallucination is meaningless.
| Property | Value |
|---|---|
| Score type | BooleanScore |
| Judge type | LLM_POWERED |
| Requires | rag.context (or fallback) |
| ABSTAIN when | Context is empty, or LLM response unparseable |
HallucinationJudge
Detects specific claims in the answer that are not supported by the context. UnlikeFaithfulnessJudge which asks “is the answer grounded?”, this judge asks “what specifically was made up?” with per-claim analysis.
The most expensive RAG judge — a natural final-tier judge in a CascadedJury.
| Property | Value |
|---|---|
| Score type | BooleanScore |
| Judge type | LLM_POWERED |
| Requires | rag.context + rag.answer (or fallbacks) |
| ABSTAIN when | Context or answer is empty, or LLM response unparseable |