Built-in Judges

This page covers judge-family modules only. Framework bridges (agent-judge-spring-ai, agent-judge-langchain4j, agent-judge-koog, agent-judge-agent-client) are documented in the API Reference.

agent-judge-core

Zero external dependencies. These judges work in any Java project.

FileExistsJudge

Verifies that a file exists in the workspace.

Judge judge = new FileExistsJudge("src/main/java/App.java");
Judgment result = judge.judge(context);

Property	Value
Score type	`BooleanScore`
Judge type	`DETERMINISTIC`
Constructor	`new FileExistsJudge(String filePath)`

The file path is resolved relative to context.workspace().

FileContentJudge

Verifies that a file’s content matches expected criteria. Supports three matching modes:

// Exact match
Judge exact = new FileContentJudge("config.json", expectedContent);

// Contains substring
Judge contains = new FileContentJudge("output.log", "BUILD SUCCESS", FileContentJudge.MatchMode.CONTAINS);

// Regex pattern
Judge regex = new FileContentJudge("version.txt", "\\d+\\.\\d+\\.\\d+", FileContentJudge.MatchMode.REGEX);

Property	Value
Score type	`BooleanScore`
Judge type	`DETERMINISTIC`
Default mode	`MatchMode.EXACT` (when 2-arg constructor used)
Checks	`file_exists`, `file_readable`, `content_match`

Produces three granular checks, so on failure you can distinguish “file not found” from “file found but content wrong.”

SupersetDiffJudge

Verifies that the workspace files are a superset of the expected files — the agent added content without removing existing files.

Judge judge = new SupersetDiffJudge();

Property	Value
Score type	`NumericalScore` (proportion of files matched)
Judge type	`DETERMINISTIC`
Constructor	`new SupersetDiffJudge()` or `new SupersetDiffJudge(Set.of(".mvn/", "mvnw"))`

Reads the reference directory from context.metadata().get("expectedDir") (a Path or String). Abstains if the key is missing. Extra files in the workspace are allowed — this is superset semantics, not exact match.

agent-judge-exec

Requires agent-judge-exec dependency. Executes real processes in the workspace.

BuildSuccessJudge

Runs a Maven or Gradle build and checks the exit code.

// Maven — auto-detects ./mvnw wrapper
Judge maven = BuildSuccessJudge.maven("clean", "compile");

// Gradle — auto-detects ./gradlew wrapper
Judge gradle = BuildSuccessJudge.gradle("build", "test");

// Custom command string
Judge custom = new BuildSuccessJudge("make all");

Property	Value
Score type	`BooleanScore`
Judge type	`DETERMINISTIC`
Default timeout	10 minutes
Wrapper detection	Checks for `mvnw`/`gradlew` in workspace, falls back to system `mvn`/`gradle`

CommandJudge

Executes an arbitrary shell command and verifies the exit code.

// Default: expect exit code 0, 2 minute timeout
Judge simple = new CommandJudge("ls README.md");

// Custom exit code and timeout
Judge custom = new CommandJudge("grep -c TODO src/App.java", 0, Duration.ofSeconds(30));

// With custom sandbox factory (e.g., Docker-based execution)
Judge sandboxed = new CommandJudge("mvn test", 0, Duration.ofMinutes(5), sandboxFactory);

Property	Value
Score type	`BooleanScore`
Judge type	`DETERMINISTIC`
Default exit code	`0`
Default timeout	2 minutes
Metadata keys	`command`, `exitCode`, `output`, `duration`

ClassVersionJudge

Validates Java class file bytecode version.

// Java 17 = class version 61
Judge judge = new ClassVersionJudge(61);

Property	Value
Score type	`BooleanScore`
Common versions	Java 8=52, Java 11=55, Java 17=61, Java 21=65

CoveragePreservationJudge

Parses JaCoCo XML report and checks that line coverage hasn’t dropped more than a threshold compared to a baseline.

// Allow up to 5% coverage drop (default threshold)
Judge judge = new CoveragePreservationJudge();

// Custom threshold
Judge strict = new CoveragePreservationJudge(2.0);

Property	Value
Score type	`BooleanScore`
Input	JaCoCo XML report at `target/site/jacoco/jacoco.xml`
Baseline	`context.metadata().get("baselineCoverage")` — a `Double` (line coverage %) or `CoverageMetrics`
Threshold	Maximum allowed percentage-point drop (default 5.0)

Abstains if baselineCoverage is missing from metadata. Fails if no JaCoCo report is found.

CoverageImprovementJudge

Measures coverage improvement as a continuous score, normalized to [0, 1].

// Score normalized against max improvement of 50 percentage points (default)
Judge judge = new CoverageImprovementJudge();

// Custom normalization ceiling: 20pp improvement = score 1.0
Judge custom = new CoverageImprovementJudge(20.0);

// With minimum coverage floor: fail if below 60% regardless of improvement
Judge withFloor = new CoverageImprovementJudge(20.0, 60.0);

Property	Value
Score type	`NumericalScore` (0.0 to 1.0)
Input	JaCoCo XML report
Baseline	`context.metadata().get("baselineCoverage")` — same as `CoveragePreservationJudge`
Minimum floor	Optional — fails if current coverage is below the floor, regardless of improvement

agent-judge-file

Requires agent-judge-file dependency. Compares agent output files against reference implementations using structural/semantic comparison.

FileComparisonJudge

Composite judge that dispatches to the appropriate comparator based on file type.

Judge judge = new FileComparisonJudge();

File type	Dispatches to
`pom.xml`	`MavenSemanticJudge`
`*.xml`	`XmlSemanticJudge`
`*.java`	`JavaSemanticJudge`
Everything else	`TextFileJudge`

Reads the reference directory from context.metadata().get("expectedDir") (a Path) and compares each file against the workspace.

JavaSemanticJudge

AST-based Java file comparison using JavaParser. Ignores whitespace, comments, and import ordering — compares structure, not formatting.

Judge judge = new JavaSemanticJudge();

MavenSemanticJudge

Semantic comparison of Maven POM files. Compares dependency lists, plugin configurations, and properties without requiring identical XML formatting.

Judge judge = new MavenSemanticJudge();

XmlSemanticJudge

Structure-aware XML comparison. Normalizes whitespace and attribute ordering before comparison.

Judge judge = new XmlSemanticJudge();

TextFileJudge

Plain text comparison with whitespace normalization.

Judge judge = new TextFileJudge();

agent-judge-llm

Requires agent-judge-llm dependency and Spring AI on the classpath.

CorrectnessJudge

Uses an LLM to evaluate whether the agent accomplished its goal.

CorrectnessJudge judge = new CorrectnessJudge(chatClientBuilder);
Judgment result = judge.judge(context);

Property	Value
Score type	`BooleanScore`
Judge type	`LLM_POWERED`
Input	`context.goal()` + `context.agentOutput()`
Cost	LLM tokens per evaluation

Sends the goal and agent output to the LLM, asks for a YES/NO determination with reasoning.

LLMJudge (Abstract Base)

Template method base class for building custom LLM judges. Subclass and implement two methods:

public class QualityJudge extends LLMJudge {

    public QualityJudge(ChatClient.Builder chatClientBuilder) {
        super("quality", "Rates code quality 0-10", chatClientBuilder);
    }

    @Override
    protected String buildPrompt(JudgmentContext context) {
        return "Rate this code quality 0-10:\n" +
            context.agentOutput().orElse("");
    }

    @Override
    protected Judgment parseResponse(String response, JudgmentContext context) {
        double score = extractScore(response);
        return Judgment.builder()
            .score(new NumericalScore(score, 0, 10))
            .status(score >= 7 ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
            .reasoning(response)
            .build();
    }
}

Method	Purpose
`buildPrompt(JudgmentContext)`	Construct the evaluation prompt
`parseResponse(String, JudgmentContext)`	Parse LLM response into a Judgment

The base class handles LLM invocation — you focus on prompt design and response parsing. See Writing Custom Judges for a complete walkthrough.

agent-judge-rag

Requires agent-judge-rag dependency. LLM-powered judges for evaluating retrieval-augmented generation pipelines. All RAG judges use the RagContext metadata convention:

Metadata key	Description	Fallback
`rag.question`	The user’s question	`context.goal()`
`rag.context`	Retrieved context (String or List)	`langchain4j.sources`
`rag.answer`	The generated answer	`context.agentOutput()`

RAG judges return ABSTAIN when required metadata is missing rather than producing misleading verdicts.

FaithfulnessJudge

Evaluates whether every claim in the answer is grounded in the provided context. An answer that is factually correct but not supported by the given context is considered unfaithful.

Judge judge = new FaithfulnessJudge(chatClientBuilder);
Judgment result = judge.judge(context);

Property	Value
Score type	`BooleanScore`
Judge type	`LLM_POWERED`
Requires	`rag.context` + `rag.answer` (or fallbacks)
ABSTAIN when	Context or answer is empty, or LLM response unparseable

ContextualRelevanceJudge

Evaluates whether the retrieved context is relevant to the question. A natural first-tier judge in a CascadedJury — if the context is irrelevant, evaluating faithfulness or hallucination is meaningless.

Judge judge = new ContextualRelevanceJudge(chatClientBuilder);
Judgment result = judge.judge(context);

Property	Value
Score type	`BooleanScore`
Judge type	`LLM_POWERED`
Requires	`rag.context` (or fallback)
ABSTAIN when	Context is empty, or LLM response unparseable

HallucinationJudge

Detects specific claims in the answer that are not supported by the context. Unlike FaithfulnessJudge which asks “is the answer grounded?”, this judge asks “what specifically was made up?” with per-claim analysis. The most expensive RAG judge — a natural final-tier judge in a CascadedJury.

Judge judge = new HallucinationJudge(chatClientBuilder);
Judgment result = judge.judge(context);

Property	Value
Score type	`BooleanScore`
Judge type	`LLM_POWERED`
Requires	`rag.context` + `rag.answer` (or fallbacks)
ABSTAIN when	Context or answer is empty, or LLM response unparseable

RAG Jury Example

Compose the three RAG judges into a cascaded jury for cost-efficient evaluation:

SimpleJury relevanceTier = SimpleJury.builder()
    .judge(new ContextualRelevanceJudge(chatClientBuilder))
    .votingStrategy(new MajorityVotingStrategy())
    .build();

SimpleJury faithfulnessTier = SimpleJury.builder()
    .judge(new FaithfulnessJudge(chatClientBuilder))
    .judge(new HallucinationJudge(chatClientBuilder))
    .votingStrategy(new ConsensusStrategy())
    .build();

CascadedJury ragJury = CascadedJury.builder()
    .tier("relevance", relevanceTier, TierPolicy.REJECT_ON_ANY_FAIL)
    .tier("grounding", faithfulnessTier, TierPolicy.FINAL_TIER)
    .build();

Verdict verdict = ragJury.vote(context);

If the context isn’t relevant, the grounding tier never runs — saving tokens.

Projects

AgentWorks

Agento

Supporting Projects

Migration

agent-judge-core

FileExistsJudge

FileContentJudge

SupersetDiffJudge

agent-judge-exec

BuildSuccessJudge

CommandJudge

ClassVersionJudge

CoveragePreservationJudge

CoverageImprovementJudge

agent-judge-file

FileComparisonJudge

JavaSemanticJudge

MavenSemanticJudge

XmlSemanticJudge

TextFileJudge

agent-judge-llm

CorrectnessJudge

LLMJudge (Abstract Base)

agent-judge-rag

FaithfulnessJudge

ContextualRelevanceJudge

HallucinationJudge

RAG Jury Example

​agent-judge-core

​FileExistsJudge

​FileContentJudge

​SupersetDiffJudge

​agent-judge-exec

​BuildSuccessJudge

​CommandJudge

​ClassVersionJudge

​CoveragePreservationJudge

​CoverageImprovementJudge

​agent-judge-file

​FileComparisonJudge

​JavaSemanticJudge

​MavenSemanticJudge

​XmlSemanticJudge

​TextFileJudge

​agent-judge-llm

​CorrectnessJudge

​LLMJudge (Abstract Base)

​agent-judge-rag

​FaithfulnessJudge

​ContextualRelevanceJudge

​HallucinationJudge

​RAG Jury Example

agent-judge-core

FileExistsJudge

FileContentJudge

SupersetDiffJudge

agent-judge-exec

BuildSuccessJudge

CommandJudge

ClassVersionJudge

CoveragePreservationJudge

CoverageImprovementJudge

agent-judge-file

FileComparisonJudge

JavaSemanticJudge

MavenSemanticJudge

XmlSemanticJudge

TextFileJudge

agent-judge-llm

CorrectnessJudge

LLMJudge (Abstract Base)

agent-judge-rag

FaithfulnessJudge

ContextualRelevanceJudge

HallucinationJudge

RAG Jury Example