Skip to main content

Documentation Index

Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt

Use this file to discover all available pages before exploring further.

This page covers judge-family modules only. Framework bridges (agent-judge-spring-ai, agent-judge-langchain4j, agent-judge-koog, agent-judge-agent-client) are documented in the API Reference.

agent-judge-core

Zero external dependencies. These judges work in any Java project.

FileExistsJudge

Verifies that a file exists in the workspace.
Judge judge = new FileExistsJudge("src/main/java/App.java");
Judgment result = judge.judge(context);
PropertyValue
Score typeBooleanScore
Judge typeDETERMINISTIC
Constructornew FileExistsJudge(String filePath)
The file path is resolved relative to context.workspace().

FileContentJudge

Verifies that a file’s content matches expected criteria. Supports three matching modes:
// Exact match
Judge exact = new FileContentJudge("config.json", expectedContent);

// Contains substring
Judge contains = new FileContentJudge("output.log", "BUILD SUCCESS", FileContentJudge.MatchMode.CONTAINS);

// Regex pattern
Judge regex = new FileContentJudge("version.txt", "\\d+\\.\\d+\\.\\d+", FileContentJudge.MatchMode.REGEX);
PropertyValue
Score typeBooleanScore
Judge typeDETERMINISTIC
Default modeMatchMode.EXACT (when 2-arg constructor used)
Checksfile_exists, file_readable, content_match
Produces three granular checks, so on failure you can distinguish “file not found” from “file found but content wrong.”

SupersetDiffJudge

Verifies that the workspace files are a superset of the expected files — the agent added content without removing existing files.
Judge judge = new SupersetDiffJudge();
PropertyValue
Score typeNumericalScore (proportion of files matched)
Judge typeDETERMINISTIC
Constructornew SupersetDiffJudge() or new SupersetDiffJudge(Set.of(".mvn/", "mvnw"))
Reads the reference directory from context.metadata().get("expectedDir") (a Path or String). Abstains if the key is missing. Extra files in the workspace are allowed — this is superset semantics, not exact match.

agent-judge-exec

Requires agent-judge-exec dependency. Executes real processes in the workspace.

BuildSuccessJudge

Runs a Maven or Gradle build and checks the exit code.
// Maven — auto-detects ./mvnw wrapper
Judge maven = BuildSuccessJudge.maven("clean", "compile");

// Gradle — auto-detects ./gradlew wrapper
Judge gradle = BuildSuccessJudge.gradle("build", "test");

// Custom command string
Judge custom = new BuildSuccessJudge("make all");
PropertyValue
Score typeBooleanScore
Judge typeDETERMINISTIC
Default timeout10 minutes
Wrapper detectionChecks for mvnw/gradlew in workspace, falls back to system mvn/gradle

CommandJudge

Executes an arbitrary shell command and verifies the exit code.
// Default: expect exit code 0, 2 minute timeout
Judge simple = new CommandJudge("ls README.md");

// Custom exit code and timeout
Judge custom = new CommandJudge("grep -c TODO src/App.java", 0, Duration.ofSeconds(30));

// With custom sandbox factory (e.g., Docker-based execution)
Judge sandboxed = new CommandJudge("mvn test", 0, Duration.ofMinutes(5), sandboxFactory);
PropertyValue
Score typeBooleanScore
Judge typeDETERMINISTIC
Default exit code0
Default timeout2 minutes
Metadata keyscommand, exitCode, output, duration

ClassVersionJudge

Validates Java class file bytecode version.
// Java 17 = class version 61
Judge judge = new ClassVersionJudge(61);
PropertyValue
Score typeBooleanScore
Common versionsJava 8=52, Java 11=55, Java 17=61, Java 21=65

CoveragePreservationJudge

Parses JaCoCo XML report and checks that line coverage hasn’t dropped more than a threshold compared to a baseline.
// Allow up to 5% coverage drop (default threshold)
Judge judge = new CoveragePreservationJudge();

// Custom threshold
Judge strict = new CoveragePreservationJudge(2.0);
PropertyValue
Score typeBooleanScore
InputJaCoCo XML report at target/site/jacoco/jacoco.xml
Baselinecontext.metadata().get("baselineCoverage") — a Double (line coverage %) or CoverageMetrics
ThresholdMaximum allowed percentage-point drop (default 5.0)
Abstains if baselineCoverage is missing from metadata. Fails if no JaCoCo report is found.

CoverageImprovementJudge

Measures coverage improvement as a continuous score, normalized to [0, 1].
// Score normalized against max improvement of 50 percentage points (default)
Judge judge = new CoverageImprovementJudge();

// Custom normalization ceiling: 20pp improvement = score 1.0
Judge custom = new CoverageImprovementJudge(20.0);

// With minimum coverage floor: fail if below 60% regardless of improvement
Judge withFloor = new CoverageImprovementJudge(20.0, 60.0);
PropertyValue
Score typeNumericalScore (0.0 to 1.0)
InputJaCoCo XML report
Baselinecontext.metadata().get("baselineCoverage") — same as CoveragePreservationJudge
Minimum floorOptional — fails if current coverage is below the floor, regardless of improvement

agent-judge-file

Requires agent-judge-file dependency. Compares agent output files against reference implementations using structural/semantic comparison.

FileComparisonJudge

Composite judge that dispatches to the appropriate comparator based on file type.
Judge judge = new FileComparisonJudge();
File typeDispatches to
pom.xmlMavenSemanticJudge
*.xmlXmlSemanticJudge
*.javaJavaSemanticJudge
Everything elseTextFileJudge
Reads the reference directory from context.metadata().get("expectedDir") (a Path) and compares each file against the workspace.

JavaSemanticJudge

AST-based Java file comparison using JavaParser. Ignores whitespace, comments, and import ordering — compares structure, not formatting.
Judge judge = new JavaSemanticJudge();

MavenSemanticJudge

Semantic comparison of Maven POM files. Compares dependency lists, plugin configurations, and properties without requiring identical XML formatting.
Judge judge = new MavenSemanticJudge();

XmlSemanticJudge

Structure-aware XML comparison. Normalizes whitespace and attribute ordering before comparison.
Judge judge = new XmlSemanticJudge();

TextFileJudge

Plain text comparison with whitespace normalization.
Judge judge = new TextFileJudge();

agent-judge-llm

Requires agent-judge-llm dependency and Spring AI on the classpath.

CorrectnessJudge

Uses an LLM to evaluate whether the agent accomplished its goal.
CorrectnessJudge judge = new CorrectnessJudge(chatClientBuilder);
Judgment result = judge.judge(context);
PropertyValue
Score typeBooleanScore
Judge typeLLM_POWERED
Inputcontext.goal() + context.agentOutput()
CostLLM tokens per evaluation
Sends the goal and agent output to the LLM, asks for a YES/NO determination with reasoning.

LLMJudge (Abstract Base)

Template method base class for building custom LLM judges. Subclass and implement two methods:
public class QualityJudge extends LLMJudge {

    public QualityJudge(ChatClient.Builder chatClientBuilder) {
        super("quality", "Rates code quality 0-10", chatClientBuilder);
    }

    @Override
    protected String buildPrompt(JudgmentContext context) {
        return "Rate this code quality 0-10:\n" +
            context.agentOutput().orElse("");
    }

    @Override
    protected Judgment parseResponse(String response, JudgmentContext context) {
        double score = extractScore(response);
        return Judgment.builder()
            .score(new NumericalScore(score, 0, 10))
            .status(score >= 7 ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
            .reasoning(response)
            .build();
    }
}
MethodPurpose
buildPrompt(JudgmentContext)Construct the evaluation prompt
parseResponse(String, JudgmentContext)Parse LLM response into a Judgment
The base class handles LLM invocation — you focus on prompt design and response parsing. See Writing Custom Judges for a complete walkthrough.

agent-judge-rag

Requires agent-judge-rag dependency. LLM-powered judges for evaluating retrieval-augmented generation pipelines. All RAG judges use the RagContext metadata convention:
Metadata keyDescriptionFallback
rag.questionThe user’s questioncontext.goal()
rag.contextRetrieved context (String or List)langchain4j.sources
rag.answerThe generated answercontext.agentOutput()
RAG judges return ABSTAIN when required metadata is missing rather than producing misleading verdicts.

FaithfulnessJudge

Evaluates whether every claim in the answer is grounded in the provided context. An answer that is factually correct but not supported by the given context is considered unfaithful.
Judge judge = new FaithfulnessJudge(chatClientBuilder);
Judgment result = judge.judge(context);
PropertyValue
Score typeBooleanScore
Judge typeLLM_POWERED
Requiresrag.context + rag.answer (or fallbacks)
ABSTAIN whenContext or answer is empty, or LLM response unparseable

ContextualRelevanceJudge

Evaluates whether the retrieved context is relevant to the question. A natural first-tier judge in a CascadedJury — if the context is irrelevant, evaluating faithfulness or hallucination is meaningless.
Judge judge = new ContextualRelevanceJudge(chatClientBuilder);
Judgment result = judge.judge(context);
PropertyValue
Score typeBooleanScore
Judge typeLLM_POWERED
Requiresrag.context (or fallback)
ABSTAIN whenContext is empty, or LLM response unparseable

HallucinationJudge

Detects specific claims in the answer that are not supported by the context. Unlike FaithfulnessJudge which asks “is the answer grounded?”, this judge asks “what specifically was made up?” with per-claim analysis. The most expensive RAG judge — a natural final-tier judge in a CascadedJury.
Judge judge = new HallucinationJudge(chatClientBuilder);
Judgment result = judge.judge(context);
PropertyValue
Score typeBooleanScore
Judge typeLLM_POWERED
Requiresrag.context + rag.answer (or fallbacks)
ABSTAIN whenContext or answer is empty, or LLM response unparseable

RAG Jury Example

Compose the three RAG judges into a cascaded jury for cost-efficient evaluation:
SimpleJury relevanceTier = SimpleJury.builder()
    .judge(new ContextualRelevanceJudge(chatClientBuilder))
    .votingStrategy(new MajorityVotingStrategy())
    .build();

SimpleJury faithfulnessTier = SimpleJury.builder()
    .judge(new FaithfulnessJudge(chatClientBuilder))
    .judge(new HallucinationJudge(chatClientBuilder))
    .votingStrategy(new ConsensusStrategy())
    .build();

CascadedJury ragJury = CascadedJury.builder()
    .tier("relevance", relevanceTier, TierPolicy.REJECT_ON_ANY_FAIL)
    .tier("grounding", faithfulnessTier, TierPolicy.FINAL_TIER)
    .build();

Verdict verdict = ragJury.vote(context);
If the context isn’t relevant, the grounding tier never runs — saving tokens.