Skip to main content

Documentation Index

Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt

Use this file to discover all available pages before exploring further.

What You’ll Build

A progression of custom judges: a lambda check, a reusable deterministic judge, a composed AI judge, an LLM judge with full control, and a RAG-specific judge. You’ll wire them into a jury alongside built-in judges. This tutorial builds on Build an Evaluation Pipeline. You should be comfortable with JudgmentContext, Judgment, and SimpleJury before starting.

Step 1: Lambda Judge

The simplest custom judge is a lambda — three lines:
import io.github.markpollack.judge.Judge;
import io.github.markpollack.judge.result.Judgment;

import java.nio.file.Files;
import java.nio.file.Path;

Judge timestampCheck = context -> {
    Path logFile = context.workspace().resolve("agent.log");
    if (Files.exists(logFile)) {
        return Judgment.pass("Agent log exists");
    }
    return Judgment.fail("No agent log found");
};
Judge is a @FunctionalInterface with a single method: judge(JudgmentContext) -> Judgment. Any lambda or method reference that matches this signature is a judge. Lambda judges work well for one-off checks. For production judges, extend a base class to get metadata and structured checks.

Step 2: Extend DeterministicJudge

For a judge you’ll reuse, extend DeterministicJudge. It implements JudgeWithMetadata so infrastructure code (logging, metrics, verdict reporting) can discover the judge’s name and type.
import io.github.markpollack.judge.DeterministicJudge;
import io.github.markpollack.judge.context.JudgmentContext;
import io.github.markpollack.judge.result.Check;
import io.github.markpollack.judge.result.Judgment;
import io.github.markpollack.judge.result.JudgmentStatus;
import io.github.markpollack.judge.score.BooleanScore;

import java.nio.file.Files;
import java.nio.file.Path;

public class AnnotationJudge extends DeterministicJudge {

    private final String annotation;
    private final String filePath;

    public AnnotationJudge(String filePath, String annotation) {
        super("annotation-check",
            String.format("Verifies %s contains @%s", filePath, annotation));
        this.filePath = filePath;
        this.annotation = annotation;
    }

    @Override
    public Judgment judge(JudgmentContext context) {
        Path file = context.workspace().resolve(filePath);

        if (!Files.exists(file)) {
            return Judgment.builder()
                .score(new BooleanScore(false))
                .status(JudgmentStatus.FAIL)
                .reasoning("File not found: " + filePath)
                .check(Check.fail("file_exists", "File not found"))
                .build();
        }

        try {
            String content = Files.readString(file);
            boolean found = content.contains("@" + annotation);

            return Judgment.builder()
                .score(new BooleanScore(found))
                .status(found ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
                .reasoning(found
                    ? "@" + annotation + " found in " + filePath
                    : "@" + annotation + " missing from " + filePath)
                .check(Check.pass("file_exists", "File found"))
                .check(found
                    ? Check.pass("annotation_present", "@" + annotation + " present")
                    : Check.fail("annotation_present", "@" + annotation + " not found"))
                .build();
        }
        catch (Exception e) {
            return Judgment.error("Failed to read file: " + e.getMessage(), e);
        }
    }
}
Key patterns:
  • Constructor calls super(name, description) — this becomes metadata() for logging and verdicts
  • Checks provide granular sub-assertions — on failure, you can distinguish “file not found” from “file found but annotation missing”
  • Judgment.error() handles unexpected failures without crashing the jury
Usage:
Judge judge = new AnnotationJudge(
    "src/main/java/com/example/HelloController.java",
    "RestController");

Judgment result = judge.judge(context);
// metadata().name() -> "annotation-check"
// metadata().type() -> DETERMINISTIC

Step 3: Add Metadata to Lambda Judges

If you prefer lambdas but still want metadata, use Judges.named():
import io.github.markpollack.judge.Judges;
import io.github.markpollack.judge.JudgeType;
import io.github.markpollack.judge.JudgeWithMetadata;

Judge wrapped = Judges.named(
    timestampCheck,
    "timestamp-check",
    "Verifies agent log exists",
    JudgeType.DETERMINISTIC);

// Now infrastructure can discover the name
if (wrapped instanceof JudgeWithMetadata jwm) {
    System.out.println(jwm.metadata().name()); // "timestamp-check"
}
When to use what:
ApproachMetadataChecksBest for
LambdaNo (unless wrapped)NoOne-off, inline checks
Judges.named(lambda)YesNoNamed lambdas in juries
DeterministicJudge subclassYesYesReusable, production judges

Step 4: Build an AI Judge with ModelBackedJudge

For semantic criteria that can’t be checked from files or commands, you need an AI-backed judge. ModelBackedJudge composes three parts — a prompt template, a model backend, and a response classifier — into a judge without subclassing.
import io.github.markpollack.judge.ai.ModelBackedJudge;
import io.github.markpollack.judge.ai.JudgePromptTemplate;
import io.github.markpollack.judge.ai.JudgmentClassifiers;
import io.github.markpollack.judge.ai.JudgeModel;

JudgePromptTemplate template = JudgePromptTemplate.fromString(
    "relevance-check",
    """
    You are evaluating whether an AI agent accomplished its goal.

    Goal: {{goal}}
    Agent output: {{output}}

    Did the agent accomplish the goal? Answer exactly PASS or FAIL.
    """);

JudgeModel model = springAiJudgeModel;  // or agentClientJudgeModel

ModelBackedJudge judge = ModelBackedJudge.builder()
    .name("goal-completion")
    .description("Evaluates whether the agent accomplished its goal")
    .promptTemplate(template)
    .judgmentClassifier(JudgmentClassifiers.passFail("PASS", "FAIL"))
    .model(model)
    .build();

Judgment result = judge.judge(context);
The three components are independently swappable:
ComponentRoleExamples
JudgePromptTemplateRenders {{variable}} placeholders from JudgmentContextfromString(id, text), fromClasspath(path)
JudgeModelInvokes the AI backendSpringAiJudgeModel (agent-judge-llm), AgentClientJudgeModel (agent-judge-agent-client)
JudgmentClassifierMaps the model’s text response to a JudgmentJudgmentClassifiers.passFail(...), custom lambda
Available template variables: {{goal}}, {{output}}, {{workspace}}, {{status}}, {{metadata.*}}.
ModelBackedJudge lives in the agent-judge-ai-core module, which has zero external dependencies. The actual AI backend arrives through a JudgeModel implementation from a bridge module (agent-judge-llm or agent-judge-agent-client).
You can also load templates from the classpath for reuse across judges:
JudgePromptTemplate template = JudgePromptTemplate.fromClasspath(
    "judges/goal-completion.txt");
When to use ModelBackedJudge vs. LLMJudge:
ApproachBest forTrade-off
ModelBackedJudgeMost AI judges — composable, testable, framework-neutralClassifier must handle parsing
LLMJudge subclassCustom parsing logic that doesn’t fit a classifier patternRequires subclassing, coupled to Spring AI
Use ModelBackedJudge as the default. Reach for LLMJudge when you need full control over prompt construction or response parsing.

Step 5: Write a Custom LLM Judge

For full control over prompt construction and response parsing, extend LLMJudge. It uses the template method pattern — you implement two methods:
  1. buildPrompt() — construct the evaluation prompt from the context
  2. parseResponse() — parse the LLM’s response into a Judgment
The base class handles the LLM call.
import io.github.markpollack.judge.llm.LLMJudge;
import io.github.markpollack.judge.context.JudgmentContext;
import io.github.markpollack.judge.result.Judgment;
import io.github.markpollack.judge.result.JudgmentStatus;
import io.github.markpollack.judge.score.NumericalScore;
import org.springframework.ai.chat.client.ChatClient;

public class CodeQualityJudge extends LLMJudge {

    public CodeQualityJudge(ChatClient.Builder chatClientBuilder) {
        super("code-quality", "Rates code quality 0-10", chatClientBuilder);
    }

    @Override
    protected String buildPrompt(JudgmentContext context) {
        return """
            You are a senior Java developer reviewing code.

            The agent was asked to: %s

            The agent produced the following output:
            %s

            Rate the code quality on a scale of 0-10. Consider:
            - Readability and naming conventions
            - Error handling
            - Adherence to the stated goal
            - Use of appropriate patterns

            Respond with exactly one line in this format:
            SCORE: <number> REASON: <brief explanation>
            """.formatted(
                context.goal(),
                context.agentOutput().orElse("(no output captured)"));
    }

    @Override
    protected Judgment parseResponse(String response, JudgmentContext context) {
        try {
            String scorePart = response.substring(
                response.indexOf("SCORE:") + 6,
                response.indexOf("REASON:")).trim();
            String reasonPart = response.substring(
                response.indexOf("REASON:") + 7).trim();

            double score = Double.parseDouble(scorePart);

            return Judgment.builder()
                .score(new NumericalScore(score, 0, 10))
                .status(score >= 7.0 ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
                .reasoning(reasonPart)
                .metadata("raw_score", score)
                .build();
        }
        catch (Exception e) {
            return Judgment.error("Failed to parse LLM response: " + response, e);
        }
    }
}
Usage:
Judge qualityJudge = new CodeQualityJudge(chatClientBuilder);
Judgment result = qualityJudge.judge(context);

// result.score() -> NumericalScore[value=7.5, min=0.0, max=10.0]
// result.score().normalized() -> 0.75
// result.status() -> PASS (score >= 7.0)
LLM judges require the agent-judge-llm module and Spring AI. Always handle parse failures gracefully — Judgment.error() prevents one broken response from crashing the jury.
Prefer deterministic judges when the criterion can be checked from files, commands, or structured metadata. Use LLM judges when the criterion is semantic and cannot be expressed reliably as code.

Step 6: Wire Custom Judges into a Jury

Combine your custom judges with built-in ones:
import io.github.markpollack.judge.Judges;
import io.github.markpollack.judge.fs.FileExistsJudge;
import io.github.markpollack.judge.exec.BuildSuccessJudge;
import io.github.markpollack.judge.jury.SimpleJury;
import io.github.markpollack.judge.jury.Verdict;
import io.github.markpollack.judge.jury.WeightedAverageStrategy;

SimpleJury jury = SimpleJury.builder()
    // Built-in deterministic
    .judge(Judges.named(
        new FileExistsJudge("src/main/java/com/example/HelloController.java"),
        "file-exists", "Controller created"), 1.0)
    .judge(Judges.named(
        BuildSuccessJudge.maven("compile"),
        "build", "Project compiles"), 2.0)

    // Custom deterministic
    .judge(new AnnotationJudge(
        "src/main/java/com/example/HelloController.java",
        "RestController"), 1.0)

    // Custom LLM
    .judge(Judges.named(
        new CodeQualityJudge(chatClientBuilder),
        "quality", "Code quality assessment"), 1.5)

    .votingStrategy(new WeightedAverageStrategy())
    .parallel(true)
    .build();

Verdict verdict = jury.vote(context);

verdict.individualByName().forEach((name, judgment) ->
    System.out.printf("%-20s %s  %s%n",
        name, judgment.status(), judgment.reasoning()));
The WeightedAverageStrategy normalizes all scores to [0, 1] and computes a weighted average. Build success (weight 2.0) matters more than file existence (weight 1.0), and quality (weight 1.5) falls in between.

Step 7: Write a Custom RAG Judge

RAG judges follow the same LLMJudge pattern but use the RagContext helper to extract the (question, context, answer) triple from metadata. Here’s a custom judge that evaluates answer completeness — did the answer address all parts of the question?
import io.github.markpollack.judge.llm.LLMJudge;
import io.github.markpollack.judge.rag.RagContext;
import io.github.markpollack.judge.context.JudgmentContext;
import io.github.markpollack.judge.result.Judgment;
import io.github.markpollack.judge.result.JudgmentStatus;
import io.github.markpollack.judge.score.BooleanScore;
import org.springframework.ai.chat.client.ChatClient;

import java.util.Optional;
import java.util.regex.Pattern;

public class CompletenessJudge extends LLMJudge {

    private static final Pattern ANSWER_PATTERN =
        Pattern.compile("(?mi)^\\s*Answer:\\s*(YES|NO)");

    public CompletenessJudge(ChatClient.Builder chatClientBuilder) {
        super("Completeness", "Evaluates whether the answer addresses all parts of the question",
              chatClientBuilder);
    }

    @Override
    public Judgment judge(JudgmentContext context) {
        Optional<String> answer = RagContext.answer(context);
        if (answer.isEmpty()) {
            return Judgment.abstain("No answer provided");
        }
        return super.judge(context);
    }

    @Override
    protected String buildPrompt(JudgmentContext context) {
        return String.format("""
            Begin your response with the line "Answer: YES" or "Answer: NO".

            Question: %s

            Answer: %s

            Does the answer address all parts of the question completely?
            Answer YES if complete, NO if any part is missing.

            Format: Answer: [YES or NO]
            Reasoning: [explanation]
            """, RagContext.question(context),
                 RagContext.answer(context).orElse(""));
    }

    @Override
    protected Judgment parseResponse(String response, JudgmentContext context) {
        var matcher = ANSWER_PATTERN.matcher(response);
        if (!matcher.find()) {
            return Judgment.abstain("Could not parse LLM response");
        }
        boolean pass = "YES".equalsIgnoreCase(matcher.group(1));
        return Judgment.builder()
            .score(new BooleanScore(pass))
            .status(pass ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
            .reasoning(response)
            .build();
    }
}
The key patterns for RAG judges:
  • Use RagContext.question(), RagContext.context(), RagContext.answer() to extract the triple
  • Return ABSTAIN when required inputs are missing — this prevents misleading verdicts
  • Use the (?mi)^\s*Answer:\s*(YES|NO) regex pattern for reliable LLM response parsing
Supply RAG metadata when building the context:
import io.github.markpollack.judge.context.JudgmentContext;
import io.github.markpollack.judge.context.ExecutionStatus;
import io.github.markpollack.judge.rag.RagContext;

import java.time.Duration;
import java.time.Instant;

JudgmentContext context = JudgmentContext.builder()
    .goal("What are the benefits of Spring Boot?")
    .status(ExecutionStatus.SUCCESS)
    .startedAt(Instant.now())
    .executionTime(Duration.ofSeconds(2))
    .metadata(RagContext.QUESTION_KEY, "What are the benefits of Spring Boot?")
    .metadata(RagContext.CONTEXT_KEY, "Spring Boot provides auto-configuration...")
    .metadata(RagContext.ANSWER_KEY, "Spring Boot simplifies configuration.")
    .build();
RagContext.question() falls back to context.goal() and RagContext.answer() falls back to context.agentOutput(), but the explicit metadata keys are the recommended convention. LangChain4j’s sources are also available as a fallback for RagContext.context().

What You Built

JudgeApproachScoreWhat it checks
timestampCheckLambdaBooleanAgent log exists
AnnotationJudgeDeterministicJudgeBooleanFile contains @RestController
Goal completionModelBackedJudgeBooleanAgent accomplished its goal
CodeQualityJudgeLLMJudgeNumerical (0-10)Code quality assessment
CompletenessJudgeLLMJudge (RAG)BooleanAnswer addresses all parts of the question
The progression: lambda (quick checks) -> DeterministicJudge (production checks with metadata and sub-assertions) -> ModelBackedJudge (composed AI judge, no subclassing) -> LLMJudge (full control over prompt and parsing) -> RAG judge (LLMJudge with the RagContext metadata convention).

What’s Next

Built-in Judges

Catalog of built-in judges — see what’s already available before writing custom judges

Jury System

CascadedJury for tiered evaluation, voting strategy configuration