Writing Custom Judges

What You’ll Build

A progression of custom judges: a lambda check, a reusable deterministic judge, a composed AI judge, an LLM judge with full control, and a RAG-specific judge. You’ll wire them into a jury alongside built-in judges. This tutorial builds on Build an Evaluation Pipeline. You should be comfortable with JudgmentContext, Judgment, and SimpleJury before starting.

Step 1: Lambda Judge

The simplest custom judge is a lambda — three lines:

import io.github.markpollack.judge.Judge;
import io.github.markpollack.judge.result.Judgment;

import java.nio.file.Files;
import java.nio.file.Path;

Judge timestampCheck = context -> {
    Path logFile = context.workspace().resolve("agent.log");
    if (Files.exists(logFile)) {
        return Judgment.pass("Agent log exists");
    }
    return Judgment.fail("No agent log found");
};

Judge is a @FunctionalInterface with a single method: judge(JudgmentContext) -> Judgment. Any lambda or method reference that matches this signature is a judge. Lambda judges work well for one-off checks. For production judges, extend a base class to get metadata and structured checks.

Step 2: Extend DeterministicJudge

For a judge you’ll reuse, extend DeterministicJudge. It implements JudgeWithMetadata so infrastructure code (logging, metrics, verdict reporting) can discover the judge’s name and type.

import io.github.markpollack.judge.DeterministicJudge;
import io.github.markpollack.judge.context.JudgmentContext;
import io.github.markpollack.judge.result.Check;
import io.github.markpollack.judge.result.Judgment;
import io.github.markpollack.judge.result.JudgmentStatus;
import io.github.markpollack.judge.score.BooleanScore;

import java.nio.file.Files;
import java.nio.file.Path;

public class AnnotationJudge extends DeterministicJudge {

    private final String annotation;
    private final String filePath;

    public AnnotationJudge(String filePath, String annotation) {
        super("annotation-check",
            String.format("Verifies %s contains @%s", filePath, annotation));
        this.filePath = filePath;
        this.annotation = annotation;
    }

    @Override
    public Judgment judge(JudgmentContext context) {
        Path file = context.workspace().resolve(filePath);

        if (!Files.exists(file)) {
            return Judgment.builder()
                .score(new BooleanScore(false))
                .status(JudgmentStatus.FAIL)
                .reasoning("File not found: " + filePath)
                .check(Check.fail("file_exists", "File not found"))
                .build();
        }

        try {
            String content = Files.readString(file);
            boolean found = content.contains("@" + annotation);

            return Judgment.builder()
                .score(new BooleanScore(found))
                .status(found ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
                .reasoning(found
                    ? "@" + annotation + " found in " + filePath
                    : "@" + annotation + " missing from " + filePath)
                .check(Check.pass("file_exists", "File found"))
                .check(found
                    ? Check.pass("annotation_present", "@" + annotation + " present")
                    : Check.fail("annotation_present", "@" + annotation + " not found"))
                .build();
        }
        catch (Exception e) {
            return Judgment.error("Failed to read file: " + e.getMessage(), e);
        }
    }
}

Key patterns:

Constructor calls super(name, description) — this becomes metadata() for logging and verdicts
Checks provide granular sub-assertions — on failure, you can distinguish “file not found” from “file found but annotation missing”
Judgment.error() handles unexpected failures without crashing the jury

Usage:

Judge judge = new AnnotationJudge(
    "src/main/java/com/example/HelloController.java",
    "RestController");

Judgment result = judge.judge(context);
// metadata().name() -> "annotation-check"
// metadata().type() -> DETERMINISTIC

Step 3: Add Metadata to Lambda Judges

If you prefer lambdas but still want metadata, use Judges.named():

import io.github.markpollack.judge.Judges;
import io.github.markpollack.judge.JudgeType;
import io.github.markpollack.judge.JudgeWithMetadata;

Judge wrapped = Judges.named(
    timestampCheck,
    "timestamp-check",
    "Verifies agent log exists",
    JudgeType.DETERMINISTIC);

// Now infrastructure can discover the name
if (wrapped instanceof JudgeWithMetadata jwm) {
    System.out.println(jwm.metadata().name()); // "timestamp-check"
}

When to use what:

Approach	Metadata	Checks	Best for
Lambda	No (unless wrapped)	No	One-off, inline checks
`Judges.named(lambda)`	Yes	No	Named lambdas in juries
`DeterministicJudge` subclass	Yes	Yes	Reusable, production judges

Step 4: Build an AI Judge with ModelBackedJudge

For semantic criteria that can’t be checked from files or commands, you need an AI-backed judge. ModelBackedJudge composes three parts — a prompt template, a model backend, and a response classifier — into a judge without subclassing.

import io.github.markpollack.judge.ai.ModelBackedJudge;
import io.github.markpollack.judge.ai.JudgePromptTemplate;
import io.github.markpollack.judge.ai.JudgmentClassifiers;
import io.github.markpollack.judge.ai.JudgeModel;

JudgePromptTemplate template = JudgePromptTemplate.fromString(
    "relevance-check",
    """
    You are evaluating whether an AI agent accomplished its goal.

    Goal: {{goal}}
    Agent output: {{output}}

    Did the agent accomplish the goal? Answer exactly PASS or FAIL.
    """);

JudgeModel model = springAiJudgeModel;  // or agentClientJudgeModel

ModelBackedJudge judge = ModelBackedJudge.builder()
    .name("goal-completion")
    .description("Evaluates whether the agent accomplished its goal")
    .promptTemplate(template)
    .judgmentClassifier(JudgmentClassifiers.passFail("PASS", "FAIL"))
    .model(model)
    .build();

Judgment result = judge.judge(context);

The three components are independently swappable:

Component	Role	Examples
`JudgePromptTemplate`	Renders `{{variable}}` placeholders from `JudgmentContext`	`fromString(id, text)`, `fromClasspath(path)`
`JudgeModel`	Invokes the AI backend	`SpringAiJudgeModel` (agent-judge-llm), `AgentClientJudgeModel` (agent-judge-agent-client)
`JudgmentClassifier`	Maps the model’s text response to a `Judgment`	`JudgmentClassifiers.passFail(...)`, custom lambda

Available template variables: {{goal}}, {{output}}, {{workspace}}, {{status}}, {{metadata.*}}.

ModelBackedJudge lives in the agent-judge-ai-core module, which has zero external dependencies. The actual AI backend arrives through a JudgeModel implementation from a bridge module (agent-judge-llm or agent-judge-agent-client).

You can also load templates from the classpath for reuse across judges:

JudgePromptTemplate template = JudgePromptTemplate.fromClasspath(
    "judges/goal-completion.txt");

When to use ModelBackedJudge vs. LLMJudge:

Approach	Best for	Trade-off
`ModelBackedJudge`	Most AI judges — composable, testable, framework-neutral	Classifier must handle parsing
`LLMJudge` subclass	Custom parsing logic that doesn’t fit a classifier pattern	Requires subclassing, coupled to Spring AI

Use ModelBackedJudge as the default. Reach for LLMJudge when you need full control over prompt construction or response parsing.

Step 5: Write a Custom LLM Judge

For full control over prompt construction and response parsing, extend LLMJudge. It uses the template method pattern — you implement two methods:

buildPrompt() — construct the evaluation prompt from the context
parseResponse() — parse the LLM’s response into a Judgment

The base class handles the LLM call.

import io.github.markpollack.judge.llm.LLMJudge;
import io.github.markpollack.judge.context.JudgmentContext;
import io.github.markpollack.judge.result.Judgment;
import io.github.markpollack.judge.result.JudgmentStatus;
import io.github.markpollack.judge.score.NumericalScore;
import org.springframework.ai.chat.client.ChatClient;

public class CodeQualityJudge extends LLMJudge {

    public CodeQualityJudge(ChatClient.Builder chatClientBuilder) {
        super("code-quality", "Rates code quality 0-10", chatClientBuilder);
    }

    @Override
    protected String buildPrompt(JudgmentContext context) {
        return """
            You are a senior Java developer reviewing code.

            The agent was asked to: %s

            The agent produced the following output:
            %s

            Rate the code quality on a scale of 0-10. Consider:
            - Readability and naming conventions
            - Error handling
            - Adherence to the stated goal
            - Use of appropriate patterns

            Respond with exactly one line in this format:
            SCORE: <number> REASON: <brief explanation>
            """.formatted(
                context.goal(),
                context.agentOutput().orElse("(no output captured)"));
    }

    @Override
    protected Judgment parseResponse(String response, JudgmentContext context) {
        try {
            String scorePart = response.substring(
                response.indexOf("SCORE:") + 6,
                response.indexOf("REASON:")).trim();
            String reasonPart = response.substring(
                response.indexOf("REASON:") + 7).trim();

            double score = Double.parseDouble(scorePart);

            return Judgment.builder()
                .score(new NumericalScore(score, 0, 10))
                .status(score >= 7.0 ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
                .reasoning(reasonPart)
                .metadata("raw_score", score)
                .build();
        }
        catch (Exception e) {
            return Judgment.error("Failed to parse LLM response: " + response, e);
        }
    }
}

Usage:

Judge qualityJudge = new CodeQualityJudge(chatClientBuilder);
Judgment result = qualityJudge.judge(context);

// result.score() -> NumericalScore[value=7.5, min=0.0, max=10.0]
// result.score().normalized() -> 0.75
// result.status() -> PASS (score >= 7.0)

LLM judges require the agent-judge-llm module and Spring AI. Always handle parse failures gracefully — Judgment.error() prevents one broken response from crashing the jury.

Prefer deterministic judges when the criterion can be checked from files, commands, or structured metadata. Use LLM judges when the criterion is semantic and cannot be expressed reliably as code.

Step 6: Wire Custom Judges into a Jury

Combine your custom judges with built-in ones:

import io.github.markpollack.judge.Judges;
import io.github.markpollack.judge.fs.FileExistsJudge;
import io.github.markpollack.judge.exec.BuildSuccessJudge;
import io.github.markpollack.judge.jury.SimpleJury;
import io.github.markpollack.judge.jury.Verdict;
import io.github.markpollack.judge.jury.WeightedAverageStrategy;

SimpleJury jury = SimpleJury.builder()
    // Built-in deterministic
    .judge(Judges.named(
        new FileExistsJudge("src/main/java/com/example/HelloController.java"),
        "file-exists", "Controller created"), 1.0)
    .judge(Judges.named(
        BuildSuccessJudge.maven("compile"),
        "build", "Project compiles"), 2.0)

    // Custom deterministic
    .judge(new AnnotationJudge(
        "src/main/java/com/example/HelloController.java",
        "RestController"), 1.0)

    // Custom LLM
    .judge(Judges.named(
        new CodeQualityJudge(chatClientBuilder),
        "quality", "Code quality assessment"), 1.5)

    .votingStrategy(new WeightedAverageStrategy())
    .parallel(true)
    .build();

Verdict verdict = jury.vote(context);

verdict.individualByName().forEach((name, judgment) ->
    System.out.printf("%-20s %s  %s%n",
        name, judgment.status(), judgment.reasoning()));

The WeightedAverageStrategy normalizes all scores to [0, 1] and computes a weighted average. Build success (weight 2.0) matters more than file existence (weight 1.0), and quality (weight 1.5) falls in between.

Step 7: Write a Custom RAG Judge

RAG judges follow the same LLMJudge pattern but use the RagContext helper to extract the (question, context, answer) triple from metadata. Here’s a custom judge that evaluates answer completeness — did the answer address all parts of the question?

import io.github.markpollack.judge.llm.LLMJudge;
import io.github.markpollack.judge.rag.RagContext;
import io.github.markpollack.judge.context.JudgmentContext;
import io.github.markpollack.judge.result.Judgment;
import io.github.markpollack.judge.result.JudgmentStatus;
import io.github.markpollack.judge.score.BooleanScore;
import org.springframework.ai.chat.client.ChatClient;

import java.util.Optional;
import java.util.regex.Pattern;

public class CompletenessJudge extends LLMJudge {

    private static final Pattern ANSWER_PATTERN =
        Pattern.compile("(?mi)^\\s*Answer:\\s*(YES|NO)");

    public CompletenessJudge(ChatClient.Builder chatClientBuilder) {
        super("Completeness", "Evaluates whether the answer addresses all parts of the question",
              chatClientBuilder);
    }

    @Override
    public Judgment judge(JudgmentContext context) {
        Optional<String> answer = RagContext.answer(context);
        if (answer.isEmpty()) {
            return Judgment.abstain("No answer provided");
        }
        return super.judge(context);
    }

    @Override
    protected String buildPrompt(JudgmentContext context) {
        return String.format("""
            Begin your response with the line "Answer: YES" or "Answer: NO".

            Question: %s

            Answer: %s

            Does the answer address all parts of the question completely?
            Answer YES if complete, NO if any part is missing.

            Format: Answer: [YES or NO]
            Reasoning: [explanation]
            """, RagContext.question(context),
                 RagContext.answer(context).orElse(""));
    }

    @Override
    protected Judgment parseResponse(String response, JudgmentContext context) {
        var matcher = ANSWER_PATTERN.matcher(response);
        if (!matcher.find()) {
            return Judgment.abstain("Could not parse LLM response");
        }
        boolean pass = "YES".equalsIgnoreCase(matcher.group(1));
        return Judgment.builder()
            .score(new BooleanScore(pass))
            .status(pass ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
            .reasoning(response)
            .build();
    }
}

The key patterns for RAG judges:

Use RagContext.question(), RagContext.context(), RagContext.answer() to extract the triple
Return ABSTAIN when required inputs are missing — this prevents misleading verdicts
Use the (?mi)^\s*Answer:\s*(YES|NO) regex pattern for reliable LLM response parsing

Supply RAG metadata when building the context:

import io.github.markpollack.judge.context.JudgmentContext;
import io.github.markpollack.judge.context.ExecutionStatus;
import io.github.markpollack.judge.rag.RagContext;

import java.time.Duration;
import java.time.Instant;

JudgmentContext context = JudgmentContext.builder()
    .goal("What are the benefits of Spring Boot?")
    .status(ExecutionStatus.SUCCESS)
    .startedAt(Instant.now())
    .executionTime(Duration.ofSeconds(2))
    .metadata(RagContext.QUESTION_KEY, "What are the benefits of Spring Boot?")
    .metadata(RagContext.CONTEXT_KEY, "Spring Boot provides auto-configuration...")
    .metadata(RagContext.ANSWER_KEY, "Spring Boot simplifies configuration.")
    .build();

RagContext.question() falls back to context.goal() and RagContext.answer() falls back to context.agentOutput(), but the explicit metadata keys are the recommended convention. LangChain4j’s sources are also available as a fallback for RagContext.context().

What You Built

Judge	Approach	Score	What it checks
`timestampCheck`	Lambda	Boolean	Agent log exists
`AnnotationJudge`	`DeterministicJudge`	Boolean	File contains `@RestController`
Goal completion	`ModelBackedJudge`	Boolean	Agent accomplished its goal
`CodeQualityJudge`	`LLMJudge`	Numerical (0-10)	Code quality assessment
`CompletenessJudge`	`LLMJudge` (RAG)	Boolean	Answer addresses all parts of the question

The progression: lambda (quick checks) -> DeterministicJudge (production checks with metadata and sub-assertions) -> ModelBackedJudge (composed AI judge, no subclassing) -> LLMJudge (full control over prompt and parsing) -> RAG judge (LLMJudge with the RagContext metadata convention).

What’s Next

Built-in Judges

Catalog of built-in judges — see what’s already available before writing custom judges

Jury System

CascadedJury for tiered evaluation, voting strategy configuration

Projects

AgentWorks

Agento

Supporting Projects

Migration

What You’ll Build

Step 1: Lambda Judge

Step 2: Extend DeterministicJudge

Step 3: Add Metadata to Lambda Judges

Step 4: Build an AI Judge with ModelBackedJudge

Step 5: Write a Custom LLM Judge

Step 6: Wire Custom Judges into a Jury

Step 7: Write a Custom RAG Judge

What You Built

What’s Next

Built-in Judges

Jury System

​What You’ll Build

​Step 1: Lambda Judge

​Step 2: Extend DeterministicJudge

​Step 3: Add Metadata to Lambda Judges

​Step 4: Build an AI Judge with ModelBackedJudge

​Step 5: Write a Custom LLM Judge

​Step 6: Wire Custom Judges into a Jury

​Step 7: Write a Custom RAG Judge

​What You Built

​What’s Next

Built-in Judges

Jury System

What You’ll Build

Step 1: Lambda Judge

Step 2: Extend DeterministicJudge

Step 3: Add Metadata to Lambda Judges

Step 4: Build an AI Judge with ModelBackedJudge

Step 5: Write a Custom LLM Judge

Step 6: Wire Custom Judges into a Jury

Step 7: Write a Custom RAG Judge

What You Built

What’s Next