A progression of custom judges: a lambda check, a reusable deterministic judge, a composed AI judge, an LLM judge with full control, and a RAG-specific judge.
You’ll wire them into a jury alongside built-in judges.This tutorial builds on Build an Evaluation Pipeline.
You should be comfortable with JudgmentContext, Judgment, and SimpleJury before starting.
Judge is a @FunctionalInterface with a single method: judge(JudgmentContext) -> Judgment.
Any lambda or method reference that matches this signature is a judge.Lambda judges work well for one-off checks.
For production judges, extend a base class to get metadata and structured checks.
For a judge you’ll reuse, extend DeterministicJudge.
It implements JudgeWithMetadata so infrastructure code (logging, metrics, verdict reporting) can discover the judge’s name and type.
For semantic criteria that can’t be checked from files or commands, you need an AI-backed judge.
ModelBackedJudge composes three parts — a prompt template, a model backend, and a response classifier — into a judge without subclassing.
import io.github.markpollack.judge.ai.ModelBackedJudge;import io.github.markpollack.judge.ai.JudgePromptTemplate;import io.github.markpollack.judge.ai.JudgmentClassifiers;import io.github.markpollack.judge.ai.JudgeModel;JudgePromptTemplate template = JudgePromptTemplate.fromString( "relevance-check", """ You are evaluating whether an AI agent accomplished its goal. Goal: {{goal}} Agent output: {{output}} Did the agent accomplish the goal? Answer exactly PASS or FAIL. """);JudgeModel model = springAiJudgeModel; // or agentClientJudgeModelModelBackedJudge judge = ModelBackedJudge.builder() .name("goal-completion") .description("Evaluates whether the agent accomplished its goal") .promptTemplate(template) .judgmentClassifier(JudgmentClassifiers.passFail("PASS", "FAIL")) .model(model) .build();Judgment result = judge.judge(context);
The three components are independently swappable:
Component
Role
Examples
JudgePromptTemplate
Renders {{variable}} placeholders from JudgmentContext
Available template variables: {{goal}}, {{output}}, {{workspace}}, {{status}}, {{metadata.*}}.
ModelBackedJudge lives in the agent-judge-ai-core module, which has zero external dependencies.
The actual AI backend arrives through a JudgeModel implementation from a bridge module (agent-judge-llm or agent-judge-agent-client).
You can also load templates from the classpath for reuse across judges:
LLM judges require the agent-judge-llm module and Spring AI.
Always handle parse failures gracefully — Judgment.error() prevents one broken response from crashing the jury.
Prefer deterministic judges when the criterion can be checked from files, commands, or structured metadata. Use LLM judges when the criterion is semantic and cannot be expressed reliably as code.
The WeightedAverageStrategy normalizes all scores to [0, 1] and computes a weighted average.
Build success (weight 2.0) matters more than file existence (weight 1.0), and quality (weight 1.5) falls in between.
RAG judges follow the same LLMJudge pattern but use the RagContext helper to extract the (question, context, answer) triple from metadata.
Here’s a custom judge that evaluates answer completeness — did the answer address all parts of the question?
import io.github.markpollack.judge.llm.LLMJudge;import io.github.markpollack.judge.rag.RagContext;import io.github.markpollack.judge.context.JudgmentContext;import io.github.markpollack.judge.result.Judgment;import io.github.markpollack.judge.result.JudgmentStatus;import io.github.markpollack.judge.score.BooleanScore;import org.springframework.ai.chat.client.ChatClient;import java.util.Optional;import java.util.regex.Pattern;public class CompletenessJudge extends LLMJudge { private static final Pattern ANSWER_PATTERN = Pattern.compile("(?mi)^\\s*Answer:\\s*(YES|NO)"); public CompletenessJudge(ChatClient.Builder chatClientBuilder) { super("Completeness", "Evaluates whether the answer addresses all parts of the question", chatClientBuilder); } @Override public Judgment judge(JudgmentContext context) { Optional<String> answer = RagContext.answer(context); if (answer.isEmpty()) { return Judgment.abstain("No answer provided"); } return super.judge(context); } @Override protected String buildPrompt(JudgmentContext context) { return String.format(""" Begin your response with the line "Answer: YES" or "Answer: NO". Question: %s Answer: %s Does the answer address all parts of the question completely? Answer YES if complete, NO if any part is missing. Format: Answer: [YES or NO] Reasoning: [explanation] """, RagContext.question(context), RagContext.answer(context).orElse("")); } @Override protected Judgment parseResponse(String response, JudgmentContext context) { var matcher = ANSWER_PATTERN.matcher(response); if (!matcher.find()) { return Judgment.abstain("Could not parse LLM response"); } boolean pass = "YES".equalsIgnoreCase(matcher.group(1)); return Judgment.builder() .score(new BooleanScore(pass)) .status(pass ? JudgmentStatus.PASS : JudgmentStatus.FAIL) .reasoning(response) .build(); }}
The key patterns for RAG judges:
Use RagContext.question(), RagContext.context(), RagContext.answer() to extract the triple
Return ABSTAIN when required inputs are missing — this prevents misleading verdicts
Use the (?mi)^\s*Answer:\s*(YES|NO) regex pattern for reliable LLM response parsing
Supply RAG metadata when building the context:
import io.github.markpollack.judge.context.JudgmentContext;import io.github.markpollack.judge.context.ExecutionStatus;import io.github.markpollack.judge.rag.RagContext;import java.time.Duration;import java.time.Instant;JudgmentContext context = JudgmentContext.builder() .goal("What are the benefits of Spring Boot?") .status(ExecutionStatus.SUCCESS) .startedAt(Instant.now()) .executionTime(Duration.ofSeconds(2)) .metadata(RagContext.QUESTION_KEY, "What are the benefits of Spring Boot?") .metadata(RagContext.CONTEXT_KEY, "Spring Boot provides auto-configuration...") .metadata(RagContext.ANSWER_KEY, "Spring Boot simplifies configuration.") .build();
RagContext.question() falls back to context.goal() and RagContext.answer() falls back to context.agentOutput(), but the explicit metadata keys are the recommended convention. LangChain4j’s sources are also available as a fallback for RagContext.context().
The progression: lambda (quick checks) -> DeterministicJudge (production checks with metadata and sub-assertions) -> ModelBackedJudge (composed AI judge, no subclassing) -> LLMJudge (full control over prompt and parsing) -> RAG judge (LLMJudge with the RagContext metadata convention).