An evaluation pipeline that verifies an AI agent’s modifications to a Maven project.
By the end you’ll have three judges — file existence, build success, and content validation — combined into a jury with majority voting.
Every evaluation starts with a JudgmentContext — it describes what the agent was asked to do and where it worked.
import io.github.markpollack.judge.context.JudgmentContext;import io.github.markpollack.judge.context.ExecutionStatus;import java.nio.file.Path;import java.time.Duration;import java.time.Instant;Path workspace = Path.of("/path/to/agent/workspace");JudgmentContext context = JudgmentContext.builder() .goal("Add a REST controller with a /hello endpoint") .workspace(workspace) .status(ExecutionStatus.SUCCESS) .startedAt(Instant.now()) .executionTime(Duration.ofMinutes(2)) .build();
The context is immutable and shared across all judges.
It carries the agent’s goal, workspace path, execution status, and timing — everything a judge needs to evaluate the result without knowing which agent produced it.
BuildSuccessJudge.maven() auto-detects the ./mvnw wrapper in the workspace directory.
It runs the specified goals and checks the exit code — zero means pass.
Build judges execute real processes.
The default timeout is 10 minutes.
Make sure the workspace has a valid Maven project before running.
Judges.and() short-circuits — if the file doesn’t exist, the build never runs.
This is useful when one check is a precondition for another.Other composition operators:
import io.github.markpollack.judge.fs.FileContentJudge;Judge contentJudge = new FileContentJudge( "src/main/java/com/example/HelloController.java", "@RestController", FileContentJudge.MatchMode.CONTAINS);// OR: pass if either judge passesJudge fallback = Judges.or(fileJudge, buildJudge);// All must pass (variadic AND)Judge all = Judges.allOf(fileJudge, buildJudge, contentJudge);// Any can pass (variadic OR)Judge any = Judges.anyOf(fileJudge, buildJudge, contentJudge);
Deterministic judges handle objective criteria — did it compile, does the file exist?
For subjective evaluation — is the code well-structured, does it follow conventions? — add an LLM judge.
import io.github.markpollack.judge.llm.CorrectnessJudge;import org.springframework.ai.chat.client.ChatClient;// Requires agent-judge-llm + Spring AI dependencyChatClient.Builder chatClientBuilder = /* configured Spring AI ChatClient.Builder */;CorrectnessJudge llmJudge = new CorrectnessJudge(chatClientBuilder);SimpleJury jury = SimpleJury.builder() .judge(fileExists, 1.0) .judge(buildSucceeds, 2.0) .judge(contentValid, 1.0) .judge(Judges.named(llmJudge, "correctness", "LLM evaluates goal completion"), 1.5) .votingStrategy(new MajorityVotingStrategy()) .parallel(true) .build();
The CorrectnessJudge sends the goal and agent output to an LLM and asks whether the agent accomplished its task.
It costs tokens — but combined with free deterministic judges, you get both speed and depth.
LLM judges require the agent-judge-llm module, Spring AI on the classpath, and a valid API key.
They are significantly slower and more expensive than deterministic judges.
Use them for criteria that can’t be checked structurally.
CorrectnessJudge extends LLMJudge, which uses Spring AI directly.
For a framework-neutral alternative, ModelBackedJudge from agent-judge-ai-core composes a prompt template, model backend, and classifier without subclassing.
See Writing Custom Judges — ModelBackedJudge for details.
You started with a single file-existence check and built up to a weighted jury with four judges spanning three cost tiers:
Judge
Type
Cost
What it checks
FileExistsJudge
Deterministic
Free
File was created
BuildSuccessJudge
Command
Compute
Project compiles
FileContentJudge
Deterministic
Free
File contains expected content
CorrectnessJudge
LLM
Tokens
Agent achieved the goal
This is the core pattern: use cheap judges to catch obvious failures, and reserve expensive judges for semantic confirmation.
In production, formalize this with a CascadedJury that runs cheap tiers first and stops early when they already have a verdict.
The tutorial uses SimpleJury with parallel execution for readability. Production pipelines often use CascadedJury to avoid running LLM judges when deterministic checks already fail.
The tutorial above builds JudgmentContext manually.
When you’re evaluating output from a specific framework, the bridge modules do this for you.Here’s the same evaluation pipeline, but with the context built automatically from a Koog agent:
import io.github.markpollack.judge.Judge;import io.github.markpollack.judge.Judges;import io.github.markpollack.judge.fs.FileExistsJudge;import io.github.markpollack.judge.exec.BuildSuccessJudge;import io.github.markpollack.judge.koog.KoogEvaluator;import io.github.markpollack.judge.jury.SimpleJury;import io.github.markpollack.judge.jury.MajorityVotingStrategy;import io.github.markpollack.judge.jury.Verdict;import ai.koog.agents.core.agent.AIAgent;AIAgent<String, String> agent = /* configured Koog agent */;SimpleJury jury = SimpleJury.builder() .judge(Judges.named( new FileExistsJudge("src/main/java/com/example/HelloController.java"), "file-exists", "Controller created"), 1.0) .judge(Judges.named( BuildSuccessJudge.maven("clean", "compile"), "build-success", "Project compiles"), 2.0) .votingStrategy(new MajorityVotingStrategy()) .parallel(true) .build();// KoogEvaluator runs the agent and evaluates its output in one callVerdict verdict = KoogEvaluator.evaluate( agent, "Add a REST controller with a /hello endpoint", jury);System.out.println(verdict.aggregated().status());
Bridge evaluators either adapt an existing framework response or wrap a framework call/supplier that produces one. The judges and jury don’t change — swap KoogEvaluator for SpringAiEvaluator, LangChain4jEvaluator, or AgentClientEvaluator.