Tutorial: Build an Evaluation Pipeline

What You’ll Build

An evaluation pipeline that verifies an AI agent’s modifications to a Maven project. By the end you’ll have three judges — file existence, build success, and content validation — combined into a jury with majority voting.

Prerequisites

Java 21+
A Maven project directory to evaluate (any Spring Boot starter project works)
agent-judge-core and agent-judge-exec (setup)
Optional: agent-judge-llm for Step 6
Optional: agent-judge-koog for the bridge example

Step 1: Create the Evaluation Context

Every evaluation starts with a JudgmentContext — it describes what the agent was asked to do and where it worked.

import io.github.markpollack.judge.context.JudgmentContext;
import io.github.markpollack.judge.context.ExecutionStatus;
import java.nio.file.Path;
import java.time.Duration;
import java.time.Instant;

Path workspace = Path.of("/path/to/agent/workspace");

JudgmentContext context = JudgmentContext.builder()
    .goal("Add a REST controller with a /hello endpoint")
    .workspace(workspace)
    .status(ExecutionStatus.SUCCESS)
    .startedAt(Instant.now())
    .executionTime(Duration.ofMinutes(2))
    .build();

The context is immutable and shared across all judges. It carries the agent’s goal, workspace path, execution status, and timing — everything a judge needs to evaluate the result without knowing which agent produced it.

Step 2: Evaluate with a Single Judge

Start with the simplest possible check — does a file exist?

import io.github.markpollack.judge.fs.FileExistsJudge;
import io.github.markpollack.judge.result.Judgment;
import io.github.markpollack.judge.Judge;

Judge fileJudge = new FileExistsJudge("src/main/java/com/example/HelloController.java");
Judgment result = fileJudge.judge(context);

System.out.println("Status:    " + result.status());
System.out.println("Score:     " + result.score());
System.out.println("Reasoning: " + result.reasoning());

Output when the file exists:

Status:    PASS
Score:     BooleanScore[value=true]
Reasoning: File src/main/java/com/example/HelloController.java exists

Every Judgment contains:

score — BooleanScore, NumericalScore, or CategoricalScore
status — PASS, FAIL, ABSTAIN, or ERROR
reasoning — human-readable explanation
checks — granular sub-assertions (useful for complex judges)

Step 3: Add a Build Judge

File existence is necessary but not sufficient — the code also needs to compile. Add a command judge that runs the Maven build:

import io.github.markpollack.judge.exec.BuildSuccessJudge;

Judge buildJudge = BuildSuccessJudge.maven("clean", "compile");
Judgment buildResult = buildJudge.judge(context);

System.out.println("Build: " + buildResult.status());
System.out.println(buildResult.reasoning());

BuildSuccessJudge.maven() auto-detects the ./mvnw wrapper in the workspace directory. It runs the specified goals and checks the exit code — zero means pass.

Build judges execute real processes. The default timeout is 10 minutes. Make sure the workspace has a valid Maven project before running.

Step 4: Compose with Judges.and()

Before reaching for a jury, you can compose judges with simple boolean logic:

import io.github.markpollack.judge.Judges;

Judge combined = Judges.and(fileJudge, buildJudge);
Judgment result = combined.judge(context);

System.out.println("Combined: " + result.status());

Judges.and() short-circuits — if the file doesn’t exist, the build never runs. This is useful when one check is a precondition for another. Other composition operators:

import io.github.markpollack.judge.fs.FileContentJudge;

Judge contentJudge = new FileContentJudge(
    "src/main/java/com/example/HelloController.java",
    "@RestController", FileContentJudge.MatchMode.CONTAINS);

// OR: pass if either judge passes
Judge fallback = Judges.or(fileJudge, buildJudge);

// All must pass (variadic AND)
Judge all = Judges.allOf(fileJudge, buildJudge, contentJudge);

// Any can pass (variadic OR)
Judge any = Judges.anyOf(fileJudge, buildJudge, contentJudge);

Step 5: Build a Jury

When you need more than boolean composition — weighted scoring, named results, parallel execution — use a SimpleJury:

import io.github.markpollack.judge.jury.SimpleJury;
import io.github.markpollack.judge.jury.MajorityVotingStrategy;
import io.github.markpollack.judge.jury.Verdict;
import io.github.markpollack.judge.fs.FileContentJudge;

// Three named judges with weights
Judge fileExists = Judges.named(
    new FileExistsJudge("src/main/java/com/example/HelloController.java"),
    "file-exists", "Controller file created");

Judge buildSucceeds = Judges.named(
    BuildSuccessJudge.maven("clean", "compile"),
    "build-success", "Project compiles");

Judge contentValid = Judges.named(
    new FileContentJudge("src/main/java/com/example/HelloController.java",
        "@RestController", FileContentJudge.MatchMode.CONTAINS),
    "has-annotation", "Uses @RestController");

SimpleJury jury = SimpleJury.builder()
    .judge(fileExists, 1.0)
    .judge(buildSucceeds, 2.0)    // Build is weighted 2x
    .judge(contentValid, 1.0)
    .votingStrategy(new MajorityVotingStrategy())
    .parallel(true)
    .build();

Verdict verdict = jury.vote(context);

Now inspect the verdict:

// Aggregated result
System.out.println("Overall: " + verdict.aggregated().status());
System.out.println("Reason:  " + verdict.aggregated().reasoning());

// Individual results by judge name
verdict.individualByName().forEach((name, judgment) ->
    System.out.printf("  %-15s %s  %s%n",
        name, judgment.status(), judgment.reasoning())
);

// Weights used
System.out.println("Weights: " + verdict.weights());

Example output:

Overall: PASS
Reason:  Majority passed (3/3)
  file-exists     PASS  File src/main/java/com/example/HelloController.java exists
  build-success   PASS  Build succeeded (exit code 0)
  has-annotation  PASS  File contains "@RestController"
Weights: {file-exists=1.0, build-success=2.0, has-annotation=1.0}

Judges.named() wraps any judge with a name and description. Without it, judges get auto-generated names, making the verdict harder to read.

Step 6: Add an LLM Judge (Optional)

Deterministic judges handle objective criteria — did it compile, does the file exist? For subjective evaluation — is the code well-structured, does it follow conventions? — add an LLM judge.

import io.github.markpollack.judge.llm.CorrectnessJudge;
import org.springframework.ai.chat.client.ChatClient;

// Requires agent-judge-llm + Spring AI dependency
ChatClient.Builder chatClientBuilder = /* configured Spring AI ChatClient.Builder */;
CorrectnessJudge llmJudge = new CorrectnessJudge(chatClientBuilder);

SimpleJury jury = SimpleJury.builder()
    .judge(fileExists, 1.0)
    .judge(buildSucceeds, 2.0)
    .judge(contentValid, 1.0)
    .judge(Judges.named(llmJudge, "correctness", "LLM evaluates goal completion"), 1.5)
    .votingStrategy(new MajorityVotingStrategy())
    .parallel(true)
    .build();

The CorrectnessJudge sends the goal and agent output to an LLM and asks whether the agent accomplished its task. It costs tokens — but combined with free deterministic judges, you get both speed and depth.

LLM judges require the agent-judge-llm module, Spring AI on the classpath, and a valid API key. They are significantly slower and more expensive than deterministic judges. Use them for criteria that can’t be checked structurally.

CorrectnessJudge extends LLMJudge, which uses Spring AI directly. For a framework-neutral alternative, ModelBackedJudge from agent-judge-ai-core composes a prompt template, model backend, and classifier without subclassing. See Writing Custom Judges — ModelBackedJudge for details.

What You Built

You started with a single file-existence check and built up to a weighted jury with four judges spanning three cost tiers:

Judge	Type	Cost	What it checks
FileExistsJudge	Deterministic	Free	File was created
BuildSuccessJudge	Command	Compute	Project compiles
FileContentJudge	Deterministic	Free	File contains expected content
CorrectnessJudge	LLM	Tokens	Agent achieved the goal

This is the core pattern: use cheap judges to catch obvious failures, and reserve expensive judges for semantic confirmation. In production, formalize this with a CascadedJury that runs cheap tiers first and stops early when they already have a verdict.

The tutorial uses SimpleJury with parallel execution for readability. Production pipelines often use CascadedJury to avoid running LLM judges when deterministic checks already fail.

Bonus: Evaluate a Framework Agent

The tutorial above builds JudgmentContext manually. When you’re evaluating output from a specific framework, the bridge modules do this for you. Here’s the same evaluation pipeline, but with the context built automatically from a Koog agent:

import io.github.markpollack.judge.Judge;
import io.github.markpollack.judge.Judges;
import io.github.markpollack.judge.fs.FileExistsJudge;
import io.github.markpollack.judge.exec.BuildSuccessJudge;
import io.github.markpollack.judge.koog.KoogEvaluator;
import io.github.markpollack.judge.jury.SimpleJury;
import io.github.markpollack.judge.jury.MajorityVotingStrategy;
import io.github.markpollack.judge.jury.Verdict;
import ai.koog.agents.core.agent.AIAgent;

AIAgent<String, String> agent = /* configured Koog agent */;

SimpleJury jury = SimpleJury.builder()
    .judge(Judges.named(
        new FileExistsJudge("src/main/java/com/example/HelloController.java"),
        "file-exists", "Controller created"), 1.0)
    .judge(Judges.named(
        BuildSuccessJudge.maven("clean", "compile"),
        "build-success", "Project compiles"), 2.0)
    .votingStrategy(new MajorityVotingStrategy())
    .parallel(true)
    .build();

// KoogEvaluator runs the agent and evaluates its output in one call
Verdict verdict = KoogEvaluator.evaluate(
    agent, "Add a REST controller with a /hello endpoint", jury);

System.out.println(verdict.aggregated().status());

Bridge evaluators either adapt an existing framework response or wrap a framework call/supplier that produces one. The judges and jury don’t change — swap KoogEvaluator for SpringAiEvaluator, LangChain4jEvaluator, or AgentClientEvaluator.

Runnable Code

Every step in this tutorial has a corresponding runnable module in the agent-judge-tutorial repository. Clone it and run any module with ./mvnw exec:java -pl module-NN-name.

Projects

AgentWorks

Agento

Supporting Projects

Migration

Tutorial: Build an Evaluation Pipeline

What You’ll Build

Prerequisites

Step 1: Create the Evaluation Context

Step 2: Evaluate with a Single Judge

Step 3: Add a Build Judge

Step 4: Compose with Judges.and()

Step 5: Build a Jury

Step 6: Add an LLM Judge (Optional)

What You Built

Bonus: Evaluate a Framework Agent

Runnable Code

What’s Next

Writing Custom Judges

Jury System

​What You’ll Build

​Prerequisites

​Step 1: Create the Evaluation Context

​Step 2: Evaluate with a Single Judge

​Step 3: Add a Build Judge

​Step 4: Compose with Judges.and()

​Step 5: Build a Jury

​Step 6: Add an LLM Judge (Optional)

​What You Built

​Bonus: Evaluate a Framework Agent

​Runnable Code

​What’s Next

Writing Custom Judges

Jury System

What You’ll Build

Prerequisites

Step 1: Create the Evaluation Context

Step 2: Evaluate with a Single Judge

Step 3: Add a Build Judge

Step 4: Compose with Judges.and()

Step 5: Build a Jury

Step 6: Add an LLM Judge (Optional)

What You Built

Bonus: Evaluate a Framework Agent

Runnable Code

What’s Next