Getting Started with Agent Judge

What is Agent Judge?

Agent Judge is an evaluation framework for AI agent output. It provides deterministic rules, command execution checks, file comparison, RAG evaluation, and LLM-powered assessment that compose into juries with configurable voting strategies. Think of judges as being like unit tests for your agent: executable checks that decide whether an agent output satisfies a goal. You wouldn’t ship application code without tests or assertions, and agents need the same discipline. The core module has zero external dependencies. Framework bridge modules let you evaluate output from Spring AI, LangChain4j, Koog, and CLI-delegated agents (via AgentClient) — the same judges and juries work across all of them.

License

Agent Judge is licensed under BSL 1.1. Internal enterprise use is welcome. Commercial redistribution requires a separate agreement — see the LICENSE file for details.

Prerequisites

Java 21+
Maven 3.9+ (or Gradle 8+)
For LLM judges: Spring AI and an API key (Anthropic, OpenAI, etc.)

Add the Dependency

Start with the core module, then add only the modules you need:

<dependency>
    <groupId>io.github.markpollack</groupId>
    <artifactId>agent-judge-core</artifactId>
    <version>0.11.0</version>
</dependency>

Optional modules — add as needed:

Module	Artifact	What it adds
Exec	`agent-judge-exec`	Build, shell, and coverage judges
File	`agent-judge-file`	AST, POM, XML, and text comparison
LLM	`agent-judge-llm`	LLM-powered judges (requires Spring AI)
RAG	`agent-judge-rag`	Faithfulness, hallucination, relevance
Spring AI bridge	`agent-judge-spring-ai`	Evaluates `ChatResponse` output
LangChain4j bridge	`agent-judge-langchain4j`	Evaluates `Result<T>` output
Koog bridge	`agent-judge-koog`	Evaluates `AIAgent` output
AgentClient bridge	`agent-judge-agent-client`	Evaluates CLI-agent output

All modules share the same groupId (io.github.markpollack) and version.

Your First Judge

Check whether a file exists in an agent’s workspace:

import io.github.markpollack.judge.Judge;
import io.github.markpollack.judge.fs.FileExistsJudge;
import io.github.markpollack.judge.context.JudgmentContext;
import io.github.markpollack.judge.context.ExecutionStatus;
import io.github.markpollack.judge.result.Judgment;

import java.nio.file.Path;
import java.time.Duration;
import java.time.Instant;

Judge judge = new FileExistsJudge("README.md");

JudgmentContext context = JudgmentContext.builder()
    .goal("Create a README")
    .workspace(Path.of("/my/project"))
    .status(ExecutionStatus.SUCCESS)
    .startedAt(Instant.now())
    .executionTime(Duration.ofSeconds(30))
    .build();

Judgment result = judge.judge(context);

System.out.println(result.status());    // PASS or FAIL
System.out.println(result.reasoning()); // "File README.md exists" or "File README.md not found"

Every judge takes a JudgmentContext (what the agent was asked to do and where it worked) and returns a Judgment (score, status, reasoning, and granular checks).

Add a Build Judge

Verify the project still compiles after the agent modified it:

import io.github.markpollack.judge.exec.BuildSuccessJudge;

Judge buildJudge = BuildSuccessJudge.maven("clean", "compile");
Judgment result = buildJudge.judge(context);

System.out.println(result.status());    // PASS if exit code 0
System.out.println(result.reasoning()); // Build output summary

BuildSuccessJudge.maven() auto-detects the ./mvnw wrapper. Use BuildSuccessJudge.gradle() for Gradle projects.

Command judges require the agent-judge-exec module. They run real processes in the workspace directory.

Combine into a Jury

Run multiple judges together and aggregate results with a voting strategy:

import io.github.markpollack.judge.jury.SimpleJury;
import io.github.markpollack.judge.jury.MajorityVotingStrategy;
import io.github.markpollack.judge.jury.Verdict;

SimpleJury jury = SimpleJury.builder()
    .judge(new FileExistsJudge("README.md"))
    .judge(BuildSuccessJudge.maven("compile"))
    .judge(new FileExistsJudge("src/main/java"))
    .votingStrategy(new MajorityVotingStrategy())
    .parallel(true)
    .build();

Verdict verdict = jury.vote(context);

System.out.println(verdict.aggregated().status()); // PASS (majority wins)

// Inspect individual results
verdict.individualByName().forEach((name, judgment) ->
    System.out.println(name + " -> " + judgment.status())
);

Evaluate Framework Output

Framework bridge modules convert agent output into JudgmentContext automatically. The same judges and juries work regardless of which framework produced the output.

Runtime	Input type	Bridge
Spring AI	`ChatResponse`	`SpringAiEvaluator`
LangChain4j	`Result<T>`	`LangChain4jEvaluator`
Koog	`AIAgent`	`KoogEvaluator`
AgentClient	`AgentClientResponse`	`AgentClientEvaluator`

Bridge modules do not bring framework runtimes transitively. Add the Spring AI, LangChain4j, Koog, or AgentClient dependency your application already uses.

Spring AI

import io.github.markpollack.judge.springai.SpringAiEvaluator;

Judgment result = SpringAiEvaluator.evaluate(
    "Summarize the document",
    () -> chatClient.prompt().user(prompt).call().chatResponse(),
    new FileExistsJudge("summary.md"));

LangChain4j

import io.github.markpollack.judge.langchain4j.LangChain4jEvaluator;

Judgment result = LangChain4jEvaluator.evaluate(
    "Summarize the document",
    goal -> assistant.chat(goal),
    new FileExistsJudge("summary.md"));

Koog

import io.github.markpollack.judge.koog.KoogEvaluator;

Judgment result = KoogEvaluator.evaluate(
    agent, "Summarize the document",
    new FileExistsJudge("summary.md"));

AgentClient (CLI agents)

import io.github.markpollack.judge.agentclient.AgentClientEvaluator;

Judgment result = AgentClientEvaluator.evaluate(
    "Fix the build", workspace,
    () -> agentClient.run("Fix the build"),
    BuildSuccessJudge.maven("compile"));

Each evaluator has a jury overload — pass a Jury instead of a Judge to get a Verdict.

Evaluate RAG Pipelines

The RAG module provides LLM-powered judges for retrieval-augmented generation:

import io.github.markpollack.judge.rag.FaithfulnessJudge;
import io.github.markpollack.judge.rag.RagContext;

JudgmentContext context = JudgmentContext.builder()
    .goal("What is Spring Boot?")
    .status(ExecutionStatus.SUCCESS)
    .startedAt(Instant.now())
    .executionTime(Duration.ofSeconds(2))
    .metadata(RagContext.QUESTION_KEY, "What is Spring Boot?")
    .metadata(RagContext.CONTEXT_KEY, "Spring Boot is a framework that simplifies...")
    .metadata(RagContext.ANSWER_KEY, "Spring Boot simplifies application development.")
    .build();

FaithfulnessJudge judge = new FaithfulnessJudge(chatClientBuilder);
Judgment result = judge.judge(context);

Three RAG judges are available: FaithfulnessJudge, ContextualRelevanceJudge, and HallucinationJudge. See Built-in Judges for details.

What’s Next

Tutorial: Build an Evaluation Pipeline

Step-by-step guide from single judge to multi-judge jury

Built-in Judges

Catalog of built-in judges across all modules

Jury System

SimpleJury, CascadedJury, voting strategies, and composition

Writing Custom Judges

Lambda judges, DeterministicJudge, LLMJudge template method

Projects

AgentWorks

Agento

Supporting Projects

Migration

Getting Started with Agent Judge

What is Agent Judge?

License

Prerequisites

Add the Dependency

Your First Judge

Add a Build Judge

Combine into a Jury

Evaluate Framework Output

Spring AI

LangChain4j

Koog

AgentClient (CLI agents)

Evaluate RAG Pipelines

What’s Next

Tutorial: Build an Evaluation Pipeline

Built-in Judges

Jury System

Writing Custom Judges

Projects

AgentWorks

Agento

Supporting Projects

Migration

Documentation Index

​What is Agent Judge?

​License

​Prerequisites

​Add the Dependency

​Your First Judge

​Add a Build Judge

​Combine into a Jury

​Evaluate Framework Output

​Spring AI

​LangChain4j

​Koog

​AgentClient (CLI agents)

​Evaluate RAG Pipelines

​What’s Next

Tutorial: Build an Evaluation Pipeline

Built-in Judges

Jury System

Writing Custom Judges

What is Agent Judge?

License

Prerequisites

Add the Dependency

Your First Judge

Add a Build Judge

Combine into a Jury

Evaluate Framework Output

Spring AI

LangChain4j

Koog

AgentClient (CLI agents)

Evaluate RAG Pipelines

What’s Next