Skip to main content

Documentation Index

Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt

Use this file to discover all available pages before exploring further.

What is Agent Judge?

Agent Judge is an evaluation framework for AI agent output. It provides deterministic rules, command execution checks, file comparison, RAG evaluation, and LLM-powered assessment that compose into juries with configurable voting strategies. Think of judges as being like unit tests for your agent: executable checks that decide whether an agent output satisfies a goal. You wouldn’t ship application code without tests or assertions, and agents need the same discipline. The core module has zero external dependencies. Framework bridge modules let you evaluate output from Spring AI, LangChain4j, Koog, and CLI-delegated agents (via AgentClient) — the same judges and juries work across all of them.

License

Agent Judge is licensed under BSL 1.1. Internal enterprise use is welcome. Commercial redistribution requires a separate agreement — see the LICENSE file for details.

Prerequisites

  • Java 21+
  • Maven 3.9+ (or Gradle 8+)
  • For LLM judges: Spring AI and an API key (Anthropic, OpenAI, etc.)

Add the Dependency

Start with the core module, then add only the modules you need:
<dependency>
    <groupId>io.github.markpollack</groupId>
    <artifactId>agent-judge-core</artifactId>
    <version>0.11.0</version>
</dependency>
Optional modules — add as needed:
ModuleArtifactWhat it adds
Execagent-judge-execBuild, shell, and coverage judges
Fileagent-judge-fileAST, POM, XML, and text comparison
LLMagent-judge-llmLLM-powered judges (requires Spring AI)
RAGagent-judge-ragFaithfulness, hallucination, relevance
Spring AI bridgeagent-judge-spring-aiEvaluates ChatResponse output
LangChain4j bridgeagent-judge-langchain4jEvaluates Result<T> output
Koog bridgeagent-judge-koogEvaluates AIAgent output
AgentClient bridgeagent-judge-agent-clientEvaluates CLI-agent output
All modules share the same groupId (io.github.markpollack) and version.

Your First Judge

Check whether a file exists in an agent’s workspace:
import io.github.markpollack.judge.Judge;
import io.github.markpollack.judge.fs.FileExistsJudge;
import io.github.markpollack.judge.context.JudgmentContext;
import io.github.markpollack.judge.context.ExecutionStatus;
import io.github.markpollack.judge.result.Judgment;

import java.nio.file.Path;
import java.time.Duration;
import java.time.Instant;

Judge judge = new FileExistsJudge("README.md");

JudgmentContext context = JudgmentContext.builder()
    .goal("Create a README")
    .workspace(Path.of("/my/project"))
    .status(ExecutionStatus.SUCCESS)
    .startedAt(Instant.now())
    .executionTime(Duration.ofSeconds(30))
    .build();

Judgment result = judge.judge(context);

System.out.println(result.status());    // PASS or FAIL
System.out.println(result.reasoning()); // "File README.md exists" or "File README.md not found"
Every judge takes a JudgmentContext (what the agent was asked to do and where it worked) and returns a Judgment (score, status, reasoning, and granular checks).

Add a Build Judge

Verify the project still compiles after the agent modified it:
import io.github.markpollack.judge.exec.BuildSuccessJudge;

Judge buildJudge = BuildSuccessJudge.maven("clean", "compile");
Judgment result = buildJudge.judge(context);

System.out.println(result.status());    // PASS if exit code 0
System.out.println(result.reasoning()); // Build output summary
BuildSuccessJudge.maven() auto-detects the ./mvnw wrapper. Use BuildSuccessJudge.gradle() for Gradle projects.
Command judges require the agent-judge-exec module. They run real processes in the workspace directory.

Combine into a Jury

Run multiple judges together and aggregate results with a voting strategy:
import io.github.markpollack.judge.jury.SimpleJury;
import io.github.markpollack.judge.jury.MajorityVotingStrategy;
import io.github.markpollack.judge.jury.Verdict;

SimpleJury jury = SimpleJury.builder()
    .judge(new FileExistsJudge("README.md"))
    .judge(BuildSuccessJudge.maven("compile"))
    .judge(new FileExistsJudge("src/main/java"))
    .votingStrategy(new MajorityVotingStrategy())
    .parallel(true)
    .build();

Verdict verdict = jury.vote(context);

System.out.println(verdict.aggregated().status()); // PASS (majority wins)

// Inspect individual results
verdict.individualByName().forEach((name, judgment) ->
    System.out.println(name + " -> " + judgment.status())
);

Evaluate Framework Output

Framework bridge modules convert agent output into JudgmentContext automatically. The same judges and juries work regardless of which framework produced the output.
RuntimeInput typeBridge
Spring AIChatResponseSpringAiEvaluator
LangChain4jResult<T>LangChain4jEvaluator
KoogAIAgentKoogEvaluator
AgentClientAgentClientResponseAgentClientEvaluator
Bridge modules do not bring framework runtimes transitively. Add the Spring AI, LangChain4j, Koog, or AgentClient dependency your application already uses.

Spring AI

import io.github.markpollack.judge.springai.SpringAiEvaluator;

Judgment result = SpringAiEvaluator.evaluate(
    "Summarize the document",
    () -> chatClient.prompt().user(prompt).call().chatResponse(),
    new FileExistsJudge("summary.md"));

LangChain4j

import io.github.markpollack.judge.langchain4j.LangChain4jEvaluator;

Judgment result = LangChain4jEvaluator.evaluate(
    "Summarize the document",
    goal -> assistant.chat(goal),
    new FileExistsJudge("summary.md"));

Koog

import io.github.markpollack.judge.koog.KoogEvaluator;

Judgment result = KoogEvaluator.evaluate(
    agent, "Summarize the document",
    new FileExistsJudge("summary.md"));

AgentClient (CLI agents)

import io.github.markpollack.judge.agentclient.AgentClientEvaluator;

Judgment result = AgentClientEvaluator.evaluate(
    "Fix the build", workspace,
    () -> agentClient.run("Fix the build"),
    BuildSuccessJudge.maven("compile"));
Each evaluator has a jury overload — pass a Jury instead of a Judge to get a Verdict.

Evaluate RAG Pipelines

The RAG module provides LLM-powered judges for retrieval-augmented generation:
import io.github.markpollack.judge.rag.FaithfulnessJudge;
import io.github.markpollack.judge.rag.RagContext;

JudgmentContext context = JudgmentContext.builder()
    .goal("What is Spring Boot?")
    .status(ExecutionStatus.SUCCESS)
    .startedAt(Instant.now())
    .executionTime(Duration.ofSeconds(2))
    .metadata(RagContext.QUESTION_KEY, "What is Spring Boot?")
    .metadata(RagContext.CONTEXT_KEY, "Spring Boot is a framework that simplifies...")
    .metadata(RagContext.ANSWER_KEY, "Spring Boot simplifies application development.")
    .build();

FaithfulnessJudge judge = new FaithfulnessJudge(chatClientBuilder);
Judgment result = judge.judge(context);
Three RAG judges are available: FaithfulnessJudge, ContextualRelevanceJudge, and HallucinationJudge. See Built-in Judges for details.

What’s Next

Tutorial: Build an Evaluation Pipeline

Step-by-step guide from single judge to multi-judge jury

Built-in Judges

Catalog of built-in judges across all modules

Jury System

SimpleJury, CascadedJury, voting strategies, and composition

Writing Custom Judges

Lambda judges, DeterministicJudge, LLMJudge template method