API Reference - Pollack AI Lab

Packages

Package	Contains
`io.github.markpollack.judge`	Core judge interfaces and utilities
`io.github.markpollack.judge.context`	`JudgmentContext`, `ExecutionStatus`
`io.github.markpollack.judge.result`	`Judgment`, `JudgmentStatus`, `Check`
`io.github.markpollack.judge.score`	`Score`, `BooleanScore`, `NumericalScore`, `CategoricalScore`
`io.github.markpollack.judge.jury`	`Jury`, `Verdict`, voting strategies
`io.github.markpollack.judge.springai`	Spring AI bridge
`io.github.markpollack.judge.langchain4j`	LangChain4j bridge
`io.github.markpollack.judge.koog`	Koog bridge
`io.github.markpollack.judge.agentclient`	AgentClient bridge
`io.github.markpollack.judge.ai`	AI-backed judge infrastructure
`io.github.markpollack.judge.rag`	RAG judges and `RagContext`

Core Types

Judge

The fundamental evaluation interface. A functional interface for lambda and method reference support.

@FunctionalInterface
public interface Judge {
    Judgment judge(JudgmentContext context);
}

Use directly as a lambda, or extend DeterministicJudge / LLMJudge for metadata support.

AsyncJudge

Asynchronous variant for non-blocking evaluation:

public interface AsyncJudge {
    CompletableFuture<Judgment> judgeAsync(JudgmentContext context);
}

ReactiveJudge

Reactive variant for Spring WebFlux / Project Reactor:

public interface ReactiveJudge {
    Mono<Judgment> judge(JudgmentContext context);
}

DeterministicJudge

Abstract base class for rule-based judges. Provides JudgeWithMetadata support:

public abstract class DeterministicJudge implements JudgeWithMetadata {
    protected DeterministicJudge(String name, String description)
    public JudgeMetadata metadata()
}

JudgeWithMetadata extends Judge, so DeterministicJudge is also a Judge.

NamedJudge

Composition wrapper that attaches metadata to any judge (including lambdas):

Judge named = Judges.named(myLambda, "check-name", "description");
Judge typed = Judges.named(myLambda, "check-name", "description", JudgeType.DETERMINISTIC);

JudgeWithMetadata

Marker interface for judges that expose identity:

public interface JudgeWithMetadata extends Judge {
    JudgeMetadata metadata();
}

Infrastructure code can use instanceof JudgeWithMetadata for discovery:

if (judge instanceof JudgeWithMetadata jwm) {
    log.info("Running: {}", jwm.metadata().name());
}

JudgeMetadata

Identity record:

public record JudgeMetadata(String name, String description, JudgeType type)

JudgeType

public enum JudgeType {
    DETERMINISTIC,
    LLM_POWERED,
    HYBRID,
    AGENT
}

Context

JudgmentContext

All evaluation inputs in one immutable record:

public record JudgmentContext(
    String goal,
    Path workspace,
    Duration executionTime,
    Instant startedAt,
    Optional<String> agentOutput,
    ExecutionStatus status,
    Optional<Throwable> error,
    Map<String, Object> metadata
)

Builder methods:

Method	Type	Required
`goal(String)`	The agent’s task description	Yes
`workspace(Path)`	Directory the agent modified	Yes
`status(ExecutionStatus)`	Agent execution outcome	Yes
`startedAt(Instant)`	When execution began	Yes
`executionTime(Duration)`	How long execution took	Yes
`agentOutput(String)`	Text output from the agent	No
`error(Throwable)`	Exception if execution failed	No
`metadata(String, Object)`	Arbitrary key-value pairs	No
`metadata(Map<String, Object>)`	Bulk metadata	No

JudgmentContext context = JudgmentContext.builder()
    .goal("Add REST endpoint")
    .workspace(Path.of("/project"))
    .status(ExecutionStatus.SUCCESS)
    .startedAt(Instant.now())
    .executionTime(Duration.ofMinutes(2))
    .agentOutput("Created HelloController.java")
    .metadata("expectedDir", Path.of("/reference"))
    .build();

ExecutionStatus

public enum ExecutionStatus {
    SUCCESS, FAILED, TIMEOUT, CANCELLED, REFUSED, UNKNOWN
}

Value	Meaning
`SUCCESS`	Agent completed normally
`FAILED`	Agent threw an exception or returned an error
`TIMEOUT`	Execution exceeded time limit
`CANCELLED`	Execution was cancelled
`REFUSED`	Model declined the request (content filter)
`UNKNOWN`	Status could not be determined

Results

Judgment

Immutable evaluation result:

public record Judgment(
    Score score,
    JudgmentStatus status,
    String reasoning,
    List<Check> checks,
    Map<String, Object> metadata
)

Static factory methods:

Judgment.pass("Build succeeded")
Judgment.fail("File not found")
Judgment.abstain("Not enough information")
Judgment.error("Timeout", exception)

Builder:

Judgment.builder()
    .score(new BooleanScore(true))
    .status(JudgmentStatus.PASS)
    .reasoning("All checks passed")
    .check(Check.pass("compile", "Compilation succeeded"))
    .check(Check.pass("tests", "All tests passed"))
    .metadata("duration", Duration.ofSeconds(45))
    .build();

Utility methods:

Method	Returns	Description
`pass()`	`boolean`	`true` if status == PASS
`elapsed()`	`Duration`	Elapsed time from metadata
`error()`	`Throwable`	Error from metadata

JudgmentStatus

public enum JudgmentStatus {
    PASS, FAIL, ABSTAIN, ERROR
}

Check

Granular sub-assertion within a judgment:

public record Check(String name, boolean passed, String message)

Factory methods:

Check.pass("file_exists")
Check.pass("file_exists", "File found at expected path")
Check.fail("content_match", "Expected header not found")

Score Types

Score is a sealed interface with three permitted implementations:

BooleanScore

Simple pass/fail:

public record BooleanScore(boolean value) implements Score

new BooleanScore(true)   // pass
new BooleanScore(false)  // fail

NumericalScore

Continuous scoring with bounds:

public record NumericalScore(double value, double min, double max) implements Score

NumericalScore score = new NumericalScore(85.0, 0.0, 100.0);
double normalized = score.normalized();  // 0.85 (scaled to [0, 1])

// Convenience factories
NumericalScore zeroToOne = NumericalScore.normalized(0.85);
NumericalScore zeroToTen = NumericalScore.outOfTen(7.5);

CategoricalScore

Discrete categories from a fixed set:

public record CategoricalScore(String value, List<String> allowedValues) implements Score

CategoricalScore score = new CategoricalScore(
    "GOOD", List.of("EXCELLENT", "GOOD", "FAIR", "POOR"));

Scores Utility

Convert between score types for heterogeneous aggregation:

// Numerical and Boolean scores normalize directly
// Categorical scores require a mapping from category names to numeric values
double normalized = Scores.toNormalized(anyScore, categoryMap);

Composition

Judges Utility

Static methods for creating and composing judges:

Method	Description
`named(Judge, String)`	Wrap with a name
`named(Judge, String, String)`	Wrap with name and description
`named(Judge, String, String, JudgeType)`	Wrap with full metadata
`alwaysPass(String)`	Test judge that always passes
`alwaysFail(String)`	Test judge that always fails
`tryMetadata(Judge)`	Extract metadata if available (`Optional<JudgeMetadata>`)
`and(Judge, Judge)`	Short-circuit AND
`or(Judge, Judge)`	Short-circuit OR
`allOf(Judge...)`	All must pass (variadic AND)
`anyOf(Judge...)`	Any can pass (variadic OR)

AI-Core Types

Framework-neutral infrastructure for AI-backed judges. Located in the agent-judge-ai-core module (zero external dependencies).

ModelBackedJudge

Composed AI-backed judge built via builder pattern. Pipeline: render prompt → invoke model → classify response → produce Judgment. No subclassing needed.

ModelBackedJudge judge = ModelBackedJudge.builder()
    .model(judgeModel)
    .template(promptTemplate)
    .classifier(LabelJudgmentClassifier.passFail())
    .build();

Judgment judgment = judge.judge(context);

Builder Method	Required	Description
`model(JudgeModel)`	Yes	AI backend to invoke
`template(JudgePromptTemplate)`	Yes	Prompt template with `{{variable}}` placeholders
`classifier(JudgmentClassifier)`	Yes	Maps model response to `Judgment`

JudgeModel

Functional interface for AI model invocation. Framework-specific implementations live in bridge modules.

@FunctionalInterface
public interface JudgeModel {
    JudgeModelResponse call(JudgeModelRequest request);
}

Implementation	Module	Backend
`SpringAiJudgeModel`	`agent-judge-llm`	Spring AI `ChatClient`
`AgentClientJudgeModel`	`agent-judge-agent-client`	CLI agent via AgentClient

JudgePromptTemplate

Loads, validates, and renders prompt templates with {{variable}} placeholders extracted from JudgmentContext.

JudgePromptTemplate template = JudgePromptTemplate.builder()
    .source(TextSource.classpath("/prompts/correctness.txt"))
    .renderer(new SimpleJudgeTemplateRenderer())
    .missingVariablePolicy(MissingVariablePolicy.STRICT)
    .build();

Builder Method	Default	Description
`source(TextSource)`	Required	Template text source (classpath, file, or string)
`renderer(JudgeTemplateRenderer)`	`SimpleJudgeTemplateRenderer`	Pluggable template engine
`missingVariablePolicy(MissingVariablePolicy)`	`STRICT`	`STRICT`, `EMPTY_STRING`, or `LEAVE_PLACEHOLDER`

Available variables from JudgmentContext: {{goal}}, {{output}}, {{workspace}}, {{status}}, {{metadata.*}}.

JudgeTemplateRenderer

Pluggable template engine interface:

public interface JudgeTemplateRenderer {
    String render(String template, Map<String, String> variables);
}

Default implementation SimpleJudgeTemplateRenderer performs {{variable}} substitution.

JudgmentClassifier

Functional interface that maps a model response to a Judgment:

@FunctionalInterface
public interface JudgmentClassifier {
    Judgment classify(JudgeModelResponse response);
}

LabelJudgmentClassifier

Exact normalized label matching with builder pattern:

// Built-in: maps "PASS"/"FAIL" labels to Judgment
JudgmentClassifier classifier = LabelJudgmentClassifier.passFail();

// Custom labels via builder
JudgmentClassifier classifier = LabelJudgmentClassifier.builder()
    .passLabel("CORRECT")
    .failLabel("INCORRECT")
    .build();

Supporting Records

// Model request — messages with options and metadata
public record JudgeModelRequest(
    List<JudgeMessage> messages,
    Map<String, Object> options,
    Map<String, Object> metadata
)
// Factory: JudgeModelRequest.user(prompt)

// Model response — text, model identity, usage, metadata
public record JudgeModelResponse(
    String text,
    String model,
    Usage usage,
    Map<String, Object> metadata
)

// Message with role
public record JudgeMessage(JudgeMessageRole role, String content)
public enum JudgeMessageRole { SYSTEM, USER, ASSISTANT }

// Token usage
public record Usage(int inputTokens, int outputTokens, int totalTokens,
                    double estimatedCostUsd)

Jury System

Jury Interface

public interface Jury {
    Verdict vote(JudgmentContext context);
}

SimpleJury

Flat multi-judge aggregation. See Jury System for full usage. Builder:

Method	Description
`.judge(Judge)`	Add with weight 1.0
`.judge(Judge, double)`	Add with custom weight
`.votingStrategy(VotingStrategy)`	Required
`.parallel(boolean)`	Default `true`
`.executor(Executor)`	Custom thread pool

CascadedJury

Sequential tiered evaluation. See Jury System for full usage. Builder:

Method	Description
`.tier(String, Jury, TierPolicy)`	Add a named tier
`.build()`	Validates last tier is `FINAL_TIER`

Verdict

public record Verdict(
    Judgment aggregated,
    List<Judgment> individual,
    Map<String, Judgment> individualByName,
    Map<String, Double> weights,
    List<Verdict> subVerdicts
)

Field	Description
`aggregated`	The voting strategy’s aggregated result
`individual`	All individual judge results (ordered)
`individualByName`	Results keyed by judge name
`weights`	Weight assigned to each judge
`subVerdicts`	Per-tier verdicts (CascadedJury only)

VotingStrategy

public interface VotingStrategy {
    Judgment aggregate(List<Judgment> judgments, Map<String, Double> weights);
    String getName();
}

Implementations:

Class	Constructor
`MajorityVotingStrategy`	`()` or `(TiePolicy, ErrorPolicy)`
`ConsensusStrategy`	`()`
`AverageVotingStrategy`	`()`
`WeightedAverageStrategy`	`()`
`MedianVotingStrategy`	`()`

TierPolicy

public enum TierPolicy {
    REJECT_ON_ANY_FAIL,   // Stop on failure
    ACCEPT_ON_ALL_PASS,   // Stop on full pass
    FINAL_TIER            // Always produces verdict (required for last tier)
}

TiePolicy

public enum TiePolicy {
    PASS,     // Optimistic
    FAIL,     // Pessimistic (default)
    ABSTAIN   // Neutral
}

ErrorPolicy

public enum ErrorPolicy {
    TREAT_AS_FAIL,     // Default
    TREAT_AS_ABSTAIN,
    IGNORE
}

Juries Utility

// Quick jury from judges
Jury jury = Juries.fromJudges(strategy, judge1, judge2, judge3);

// Combine two juries into a meta-jury
Jury meta = Juries.combine(jury1, jury2, strategy);

// Multiple juries
Jury all = Juries.allOf(strategy, jury1, jury2, jury3);

Framework Bridge Evaluators

Each framework bridge provides an Evaluator (one-liner convenience) and a JudgmentContextBuilder (full control). All evaluators follow the same 4-method pattern: Judge/Jury x with/without extra metadata.

Runtime	Input type	Evaluator	Context builder
Spring AI	`ChatResponse`	`SpringAiEvaluator`	`SpringAiJudgmentContextBuilder`
LangChain4j	`Result<T>`	`LangChain4jEvaluator`	`LangChain4jJudgmentContextBuilder`
Koog	`AIAgent`	`KoogEvaluator`	`KoogJudgmentContextBuilder`
AgentClient	`AgentClientResponse`	`AgentClientEvaluator`	`AgentClientJudgmentContextBuilder`

Bridge modules declare framework dependencies with provided scope. Your application must already include the corresponding framework/runtime dependency.

SpringAiEvaluator

Bridges Spring AI ChatResponse output to agent-judge evaluation. Uses Supplier<ChatResponse> because Spring AI ChatClient calls don’t take the goal as an argument at call time.

// One-liner with a judge
Judgment result = SpringAiEvaluator.evaluate(
    "Summarize the document",
    () -> chatClient.prompt().user(prompt).call().chatResponse(),
    myJudge);

// One-liner with a jury
Verdict verdict = SpringAiEvaluator.evaluate(
    "Summarize the document",
    () -> chatClient.prompt().user(prompt).call().chatResponse(),
    myJury);

Metadata extracted (constants in SpringAiMetadataKeys):

Key	Source
`springai.responseId`	`ChatResponse.getMetadata().getId()`
`springai.model`	`ChatResponse.getMetadata().getModel()`
`springai.finishReason`	Generation finish reason
`springai.usage.promptTokens`	Prompt token count
`springai.usage.completionTokens`	Completion token count
`springai.usage.totalTokens`	Total token count
`springai.hasToolCalls`	Whether tool calls were made
`springai.toolCalls`	Best-effort tool-call requests (not a full execution trace)

Finish reason mapping: stop → SUCCESS, tool_calls → SUCCESS, length → SUCCESS (indicates truncation; judges may choose to abstain), content_filter → REFUSED, null → UNKNOWN

LangChain4jEvaluator

Bridges LangChain4j Result<T> to agent-judge evaluation. Uses Function<String, Result<T>> because LangChain4j AiServices are dynamic proxies — there’s no common agent interface.

Judgment result = LangChain4jEvaluator.evaluate(
    "Summarize the document",
    goal -> assistant.chat(goal),
    myJudge);

Metadata extracted:

Key	Source
`langchain4j.tokenUsage`	`Result.tokenUsage()`
`langchain4j.toolExecutions`	`Result.toolExecutions()`
`langchain4j.sources`	`Result.sources()` (also used as RAG context fallback)
`langchain4j.finishReason`	`Result.finishReason().name()`

Finish reason mapping: STOP/TOOL_EXECUTION → SUCCESS, LENGTH → SUCCESS (indicates truncation; judges may choose to abstain), CONTENT_FILTER → REFUSED, OTHER → UNKNOWN

KoogEvaluator

Bridges JetBrains Koog AIAgent to agent-judge evaluation. Calls agent.run(input) directly — Koog’s native Java API is synchronous from the caller’s perspective.

Judgment result = KoogEvaluator.evaluate(agent, "Summarize the document", myJudge);

Metadata extracted:

Key	Source
`koog.agentId`	`agent.getId()`

AgentClientEvaluator

Bridges CLI-delegated agents (Claude Code, Codex, Gemini CLI, Amazon Q, etc.) via AgentClient. Uses Supplier<AgentClientResponse> to keep process execution inside AgentClient.

Judgment result = AgentClientEvaluator.evaluate(
    "Fix the build", workspace,
    () -> agentClient.run("Fix the build"),
    myJudge);

Metadata extracted (constants in AgentClientMetadataKeys):

Key	Source
`agentclient.model`	`response.getMetadata().getModel()`
`agentclient.sessionId`	`response.getMetadata().getSessionId()`
`agentclient.finishReason`	`response.getMetadata().getFinishReason()`

AgentClientJudgmentContextBuilder also maps result text to agentOutput, success/failure to ExecutionStatus, workspace to JudgmentContext.workspace, and metadata duration to executionTime.

JudgmentContextBuilder (All Bridges)

For full control, use the JudgmentContextBuilder directly:

// Build context from a pre-existing response
JudgmentContext ctx = SpringAiJudgmentContextBuilder.from(
    chatResponse, "goal", startedAt, duration);

// Or execute and capture in one step
JudgmentContext ctx = SpringAiJudgmentContextBuilder.execute(
    "goal", () -> chatClient.prompt().user(prompt).call().chatResponse());

Each bridge’s builder follows the same two-entry-point pattern: from() for pre-existing responses, execute() for wrapping the call. Both have overloads accepting Map<String, Object> extraMetadata for attaching run IDs, experiment tags, etc.

RAG Evaluation

RagContext

Static helper for extracting RAG metadata from a JudgmentContext:

String question = RagContext.question(context);       // rag.question or goal
Optional<String> ctx = RagContext.context(context);   // rag.context or langchain4j.sources
Optional<String> answer = RagContext.answer(context); // rag.answer or agentOutput

Metadata key constants:

Constant	Value	Fallback
`RagContext.QUESTION_KEY`	`rag.question`	`context.goal()`
`RagContext.CONTEXT_KEY`	`rag.context`	`langchain4j.sources`
`RagContext.ANSWER_KEY`	`rag.answer`	`context.agentOutput()`

The context() method handles both String and List<?> values — lists are joined with newlines.

RAG Judges

All three RAG judges extend LLMJudge and return ABSTAIN when required metadata is missing:

Judge	Evaluates	Requires
`FaithfulnessJudge`	Is the answer grounded in the context?	context + answer
`ContextualRelevanceJudge`	Is the context relevant to the question?	context
`HallucinationJudge`	Does the answer contain unsupported claims?	context + answer

See Built-in Judges for usage examples.

Module Coordinates

Judge families:

Module	Artifact	Key Dependencies
Core	`io.github.markpollack:agent-judge-core`	None (zero deps)
AI Core	`io.github.markpollack:agent-judge-ai-core`	None (zero deps)
Exec	`io.github.markpollack:agent-judge-exec`	agent-sandbox
File	`io.github.markpollack:agent-judge-file`	JavaParser, Maven Model
LLM	`io.github.markpollack:agent-judge-llm`	Spring AI ChatClient, `SpringAiJudgeModel`
RAG	`io.github.markpollack:agent-judge-rag`	agent-judge-llm

Framework bridges:

Module	Artifact	Key Dependencies (provided)
Spring AI	`io.github.markpollack:agent-judge-spring-ai`	Spring AI Model
LangChain4j	`io.github.markpollack:agent-judge-langchain4j`	LangChain4j
Koog	`io.github.markpollack:agent-judge-koog`	Koog Agents
AgentClient	`io.github.markpollack:agent-judge-agent-client`	AgentClient, `AgentClientJudgeModel`

Add modules with explicit versions:

<dependency>
    <groupId>io.github.markpollack</groupId>
    <artifactId>agent-judge-core</artifactId>
    <version>0.13.0</version>
</dependency>

​Packages

​Core Types

​Judge

​AsyncJudge

​ReactiveJudge

​DeterministicJudge

​NamedJudge

​JudgeWithMetadata

​JudgeMetadata

​JudgeType

​Context

​JudgmentContext

​ExecutionStatus

​Results

​Judgment

​JudgmentStatus

​Check

​Score Types

​BooleanScore

​NumericalScore

​CategoricalScore

​Scores Utility

​Composition

​Judges Utility

​AI-Core Types

​ModelBackedJudge

​JudgeModel

​JudgePromptTemplate

​JudgeTemplateRenderer

​JudgmentClassifier

​LabelJudgmentClassifier

​Supporting Records

​Jury System

​Jury Interface

​SimpleJury

​CascadedJury

​Verdict

​VotingStrategy

​TierPolicy

​TiePolicy

​ErrorPolicy

​Juries Utility

​Framework Bridge Evaluators

​SpringAiEvaluator

​LangChain4jEvaluator

​KoogEvaluator

​AgentClientEvaluator

​JudgmentContextBuilder (All Bridges)

​RAG Evaluation

​RagContext

​RAG Judges

​Module Coordinates

Packages

Core Types

Judge

AsyncJudge

ReactiveJudge

DeterministicJudge

NamedJudge

JudgeWithMetadata

JudgeMetadata

JudgeType

Context

JudgmentContext

ExecutionStatus

Results

Judgment

JudgmentStatus

Check

Score Types

BooleanScore

NumericalScore

CategoricalScore

Scores Utility

Composition

Judges Utility

AI-Core Types

ModelBackedJudge

JudgeModel

JudgePromptTemplate

JudgeTemplateRenderer

JudgmentClassifier

LabelJudgmentClassifier

Supporting Records

Jury System

Jury Interface

SimpleJury

CascadedJury

Verdict

VotingStrategy

TierPolicy

TiePolicy

ErrorPolicy

Juries Utility

Framework Bridge Evaluators

SpringAiEvaluator

LangChain4jEvaluator

KoogEvaluator

AgentClientEvaluator

JudgmentContextBuilder (All Bridges)

RAG Evaluation

RagContext

RAG Judges

Module Coordinates