Skip to main content

Documentation Index

Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt

Use this file to discover all available pages before exploring further.

Packages

PackageContains
io.github.markpollack.judgeCore judge interfaces and utilities
io.github.markpollack.judge.contextJudgmentContext, ExecutionStatus
io.github.markpollack.judge.resultJudgment, JudgmentStatus, Check
io.github.markpollack.judge.scoreScore, BooleanScore, NumericalScore, CategoricalScore
io.github.markpollack.judge.juryJury, Verdict, voting strategies
io.github.markpollack.judge.springaiSpring AI bridge
io.github.markpollack.judge.langchain4jLangChain4j bridge
io.github.markpollack.judge.koogKoog bridge
io.github.markpollack.judge.agentclientAgentClient bridge
io.github.markpollack.judge.aiAI-backed judge infrastructure
io.github.markpollack.judge.ragRAG judges and RagContext

Core Types

Judge

The fundamental evaluation interface. A functional interface for lambda and method reference support.
@FunctionalInterface
public interface Judge {
    Judgment judge(JudgmentContext context);
}
Use directly as a lambda, or extend DeterministicJudge / LLMJudge for metadata support.

AsyncJudge

Asynchronous variant for non-blocking evaluation:
public interface AsyncJudge {
    CompletableFuture<Judgment> judgeAsync(JudgmentContext context);
}

ReactiveJudge

Reactive variant for Spring WebFlux / Project Reactor:
public interface ReactiveJudge {
    Mono<Judgment> judge(JudgmentContext context);
}

DeterministicJudge

Abstract base class for rule-based judges. Provides JudgeWithMetadata support:
public abstract class DeterministicJudge implements JudgeWithMetadata {
    protected DeterministicJudge(String name, String description)
    public JudgeMetadata metadata()
}
JudgeWithMetadata extends Judge, so DeterministicJudge is also a Judge.

NamedJudge

Composition wrapper that attaches metadata to any judge (including lambdas):
Judge named = Judges.named(myLambda, "check-name", "description");
Judge typed = Judges.named(myLambda, "check-name", "description", JudgeType.DETERMINISTIC);

JudgeWithMetadata

Marker interface for judges that expose identity:
public interface JudgeWithMetadata extends Judge {
    JudgeMetadata metadata();
}
Infrastructure code can use instanceof JudgeWithMetadata for discovery:
if (judge instanceof JudgeWithMetadata jwm) {
    log.info("Running: {}", jwm.metadata().name());
}

JudgeMetadata

Identity record:
public record JudgeMetadata(String name, String description, JudgeType type)

JudgeType

public enum JudgeType {
    DETERMINISTIC,
    LLM_POWERED,
    HYBRID,
    AGENT
}

Context

JudgmentContext

All evaluation inputs in one immutable record:
public record JudgmentContext(
    String goal,
    Path workspace,
    Duration executionTime,
    Instant startedAt,
    Optional<String> agentOutput,
    ExecutionStatus status,
    Optional<Throwable> error,
    Map<String, Object> metadata
)
Builder methods:
MethodTypeRequired
goal(String)The agent’s task descriptionYes
workspace(Path)Directory the agent modifiedYes
status(ExecutionStatus)Agent execution outcomeYes
startedAt(Instant)When execution beganYes
executionTime(Duration)How long execution tookYes
agentOutput(String)Text output from the agentNo
error(Throwable)Exception if execution failedNo
metadata(String, Object)Arbitrary key-value pairsNo
metadata(Map<String, Object>)Bulk metadataNo
JudgmentContext context = JudgmentContext.builder()
    .goal("Add REST endpoint")
    .workspace(Path.of("/project"))
    .status(ExecutionStatus.SUCCESS)
    .startedAt(Instant.now())
    .executionTime(Duration.ofMinutes(2))
    .agentOutput("Created HelloController.java")
    .metadata("expectedDir", Path.of("/reference"))
    .build();

ExecutionStatus

public enum ExecutionStatus {
    SUCCESS, FAILED, TIMEOUT, CANCELLED, REFUSED, UNKNOWN
}
ValueMeaning
SUCCESSAgent completed normally
FAILEDAgent threw an exception or returned an error
TIMEOUTExecution exceeded time limit
CANCELLEDExecution was cancelled
REFUSEDModel declined the request (content filter)
UNKNOWNStatus could not be determined

Results

Judgment

Immutable evaluation result:
public record Judgment(
    Score score,
    JudgmentStatus status,
    String reasoning,
    List<Check> checks,
    Map<String, Object> metadata
)
Static factory methods:
Judgment.pass("Build succeeded")
Judgment.fail("File not found")
Judgment.abstain("Not enough information")
Judgment.error("Timeout", exception)
Builder:
Judgment.builder()
    .score(new BooleanScore(true))
    .status(JudgmentStatus.PASS)
    .reasoning("All checks passed")
    .check(Check.pass("compile", "Compilation succeeded"))
    .check(Check.pass("tests", "All tests passed"))
    .metadata("duration", Duration.ofSeconds(45))
    .build();
Utility methods:
MethodReturnsDescription
pass()booleantrue if status == PASS
elapsed()DurationElapsed time from metadata
error()ThrowableError from metadata

JudgmentStatus

public enum JudgmentStatus {
    PASS, FAIL, ABSTAIN, ERROR
}

Check

Granular sub-assertion within a judgment:
public record Check(String name, boolean passed, String message)
Factory methods:
Check.pass("file_exists")
Check.pass("file_exists", "File found at expected path")
Check.fail("content_match", "Expected header not found")

Score Types

Score is a sealed interface with three permitted implementations:

BooleanScore

Simple pass/fail:
public record BooleanScore(boolean value) implements Score
new BooleanScore(true)   // pass
new BooleanScore(false)  // fail

NumericalScore

Continuous scoring with bounds:
public record NumericalScore(double value, double min, double max) implements Score
NumericalScore score = new NumericalScore(85.0, 0.0, 100.0);
double normalized = score.normalized();  // 0.85 (scaled to [0, 1])

// Convenience factories
NumericalScore zeroToOne = NumericalScore.normalized(0.85);
NumericalScore zeroToTen = NumericalScore.outOfTen(7.5);

CategoricalScore

Discrete categories from a fixed set:
public record CategoricalScore(String value, List<String> allowedValues) implements Score
CategoricalScore score = new CategoricalScore(
    "GOOD", List.of("EXCELLENT", "GOOD", "FAIR", "POOR"));

Scores Utility

Convert between score types for heterogeneous aggregation:
// Numerical and Boolean scores normalize directly
// Categorical scores require a mapping from category names to numeric values
double normalized = Scores.toNormalized(anyScore, categoryMap);

Composition

Judges Utility

Static methods for creating and composing judges:
MethodDescription
named(Judge, String)Wrap with a name
named(Judge, String, String)Wrap with name and description
named(Judge, String, String, JudgeType)Wrap with full metadata
alwaysPass(String)Test judge that always passes
alwaysFail(String)Test judge that always fails
tryMetadata(Judge)Extract metadata if available (Optional<JudgeMetadata>)
and(Judge, Judge)Short-circuit AND
or(Judge, Judge)Short-circuit OR
allOf(Judge...)All must pass (variadic AND)
anyOf(Judge...)Any can pass (variadic OR)

AI-Core Types

Framework-neutral infrastructure for AI-backed judges. Located in the agent-judge-ai-core module (zero external dependencies).

ModelBackedJudge

Composed AI-backed judge built via builder pattern. Pipeline: render prompt → invoke model → classify response → produce Judgment. No subclassing needed.
ModelBackedJudge judge = ModelBackedJudge.builder()
    .model(judgeModel)
    .template(promptTemplate)
    .classifier(LabelJudgmentClassifier.passFail())
    .build();

Judgment judgment = judge.judge(context);
Builder MethodRequiredDescription
model(JudgeModel)YesAI backend to invoke
template(JudgePromptTemplate)YesPrompt template with {{variable}} placeholders
classifier(JudgmentClassifier)YesMaps model response to Judgment

JudgeModel

Functional interface for AI model invocation. Framework-specific implementations live in bridge modules.
@FunctionalInterface
public interface JudgeModel {
    JudgeModelResponse call(JudgeModelRequest request);
}
ImplementationModuleBackend
SpringAiJudgeModelagent-judge-llmSpring AI ChatClient
AgentClientJudgeModelagent-judge-agent-clientCLI agent via AgentClient

JudgePromptTemplate

Loads, validates, and renders prompt templates with {{variable}} placeholders extracted from JudgmentContext.
JudgePromptTemplate template = JudgePromptTemplate.builder()
    .source(TextSource.classpath("/prompts/correctness.txt"))
    .renderer(new SimpleJudgeTemplateRenderer())
    .missingVariablePolicy(MissingVariablePolicy.STRICT)
    .build();
Builder MethodDefaultDescription
source(TextSource)RequiredTemplate text source (classpath, file, or string)
renderer(JudgeTemplateRenderer)SimpleJudgeTemplateRendererPluggable template engine
missingVariablePolicy(MissingVariablePolicy)STRICTSTRICT, EMPTY_STRING, or LEAVE_PLACEHOLDER
Available variables from JudgmentContext: {{goal}}, {{output}}, {{workspace}}, {{status}}, {{metadata.*}}.

JudgeTemplateRenderer

Pluggable template engine interface:
public interface JudgeTemplateRenderer {
    String render(String template, Map<String, String> variables);
}
Default implementation SimpleJudgeTemplateRenderer performs {{variable}} substitution.

JudgmentClassifier

Functional interface that maps a model response to a Judgment:
@FunctionalInterface
public interface JudgmentClassifier {
    Judgment classify(JudgeModelResponse response);
}

LabelJudgmentClassifier

Exact normalized label matching with builder pattern:
// Built-in: maps "PASS"/"FAIL" labels to Judgment
JudgmentClassifier classifier = LabelJudgmentClassifier.passFail();

// Custom labels via builder
JudgmentClassifier classifier = LabelJudgmentClassifier.builder()
    .passLabel("CORRECT")
    .failLabel("INCORRECT")
    .build();

Supporting Records

// Model request — messages with options and metadata
public record JudgeModelRequest(
    List<JudgeMessage> messages,
    Map<String, Object> options,
    Map<String, Object> metadata
)
// Factory: JudgeModelRequest.user(prompt)

// Model response — text, model identity, usage, metadata
public record JudgeModelResponse(
    String text,
    String model,
    Usage usage,
    Map<String, Object> metadata
)

// Message with role
public record JudgeMessage(JudgeMessageRole role, String content)
public enum JudgeMessageRole { SYSTEM, USER, ASSISTANT }

// Token usage
public record Usage(int inputTokens, int outputTokens, int totalTokens,
                    double estimatedCostUsd)

Jury System

Jury Interface

public interface Jury {
    Verdict vote(JudgmentContext context);
}

SimpleJury

Flat multi-judge aggregation. See Jury System for full usage. Builder:
MethodDescription
.judge(Judge)Add with weight 1.0
.judge(Judge, double)Add with custom weight
.votingStrategy(VotingStrategy)Required
.parallel(boolean)Default true
.executor(Executor)Custom thread pool

CascadedJury

Sequential tiered evaluation. See Jury System for full usage. Builder:
MethodDescription
.tier(String, Jury, TierPolicy)Add a named tier
.build()Validates last tier is FINAL_TIER

Verdict

public record Verdict(
    Judgment aggregated,
    List<Judgment> individual,
    Map<String, Judgment> individualByName,
    Map<String, Double> weights,
    List<Verdict> subVerdicts
)
FieldDescription
aggregatedThe voting strategy’s aggregated result
individualAll individual judge results (ordered)
individualByNameResults keyed by judge name
weightsWeight assigned to each judge
subVerdictsPer-tier verdicts (CascadedJury only)

VotingStrategy

public interface VotingStrategy {
    Judgment aggregate(List<Judgment> judgments, Map<String, Double> weights);
    String getName();
}
Implementations:
ClassConstructor
MajorityVotingStrategy() or (TiePolicy, ErrorPolicy)
ConsensusStrategy()
AverageVotingStrategy()
WeightedAverageStrategy()
MedianVotingStrategy()

TierPolicy

public enum TierPolicy {
    REJECT_ON_ANY_FAIL,   // Stop on failure
    ACCEPT_ON_ALL_PASS,   // Stop on full pass
    FINAL_TIER            // Always produces verdict (required for last tier)
}

TiePolicy

public enum TiePolicy {
    PASS,     // Optimistic
    FAIL,     // Pessimistic (default)
    ABSTAIN   // Neutral
}

ErrorPolicy

public enum ErrorPolicy {
    TREAT_AS_FAIL,     // Default
    TREAT_AS_ABSTAIN,
    IGNORE
}

Juries Utility

// Quick jury from judges
Jury jury = Juries.fromJudges(strategy, judge1, judge2, judge3);

// Combine two juries into a meta-jury
Jury meta = Juries.combine(jury1, jury2, strategy);

// Multiple juries
Jury all = Juries.allOf(strategy, jury1, jury2, jury3);

Framework Bridge Evaluators

Each framework bridge provides an Evaluator (one-liner convenience) and a JudgmentContextBuilder (full control). All evaluators follow the same 4-method pattern: Judge/Jury x with/without extra metadata.
RuntimeInput typeEvaluatorContext builder
Spring AIChatResponseSpringAiEvaluatorSpringAiJudgmentContextBuilder
LangChain4jResult<T>LangChain4jEvaluatorLangChain4jJudgmentContextBuilder
KoogAIAgentKoogEvaluatorKoogJudgmentContextBuilder
AgentClientAgentClientResponseAgentClientEvaluatorAgentClientJudgmentContextBuilder
Bridge modules declare framework dependencies with provided scope. Your application must already include the corresponding framework/runtime dependency.

SpringAiEvaluator

Bridges Spring AI ChatResponse output to agent-judge evaluation. Uses Supplier<ChatResponse> because Spring AI ChatClient calls don’t take the goal as an argument at call time.
// One-liner with a judge
Judgment result = SpringAiEvaluator.evaluate(
    "Summarize the document",
    () -> chatClient.prompt().user(prompt).call().chatResponse(),
    myJudge);

// One-liner with a jury
Verdict verdict = SpringAiEvaluator.evaluate(
    "Summarize the document",
    () -> chatClient.prompt().user(prompt).call().chatResponse(),
    myJury);
Metadata extracted (constants in SpringAiMetadataKeys):
KeySource
springai.responseIdChatResponse.getMetadata().getId()
springai.modelChatResponse.getMetadata().getModel()
springai.finishReasonGeneration finish reason
springai.usage.promptTokensPrompt token count
springai.usage.completionTokensCompletion token count
springai.usage.totalTokensTotal token count
springai.hasToolCallsWhether tool calls were made
springai.toolCallsBest-effort tool-call requests (not a full execution trace)
Finish reason mapping: stop → SUCCESS, tool_calls → SUCCESS, length → SUCCESS (indicates truncation; judges may choose to abstain), content_filter → REFUSED, null → UNKNOWN

LangChain4jEvaluator

Bridges LangChain4j Result<T> to agent-judge evaluation. Uses Function<String, Result<T>> because LangChain4j AiServices are dynamic proxies — there’s no common agent interface.
Judgment result = LangChain4jEvaluator.evaluate(
    "Summarize the document",
    goal -> assistant.chat(goal),
    myJudge);
Metadata extracted:
KeySource
langchain4j.tokenUsageResult.tokenUsage()
langchain4j.toolExecutionsResult.toolExecutions()
langchain4j.sourcesResult.sources() (also used as RAG context fallback)
langchain4j.finishReasonResult.finishReason().name()
Finish reason mapping: STOP/TOOL_EXECUTION → SUCCESS, LENGTH → SUCCESS (indicates truncation; judges may choose to abstain), CONTENT_FILTER → REFUSED, OTHER → UNKNOWN

KoogEvaluator

Bridges JetBrains Koog AIAgent to agent-judge evaluation. Calls agent.run(input) directly — Koog’s native Java API is synchronous from the caller’s perspective.
Judgment result = KoogEvaluator.evaluate(agent, "Summarize the document", myJudge);
Metadata extracted:
KeySource
koog.agentIdagent.getId()

AgentClientEvaluator

Bridges CLI-delegated agents (Claude Code, Codex, Gemini CLI, Amazon Q, etc.) via AgentClient. Uses Supplier<AgentClientResponse> to keep process execution inside AgentClient.
Judgment result = AgentClientEvaluator.evaluate(
    "Fix the build", workspace,
    () -> agentClient.run("Fix the build"),
    myJudge);
Metadata extracted (constants in AgentClientMetadataKeys):
KeySource
agentclient.modelresponse.getMetadata().getModel()
agentclient.sessionIdresponse.getMetadata().getSessionId()
agentclient.finishReasonresponse.getMetadata().getFinishReason()
AgentClientJudgmentContextBuilder also maps result text to agentOutput, success/failure to ExecutionStatus, workspace to JudgmentContext.workspace, and metadata duration to executionTime.

JudgmentContextBuilder (All Bridges)

For full control, use the JudgmentContextBuilder directly:
// Build context from a pre-existing response
JudgmentContext ctx = SpringAiJudgmentContextBuilder.from(
    chatResponse, "goal", startedAt, duration);

// Or execute and capture in one step
JudgmentContext ctx = SpringAiJudgmentContextBuilder.execute(
    "goal", () -> chatClient.prompt().user(prompt).call().chatResponse());
Each bridge’s builder follows the same two-entry-point pattern: from() for pre-existing responses, execute() for wrapping the call. Both have overloads accepting Map<String, Object> extraMetadata for attaching run IDs, experiment tags, etc.

RAG Evaluation

RagContext

Static helper for extracting RAG metadata from a JudgmentContext:
String question = RagContext.question(context);       // rag.question or goal
Optional<String> ctx = RagContext.context(context);   // rag.context or langchain4j.sources
Optional<String> answer = RagContext.answer(context); // rag.answer or agentOutput
Metadata key constants:
ConstantValueFallback
RagContext.QUESTION_KEYrag.questioncontext.goal()
RagContext.CONTEXT_KEYrag.contextlangchain4j.sources
RagContext.ANSWER_KEYrag.answercontext.agentOutput()
The context() method handles both String and List<?> values — lists are joined with newlines.

RAG Judges

All three RAG judges extend LLMJudge and return ABSTAIN when required metadata is missing:
JudgeEvaluatesRequires
FaithfulnessJudgeIs the answer grounded in the context?context + answer
ContextualRelevanceJudgeIs the context relevant to the question?context
HallucinationJudgeDoes the answer contain unsupported claims?context + answer
See Built-in Judges for usage examples.

Module Coordinates

Judge families:
ModuleArtifactKey Dependencies
Coreio.github.markpollack:agent-judge-coreNone (zero deps)
AI Coreio.github.markpollack:agent-judge-ai-coreNone (zero deps)
Execio.github.markpollack:agent-judge-execagent-sandbox
Fileio.github.markpollack:agent-judge-fileJavaParser, Maven Model
LLMio.github.markpollack:agent-judge-llmSpring AI ChatClient, SpringAiJudgeModel
RAGio.github.markpollack:agent-judge-ragagent-judge-llm
Framework bridges:
ModuleArtifactKey Dependencies (provided)
Spring AIio.github.markpollack:agent-judge-spring-aiSpring AI Model
LangChain4jio.github.markpollack:agent-judge-langchain4jLangChain4j
Koogio.github.markpollack:agent-judge-koogKoog Agents
AgentClientio.github.markpollack:agent-judge-agent-clientAgentClient, AgentClientJudgeModel
Add modules with explicit versions:
<dependency>
    <groupId>io.github.markpollack</groupId>
    <artifactId>agent-judge-core</artifactId>
    <version>0.11.0</version>
</dependency>