Skip to main content
Judges are unit tests for your agent. You wouldn’t ship application code without assertions — agents need the same discipline. Agent Judge provides the testing framework: deterministic rules, command execution, and LLM-powered assessment that compose into juries with configurable voting strategies. Just as JUnit gives you assertEquals and AssertJ gives you assertThat, Agent Judge gives you FileExistsJudge, BuildSuccessJudge, and CorrectnessJudge. Deterministic judges are free and fast — did it compile, did coverage go up? LLM judges handle the softer criteria — are the assertions meaningful, does the code follow conventions? Together, they score every agent run against the same rubric. When your judges pass consistently, your agent is ready for the field. The core module (agent-judge-core) has zero external dependencies. Specialized modules add command execution (agent-judge-exec) and LLM evaluation (agent-judge-llm).

Core Abstractions

Judge

Functional interface — takes JudgmentContext, returns Judgment with score, status, reasoning, and granular checks

Jury

Multi-judge aggregation with voting strategies — majority, consensus, weighted average, median

Judge Types

TypeModuleCostExample
Deterministicagent-judge-coreFreeFileExistsJudge, FileContentJudge, custom rules
Commandagent-judge-execCompute onlyBuildSuccessJudge (Maven/Gradle), CommandJudge
LLMagent-judge-llmToken costCorrectnessJudge, custom LLM evaluation
Agentagent-judge-agentAgent costDelegate evaluation to an AI agent

Documentation

Full Documentation

API reference, built-in judges, jury system, voting strategies, code examples

Source Code

6 modules — core, exec, llm, agent, advisor, BOM

Guides

Building a Jury

Wire judges into a cascaded jury for experiment evaluation

LLM as Judge (Blog)

Judge design, rubric creation, scoring patterns

Role in the Lab

  • Agent WorkflowJudgeGate wraps a jury as a quality checkpoint in workflow pipelines
  • Agent Experiment — jury system scores agent output across all experiments
  • Agent ClientJudgeAdvisor integrates evaluation into agent execution loops