assertEquals and AssertJ gives you assertThat, Agent Judge gives you FileExistsJudge, BuildSuccessJudge, and CorrectnessJudge. Deterministic judges are free and fast — did it compile, did coverage go up? LLM judges handle the softer criteria — are the assertions meaningful, does the code follow conventions? Together, they score every agent run against the same rubric. When your judges pass consistently, your agent is ready for the field.
The core module (agent-judge-core) has zero external dependencies. Specialized modules add command execution (agent-judge-exec) and LLM evaluation (agent-judge-llm).
Core Abstractions
Judge
Functional interface — takes
JudgmentContext, returns Judgment with score, status, reasoning, and granular checksJury
Multi-judge aggregation with voting strategies — majority, consensus, weighted average, median
Judge Types
| Type | Module | Cost | Example |
|---|---|---|---|
| Deterministic | agent-judge-core | Free | FileExistsJudge, FileContentJudge, custom rules |
| Command | agent-judge-exec | Compute only | BuildSuccessJudge (Maven/Gradle), CommandJudge |
| LLM | agent-judge-llm | Token cost | CorrectnessJudge, custom LLM evaluation |
| Agent | agent-judge-agent | Agent cost | Delegate evaluation to an AI agent |
Documentation
Full Documentation
API reference, built-in judges, jury system, voting strategies, code examples
Source Code
6 modules — core, exec, llm, agent, advisor, BOM
Guides
Building a Jury
Wire judges into a cascaded jury for experiment evaluation
LLM as Judge (Blog)
Judge design, rubric creation, scoring patterns
Role in the Lab
- Agent Workflow —
JudgeGatewraps a jury as a quality checkpoint in workflow pipelines - Agent Experiment — jury system scores agent output across all experiments
- Agent Client —
JudgeAdvisorintegrates evaluation into agent execution loops