Documentation Index Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt
Use this file to discover all available pages before exploring further.
What You’ll Build
An experiment that evaluates an AI agent against a dataset of coding tasks, scores the results with a jury of judges, and compares variants to test whether adding knowledge improves quality.
Prerequisites
Java 17+
Maven (the project includes ./mvnw)
Concepts
Dataset A collection of items, each with a task description, “before” source state, and “reference” solution
AgentInvoker Your agent — anything that takes a prompt + workspace and produces a result
Jury One or more judges that score the agent’s output against the reference
AgentExperiment Orchestrates: load items → invoke agent → evaluate → persist results
Step 1: Create a Dataset
A dataset is a directory with a manifest and per-item directories:
my-dataset/
├── dataset.json
└── items/
└── RENAME-001/
├── item.json
├── before/
│ └── src/main/java/com/example/Person.java
└── reference/
└── src/main/java/com/example/Person.java
dataset.json — the manifest:
{
"schemaVersion" : 1 ,
"name" : "rename-field" ,
"version" : "1.0.0" ,
"description" : "Field rename tasks" ,
"items" : [
{
"id" : "RENAME-001" ,
"slug" : "simple-rename" ,
"path" : "items/RENAME-001" ,
"bucket" : "A" ,
"taskType" : "rename-field" ,
"status" : "active"
}
]
}
item.json — per-item metadata:
{
"schemaVersion" : 1 ,
"id" : "RENAME-001" ,
"slug" : "simple-rename" ,
"developerTask" : "Rename the field 'name' to 'fullName' in Person.java and update all references" ,
"taskType" : "rename-field" ,
"bucket" : "A" ,
"noChange" : false ,
"knowledgeRefs" : [],
"tags" : [ "rename" , "simple" ],
"status" : "active"
}
The before/ directory is the starting state. The reference/ directory is the correct answer. The agent never sees the reference — it’s used by judges for comparison.
Step 2: Implement an AgentInvoker
AgentInvoker is a single-method interface:
public class MyAgent implements AgentInvoker {
@ Override
public InvocationResult invoke ( InvocationContext context ) {
// Your agent works in context.workspacePath()
// using context.prompt() as the task description
ProcessBuilder pb = new ProcessBuilder (
"my-agent" , "--workspace" , context . workspacePath (). toString (),
"--prompt" , context . prompt ());
pb . directory ( context . workspacePath (). toFile ());
Process p = pb . start ();
boolean finished = p . waitFor (
context . timeout (). toSeconds (), TimeUnit . SECONDS );
if ( ! finished) {
p . destroyForcibly ();
return InvocationResult . timeout (
context . timeout (). toMillis (),
context . metadata (), "Timed out" );
}
return InvocationResult . completed (
List . of (), 0 , 0 , 0 , 0.0 ,
System . currentTimeMillis (),
null , context . metadata ());
}
}
For Claude Code, use the built-in ClaudeSdkInvoker from the experiment-claude module.
Step 3: Wire a Jury
Start with a simple deterministic judge:
public class FileExistsJudge implements Judge , JudgeWithMetadata {
private final String expectedFile ;
public FileExistsJudge ( String expectedFile ) {
this . expectedFile = expectedFile;
}
@ Override
public Judgment judge ( JudgmentContext context ) {
boolean exists = Files . exists (
context . workspacePath (). resolve (expectedFile));
return Judgment . builder ()
. score ( new BooleanScore (exists))
. status (exists ? JudgmentStatus . PASS : JudgmentStatus . FAIL )
. reasoning (exists ? "Found" : "Missing: " + expectedFile)
. build ();
}
@ Override
public JudgeMetadata metadata () {
return new JudgeMetadata (
"file_exists" ,
"Checks that " + expectedFile + " exists" ,
JudgeType . DETERMINISTIC );
}
}
Jury jury = SimpleJury . builder ()
. judge ( new FileExistsJudge ( "src/main/java/com/example/Person.java" ), 1.0 )
. votingStrategy ( new MajorityVotingStrategy ())
. build ();
Step 4: Run the Experiment
DatasetManager datasetManager = new FileSystemDatasetManager ();
ResultStore resultStore = new FileSystemResultStore ( Path . of ( "results" ));
ExperimentConfig config = ExperimentConfig . builder ()
. experimentName ( "rename-field-v1" )
. datasetDir ( Path . of ( "my-dataset" ))
. model ( "sonnet" )
. promptTemplate ( "{{task}}" )
. perItemTimeout ( Duration . ofMinutes ( 2 ))
. outputDir ( Path . of ( "results" ))
. build ();
AgentExperiment experiment = new AgentExperiment (
datasetManager, jury, resultStore, config);
ExperimentResult result = experiment . run ( new MyAgent ());
System . out . printf ( "Pass rate: %.0f%% (%d/%d)%n" ,
result . passRate () * 100 ,
result . passCount (),
result . items (). size ());
Step 5: Compare Variants
The real power is variant comparison — same dataset, different agent configurations:
// Variant A: base agent
ExperimentConfig configA = ExperimentConfig . builder ()
. experimentName ( "rename-v1-base" )
. datasetDir (datasetDir)
. model ( "sonnet" )
. promptTemplate ( "{{task}}" )
. perItemTimeout ( Duration . ofMinutes ( 2 ))
. build ();
ExperimentResult resultA = runner . run (baseAgent);
// Variant B: agent with knowledge base
ExperimentConfig configB = ExperimentConfig . builder ()
. experimentName ( "rename-v1-with-kb" )
. datasetDir (datasetDir)
. model ( "sonnet" )
. promptTemplate ( "{{task}} \n\n Relevant knowledge: \n {{knowledgeRefs}}" )
. knowledgeBaseDir ( Path . of ( "knowledge" ))
. perItemTimeout ( Duration . ofMinutes ( 2 ))
. build ();
ExperimentResult resultB = runner . run (kbAgent);
Same model. Same dataset. Does adding curated knowledge improve agent quality? That’s the thesis in action.
What’s Next
Creating Experiments Dataset design, variant ladders, and filter strategies
Building a Jury Three-tier evaluation: deterministic, structural, and semantic