What You’ll Build
An experiment that evaluates an AI agent against a dataset of coding tasks, scores the results with a jury of judges, and compares variants to test whether adding knowledge improves quality.Prerequisites
- Java 17+
- Maven (the project includes
./mvnw)
Concepts
Dataset
A collection of items, each with a task description, “before” source state, and “reference” solution
AgentInvoker
Your agent — anything that takes a prompt + workspace and produces a result
Jury
One or more judges that score the agent’s output against the reference
ExperimentRunner
Orchestrates: load items → invoke agent → evaluate → persist results
Step 1: Create a Dataset
A dataset is a directory with a manifest and per-item directories:before/ directory is the starting state. The reference/ directory is the correct answer. The agent never sees the reference — it’s used by judges for comparison.
Step 2: Implement an AgentInvoker
AgentInvoker is a single-method interface:
ClaudeSdkInvoker from the experiment-claude module.
Step 3: Wire a Jury
Start with a simple deterministic judge:Step 4: Run the Experiment
Step 5: Compare Variants
The real power is variant comparison — same dataset, different agent configurations:What’s Next
Creating Experiments
Dataset design, variant ladders, and filter strategies
Building a Jury
Three-tier evaluation: deterministic, structural, and semantic