Quick Start with the Template
The fastest way to start is the agent-experiment-template — a pre-wired project with variant config, analysis scripts, and the improvement flywheel methodology built in:ExperimentApp (CLI with --variant, --item, --run-all-variants), a pluggable AgentInvoker, cascaded jury, Markov analysis scripts, and GrowthStoryReporter for variant comparison. Customize three things: the agent invoker, domain judges, and knowledge files.
If you want to wire the experiment loop yourself from scratch, follow the steps below.
What You’ll Build
An experiment that evaluates an AI agent against a dataset of coding tasks, scores the results with a jury of judges, and compares variants to test whether adding knowledge improves quality.Prerequisites
- Java 17+
- Maven (the project includes
./mvnw)
Concepts
Dataset
A collection of items, each with a task description, “before” source state, and “reference” solution
AgentInvoker
Your agent — anything that takes a prompt + workspace and produces a result
Jury
One or more judges that score the agent’s output against the reference
AgentExperiment
Orchestrates: load items → invoke agent → evaluate → persist results
Step 1: Create a Dataset
A dataset is a directory with a manifest and per-item directories:before/ directory is the starting state. The reference/ directory is the correct answer. The agent never sees the reference — it’s used by judges for comparison.
Step 2: Implement an AgentInvoker
Using the template invokers (recommended)
The template ships with ready-made invokers that handle journal integration, knowledge injection, and cost tracking. Pick the one that matches your orchestration: Single-step workflow — the simplest starting point. Wraps aClaudeStep in a Workflow with automatic journal recording:
From scratch
If you need full control,AgentInvoker is a single-method interface:
ClaudeSdkInvoker from the experiment-claude module.
Step 3: Wire a Jury
Start with a simple deterministic judge:Step 4: Run the Experiment
Step 5: Compare Variants
The real power is variant comparison — same dataset, different agent configurations:What’s Next
Creating Experiments
Dataset design, variant ladders, and filter strategies
Building a Jury
Three-tier evaluation: deterministic, structural, and semantic