Skip to main content

Quick Start with the Template

The fastest way to start is the agent-experiment-template — a pre-wired project with variant config, analysis scripts, and the improvement flywheel methodology built in:
# Clone the template
git clone https://github.com/markpollack/agent-experiment-template my-experiment
cd my-experiment

# Run the baseline variant
./mvnw compile exec:java -Dexec.args="--variant control"

# Run all variants and compare
./mvnw compile exec:java -Dexec.args="--run-all-variants"
The template includes ExperimentApp (CLI with --variant, --item, --run-all-variants), a pluggable AgentInvoker, cascaded jury, Markov analysis scripts, and GrowthStoryReporter for variant comparison. Customize three things: the agent invoker, domain judges, and knowledge files. If you want to wire the experiment loop yourself from scratch, follow the steps below.

What You’ll Build

An experiment that evaluates an AI agent against a dataset of coding tasks, scores the results with a jury of judges, and compares variants to test whether adding knowledge improves quality.

Prerequisites

  • Java 17+
  • Maven (the project includes ./mvnw)

Concepts

Dataset

A collection of items, each with a task description, “before” source state, and “reference” solution

AgentInvoker

Your agent — anything that takes a prompt + workspace and produces a result

Jury

One or more judges that score the agent’s output against the reference

AgentExperiment

Orchestrates: load items → invoke agent → evaluate → persist results

Step 1: Create a Dataset

A dataset is a directory with a manifest and per-item directories:
my-dataset/
├── dataset.json
└── items/
    └── RENAME-001/
        ├── item.json
        ├── before/
        │   └── src/main/java/com/example/Person.java
        └── reference/
            └── src/main/java/com/example/Person.java
dataset.json — the manifest:
{
  "schemaVersion": 1,
  "name": "rename-field",
  "version": "1.0.0",
  "description": "Field rename tasks",
  "items": [
    {
      "id": "RENAME-001",
      "slug": "simple-rename",
      "path": "items/RENAME-001",
      "bucket": "A",
      "taskType": "rename-field",
      "status": "active"
    }
  ]
}
item.json — per-item metadata:
{
  "schemaVersion": 1,
  "id": "RENAME-001",
  "slug": "simple-rename",
  "developerTask": "Rename the field 'name' to 'fullName' in Person.java and update all references",
  "taskType": "rename-field",
  "bucket": "A",
  "noChange": false,
  "knowledgeRefs": [],
  "tags": ["rename", "simple"],
  "status": "active"
}
The before/ directory is the starting state. The reference/ directory is the correct answer. The agent never sees the reference — it’s used by judges for comparison.

Step 2: Implement an AgentInvoker

The template ships with ready-made invokers that handle journal integration, knowledge injection, and cost tracking. Pick the one that matches your orchestration: Single-step workflow — the simplest starting point. Wraps a ClaudeStep in a Workflow with automatic journal recording:
// WorkflowAgentInvoker works out of the box — journal wired, knowledge injected.
// Rename to {Domain}AgentInvoker and override hooks as needed.
Multi-step workflow — for pipelines with typed state flowing between steps:
public class MyWorkflow extends WorkflowInvoker<MyState> {

    @Override protected String workflowName() { return "my-experiment"; }

    @Override
    protected Workflow<Object, MyState> buildWorkflow(
            InvocationContext ctx, WorkflowExecutor executor) {
        return Workflow.<Object, MyState>define(workflowName())
                .withExecutor(executor)   // journal + cost tracking pre-wired
                .step(analyzeStep)
                .step(fixStep)
                .build();
    }

    @Override
    protected MyState buildInitialState(InvocationContext ctx) {
        return new MyState(ctx.workspacePath());
    }
}
See the API Reference for the full invoker hierarchy.

From scratch

If you need full control, AgentInvoker is a single-method interface:
public class MyAgent implements AgentInvoker {

    @Override
    public InvocationResult invoke(InvocationContext context) {
        // Your agent works in context.workspacePath()
        // using context.prompt() as the task description

        ProcessBuilder pb = new ProcessBuilder(
            "my-agent", "--workspace", context.workspacePath().toString(),
                        "--prompt", context.prompt());
        pb.directory(context.workspacePath().toFile());

        Process p = pb.start();
        boolean finished = p.waitFor(
            context.timeout().toSeconds(), TimeUnit.SECONDS);

        if (!finished) {
            p.destroyForcibly();
            return InvocationResult.timeout(
                context.timeout().toMillis(),
                context.metadata(), "Timed out");
        }

        return InvocationResult.completed(
            List.of(), 0, 0, 0, 0.0,
            System.currentTimeMillis(),
            null, context.metadata());
    }
}
For Claude Code, use the built-in ClaudeSdkInvoker from the experiment-claude module.

Step 3: Wire a Jury

Start with a simple deterministic judge:
public class FileExistsJudge implements Judge, JudgeWithMetadata {
    private final String expectedFile;

    public FileExistsJudge(String expectedFile) {
        this.expectedFile = expectedFile;
    }

    @Override
    public Judgment judge(JudgmentContext context) {
        boolean exists = Files.exists(
            context.workspacePath().resolve(expectedFile));

        return Judgment.builder()
            .score(new BooleanScore(exists))
            .status(exists ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
            .reasoning(exists ? "Found" : "Missing: " + expectedFile)
            .build();
    }

    @Override
    public JudgeMetadata metadata() {
        return new JudgeMetadata(
            "file_exists",
            "Checks that " + expectedFile + " exists",
            JudgeType.DETERMINISTIC);
    }
}

Jury jury = SimpleJury.builder()
    .judge(new FileExistsJudge("src/main/java/com/example/Person.java"), 1.0)
    .votingStrategy(new MajorityVotingStrategy())
    .build();

Step 4: Run the Experiment

DatasetManager datasetManager = new FileSystemDatasetManager();
ResultStore resultStore = new FileSystemResultStore(Path.of("results"));

ExperimentConfig config = ExperimentConfig.builder()
    .experimentName("rename-field-v1")
    .datasetDir(Path.of("my-dataset"))
    .model("sonnet")
    .promptTemplate("{{task}}")
    .perItemTimeout(Duration.ofMinutes(2))
    .outputDir(Path.of("results"))
    .build();

AgentExperiment experiment = new AgentExperiment(
    datasetManager, jury, resultStore, config);

ExperimentResult result = experiment.run(new MyAgent());

System.out.printf("Pass rate: %.0f%% (%d/%d)%n",
    result.passRate() * 100,
    result.passCount(),
    result.items().size());

Step 5: Compare Variants

The real power is variant comparison — same dataset, different agent configurations:
// Variant A: base agent
ExperimentConfig configA = ExperimentConfig.builder()
    .experimentName("rename-v1-base")
    .datasetDir(datasetDir)
    .model("sonnet")
    .promptTemplate("{{task}}")
    .perItemTimeout(Duration.ofMinutes(2))
    .build();
ExperimentResult resultA = runner.run(baseAgent);

// Variant B: agent with knowledge base
ExperimentConfig configB = ExperimentConfig.builder()
    .experimentName("rename-v1-with-kb")
    .datasetDir(datasetDir)
    .model("sonnet")
    .promptTemplate("{{task}}\n\nRelevant knowledge:\n{{knowledgeRefs}}")
    .knowledgeBaseDir(Path.of("knowledge"))
    .perItemTimeout(Duration.ofMinutes(2))
    .build();
ExperimentResult resultB = runner.run(kbAgent);
Same model. Same dataset. Does adding curated knowledge improve agent quality? That’s the thesis in action.

What’s Next

Creating Experiments

Dataset design, variant ladders, and filter strategies

Building a Jury

Three-tier evaluation: deterministic, structural, and semantic