Getting Started with Agent Experiment

Quick Start with the Template

The fastest way to start is the agent-experiment-template — a pre-wired project with variant config, analysis scripts, and the improvement flywheel methodology built in:

# Clone the template
git clone https://github.com/markpollack/agent-experiment-template my-experiment
cd my-experiment

# Run the baseline variant
./mvnw compile exec:java -Dexec.args="--variant control"

# Run all variants and compare
./mvnw compile exec:java -Dexec.args="--run-all-variants"

The template includes ExperimentApp (CLI with --variant, --item, --run-all-variants), a pluggable AgentInvoker, cascaded jury, Markov analysis scripts, and GrowthStoryReporter for variant comparison. Customize three things: the agent invoker, domain judges, and knowledge files. If you want to wire the experiment loop yourself from scratch, follow the steps below.

What You’ll Build

An experiment that evaluates an AI agent against a dataset of coding tasks, scores the results with a jury of judges, and compares variants to test whether adding knowledge improves quality.

Prerequisites

Java 17+
Maven (the project includes ./mvnw)

Concepts

Dataset

A collection of items, each with a task description, “before” source state, and “reference” solution

AgentInvoker

Your agent — anything that takes a prompt + workspace and produces a result

Jury

One or more judges that score the agent’s output against the reference

AgentExperiment

Orchestrates: load items → invoke agent → evaluate → persist results

Step 1: Create a Dataset

A dataset is a directory with a manifest and per-item directories:

my-dataset/
├── dataset.json
└── items/
    └── RENAME-001/
        ├── item.json
        ├── before/
        │   └── src/main/java/com/example/Person.java
        └── reference/
            └── src/main/java/com/example/Person.java

dataset.json — the manifest:

{
  "schemaVersion": 1,
  "name": "rename-field",
  "version": "1.0.0",
  "description": "Field rename tasks",
  "items": [
    {
      "id": "RENAME-001",
      "slug": "simple-rename",
      "path": "items/RENAME-001",
      "bucket": "A",
      "taskType": "rename-field",
      "status": "active"
    }
  ]
}

item.json — per-item metadata:

{
  "schemaVersion": 1,
  "id": "RENAME-001",
  "slug": "simple-rename",
  "developerTask": "Rename the field 'name' to 'fullName' in Person.java and update all references",
  "taskType": "rename-field",
  "bucket": "A",
  "noChange": false,
  "knowledgeRefs": [],
  "tags": ["rename", "simple"],
  "status": "active"
}

The before/ directory is the starting state. The reference/ directory is the correct answer. The agent never sees the reference — it’s used by judges for comparison.

Step 2: Implement an AgentInvoker

Using the template invokers (recommended)

The template ships with ready-made invokers that handle journal integration, knowledge injection, and cost tracking. Pick the one that matches your orchestration: Single-step workflow — the simplest starting point. Wraps a ClaudeStep in a Workflow with automatic journal recording:

// WorkflowAgentInvoker works out of the box — journal wired, knowledge injected.
// Rename to {Domain}AgentInvoker and override hooks as needed.

Multi-step workflow — for pipelines with typed state flowing between steps:

public class MyWorkflow extends WorkflowInvoker<MyState> {

    @Override protected String workflowName() { return "my-experiment"; }

    @Override
    protected Workflow<Object, MyState> buildWorkflow(
            InvocationContext ctx, WorkflowExecutor executor) {
        return Workflow.<Object, MyState>define(workflowName())
                .withExecutor(executor)   // journal + cost tracking pre-wired
                .step(analyzeStep)
                .step(fixStep)
                .build();
    }

    @Override
    protected MyState buildInitialState(InvocationContext ctx) {
        return new MyState(ctx.workspacePath());
    }
}

See the API Reference for the full invoker hierarchy.

From scratch

If you need full control, AgentInvoker is a single-method interface:

public class MyAgent implements AgentInvoker {

    @Override
    public InvocationResult invoke(InvocationContext context) {
        // Your agent works in context.workspacePath()
        // using context.prompt() as the task description

        ProcessBuilder pb = new ProcessBuilder(
            "my-agent", "--workspace", context.workspacePath().toString(),
                        "--prompt", context.prompt());
        pb.directory(context.workspacePath().toFile());

        Process p = pb.start();
        boolean finished = p.waitFor(
            context.timeout().toSeconds(), TimeUnit.SECONDS);

        if (!finished) {
            p.destroyForcibly();
            return InvocationResult.timeout(
                context.timeout().toMillis(),
                context.metadata(), "Timed out");
        }

        return InvocationResult.completed(
            List.of(), 0, 0, 0, 0.0,
            System.currentTimeMillis(),
            null, context.metadata());
    }
}

For Claude Code, use the built-in ClaudeSdkInvoker from the experiment-claude module.

Step 3: Wire a Jury

Start with a simple deterministic judge:

public class FileExistsJudge implements Judge, JudgeWithMetadata {
    private final String expectedFile;

    public FileExistsJudge(String expectedFile) {
        this.expectedFile = expectedFile;
    }

    @Override
    public Judgment judge(JudgmentContext context) {
        boolean exists = Files.exists(
            context.workspacePath().resolve(expectedFile));

        return Judgment.builder()
            .score(new BooleanScore(exists))
            .status(exists ? JudgmentStatus.PASS : JudgmentStatus.FAIL)
            .reasoning(exists ? "Found" : "Missing: " + expectedFile)
            .build();
    }

    @Override
    public JudgeMetadata metadata() {
        return new JudgeMetadata(
            "file_exists",
            "Checks that " + expectedFile + " exists",
            JudgeType.DETERMINISTIC);
    }
}

Jury jury = SimpleJury.builder()
    .judge(new FileExistsJudge("src/main/java/com/example/Person.java"), 1.0)
    .votingStrategy(new MajorityVotingStrategy())
    .build();

Step 4: Run the Experiment

DatasetManager datasetManager = new FileSystemDatasetManager();
ResultStore resultStore = new FileSystemResultStore(Path.of("results"));

ExperimentConfig config = ExperimentConfig.builder()
    .experimentName("rename-field-v1")
    .datasetDir(Path.of("my-dataset"))
    .model("sonnet")
    .promptTemplate("{{task}}")
    .perItemTimeout(Duration.ofMinutes(2))
    .outputDir(Path.of("results"))
    .build();

AgentExperiment experiment = new AgentExperiment(
    datasetManager, jury, resultStore, config);

ExperimentResult result = experiment.run(new MyAgent());

System.out.printf("Pass rate: %.0f%% (%d/%d)%n",
    result.passRate() * 100,
    result.passCount(),
    result.items().size());

Step 5: Compare Variants

The real power is variant comparison — same dataset, different agent configurations:

// Variant A: base agent
ExperimentConfig configA = ExperimentConfig.builder()
    .experimentName("rename-v1-base")
    .datasetDir(datasetDir)
    .model("sonnet")
    .promptTemplate("{{task}}")
    .perItemTimeout(Duration.ofMinutes(2))
    .build();
ExperimentResult resultA = runner.run(baseAgent);

// Variant B: agent with knowledge base
ExperimentConfig configB = ExperimentConfig.builder()
    .experimentName("rename-v1-with-kb")
    .datasetDir(datasetDir)
    .model("sonnet")
    .promptTemplate("{{task}}\n\nRelevant knowledge:\n{{knowledgeRefs}}")
    .knowledgeBaseDir(Path.of("knowledge"))
    .perItemTimeout(Duration.ofMinutes(2))
    .build();
ExperimentResult resultB = runner.run(kbAgent);

Same model. Same dataset. Does adding curated knowledge improve agent quality? That’s the thesis in action.

Projects

AgentWorks

Agento

Supporting Projects

Migration

Getting Started with Agent Experiment

Quick Start with the Template

What You’ll Build

Prerequisites

Concepts

Dataset

AgentInvoker

Jury

AgentExperiment

Step 1: Create a Dataset

Step 2: Implement an AgentInvoker

Using the template invokers (recommended)

From scratch

Step 3: Wire a Jury

Step 4: Run the Experiment

Step 5: Compare Variants

What’s Next

Creating Experiments

Building a Jury

​Quick Start with the Template

​What You’ll Build

​Prerequisites

​Concepts

Dataset

AgentInvoker

Jury

AgentExperiment

​Step 1: Create a Dataset

​Step 2: Implement an AgentInvoker

​Using the template invokers (recommended)

​From scratch

​Step 3: Wire a Jury

​Step 4: Run the Experiment

​Step 5: Compare Variants

​What’s Next

Creating Experiments

Building a Jury

Quick Start with the Template

What You’ll Build

Prerequisites

Concepts

Step 1: Create a Dataset

Step 2: Implement an AgentInvoker

Using the template invokers (recommended)

From scratch

Step 3: Wire a Jury

Step 4: Run the Experiment

Step 5: Compare Variants

What’s Next