Creating Experiments

Design Philosophy

Every experiment tests a hypothesis about what makes agents better. The experiment driver makes the independent variables explicit:

Variable	How you control it
Knowledge	`knowledgeRefs` in dataset items, `knowledgeBaseDir` in config
Prompt structure	`promptTemplate` with `{{task}}` and `{{knowledgeRefs}}` placeholders
Model	`model` field in config
Execution strategy	Your `AgentInvoker` implementation
Evaluation criteria	Your `Jury` wiring

Variant Ladders

The most informative experiments use a progressive variant ladder — each variant adds one thing to the previous:

Variant	Change from previous	Tests
1. Simple prompt	— (baseline)	Model’s raw capability
2. + System prompt	Add domain instructions	Does framing help?
3. + Knowledge base	Add `knowledgeRefs`	Does knowledge help?
4. + Skills (SkillsJars)	Same content, structured packaging	Does structure help?
5. + SAE	Add Structured Agent Execution	Does execution structure help?

Each step isolates one variable. If variant 4 outperforms variant 3 with identical knowledge content, the structure is what matters — not just the knowledge.

Improvement Flywheel

Variant ladders can be pre-planned, but the most effective experiments use empirically motivated variants — each exists because the previous variant’s analysis revealed a specific gap. This follows the Improvement Flywheel methodology:

RUN        — Execute a variant and capture journals
MEASURE    — Compute scores, traces, behavioral metrics
DIAGNOSE   — Convert signals into hypotheses about causes
INTERVENE  — Change prompt, KB, tool, workflow, or rubric
VERIFY     — Re-run and compare deltas / check for regressions

Iteration metadata

Each variant records what motivated it using IterationMetadata:

public record IterationMetadata(@Nullable String finding, String hypothesis) {}

In the agent-experiment-template, this is configured in experiment-config.yaml:

variants:
  - name: control
    promptFile: v0-naive.txt
    iteration:
      finding: null
      hypothesis: "Establish baseline agent behavior"

  - name: variant-a
    promptFile: v1-hardened.txt
    iteration:
      finding: "v0 BUILD→FIX loop amplification 3.2"
      hypothesis: "Structured execution steps reduce fix loops"

This creates an audit trail: for every variant you can trace back to the observation that motivated it and verify whether the hypothesis held.

Intervention levers

The type of loss determines which lever to pull:

Lever	When to use
Prompt	Diffuse waste, no dominant failure pattern
Knowledge / skills	Friction loops around a specific knowledge gap
Execution structure	Loops around states that could be deterministic
Model	Agent fundamentally cannot perform the task
Rubric / evaluation	Judge variance, scores don’t correlate with quality

Comparison reporting

GrowthStoryReporter (in the template) generates a markdown comparison report across variants. It:

Shows per-judge score deltas, improvements, and regressions for each variant pair
Flags regressions with explicit warnings when any ScoreComparison.regressions() > 0
Includes iteration motivation (finding + hypothesis) before each variant’s scores when IterationMetadata is present

The report is written to analysis/comparison-report.md and provides the MEASURE output that feeds the next DIAGNOSE step.

Dataset Design

Item structure

Each item needs:

developerTask — what you’re asking the agent to do (natural language)
before/ — the starting state (real source code)
reference/ — the correct result (for judge comparison)
bucket — difficulty classification (A = easy, B = medium, C = hard)
knowledgeRefs — paths to relevant KB entries (relative to knowledgeBaseDir)

Buckets

Use buckets to stratify difficulty:

Bucket	Typical characteristics
A	Single file, mechanical change, clear instructions
B	Multi-file, requires understanding, some ambiguity
C	Cross-cutting concern, requires domain knowledge, creative problem-solving

Filtering

Run subsets of the dataset:

// Run only bucket A items
ExperimentConfig.builder()
    .itemFilter(ItemFilter.bucket("A"))
    // ...

// Run items with specific tags
ExperimentConfig.builder()
    .itemFilter(ItemFilter.tags("rename", "simple"))
    // ...

// Run a single item by ID
ExperimentConfig.builder()
    .itemFilter(ItemFilter.id("RENAME-001"))
    // ...

ExperimentConfig Reference

Field	Required	Default	Description
`experimentName`	Yes	—	Experiment identifier
`datasetDir`	Yes	—	Path to dataset directory
`model`	Yes	—	LLM model (`sonnet`, `opus`, `haiku`, or full ID)
`promptTemplate`	Yes	—	Template with `{{task}}` and `{{knowledgeRefs}}`
`perItemTimeout`	Yes	—	Timeout per item invocation
`itemFilter`	No	all items	Filter by bucket, tags, ID, status
`knowledgeBaseDir`	No	—	KB root (for ablation tracking)
`outputDir`	No	—	Directory for workspaces and logs
`experimentTimeout`	No	—	Timeout for entire experiment
`metadata`	No	—	Arbitrary key-value pairs
`baselineId`	No	—	Reference experiment for comparison

Result Structure

Results are persisted by FileSystemResultStore:

results/
└── rename-field-v1/
    ├── index.json                    # Experiment history
    └── a1b2c3d4.json               # Individual experiment result

Each result contains:

Experiment metadata (name, config, git version, timestamps)
Per-item results (agent output, jury verdict, tokens, cost, duration)
Aggregate statistics (pass rate, total cost, total duration)

Cross-Run Comparison

ComparisonEngine comparison = new ComparisonEngine();
ComparisonResult diff = comparison.compare(resultA, resultB);

The comparison engine aligns items by ID across two experiments and reports per-item and aggregate deltas.

Building a Jury

Three-tier evaluation: deterministic, structural, semantic

API Reference

Full config, dataset format, invoker contract

Projects

AgentWorks

Agento

Supporting Projects

Migration

Design Philosophy

Variant Ladders

Improvement Flywheel

Iteration metadata

Intervention levers

Comparison reporting

Dataset Design

Item structure

Buckets

Filtering

ExperimentConfig Reference

Result Structure

Cross-Run Comparison

Building a Jury

API Reference

​Design Philosophy

​Variant Ladders

​Improvement Flywheel

​Iteration metadata

​Intervention levers

​Comparison reporting

​Dataset Design

​Item structure

​Buckets

​Filtering

​ExperimentConfig Reference

​Result Structure

​Cross-Run Comparison

​Related

Building a Jury

API Reference

Design Philosophy

Variant Ladders

Improvement Flywheel

Iteration metadata

Intervention levers

Comparison reporting

Dataset Design

Item structure

Buckets

Filtering

ExperimentConfig Reference

Result Structure

Cross-Run Comparison

Related