Skip to main content

Design Philosophy

Every experiment tests a hypothesis about what makes agents better. The experiment driver makes the independent variables explicit:
VariableHow you control it
KnowledgeknowledgeRefs in dataset items, knowledgeBaseDir in config
Prompt structurepromptTemplate with {{task}} and {{knowledgeRefs}} placeholders
Modelmodel field in config
Execution strategyYour AgentInvoker implementation
Evaluation criteriaYour Jury wiring

Variant Ladders

The most informative experiments use a progressive variant ladder — each variant adds one thing to the previous:
VariantChange from previousTests
1. Simple prompt— (baseline)Model’s raw capability
2. + System promptAdd domain instructionsDoes framing help?
3. + Knowledge baseAdd knowledgeRefsDoes knowledge help?
4. + Skills (SkillsJars)Same content, structured packagingDoes structure help?
5. + SAEAdd Structured Agent ExecutionDoes execution structure help?
Each step isolates one variable. If variant 4 outperforms variant 3 with identical knowledge content, the structure is what matters — not just the knowledge.

Dataset Design

Item structure

Each item needs:
  • developerTask — what you’re asking the agent to do (natural language)
  • before/ — the starting state (real source code)
  • reference/ — the correct result (for judge comparison)
  • bucket — difficulty classification (A = easy, B = medium, C = hard)
  • knowledgeRefs — paths to relevant KB entries (relative to knowledgeBaseDir)

Buckets

Use buckets to stratify difficulty:
BucketTypical characteristics
ASingle file, mechanical change, clear instructions
BMulti-file, requires understanding, some ambiguity
CCross-cutting concern, requires domain knowledge, creative problem-solving

Filtering

Run subsets of the dataset:
// Run only bucket A items
ExperimentConfig.builder()
    .itemFilter(ItemFilter.bucket("A"))
    // ...

// Run items with specific tags
ExperimentConfig.builder()
    .itemFilter(ItemFilter.tags("rename", "simple"))
    // ...

// Run a single item by ID
ExperimentConfig.builder()
    .itemFilter(ItemFilter.id("RENAME-001"))
    // ...

ExperimentConfig Reference

FieldRequiredDefaultDescription
experimentNameYesExperiment identifier
datasetDirYesPath to dataset directory
modelYesLLM model (sonnet, opus, haiku, or full ID)
promptTemplateYesTemplate with {{task}} and {{knowledgeRefs}}
perItemTimeoutYesTimeout per item invocation
itemFilterNoall itemsFilter by bucket, tags, ID, status
knowledgeBaseDirNoKB root (for ablation tracking)
outputDirNoDirectory for workspaces and logs
experimentTimeoutNoTimeout for entire experiment
metadataNoArbitrary key-value pairs
baselineIdNoReference experiment for comparison

Result Structure

Results are persisted by FileSystemResultStore:
results/
└── rename-field-v1/
    ├── index.json                    # Experiment history
    └── a1b2c3d4.json               # Individual experiment result
Each result contains:
  • Experiment metadata (name, config, git version, timestamps)
  • Per-item results (agent output, jury verdict, tokens, cost, duration)
  • Aggregate statistics (pass rate, total cost, total duration)

Cross-Run Comparison

ComparisonEngine comparison = new ComparisonEngine();
ComparisonResult diff = comparison.compare(resultA, resultB);
The comparison engine aligns items by ID across two experiments and reports per-item and aggregate deltas.

Building a Jury

Three-tier evaluation: deterministic, structural, semantic

API Reference

Full config, dataset format, invoker contract