Skip to main content
Every experiment in this lab follows the same pattern: define variants, control for one variable, measure with the four-tier jury, analyze behavioral traces. We don’t just measure whether agents succeed β€” we measure how they behave on the way to success or failure, using Markov chain analysis of tool-call traces.

Active Experiments

Issue Classification

Infrastructure vs prompts on SWE-bench Lite β€” does tooling beat prompt engineering at 300-task scale?

Completed Experiments

Code Coverage v3

The exemplar effect β€” when existing tests use older patterns, skills can’t override them. T2 = 0.667 across all 6 runs.

Code Coverage v2

Skills vs flat knowledge bases β€” 7 variants on Spring PetClinic. Skills+preanalysis achieves -31% steps with no quality loss.

Code Coverage v1

Knowledge injection baseline β€” 9 variants, two independent axes discovered.

Upcoming

ExperimentQuestionStatus
Code Coverage v4Fix the exemplar with a separate upgrade stepPlanned
SWE-bench ResultsCross-experiment comparison on standardized tasksPlanned

Experiment Design Principles

  1. One variable per experiment β€” Isolate the thing being tested
  2. Deterministic preprocessing β€” Parse inputs before the LLM sees them (zero LLM cost)
  3. Cascaded evaluation β€” T0β†’T1β†’T2β†’T3, cheap filters first
  4. Behavioral analysis β€” Markov chains on tool-call traces, not just pass/fail
  5. Reproducibility β€” All experiment repos are public with full trace data