Active Experiments
Issue Classification
Infrastructure vs prompts on SWE-bench Lite β does tooling beat prompt engineering at 300-task scale?
Completed Experiments
Code Coverage v3
The exemplar effect β when existing tests use older patterns, skills canβt override them. T2 = 0.667 across all 6 runs.
Code Coverage v2
Skills vs flat knowledge bases β 7 variants on Spring PetClinic. Skills+preanalysis achieves -31% steps with no quality loss.
Code Coverage v1
Knowledge injection baseline β 9 variants, two independent axes discovered.
Upcoming
| Experiment | Question | Status |
|---|---|---|
| Code Coverage v4 | Fix the exemplar with a separate upgrade step | Planned |
| SWE-bench Results | Cross-experiment comparison on standardized tasks | Planned |
Experiment Design Principles
- One variable per experiment β Isolate the thing being tested
- Deterministic preprocessing β Parse inputs before the LLM sees them (zero LLM cost)
- Cascaded evaluation β T0βT1βT2βT3, cheap filters first
- Behavioral analysis β Markov chains on tool-call traces, not just pass/fail
- Reproducibility β All experiment repos are public with full trace data