Skip to main content
IN PROGRESSMar 2026

Hypothesis

Infrastructure-level optimization — knowledge bases, deterministic preprocessing, tool configuration, judge feedback loops — outperforms prompt-level optimization on SWE-bench Lite tasks.

Setup

ParameterValue
TargetSWE-bench Lite (300 tasks)
Variants5-variant ladder
ControlArize ruleset_0.txt (20 rules, test_accuracy=0.40)
Key innovation+pre-analysis — deterministic preprocessing (parse _pytest imports → route KB, zero LLM cost)
EvaluationFour-tier jury adapted for SWE-bench

The Variant Ladder

#VariantApproach
1BaselineNo knowledge, standard prompt
2+ Prompt optimizationBetter system prompt, few-shot examples
3+ Knowledge baseFlat file domain knowledge for Python testing
4+ Pre-analysisDeterministic import parsing → KB routing
5+ Full infrastructurePre-analysis + skills + judge feedback loop

Current Status

Stage 5 complete (125 tests). Stage 7 next: fix SmokeTest package-private bug, wire ClaudeSdkInvoker, run 5-variant ladder.

What This Proves

If the infrastructure variant significantly outperforms the prompt variant on SWE-bench — a well-studied benchmark with known baselines — it provides strong evidence for the knowledge-directed execution thesis.

Resources

Experiment Repo

Full experiment code, variant configs, SWE-bench task selection