IN PROGRESSMar 2026
Hypothesis
Infrastructure-level optimization — knowledge bases, deterministic preprocessing, tool configuration, judge feedback loops — outperforms prompt-level optimization on SWE-bench Lite tasks.Setup
| Parameter | Value |
|---|---|
| Target | SWE-bench Lite (300 tasks) |
| Variants | 5-variant ladder |
| Control | Arize ruleset_0.txt (20 rules, test_accuracy=0.40) |
| Key innovation | +pre-analysis — deterministic preprocessing (parse _pytest imports → route KB, zero LLM cost) |
| Evaluation | Four-tier jury adapted for SWE-bench |
The Variant Ladder
| # | Variant | Approach |
|---|---|---|
| 1 | Baseline | No knowledge, standard prompt |
| 2 | + Prompt optimization | Better system prompt, few-shot examples |
| 3 | + Knowledge base | Flat file domain knowledge for Python testing |
| 4 | + Pre-analysis | Deterministic import parsing → KB routing |
| 5 | + Full infrastructure | Pre-analysis + skills + judge feedback loop |
Current Status
Stage 5 complete (125 tests). Stage 7 next: fix SmokeTest package-private bug, wire ClaudeSdkInvoker, run 5-variant ladder.What This Proves
If the infrastructure variant significantly outperforms the prompt variant on SWE-bench — a well-studied benchmark with known baselines — it provides strong evidence for the knowledge-directed execution thesis.Resources
Experiment Repo
Full experiment code, variant configs, SWE-bench task selection