Issue Classification — Infrastructure vs Prompts

IN PROGRESSMar 2026

Hypothesis

Infrastructure-level optimization — knowledge bases, deterministic preprocessing, tool configuration, judge feedback loops — outperforms prompt-level optimization on SWE-bench Lite tasks.

Setup

Parameter	Value
Target	SWE-bench Lite (300 tasks)
Variants	5-variant ladder
Control	Arize `ruleset_0.txt` (20 rules, test_accuracy=0.40)
Key innovation	`+pre-analysis` — deterministic preprocessing (parse `_pytest` imports → route KB, zero LLM cost)
Evaluation	Four-tier jury adapted for SWE-bench

The Variant Ladder

#	Variant	Approach
1	Baseline	No knowledge, standard prompt
2	+ Prompt optimization	Better system prompt, few-shot examples
3	+ Knowledge base	Flat file domain knowledge for Python testing
4	+ Pre-analysis	Deterministic import parsing → KB routing
5	+ Full infrastructure	Pre-analysis + skills + judge feedback loop

Current Status

Stage 5 complete (125 tests). Stage 7 next: fix SmokeTest package-private bug, wire ClaudeSdkInvoker, run 5-variant ladder.

What This Proves

If the infrastructure variant significantly outperforms the prompt variant on SWE-bench — a well-studied benchmark with known baselines — it provides strong evidence for the knowledge-directed execution thesis.

Resources

Experiment Repo

Full experiment code, variant configs, SWE-bench task selection

Code Coverage v3 — The Exemplar Effect

Documentation Index

​Hypothesis

​Setup

​The Variant Ladder

​Current Status

​What This Proves

​Resources