COMPLETEMar 2026
Hypothesis
Structured skills (SkillsJars) outperform flat knowledge injection — not because they contain more knowledge, but because structure itself changes agent behavior. Pre-analysis (a mandatory exploration pass before writing code) further reduces wasted steps by front-loading understanding.Setup
| Parameter | Value |
|---|---|
| Target | spring-petclinic |
| Variants | 7 |
| N-count | 3 per variant (20 sessions total) |
| Evaluation | Four-tier jury (T0-T3) |
| Agent engine | Agent Workflow |
The 7 Variants
| # | Name | What it adds |
|---|---|---|
| 1 | simple | Minimal prompt, no knowledge, no stopping condition |
| 2 | hardened | Structured prompt + explicit stopping condition |
| 3 | hardened+kb | Hardened + flat knowledge base (Spring test imports, JaCoCo config) |
| 4 | hardened+skills | Hardened + SkillsJars (structured, modular knowledge packages) |
| 5 | hardened+preanalysis | Hardened + mandatory pre-analysis pass before writing tests |
| 6 | hardened+skills+preanalysis | Skills + pre-analysis together |
| 7 | hardened+skills+preanalysis+plan-act | Two-phase: deep exploration then sustained action |
Results (N=3)
| Variant | Mean Steps | Mean Cost | T3 Quality |
|---|---|---|---|
| hardened+skills+preanalysis | 75.0 | $3.39 | 0.850 |
| hardened+preanalysis | 80.3 | $3.41 | 0.789 |
| hardened+kb | 83.3 | $3.21 | 0.847 |
| hardened+skills+preanalysis+plan-act | 95.0 | $5.11 | 0.878 |
| hardened+skills | 101.7 | $3.70 | 0.856 |
| hardened | 103.0 | $4.08 | 0.850 |
| simple | 109.5 | $3.47 | 0.783 |
Key Findings
- Prompt hardening is the biggest quality driver — simple → hardened: +0.067 quality, -6% steps. A free gain from structural discipline alone.
- Pre-analysis drives efficiency at a quality cost — hardened+preanalysis: -22% steps but quality drops to 0.789. Tight plans miss edge cases.
- Skills fix pre-analysis’s quality regression — hardened+skills+preanalysis: -31% steps AND quality = 0.850 (matches hardened). Best tradeoff in the experiment.
- KB is a pure efficiency play on known codebases — hardened+kb: -24% steps, quality flat. On novel codebases the effect should be larger.
- Plan-act is high variance — highest quality ceiling (0.878) but also highest cost ($5.11) and rework spiral risk.
- Markov model predicts step counts — zero mean bias in leave-one-out cross-validation despite formal rejection of first-order assumption.
What Comes Next
The PetClinic model floor (agents already know PetClinic well) limits how much knowledge injection can differentiate. The follow-up experiment targets Apache Fineract — a novel, complex codebase where the agent has no prior knowledge — to test whether skills make a larger difference on unfamiliar territory.Resources
Dataset & Traces (v2.0.0)
Full dataset download, variant configs, raw traces
Blog: I Read My Agent's Diary
Narrative walkthrough of the behavioral analysis
Results Report (PDF)
Full quantitative results with figures and tables
Reading Agent Behavior (PDF)
How to read agent behavioral traces — a casual explainer
v1 Baseline Experiment
The first experiment establishing the methodology