Skip to main content
COMPLETEMar 2026

Hypothesis

Structured skills (SkillsJars) outperform flat knowledge injection — not because they contain more knowledge, but because structure itself changes agent behavior. Pre-analysis (a mandatory exploration pass before writing code) further reduces wasted steps by front-loading understanding.

Setup

ParameterValue
Targetspring-petclinic
Variants7
N-count3 per variant (20 sessions total)
EvaluationFour-tier jury (T0-T3)
Agent engineAgent Workflow

The 7 Variants

#NameWhat it adds
1simpleMinimal prompt, no knowledge, no stopping condition
2hardenedStructured prompt + explicit stopping condition
3hardened+kbHardened + flat knowledge base (Spring test imports, JaCoCo config)
4hardened+skillsHardened + SkillsJars (structured, modular knowledge packages)
5hardened+preanalysisHardened + mandatory pre-analysis pass before writing tests
6hardened+skills+preanalysisSkills + pre-analysis together
7hardened+skills+preanalysis+plan-actTwo-phase: deep exploration then sustained action

Results (N=3)

VariantMean StepsMean CostT3 Quality
hardened+skills+preanalysis75.0$3.390.850
hardened+preanalysis80.3$3.410.789
hardened+kb83.3$3.210.847
hardened+skills+preanalysis+plan-act95.0$5.110.878
hardened+skills101.7$3.700.856
hardened103.0$4.080.850
simple109.5$3.470.783

Key Findings

  1. Prompt hardening is the biggest quality driver — simple → hardened: +0.067 quality, -6% steps. A free gain from structural discipline alone.
  2. Pre-analysis drives efficiency at a quality cost — hardened+preanalysis: -22% steps but quality drops to 0.789. Tight plans miss edge cases.
  3. Skills fix pre-analysis’s quality regression — hardened+skills+preanalysis: -31% steps AND quality = 0.850 (matches hardened). Best tradeoff in the experiment.
  4. KB is a pure efficiency play on known codebases — hardened+kb: -24% steps, quality flat. On novel codebases the effect should be larger.
  5. Plan-act is high variance — highest quality ceiling (0.878) but also highest cost ($5.11) and rework spiral risk.
  6. Markov model predicts step counts — zero mean bias in leave-one-out cross-validation despite formal rejection of first-order assumption.

What Comes Next

The PetClinic model floor (agents already know PetClinic well) limits how much knowledge injection can differentiate. The follow-up experiment targets Apache Fineract — a novel, complex codebase where the agent has no prior knowledge — to test whether skills make a larger difference on unfamiliar territory.

Resources

Dataset & Traces (v2.0.0)

Full dataset download, variant configs, raw traces

Blog: I Read My Agent's Diary

Narrative walkthrough of the behavioral analysis

Results Report (PDF)

Full quantitative results with figures and tables

Reading Agent Behavior (PDF)

How to read agent behavioral traces — a casual explainer

v1 Baseline Experiment

The first experiment establishing the methodology