Skip to main content
COMPLETEApr 2026

Hypothesis

When existing test files use older patterns, the agent will follow those patterns even when skills explicitly teach the newer ones. The codebase is a stronger signal than knowledge injection.

Setup

ParameterValue
Targetspring-petclinic (Boot 4.0.1)
Variants2
N-count3 per variant (6 sessions total)
EvaluationThree-tier jury (T0-T2)
ModelClaude Sonnet 4.6
Baseline coverage64.9% (stripped from full suite)
Existing test files6 (using mockMvc.perform(), no flush()/clear())

The 2 Variants

#NameWhat the agent has
1simpleTwo-line prompt. No process guidance.
2hardened-skillsSeven-step structured prompt. Explicit stopping condition. Read-existing-tests-first instruction.
Both variants have the same Spring testing skills installed globally.

Results (N=3)

VariantNMean CostMean TurnsT2 QualityFinal Coverage
simple3$3.6047.00.66795.2%
hardened-skills3$3.4646.70.66792.9%

Per-Run Breakdown

RunVariantCostTurnsDurationFinal Cov.T1T2
n1simple$4.074220.3 min95.3%0.6080.667
n1hardened-skills$3.995216.9 min94.6%0.5950.667
n2simple$3.725218.0 min94.6%0.5950.667
n2hardened-skills$2.623712.5 min89.2%0.4860.667
n3simple$3.004713.0 min95.6%0.6150.667
n3hardened-skills$3.775116.4 min94.9%0.6010.667

T2 Quality Breakdown

T2 CriterionScorePassed?
test_slice_selection1.00Yes
assertion_quality0.80Yes
error_and_edge_case_coverage0.80Yes
domain_specific_test_patterns0.30No
coverage_target_selection0.80Yes
version_aware_patterns0.30No
Average (T2)0.667

Behavioral Analysis

Despite identical quality scores, the two variants navigate the codebase differently.
Metricsimplehardened-skills
Orientation phase72% of calls65% of calls
First file readPetClinicApplication.javapom.xml
Read orderproduction code firstexisting tests first
Redundant reads per run16.74.3
Avg tool calls8461
Expected steps (Markov)260164
Combined state diagram showing both variants
Each arrow shows: simple value → hardened-skills value. The EXPLORE self-loop drops from 0.87 to 0.70 (less re-reading). The WRITE→BUILD arrow more than doubles (0.22 → 0.47) — the agent builds sooner.
4x fewer redundant reads, 37% fewer expected steps, same quality
4x fewer redundant reads. 37% fewer expected steps. Same quality. The efficiency story is invisible if you only look at the T2 score.

The Exemplar Effect: v2 vs v3

v2 (zero tests)v3 (existing tests)What changed
Coverage (simple)92–94%89–96%Converges either way
T2/T3 quality0.783–0.8780.667Exemplar patterns cap quality
version_aware0.70–1.000.30Existing files use perform()
domain_specific0.70–0.900.30Existing files lack flush/clear

Key Findings

  1. The codebase is the agent’s primary teacher. Skills and prompts are secondary signals. If the existing code demonstrates older patterns, the agent will reproduce them — even when it has explicit knowledge of the better approach.
  2. Quality ceilings come from exemplars, not prompts. T2 = 0.667 across all 6 runs. Two variants, three runs each, identical quality score. The ceiling moved when the existing tests changed (v2 vs v3), not when the prompt changed.
  3. Efficiency gains are still real. 37% fewer expected steps, 4x fewer redundant reads, half as many reading-loop cycles. Prompt hardening and skills make the agent faster — they just can’t make it better when the codebase says otherwise.
  4. Fix the code, not the prompt. The highest-leverage intervention for agent quality is updating the exemplars the agent will see.

What Comes Next

v4: Fix the exemplar — but not by hand. A separate “Boot best-practices upgrade” step — skill-driven, focused, run before the test-writing agent starts. Fix the code the agent will imitate, then let it imitate. Prediction: T2 rises to ≥0.85.

Resources

Experiment Repo

Variant configs, analysis scripts, raw traces

Blog: When You Come to a Fork in the Code

Narrative walkthrough of the exemplar effect

Results Report (PDF)

Full quantitative results with figures and tables

v2 Experiment

The previous experiment — skills vs knowledge bases with zero existing tests