Skip to main content
COMPLETEFeb 2026

Hypothesis

Progressive knowledge injection β€” giving agents increasingly structured domain knowledge β€” improves JUnit test generation quality on Spring Boot projects more than model upgrades.

Setup

ParameterValue
TargetSpring Boot projects (gs-rest-service, spring-petclinic)
Variants9 (baseline β†’ full forge with SAE)
EvaluationFour-tier jury (T0-T3)
Build toolMaven
Agent engineAgent Workflow

The 9 Variants

#VariantKnowledge Level
1Simple promptNone
2+ System promptMinimal guidance
3+ Flat knowledge baseFile-based domain knowledge
4+ Skills (SkillsJars)Structured, agent-accessible knowledge
5+ Skills + SAESkills + Structured Agent Execution
6+ Hardened promptDefensive instructions
7+ Hardened + KBHardened + flat knowledge
8+ Hardened + SkillsHardened + structured knowledge
9+ Forge (full stack)Complete knowledge-directed execution

Key Findings

  1. Two independent axes discovered β€” Knowledge injection and prompt hardening improve quality independently
  2. Model floor exists β€” PetClinic achieves 92-94% coverage across all variants (the model already knows PetClinic)
  3. SAE is most efficient β€” 70 expected steps, $2.84 per run
  4. Partial knowledge paradox β€” Some knowledge without structure can decrease performance
  5. First Markov fingerprints β€” Tool-call traces reveal distinct behavioral signatures per variant

Markov Analysis

Agent behavior varies dramatically across variants even when final outcomes are similar. The Markov fingerprint analysis revealed:
  • JAR cluster patterns β€” How much time agents spend in dependency inspection
  • Thrashing loops β€” BUILDβ†’TESTβ†’EDIT cycles that indicate the agent is stuck
  • Loop amplification β€” Quantified via transition probability engineering (TPE)

Resources

Experiment Repo

Full traces, Markov analysis scripts, raw data

Blog: Agent Fingerprint

Narrative walkthrough of the Markov findings