Skip to main content
COMPLETEFeb 2026

Hypothesis

Progressive knowledge injection — giving agents increasingly structured domain knowledge — improves JUnit test generation quality on Spring Boot projects more than model upgrades.

Setup

ParameterValue
TargetSpring Boot projects (gs-rest-service, spring-petclinic)
Variants9 (baseline → full forge with SAE)
EvaluationFour-tier jury (T0-T3)
Build toolMaven
Agent engineAgent Workflow

The 9 Variants

#VariantKnowledge Level
1Simple promptNone
2+ System promptMinimal guidance
3+ Flat knowledge baseFile-based domain knowledge
4+ Skills (SkillsJars)Structured, agent-accessible knowledge
5+ Skills + SAESkills + Structured Agent Execution
6+ Hardened promptDefensive instructions
7+ Hardened + KBHardened + flat knowledge
8+ Hardened + SkillsHardened + structured knowledge
9+ Forge (full stack)Complete knowledge-directed execution

Key Findings

  1. Two independent axes discovered — Knowledge injection and prompt hardening improve quality independently
  2. Model floor exists — PetClinic achieves 92-94% coverage across all variants (the model already knows PetClinic)
  3. SAE is most efficient — 70 expected steps, $2.84 per run
  4. Partial knowledge paradox — Some knowledge without structure can decrease performance
  5. First Markov fingerprints — Tool-call traces reveal distinct behavioral signatures per variant

Markov Analysis

Agent behavior varies dramatically across variants even when final outcomes are similar. The Markov fingerprint analysis revealed:
  • JAR cluster patterns — How much time agents spend in dependency inspection
  • Thrashing loops — BUILD→TEST→EDIT cycles that indicate the agent is stuck
  • Loop amplification — Quantified via transition probability engineering (TPE)

Resources

Experiment Repo

Full traces, Markov analysis scripts, raw data

Blog: Agent Fingerprint

Narrative walkthrough of the Markov findings