COMPLETEFeb 2026
Hypothesis
Progressive knowledge injection — giving agents increasingly structured domain knowledge — improves JUnit test generation quality on Spring Boot projects more than model upgrades.Setup
| Parameter | Value |
|---|---|
| Target | Spring Boot projects (gs-rest-service, spring-petclinic) |
| Variants | 9 (baseline → full forge with SAE) |
| Evaluation | Four-tier jury (T0-T3) |
| Build tool | Maven |
| Agent engine | Agent Workflow |
The 9 Variants
| # | Variant | Knowledge Level |
|---|---|---|
| 1 | Simple prompt | None |
| 2 | + System prompt | Minimal guidance |
| 3 | + Flat knowledge base | File-based domain knowledge |
| 4 | + Skills (SkillsJars) | Structured, agent-accessible knowledge |
| 5 | + Skills + SAE | Skills + Structured Agent Execution |
| 6 | + Hardened prompt | Defensive instructions |
| 7 | + Hardened + KB | Hardened + flat knowledge |
| 8 | + Hardened + Skills | Hardened + structured knowledge |
| 9 | + Forge (full stack) | Complete knowledge-directed execution |
Key Findings
- Two independent axes discovered — Knowledge injection and prompt hardening improve quality independently
- Model floor exists — PetClinic achieves 92-94% coverage across all variants (the model already knows PetClinic)
- SAE is most efficient — 70 expected steps, $2.84 per run
- Partial knowledge paradox — Some knowledge without structure can decrease performance
- First Markov fingerprints — Tool-call traces reveal distinct behavioral signatures per variant
Markov Analysis
Agent behavior varies dramatically across variants even when final outcomes are similar. The Markov fingerprint analysis revealed:- JAR cluster patterns — How much time agents spend in dependency inspection
- Thrashing loops — BUILD→TEST→EDIT cycles that indicate the agent is stuck
- Loop amplification — Quantified via transition probability engineering (TPE)
Resources
Experiment Repo
Full traces, Markov analysis scripts, raw data
Blog: Agent Fingerprint
Narrative walkthrough of the Markov findings