Skip to main content

Documentation Index

Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt

Use this file to discover all available pages before exploring further.

COMPLETEMar 2026

Hypothesis

Structured skills (SkillsJars) outperform flat knowledge injection — not because they contain more knowledge, but because structure itself changes agent behavior. Pre-analysis (a mandatory exploration pass before writing code) further reduces wasted steps by front-loading understanding.

Setup

ParameterValue
Targetspring-petclinic (Boot 4.0.1)
Variants7
N-count3 per variant (20 sessions total)
EvaluationFour-tier jury (T0-T3)
ModelClaude Sonnet 4
Agent engineAgent Workflow
Starting coverage0% (all tests deleted)

The 7 Variants

Each variant adds one variable on top of the previous. This isolates the effect of each intervention.
#NameWhat it adds
1simpleMinimal prompt, no knowledge, no stopping condition
2hardenedStructured prompt + explicit stopping condition
3hardened+kbHardened + flat knowledge base (Spring test imports, JaCoCo config)
4hardened+skillsHardened + SkillsJars (structured, modular knowledge packages)
5hardened+preanalysisHardened + mandatory pre-analysis pass before writing tests
6hardened+skills+preanalysisSkills + pre-analysis together
7hardened+skills+preanalysis+plan-actTwo-phase: deep exploration then sustained action

Results (N=3)

VariantMean StepsMean CostT3 Quality
hardened+skills+preanalysis75.0$3.390.850
hardened+preanalysis80.3$3.410.789
hardened+kb83.3$3.210.847
hardened+skills+preanalysis+plan-act95.0$5.110.878
hardened+skills101.7$3.700.856
hardened103.0$4.080.850
simple109.5$3.470.783
Mean cost and step count per variant
Cost and step count across 7 variants. Skills+preanalysis (variant 6) achieves the lowest step count without inflating cost. Plan-act (variant 7) pays $5.11 for marginal quality gain.

Key Findings

1. Prompt hardening is the biggest quality driver

simple → hardened: +0.067 quality, -6% steps. A free gain from structural discipline alone — no knowledge injection, just telling the agent when to stop and how to structure its work.

2. Pre-analysis drives efficiency at a quality cost

hardened+preanalysis: -22% steps but quality drops to 0.789. The agent follows its pre-analysis plan too rigidly, missing edge cases it would have discovered through exploration. Same attention budget, worse allocation.

3. Skills fix pre-analysis’s quality regression

hardened+skills+preanalysis: -31% steps AND quality = 0.850 (matches hardened). Skills give the agent the right vocabulary for each step, so it doesn’t waste attention discovering patterns. Best tradeoff in the experiment.

4. KB is a pure efficiency play on known codebases

hardened+kb: -24% steps, quality flat. The knowledge base eliminates JAR inspection cycles (the agent no longer needs to discover Spring Boot 4 import changes). On novel codebases the effect should be larger.

5. Plan-act is high variance

Highest quality ceiling (0.878) but also highest cost ($5.11) and rework spiral risk. The two-phase approach (deep exploration then sustained writing) occasionally gets stuck in fix loops.

6. Markov model predicts step counts

Zero mean bias in leave-one-out cross-validation despite formal rejection of the first-order assumption. The model is wrong in theory but useful in practice.

Behavioral Analysis

Every tool call across all 20 sessions was classified into one of 9 behavioral states using the Markov fingerprinting methodology. This reveals how variants differ, not just whether they produce different outcomes.

Transition Probability Matrix

Transition probability matrix across all variants
Each cell shows the probability of transitioning from one behavioral state to another. Darker cells = higher probability. The diagonal (self-loops) dominates — agents spend most time repeating the same type of action.

The JAR Cluster: Knowledge Friction

The most distinctive behavioral signature was the JAR_INSPECT cluster — the agent downloading and inspecting Spring Boot JARs to discover import paths that changed between Boot 3 and Boot 4.
JAR inspection loop — the agent stuck in a discovery cycle
Without knowledge injection, the agent cycles through JAR inspection trying to discover Boot 4 import changes. This loop consumed 6–18% of all tool calls in variants without KB or skills.
JAR loop eliminated with skills
With skills or KB providing the correct imports, the JAR inspection loop disappears entirely. The agent goes straight from reading to writing.

Loop Amplification

Expected cycles through each behavioral loop
The Markov fundamental matrix predicts how many times the agent cycles through each loop before absorbing. Skills+preanalysis cuts the EXPLORE loop from 189 expected cycles to 93 — same attention budget, better allocation.

Intervention Deltas

How each intervention changes transition probabilities
Each panel shows the change in transition probabilities when adding one intervention. Red = increased probability, blue = decreased. Skills most visibly reduce the EXPLORE self-loop and increase WRITE→BUILD transitions.

Sankey Flow Comparison

Sankey flow comparing simple vs hardened+skills+preanalysis
Tool-call flow from left to right. The simple variant (top) shows wide, diffuse flows through many states. The skills+preanalysis variant (bottom) is narrower and more directed — fewer detours, more time writing.

Behavioral Heatmaps

Behavioral heatmap — simple variant
Simple variant: broad exploration, scattered writes, high JAR_INSPECT activity. The agent discovers everything through trial and error.
Behavioral heatmap — hardened+skills+preanalysis
Skills+preanalysis: focused exploration phase, then sustained writing. JAR_INSPECT is eliminated. The agent knows what it needs and writes it.

What Comes Next

The follow-up experiment (Code Coverage v3) tests what happens when the agent has skills but the existing codebase demonstrates older patterns. Spoiler: the codebase wins.

Resources

Dataset & Traces (v2.0.0)

Full dataset download, variant configs, raw traces

Blog: I Read My Agent's Diary

Narrative walkthrough of the behavioral analysis

Results Report (PDF)

Full quantitative results with figures and tables

Reading Agent Behavior (PDF)

How to read agent behavioral traces — a casual explainer

v1 Baseline Experiment

The first experiment establishing the methodology

v3 Follow-up: The Exemplar Effect

What happens when existing code contradicts skill guidance