The Core Insight
Every iteration should identify a measurable gap between desired agent behavior and observed agent behavior. That gap is the loss signal. The flywheel turns the loss signal into a diagnosis, then into a targeted intervention, then into a verification run. This is gradient-inspired, not gradient-computed. Agent systems are not differentiable, but their journals, scores, state transitions, and failure paths provide directional evidence about where the next intervention should be applied.The Cycle
Phase 0: State Taxonomy Discovery
For projects that use Markov analysis, the flywheel begins with state taxonomy discovery. You need a state taxonomy — the named states that the classifier maps tool calls to. This taxonomy is domain-specific and must be discovered empirically.Run control variant 3–5 times
Generate enough tool-call data to see the agent’s natural behavior patterns.
Define state taxonomy
Name the clusters. Each state should represent a distinct kind of work: exploring, building, fixing, verifying, searching, reading knowledge. Aim for 5–12 states.
Loss Signal Taxonomy
The loss signal is multi-dimensional. Not every dimension matters for every iteration, but the full surface is:| Loss Dimension | What It Measures | Example Signal |
|---|---|---|
| Outcome | Task failure or low judge score | 3 of 10 benchmark cases fail |
| Behavioral | Unnecessary exploration or loops | BUILD→FIX loop amplification 3.2 |
| Knowledge | Repeated search or oracle calls | Repeated fallback inspection (e.g., Maven cache decompilation) |
| Tooling | Errors reachable from multiple paths | Same exception from 4 different states |
| Evaluation | Judge variance or malformed output | Non-JSON judge response 2/7 runs |
| Stability | Large run-to-run variance | Quality scores range 0.28–0.72 |
| Regression | One metric improves, another worsens | Batch score +0.4 but scheduling score −0.3 |
Diagnostic Lenses
Multiple analytical tools illuminate the loss signal. No single lens is the methodology — they are instruments in the measurement apparatus.| Lens | What It Reveals | Best For |
|---|---|---|
| Markov analysis | State transition patterns, loop amplification, transition gaps | Behavioral loss — where the agent gets stuck |
| Judge scores | Per-criterion quality assessment | Outcome loss — what the agent produces |
| Reasoning/intent traces | Intent-to-action policy, planning distribution | Knowledge loss — what the agent is searching for |
| Oracle call log | KB gaps the agent couldn’t resolve alone | Knowledge loss — what’s missing from the KB |
| Cost/token accounting | Where the budget goes | Behavioral loss — which states burn tokens |
| Run-to-run comparison | Variance across identical inputs | Stability loss — what’s nondeterministic |
Loop Types
Not all loops are problems. The DIAGNOSE step must classify the type of loop before choosing an intervention.| Loop Type | Pattern | Meaning | Action |
|---|---|---|---|
| Productive | WRITE → VERIFY → FIX → VERIFY | Expected refinement cycle | Leave it alone |
| Friction | SEARCH → READ → SEARCH → READ | Agent lacks context or structure | Add knowledge or routing |
| Failure | BUILD → FIX → BUILD → FIX (same error) | Agent repeats an invalid strategy | Change strategy, not retry count |
| Diagnostic | BUILD → ERROR → READ_LOG → FIX | Agent is gathering useful failure information | Leave it alone |
| Degenerate | EXPLORE → EXPLORE → EXPLORE | No new information is being gained | The agent is stuck — intervene |
Optimizing loop amplification to zero is an anti-pattern. Some loops are productive. The goal is to eliminate friction, failure, and degenerate loops while preserving productive and diagnostic ones.
Intervention Levers
The type of loss determines which lever to pull.Lever 1: Prompt
Clarify task decomposition, add stopping conditions, add execution ordering. Thesimple → hardened jump in code-coverage experiments produced the single largest quality gain (+0.07) with no external knowledge — just structure and an explicit stopping condition.
Pull when: diffuse waste, no dominant failure pattern, agent doesn’t know when it’s done.
Lever 2: Knowledge and Skills
Add domain recipes, examples, routing hints. Targeted KB entries eliminate specific search loops without touching the prompt. In one observed experiment, a single knowledge package reduced JAR_INSPECT from 18% to under 2% of all steps. Pull when: friction loops around a specific knowledge gap.Lever 3: Execution Structure
Three sub-levers that replace exploratory LLM behavior with deterministic execution:- Deterministic tools — Replace states that don’t require reasoning. A build script that returns structured results eliminates the BUILD/FIX reasoning loop.
- Templates and scaffolds — Pre-generate structure or use cached known-good baselines. When the flywheel reveals the agent consistently discovers the same pattern through exploration, codify it.
- Steering — Runtime hooks that intercept tool calls and enforce behavioral constraints.
Lever 4: Model
Pick a model that clears the capability floor — below it, nothing else helps. But above that floor, the other levers are cheaper and often more effective. Pull when: the agent fundamentally cannot perform the task, even with perfect knowledge and structure.Lever 5: Rubric and Evaluation
Tighten judge criteria, add anchors with concrete examples, add per-criterion scoring. A rubric intervention doesn’t change the agent — it changes the measurement, which changes what the next iteration optimizes for. Pull when: evaluation loss dominates (judge variance, malformed output, scores that don’t correlate with actual quality).The Deterministic-Over-Exploratory Principle
The flywheel’s purpose is to systematically shrink the agent’s exploration space. When the measurement apparatus reveals the agent consistently discovers the same pattern through exploration, that pattern should be codified as a deterministic step.| Execution Path | Quality Range | Reliability |
|---|---|---|
| Cached templates (deterministic) | 0.70 – 0.93 | Stable across runs |
| Expansion path (LLM with constraints) | 0.28 – 0.72 | Varies by run |
| Raw Claude Code (pure exploration) | 0.19 – 0.63 | High variance |
| Finding | Codification |
|---|---|
| Agent always discovers the same file structure | Template or scaffold |
| Agent always applies the same fix pattern | Recipe in knowledge/ |
| Agent always needs the same context | Structured context in the prompt |
| Agent always makes the same tool-call sequence | Deterministic workflow step |
| Agent’s orientation thinking dominates | Pre-analysis script that front-loads context |
Variant Progression
Variants are empirically motivated. Each exists because the previous variant’s analysis revealed a specific gap.Verification Discipline
The VERIFY step requires tracking what changed between iterations and what improved. Per-iteration record:- Per-criterion scores — Not just the aggregate. A rising aggregate can hide a regression in a specific criterion.
- Loop amplification per state — The primary behavioral metric. Did the friction loop shrink?
- Transition probabilities — Did the agent’s navigation pattern change as expected?
- Variant-over-variant delta — Before/after comparison for the specific change made.
- Stability — Run the same variant multiple times to distinguish signal from variance.
Anti-Patterns
- Skipping taxonomy discovery — Jumping to Markov analysis with a generic state taxonomy
- Figures without interpretation — Running analysis and looking at pictures without mapping findings to interventions
- Unmotivated variants — Creating variants without a clear hypothesis from prior measurement
- Aggregate-only scoring — Tracking only the overall batch score instead of per-criterion metrics
- Wrong lever — Throwing knowledge at a reasoning gap, or a bigger model at a knowledge gap
- Fixing without verifying — Making an improvement and moving on without confirming it worked
- Over-rotation on a single metric — Optimizing loop amplification to zero removes productive loops
- Ignoring the deterministic principle — Improving agent exploration instead of converting exploration into deterministic steps
Evidence
| Project | Iterations | Key Finding |
|---|---|---|
| Code Coverage v1→v2→v3 | 7 variants, 20 runs | Structure beats knowledge; skills reduce waste but not quality ceiling |
| bud-eval | 6 iterations | Batch 0.519→0.926 (template fix), scheduling oscillation→stable 0.741 (test fix) |
Related
Forge Methodology
The pipeline that produces the artifacts the flywheel improves
Markov Fingerprinting
Primary diagnostic lens — the DIAGNOSE step
Four-Tier Jury
Cascaded evaluation — the MEASURE step
Experiment Driver
IterationMetadata and variant progression in practice