Skip to main content

The Core Insight

Every iteration should identify a measurable gap between desired agent behavior and observed agent behavior. That gap is the loss signal. The flywheel turns the loss signal into a diagnosis, then into a targeted intervention, then into a verification run. This is gradient-inspired, not gradient-computed. Agent systems are not differentiable, but their journals, scores, state transitions, and failure paths provide directional evidence about where the next intervention should be applied.

The Cycle

1. RUN        — Execute variants and capture journals
2. MEASURE    — Compute scores, traces, behavioral metrics
3. DIAGNOSE   — Convert signals into hypotheses about causes
4. INTERVENE  — Change prompt, KB, tool, workflow, rubric, or template
5. VERIFY     — Re-run and compare deltas/regressions
Each iteration estimates where the system is failing, chooses the most promising improvement direction, applies an intervention, and measures whether the system moved in the intended direction. Variants are empirically motivated, not pre-planned — each variant exists because the previous variant’s analysis revealed a specific gap.

Phase 0: State Taxonomy Discovery

For projects that use Markov analysis, the flywheel begins with state taxonomy discovery. You need a state taxonomy — the named states that the classifier maps tool calls to. This taxonomy is domain-specific and must be discovered empirically.
1

Run control variant 3–5 times

Generate enough tool-call data to see the agent’s natural behavior patterns.
2

Run discovery mode

See raw tool name + target frequencies without a predefined taxonomy.
3

Inspect clusters

Look for related tool calls that represent a coherent activity.
4

Define state taxonomy

Name the clusters. Each state should represent a distinct kind of work: exploring, building, fixing, verifying, searching, reading knowledge. Aim for 5–12 states.
5

Define cluster groups

Group states into higher-level categories: productive work (WRITE, BUILD, VERIFY), friction (FIX, SEARCH), knowledge access (READ_KB, READ_SKILL).
What makes a good taxonomy: States are verbs, not nouns — they describe what the agent is doing, not what it’s looking at. Each state should have diagnostic value: its frequency change tells you something about agent quality.

Loss Signal Taxonomy

The loss signal is multi-dimensional. Not every dimension matters for every iteration, but the full surface is:
Loss DimensionWhat It MeasuresExample Signal
OutcomeTask failure or low judge score3 of 10 benchmark cases fail
BehavioralUnnecessary exploration or loopsBUILD→FIX loop amplification 3.2
KnowledgeRepeated search or oracle callsRepeated fallback inspection (e.g., Maven cache decompilation)
ToolingErrors reachable from multiple pathsSame exception from 4 different states
EvaluationJudge variance or malformed outputNon-JSON judge response 2/7 runs
StabilityLarge run-to-run varianceQuality scores range 0.28–0.72
RegressionOne metric improves, another worsensBatch score +0.4 but scheduling score −0.3
The MEASURE step quantifies these signals. The DIAGNOSE step identifies which dimension dominates. The INTERVENE step targets that dimension specifically.

Diagnostic Lenses

Multiple analytical tools illuminate the loss signal. No single lens is the methodology — they are instruments in the measurement apparatus.
LensWhat It RevealsBest For
Markov analysisState transition patterns, loop amplification, transition gapsBehavioral loss — where the agent gets stuck
Judge scoresPer-criterion quality assessmentOutcome loss — what the agent produces
Reasoning/intent tracesIntent-to-action policy, planning distributionKnowledge loss — what the agent is searching for
Oracle call logKB gaps the agent couldn’t resolve aloneKnowledge loss — what’s missing from the KB
Cost/token accountingWhere the budget goesBehavioral loss — which states burn tokens
Run-to-run comparisonVariance across identical inputsStability loss — what’s nondeterministic

Loop Types

Not all loops are problems. The DIAGNOSE step must classify the type of loop before choosing an intervention.
Loop TypePatternMeaningAction
ProductiveWRITE → VERIFY → FIX → VERIFYExpected refinement cycleLeave it alone
FrictionSEARCH → READ → SEARCH → READAgent lacks context or structureAdd knowledge or routing
FailureBUILD → FIX → BUILD → FIX (same error)Agent repeats an invalid strategyChange strategy, not retry count
DiagnosticBUILD → ERROR → READ_LOG → FIXAgent is gathering useful failure informationLeave it alone
DegenerateEXPLORE → EXPLORE → EXPLORENo new information is being gainedThe agent is stuck — intervene
Optimizing loop amplification to zero is an anti-pattern. Some loops are productive. The goal is to eliminate friction, failure, and degenerate loops while preserving productive and diagnostic ones.

Intervention Levers

The type of loss determines which lever to pull.

Lever 1: Prompt

Clarify task decomposition, add stopping conditions, add execution ordering. The simplehardened jump in code-coverage experiments produced the single largest quality gain (+0.07) with no external knowledge — just structure and an explicit stopping condition. Pull when: diffuse waste, no dominant failure pattern, agent doesn’t know when it’s done.

Lever 2: Knowledge and Skills

Add domain recipes, examples, routing hints. Targeted KB entries eliminate specific search loops without touching the prompt. In one observed experiment, a single knowledge package reduced JAR_INSPECT from 18% to under 2% of all steps. Pull when: friction loops around a specific knowledge gap.

Lever 3: Execution Structure

Three sub-levers that replace exploratory LLM behavior with deterministic execution:
  • Deterministic tools — Replace states that don’t require reasoning. A build script that returns structured results eliminates the BUILD/FIX reasoning loop.
  • Templates and scaffolds — Pre-generate structure or use cached known-good baselines. When the flywheel reveals the agent consistently discovers the same pattern through exploration, codify it.
  • Steering — Runtime hooks that intercept tool calls and enforce behavioral constraints.
Pull when: loops around states that could be deterministic, agent repeatedly discovers the same answer, or agent makes predictable wrong choices.

Lever 4: Model

Pick a model that clears the capability floor — below it, nothing else helps. But above that floor, the other levers are cheaper and often more effective. Pull when: the agent fundamentally cannot perform the task, even with perfect knowledge and structure.

Lever 5: Rubric and Evaluation

Tighten judge criteria, add anchors with concrete examples, add per-criterion scoring. A rubric intervention doesn’t change the agent — it changes the measurement, which changes what the next iteration optimizes for. Pull when: evaluation loss dominates (judge variance, malformed output, scores that don’t correlate with actual quality).
The critical distinction: Knowledge can’t fix a reasoning gap. Steering can’t fix a knowledge gap. A better model can’t fix either. Diagnose which problem you have before you reach for a lever.

The Deterministic-Over-Exploratory Principle

The flywheel’s purpose is to systematically shrink the agent’s exploration space. When the measurement apparatus reveals the agent consistently discovers the same pattern through exploration, that pattern should be codified as a deterministic step.
Execution PathQuality RangeReliability
Cached templates (deterministic)0.70 – 0.93Stable across runs
Expansion path (LLM with constraints)0.28 – 0.72Varies by run
Raw Claude Code (pure exploration)0.19 – 0.63High variance
Every decision point the LLM doesn’t have to make is a source of variance eliminated. LLM steps are reserved for genuinely creative decisions where the search space can’t be pre-constrained.
FindingCodification
Agent always discovers the same file structureTemplate or scaffold
Agent always applies the same fix patternRecipe in knowledge/
Agent always needs the same contextStructured context in the prompt
Agent always makes the same tool-call sequenceDeterministic workflow step
Agent’s orientation thinking dominatesPre-analysis script that front-loads context

Variant Progression

Variants are empirically motivated. Each exists because the previous variant’s analysis revealed a specific gap.
v0: baseline (control)
    → Run, measure: identify dominant loss dimension

v1: address the dominant loss
    → Typically prompt improvement (Lever 1) — clearest signal first
    → Run, measure: did the loss decrease? What's the next loss?

v2: address the next loss
    → Typically knowledge injection (Lever 2) — domain files for remaining gaps
    → Run, measure: repeat

v3+: address remaining losses
    → Structural fixes (Lever 3), rubric tightening (Lever 5)
    → Each variant is motivated by the previous variant's measurement
Every variant links back to its motivating finding and hypothesis, creating an audit trail: for every variant you can trace back to the observation that motivated it and verify whether the hypothesis held.

Verification Discipline

The VERIFY step requires tracking what changed between iterations and what improved. Per-iteration record:
Iteration: v0 → v1
Change: Added structured execution steps to prompt
Metrics before: batch score 0.519, BUILD→FIX amplification 3.2
Metrics after:  batch score 0.926, BUILD→FIX amplification 1.1
Delta: +0.407 outcome score, −2.1 behavioral amplification
Regression: none detected
What to track:
  • Per-criterion scores — Not just the aggregate. A rising aggregate can hide a regression in a specific criterion.
  • Loop amplification per state — The primary behavioral metric. Did the friction loop shrink?
  • Transition probabilities — Did the agent’s navigation pattern change as expected?
  • Variant-over-variant delta — Before/after comparison for the specific change made.
  • Stability — Run the same variant multiple times to distinguish signal from variance.
Regression detection: Every improvement can introduce regressions. Did the targeted loss decrease? Did any other dimension increase? Is the improvement stable across multiple runs?

Anti-Patterns

  • Skipping taxonomy discovery — Jumping to Markov analysis with a generic state taxonomy
  • Figures without interpretation — Running analysis and looking at pictures without mapping findings to interventions
  • Unmotivated variants — Creating variants without a clear hypothesis from prior measurement
  • Aggregate-only scoring — Tracking only the overall batch score instead of per-criterion metrics
  • Wrong lever — Throwing knowledge at a reasoning gap, or a bigger model at a knowledge gap
  • Fixing without verifying — Making an improvement and moving on without confirming it worked
  • Over-rotation on a single metric — Optimizing loop amplification to zero removes productive loops
  • Ignoring the deterministic principle — Improving agent exploration instead of converting exploration into deterministic steps

Evidence

ProjectIterationsKey Finding
Code Coverage v1→v2→v37 variants, 20 runsStructure beats knowledge; skills reduce waste but not quality ceiling
bud-eval6 iterationsBatch 0.519→0.926 (template fix), scheduling oscillation→stable 0.741 (test fix)

Forge Methodology

The pipeline that produces the artifacts the flywheel improves

Markov Fingerprinting

Primary diagnostic lens — the DIAGNOSE step

Four-Tier Jury

Cascaded evaluation — the MEASURE step

Experiment Driver

IterationMetadata and variant progression in practice