The Worldview
An agent is not a one-shot generator. It is a controlled system operating under feedback — observed through journals, scored by judges, and steered by the evidence both produce. That worldview has practical consequences:- Everything iterates. This is not spec-driven development in the waterfall sense. The specs, the judges, and the workflow steps all iterate together until the agent performs. No artifact is finished before the loop starts — the loop is how artifacts get finished.
- The judge set becomes a benchmark. Once the judges converge, they stop being development scaffolding and become the measuring stick: every improvement, and every alternative approach, is scored against them.
- Determinism is the substrate. Agents are workflows — deterministic steps wherever possible, AI only where necessary. Every decision the LLM doesn’t have to make is a source of variance eliminated.
- Behavioral analysis finds the gradient. Agent systems aren’t differentiable, but Markov analysis of journals plus judge scores provide directional evidence — a gradient the iteration descends.
- The loop continues into production. Deployment doesn’t stop the journals or retire the judges. The same flywheel that built the agent keeps measuring and improving it.
Six Pillars
Six pillars — the thesis, the pipeline, knowledge design, execution structure, evaluation, and behavioral analysis — plus the improvement flywheel that ties them into a single loop:The Thesis
Knowledge + structured execution > model
Forge
Deterministic customization pipeline — Define → Forge → Run → Grow
KB Design
How to structure knowledge for agent consumption
SAE
Phased execution with checkpoints and guard rails
Evaluation
Four-tier cascaded jury — deterministic first, LLM last
Behavioral Analysis
Markov chain modeling of agent tool-call traces
Improvement Flywheel
Loss-driven iteration — from measured gaps to targeted interventions
Experiment Protocol
Every experiment follows the Improvement Flywheel cycle:Discover state taxonomy
Run control variant 3–5 times. Inspect tool-call clusters and define domain-specific behavioral states for Markov analysis.
Run control baseline
Each variant runs N=3+ times for statistical confidence. Full traces captured. Deterministic preprocessing routes to knowledge bases at zero LLM cost.
Measure with four-tier jury
T0 → T1 → T2 → T3 cascade. Cheap filters first. Capture per-criterion scores, not just aggregates.
Diagnose with behavioral analysis
Build Markov chains from tool-call sequences. Identify dominant loss dimension. Classify loops before intervening.
Intervene — create next variant
Apply a targeted lever: prompt, knowledge, execution structure, model, or rubric. Each variant is motivated by the previous variant’s measurement.
Verify and iterate
Re-run and compare deltas. Check for regressions across all dimensions. Return to step 3 until loss plateaus.
Key Metrics
| Metric | What It Measures | Source |
|---|---|---|
| T3 Score | Overall quality (LLM jury assessment) | Agent Judge |
| Expected Steps | Efficiency (Markov fundamental matrix) | Agent Journal |
| P(success) | Reliability (absorbing chain probability) | Markov Analysis |
| Thrash Score | Behavioral loss (loop amplification in BUILD→TEST→EDIT) | Markov Analysis |
| Loop Amplification | Per-state revisit rate — friction vs productive loops | Markov Analysis |
| Regression Count | Dimensions that worsened after an intervention | Jury Comparison |
| Cost | Practical efficiency ($ per experiment run) | Token tracking |