Skip to main content
The lab uses a consistent methodology across all experiments. This page is the overview — each component has its own deep-dive.

The Worldview

An agent is not a one-shot generator. It is a controlled system operating under feedback — observed through journals, scored by judges, and steered by the evidence both produce. That worldview has practical consequences:
  • Everything iterates. This is not spec-driven development in the waterfall sense. The specs, the judges, and the workflow steps all iterate together until the agent performs. No artifact is finished before the loop starts — the loop is how artifacts get finished.
  • The judge set becomes a benchmark. Once the judges converge, they stop being development scaffolding and become the measuring stick: every improvement, and every alternative approach, is scored against them.
  • Determinism is the substrate. Agents are workflows — deterministic steps wherever possible, AI only where necessary. Every decision the LLM doesn’t have to make is a source of variance eliminated.
  • Behavioral analysis finds the gradient. Agent systems aren’t differentiable, but Markov analysis of journals plus judge scores provide directional evidence — a gradient the iteration descends.
  • The loop continues into production. Deployment doesn’t stop the journals or retire the judges. The same flywheel that built the agent keeps measuring and improving it.
The open frontier is the next layer up: many such agents, coordinating.

Six Pillars

Six pillars — the thesis, the pipeline, knowledge design, execution structure, evaluation, and behavioral analysis — plus the improvement flywheel that ties them into a single loop:

The Thesis

Knowledge + structured execution > model

Forge

Deterministic customization pipeline — Define → Forge → Run → Grow

KB Design

How to structure knowledge for agent consumption

SAE

Phased execution with checkpoints and guard rails

Evaluation

Four-tier cascaded jury — deterministic first, LLM last

Behavioral Analysis

Markov chain modeling of agent tool-call traces

Improvement Flywheel

Loss-driven iteration — from measured gaps to targeted interventions

Experiment Protocol

Every experiment follows the Improvement Flywheel cycle:
1

Define hypothesis

What are we testing? One variable, controlled comparison.
2

Discover state taxonomy

Run control variant 3–5 times. Inspect tool-call clusters and define domain-specific behavioral states for Markov analysis.
3

Run control baseline

Each variant runs N=3+ times for statistical confidence. Full traces captured. Deterministic preprocessing routes to knowledge bases at zero LLM cost.
4

Measure with four-tier jury

T0 → T1 → T2 → T3 cascade. Cheap filters first. Capture per-criterion scores, not just aggregates.
5

Diagnose with behavioral analysis

Build Markov chains from tool-call sequences. Identify dominant loss dimension. Classify loops before intervening.
6

Intervene — create next variant

Apply a targeted lever: prompt, knowledge, execution structure, model, or rubric. Each variant is motivated by the previous variant’s measurement.
7

Verify and iterate

Re-run and compare deltas. Check for regressions across all dimensions. Return to step 3 until loss plateaus.
8

Publish

Experiment page here in the lab, narrative on blog.pollack.ai, raw data on GitHub.

Key Metrics

MetricWhat It MeasuresSource
T3 ScoreOverall quality (LLM jury assessment)Agent Judge
Expected StepsEfficiency (Markov fundamental matrix)Agent Journal
P(success)Reliability (absorbing chain probability)Markov Analysis
Thrash ScoreBehavioral loss (loop amplification in BUILD→TEST→EDIT)Markov Analysis
Loop AmplificationPer-state revisit rate — friction vs productive loopsMarkov Analysis
Regression CountDimensions that worsened after an interventionJury Comparison
CostPractical efficiency ($ per experiment run)Token tracking