Diagnostic Reasoning

What Diagnostic Reasoning Does

After an experiment runs and the jury produces verdicts, the diagnostic system answers: why did items fail, and what should change? It works in three stages:

Gap classification — map each failing verdict to a category (agent error, plan gap, missing knowledge, etc.)
Deterministic reasoning — apply rule-based logic to produce actionable fixes
LLM fallback — for checks the rules can’t resolve, an LLM analyzes execution traces and proposes new artifacts

ExperimentResult (jury verdicts)
        │
        ▼
  DiagnosticAnalyzer
   └─ GapClassifier: verdict → DiagnosticCheck (with GapCategory)
        │
        ▼
  DiagnosticReport (per-item checks, gap distribution)
        │
        ▼
  DiagnosticReasoner
   ├─ DeterministicReasoner → RemediationAction[]
   └─ LlmDiagnosticReasoner → RemediationAction[] + RemediationProposal[]
        │
        ▼
  RemediationReport (actions, proposals, unresolved checks)

Gap Categories

Every failing verdict check is classified into a gap category that identifies where in the system the problem lives:

Category	Meaning	Fix target
`AGENT_EXECUTION_GAP`	Plan was correct, agent didn’t follow it	Agent prompting
`PLAN_GENERATION_GAP`	Plan didn’t cover this pattern	Planner or planning model
`KB_GAP`	Knowledge base doesn’t cover this pattern	Add KB entry
`TOOL_GAP`	No deterministic tool handles this	Build new tool
`ANALYSIS_GAP`	Static analysis missed a signal	Improve analysis tools
`CRITERIA_GAP`	VERIFY criteria were redundant/ambiguous/missing	Criteria generation
`EVALUATION_GAP`	Jury itself is wrong (false positive/negative)	Judge calibration
`STOCHASTICITY_GAP`	Same config produces different outcomes across runs	Requires N≥3 runs

DiagnosticAnalyzer

Entry point for analysis. Takes an ExperimentResult and produces a DiagnosticReport:

DiagnosticAnalyzer analyzer = new DiagnosticAnalyzer(gapClassifier);
DiagnosticReport report = analyzer.analyze(experimentResult);

report.distribution().dominant();  // e.g., AGENT_EXECUTION_GAP
report.items();                    // per-item ItemDiagnostic list
report.recommendations();         // human-readable suggestions

GapClassifier

Assigns gap categories to verdict checks. The default HeuristicGapClassifier uses 22 judge-specific classification rules to map failures to categories based on the judge name, check content, and available analysis data.

GapClassifier classifier = new HeuristicGapClassifier();
List<DiagnosticCheck> checks = classifier.classify(verdict, analysisEnvelope, executionPlan);

DiagnosticReport

Field	Type	Description
`experimentId`	`String`	Experiment run ID
`items`	`List<ItemDiagnostic>`	Per-item classified checks with dominant gap
`distribution`	`GapDistribution`	Aggregate counts and fractions by category
`recommendations`	`List<String>`	Human-readable improvement suggestions

GapDistribution.dominant() returns the most frequent gap category — the highest-leverage fix target.

DiagnosticReasoner

Transforms a DiagnosticReport into actionable remediation:

public interface DiagnosticReasoner {
    RemediationReport reason(DiagnosticReport report, ReasoningContext context);
}

ReasoningContext

Provides the full data menu for reasoning — analysis output, execution plan, trajectory exhaust, and file pointers:

Field	Type	Description
`analysis`	`AnalysisEnvelope`	Static analysis data (from pipeline)
`plan`	`ExecutionPlan`	Execution roadmap (from pipeline)
`availableTools`	`Set<String>`	Tools available to the agent
`phases`	`List<PhaseCapture>`	Agent thinking, tool calls, and results

Helper methods: unusedTools(), errorToolResults(), toolUsesByName(String).

DeterministicReasoner

Rule-based reasoning with two rule categories: Verdict rules (fire on failing judge checks):

Pattern-match on gap category and structured analysis data
Target specific components: planner-prompt, pom-upgrader, agent-prompt, dependency-analysis

Trajectory rules (fire on execution context regardless of judge outcomes):

Detect efficiency gaps where the agent recovered but deterministic tools could have prevented the problem
Examples: unused tools, implicit JDK dependencies, repeated build errors, format violations

DeterministicReasoner reasoner = new DeterministicReasoner();
RemediationReport report = reasoner.reason(diagnosticReport, context);

report.remediations();      // actionable fixes
report.unresolvedChecks();  // checks the rules couldn't resolve

LlmDiagnosticReasoner

Handles checks that deterministic rules can’t resolve. Analyzes execution traces (thinking, tool calls, results) and produces:

RemediationActions — fixes with LLM_INFERRED confidence
RemediationProposals — new deterministic artifacts (rules, KB entries, tool specs) for the flywheel

public interface LlmDiagnosticReasoner {
    LlmReasoningResult reasonUnresolved(
        List<DiagnosticCheck> unresolvedChecks, ReasoningContext context);
}

CompositeDiagnosticReasoner

Chains deterministic and LLM reasoning:

CompositeDiagnosticReasoner reasoner = new CompositeDiagnosticReasoner(
    new DeterministicReasoner(), llmReasoner);

RemediationReport report = reasoner.reason(diagnosticReport, context);

Deterministic layer runs first (fast, proof-based)
If unresolved checks remain and an LLM fallback is available, forward them to the LLM
Merge results into a single RemediationReport

If deterministic reasoning resolves all checks, the LLM is never called.

RemediationReport

Field	Type	Description
`experimentId`	`String`	Experiment run ID
`remediations`	`List<RemediationAction>`	Actionable fixes, highest-impact first
`proposals`	`List<RemediationProposal>`	New artifacts proposed by LLM
`unresolvedChecks`	`List<DiagnosticCheck>`	Checks neither layer could resolve

RemediationAction

Each action targets a specific component and carries a confidence level:

Field	Type	Description
`target`	`String`	Component to fix (e.g., “pom-upgrader”, “agent-prompt”)
`actionType`	`ActionType`	`ADD_RULE`, `ENHANCE_TOOL`, `IMPROVE_PROMPT`, `ADD_KB_ENTRY`, `ENHANCE_ANALYSIS`, `CALIBRATE_JUDGE`, `MANUAL_INVESTIGATION`
`summary`	`String`	One-line description
`detail`	`String`	Full explanation
`confidence`	`Confidence`	`DETERMINISTIC`, `HEURISTIC`, or `LLM_INFERRED`

RemediationProposal

LLM-discovered patterns that can become new deterministic infrastructure:

Field	Type	Description
`proposalType`	`ProposalType`	`NEW_REASONER_RULE`, `KB_ENTRY_DRAFT`, `TOOL_ENHANCEMENT`, `PROMPT_PATCH`, `NEW_TOOL_SPEC`, `JUDGE_CALIBRATION`
`target`	`String`	Component the proposal targets
`proposalMarkdown`	`String`	Full specification for review
`confidence`	`Confidence`	Always `LLM_INFERRED`

The Flywheel

RemediationProposals are the flywheel mechanism. When the LLM discovers a novel failure pattern:

It creates a RemediationProposal (e.g., a new deterministic reasoner rule)
A human reviews and applies the proposal
The new rule is added to DeterministicReasoner
On the next run, that pattern is resolved deterministically — faster, cheaper, and with higher confidence

Over time, the LLM fallback is invoked less as more patterns move into deterministic rules.

Cross-Run Aggregation

DiagnosticAggregator analyzes multiple DiagnosticReport instances from repeated runs to detect stochasticity:

DiagnosticAggregator aggregator = new DiagnosticAggregator();
AggregatedDiagnostic agg = aggregator.aggregate(List.of(report1, report2, report3));

agg.stochasticItems();    // items with different dominant gaps across runs
agg.stableItems();        // items that fail consistently for the same reason
agg.stabilityFraction();  // fraction of items that are stable

An item classified as AGENT_EXECUTION_GAP in one run and PLAN_GENERATION_GAP in another is flagged as stochastic. Stochastic items need N≥3 runs before you can draw reliable conclusions. Stable items are immediately actionable.

Efficiency Evaluation

EfficiencyEvaluator scores execution efficiency across four metrics:

Metric	Weight	What it measures
`buildErrors`	0.35	How many build errors occurred before success
`toolUtilization`	0.25	Fraction of available tools actually used
`cost`	0.20	LLM cost relative to a configurable ceiling
`recoveryCycles`	0.20	How many error-recovery loops the agent needed

EfficiencyConfig config = new EfficiencyConfig(5.0, defaultWeights, 8);
EfficiencyReport report = evaluator.evaluate(result, context, config);

report.compositeScore();  // weighted average [0, 1] where 1.0 = perfect
report.checks();          // per-metric breakdown

Metrics gracefully degrade — if data for a metric is missing, the metric is omitted rather than failing.

Behavioral Diagnostics (Markov Analysis)

Gap classification and remediation operate on judge verdicts — the outcome layer. A complementary diagnostic lens operates on tool-call sequences — the behavioral layer. The agent-experiment-template includes Markov chain analysis scripts that reveal how the agent behaves, not just what it produces.

Loop amplification

The Markov analysis computes loop amplification — how many times the agent revisits a state before moving forward. High amplification indicates friction or failure loops:

Signal	Diagnosis	Intervention
Amplification > 2.0 on BUILD→FIX	Agent in a fix loop — build fails, fix fails, rebuild fails	Knowledge (add fix patterns) or execution structure (pre-validate)
Amplification > 2.0 on SEARCH states	Agent searching for something it can’t find	Knowledge (add the target information)
Amplification > 2.0 on EXPLORE	Agent reading many files without progress	Prompt (clarify task decomposition) or pre-analysis script

Loop types

Not all loops are problems. Classify before intervening:

Loop type	Pattern	Action
Productive	WRITE → VERIFY → FIX → VERIFY	Leave it alone
Friction	SEARCH → READ → SEARCH → READ	Add knowledge or routing
Failure	BUILD → FIX → BUILD → FIX (same error)	Change strategy, not retry count
Diagnostic	BUILD → ERROR → READ_LOG → FIX	Leave it alone
Degenerate	EXPLORE → EXPLORE → EXPLORE	Agent is stuck — intervene

Interpretation output

The template’s make_markov_analysis.py writes analysis/markov-interpretation.md with:

Per-variant loop amplification summary with threshold-based classification
Recommended intervention lever for each high-amplification state
Suggested next variant with a hypothesis template

This connects the behavioral DIAGNOSE step to the flywheel’s INTERVENE step — the interpretation tells you which lever to pull based on measured state-transition patterns. See the Improvement Flywheel for the full methodology.

Pipeline

Analyze, plan, and execute — the three pipeline phases

Jury System

Three-tier evaluation: deterministic, structural, semantic

Projects

AgentWorks

Agento

Supporting Projects

Migration

What Diagnostic Reasoning Does

Gap Categories

DiagnosticAnalyzer

GapClassifier

DiagnosticReport

DiagnosticReasoner

ReasoningContext

DeterministicReasoner

LlmDiagnosticReasoner

CompositeDiagnosticReasoner

RemediationReport

RemediationAction

RemediationProposal

The Flywheel

Cross-Run Aggregation

Efficiency Evaluation

Behavioral Diagnostics (Markov Analysis)

Loop amplification

Loop types

Interpretation output

Pipeline

Jury System

​What Diagnostic Reasoning Does

​Gap Categories

​DiagnosticAnalyzer

​GapClassifier

​DiagnosticReport

​DiagnosticReasoner

​ReasoningContext

​DeterministicReasoner

​LlmDiagnosticReasoner

​CompositeDiagnosticReasoner

​RemediationReport

​RemediationAction

​RemediationProposal

​The Flywheel

​Cross-Run Aggregation

​Efficiency Evaluation

​Behavioral Diagnostics (Markov Analysis)

​Loop amplification

​Loop types

​Interpretation output

​Related

Pipeline

Jury System

What Diagnostic Reasoning Does

Gap Categories

DiagnosticAnalyzer

GapClassifier

DiagnosticReport

DiagnosticReasoner

ReasoningContext

DeterministicReasoner

LlmDiagnosticReasoner

CompositeDiagnosticReasoner

RemediationReport

RemediationAction

RemediationProposal

The Flywheel

Cross-Run Aggregation

Efficiency Evaluation

Behavioral Diagnostics (Markov Analysis)

Loop amplification

Loop types

Interpretation output

Related