Documentation Index Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt
Use this file to discover all available pages before exploring further.
What Diagnostic Reasoning Does
After an experiment runs and the jury produces verdicts, the diagnostic system answers: why did items fail, and what should change?
It works in three stages:
Gap classification — map each failing verdict to a category (agent error, plan gap, missing knowledge, etc.)
Deterministic reasoning — apply rule-based logic to produce actionable fixes
LLM fallback — for checks the rules can’t resolve, an LLM analyzes execution traces and proposes new artifacts
ExperimentResult (jury verdicts)
│
▼
DiagnosticAnalyzer
└─ GapClassifier: verdict → DiagnosticCheck (with GapCategory)
│
▼
DiagnosticReport (per-item checks, gap distribution)
│
▼
DiagnosticReasoner
├─ DeterministicReasoner → RemediationAction[]
└─ LlmDiagnosticReasoner → RemediationAction[] + RemediationProposal[]
│
▼
RemediationReport (actions, proposals, unresolved checks)
Gap Categories
Every failing verdict check is classified into a gap category that identifies where in the system the problem lives:
Category Meaning Fix target AGENT_EXECUTION_GAPPlan was correct, agent didn’t follow it Agent prompting PLAN_GENERATION_GAPPlan didn’t cover this pattern Planner or planning model KB_GAPKnowledge base doesn’t cover this pattern Add KB entry TOOL_GAPNo deterministic tool handles this Build new tool ANALYSIS_GAPStatic analysis missed a signal Improve analysis tools CRITERIA_GAPVERIFY criteria were redundant/ambiguous/missing Criteria generation EVALUATION_GAPJury itself is wrong (false positive/negative) Judge calibration STOCHASTICITY_GAPSame config produces different outcomes across runs Requires N≥3 runs
DiagnosticAnalyzer
Entry point for analysis. Takes an ExperimentResult and produces a DiagnosticReport:
DiagnosticAnalyzer analyzer = new DiagnosticAnalyzer (gapClassifier);
DiagnosticReport report = analyzer . analyze (experimentResult);
report . distribution (). dominant (); // e.g., AGENT_EXECUTION_GAP
report . items (); // per-item ItemDiagnostic list
report . recommendations (); // human-readable suggestions
GapClassifier
Assigns gap categories to verdict checks. The default HeuristicGapClassifier uses 22 judge-specific classification rules to map failures to categories based on the judge name, check content, and available analysis data.
GapClassifier classifier = new HeuristicGapClassifier ();
List < DiagnosticCheck > checks = classifier . classify (verdict, analysisEnvelope, executionPlan);
DiagnosticReport
Field Type Description experimentIdStringExperiment run ID itemsList<ItemDiagnostic>Per-item classified checks with dominant gap distributionGapDistributionAggregate counts and fractions by category recommendationsList<String>Human-readable improvement suggestions
GapDistribution.dominant() returns the most frequent gap category — the highest-leverage fix target.
DiagnosticReasoner
Transforms a DiagnosticReport into actionable remediation:
public interface DiagnosticReasoner {
RemediationReport reason ( DiagnosticReport report , ReasoningContext context );
}
ReasoningContext
Provides the full data menu for reasoning — analysis output, execution plan, trajectory exhaust, and file pointers:
Field Type Description analysisAnalysisEnvelopeStatic analysis data (from pipeline) planExecutionPlanExecution roadmap (from pipeline) availableToolsSet<String>Tools available to the agent phasesList<PhaseCapture>Agent thinking, tool calls, and results
Helper methods: unusedTools(), errorToolResults(), toolUsesByName(String).
DeterministicReasoner
Rule-based reasoning with two rule categories:
Verdict rules (fire on failing judge checks):
Pattern-match on gap category and structured analysis data
Target specific components: planner-prompt, pom-upgrader, agent-prompt, dependency-analysis
Trajectory rules (fire on execution context regardless of judge outcomes):
Detect efficiency gaps where the agent recovered but deterministic tools could have prevented the problem
Examples: unused tools, implicit JDK dependencies, repeated build errors, format violations
DeterministicReasoner reasoner = new DeterministicReasoner ();
RemediationReport report = reasoner . reason (diagnosticReport, context);
report . remediations (); // actionable fixes
report . unresolvedChecks (); // checks the rules couldn't resolve
LlmDiagnosticReasoner
Handles checks that deterministic rules can’t resolve. Analyzes execution traces (thinking, tool calls, results) and produces:
RemediationActions — fixes with LLM_INFERRED confidence
RemediationProposals — new deterministic artifacts (rules, KB entries, tool specs) for the flywheel
public interface LlmDiagnosticReasoner {
LlmReasoningResult reasonUnresolved (
List < DiagnosticCheck > unresolvedChecks , ReasoningContext context );
}
CompositeDiagnosticReasoner
Chains deterministic and LLM reasoning:
CompositeDiagnosticReasoner reasoner = new CompositeDiagnosticReasoner (
new DeterministicReasoner (), llmReasoner);
RemediationReport report = reasoner . reason (diagnosticReport, context);
Deterministic layer runs first (fast, proof-based)
If unresolved checks remain and an LLM fallback is available, forward them to the LLM
Merge results into a single RemediationReport
If deterministic reasoning resolves all checks, the LLM is never called.
Field Type Description experimentIdStringExperiment run ID remediationsList<RemediationAction>Actionable fixes, highest-impact first proposalsList<RemediationProposal>New artifacts proposed by LLM unresolvedChecksList<DiagnosticCheck>Checks neither layer could resolve
Each action targets a specific component and carries a confidence level:
Field Type Description targetStringComponent to fix (e.g., “pom-upgrader”, “agent-prompt”) actionTypeActionTypeADD_RULE, ENHANCE_TOOL, IMPROVE_PROMPT, ADD_KB_ENTRY, ENHANCE_ANALYSIS, CALIBRATE_JUDGE, MANUAL_INVESTIGATIONsummaryStringOne-line description detailStringFull explanation confidenceConfidenceDETERMINISTIC, HEURISTIC, or LLM_INFERRED
LLM-discovered patterns that can become new deterministic infrastructure:
Field Type Description proposalTypeProposalTypeNEW_REASONER_RULE, KB_ENTRY_DRAFT, TOOL_ENHANCEMENT, PROMPT_PATCH, NEW_TOOL_SPEC, JUDGE_CALIBRATIONtargetStringComponent the proposal targets proposalMarkdownStringFull specification for review confidenceConfidenceAlways LLM_INFERRED
The Flywheel
RemediationProposals are the flywheel mechanism. When the LLM discovers a novel failure pattern:
It creates a RemediationProposal (e.g., a new deterministic reasoner rule)
A human reviews and applies the proposal
The new rule is added to DeterministicReasoner
On the next run, that pattern is resolved deterministically — faster, cheaper, and with higher confidence
Over time, the LLM fallback is invoked less as more patterns move into deterministic rules.
Cross-Run Aggregation
DiagnosticAggregator analyzes multiple DiagnosticReport instances from repeated runs to detect stochasticity:
DiagnosticAggregator aggregator = new DiagnosticAggregator ();
AggregatedDiagnostic agg = aggregator . aggregate ( List . of (report1, report2, report3));
agg . stochasticItems (); // items with different dominant gaps across runs
agg . stableItems (); // items that fail consistently for the same reason
agg . stabilityFraction (); // fraction of items that are stable
An item classified as AGENT_EXECUTION_GAP in one run and PLAN_GENERATION_GAP in another is flagged as stochastic. Stochastic items need N≥3 runs before you can draw reliable conclusions. Stable items are immediately actionable.
Efficiency Evaluation
EfficiencyEvaluator scores execution efficiency across four metrics:
Metric Weight What it measures buildErrors0.35 How many build errors occurred before success toolUtilization0.25 Fraction of available tools actually used cost0.20 LLM cost relative to a configurable ceiling recoveryCycles0.20 How many error-recovery loops the agent needed
EfficiencyConfig config = new EfficiencyConfig ( 5.0 , defaultWeights, 8 );
EfficiencyReport report = evaluator . evaluate (result, context, config);
report . compositeScore (); // weighted average [0, 1] where 1.0 = perfect
report . checks (); // per-metric breakdown
Metrics gracefully degrade — if data for a metric is missing, the metric is omitted rather than failing.
Pipeline Analyze, plan, and execute — the three pipeline phases
Jury System Three-tier evaluation: deterministic, structural, semantic