Documentation Index
Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt
Use this file to discover all available pages before exploring further.
ExperimentConfig
ExperimentConfig.builder()
.experimentName("name") // Required: experiment identifier
.datasetDir(Path.of("dataset")) // Required: dataset directory
.model("sonnet") // Required: LLM model
.promptTemplate("{{task}}") // Required: prompt with placeholders
.perItemTimeout(Duration.ofMinutes(2)) // Required: per-item timeout
.itemFilter(ItemFilter.bucket("A")) // Optional: filter items
.knowledgeBaseDir(Path.of("kb")) // Optional: KB root
.outputDir(Path.of("results")) // Optional: persist workspaces
.experimentTimeout(Duration.ofHours(1)) // Optional: overall timeout
.metadata(Map.of("key", "value")) // Optional: arbitrary metadata
.baselineId("abc123") // Optional: comparison baseline
.build();
AgentInvoker
Single-method interface — implement this to plug in any agent:
public interface AgentInvoker {
InvocationResult invoke(InvocationContext context)
throws AgentInvocationException;
}
Contract:
- Blocking: returns when agent completes, times out, or fails
- Thread-safe: callable from multiple threads
- NOT responsible for: timeout enforcement, workspace setup, result tracking
InvocationContext
What the runner passes to your agent:
| Field | Type | Description |
|---|
workspacePath | Path | Directory where agent operates |
prompt | String | Fully constructed prompt |
systemPrompt | String | Optional additional system instructions |
model | String | Model identifier |
timeout | Duration | Timeout hint |
metadata | Map | Pass-through (experimentId, itemId, etc.) |
runDir | Path | Optional directory for trace artifacts |
InvocationResult
What your agent returns:
// Success
InvocationResult.completed(phases, inputTokens, outputTokens,
thinkingTokens, totalCostUsd, durationMs, sessionId, metadata);
// Timeout
InvocationResult.timeout(durationMs, metadata, errorMessage);
// Error
InvocationResult.error(errorMessage, metadata);
| Field | Type | Description |
|---|
success | boolean | Agent completed without error |
status | TerminalStatus | COMPLETED, ERROR, TIMEOUT |
inputTokens | int | Total input tokens consumed |
outputTokens | int | Total output tokens produced |
totalCostUsd | double | Estimated cost |
durationMs | long | Wall-clock execution time |
ExecutionDetail
Marker interface that decouples shared experiment infrastructure (ComparisonEngine, ResultStore, VerdictExtractor) from domain-specific per-item execution details.
public interface ExecutionDetail {
// Marker — shared infrastructure stores but never interprets
}
| Implementation | Used By | Contains |
|---|
InvocationResult | AgentExperiment | Agent invocation output, tokens, cost, phases |
JudgeExecutionDetail | JudgeExperiment | Candidate judgment, expected label, scorer result |
ItemResult.executionDetail() returns @Nullable ExecutionDetail. Consumers use instanceof pattern matching to access domain-specific fields:
if (item.executionDetail() instanceof InvocationResult inv) {
System.out.println("Cost: $" + inv.totalCostUsd());
}
dataset.json
{
"schemaVersion": 1,
"name": "dataset-name",
"version": "1.0.0",
"description": "What this dataset tests",
"items": [
{
"id": "ITEM-001",
"slug": "short-description",
"path": "items/ITEM-001",
"bucket": "A",
"taskType": "task-type",
"status": "active"
}
]
}
item.json
{
"schemaVersion": 1,
"id": "ITEM-001",
"slug": "short-description",
"developerTask": "Natural language task description",
"taskType": "task-type",
"bucket": "A",
"noChange": false,
"knowledgeRefs": ["path/to/kb-entry.md"],
"tags": ["tag1", "tag2"],
"status": "active"
}
Directory layout
dataset/
├── dataset.json
└── items/
└── ITEM-001/
├── item.json
├── before/ # Starting state
│ └── src/...
└── reference/ # Correct result
└── src/...
ItemFilter
ItemFilter.all() // No filtering
ItemFilter.bucket("A") // Single bucket
ItemFilter.tags("rename", "simple") // By tags
ItemFilter.id("ITEM-001") // Single item
ResultStore
| Implementation | Use case |
|---|
FileSystemResultStore(path) | Production — persists to disk |
InMemoryResultStore() | Testing — HashMap-backed |
Both implement:
void save(ExperimentResult result);
Optional<ExperimentResult> load(String id);
List<ExperimentResult> listByName(String experimentName);
Optional<ExperimentResult> mostRecent(String experimentName);
ExperimentResult
| Method | Type | Description |
|---|
experimentId() | String | Unique run ID |
experimentName() | String | Experiment name from config |
items() | List<ItemResult> | Per-item results |
passCount() | int | Items that passed all judges |
failCount() | int | Items that failed |
passRate() | double | Pass count / total (0.0–1.0) |
Re-Evaluation
Re-evaluate stored experiment results with a different jury without re-invoking the system under test.
ReEvaluationContextFactory
Functional interface that reconstructs a JudgmentContext from a stored ItemResult:
@FunctionalInterface
public interface ReEvaluationContextFactory {
Optional<JudgmentContext> create(ItemResult item);
}
Returns Optional.empty() when re-evaluation is not possible (failed item, missing execution detail, workspace not preserved).
AgentReEvaluationContextFactory
Default implementation for agent experiment results. Pattern-matches on InvocationResult to reconstruct the context:
ReEvaluationContextFactory factory =
AgentReEvaluationContextFactory.defaults();
Maps TerminalStatus to ExecutionStatus (COMPLETED → SUCCESS, TIMEOUT → TIMEOUT, ERROR → FAILED). Preserves original costUsd and totalTokens.
ReEvaluator
Orchestrates post-hoc re-scoring of stored experiment results:
ReEvaluator reEvaluator = ReEvaluator.builder()
.contextFactory(AgentReEvaluationContextFactory.defaults())
.resultStore(store)
.build();
ExperimentResult reScored = reEvaluator.reEvaluate(originalResult, newJury);
| Method | Description |
|---|
reEvaluate(ExperimentResult, Jury) | Re-score a loaded result with a new jury |
reEvaluate(String experimentId, Jury) | Load by ID, then re-score |
agentDefaults(ResultStore) | Convenience factory with AgentReEvaluationContextFactory |
Re-evaluated results carry metadata: reEvaluated=true, systemReinvoked=false, originalCostUsd, reEvaluationJury, originalTimestamp. Skipped items carry reEvaluationSkipped=true with a reason.
Judge Experiment
Run a judge as the system under test against a labeled dataset to measure agreement rate.
JudgeScorer
Functional interface that scores a candidate judge’s Judgment against the expected label:
@FunctionalInterface
public interface JudgeScorer {
JudgeScorerResult score(JudgeScoringInput input);
}
public record JudgeScoringInput(
DatasetItem item, // dataset item (for item-level context)
Judgment actual, // candidate judge's judgment
String expectedLabel // expected label from dataset
)
JudgeScorerResult
public record JudgeScorerResult(
boolean match, // judge agreed with expected label
double score, // normalized agreement score [0, 1]
String reasoning // explanation of match/mismatch
)
JudgeScorers
Built-in scoring implementations:
| Factory Method | Scoring Rule |
|---|
exactVerdictMatch() | PASS/FAIL must exactly match expected "PASS"/"FAIL" label |
exactCategoryMatch() | CategoricalScore value must match expected label (case-insensitive) |
numericalTolerance(double) | NumericalScore within tolerance of expected numeric value |
JudgeExecutionDetail
Domain evidence preserved for each item:
public record JudgeExecutionDetail(
Judgment candidateJudgment,
String expectedLabel,
JudgeScorerResult scorerResult
) implements ExecutionDetail
JudgeExperiment
Builder-based experiment runner where the system under test is a Judge:
JudgeExperimentResult result = JudgeExperiment.builder()
.name("correctness-judge-calibration")
.candidate(myCorrectnessJudge)
.items(labeledItems)
.input(item -> buildContextFromItem(item))
.expected(item -> item.metadata().get("expectedVerdict"))
.scorer(JudgeScorers.exactVerdictMatch())
.resultStore(store)
.build()
.run();
| Builder Method | Required | Description |
|---|
name(String) | Yes | Experiment name |
candidate(Judge) | Yes | Judge to evaluate |
items(List<DatasetItem>) | Yes | Labeled dataset items |
input(Function<DatasetItem, JudgmentContext>) | Yes | Builds context from item |
expected(Function<DatasetItem, String>) | Yes | Extracts expected label from item |
scorer(JudgeScorer) | Yes | Scoring strategy |
resultStore(ResultStore) | Yes | Persistence |
datasetVersion(String) | No | Defaults to "1.0.0" |
Takes List<DatasetItem> directly — judge datasets do not require filesystem loading.
JudgeExperimentResult
public record JudgeExperimentResult(
ExperimentResult experimentResult,
double agreementRate,
List<JudgeDisagreement> disagreements
)
| Method | Description |
|---|
agreementRate() | Fraction of items where judge agreed with expected label |
disagreements() | Items where judge disagreed |
from(ExperimentResult) | Create from an ExperimentResult containing JudgeExecutionDetail items |
asExperimentResult() | Unwrap for ComparisonEngine and ResultStore compatibility |
JudgeDisagreement
public record JudgeDisagreement(
String itemId,
JudgeExecutionDetail detail
)
Modules
<!-- Core: runner, dataset, jury, stores, diagnostics -->
<dependency>
<groupId>io.github.markpollack</groupId>
<artifactId>experiment-core</artifactId>
</dependency>
<!-- Claude SDK integration (ClaudeSdkInvoker, SemanticDiffJudge) -->
<dependency>
<groupId>io.github.markpollack</groupId>
<artifactId>experiment-claude</artifactId>
</dependency>