Skip to main content

Documentation Index

Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt

Use this file to discover all available pages before exploring further.

ExperimentConfig

ExperimentConfig.builder()
    .experimentName("name")                    // Required: experiment identifier
    .datasetDir(Path.of("dataset"))            // Required: dataset directory
    .model("sonnet")                           // Required: LLM model
    .promptTemplate("{{task}}")                // Required: prompt with placeholders
    .perItemTimeout(Duration.ofMinutes(2))     // Required: per-item timeout
    .itemFilter(ItemFilter.bucket("A"))        // Optional: filter items
    .knowledgeBaseDir(Path.of("kb"))           // Optional: KB root
    .outputDir(Path.of("results"))             // Optional: persist workspaces
    .experimentTimeout(Duration.ofHours(1))    // Optional: overall timeout
    .metadata(Map.of("key", "value"))          // Optional: arbitrary metadata
    .baselineId("abc123")                      // Optional: comparison baseline
    .build();

AgentInvoker

Single-method interface — implement this to plug in any agent:
public interface AgentInvoker {
    InvocationResult invoke(InvocationContext context)
        throws AgentInvocationException;
}
Contract:
  • Blocking: returns when agent completes, times out, or fails
  • Thread-safe: callable from multiple threads
  • NOT responsible for: timeout enforcement, workspace setup, result tracking

InvocationContext

What the runner passes to your agent:
FieldTypeDescription
workspacePathPathDirectory where agent operates
promptStringFully constructed prompt
systemPromptStringOptional additional system instructions
modelStringModel identifier
timeoutDurationTimeout hint
metadataMapPass-through (experimentId, itemId, etc.)
runDirPathOptional directory for trace artifacts

InvocationResult

What your agent returns:
// Success
InvocationResult.completed(phases, inputTokens, outputTokens,
    thinkingTokens, totalCostUsd, durationMs, sessionId, metadata);

// Timeout
InvocationResult.timeout(durationMs, metadata, errorMessage);

// Error
InvocationResult.error(errorMessage, metadata);
FieldTypeDescription
successbooleanAgent completed without error
statusTerminalStatusCOMPLETED, ERROR, TIMEOUT
inputTokensintTotal input tokens consumed
outputTokensintTotal output tokens produced
totalCostUsddoubleEstimated cost
durationMslongWall-clock execution time

ExecutionDetail

Marker interface that decouples shared experiment infrastructure (ComparisonEngine, ResultStore, VerdictExtractor) from domain-specific per-item execution details.
public interface ExecutionDetail {
    // Marker — shared infrastructure stores but never interprets
}
ImplementationUsed ByContains
InvocationResultAgentExperimentAgent invocation output, tokens, cost, phases
JudgeExecutionDetailJudgeExperimentCandidate judgment, expected label, scorer result
ItemResult.executionDetail() returns @Nullable ExecutionDetail. Consumers use instanceof pattern matching to access domain-specific fields:
if (item.executionDetail() instanceof InvocationResult inv) {
    System.out.println("Cost: $" + inv.totalCostUsd());
}

Dataset Format

dataset.json

{
  "schemaVersion": 1,
  "name": "dataset-name",
  "version": "1.0.0",
  "description": "What this dataset tests",
  "items": [
    {
      "id": "ITEM-001",
      "slug": "short-description",
      "path": "items/ITEM-001",
      "bucket": "A",
      "taskType": "task-type",
      "status": "active"
    }
  ]
}

item.json

{
  "schemaVersion": 1,
  "id": "ITEM-001",
  "slug": "short-description",
  "developerTask": "Natural language task description",
  "taskType": "task-type",
  "bucket": "A",
  "noChange": false,
  "knowledgeRefs": ["path/to/kb-entry.md"],
  "tags": ["tag1", "tag2"],
  "status": "active"
}

Directory layout

dataset/
├── dataset.json
└── items/
    └── ITEM-001/
        ├── item.json
        ├── before/          # Starting state
        │   └── src/...
        └── reference/       # Correct result
            └── src/...

ItemFilter

ItemFilter.all()                     // No filtering
ItemFilter.bucket("A")               // Single bucket
ItemFilter.tags("rename", "simple")  // By tags
ItemFilter.id("ITEM-001")           // Single item

ResultStore

ImplementationUse case
FileSystemResultStore(path)Production — persists to disk
InMemoryResultStore()Testing — HashMap-backed
Both implement:
void save(ExperimentResult result);
Optional<ExperimentResult> load(String id);
List<ExperimentResult> listByName(String experimentName);
Optional<ExperimentResult> mostRecent(String experimentName);

ExperimentResult

MethodTypeDescription
experimentId()StringUnique run ID
experimentName()StringExperiment name from config
items()List<ItemResult>Per-item results
passCount()intItems that passed all judges
failCount()intItems that failed
passRate()doublePass count / total (0.0–1.0)

Re-Evaluation

Re-evaluate stored experiment results with a different jury without re-invoking the system under test.

ReEvaluationContextFactory

Functional interface that reconstructs a JudgmentContext from a stored ItemResult:
@FunctionalInterface
public interface ReEvaluationContextFactory {
    Optional<JudgmentContext> create(ItemResult item);
}
Returns Optional.empty() when re-evaluation is not possible (failed item, missing execution detail, workspace not preserved).

AgentReEvaluationContextFactory

Default implementation for agent experiment results. Pattern-matches on InvocationResult to reconstruct the context:
ReEvaluationContextFactory factory =
    AgentReEvaluationContextFactory.defaults();
Maps TerminalStatus to ExecutionStatus (COMPLETEDSUCCESS, TIMEOUTTIMEOUT, ERRORFAILED). Preserves original costUsd and totalTokens.

ReEvaluator

Orchestrates post-hoc re-scoring of stored experiment results:
ReEvaluator reEvaluator = ReEvaluator.builder()
    .contextFactory(AgentReEvaluationContextFactory.defaults())
    .resultStore(store)
    .build();

ExperimentResult reScored = reEvaluator.reEvaluate(originalResult, newJury);
MethodDescription
reEvaluate(ExperimentResult, Jury)Re-score a loaded result with a new jury
reEvaluate(String experimentId, Jury)Load by ID, then re-score
agentDefaults(ResultStore)Convenience factory with AgentReEvaluationContextFactory
Re-evaluated results carry metadata: reEvaluated=true, systemReinvoked=false, originalCostUsd, reEvaluationJury, originalTimestamp. Skipped items carry reEvaluationSkipped=true with a reason.

Judge Experiment

Run a judge as the system under test against a labeled dataset to measure agreement rate.

JudgeScorer

Functional interface that scores a candidate judge’s Judgment against the expected label:
@FunctionalInterface
public interface JudgeScorer {
    JudgeScorerResult score(JudgeScoringInput input);
}

JudgeScoringInput

public record JudgeScoringInput(
    DatasetItem item,        // dataset item (for item-level context)
    Judgment actual,         // candidate judge's judgment
    String expectedLabel     // expected label from dataset
)

JudgeScorerResult

public record JudgeScorerResult(
    boolean match,           // judge agreed with expected label
    double score,            // normalized agreement score [0, 1]
    String reasoning         // explanation of match/mismatch
)

JudgeScorers

Built-in scoring implementations:
Factory MethodScoring Rule
exactVerdictMatch()PASS/FAIL must exactly match expected "PASS"/"FAIL" label
exactCategoryMatch()CategoricalScore value must match expected label (case-insensitive)
numericalTolerance(double)NumericalScore within tolerance of expected numeric value

JudgeExecutionDetail

Domain evidence preserved for each item:
public record JudgeExecutionDetail(
    Judgment candidateJudgment,
    String expectedLabel,
    JudgeScorerResult scorerResult
) implements ExecutionDetail

JudgeExperiment

Builder-based experiment runner where the system under test is a Judge:
JudgeExperimentResult result = JudgeExperiment.builder()
    .name("correctness-judge-calibration")
    .candidate(myCorrectnessJudge)
    .items(labeledItems)
    .input(item -> buildContextFromItem(item))
    .expected(item -> item.metadata().get("expectedVerdict"))
    .scorer(JudgeScorers.exactVerdictMatch())
    .resultStore(store)
    .build()
    .run();
Builder MethodRequiredDescription
name(String)YesExperiment name
candidate(Judge)YesJudge to evaluate
items(List<DatasetItem>)YesLabeled dataset items
input(Function<DatasetItem, JudgmentContext>)YesBuilds context from item
expected(Function<DatasetItem, String>)YesExtracts expected label from item
scorer(JudgeScorer)YesScoring strategy
resultStore(ResultStore)YesPersistence
datasetVersion(String)NoDefaults to "1.0.0"
Takes List<DatasetItem> directly — judge datasets do not require filesystem loading.

JudgeExperimentResult

public record JudgeExperimentResult(
    ExperimentResult experimentResult,
    double agreementRate,
    List<JudgeDisagreement> disagreements
)
MethodDescription
agreementRate()Fraction of items where judge agreed with expected label
disagreements()Items where judge disagreed
from(ExperimentResult)Create from an ExperimentResult containing JudgeExecutionDetail items
asExperimentResult()Unwrap for ComparisonEngine and ResultStore compatibility

JudgeDisagreement

public record JudgeDisagreement(
    String itemId,
    JudgeExecutionDetail detail
)

Modules

<!-- Core: runner, dataset, jury, stores, diagnostics -->
<dependency>
    <groupId>io.github.markpollack</groupId>
    <artifactId>experiment-core</artifactId>
</dependency>

<!-- Claude SDK integration (ClaudeSdkInvoker, SemanticDiffJudge) -->
<dependency>
    <groupId>io.github.markpollack</groupId>
    <artifactId>experiment-claude</artifactId>
</dependency>