Skip to main content

ExperimentConfig

ExperimentConfig.builder()
    .experimentName("name")                    // Required: experiment identifier
    .datasetDir(Path.of("dataset"))            // Required: dataset directory
    .model("sonnet")                           // Required: LLM model
    .promptTemplate("{{task}}")                // Required: prompt with placeholders
    .perItemTimeout(Duration.ofMinutes(2))     // Required: per-item timeout
    .itemFilter(ItemFilter.bucket("A"))        // Optional: filter items
    .knowledgeBaseDir(Path.of("kb"))           // Optional: KB root
    .outputDir(Path.of("results"))             // Optional: persist workspaces
    .experimentTimeout(Duration.ofHours(1))    // Optional: overall timeout
    .metadata(Map.of("key", "value"))          // Optional: arbitrary metadata
    .baselineId("abc123")                      // Optional: comparison baseline
    .build();

AgentInvoker

Single-method interface — implement this to plug in any agent:
public interface AgentInvoker {
    InvocationResult invoke(InvocationContext context)
        throws AgentInvocationException;
}
Contract:
  • Blocking: returns when agent completes, times out, or fails
  • Thread-safe: callable from multiple threads
  • NOT responsible for: timeout enforcement, workspace setup, result tracking

InvocationContext

What the runner passes to your agent:
FieldTypeDescription
workspacePathPathDirectory where agent operates
promptStringFully constructed prompt
systemPromptStringOptional additional system instructions
modelStringModel identifier
timeoutDurationTimeout hint
metadataMapPass-through (experimentId, itemId, etc.)
runDirPathOptional directory for trace artifacts

InvocationResult

What your agent returns:
// Success
InvocationResult.completed(phases, inputTokens, outputTokens,
    thinkingTokens, totalCostUsd, durationMs, sessionId, metadata);

// Timeout
InvocationResult.timeout(durationMs, metadata, errorMessage);

// Error
InvocationResult.error(errorMessage, metadata);
FieldTypeDescription
successbooleanAgent completed without error
statusTerminalStatusCOMPLETED, ERROR, TIMEOUT
inputTokensintTotal input tokens consumed
outputTokensintTotal output tokens produced
totalCostUsddoubleEstimated cost
durationMslongWall-clock execution time

Dataset Format

dataset.json

{
  "schemaVersion": 1,
  "name": "dataset-name",
  "version": "1.0.0",
  "description": "What this dataset tests",
  "items": [
    {
      "id": "ITEM-001",
      "slug": "short-description",
      "path": "items/ITEM-001",
      "bucket": "A",
      "taskType": "task-type",
      "status": "active"
    }
  ]
}

item.json

{
  "schemaVersion": 1,
  "id": "ITEM-001",
  "slug": "short-description",
  "developerTask": "Natural language task description",
  "taskType": "task-type",
  "bucket": "A",
  "noChange": false,
  "knowledgeRefs": ["path/to/kb-entry.md"],
  "tags": ["tag1", "tag2"],
  "status": "active"
}

Directory layout

dataset/
├── dataset.json
└── items/
    └── ITEM-001/
        ├── item.json
        ├── before/          # Starting state
        │   └── src/...
        └── reference/       # Correct result
            └── src/...

ItemFilter

ItemFilter.all()                     // No filtering
ItemFilter.bucket("A")               // Single bucket
ItemFilter.tags("rename", "simple")  // By tags
ItemFilter.id("ITEM-001")           // Single item

ResultStore

ImplementationUse case
FileSystemResultStore(path)Production — persists to disk
InMemoryResultStore()Testing — HashMap-backed
Both implement:
void save(ExperimentResult result);
Optional<ExperimentResult> load(String id);
List<ExperimentResult> listByName(String experimentName);
Optional<ExperimentResult> mostRecent(String experimentName);

ExperimentResult

MethodTypeDescription
experimentId()StringUnique run ID
experimentName()StringExperiment name from config
items()List<ItemResult>Per-item results
passCount()intItems that passed all judges
failCount()intItems that failed
passRate()doublePass count / total (0.0–1.0)

Modules

<!-- Core: runner, dataset, jury, stores, diagnostics -->
<dependency>
    <groupId>ai.tuvium</groupId>
    <artifactId>experiment-core</artifactId>
</dependency>

<!-- Claude SDK integration (ClaudeSdkInvoker, SemanticDiffJudge) -->
<dependency>
    <groupId>ai.tuvium</groupId>
    <artifactId>experiment-claude</artifactId>
</dependency>