ExperimentConfig
AgentInvoker
Single-method interface — implement this to plug in any agent:- Blocking: returns when agent completes, times out, or fails
- Thread-safe: callable from multiple threads
- NOT responsible for: timeout enforcement, workspace setup, result tracking
InvocationContext
What the runner passes to your agent:| Field | Type | Description |
|---|---|---|
workspacePath | Path | Directory where agent operates |
prompt | String | Fully constructed prompt |
systemPrompt | String | Optional additional system instructions |
model | String | Model identifier |
timeout | Duration | Timeout hint |
metadata | Map | Pass-through (experimentId, itemId, etc.) |
runDir | Path | Optional directory for trace artifacts |
InvocationResult
What your agent returns:| Field | Type | Description |
|---|---|---|
success | boolean | Agent completed without error |
status | TerminalStatus | COMPLETED, ERROR, TIMEOUT |
inputTokens | int | Total input tokens consumed |
outputTokens | int | Total output tokens produced |
totalCostUsd | double | Estimated cost |
durationMs | long | Wall-clock execution time |
Dataset Format
dataset.json
item.json
Directory layout
ItemFilter
ResultStore
| Implementation | Use case |
|---|---|
FileSystemResultStore(path) | Production — persists to disk |
InMemoryResultStore() | Testing — HashMap-backed |
ExperimentResult
| Method | Type | Description |
|---|---|---|
experimentId() | String | Unique run ID |
experimentName() | String | Experiment name from config |
items() | List<ItemResult> | Per-item results |
passCount() | int | Items that passed all judges |
failCount() | int | Items that failed |
passRate() | double | Pass count / total (0.0–1.0) |