Skip to main content

Overview

Agent Experiment is the execution backbone of every experiment in this lab. It provides a complete pipeline for running reproducible evaluations: load fixture datasets, invoke agents, judge results with Agent Judge juries, persist structured results, and compare runs across variants. The framework is agent-agnostic at its core. experiment-core has no AI SDK dependencies, while experiment-claude adds Claude Code SDK integration for agent invocation, LLM-based planning, and semantic evaluation.

Architecture

Dataset

Git-managed fixture datasets with items, before/reference snapshots, and version tracking

ExperimentRunner

Orchestrates the full loop: load items, invoke agent, judge, aggregate, persist

Comparison Engine

Compare runs across variants with per-judge deltas, regression detection, and summary statistics

Sessions & Sweeps

Group variant results into sessions, group sessions into sweeps for multi-run analysis

Modules

ModuleDescriptionKey Dependencies
experiment-coreDatasets, runner, comparison, results, storageagent-judge-core, agent-judge-exec, Jackson
experiment-claudeClaude SDK invoker, plan generator, semantic judgeclaude-code-sdk

Documentation

Getting Started

Run your first experiment: dataset, agent, jury, variant comparison

Creating Experiments

Design datasets, configure variants, wire custom judges

Jury System

Build cascaded juries for tiered evaluation

API Reference

Core types, runner, comparison, storage, diagnostics

Quick Start

<dependency>
    <groupId>io.github.markpollack</groupId>
    <artifactId>experiment-core</artifactId>
    <version>0.1.0</version>
</dependency>
ExperimentConfig config = ExperimentConfig.builder()
    .experimentName("my-experiment")
    .datasetDir(Path.of("datasets/my-benchmark"))
    .model("sonnet")
    .promptTemplate("Your task: {{task}}")
    .perItemTimeout(Duration.ofMinutes(10))
    .build();

DatasetManager dm = new FileSystemDatasetManager();
ResultStore store = new FileSystemResultStore(resultsDir);

ExperimentRunner runner = new ExperimentRunner(dm, jury, store, config);
ExperimentResult result = runner.run(invoker);
// result.passRate(), result.totalCostUsd(), result.items()

Role in the Lab

Agent Experiment is the execution layer that ties the other AgentWorks projects together: Used by every experiment in the lab:

Source

GitHub

Source code (BSL 1.1) — two modules, 477 tests