Overview
Agent Experiment is the execution backbone of every experiment in this lab. It provides a complete pipeline for running reproducible evaluations: load fixture datasets, invoke agents, judge results with Agent Judge juries, persist structured results, and compare runs across variants. The framework is agent-agnostic at its core.experiment-core has no AI SDK dependencies, while experiment-claude adds Claude Code SDK integration for agent invocation, LLM-based planning, and semantic evaluation.
Architecture
Dataset
Git-managed fixture datasets with items, before/reference snapshots, and version tracking
ExperimentRunner
Orchestrates the full loop: load items, invoke agent, judge, aggregate, persist
Comparison Engine
Compare runs across variants with per-judge deltas, regression detection, and summary statistics
Sessions & Sweeps
Group variant results into sessions, group sessions into sweeps for multi-run analysis
Modules
| Module | Description | Key Dependencies |
|---|---|---|
experiment-core | Datasets, runner, comparison, results, storage | agent-judge-core, agent-judge-exec, Jackson |
experiment-claude | Claude SDK invoker, plan generator, semantic judge | claude-code-sdk |
Documentation
Getting Started
Run your first experiment: dataset, agent, jury, variant comparison
Creating Experiments
Design datasets, configure variants, wire custom judges
Jury System
Build cascaded juries for tiered evaluation
API Reference
Core types, runner, comparison, storage, diagnostics
Quick Start
Role in the Lab
Agent Experiment is the execution layer that ties the other AgentWorks projects together:- Agent Judge — Jury scores every item
- Agent Journal — Traces captured during invocation
- Agent Sandbox — Isolated execution environments
- Agent Bench — Benchmark datasets consumed by experiments
Source
GitHub
Source code (BSL 1.1) — two modules, 477 tests