Agent Journal

What’s New → What’s new in 1.6.0: First-class journal-capture primitives downstream repos import — PhaseCapture.stepCosts(), JournalSteps.fromEvents(), a production fail-loud RunRecorder, and per-turn usage in the immutable log (slice 1 of the cross-repo capture contract). Plus a cost-metering correction — the headline LLMCallEvent.tokenUsage is now the cost-bearing Σ-per-turn aggregate (incl. cache), fixing a ~2× under-count on long runs — and a per-file schemaVersion header on events.jsonl / analysis.jsonl. Additive on the frozen capture contract: a 1.5.0 consumer keeps working. (1.5.0 added first-class Gemini CLI capture — the gemini-cli-capture module, same portable trace + cost schema as Claude. Three modules: journal-core, claude-code-capture, gemini-cli-capture.)

This project has moved from the spring-ai-community GitHub organization to markpollack. New releases are published under the Maven groupId io.github.markpollack, and Java packages now use the io.github.markpollack namespace. If you previously used org.springaicommunity, update your dependency coordinates and imports to the current values shown below.

Overview

Agent Journal captures the structured behavioral traces that make agent research possible. Every LLM call, tool invocation, state transition, and decision point is logged as a typed event in an append-only journal. The EvalSubject extraction layer converts heterogeneous event sources into a uniform stream of behavioral units ready for evaluation by Agent Judge. A human feedback API records reviewer judgments with typed scores for judge calibration and golden dataset creation.

Architecture

Event System

Sealed event hierarchy: LLM calls, tool calls, state changes, git events, metrics, custom events

EvalSubject Extraction

Source-neutral behavioral units for evaluation — 9 subject kinds from journal events or SDK captures

Human Feedback

Typed feedback events with binary, numerical, and categorical scores for judge calibration

Modules

Module	Description	Dependencies
`journal-core`	Events, storage, EvalSubject, Feedback, portable `TraceWriter`	Zero external deps
`claude-code-capture`	Claude Code SDK → journal bridge, `PhaseCaptureSources`	claude-code-sdk
`gemini-cli-capture`	Gemini CLI → journal bridge — same trace + cost schema	gemini-cli-sdk

Installation

<dependency>
    <groupId>io.github.markpollack</groupId>
    <artifactId>journal-core</artifactId>
    <version>1.6.0</version>
</dependency>

For Claude Code SDK integration:

<dependency>
    <groupId>io.github.markpollack</groupId>
    <artifactId>claude-code-capture</artifactId>
    <version>1.6.0</version>
</dependency>

For Gemini CLI capture:

<dependency>
    <groupId>io.github.markpollack</groupId>
    <artifactId>gemini-cli-capture</artifactId>
    <version>1.6.0</version>
</dependency>

EvalSubject

EvalSubject is the source-neutral unit of recorded agent behavior that can be judged. Each subject carries an ID, kind, source reference, and metadata extracted from the original event.

EvalSubjectKind

Nine kinds classify the type of behavior:

Kind	Source Event
`LLM_CALL`	LLM invocation with tokens, cost, duration
`TOOL_CALL`	Individual tool invocation
`WORKFLOW_STEP`	High-level workflow phase
`ROUTER_DECISION`	Routing or dispatch decision
`RETRIEVAL_RESULT`	RAG or search retrieval
`FINAL_OUTPUT`	Terminal agent output
`FEEDBACK`	Human feedback event
`STATE_CHANGE`	State transition
`CUSTOM`	Application-defined behavior

EvalSubjectSource

Functional interface that adapts a specific data source into a stream of EvalSubject records:

@FunctionalInterface
public interface EvalSubjectSource {
    Stream<EvalSubject> subjects();
}

EvalSubjectSources

Factory for creating sources from journal data:

// From stored journal events
EvalSubjectSource source = EvalSubjectSources.fromJournal(
    storage, experimentId, runId);

// From a pre-loaded event list
EvalSubjectSource source = EvalSubjectSources.fromEvents(events, runId);

The claude-code-capture module provides PhaseCaptureSources.fromPhaseCaptures(captures) for extracting subjects from Claude Code SDK PhaseCapture records — each phase becomes an LLM_CALL subject and individual tool uses become separate TOOL_CALL subjects.

EvalSubjectQuery

Fluent selection and grouping over subjects:

EvalSubjectSet toolCalls = EvalSubjectQuery.from(source)
    .kind(EvalSubjectKind.TOOL_CALL)
    .where(s -> s.metadata().containsKey("success"))
    .toSet();

Map<EvalSubjectKind, EvalSubjectSet> byKind =
    EvalSubjectQuery.from(source).groupBy(EvalSubject::kind);

Human Feedback

Record human reviewer judgments for judge agreement analysis and golden dataset creation.

FeedbackTarget

Identifies what the feedback applies to:

FeedbackTarget.item("ITEM-001")                         // item-level
FeedbackTarget.subject("ITEM-001", "subj-42", "TOOL_CALL")  // subject-level
FeedbackTarget.run()                                     // run-level

FeedbackEvent

Records a single piece of human feedback. Implements JournalEvent and is stored in the journal’s feedback.jsonl sidecar:

FeedbackEvent.thumbsUp(target, "reviewer-name")
FeedbackEvent.thumbsDown(target, "reviewer-name")
FeedbackEvent.rated(target, 0.8, 1.0, "reviewer-name")
FeedbackEvent.labeled(target, List.of("correct", "efficient"), "reviewer-name")

FeedbackScore

Typed score with three kinds:

Factory Method	ScoreKind	Description
`binary(true)`	`BINARY`	Thumbs up / thumbs down
`numerical(0.8, 1.0)`	`NUMERICAL`	Value with max, `normalized()` returns `OptionalDouble`
`categorical("correct")`	`CATEGORICAL`	Discrete category label

FeedbackService

Records and queries feedback, exports reviewed items for golden dataset creation:

FeedbackService feedback = new DefaultFeedbackService(storage);

// Record
feedback.recordFeedback(experimentId, runId, event);

// Query
List<FeedbackEvent> events = feedback.getFeedback(experimentId, runId);
List<FeedbackEvent> itemEvents = feedback.getFeedbackForItem(experimentId, itemId);

// Export for golden dataset creation
List<ReviewedItem> reviewed = feedback.exportReviewedItems(experimentId);

ReviewedItem is a projection record containing itemId, runId, feedback, and itemMetadata — suitable for building labeled datasets from human judgments.

Workflow Integration

The workflow-journal module bridges agent-workflow step execution to the journal. Each workflow step completion is recorded as a WorkflowStepEvent containing step name, duration, tokens, and cost.

WorkflowJournal.registerEventType();  // register once

try (Run run = Journal.run("my-experiment").start()) {
    WorkflowExecutor executor = new WorkflowExecutor(
        new LocalStepRunner(), WorkflowJournal.forRun(run));

    Workflow.<String, String>define("my-workflow")
        .withExecutor(executor)
        .step(claudeStep)
        .run(input);
}

The experiment template’s WorkflowAgentInvoker and WorkflowInvoker<S> wire this automatically — no manual setup needed in consumer projects.

What It Captures

Events are stored as an append-only events.jsonl log per run:

.tuvium/experiments/{experimentId}/runs/{runId}/
├── events.jsonl      # immutable execution log
└── feedback.jsonl    # append-only human feedback

The sealed JournalEvent hierarchy includes:

LLMCallEvent — tokens, cost, duration, provider, model
ToolCallEvent — tool name, arguments, result, duration
StateChangeEvent — from/to states in the 9-state taxonomy
MetricEvent — counters, timers, gauges with dimensional tags
GitEvent — commit, diff, branch operations
CustomEvent — application-defined events
FeedbackEvent — human reviewer feedback

Why It Matters

Without structured traces, agent behavior is a black box. Agent Journal transforms agent runs into analyzable data — enabling Markov fingerprinting, loop detection, and cross-variant behavioral comparison.

Projects

AgentWorks

Agento

Supporting Projects

Migration

Overview

Architecture

Event System

EvalSubject Extraction

Human Feedback

Modules

Installation

EvalSubject

EvalSubjectKind

EvalSubjectSource

EvalSubjectSources

EvalSubjectQuery

Human Feedback

FeedbackTarget

FeedbackEvent

FeedbackScore

FeedbackService

Workflow Integration

What It Captures

Why It Matters

Feeds Into

Source

GitHub

​Overview

​Architecture

Event System

EvalSubject Extraction

Human Feedback

​Modules

​Installation

​EvalSubject

​EvalSubjectKind

​EvalSubjectSource

​EvalSubjectSources

​EvalSubjectQuery

​Human Feedback

​FeedbackTarget

​FeedbackEvent

​FeedbackScore

​FeedbackService

​Workflow Integration

​What It Captures

​Why It Matters

​Feeds Into

​Source

GitHub

Overview

Architecture

Modules

Installation

EvalSubject

EvalSubjectKind

EvalSubjectSource

EvalSubjectSources

EvalSubjectQuery

Human Feedback

FeedbackTarget

FeedbackEvent

FeedbackScore

FeedbackService

Workflow Integration

What It Captures

Why It Matters

Feeds Into

Source