Skip to main content

Documentation Index

Fetch the complete documentation index at: https://lab.pollack.ai/llms.txt

Use this file to discover all available pages before exploring further.

This project has moved from the spring-ai-community GitHub organization to markpollack. New releases are published under the Maven groupId io.github.markpollack, and Java packages now use the io.github.markpollack namespace. If you previously used org.springaicommunity, update your dependency coordinates and imports to the current values shown below.

Overview

Agent Journal captures the structured behavioral traces that make agent research possible. Every LLM call, tool invocation, state transition, and decision point is logged as a typed event in an append-only journal. The EvalSubject extraction layer converts heterogeneous event sources into a uniform stream of behavioral units ready for evaluation by Agent Judge. A human feedback API records reviewer judgments with typed scores for judge calibration and golden dataset creation.

Architecture

Event System

Sealed event hierarchy: LLM calls, tool calls, state changes, git events, metrics, custom events

EvalSubject Extraction

Source-neutral behavioral units for evaluation — 9 subject kinds from journal events or SDK captures

Human Feedback

Typed feedback events with binary, numerical, and categorical scores for judge calibration

Modules

ModuleDescriptionDependencies
journal-coreEvents, storage, EvalSubject, FeedbackZero external deps
claude-code-captureClaude Code SDK → journal bridge, PhaseCaptureSourcesclaude-code-sdk

Installation

<dependency>
    <groupId>io.github.markpollack</groupId>
    <artifactId>journal-core</artifactId>
    <version>1.2.0</version>
</dependency>
For Claude Code SDK integration:
<dependency>
    <groupId>io.github.markpollack</groupId>
    <artifactId>claude-code-capture</artifactId>
    <version>1.2.0</version>
</dependency>

EvalSubject

EvalSubject is the source-neutral unit of recorded agent behavior that can be judged. Each subject carries an ID, kind, source reference, and metadata extracted from the original event.

EvalSubjectKind

Nine kinds classify the type of behavior:
KindSource Event
LLM_CALLLLM invocation with tokens, cost, duration
TOOL_CALLIndividual tool invocation
WORKFLOW_STEPHigh-level workflow phase
ROUTER_DECISIONRouting or dispatch decision
RETRIEVAL_RESULTRAG or search retrieval
FINAL_OUTPUTTerminal agent output
FEEDBACKHuman feedback event
STATE_CHANGEState transition
CUSTOMApplication-defined behavior

EvalSubjectSource

Functional interface that adapts a specific data source into a stream of EvalSubject records:
@FunctionalInterface
public interface EvalSubjectSource {
    Stream<EvalSubject> subjects();
}

EvalSubjectSources

Factory for creating sources from journal data:
// From stored journal events
EvalSubjectSource source = EvalSubjectSources.fromJournal(
    storage, experimentId, runId);

// From a pre-loaded event list
EvalSubjectSource source = EvalSubjectSources.fromEvents(events, runId);
The claude-code-capture module provides PhaseCaptureSources.fromPhaseCaptures(captures) for extracting subjects from Claude Code SDK PhaseCapture records — each phase becomes an LLM_CALL subject and individual tool uses become separate TOOL_CALL subjects.

EvalSubjectQuery

Fluent selection and grouping over subjects:
EvalSubjectSet toolCalls = EvalSubjectQuery.from(source)
    .kind(EvalSubjectKind.TOOL_CALL)
    .where(s -> s.metadata().containsKey("success"))
    .toSet();

Map<EvalSubjectKind, EvalSubjectSet> byKind =
    EvalSubjectQuery.from(source).groupBy(EvalSubject::kind);

Human Feedback

Record human reviewer judgments for judge agreement analysis and golden dataset creation.

FeedbackTarget

Identifies what the feedback applies to:
FeedbackTarget.item("ITEM-001")                         // item-level
FeedbackTarget.subject("ITEM-001", "subj-42", "TOOL_CALL")  // subject-level
FeedbackTarget.run()                                     // run-level

FeedbackEvent

Records a single piece of human feedback. Implements JournalEvent and is stored in the journal’s feedback.jsonl sidecar:
FeedbackEvent.thumbsUp(target, "reviewer-name")
FeedbackEvent.thumbsDown(target, "reviewer-name")
FeedbackEvent.rated(target, 0.8, 1.0, "reviewer-name")
FeedbackEvent.labeled(target, List.of("correct", "efficient"), "reviewer-name")

FeedbackScore

Typed score with three kinds:
Factory MethodScoreKindDescription
binary(true)BINARYThumbs up / thumbs down
numerical(0.8, 1.0)NUMERICALValue with max, normalized() returns OptionalDouble
categorical("correct")CATEGORICALDiscrete category label

FeedbackService

Records and queries feedback, exports reviewed items for golden dataset creation:
FeedbackService feedback = new DefaultFeedbackService(storage);

// Record
feedback.recordFeedback(experimentId, runId, event);

// Query
List<FeedbackEvent> events = feedback.getFeedback(experimentId, runId);
List<FeedbackEvent> itemEvents = feedback.getFeedbackForItem(experimentId, itemId);

// Export for golden dataset creation
List<ReviewedItem> reviewed = feedback.exportReviewedItems(experimentId);
ReviewedItem is a projection record containing itemId, runId, feedback, and itemMetadata — suitable for building labeled datasets from human judgments.

What It Captures

Events are stored as an append-only events.jsonl log per run:
.tuvium/experiments/{experimentId}/runs/{runId}/
├── events.jsonl      # immutable execution log
└── feedback.jsonl    # append-only human feedback
The sealed JournalEvent hierarchy includes:
  • LLMCallEvent — tokens, cost, duration, provider, model
  • ToolCallEvent — tool name, arguments, result, duration
  • StateChangeEvent — from/to states in the 9-state taxonomy
  • MetricEvent — counters, timers, gauges with dimensional tags
  • GitEvent — commit, diff, branch operations
  • CustomEvent — application-defined events
  • FeedbackEvent — human reviewer feedback

Why It Matters

Without structured traces, agent behavior is a black box. Agent Journal transforms agent runs into analyzable data — enabling Markov fingerprinting, loop detection, and cross-variant behavioral comparison.

Feeds Into

Source

GitHub

Source code (BSL 1.1) — two modules, 401 tests