Skip to main content

Overview

Agent Memory gives AI agents the ability to manage conversational context intelligently. Without memory management, every prior tool result is re-sent every turn — on long tasks, context fills with stale noise, costs climb, and the model loses focus. In production: 18M input tokens for a single code-coverage task. With compaction: 854K tokens. 21x reduction, same quality. The library starts with a proven context compaction strategy and progressively adds structured retrieval, reflection, and autonomous memory management — eventually reaching MemGPT-level capabilities. Each tier is independently useful. Agent Memory ships as a Spring AI BaseAdvisor — plug it into any ChatClient pipeline with one line:
var memoryStore = new FileSystemMemoryStore(Path.of(".memory"));

var advisor = CompactionMemoryAdvisor.builder(memoryStore)
    .compactionChatClient(ChatClient.create(haikuModel))
    .memoryTokenBudget(8192)
    .compactionRatio(0.75)
    .build();

ChatClient agent = ChatClient.builder(chatModel)
    .defaultAdvisors(advisor)
    .build();
On each request, the advisor retrieves accumulated learnings (within the token budget) and injects them into the system message. After each response, it appends the assistant’s output to the store. When uncompacted entries exceed budget × ratio, compaction summarizes them via a cheap model and replaces them with dense summaries.

How Compaction Works

When accumulated context exceeds a token budget, older entries are summarized by a cheap model (e.g., Haiku) and replaced with a compact summary. The agent continues with dense, relevant context instead of an ever-growing prompt. Two parameters control it:
ParameterDefaultDescription
memoryTokenBudget8,192Max tokens of memory included in each prompt
compactionRatio0.75Fraction of budget that triggers compaction

Benchmark Data

Real LLM benchmarks against Anthropic Haiku 4.5 on a 12-story e-commerce platform PRD:
MetricWithout CompactionWith Compaction
Stories passed9/1211/12
Total tokens56,87640,152
Total cost$0.34$0.24
Token growth without compaction is linear and unbounded (~800 tokens/story, reaching 9,000+ by story 12). With compaction it plateaus around 4,600 tokens after the first compaction cycle.

Budget Sensitivity

The memory budget is the critical tuning parameter:
BudgetResultNotes
2,0487/12 storiesToo aggressive — destroys critical details
4,09611/12 storiesSweet spot for structured tasks
8,192GoodBest for unstructured conversations
At 2,048 tokens, compaction destroys critical details — table names, endpoint signatures, auth token formats — that later tasks depend on. At 4,096, it preserves what matters and discards what doesn’t.

Production Validation

In a code-coverage experiment with Loopy:
ConfigurationCompactionInput TokensOutcome
No compactionnone18,336,594Failed
Threshold 0.5lateCost cap hit
Threshold 0.3early854,353Passed
Compaction does not change answer quality. It determines whether the run finishes within budget.

Roadmap

TierNameStatusDescription
1CompactionShippingToken-budgeted retrieval + LLM summarization
2StructuredPlannedCategorized memory with selective retrieval and per-category retention policies
3ReflectivePlannedImportance scoring + periodic reflection synthesis (Generative Agents pattern)
4AutonomousPlannedAgent-controlled memory via tools — virtual context management (MemGPT pattern)

Modules

ModuleDescription
memory-coreMemoryStore interface, FileSystemMemoryStore, MemoryCompactor, TokenEstimator
memory-advisorCompactionMemoryAdvisor — Spring AI BaseAdvisor for ChatClient integration

GitHub

Source code (0.1.0 on Maven Central)

Origin

Extracted from wiggum-memory — a research project that explored the Ralph Wiggum pattern for context management in AI agent loops. The memory subsystem proved to be the most broadly useful component, so it was promoted to a standalone library as part of the AgentWorks stack.