Designing Agent Memory Systems: From Short-Term Context to Persistent Knowledge
A deep dive into the four-layer agent memory architecture, with practical code for vector retrieval and memory compression to help you build scalable long-term memory systems.
Designing Agent Memory Systems: From Short-Term Context to Persistent Knowledge
Most agents handle memory by stuffing recent conversation turns into the context window. That works for demos but fails in production, where users expect agents to remember past decisions, understand long-term preferences, and improve based on experience. This article breaks down a four-layer memory architecture with implementation code for each layer.
Why "Stuff Everything Into Context" Does Not Work
Loading full conversation history into context has three fatal flaws:
- Token costs scale linearly: After 10 turns, inference cost per turn doubles
- Retrieval is unreliable: An LLM finding a key fact in 50K tokens of context is far less reliable than finding it in 500 tokens of targeted retrieval results
- No cross-session persistence: Context windows vanish when sessions end; returning users face a blank-slate agent
A real memory system needs layered design, where each layer has its own storage medium, retrieval strategy, and lifecycle.
Four-Layer Memory Architecture
┌─────────────────────────────────────────┐
│ Layer 1: Working Memory │ ← Current context window
├─────────────────────────────────────────┤
│ Layer 2: Episodic Memory │ ← Recent summaries + key events
├─────────────────────────────────────────┤
│ Layer 3: Semantic Memory │ ← Vector-stored knowledge and facts
├─────────────────────────────────────────┤
│ Layer 4: Procedural Memory │ ← Learned behavior patterns
└─────────────────────────────────────────┘
Layer 1: Working Memory
Working memory is the current context window. No extra storage needed, but it requires careful management:
from dataclasses import dataclass, field
@dataclass
class WorkingMemory:
system_prompt: str
recent_messages: list[dict] = field(default_factory=list)
max_tokens: int = 8000
reserved_for_output: int = 2000
def add_message(self, role: str, content: str):
self.recent_messages.append({"role": role, "content": content})
self._trim_if_needed()
def _trim_if_needed(self):
"""Remove oldest messages when approaching token budget"""
estimated_tokens = sum(
len(m["content"]) // 3 for m in self.recent_messages
)
budget = self.max_tokens - self.reserved_for_output - len(self.system_prompt) // 3
while estimated_tokens > budget and len(self.recent_messages) > 2:
removed = self.recent_messages.pop(0)
estimated_tokens -= len(removed["content"]) // 3
def get_context(self) -> list[dict]:
return [
{"role": "system", "content": self.system_prompt}
] + self.recent_messages
Design point: Reserve space for output (reserved_for_output) to avoid truncation when context is full.
Layer 2: Episodic Memory
Episodic memory stores key information about recent events. When working memory overflows, evicted content is compressed and stored here rather than discarded.
from datetime import datetime, timedelta
@dataclass
class Episode:
timestamp: datetime
summary: str
key_entities: list[str]
importance: float # 0.0 ~ 1.0
raw_excerpt: str | None = None
class EpisodicMemory:
def __init__(self, max_episodes: int = 200, ttl_days: int = 30):
self.episodes: list[Episode] = []
self.max_episodes = max_episodes
self.ttl = timedelta(days=ttl_days)
def add(self, summary: str, key_entities: list[str],
importance: float = 0.5, raw: str | None = None):
episode = Episode(
timestamp=datetime.now(),
summary=summary,
key_entities=key_entities,
importance=importance,
raw_excerpt=raw,
)
self.episodes.append(episode)
self._evict_if_needed()
def retrieve_recent(self, hours: int = 24, limit: int = 10) -> list[Episode]:
cutoff = datetime.now() - timedelta(hours=hours)
candidates = [e for e in self.episodes if e.timestamp > cutoff]
candidates.sort(
key=lambda e: e.importance * self._recency_score(e),
reverse=True,
)
return candidates[:limit]
def _recency_score(self, episode: Episode) -> float:
age_hours = (datetime.now() - episode.timestamp).total_seconds() / 3600
return max(0.1, 1.0 - age_hours / 720)
def _evict_if_needed(self):
cutoff = datetime.now() - self.ttl
self.episodes = [e for e in self.episodes if e.timestamp > cutoff]
if len(self.episodes) > self.max_episodes:
self.episodes.sort(key=lambda e: e.importance * self._recency_score(e))
self.episodes = self.episodes[-self.max_episodes:]
Key design: Not all memories are equally important. The importance field and recency_score enable priority-based eviction, ensuring high-value memories are retained.
Layer 3: Semantic Memory
Semantic memory is the long-term knowledge layer, using a vector store for semantic retrieval.
import numpy as np
class SemanticMemory:
def __init__(self, embedding_dim: int = 1536):
self.vectors: np.ndarray = np.empty((0, embedding_dim))
self.documents: list[dict] = []
def add(self, text: str, embedding: list[float], metadata: dict | None = None):
vec = np.array(embedding).reshape(1, -1)
self.vectors = (
np.vstack([self.vectors, vec])
if len(self.vectors) > 0 else vec
)
self.documents.append({
"text": text,
"metadata": metadata or {},
"access_count": 0,
})
def search(self, query_embedding: list[float], top_k: int = 5,
threshold: float = 0.7) -> list[dict]:
if len(self.vectors) == 0:
return []
query = np.array(query_embedding).reshape(1, -1)
norms = np.linalg.norm(self.vectors, axis=1) * np.linalg.norm(query)
similarities = (self.vectors @ query.T).flatten() / np.clip(norms, 1e-8, None)
top_indices = np.argsort(similarities)[-top_k:][::-1]
results = []
for idx in top_indices:
if similarities[idx] >= threshold:
doc = self.documents[idx]
doc["access_count"] += 1
doc["score"] = float(similarities[idx])
results.append(doc)
return results
def consolidate(self, min_access: int = 3, max_entries: int = 5000):
"""Remove low-value memories that were never accessed"""
if len(self.documents) <= max_entries:
return
keep_indices = [
i for i, doc in enumerate(self.documents)
if doc["access_count"] >= min_access
]
if not keep_indices:
return
self.vectors = self.vectors[keep_indices]
self.documents = [self.documents[i] for i in keep_indices]
Retrieval strategy: A similarity threshold (threshold) filters noise. access_count tracks usage frequency and informs memory consolidation.
Layer 4: Procedural Memory
Procedural memory stores behavior patterns the agent has learned from experience — not "what happened" but "what to do."
@dataclass
class LearnedPattern:
trigger: str
action: str
success_rate: float
sample_size: int
class ProceduralMemory:
def __init__(self):
self.patterns: list[LearnedPattern] = []
def record_outcome(self, trigger: str, action: str, success: bool):
existing = next(
(p for p in self.patterns
if p.trigger == trigger and p.action == action),
None,
)
if existing:
total = existing.sample_size + 1
existing.success_rate = (
existing.success_rate * existing.sample_size + int(success)
) / total
existing.sample_size = total
else:
self.patterns.append(LearnedPattern(
trigger=trigger,
action=action,
success_rate=float(success),
sample_size=1,
))
def get_best_action(self, trigger: str, min_samples: int = 5) -> str | None:
candidates = [
p for p in self.patterns
if p.trigger == trigger and p.sample_size >= min_samples
]
if not candidates:
return None
best = max(candidates, key=lambda p: p.success_rate)
return best.action if best.success_rate > 0.6 else None
Design rationale: Procedural memory stores abstracted trigger-action-success_rate triples, not raw conversations. A minimum of min_samples observations is required before making a recommendation, avoiding small-sample bias.
Putting It Together: The Memory Manager
class AgentMemoryManager:
def __init__(self, system_prompt: str):
self.working = WorkingMemory(system_prompt=system_prompt)
self.episodic = EpisodicMemory()
self.semantic = SemanticMemory()
self.procedural = ProceduralMemory()
def build_context(self, current_query: str) -> list[dict]:
"""Assemble optimal context for the current query"""
context = self.working.get_context()
best_action = self.procedural.get_best_action(current_query[:100])
if best_action:
context.append({
"role": "system",
"content": f"[Learned behavior] Based on past experience: {best_action}",
})
recent_episodes = self.episodic.retrieve_recent(hours=48, limit=5)
if recent_episodes:
episode_text = "\n".join(f"- {e.summary}" for e in recent_episodes)
context.append({
"role": "system",
"content": f"[Recent memory]\n{episode_text}",
})
return context
Memory Retrieval Decision Framework
Different scenarios call for different memory layers:
| Scenario | Primary Layer | Strategy | Reason |
|---|---|---|---|
| Coreference in multi-turn dialogue | Working memory | Last N messages | Information is in context |
| "What did we discuss last time?" | Episodic memory | Time range + importance | Needs temporal cues |
| "Have we seen similar technical solutions?" | Semantic memory | Vector similarity search | Needs semantic matching |
| "How should we handle this situation?" | Procedural memory | Trigger condition matching | Needs experience patterns |
| First-time user interaction | Procedural memory | Default behavior patterns | Falls back to general experience |
Common Mistakes
Mistake 1: "More memory is always better" Memory quality > memory quantity. Undifferentiated storage drowns retrieval in noise. Every layer needs an eviction mechanism (TTL, importance scoring, or access frequency).
Mistake 2: "Vector search solves everything" Vector search excels at semantic matching but struggles with exact matches and temporal ordering. For "what did we discuss yesterday," episodic memory with time indexing beats vector search. Choosing the right layer matters more than tuning the vector model.
Mistake 3: "No need for compression, just store raw text" Raw text storage has high cost and retrieval noise. Good compression preserves decision-relevant information while removing pleasantries and redundancy.
Summary
- Four memory layers, each with a distinct role: working memory for current context, episodic for recent events, semantic for long-term knowledge, procedural for behavior patterns
- Each layer needs its own eviction mechanism: TTL, importance scoring, access frequency — pick at least two
- Retrieval strategy matters more than storage: choose the right layer for each query
- Memory compression is a necessity, not a luxury: good compression keeps signal, removes noise
- Procedural memory is the most overlooked layer but the most valuable in long-running agents
Prepared by AgentList. Explore more agent memory projects in our directory.
Projects in this article
A-MEM
1.0k ⭐An agentic memory system for LLM agents inspired by human memory mechanisms, enabling dynamic memory generation, retrieval, and consolidation with automatic memory evolution and self-organization.
SimpleMem
3.2k ⭐SimpleMem: Efficient Lifelong Memory for LLM Agents — supports text and multimodal memory for long-term information retention and retrieval.
Agentic Memory
533 ⭐Implementing cognitive architecture and psychological memory concepts into Agentic LLM Systems. Explores short-term, long-term, and working memory engineering for AI agents.
MemAgent
1.0k ⭐A MemAgent framework that can extrapolate to 3.5M context tokens, along with a training framework for RL training of any agent workflow.
OpenMemory
4.1k ⭐Local persistent memory store for LLM applications including Claude Desktop, GitHub Copilot, Codex, and more. Provides durable context memory capabilities for AI agents.