Designing Agent Memory Systems: From Short-Term Context to Persistent Knowledge

A deep dive into the four-layer agent memory architecture, with practical code for vector retrieval and memory compression to help you build scalable long-term memory systems.

AgentList Team · April 21, 2026
AI Agent记忆系统向量检索Memory架构设计

Designing Agent Memory Systems: From Short-Term Context to Persistent Knowledge

Most agents handle memory by stuffing recent conversation turns into the context window. That works for demos but fails in production, where users expect agents to remember past decisions, understand long-term preferences, and improve based on experience. This article breaks down a four-layer memory architecture with implementation code for each layer.

Why "Stuff Everything Into Context" Does Not Work

Loading full conversation history into context has three fatal flaws:

  • Token costs scale linearly: After 10 turns, inference cost per turn doubles
  • Retrieval is unreliable: An LLM finding a key fact in 50K tokens of context is far less reliable than finding it in 500 tokens of targeted retrieval results
  • No cross-session persistence: Context windows vanish when sessions end; returning users face a blank-slate agent

A real memory system needs layered design, where each layer has its own storage medium, retrieval strategy, and lifecycle.

Four-Layer Memory Architecture

┌─────────────────────────────────────────┐
│  Layer 1: Working Memory                 │  ← Current context window
├─────────────────────────────────────────┤
│  Layer 2: Episodic Memory               │  ← Recent summaries + key events
├─────────────────────────────────────────┤
│  Layer 3: Semantic Memory               │  ← Vector-stored knowledge and facts
├─────────────────────────────────────────┤
│  Layer 4: Procedural Memory             │  ← Learned behavior patterns
└─────────────────────────────────────────┘

Layer 1: Working Memory

Working memory is the current context window. No extra storage needed, but it requires careful management:

from dataclasses import dataclass, field

@dataclass
class WorkingMemory:
    system_prompt: str
    recent_messages: list[dict] = field(default_factory=list)
    max_tokens: int = 8000
    reserved_for_output: int = 2000

    def add_message(self, role: str, content: str):
        self.recent_messages.append({"role": role, "content": content})
        self._trim_if_needed()

    def _trim_if_needed(self):
        """Remove oldest messages when approaching token budget"""
        estimated_tokens = sum(
            len(m["content"]) // 3 for m in self.recent_messages
        )
        budget = self.max_tokens - self.reserved_for_output - len(self.system_prompt) // 3
        while estimated_tokens > budget and len(self.recent_messages) > 2:
            removed = self.recent_messages.pop(0)
            estimated_tokens -= len(removed["content"]) // 3

    def get_context(self) -> list[dict]:
        return [
            {"role": "system", "content": self.system_prompt}
        ] + self.recent_messages

Design point: Reserve space for output (reserved_for_output) to avoid truncation when context is full.

Layer 2: Episodic Memory

Episodic memory stores key information about recent events. When working memory overflows, evicted content is compressed and stored here rather than discarded.

from datetime import datetime, timedelta

@dataclass
class Episode:
    timestamp: datetime
    summary: str
    key_entities: list[str]
    importance: float  # 0.0 ~ 1.0
    raw_excerpt: str | None = None

class EpisodicMemory:
    def __init__(self, max_episodes: int = 200, ttl_days: int = 30):
        self.episodes: list[Episode] = []
        self.max_episodes = max_episodes
        self.ttl = timedelta(days=ttl_days)

    def add(self, summary: str, key_entities: list[str],
            importance: float = 0.5, raw: str | None = None):
        episode = Episode(
            timestamp=datetime.now(),
            summary=summary,
            key_entities=key_entities,
            importance=importance,
            raw_excerpt=raw,
        )
        self.episodes.append(episode)
        self._evict_if_needed()

    def retrieve_recent(self, hours: int = 24, limit: int = 10) -> list[Episode]:
        cutoff = datetime.now() - timedelta(hours=hours)
        candidates = [e for e in self.episodes if e.timestamp > cutoff]
        candidates.sort(
            key=lambda e: e.importance * self._recency_score(e),
            reverse=True,
        )
        return candidates[:limit]

    def _recency_score(self, episode: Episode) -> float:
        age_hours = (datetime.now() - episode.timestamp).total_seconds() / 3600
        return max(0.1, 1.0 - age_hours / 720)

    def _evict_if_needed(self):
        cutoff = datetime.now() - self.ttl
        self.episodes = [e for e in self.episodes if e.timestamp > cutoff]
        if len(self.episodes) > self.max_episodes:
            self.episodes.sort(key=lambda e: e.importance * self._recency_score(e))
            self.episodes = self.episodes[-self.max_episodes:]

Key design: Not all memories are equally important. The importance field and recency_score enable priority-based eviction, ensuring high-value memories are retained.

Layer 3: Semantic Memory

Semantic memory is the long-term knowledge layer, using a vector store for semantic retrieval.

import numpy as np

class SemanticMemory:
    def __init__(self, embedding_dim: int = 1536):
        self.vectors: np.ndarray = np.empty((0, embedding_dim))
        self.documents: list[dict] = []

    def add(self, text: str, embedding: list[float], metadata: dict | None = None):
        vec = np.array(embedding).reshape(1, -1)
        self.vectors = (
            np.vstack([self.vectors, vec])
            if len(self.vectors) > 0 else vec
        )
        self.documents.append({
            "text": text,
            "metadata": metadata or {},
            "access_count": 0,
        })

    def search(self, query_embedding: list[float], top_k: int = 5,
               threshold: float = 0.7) -> list[dict]:
        if len(self.vectors) == 0:
            return []
        query = np.array(query_embedding).reshape(1, -1)
        norms = np.linalg.norm(self.vectors, axis=1) * np.linalg.norm(query)
        similarities = (self.vectors @ query.T).flatten() / np.clip(norms, 1e-8, None)
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        results = []
        for idx in top_indices:
            if similarities[idx] >= threshold:
                doc = self.documents[idx]
                doc["access_count"] += 1
                doc["score"] = float(similarities[idx])
                results.append(doc)
        return results

    def consolidate(self, min_access: int = 3, max_entries: int = 5000):
        """Remove low-value memories that were never accessed"""
        if len(self.documents) <= max_entries:
            return
        keep_indices = [
            i for i, doc in enumerate(self.documents)
            if doc["access_count"] >= min_access
        ]
        if not keep_indices:
            return
        self.vectors = self.vectors[keep_indices]
        self.documents = [self.documents[i] for i in keep_indices]

Retrieval strategy: A similarity threshold (threshold) filters noise. access_count tracks usage frequency and informs memory consolidation.

Layer 4: Procedural Memory

Procedural memory stores behavior patterns the agent has learned from experience — not "what happened" but "what to do."

@dataclass
class LearnedPattern:
    trigger: str
    action: str
    success_rate: float
    sample_size: int

class ProceduralMemory:
    def __init__(self):
        self.patterns: list[LearnedPattern] = []

    def record_outcome(self, trigger: str, action: str, success: bool):
        existing = next(
            (p for p in self.patterns
             if p.trigger == trigger and p.action == action),
            None,
        )
        if existing:
            total = existing.sample_size + 1
            existing.success_rate = (
                existing.success_rate * existing.sample_size + int(success)
            ) / total
            existing.sample_size = total
        else:
            self.patterns.append(LearnedPattern(
                trigger=trigger,
                action=action,
                success_rate=float(success),
                sample_size=1,
            ))

    def get_best_action(self, trigger: str, min_samples: int = 5) -> str | None:
        candidates = [
            p for p in self.patterns
            if p.trigger == trigger and p.sample_size >= min_samples
        ]
        if not candidates:
            return None
        best = max(candidates, key=lambda p: p.success_rate)
        return best.action if best.success_rate > 0.6 else None

Design rationale: Procedural memory stores abstracted trigger-action-success_rate triples, not raw conversations. A minimum of min_samples observations is required before making a recommendation, avoiding small-sample bias.

Putting It Together: The Memory Manager

class AgentMemoryManager:
    def __init__(self, system_prompt: str):
        self.working = WorkingMemory(system_prompt=system_prompt)
        self.episodic = EpisodicMemory()
        self.semantic = SemanticMemory()
        self.procedural = ProceduralMemory()

    def build_context(self, current_query: str) -> list[dict]:
        """Assemble optimal context for the current query"""
        context = self.working.get_context()

        best_action = self.procedural.get_best_action(current_query[:100])
        if best_action:
            context.append({
                "role": "system",
                "content": f"[Learned behavior] Based on past experience: {best_action}",
            })

        recent_episodes = self.episodic.retrieve_recent(hours=48, limit=5)
        if recent_episodes:
            episode_text = "\n".join(f"- {e.summary}" for e in recent_episodes)
            context.append({
                "role": "system",
                "content": f"[Recent memory]\n{episode_text}",
            })

        return context

Memory Retrieval Decision Framework

Different scenarios call for different memory layers:

Scenario Primary Layer Strategy Reason
Coreference in multi-turn dialogue Working memory Last N messages Information is in context
"What did we discuss last time?" Episodic memory Time range + importance Needs temporal cues
"Have we seen similar technical solutions?" Semantic memory Vector similarity search Needs semantic matching
"How should we handle this situation?" Procedural memory Trigger condition matching Needs experience patterns
First-time user interaction Procedural memory Default behavior patterns Falls back to general experience

Common Mistakes

Mistake 1: "More memory is always better" Memory quality > memory quantity. Undifferentiated storage drowns retrieval in noise. Every layer needs an eviction mechanism (TTL, importance scoring, or access frequency).

Mistake 2: "Vector search solves everything" Vector search excels at semantic matching but struggles with exact matches and temporal ordering. For "what did we discuss yesterday," episodic memory with time indexing beats vector search. Choosing the right layer matters more than tuning the vector model.

Mistake 3: "No need for compression, just store raw text" Raw text storage has high cost and retrieval noise. Good compression preserves decision-relevant information while removing pleasantries and redundancy.

Summary

  • Four memory layers, each with a distinct role: working memory for current context, episodic for recent events, semantic for long-term knowledge, procedural for behavior patterns
  • Each layer needs its own eviction mechanism: TTL, importance scoring, access frequency — pick at least two
  • Retrieval strategy matters more than storage: choose the right layer for each query
  • Memory compression is a necessity, not a luxury: good compression keeps signal, removes noise
  • Procedural memory is the most overlooked layer but the most valuable in long-running agents

Prepared by AgentList. Explore more agent memory projects in our directory.