Context Engineering: Context Decay and Recovery in Long-Conversation Agents
Long-conversation agents fail at context management, not model capability. A systematic comparison of sliding window, retrieval injection, and layered compression strategies with practical decay diagnosis and recovery patterns.
After building a few multi-turn agent systems, you hit the same bottleneck: the model is rarely the limiting factor. The context window is.
A customer service agent averages 12 turns per conversation, with 300-500 tokens per turn. Add tool returns, system prompts, and retrieval results, and by turn 8 you have already exceeded 32K tokens. By turn 15, the model starts "forgetting" constraints the user mentioned in turn 3, or conflating tool-returned JSON with the user's question.
This is context decay -- the information has not actually disappeared. Its position and salience have been diluted by subsequent content.
This article skips the "how to expand your context window" primer and answers a more practical question: when context is running out, what should you discard, what should you preserve, and how do you recover what was lost?
Three Mechanisms of Decay
Before choosing a strategy, understand why decay happens. There are three primary mechanisms:
Positional decay: Transformer attention drops exponentially with distance. Information from turn 1 may have only 5-10% of its original attention score by turn 20. This is an architectural feature, not a bug.
Saliency dilution: Every new message competes for the attention "budget." Large tool-return JSON blobs, verbose system prompts, even emojis crowd out critical information.
Semantic drift: In multi-turn conversations, topics naturally shift. The "order #12345" discussed in turn 3 is still in the context by turn 12, but the model is much less likely to proactively connect it to the current topic.
These three mechanisms compound, creating a quality inflection point around turns 10-20 where responses become vague, tool calls become imprecise, and constraints get ignored.
Strategy 1: Sliding Window + Summary Bridging
The most straightforward approach: keep the last N turns in full, summarize older turns.
from dataclasses import dataclass, field
from typing import Any
from enum import Enum
class MessageRole(Enum):
USER = "user"
ASSISTANT = "assistant"
SYSTEM = "system"
TOOL = "tool"
@dataclass
class Message:
role: MessageRole
content: str
turn: int
token_count: int = 0
@dataclass
class SlidingWindowConfig:
window_size: int = 6
summary_trigger: int = 10
max_summary_tokens: int = 500
class ConversationBuffer:
def __init__(self, config: SlidingWindowConfig):
self.config = config
self.messages: list[Message] = []
self.summary: str = ""
def add(self, message: Message):
self.messages.append(message)
self._maybe_compress()
def _maybe_compress(self):
if len(self.messages) < self.config.summary_trigger:
return
boundary = len(self.messages) - self.config.window_size
if boundary <= 0:
return
to_compress = self.messages[:boundary]
self.summary = self._compress(to_compress, self.summary)
self.messages = self.messages[boundary:]
def _compress(self, old_messages: list[Message], prev_summary: str) -> str:
parts = []
if prev_summary:
parts.append(f"Previous summary: {prev_summary}")
for msg in old_messages:
if msg.role == MessageRole.USER:
parts.append(f"User asked: {msg.content[:100]}")
elif msg.role == MessageRole.TOOL:
parts.append(f"Tool returned: {msg.content[:80]}")
return " | ".join(parts)
def build_context(self) -> list[dict]:
context = []
if self.summary:
context.append({
"role": "system",
"content": f"[Conversation summary] {self.summary}"
})
for msg in self.messages:
context.append({
"role": msg.role.value,
"content": msg.content
})
return context
Key design decisions:
- Summaries preserve structured information -- who did what and what was the result -- rather than rewriting sentences
- Set
summary_triggerto roughlywindow_size * 1.5to give summaries breathing room - Include the old summary in the input for each compression pass to avoid information gaps
Best for: Customer service, technical support, and other linear conversations where backtracking is rare.
Limitation: Summaries lose detail. If the conversation contains precise values -- order numbers, amounts, person names -- the sliding window will eventually push them out. Summary quality also depends heavily on the LLM's compression ability.
Strategy 2: Retrieval-Augmented Context
Instead of compressing all history, treat past conversations as a "document library" and retrieve relevant passages on each turn.
import hashlib
from dataclasses import dataclass
from typing import Any
@dataclass
class ContextChunk:
chunk_id: str
turn: int
role: str
content: str
embedding: list[float] | None = None
metadata: dict[str, Any] = field(default_factory=dict)
class RetrievalAugmentedBuffer:
def __init__(self, embedding_fn, vector_store, top_k: int = 5):
self.embedding_fn = embedding_fn
self.vector_store = vector_store
self.top_k = top_k
self.recent: list[ContextChunk] = []
self.recent_limit = 4
def add_turn(self, user_msg: str, assistant_msg: str, tool_results: list[str]):
turn = len(self.recent) + 1
chunks = self._chunk_turn(turn, user_msg, assistant_msg, tool_results)
for chunk in chunks:
chunk.embedding = self.embedding_fn(chunk.content)
self.vector_store.upsert(chunk)
self.recent.extend(chunks)
def _chunk_turn(self, turn: int, user: str, assistant: str, tools: list[str]) -> list[ContextChunk]:
chunks = []
chunks.append(ContextChunk(
chunk_id=hashlib.sha256(f"t{turn}-user".encode()).hexdigest()[:12],
turn=turn,
role="user",
content=user,
))
if assistant:
chunks.append(ContextChunk(
chunk_id=hashlib.sha256(f"t{turn}-asst".encode()).hexdigest()[:12],
turn=turn,
role="assistant",
content=assistant,
))
if tools:
chunks.append(ContextChunk(
chunk_id=hashlib.sha256(f"t{turn}-tools".encode()).hexdigest()[:12],
turn=turn,
role="tool",
content="\n".join(tools),
metadata={"type": "tool_result"},
))
return chunks
def build_context(self, current_query: str) -> list[dict]:
query_emb = self.embedding_fn(current_query)
relevant = self.vector_store.query(query_emb, top_k=self.top_k)
relevant_ids = {c.chunk_id for c in relevant}
recent_chunks = [c for c in self.recent[-self.recent_limit * 3:] if c.chunk_id not in relevant_ids]
all_chunks = sorted(
relevant + recent_chunks,
key=lambda c: (c.turn, c.chunk_id)
)
return [{"role": c.role, "content": c.content} for c in all_chunks]
Key design decisions:
- Split each turn into 2-3 semantic chunks (user/assistant/tool) rather than packing the entire turn into one block, which improves retrieval precision
- Always retain the last N turns in full so the retrieval system does not "forget" what just happened
- Tool returns get their own chunks because they typically contain precise data (IDs, amounts, statuses) that should be weighted heavily during retrieval
Best for: Conversations with frequent topic shifts, knowledge QA agents, and scenarios requiring historical detail recall.
Tool reference: Context7 provides an out-of-the-box solution for library documentation retrieval injection. Claude Context uses Milvus for code-level semantic retrieval -- the same idea applied to codebases.
Strategy 3: Layered Compression + Key Fact Externalization
The first two strategies each have strengths, but neither solves a fundamental problem: some information must never be lost -- the user's name, order number, current task objective, confirmed constraints.
Layered compression splits context into a "hot layer" and a "cold layer." The hot layer retains all details needed for the current turn. The cold layer stores only structured summaries and key entities.
from dataclasses import dataclass, field
from typing import Any
@dataclass
class KeyFact:
key: str
value: str
source_turn: int
confidence: float = 1.0
@dataclass
class LayeredContext:
hot_messages: list[dict] = field(default_factory=list)
warm_summary: str = ""
key_facts: dict[str, KeyFact] = field(default_factory=dict)
hot_limit: int = 6
def add_exchange(self, user: str, assistant: str):
self.hot_messages.append({"role": "user", "content": user})
self.hot_messages.append({"role": "assistant", "content": assistant})
self._extract_facts(user, assistant)
self._maybe_demote()
def _extract_facts(self, user: str, assistant: str):
import re
ids = re.findall(r'\b[A-Z]{2,3}-\d{4,}\b', user + " " + assistant)
for id_ in ids:
self.key_facts[id_] = KeyFact(
key="reference_id",
value=id_,
source_turn=len(self.hot_messages) // 2,
)
def _maybe_demote(self):
if len(self.hot_messages) > self.hot_limit * 2:
boundary = len(self.hot_messages) - self.hot_limit * 2
old = self.hot_messages[:boundary]
self.warm_summary = self._summarize(old, self.warm_summary)
self.hot_messages = self.hot_messages[boundary:]
def build_prompt(self) -> str:
parts = []
if self.key_facts:
facts_str = "\n".join(
f"- {k}: {v.value}" for k, v in self.key_facts.items()
)
parts.append(f"[Key facts that must not be lost]\n{facts_str}")
if self.warm_summary:
parts.append(f"[Earlier conversation summary]\n{self.warm_summary}")
parts.extend([f"{m['role']}: {m['content']}" for m in self.hot_messages])
return "\n\n".join(parts)
def _summarize(self, old_messages: list[dict], prev: str) -> str:
return prev + " | " + "; ".join(m["content"][:60] for m in old_messages)
Key design decisions:
key_factsis the cold layer's core value -- it never scrolls away with the context window and is always injected as part of the system prompt- Key entity extraction can be rule-based (regex for order numbers, emails) or LLM-judged. The former is fast but narrow; the latter is more flexible but costs more
- Layering makes the model "aware" of which information must never be lost, so it actively references them during generation
Best for: Task-oriented agents (order processing, ticketing systems) and any scenario requiring precise memory across many turns.
Tool reference: Context Mode's "Think in Code" paradigm is essentially a form of layered compression -- having the LLM write analysis scripts instead of processing raw data directly, compressing 315KB of tool output into 5.4KB for a 98% context reduction.
Decay Diagnosis Checklist
Before choosing a strategy, run a quick diagnosis on your agent:
| Diagnostic | How to check | If it matches |
|---|---|---|
| Average conversation turns > 15 | Count user messages in logs | Some form of compression is needed |
| Tool returns average > 500 tokens | Sample tool output lengths | Prioritize layered compression |
| Users frequently reference earlier content | Search for "earlier", "before", "that" | Retrieval injection beats summaries |
| Heavy precise data (IDs, amounts) | Analyze numeric/coded patterns in messages | Must externalize key entities |
| Linear topic progression, rarely backtracks | Observe topic change curves | Sliding window is usually enough |
Do not reach for RAG or layered compression immediately. If your conversations average only 8 turns, a sliding window with a 200-token summary is sufficient. Premature optimization is the most common mistake in context engineering.
Three Common Mistakes
Mistake 1: Treating summaries as translations
When using LLMs for summarization, many people prompt with "translate this conversation into an English summary." A summary is not a translation -- it should preserve entities, decisions, and action items, not rewrite sentences. A 500-word summary that drops the only order number is worthless no matter how fluent it reads.
Fix: Explicitly require the summary prompt to extract "key entities + confirmed items + action items," not "summarize this conversation."
Mistake 2: Poor retrieval precision is worse than no retrieval
RAG effectiveness depends heavily on retrieval quality. If the embedding model processes short texts ("okay", "yes") the same way as long texts, retrieval results will be noisy. And references like "that order" may be far from the actually relevant history in vector space.
Fix: Perform query rewriting before retrieval -- resolve pronouns and references to their full forms, then search. Also apply different embedding weights for tool returns versus user messages.
Mistake 3: A cold layer with only summaries, no entities
Layered compression is often implemented as "hot layer + longer summary." This misses the point. The cold layer's value is not "shorter history" -- it is structured key information. The user's email address from turn 2 should not be buried on line 50 of a summary. It should be a key-value pair, always visible in the system prompt.
Summary
- Context decay is an inherent property of the Transformer architecture, not solvable by "better models." Positional sensitivity decline, saliency dilution, and semantic drift compound to create a quality inflection point around turns 10-20
- Sliding window + summary suits linear conversations -- simple to implement but weak at preserving precise information. Best for customer service and technical support where backtracking is rare
- Retrieval injection suits topic-jumping conversations -- treats history as a document library retrieved on demand. Retains detail but retrieval precision is the bottleneck. Requires query rewriting and chunking strategy
- Layered compression is the production optimum -- hot layer preserves detail, cold layer stores entities. Critical information is always externalized outside summaries so the model never "forgets" names, order numbers, and constraints
- Diagnose before choosing. Average turns, tool return size, backtracking frequency, and precise data density determine which strategy fits. Do not reach for the most complex solution first
For hands-on comparison, explore Agent Skills for Context Engineering (context degradation patterns and compression strategies), Claude Context (codebase semantic retrieval), Context7 (library documentation context injection), and Context Mode (tool output context reduction) to see different context management approaches in practice.
Projects in this article
Agent Skills for Context Engineering
16.8k ⭐A comprehensive collection of Agent Skills for context engineering, multi-agent architectures, and production agent systems. Use when building, optimizing, or debugging agent systems.
Claude Context
12.0k ⭐Code search MCP for Claude Code and coding agents. Makes entire codebases available as context for AI coding assistants using vector-based semantic code search for precise understanding of large projects.
Context7
58.4k ⭐Context7 is Upstash's context-engineering toolkit for agents, helping applications manage long context windows, retrieval injection, and history compression.
Context Mode
18.4k ⭐Context Mode is a context window optimization tool for AI coding agents that sandboxes tool output for 98% context reduction across 12 major platforms.