Context Engineering: Context Decay and Recovery in Long-Conversation Agents

Long-conversation agents fail at context management, not model capability. A systematic comparison of sliding window, retrieval injection, and layered compression strategies with practical decay diagnosis and recovery patterns.

AgentList Team · 2026年6月29日
上下文工程长上下文RAGContext7记忆系统

After building a few multi-turn agent systems, you hit the same bottleneck: the model is rarely the limiting factor. The context window is.

A customer service agent averages 12 turns per conversation, with 300-500 tokens per turn. Add tool returns, system prompts, and retrieval results, and by turn 8 you have already exceeded 32K tokens. By turn 15, the model starts "forgetting" constraints the user mentioned in turn 3, or conflating tool-returned JSON with the user's question.

This is context decay -- the information has not actually disappeared. Its position and salience have been diluted by subsequent content.

This article skips the "how to expand your context window" primer and answers a more practical question: when context is running out, what should you discard, what should you preserve, and how do you recover what was lost?

Three Mechanisms of Decay

Before choosing a strategy, understand why decay happens. There are three primary mechanisms:

Positional decay: Transformer attention drops exponentially with distance. Information from turn 1 may have only 5-10% of its original attention score by turn 20. This is an architectural feature, not a bug.

Saliency dilution: Every new message competes for the attention "budget." Large tool-return JSON blobs, verbose system prompts, even emojis crowd out critical information.

Semantic drift: In multi-turn conversations, topics naturally shift. The "order #12345" discussed in turn 3 is still in the context by turn 12, but the model is much less likely to proactively connect it to the current topic.

These three mechanisms compound, creating a quality inflection point around turns 10-20 where responses become vague, tool calls become imprecise, and constraints get ignored.

Strategy 1: Sliding Window + Summary Bridging

The most straightforward approach: keep the last N turns in full, summarize older turns.

from dataclasses import dataclass, field
from typing import Any
from enum import Enum


class MessageRole(Enum):
    USER = "user"
    ASSISTANT = "assistant"
    SYSTEM = "system"
    TOOL = "tool"


@dataclass
class Message:
    role: MessageRole
    content: str
    turn: int
    token_count: int = 0


@dataclass
class SlidingWindowConfig:
    window_size: int = 6
    summary_trigger: int = 10
    max_summary_tokens: int = 500


class ConversationBuffer:
    def __init__(self, config: SlidingWindowConfig):
        self.config = config
        self.messages: list[Message] = []
        self.summary: str = ""

    def add(self, message: Message):
        self.messages.append(message)
        self._maybe_compress()

    def _maybe_compress(self):
        if len(self.messages) < self.config.summary_trigger:
            return
        boundary = len(self.messages) - self.config.window_size
        if boundary <= 0:
            return
        to_compress = self.messages[:boundary]
        self.summary = self._compress(to_compress, self.summary)
        self.messages = self.messages[boundary:]

    def _compress(self, old_messages: list[Message], prev_summary: str) -> str:
        parts = []
        if prev_summary:
            parts.append(f"Previous summary: {prev_summary}")
        for msg in old_messages:
            if msg.role == MessageRole.USER:
                parts.append(f"User asked: {msg.content[:100]}")
            elif msg.role == MessageRole.TOOL:
                parts.append(f"Tool returned: {msg.content[:80]}")
        return " | ".join(parts)

    def build_context(self) -> list[dict]:
        context = []
        if self.summary:
            context.append({
                "role": "system",
                "content": f"[Conversation summary] {self.summary}"
            })
        for msg in self.messages:
            context.append({
                "role": msg.role.value,
                "content": msg.content
            })
        return context

Key design decisions:

  • Summaries preserve structured information -- who did what and what was the result -- rather than rewriting sentences
  • Set summary_trigger to roughly window_size * 1.5 to give summaries breathing room
  • Include the old summary in the input for each compression pass to avoid information gaps

Best for: Customer service, technical support, and other linear conversations where backtracking is rare.

Limitation: Summaries lose detail. If the conversation contains precise values -- order numbers, amounts, person names -- the sliding window will eventually push them out. Summary quality also depends heavily on the LLM's compression ability.

Strategy 2: Retrieval-Augmented Context

Instead of compressing all history, treat past conversations as a "document library" and retrieve relevant passages on each turn.

import hashlib
from dataclasses import dataclass
from typing import Any


@dataclass
class ContextChunk:
    chunk_id: str
    turn: int
    role: str
    content: str
    embedding: list[float] | None = None
    metadata: dict[str, Any] = field(default_factory=dict)


class RetrievalAugmentedBuffer:
    def __init__(self, embedding_fn, vector_store, top_k: int = 5):
        self.embedding_fn = embedding_fn
        self.vector_store = vector_store
        self.top_k = top_k
        self.recent: list[ContextChunk] = []
        self.recent_limit = 4

    def add_turn(self, user_msg: str, assistant_msg: str, tool_results: list[str]):
        turn = len(self.recent) + 1
        chunks = self._chunk_turn(turn, user_msg, assistant_msg, tool_results)
        for chunk in chunks:
            chunk.embedding = self.embedding_fn(chunk.content)
            self.vector_store.upsert(chunk)
        self.recent.extend(chunks)

    def _chunk_turn(self, turn: int, user: str, assistant: str, tools: list[str]) -> list[ContextChunk]:
        chunks = []
        chunks.append(ContextChunk(
            chunk_id=hashlib.sha256(f"t{turn}-user".encode()).hexdigest()[:12],
            turn=turn,
            role="user",
            content=user,
        ))
        if assistant:
            chunks.append(ContextChunk(
                chunk_id=hashlib.sha256(f"t{turn}-asst".encode()).hexdigest()[:12],
                turn=turn,
                role="assistant",
                content=assistant,
            ))
        if tools:
            chunks.append(ContextChunk(
                chunk_id=hashlib.sha256(f"t{turn}-tools".encode()).hexdigest()[:12],
                turn=turn,
                role="tool",
                content="\n".join(tools),
                metadata={"type": "tool_result"},
            ))
        return chunks

    def build_context(self, current_query: str) -> list[dict]:
        query_emb = self.embedding_fn(current_query)
        relevant = self.vector_store.query(query_emb, top_k=self.top_k)
        relevant_ids = {c.chunk_id for c in relevant}
        recent_chunks = [c for c in self.recent[-self.recent_limit * 3:] if c.chunk_id not in relevant_ids]
        all_chunks = sorted(
            relevant + recent_chunks,
            key=lambda c: (c.turn, c.chunk_id)
        )
        return [{"role": c.role, "content": c.content} for c in all_chunks]

Key design decisions:

  • Split each turn into 2-3 semantic chunks (user/assistant/tool) rather than packing the entire turn into one block, which improves retrieval precision
  • Always retain the last N turns in full so the retrieval system does not "forget" what just happened
  • Tool returns get their own chunks because they typically contain precise data (IDs, amounts, statuses) that should be weighted heavily during retrieval

Best for: Conversations with frequent topic shifts, knowledge QA agents, and scenarios requiring historical detail recall.

Tool reference: Context7 provides an out-of-the-box solution for library documentation retrieval injection. Claude Context uses Milvus for code-level semantic retrieval -- the same idea applied to codebases.

Strategy 3: Layered Compression + Key Fact Externalization

The first two strategies each have strengths, but neither solves a fundamental problem: some information must never be lost -- the user's name, order number, current task objective, confirmed constraints.

Layered compression splits context into a "hot layer" and a "cold layer." The hot layer retains all details needed for the current turn. The cold layer stores only structured summaries and key entities.

from dataclasses import dataclass, field
from typing import Any


@dataclass
class KeyFact:
    key: str
    value: str
    source_turn: int
    confidence: float = 1.0


@dataclass
class LayeredContext:
    hot_messages: list[dict] = field(default_factory=list)
    warm_summary: str = ""
    key_facts: dict[str, KeyFact] = field(default_factory=dict)
    hot_limit: int = 6

    def add_exchange(self, user: str, assistant: str):
        self.hot_messages.append({"role": "user", "content": user})
        self.hot_messages.append({"role": "assistant", "content": assistant})
        self._extract_facts(user, assistant)
        self._maybe_demote()

    def _extract_facts(self, user: str, assistant: str):
        import re
        ids = re.findall(r'\b[A-Z]{2,3}-\d{4,}\b', user + " " + assistant)
        for id_ in ids:
            self.key_facts[id_] = KeyFact(
                key="reference_id",
                value=id_,
                source_turn=len(self.hot_messages) // 2,
            )

    def _maybe_demote(self):
        if len(self.hot_messages) > self.hot_limit * 2:
            boundary = len(self.hot_messages) - self.hot_limit * 2
            old = self.hot_messages[:boundary]
            self.warm_summary = self._summarize(old, self.warm_summary)
            self.hot_messages = self.hot_messages[boundary:]

    def build_prompt(self) -> str:
        parts = []
        if self.key_facts:
            facts_str = "\n".join(
                f"- {k}: {v.value}" for k, v in self.key_facts.items()
            )
            parts.append(f"[Key facts that must not be lost]\n{facts_str}")
        if self.warm_summary:
            parts.append(f"[Earlier conversation summary]\n{self.warm_summary}")
        parts.extend([f"{m['role']}: {m['content']}" for m in self.hot_messages])
        return "\n\n".join(parts)

    def _summarize(self, old_messages: list[dict], prev: str) -> str:
        return prev + " | " + "; ".join(m["content"][:60] for m in old_messages)

Key design decisions:

  • key_facts is the cold layer's core value -- it never scrolls away with the context window and is always injected as part of the system prompt
  • Key entity extraction can be rule-based (regex for order numbers, emails) or LLM-judged. The former is fast but narrow; the latter is more flexible but costs more
  • Layering makes the model "aware" of which information must never be lost, so it actively references them during generation

Best for: Task-oriented agents (order processing, ticketing systems) and any scenario requiring precise memory across many turns.

Tool reference: Context Mode's "Think in Code" paradigm is essentially a form of layered compression -- having the LLM write analysis scripts instead of processing raw data directly, compressing 315KB of tool output into 5.4KB for a 98% context reduction.

Decay Diagnosis Checklist

Before choosing a strategy, run a quick diagnosis on your agent:

Diagnostic How to check If it matches
Average conversation turns > 15 Count user messages in logs Some form of compression is needed
Tool returns average > 500 tokens Sample tool output lengths Prioritize layered compression
Users frequently reference earlier content Search for "earlier", "before", "that" Retrieval injection beats summaries
Heavy precise data (IDs, amounts) Analyze numeric/coded patterns in messages Must externalize key entities
Linear topic progression, rarely backtracks Observe topic change curves Sliding window is usually enough

Do not reach for RAG or layered compression immediately. If your conversations average only 8 turns, a sliding window with a 200-token summary is sufficient. Premature optimization is the most common mistake in context engineering.

Three Common Mistakes

Mistake 1: Treating summaries as translations

When using LLMs for summarization, many people prompt with "translate this conversation into an English summary." A summary is not a translation -- it should preserve entities, decisions, and action items, not rewrite sentences. A 500-word summary that drops the only order number is worthless no matter how fluent it reads.

Fix: Explicitly require the summary prompt to extract "key entities + confirmed items + action items," not "summarize this conversation."

Mistake 2: Poor retrieval precision is worse than no retrieval

RAG effectiveness depends heavily on retrieval quality. If the embedding model processes short texts ("okay", "yes") the same way as long texts, retrieval results will be noisy. And references like "that order" may be far from the actually relevant history in vector space.

Fix: Perform query rewriting before retrieval -- resolve pronouns and references to their full forms, then search. Also apply different embedding weights for tool returns versus user messages.

Mistake 3: A cold layer with only summaries, no entities

Layered compression is often implemented as "hot layer + longer summary." This misses the point. The cold layer's value is not "shorter history" -- it is structured key information. The user's email address from turn 2 should not be buried on line 50 of a summary. It should be a key-value pair, always visible in the system prompt.

Summary

  • Context decay is an inherent property of the Transformer architecture, not solvable by "better models." Positional sensitivity decline, saliency dilution, and semantic drift compound to create a quality inflection point around turns 10-20
  • Sliding window + summary suits linear conversations -- simple to implement but weak at preserving precise information. Best for customer service and technical support where backtracking is rare
  • Retrieval injection suits topic-jumping conversations -- treats history as a document library retrieved on demand. Retains detail but retrieval precision is the bottleneck. Requires query rewriting and chunking strategy
  • Layered compression is the production optimum -- hot layer preserves detail, cold layer stores entities. Critical information is always externalized outside summaries so the model never "forgets" names, order numbers, and constraints
  • Diagnose before choosing. Average turns, tool return size, backtracking frequency, and precise data density determine which strategy fits. Do not reach for the most complex solution first

For hands-on comparison, explore Agent Skills for Context Engineering (context degradation patterns and compression strategies), Claude Context (codebase semantic retrieval), Context7 (library documentation context injection), and Context Mode (tool output context reduction) to see different context management approaches in practice.