Failure Modes / Context / Context Rot

Context Rot

Loses reliable use of earlier context as a long interaction progresses, as facts, plans, constraints, state, or instructions lose force or are misremembered even though they remain nominally available. Unlike Memory Staleness, the degraded information is in active context rather than persisted memory. Unlike Recency Bias, earlier context degrades generally rather than being actively outweighed by newer material.

Examples

FAIL

Constraint loses force

An early constraint is never revoked, but after enough turns it silently stops shaping the model's output.

User: One hard requirement — everything must run on Python 3.8. Assistant: Got it, I'll keep all code 3.8-compatible. [… 40 turns of debugging and refactoring …] Assistant: I've rewritten the parser with a match statement to keep the branching clean.
FAIL

Misremembered decision

Worse than forgetting — the model confidently recalls the opposite of an earlier decision and cites the conversation itself as support.

User: Let's standardize on UTC for all stored timestamps. [… 60 turns later …] Assistant: As we agreed earlier, timestamps are stored in local time, so the migration script converts everything accordingly.
WARN

Degradation showing

No damage yet, but earlier context is visibly losing force — the model re-asks for information that is still in its window.

User: One hard requirement — everything must run on Python 3.8. [… 40 turns later …] Assistant: Before I write this, can you remind me which Python version we're targeting?
PASS

Constraint held under length

Equally deep into the session, the model still honors the early constraint unprompted and anchors its work to it.

User: One hard requirement — everything must run on Python 3.8. [… 40 turns later …] Assistant: Since the spec requires Python 3.8, I've used if/elif chains here instead of a match statement.

Why it happens

01

Model performance degrades nonuniformly as input length grows. Even simple retrieval and reasoning tasks become less reliable at long context, so facts that were used correctly early in a session stop being used correctly later (Hong et al., 2025, "Context Rot: How Increasing Input Tokens Impacts LLM Performance"; Levy et al., 2024, "Same Task, More Tokens").

02

Long-context training data is scarce relative to short exchanges. Models see far fewer examples of honoring an instruction issued tens of thousands of tokens ago, so that behavior is weakly reinforced.

03

Attention to a token competes with attention to everything added since. As a conversation accumulates turns, tool output, and corrections, earlier constraints receive a progressively smaller share of attention without being explicitly revoked.

04

Long sessions accumulate near-duplicate and superseded statements such as drafts, retractions, and partial plans. The model must infer which version is authoritative, and interference between similar passages makes earlier commitments easy to misremember.

05

Instructions stated once are rarely restated, while chat-style training rewards responsiveness to the latest message. Nothing in the objective specifically rewards re-checking old constraints before each new response.

06

Few evaluations measure consistency across very long interactions. Benchmarks mostly test single-turn or short multi-turn quality, so gradual degradation over long sessions goes unmeasured and unoptimized.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

📏

Constraint preservation checks

Extract the constraints and decisions stated early in a session and check each later response against them. The signature is a constraint that shaped the first stretch of work and then silently stops being honored without ever being revoked.

⚖️

LLM-as-judge evaluation

Run a judge over the full transcript to flag late turns that contradict earlier agreements, misattribute past decisions ("as we agreed earlier…"), or re-ask for information still present in the window — the misremembering cases that simple constraint checks miss.

🧪

Golden-set evals

Script long multi-turn sessions that plant facts and constraints early, run many realistic filler turns, then probe — re-ask the fact, request work the constraint should shape — and chart accuracy by session depth to find where reliability starts to decay.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

📌

Context pinning

Keep standing constraints and decisions restated where attention is strongest — the system prompt or a maintained working-notes block near the active end of context — instead of relying on a single statement forty turns back to keep competing with everything added since.

🧹

Context curation

Prune the near-duplicates that cause misremembering — superseded drafts, retracted plans, abandoned branches — or compact the session around an authoritative record of what was actually decided, so "as we agreed earlier" has one version to bind to instead of several.

Self-check pass

Before responding deep in a session, have the model re-scan the early turns for unrevoked constraints and check its draft against them — and require any "as we agreed" claim to point at the actual turn, which catches confident misrecall like UTC becoming local time.