Failure Modes / Context / Midsequence Neglect/Lost in the Middle

Midsequence Neglect/Lost in the Middle

Overlooks or underuses information located in the middle of a long prompt, document set, or conversation context. Unlike Recency Bias, the neglected information is not necessarily earlier than the competing context; its position in the middle is the salient factor. Unlike Context Dilution, the problem is positional neglect rather than broad signal weakening.

Examples

FAIL

Middle document missed

The answer sits in the middle document of three, and the model reports it doesn't exist while handling the first and last documents fine.

User: Here are our three vendor contracts. Which one lets us terminate without penalty? [… contract A, then contract B (whose §9.2 allows no-penalty termination), then contract C …] Assistant: None of the three contracts include a no-penalty termination clause. Contract A requires 90 days' notice and contract C imposes an early-exit fee.

WARN

Edges summarized, middle thinned

No outright error, but a summary of a long report covers the opening and closing sections in detail while the middle chapters get one vague sentence.

User: Summarize this 60-page audit report for the leadership meeting. [… report pasted; chapters 4–6 in the middle contain the main findings …] Assistant: The report opens with scope and methodology, and closes with a strong set of recommendations. The intervening chapters cover various operational details.

PASS

Middle content retrieved

Given the same contracts, the model finds and cites the clause buried in the middle document.

User: Here are our three vendor contracts. Which one lets us terminate without penalty? [… contract A, then contract B (whose §9.2 allows no-penalty termination), then contract C …] Assistant: Contract B — §9.2 allows either party to terminate with 30 days' notice and no penalty. Contracts A and C both impose early-exit costs.

Why it happens

Accuracy on long-context retrieval follows a U-shaped curve over position. Models use information at the beginning and end of the input far better than information in the middle, even when the task is otherwise identical (Liu et al., 2023, "Lost in the Middle: How Language Models Use Long Contexts").

Models develop strong attention toward the earliest tokens in the sequence, sometimes called attention sinks, which concentrates capacity at the start of the input at the expense of the interior (Xiao et al., 2023, "Efficient Streaming Language Models with Attention Sinks").

Common positional encoding schemes weight nearby tokens more heavily than distant ones. Tokens in the middle of a long input are far from the generation point, so they lose the recency advantage while lacking the primacy advantage of the opening tokens.

Natural documents front-load important content. Leads, abstracts, and conclusions carry disproportionate information in training data, teaching models a prior that the middle of a text matters less.

Instruction tuning reinforces edge positions. System prompts sit at the start and user questions at the end of most training examples, so the middle of the context is rarely where the supervised signal points.

Long-context evaluations often rely on simple needle-in-a-haystack tests that models can pass while still neglecting mid-sequence content on realistic tasks, so the failure can persist despite strong benchmark scores (Hsieh et al., 2024, "RULER: What's the Real Context Size of Your Long-Context Language Models?").

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

🔀

Position permutation testing

Hold the documents and question fixed, vary only where the answer-bearing passage sits, and compare accuracy across positions. The U-shaped curve — strong at the edges, weak in the middle — is the fingerprint of this mode.

⚖️

LLM-as-judge evaluation

For summaries and analyses of long inputs, run a judge that scores coverage per source section rather than overall quality. Detailed treatment of the opening and closing with one vague sentence for the interior is the soft form of the failure.

🧪

Golden-set evals

Maintain retrieval and summarization tasks whose key content sits deliberately mid-context — the middle contract of three, the central chapters of a long report — and rerun them on every model or context-assembly change, since edge-position tests will not catch regressions here.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

📐

Position-aware ordering

Assemble context so the most important material sits at the edges the model actually uses — most relevant documents first or last, key instructions at the start and restated at the end — exploiting the U-shaped curve instead of being ambushed by it (Peysakhovich & Lerer, 2023, "Attention Sorting Combats Recency Bias in Long Context Language Models").

🧭

Query decomposition

Process long document sets piecewise instead of in one stuffed prompt — examine each contract in its own call, or each chapter in its own pass, then combine verdicts — so no document is ever "the middle one" when the model reads it.

📝

Instruction constraints

Require explicit per-document or per-section output — a verdict on each contract, coverage of every chapter in a summary — so the middle can't be skipped silently; "the intervening chapters cover various operational details" stops being an acceptable sentence.