Failure Modes / Faithfulness / Source Misrepresentation

Source Misrepresentation

Misstates, exaggerates, reverses, or selectively distorts what a cited, retrieved, uploaded, or tool-returned source actually says. Unlike Citation Span Mismatch, the problem is the model's interpretation of the source, not merely an unsupported citation attachment. Unlike Quote Hallucination, the failure can occur without fabricating exact quoted wording.

Examples

FAIL

Dropped hedge strengthens the claim

Paraphrasing discards "may" and "in some patients," turning a cautious finding into a flat assertion the source never made.

[… uploaded study reads: "The supplement may modestly reduce fatigue in some patients with mild deficiency." …] User: What does this study say about the supplement? Assistant: The study shows that the supplement reduces fatigue.
FAIL

Opposing view read as the document's claim

The source quotes a critic in order to rebut them, but retrieval strips the surrounding argument and the model attributes the criticism to the document itself.

[… retrieved passage: "Critics argue the policy raised rents; the report's data show rents were flat." …] User: What does the report conclude about the policy's effect on rents? Assistant: The report concludes that the policy raised rents.
WARN

Faithful facts, shifted emphasis

Every individual statement matches the source, but a minor caveat is promoted to the headline finding, leaving the reader with a conclusion the source did not draw.

[… study's main result is a treatment benefit; a limitations note mentions two cases of mild nausea …] User: What does this study say about the supplement? Assistant: The key takeaway is the study's safety concerns — it documents adverse reactions, alongside some reported benefits.
PASS

Qualifiers preserved

Restates the finding at exactly the strength the source gives it, keeping scope limits and uncertainty intact.

[… uploaded study reads: "The supplement may modestly reduce fatigue in some patients with mild deficiency." …] User: What does this study say about the supplement? Assistant: It reports a possible modest reduction in fatigue, and only for some patients with mild deficiency — the authors stop well short of a general claim that it works.

Why it happens

01

Models paraphrase by regenerating meaning through their own learned distributions rather than copying. Hedges, scope limits, and negations are low-salience tokens that paraphrasing easily drops or flips, which reverses or strengthens the original claim (Kryscinski et al., 2020, "Evaluating the Factual Consistency of Abstractive Text Summarization").

02

Pretraining priors pull interpretation toward the typical version of a claim. When a source says something surprising or carefully qualified, the model drifts toward the more common framing it saw during training (Longpre et al., 2021, "Entity-Based Knowledge Conflicts in Question Answering").

03

Preference tuning rewards clear, decisive prose. Faithfully preserving a source's uncertainty or mixed findings reads as wishy-washy to raters, so tuning pressures the model toward cleaner but stronger restatements.

04

Sycophancy pressures bend interpretation toward the user's apparent position; models tuned on human feedback tend to read sources as agreeing with what the user wants to hear (Sharma et al., 2023, "Towards Understanding Sycophancy in Language Models").

05

Retrieval hands the model fragments stripped of surrounding argument. A passage that was a counterexample, a quotation of an opposing view, or a conditional claim gets read as the document's own assertion.

06

Error analyses of summarization systems find that misrepresented relations between correct entities, such as wrong causal direction or wrong attribution, are among the most common factual errors, yet standard quality metrics barely penalize them (Pagnoni et al., 2021, "Understanding Factuality in Abstractive Summarization with FRANK").

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

🔗

Entailment checking

Run an NLI model between each restated claim and the source passage it paraphrases. Dropped hedges and flipped negations surface as non-entailment — "the supplement reduces fatigue" is not entailed by "may modestly reduce fatigue in some patients," even though every entity matches.

⚖️

LLM-as-judge evaluation

Run a judge that compares the restatement against the source on the axes paraphrasing distorts — claim strength, scope qualifiers, causal direction, and whether a quoted opposing view got read as the document's own position. Entity-level checks miss these relation errors entirely.

🧪

Golden-set evals

Maintain sources that are heavily hedged, carefully scoped, or surprising relative to common framings, and regression-test whether restatements preserve the qualifiers and direction — measuring drift toward the stronger, more typical version of each claim.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

📝

Instruction constraints

Instruct the model to restate claims at exactly the strength the source gives them — keeping "may," "in some patients," and scope limits intact — and to quote the operative sentence when the finding is hedged, since paraphrase is where qualifiers get dropped.

Self-check pass

Before answering, have the model compare its restatement against the source on the axes paraphrasing distorts — claim strength, scope, causal direction, attribution — and restore any hedge or qualifier the draft strengthened away.

🧩

Retrieval tuning

Retrieve passages with enough surrounding context to preserve their argumentative role — expanding chunks or including neighboring sentences — so a quoted critic or a conditional claim arrives marked as such instead of reading as the document's own assertion.