Irrelevance

Includes content that does not materially help answer the user's question, solve the task, or support the needed decision. Unlike Genericism, irrelevant content may be specific but off-target. Unlike Verbosity Failure, the issue is topical fit rather than length alone.

Examples

FAIL

Answers the topic, not the question

Responds with the topic's standard material while the actual question goes unanswered.

User:What port does PostgreSQL listen on by default?Assistant:PostgreSQL is a powerful open-source relational database first released in 1996. To get started, install it with your package manager, then initialize a data directory with initdb and create your first database with createdb.

WARN

Answer plus digressions

Contains the answer, but surrounds it with adjacent material that serves no part of the question.

User:What port does PostgreSQL listen on by default?Assistant:The default port is 5432. While we're on the topic, you may also want to know about connection pooling with PgBouncer, the pros and cons of logical replication, and how Postgres compares to MySQL for high-write workloads.

PASS

Material content only

Answers the question and adds only what the asker of that question would plausibly need next.

User:What port does PostgreSQL listen on by default?Assistant:5432. You can change it with the port setting in postgresql.conf.

Why it happens

Association drives generation. Surface terms in the prompt activate everything that co-occurred with them in training, and content related to the topic's neighborhood flows into the answer whether or not it serves the actual question.

The model answers the template the question resembles. Common phrasings pull toward their most frequent training-data treatment, so a question that superficially matches a popular pattern gets that pattern's standard answer, including its standard digressions.

Coverage is rewarded over selection. Raters score visible thoroughness more reliably than they penalize off-target passages, so including adjacent material is a safe strategy and pruning it is never reinforced.

Relevance requires modeling the decision behind the question. What the user will do with the answer determines what counts as material, and that purpose usually goes unstated, leaving the model to fill the gap with everything plausibly related.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

QA-based consistency checks

Run the original question against the response with an extractive QA model — if "what port does PostgreSQL listen on" cannot pull an answer out of the response, the response answered the topic instead of the question. The fail case never contains the answer at all, and this catches it mechanically.

LLM-as-judge evaluation

Score materiality passage by passage — for each sentence, does it help answer the question, or does it belong to the topic's neighborhood? The warn-level shape is an answer present but surrounded by digressions, and only per-passage scoring separates the 5432 from the PgBouncer tour around it.

Golden-set evals

Maintain questions with short checkable answers and score two things separately — whether the answer appears, and what fraction of the response is material to the question. A system can ace the first while drowning it in the second, and both numbers matter.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

Instruction constraints

Require the answer to the question asked to come first, and limit additions to what the asker of that question plausibly needs next — the port number, then the config setting that changes it, and nothing from the topic's neighborhood. "While we're on the topic" is a stateable anti-pattern the prompt can ban by name.

Self-check pass

Add the revision pass generation lacks — before delivering, have the model score each passage against one question, does this help answer what was asked? The PgBouncer tour develops fluently because tangents are never re-examined once begun; a per-passage materiality check is where they finally get cut.

Relevance gating

Filter retrieved passages by usefulness to the question, not embedding similarity, before they reach the model. RAG injects the topic's neighborhood by construction, and the model is tuned to use what it is given — gating what arrives is cheaper than teaching it to discard.