Genericism

Gives vague, boilerplate, or template-like guidance that is too nonspecific or abstract for the user to act on, instead of concrete help grounded in their task. Unlike Irrelevance, generic content may be on topic but too nonspecific to be useful. Unlike Incompleteness, the topic may be covered while the answer still offers nothing the user can actually execute.

Examples

FAIL

Boilerplate instead of help

Returns the listicle any churn question gets, using none of the specifics the user provided.

User: My B2B SaaS churn jumped from 3% to 8% last quarter, right after we raised prices 40% and sunset the legacy plan. What should I do? Assistant: Reducing churn starts with understanding your customers. Improve onboarding, gather feedback regularly, build customer loyalty, and deliver consistent value. Engaged customers are retained customers!
WARN

On-topic but unexecutable

Engages with the situation but stays at an altitude where nothing said implies a next action.

User: My B2B SaaS churn jumped from 3% to 8% last quarter, right after we raised prices 40% and sunset the legacy plan. What should I do? Assistant: A price increase of that size can definitely impact retention. You'll want to look at whether the churn is price-related and consider whether your value communication matched the change.
PASS

Grounded in the specifics

Builds the answer from the facts given and ends in steps the user can execute this week.

User: My B2B SaaS churn jumped from 3% to 8% last quarter, right after we raised prices 40% and sunset the legacy plan. What should I do? Assistant: The timing points at the migration, not churn in general. First, split last quarter's churned accounts into legacy-plan migrants vs. everyone else โ€” if migrants dominate, this is a one-time cohort event, not a trend. For accounts still in their migration window, offer a 12-month grandfathered rate; that usually stanches the bleeding while you confirm the cohort theory.

Why it happens

01

Next-token prediction averages over the training distribution. For any common question, the highest-likelihood response is the one that fit the most past contexts, which is by construction the least specific to this one.

02

Alignment tuning collapses output diversity. RLHF concentrates probability on a narrow band of safe, high-scoring response shapes, trading the distribution's tails, where situation-specific answers live, for the reliable middle (Kirk et al., 2024, "Understanding the Effects of RLHF on LLM Generalisation and Diversity").

03

Generic advice is unfalsifiable, and the reward signal notices. Boilerplate cannot be wrong, so it never draws the penalty a concrete recommendation risks, and preference tuning steadily favors the answer that no rater can fault.

04

The model lacks the context specificity requires, and is tuned not to ask. Concrete guidance depends on constraints the prompt omits, and since clarifying questions rate poorly, the model answers anyway at the only altitude the missing information allows.

05

Self-help and content-marketing prose saturates the corpus. Listicle advice and template documents are among the most common shapes for guidance text on the web, so the genre itself is what the model learned to write.

06

Surface-level evaluation cannot tell generic from grounded. A polished, well-structured answer scores well on benchmarks and quick ratings whether or not anyone could act on it, so nothing in the loop selects for executability.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

๐Ÿ“

Specificity scoring

Count how many of the prompt's particulars the answer actually uses โ€” the 40% price raise, the sunset legacy plan, the 3-to-8% jump. An answer that references none of them could have been written without reading the question, which is the definition of template output, measurable mechanically.

โš–๏ธ

LLM-as-judge evaluation

Ask the judge two questions โ€” could this response have been generated from the topic alone, and what concrete action does it imply for this user this week? The warn-level case engages with the situation but answers at an altitude where nothing said implies a next step, and only the second question catches it.

๐Ÿงช

Golden-set evals

Build detail-rich scenarios where the right answer depends on the supplied facts โ€” the timing of the churn spike points at the migration cohort, not churn in general โ€” and score whether the recommendations change when the details do. Generic output scores identically across scenario variants; grounded output cannot.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

๐Ÿ“

Instruction constraints

Require the answer to use the particulars the user supplied and to end in steps executable this week โ€” and when the specifics needed for concrete guidance are missing, to ask for them instead of answering at the altitude the gap allows. The listicle is the trained default; both halves of the constraint exist to make it unavailable.

โœ…

Self-check pass

Before delivering, have the model check which of the prompt's facts the draft actually uses โ€” the 40% raise, the sunset plan, the 3-to-8% jump. A draft that uses none could have been written without reading the question; that test failing is the signal to regenerate from the specifics rather than polish the template.

๐Ÿง‘โ€๐Ÿซ

Preference data curation

Rate concrete, falsifiable recommendations above polished boilerplate, deliberately rewarding answers that take a position a rater could fault. Genericism is reinforced precisely because unfalsifiable advice never draws a penalty; until the data charges for it, tuning keeps collapsing toward the safe middle.