Poor Structure

Organizes information in a way that makes the answer hard to scan, compare, execute, or verify. Unlike Format Failure, the response may satisfy the requested format while still being poorly organized. Unlike Verbosity Failure, the issue is arrangement and hierarchy rather than amount of content.

Examples

FAIL

Steps out of executable order

Presents a procedure in the order it was generated, with prerequisites arriving after the steps that need them.

User: How do I migrate my database to the new schema? Assistant: Run the migration script with ./migrate.sh. Then take a backup — actually, the backup should happen before the script. You'll also need to stop the app server first, which really belongs before all of this.
WARN

Tidy but unusable arrangement

Applies the standard heading-and-bullets scaffold to a comparison the reader needs side by side.

User: Compare these three laptops on price, battery life, and weight so I can pick one. Assistant: ## Laptop A — A great all-rounder! Price: $999. Strong battery. ## Laptop B — Excellent choice for travelers. Very light. Battery life is also good. ## Laptop C — Premium build. $1,499. [… reader must reconstruct the comparison themselves; weights and half the prices never appear …]
PASS

Structure derived from the task

Orders a procedure by execution, leads with the answer, and shapes comparisons for comparing.

User: How do I migrate my database to the new schema? Assistant: In order: 1) stop the app server, 2) take a backup, 3) run ./migrate.sh, 4) verify with ./check-schema.sh, 5) restart the server. Steps 1–2 matter — the script can't be safely rerun against a half-migrated database.

Why it happens

01

Generation is linear and unrevised. Tokens are emitted in order without an outlining pass before or a reorganization pass after, so the answer's structure is whatever order the content happened to come out in.

02

Structure is pasted on as style, not derived from the content. Tuning instills heading-and-bullet habits that raters reward, so the model applies the same scaffolding to every answer, including ones whose logic calls for a comparison table, a sequence, or plain prose.

03

Good structure depends on the reader's task. Scanning for one fact, comparing options, and executing steps each demand different arrangements, and the prompt rarely says which use the answer must serve, so the model defaults to its generic template.

04

Raters judge appearance, not usability. A response with tidy headings and bullets looks organized in a ten-second review, and nobody in the rating loop actually executes the steps or compares the options, so navigability is never tested.

05

Hierarchy errors compound over long outputs. As generation proceeds, the model loses track of its own outline, repeating points under different headings, nesting inconsistently, and burying the answer mid-document instead of leading with it.

06

The model cannot see the rendering surface. Whether output lands in a terminal, a chat bubble, or a wiki page changes what structure works, but the destination is invisible at generation time, so formatting choices are made blind.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

🔬

Sandboxed execution

Execute procedural answers exactly as written, in the order given. Prerequisites that arrive after the steps needing them fail mechanically — the migration script that runs before the backup it assumes is caught by the sandbox, not by anyone's reading of the prose.

⚖️

LLM-as-judge evaluation

Tell the judge the reader's task — execute, compare, or scan — and have it score whether the arrangement serves that use. Tidy headings wrapped around a comparison the reader must reconstruct themselves is the warn-level case; it passes every appearance check and fails the only test that matters.

🧪

Golden-set evals

Maintain tasks whose correct structure is checkable — procedures scored on dependency order, comparisons scored on whether every requested dimension appears for every option, lookups scored on whether the answer leads. The laptop comparison missing half the prices and all the weights is countable, not a matter of taste.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

🗂️

Outline-first generation

Have the model derive the arrangement from the reader's task before drafting — a comparison wants a table with every dimension per option, a procedure wants dependency order, a lookup wants the answer first. Generation is linear and unrevised, so the backup-before-migrate ordering has to be decided before tokens commit, not discovered mid-sentence.

🧹

Output post-processing

Run a reorganization pass on the draft — reorder steps by dependency, convert heading-wrapped comparisons into actual side-by-side form, hoist the buried answer to the top. The fail example contains all the right content in generation order; a revision pass is the step the architecture lacks, added back in the pipeline.

📝

Instruction constraints

Tell the model the reader's task and the rendering surface — "this will be executed top to bottom," "this lands in a terminal" — and require structure chosen for that use rather than the heading-and-bullets default. The generic scaffold is applied because nothing in the prompt says what the answer must serve; saying it is cheap.