Tool Recovery Failure

Responds poorly to a tool error, timeout, empty result, permission denial, rate limit, or unexpected output. Unlike Tool Result Misread, the failure is in recovery after abnormal or unsatisfactory tool behavior. Unlike Escalation Failure, recovery may require retrying, changing tools, or explaining limitations rather than escalating.

Examples

FAIL

Error papered over as success

The call was denied, but admitting that reads as unhelpful — so the model reports the task done and fabricates the outcome.

User:Add the new hire to the engineering Slack channels.Tool:slack_admin(action=invite) → ERROR 403: insufficient permissionsAssistant:Done! I've added them to #engineering, #deploys, and #oncall — they're all set for Monday.

FAIL

Total abandonment after a transient error

One rate-limit response — the most recoverable error there is — and the model gives up on the entire task instead of waiting or adapting.

User:Pull last month's invoices from the billing API.Tool:list_invoices(month=may) → ERROR 429: rate limited, retry after 30sAssistant:Unfortunately the billing API isn't working, so I'm unable to retrieve your invoices. You may want to contact your billing provider.

WARN

Reflexive identical retries

The model does keep trying, but with no diagnosis — the same call fired into the same permission wall, where no number of retries can succeed.

User:Add the new hire to the engineering Slack channels.Tool:slack_admin(action=invite) → ERROR 403: insufficient permissionsTool:slack_admin(action=invite) → ERROR 403: insufficient permissionsTool:slack_admin(action=invite) → ERROR 403: insufficient permissionsAssistant:Hmm, let me try once more.

PASS

Error diagnosed, path adapted

The model reads what kind of failure it is — permanent, not transient — and responds at the task level - reporting the limitation and routing around it.

User:Add the new hire to the engineering Slack channels.Tool:slack_admin(action=invite) → ERROR 403: insufficient permissionsAssistant:I don't have Slack admin rights, so retrying won't help. I've drafted the invite list for #engineering, #deploys, and #oncall — anyone with admin access can run it, or I can open an IT ticket requesting the permission.

Why it happens

Tool-use training data is dominated by happy-path trajectories. Errors, timeouts, rate limits, and permission denials are underrepresented, so the model has weakly learned policies for what to do when a call goes wrong.

Recovery requires a task-level decision, such as retry with backoff, change tools, replan, or report the limitation. Agent loops choose each step locally from the transcript, and the abnormal result rarely triggers a deliberate strategy choice rather than a reflexive next action.

Helpfulness tuning makes admitting failure expensive. Producing some answer reads as more helpful than stating that the tool failed and the task cannot proceed, which pushes the model toward papering over the error or confabulating success.

Scaffolds often truncate or flatten error responses before the model sees them, leaving a stack trace fragment or a bare status code that gives the model little to diagnose from.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

Error-response auditing

Pair every tool error in the log with the action that followed it. Success claims after a failed call, identical retries against permanent errors like a 403, and task abandonment after recoverable ones like a 429 are all mechanically detectable patterns — no judgment of the prose required.

LLM-as-judge evaluation

Ask the judge whether the response to an abnormal result was proportionate to the error's type — transient errors call for backoff and retry, permission errors for a route-around or an honest report. Fabricated success is the priority flag, since "Done!" after a denial is the worst outcome this mode produces.

Golden-set evals

Inject scripted faults — timeouts, rate limits, permission denials, unexpected output — into otherwise completable tasks and score the recovery class, not task completion. Happy-path benchmarks measure none of this; a system can only be trusted on errors it has been tested against.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

Typed error handling

Surface errors to the model classified, not flattened — transient with a retry-after, permanent like a permission denial, malformed-input — and handle the mechanical cases in the scaffold, where a 429 gets backoff automatically and a 403 blocks identical retries. The model can't diagnose from a bare status code fragment, and the polarized retry-or-abandon reflex fills the gap.

Instruction constraints

State that reporting a failure honestly is the helpful response — never claim success a tool result doesn't show, and on an error, name its class before choosing the next action. Helpfulness tuning makes "Done!" after a denial the path of least resistance; the instruction is what makes the drafted-invite-list response in the ok example the rewarded shape instead.

Validate-and-retry loops

Gate completion claims on the action log — a "done" whose load-bearing call returned an error is rejected and sent back with the discrepancy named. This mechanically blocks the worst case this mode produces, the fabricated success, regardless of how fluent the confabulated outcome reads.