Failure Modes / Tools / Tool Argument Error

Tool Argument Error

Calls a tool with arguments that are malformed, incomplete, unauthorized, stale, poorly scoped, or semantically wrong for the intended operation. Unlike Wrong Tool Selection, the chosen tool may be appropriate while the call parameters are not. Unlike JSON/Schema Failure, the malformed structure is sent to a tool rather than returned as final output.

Examples

FAIL

Plausible value invented

The model doesn't know the customer's ID, so the argument slot gets filled with a fabricated one that looks right — and silently operates on the wrong account.

User: Pull up the subscription details for Meridian Labs. Tool: get_subscription(customer_id="cust_meridian_001") → returns a different company that happens to own that ID Assistant: Meridian Labs is on the Starter plan at $49/month.
FAIL

Stale value replayed

The deploy target was switched mid-session, but the model copies the argument from an earlier call instead of the current state.

User: Actually, deploy to the EU cluster instead — we just discussed this. Assistant: Deploying now. Tool: deploy(cluster="us-east-1") → deployed [… the argument was copied from the previous deploy call, not the updated instruction …]
WARN

Missing parameter guessed silently

A required scope was never specified, and instead of asking, the model picks a default that happens to be far broader than intended.

User: Grant the contractor access to the project. Tool: grant_access(user="contractor@ext.com", scope="org-admin") Assistant: Done — access granted.
PASS

Unknown value looked up, not guessed

The model treats the ID as something to resolve, not generate, and confirms the ambiguous scope before the call.

User: Pull up the subscription details for Meridian Labs. Tool: search_customers("Meridian Labs") → cust_8842 Tool: get_subscription(customer_id="cust_8842") → Enterprise, $2,400/month Assistant: Meridian Labs (cust_8842) is on Enterprise at $2,400/month.

Why it happens

01

Arguments are generated by the same next-token process that hallucinates facts. When the model does not know a value, it fills in a plausible-looking one, producing fabricated IDs, paths, and parameters (Patil et al., 2023, "Gorilla: Large Language Model Connected with Massive APIs").

02

Schema enforcement constrains syntax, not semantics. Constrained decoding can guarantee well-formed JSON while doing nothing to verify that a file exists, an ID is current, or a scope is appropriate.

03

Values must often be copied from earlier in a long transcript, and retrieval from the middle of a long context is unreliable. The model grabs a similar-looking value from the wrong place or misreads the right one (Liu et al., 2023, "Lost in the Middle: How Language Models Use Long Contexts").

04

Conversation state changes as a task progresses, but the model has no mechanism marking which earlier values are stale. Arguments correct at one step get replayed after the underlying state has moved on.

05

Preference tuning discourages pausing to ask questions, so when a required parameter is missing or ambiguous the model guesses rather than requesting clarification.

06

Tool specifications themselves are often underdocumented, with ambiguous parameter names, unstated units, or silent defaults. The model inherits every gap in the spec as a guessing problem.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

🔗

Argument provenance checking

Trace each argument value back to its source — a user statement, a prior tool result, or the tool spec's documented default. A customer_id that appears nowhere upstream is fabricated; a cluster name that matches an early turn but not the latest correction is stale. Both are detectable from the transcript without knowing the right answer.

⚖️

LLM-as-judge evaluation

Have the judge compare each call's arguments against the current conversation state, flagging values guessed where the user never specified them — especially silently chosen defaults, like an org-admin scope where the request implied something far narrower.

🧪

Golden-set evals

Build tasks where required values must be resolved by lookup rather than generated, and tasks that change a parameter mid-session, then score the arguments of each call rather than the well-formedness of the JSON. Schema-valid calls carrying wrong values are exactly the cases worth counting.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

🔎

Tool-backed lookup

Make identifiers things the model resolves, never generates — provide search and resolve tools beside the operational ones, and instruct that any ID, path, or account reference must come from a lookup result or the user's own words. Arguments come from the same next-token process that hallucinates facts, and cust_meridian_001 looked exactly as plausible as the real ID.

🗃️

Explicit state tracking

Hold current parameter values — the deploy target, the active customer, the working directory — in a structured state block that updates when the user changes them, and have calls draw from it rather than from transcript recall. The us-east-1 replay happened because the correction lived mid-context where retrieval is weakest; a maintained current-values table has no stale copy to grab.

🔁

Validate-and-retry loops

Validate arguments semantically at the boundary before execution — does the ID exist, is the file present, is the scope plausible for the request — and treat missing required parameters as errors rather than slots for silent defaults. An org-admin grant where the request named one contractor should fail the width check, not succeed quietly.