Arithmetic Error

Computes or transforms numeric inputs incorrectly, including arithmetic, aggregation, unit conversion, comparison, or formula application. Unlike Numerical Hallucination, the relevant numbers are grounded or supplied but the calculation over them is wrong. Unlike Reasoning Error, this mode is specifically about numeric operations.

Examples

FAIL

Incorrect total

All inputs are supplied and unambiguous; the addition over them is simply wrong.

User:What does the invoice come to? Line items are $52, $48, and $46.Assistant:The total is $142.

FAIL

Botched unit conversion

The model reproduces the shape of a conversion it has seen but with the wrong factor, then builds on the bad number.

User:The tank holds 380 liters. How many gallons is that, and how many 5-gallon jugs do we need?Assistant:380 liters is about 84 gallons, so you'll need 17 jugs.

WARN

Unchecked inline math

The total happens to be right, but it was computed inline with no steps or verification — the same path that produces the failures above.

User:What does the invoice come to? Line items are $52, $48, and $46.Assistant:The total is $146.

PASS

Tool-backed calculation

The arithmetic is routed to deterministic execution and the result is grounded in the computation.

User:What does the invoice come to? Line items are $52, $48, and $46.Tool:calculate(52 + 48 + 46) → 146Assistant:The total is $146.

Why it happens

Tokenizers split numbers into arbitrary fragments, so models represent digits poorly and arithmetic does not generalize the way it would on a calculator (Nogueira et al., 2021, "Investigating the Limitations of Transformers with Simple Arithmetic Tasks").

Models learn arithmetic as statistical association, not algorithm execution. Accuracy tracks how often an operation and its operands appear in training data, so common results are memorized while rare operand combinations are interpolated and often wrong (McCoy et al., 2023, "Embers of Autoregression: Understanding Large Language Models Through the Problem They Are Trained to Solve").

Multi-digit operations require precise intermediate bookkeeping such as carries and partial products. Attention performs this tracking only approximately, and accuracy collapses as the number of required intermediate steps grows (Dziri et al., 2023, "Faith and Fate: Limits of Transformers on Compositionality").

Many systems answer numeric questions inline instead of routing them to a calculator or code execution, so nothing deterministic checks the result before it reaches the user.

Detection Approaches

Categories of checks that can identify the issue. These are strategies, not specific implementations.

Deterministic recomputation

Extract every numeric operation from the output — sums, conversions, formula applications — and recompute it with code or a calculator. Inputs and operations are visible in the response, so mismatches are mechanically detectable with no judgment call involved.

Step-level validation

Check intermediate values, not just the final answer. A botched conversion factor or a dropped carry is visible at the step where it happens, and a final answer that is accidentally right can still hide broken arithmetic that will fail on the next input.

Golden-set evals

Weight the eval toward rare operand combinations, multi-digit operations, and conversions with uncommon factors — accuracy tracks training-data frequency, so round numbers and familiar pairs the model has memorized will overstate reliability.

Mitigation Approaches

High-level reliability strategies that reduce how often this failure occurs.

Tool-backed math

Route every numeric operation — sums, conversions, formula applications — to a calculator or code execution instead of computing inline. Arithmetic is the rare failure with a complete mechanical fix; the model's job becomes setting up the right computation, which is what it's actually good at, and the digits come from something deterministic.

Instruction constraints

Forbid silent inline arithmetic — require each operation written out with its operands and intermediate values, including conversion factors stated before they're applied. The accidentally-right inline total travels the same path as the wrong ones; making the work explicit gives a recomputation check something to bind to and surfaces the wrong factor before it gets built on.

Validate-and-retry loops

Where tool routing isn't available, extract the operations from the draft, recompute them deterministically, and regenerate on mismatch with the wrong value named. Per-attempt accuracy is a function of training-data frequency, but the check converts that into a guarantee at the cost of an occasional retry.