InvariantBench: Can Large Language Models Exhibit Inherent Reasoning Across Equivalent Transformations?
Keywords: Large Language Models, reasoning, benchmark, invariance, robustness
Abstract: Reasoning is often attributed to large language models (LLMs), yet it remains unclear whether they operate over underlying semantics or rely on surface-form patterns. Existing benchmarks evaluate correctness on fixed problem instances, but overlook a fundamental property of reasoning: invariance under semantics-preserving transformations. If a model truly understands a problem, its predictions should remain consistent across equivalent representations.
We introduce InvariantBench, a benchmark of $1{,}200$ seed problems $\times$ $4$ invariant forms, spanning 16 tasks across 3 reasoning families and 12 fine-grained invariance axes. Each problem is paired with multiple semantically equivalent variants under a strict invariance contract, enabling evaluation beyond accuracy to measure consistency across representations.
Experiments on 15 frontier and open-weight LLMs reveal a persistent invariance gap: accuracy drops of $1-5\%$ for strong models and $>10$ pp for smaller systems under transformation, with full-consistency rates remaining below $5\%$ for most models, compared to $>48\%$ for the best systems and $>98\%$ for humans. These results show that high base accuracy substantially overestimates reasoning ability, and establish invariance as a necessary axis for evaluating and improving robust language understanding.
Submission Number: 103
Loading