InvariantBench: Can Large Language Models Exhibit Inherent Reasoning Across Equivalent Transformations?

Azmine Toushik Wasi; Mahir Absar Khan; Abdur Rahman; Wahid Faisal; Sukanta Saha; Saimon Bhuiyan; Mahdiya Rahman Sukanya; Rahatun Nesa Priti; Md. Iqramul Hoque; Munem Shahriar; Raima Islam; Sanatan Sushil; Shahriyar Zaman Ridoy; Kazi Rajwan Sultan; MD Shafikul Islam; Md Manjurul Ahsan; Md Rizwan Parvez

InvariantBench: Can Large Language Models Exhibit Inherent Reasoning Across Equivalent Transformations?

Azmine Toushik Wasi, Mahir Absar Khan, Abdur Rahman, Wahid Faisal, Sukanta Saha, Saimon Bhuiyan, Mahdiya Rahman Sukanya, Rahatun Nesa Priti, Md. Iqramul Hoque, Munem Shahriar, Raima Islam, Sanatan Sushil, Shahriyar Zaman Ridoy, Kazi Rajwan Sultan, MD Shafikul Islam, Md Manjurul Ahsan, Md Rizwan Parvez

Published: 17 Jun 2026, Last Modified: 26 Jun 2026ICML 2026 AI4Math Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, reasoning, benchmark, invariance, robustness

Abstract: Reasoning is often attributed to large language models (LLMs), yet it remains unclear whether they operate over underlying semantics or rely on surface-form patterns. Existing benchmarks evaluate correctness on fixed problem instances, but overlook a fundamental property of reasoning: invariance under semantics-preserving transformations. If a model truly understands a problem, its predictions should remain consistent across equivalent representations. We introduce InvariantBench, a benchmark of $1{,}200$ seed problems $\times$ $4$ invariant forms, spanning 16 tasks across 3 reasoning families and 12 fine-grained invariance axes. Each problem is paired with multiple semantically equivalent variants under a strict invariance contract, enabling evaluation beyond accuracy to measure consistency across representations. Experiments on 15 frontier and open-weight LLMs reveal a persistent invariance gap: accuracy drops of $1-5\%$ for strong models and $>10$ pp for smaller systems under transformation, with full-consistency rates remaining below $5\%$ for most models, compared to $>48\%$ for the best systems and $>98\%$ for humans. These results show that high base accuracy substantially overestimates reasoning ability, and establish invariance as a necessary axis for evaluating and improving robust language understanding.

Submission Number: 103

Loading