A²RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

Qingchuan Ma; Yuexiao Ma; Yongkang Xie; Tianyu Xie; Xiawu Zheng; Rongrong Ji

A²RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

Qingchuan Ma, Yuexiao Ma, Yongkang Xie, Tianyu Xie, Xiawu Zheng, Rongrong Ji

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A$^2$RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification—testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)—guarantees a unique solution. Through extensive evaluations on mainstream LLMs, we find: (1) Current LLMs exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans on a representative subset (39.8\% vs. 68.5\%). (2) Current LLMs fall far short of 2D and 1D in the complexity of generated 3D tasks, revealing their lack of understanding of high-dimensional tasks. (3) Counterintuitively, inputs with higher information complexity can simplify the reasoning process. Code and data are available at: https://github.com/MAC-AutoML/A2Rbench.

Lay Summary: Accurately measuring the abstract reasoning ability of AI models remains challenging: existing benchmarks either rely on expensive manual creation, limiting their scale, or risk measuring mere memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A²RBench, where models generate and expand diverse reasoning tasks at scale. Because this automated process can produce logical errors (hallucinations), we establish a framework using programmatic verification: we prove that testing whether an "inverse" operation perfectly reverses a "forward" operation guarantees that every generated puzzle has a unique, verifiable solution. Through extensive evaluations, our system reveals three key insights. First, current models exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans (39.8% vs. 68.5%). Second, models reveal a deep lack of understanding when handling high-dimensional (3D) spatial tasks. Third, counterintuitively, providing these models with inputs that have higher information complexity can actually simplify their reasoning process.

Link To Code: https://github.com/MAC-AutoML/A2Rbench

Primary Area: Deep Learning->Large Language Models

Keywords: Large Language Models, Abstract Reasoning, Benchmark

Originally Submitted PDF: pdf

Submission Number: 11906

Loading