Abstract: Abstract reasoning ability reflects the intelligence and generalization
capacity of LLMs to extract and apply abstract rules.
However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning.
To address this, we introduce an automated pipeline named A$^2$RBench, encompassing generation, expansion, evaluation, and analysis.
Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling.
However, such a process may cause hallucinations.
To eliminate it, we further establish a theoretical framework and prove that programmatic verification—testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)—guarantees a unique solution.
Through extensive evaluations on mainstream LLMs, we find: (1) Current LLMs exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans on a representative subset (39.8\% vs. 68.5\%). (2) Current LLMs fall far short of 2D and 1D in the complexity of generated 3D tasks, revealing their lack of understanding of high-dimensional tasks. (3) Counterintuitively, inputs with higher information complexity can simplify the reasoning process.
Code and data are available at: https://github.com/MAC-AutoML/A2Rbench.
Lay Summary: Accurately measuring the abstract reasoning ability of AI models remains challenging: existing benchmarks either rely on expensive manual creation, limiting their scale, or risk measuring mere memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A²RBench, where models generate and expand diverse reasoning tasks at scale. Because this automated process can produce logical errors (hallucinations), we establish a framework using programmatic verification: we prove that testing whether an "inverse" operation perfectly reverses a "forward" operation guarantees that every generated puzzle has a unique, verifiable solution. Through extensive evaluations, our system reveals three key insights. First, current models exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans (39.8% vs. 68.5%). Second, models reveal a deep lack of understanding when handling high-dimensional (3D) spatial tasks. Third, counterintuitively, providing these models with inputs that have higher information complexity can actually simplify their reasoning process.
Link To Code: https://github.com/MAC-AutoML/A2Rbench
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Abstract Reasoning, Benchmark
Originally Submitted PDF: pdf
Submission Number: 11906
Loading