Diverse Inference for Solving ARC at a Human Level

Published: 08 Oct 2025, Last Modified: 08 Oct 2025Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Abstraction and Reasoning Corpus (ARC), diverse inference, model aggregation, reasoning LLMs, near‑perfect verifiers
TL;DR: Aggregate diverse models and methods with near‑perfect verifiers to solve ARC puzzles beyond human accuracy
Abstract: Reasoning LLMs have made significant progress in mathematics and coding, yet struggle with advanced generalization tasks such as Abstraction and Reasoning Corpus (ARC) puzzles. To address this, we propose a diverse inference approach that aggregates multiple models and methods at test time to enhance the performance. We automatically verify correctness of solutions to ARC puzzles by code. Our approach increases the success rate on a validation set of 400 ARC puzzles from 53\% to 69.5\% without reasoning models and 91.5\% to 93.75\% with them which exceeds average human accuracy which is between 73.3 and 77.2\%. Our approach succeeds in solving ARC puzzles that state-of-the-art reasoning LLMs and 948 humans could not. It solves 26.5\% of ARC puzzles that reasoning models do not solve and 80\% of ARC puzzles that 948 humans could not. We identify the relationship between the number of diverse models and methods and the performance on verifiable problems. We automate the MARC method by an agentic framework using a computation graph, enabling modular, scalable, and autonomous execution and comparison of ARC problem-solving pipelines. This work makes progress toward building flexible, generalizable reasoning systems.
Supplementary Material: zip
Submission Number: 207
Loading