Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We achieve SOTA open source performance on the ARC-AGI challenge by leveraging LLMs combined with algorithmic supported sampling and product-of-experts scoring.
Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware.
Lay Summary: Large language models (LLMs) have achieved remarkable success in various domains, yet they often falter when faced with tasks demanding abstract reasoning. The Abstraction and Reasoning Corpus (ARC) serves as a benchmark to evaluate such capabilities, presenting challenges that, while intuitive for humans, remain elusive for AI systems. In this work, we explored whether refining the training and evaluation processes of LLMs could enhance their performance on such complex tasks. We generated multiple variations of each ARC puzzle and used the same LLM both to produce solution candidates and to judge how well each candidate fits the given patterns. By combining these perspectives using a "product of experts" approach that reinforces the independent expert assesments, we achieved state-of-the-art performance among publicly available methods. This methodology led to a significant improvement, achieving a 71.6% success rate on the public ARC evaluation set, surpassing average human performance. Notably, our solution is transparent, reproducible, and cost-effective, requiring only about 2 cents per task on standard hardware. We also evaluate our results on Sudoku puzzles, achieving a high solve rate and provide a mathematical explanation for why this approach increases the performance. Our findings demonstrate that with thoughtful design and leveraging existing model capabilities, it's possible to bridge the gap between AI and human abstract reasoning on the ARC benchmark.
Link To Code: https://github.com/da-fr/Product-of-Experts-ARC-Paper
Primary Area: Deep Learning->Large Language Models
Keywords: Abstraction and Reasoning Corpus (ARC), ARC-AGI, Large Language Models, Product-of-Experts, Search-based Methods, Test-Time Training, Data Augmentation, Abstract Reasoning, Puzzle Solving
Submission Number: 10238
Loading