Tracking the Evolution of Multimodal Reasoning on Visual Puzzles

Vernon Toh; Yew Ken Chia; Deepanway Ghosal; Soujanya Poria

Tracking the Evolution of Multimodal Reasoning on Visual Puzzles

Vernon Toh, Yew Ken Chia, Deepanway Ghosal, Soujanya Poria

Published: 28 Dec 2025, Last Modified: 08 Mar 2026AAAI 2026 Bridge LMReasoning OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: multimodal, reasoning, analysis

TL;DR: Comprehensive analysis on the reasoning performance of the lastest closed-source and open-source models on multimodal puzzles

Abstract: The releases of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm shift in Large Language Models towards advanced reasoning capabilities. Notably, models like o3 have demonstrated strong performance on benchmarks like the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models (including o1, o3, and o4-mini) and compare them against leading open-source alternatives on challenging multimodal puzzles from PuzzleVQA and AlgoPuzzleVQA. Our results reveal that the o-[n] series, particularly later iterations, significantly outperform both the GPT-[n] series and the evaluated open-source models, establishing clear performance tiers. Nonetheless, despite these substantial advancements, our findings highlight that even leading models face persistent challenges. Difficulties are particularly evident in tasks requiring precise visual perception, robust compositional reasoning, and solving complex algorithmic puzzles. Our analysis further reveals that while closed-source models are primarily bottlenecked by perception, open-source models struggle with both perception and subsequent inductive reasoning, indicating critical areas for future AGI development. We plan to continuously track new models and update our results in this paper accordingly. All resources used in this evaluation are openly available at https://github.com/declare-lab/LLM-PuzzleTest.

Submission Number: 73

Loading