Pixels Lie, Code Doesn't: Thinking with Visual Programming for ''Seemingly Impossible'' Multimodal Agentic Reasoning Tasks

Pixels Lie, Code Doesn't: Thinking with Visual Programming for ''Seemingly Impossible'' Multimodal Agentic Reasoning Tasks

ICLR 2026 Conference Submission15045 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Reasoning, Thinking with Visual Programming

TL;DR: We introduce MMR-VIP, a MultiModal Agentic Reasoning benchmark that consists of Visual Impossible Problems.

Abstract: To overcome the inherent limitations of Chain-of-Thought (CoT) and to further push the upper bound of multimodal reasoning capabilities, we introduce Thinking with Visual Programming (TVP), where models can iteratively interact with an external code executor to generate, run, and verify both visual and textual agentic operations as part of the reasoning loop. Motivated by the open question of how far Multimodal Large Language Models (MLLMs) still lag behind this paradigm, we introduce MMR-VIP, a MultiModal Agentic Reasoning benchmark built on Visual Impossible Problems. We design MMR-VIP with two key principles: (1) We construct a Difficulty Ladder grounded in computational complexity theory, structuring tasks from easy problems that can be solved with inherent perception and reasoning, through medium problems that require external computational tools, to hard problems that remain intractable even with programming assistance. (2) We decompose the paradigm of Thinking with Visual Programming into three Cognitive Skills, namely Perception, Abstraction, and Optimization, which correspond to perceiving visual inputs, abstracting them into problem formulations, and optimizing algorithms to obtain efficient solutions. Our experiments on MMR-VIP yield the following findings: (1) GPT-5, as a native TVP model, delivers the strongest overall results, yet its accuracy remains only 38.2%, underscoring substantial room for progress. (2) For commercial models, multi-turn code execution consistently surpasses direct CoT and single-turn execution, providing stable and significant improvements. (3) Across difficulty levels, performance follows a ladder-shaped trend, with negligible gains on easy tasks, the largest improvements on medium tasks, and steady advances on hard tasks. (4) From a cognitive perspective, TVP enhances optimization by offloading complex computation, search, and planning, but models still encounter bottlenecks in abstraction.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 15045

Loading