Executable Visual Reasoning: From Universal Solver to Emergent Behaviors

Qi Song; Honglin Li; Yingchen Yu; Haoyi Zhou; Lin Yang; Song Bai; Qi She; Zilong Huang; Yunqing Zhao

Executable Visual Reasoning: From Universal Solver to Emergent Behaviors

Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, Yunqing Zhao

01 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language models, Reasoning models

TL;DR: We propose an MLLM that uses executable code for visual chain-of-thought reasoning, where atomic abilities are learnt in supervised training but emergent behaviors arise at RL stage, enabling flexible tool use and strong benchmark performance.

Abstract: Recent releases such as o3 highlight human-like “thinking with images” reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce ExeVision, which explores executable code as a universal solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), ExeVision defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for Balanced Adaptive Tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: ExeVision demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that ExeVision not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 345

Loading