SiTu: A Simple Training-Free Thinking-with-Image Approach via Uncertainty Guidance

12 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Thinking with Image, MLLM, Training-free, test-time scaling
Abstract: Large Multimodal Models (LMMs) have shown great promise in complex reasoning by incorporating images as intermediate steps, a paradigm known as "thinking with images". However, most current ``thinking with images" techniques are training-based, incurring significant computational costs, limiting model versatility, and risking catastrophic forgetting. To bridge this gap, we propose SiTu, a simple, training-free framework for "thinking with images" that leverages an LMM's inherent uncertainty to achieve test-time scaling for multimodal reasoning. The core of our approach is the discovery of a stable, entropy-based uncertainty estimation native to LMMs, which reliably guides the dynamic combination of diverse perception enhancement paths. We implement three simple perceptual actions, categorized as visual highlighting and irrelevant information suppression, and demonstrate a notable scaling phenomenon: as the number and diversity of these actions increase, the LMM's reasoning ability improves consistently. Our extensive experiments on fine-grained visual understanding benchmarks like \textit{V}$^*$, HR-Bench 4K, HR-Bench 8K, and MME-realworld show that SiTu significantly outperforms existing training-free perception enhancement methods. Surprisingly, SiTu even surpasses the performance of current state-of-the-art training-based "thinking with images" methods, highlighting the immense potential of test-time scaling for multimodal reasoning.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4497
Loading