Observe-Then-Think: Learning to Elicit Multimodal Understanding by Decoupling Perception and Reasoning

Observe-Then-Think: Learning to Elicit Multimodal Understanding by Decoupling Perception and Reasoning

ICLR 2026 Conference Submission18348 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision–language models, Multimodal reasoning

Abstract: Vision-language models (VLMs) have achieved impressive results in fusing visual and textual inputs, yet they often stumble on tasks demanding complex, multimodal reasoning. This imbalance arises from the inherent separation between perception—accurately interpreting sensory data, and reasoning—conducting multi-step, symbolic inference. To bridge this gap, we introduce a novel framework for multimodal reasoning, $\textit{OTT (Observe-Then-Think)}$, which includes a two-stage post-training process: supervised fine-tuning (SFT) followed by reinforcement learning (RL). During SFT, the model learns to decouple perceptual understanding from logical inference, mastering structured output formats and ensuring logical consistency. In the RL stage, our Perception-Guided Consistency Optimization (PGCO) algorithm, inspired by human cognition, enhances visual understanding through perception rewards and employs consistency rewards to align perception and reasoning steps, ensuring the final answer accuracy, eliminating logical contradictions without external tool support. Extensive evaluations across six challenging benchmarks demonstrate that our method consistently outperforms state-of-the-art baselines by an average of 3.8\% over the baseline models, delivering both stronger perceptual grounding and more reliable multimodal reasoning of VLMs.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 18348

Loading