Observe-Then-Think: Learning to Elicit Multimodal Understanding by Decoupling Perception and Reasoning
Keywords: Vision–language models, Multimodal reasoning
Abstract: Vision-language models (VLMs) have achieved impressive results in fusing visual and textual inputs, yet they often stumble on tasks demanding complex, multimodal reasoning. This imbalance arises from the inherent separation between perception—accurately interpreting sensory data, and reasoning—conducting multi-step, symbolic inference. To bridge this gap, we introduce a novel framework for multimodal reasoning, $\textit{OTT (Observe-Then-Think)}$, which includes a two-stage post-training process: supervised fine-tuning (SFT) followed by reinforcement learning (RL). During SFT, the model learns to decouple perceptual understanding from logical inference, mastering structured output formats and ensuring logical consistency. In the RL stage, our Perception-Guided Consistency Optimization (PGCO) algorithm, inspired by human cognition, enhances visual understanding through perception rewards and employs consistency rewards to align perception and reasoning steps, ensuring the final answer accuracy, eliminating logical contradictions without external tool support. Extensive evaluations across six challenging benchmarks demonstrate that our method consistently outperforms state-of-the-art baselines by an average of 3.8\% over the baseline models, delivering both stronger perceptual grounding and more reliable multimodal reasoning of VLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 18348
Loading