S-ATM: Self-Boosting Visual Reasoning via Adaptive Token Merging

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: VLM, Vision Reasoning, Token Merging
TL;DR: We propose S-ATM, a training-free decoding strategy that mitigates reasoning degradation in VLMs without extra supervision or reliance on external models.
Abstract: Vision-language models, often adapted from large language models, tend to show degraded reasoning capabilities when visual inputs are introduced. To address this issue, we propose S-ATM, a training-free decoding strategy that enhances visual reasoning without relying on external priors. For each input, two parallel pathways are constructed: one using the original image–text input and the other using a self-generated caption–text input. Their decoding distributions are adaptively merged at each step, with the merging weight guided by the model’s attention to visual tokens. A momentum-based smoothing mechanism further stabilizes this merging over time. We conduct comprehensive experiments on diverse visual reasoning benchmarks to demonstrate the effectiveness of S-ATM. Further analysis shows that S-ATM primarily activates at high-entropy forking tokens, which often correspond to reasoning transitions, and that momentum smoothing reduces decoding instability and maintains reasoning coherence. These findings underscore the role of token-level dynamics in supporting long-chain reasoning in VLMs.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10591
Loading