S-ATM: Self-Boosting Visual Reasoning via Adaptive Token Merging

Lingjun Chen; Xiangyan Liu; Jinghan Zhang; Fanqing Meng; Zijian Wu; Di He; Michael Qizhe Shieh

S-ATM: Self-Boosting Visual Reasoning via Adaptive Token Merging

Lingjun Chen, Xiangyan Liu, Jinghan Zhang, Fanqing Meng, Zijian Wu, Di He, Michael Qizhe Shieh

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: VLM, Vision Reasoning, Token Merging

TL;DR: We propose S-ATM, a training-free decoding strategy that mitigates reasoning degradation in VLMs without extra supervision or reliance on external models.

Abstract: Vision-language models, often adapted from large language models, tend to show degraded reasoning capabilities when visual inputs are introduced. To address this issue, we propose S-ATM, a training-free decoding strategy that enhances visual reasoning without relying on external priors. For each input, two parallel pathways are constructed: one using the original image–text input and the other using a self-generated caption–text input. Their decoding distributions are adaptively merged at each step, with the merging weight guided by the model’s attention to visual tokens. A momentum-based smoothing mechanism further stabilizes this merging over time. We conduct comprehensive experiments on diverse visual reasoning benchmarks to demonstrate the effectiveness of S-ATM. Further analysis shows that S-ATM primarily activates at high-entropy forking tokens, which often correspond to reasoning transitions, and that momentum smoothing reduces decoding instability and maintains reasoning coherence. These findings underscore the role of token-level dynamics in supporting long-chain reasoning in VLMs.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 10591

Loading