Caption as Reward: Enhancing Vision-Language Reasoning through Dense Visual Description

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Reinforcement Learning, Visual Reasoning, Caption Generation, Reward Modeling, Multimodal Learning
TL;DR: We train vision-language models using reinforcement learning where rewards are based on how much generated captions improve downstream reasoning performance.
Abstract: Recent advances in reinforcement learning for large language models have demonstrated remarkable reasoning capabilities using simple question-answer supervision. A natural question arises: can we train vision-language models (VLMs) to reason over images through reinforcement learning alone, without explicit chain-of-thought annotations? Our investigation reveals a critical bottleneck: over 60\% of VLM reasoning failures stem from inadequate visual perception rather than logical errors. Furthermore, we find that standard RL approaches optimize reasoning chains without ensuring accurate visual understanding, leading to confident but incorrect answers. We argue that the key to effective visual reasoning is to explicitly evaluate whether visual descriptions actually improve task performance. Therefore, we propose Caption as Reward (CaR), a framework that assigns rewards to captions based on their downstream reasoning utility rather than linguistic quality. CaR uses a gain-based mechanism: captions that fix reasoning errors receive high rewards, while those that degrade correct predictions are penalized. Trained on 50K visual question-answer pairs without any CoT supervision, our 3B model outperforms strong baselines including Visionary-R1, TBAC-VLR1, and VLAA-Thinker on eight challenging visual reasoning benchmarks. Additional evaluation on MME-RealWorld confirms substantial improvements in visual perception, particularly for diagram understanding and OCR tasks. Code and checkpoints will be released upon acceptance.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 14678
Loading