Caption as Reward: Enhancing Vision-Language Reasoning through Dense Visual Description

ICLR 2026 Conference Submission14678 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Reinforcement Learning, Visual Reasoning, Caption Generation, Reward Modeling, Multimodal Learning
TL;DR: We train vision-language models using reinforcement learning where rewards are based on how much generated captions improve downstream reasoning performance.
Abstract: Vision-Language Models (VLMs) face challenges in complex visual reasoning tasks, where the contribution of intermediate visual understanding remains underexplored. We present Caption as Reward (CaR), a reinforcement learning framework that evaluates visual understanding quality through its impact on task performance. Unlike approaches that assess visual description quality independently through linguistic metrics, CaR introduces a gain-based reward mechanism that measures how visual descriptions improve task performance relative to direct reasoning. This approach encourages models to adapt their visual understanding strategy to task complexity. We evaluate CaR on eight reasoning benchmarks using Qwen2.5-VL models (3B and 7B parameters). CaR achieves consistent improvements across model scales: our 3B model with 30K training samples reaches 34.2\% average accuracy, significantly outperforming both the SFT baseline (22.9\% with 20K samples) and the 3B-Instruct baseline (29.8\%). Notably, CaR shows substantial improvements over standard supervised fine-tuning, with gains of +11.3 percentage points (34.2\% vs 22.9\%) on 30K data. For the 7B model, CaR improves performance from 36.5\% (GRPO) to 38.1\%, a 1.6 percentage point gain, demonstrating robust improvements regardless of model size. CaR's gain-based reward mechanism provides a principled training signal that directly links visual description quality to task performance, opening new directions for improving visual reasoning capabilities in VLMs without requiring expensive human annotations. Additional evaluation on MME-RealWorld confirms CaR's effectiveness in enhancing visual perception abilities, with particularly strong improvements in diagram understanding (+31.4 points) and OCR tasks (+8.1 points).
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 14678
Loading