Keywords: Image Caption, Reinforcement learning, Large Vision Language Model
TL;DR: We present CapRL, an effective decoupled two-stage training scheme with verifiable caption reward to boost image captioning model.
Abstract: Image captioning is a fundamental task that bridges the visual and linguistic
domains, playing a critical role in pre-training Large Vision-Language Models
(LVLMs). Current state-of-the-art captioning models are typically trained with
Supervised Fine-Tuning (SFT), a paradigm that relies on expensive, non-scalable
data annotated by humans or proprietary models. This approach often leads to
models that memorize specific ground-truth answers, limiting their generality and
ability to generate diverse, creative descriptions. To overcome the limitation of
SFT, we propose applying the Reinforcement Learning with Verifiable Rewards
(RLVR) paradigm to the open-ended task of image captioning. A primary challenge,
however, is designing an objective reward function for the inherently subjective
nature of what constitutes a "good" caption. We introduce Captioning Reinforce-
ment Learning (CapRL), a novel training framework that redefines caption quality
through its utility: a high-quality caption should enable a non-visual language
model to accurately answer questions about the corresponding image. CapRL
employs a decoupled two-stage pipeline where an LVLM generates a caption, and
the objective reward is derived from the accuracy of a separate, vision-free LLM
answering Multiple-Choice Questions based solely on that caption. As the first
study to apply RLVR to the subjective image captioning task, we demonstrate
that CapRL significantly enhances multiple settings. Pretraining on the CapRL-
5M caption dataset annotated by CapRL-3B results in substantial gains across 12
benchmarks. Moreover, within the Prism Framework for caption quality evaluation,
CapRL achieves performance comparable to Qwen2.5-VL-72B, while exceeding
the baseline by an average margin of 8.4%. Results validate that our CapRL effec-
tively trains models to produce a more general and accurate image descriptions,
moving beyond the limitations of traditional SFT-based image captioning models.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3123
Loading