Keywords: Large Vision Language Model, Image Caption
TL;DR: We present ScaleCap, a scalable image captioning framework.
Abstract: This paper presents ScaleCap, a scalable image captioning strategy that generates
comprehensive and detailed image captions. The key challenges of high-quality
image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in
imbalanced descriptive granularity, offering detailed accounts of some elements
while merely skimming over others; linguistic bias leading to hallucinated de-
scriptions of non-existent objects. To address these issues, we propose a scalable
debiased captioning strategy, which continuously enriches and calibrates the caption
with increased inference budget. Specifically, we propose two novel components:
heuristic question answering and contrastive sentence rating. The former generates
content-specific questions based on the image and answers them to progressively
inject relevant information into the caption. The latter employs sentence-level
offline contrastive decoding to effectively identify and eliminate hallucinations
caused by linguistic biases. With increased inference cost, more heuristic questions
are raised by ScaleCap to progressively capture additional visual details, generating
captions that are more accurate, balanced, and informative. Extensive modality
alignment experiments demonstrate the effectiveness of ScaleCap. Annotating
450K images with ScaleCap and using them for LVLM pretraining leads to consis-
tent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap
showcases superb richness and fidelity of generated captions with two additional
tasks: replacing images with captions in VQA task, and reconstructing images
from captions to assess semantic coverage.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 2248
Loading