Balancing Precision and Richness in Image Caption Services for Enhanced Descriptive Accuracy

17 Sept 2025 (modified: 27 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image Understandin, Image Captioning, Descriptive Precision, Richness Enhancement
Abstract: Current image captioning services often learn to generate captions by imitating ground truth references, which are constrained by the limitations of manual annotations. This leads to overlooked details in images, causing captions to lack richness and precise descriptions, critical for enhanced image captioning services. To address this, we propose a CLIP-based image captioning framework designed to balance descriptive precision and richness enhancement. Our approach uses fine-grained pseudo tags for learning and integrates an asymmetric attention multi-modal projector to map and fuse information across modalities effectively. We also introduce an evaluation metric, Tags Coverage, to measure the granularity of generated captions and incorporate it into reinforcement learning to optimize the reward function. This eliminates the need for additional text annotations while addressing unannotated details. Experimental results on the MS-COCO Karpathy's test set demonstrate the model’s effectiveness, with improvement in CIDEr and Tags Coverage compared to state-of-the-art baselines, highlighting its potential for advancing precision and richness in image captioning services.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9232
Loading