Abstract: Diffusion-based image captioning methods have been proposed to address the inherent issues of autoregressive models, such as slow inference speed, significant accumulative errors, and limited generative diversity. However, due to excessive reliance on textual data and constrained training objective, existing diffusion-based methods suffer from a semantic gap between vision and language, ultimately resulting in poor quality of generated captions. To address this issue, we propose a novel diffusion-based semantics aligned image captioning framework, namely DSACap. Specifically, DSACap deviates from existing methods which treat text as the target of noise-adding and denoising, instead directly applying these processes to the image, thus reducing the loss of visual-semantic alignment. In addition, we introduce a reinforcement learning-based training strategy to maximize the semantic alignment between image and text. We feed the generated textual descriptions into an image generation model to reconstruct the original image and use the cosine similarity between the generated image and the original image as the reward to train the image captioning model. Extensive experimental results on the MS COCO dataset demonstrate that DSACap achieves a CIDEr score of 128.8, clearly outperforming existing diffusion-based image captioning methods. Our code will be made publicly open soon.
External IDs:doi:10.1145/3746027.3755156
Loading