KE-Diffusion: Knowledge-Enhanced Diffusion for Image Captioning via Object-Level Semantic Conditioning

ACL ARR 2026 January Submission1480 Authors

30 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Image Captioning, Diffusion Models, Multimodal Generation, Semantic Conditioning, Knowledge-Enhanced Models, Non-Autoregressive Generation
Abstract: Image captioning systems face a long-standing trade-off between generation diversity, semantic fidelity, and computational efficiency. Autoregressive models often suffer from limited diversity and error accumulation, while recent diffusion-based approaches improve diversity at the cost of increased model complexity or insufficient semantic grounding. In this work, we propose KE-diffusion, a knowledge-enhanced lightweight diffusion model for image captioning that integrates object-level visual perception with semantic conditioning in a parameter-efficient manner. Instead of relying on global image embeddings or prefix-based conditioning, KE-diffusion constructs compact visual–semantic condition vectors from detected object regions and injects them directly into the reverse diffusion process via model-level feature concatenation. This design enables effective semantic guidance while preserving the efficiency and parallel generation advantages of diffusion models. Extensive experiments on MS-COCO and Flickr30k demonstrate that KE-diffusion consistently improves semantic accuracy and caption diversity over prior lightweight diffusion models. Additional analyses on cross-domain captioning and visualization further validate the robustness and interpretability of the proposed approach.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodal learning, vision-language grounding, image captioning, conditional text generation, diffusion models
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 1480
Loading