Enriched Image Captioning Based on Knowledge Divergence and Focus

An-An Liu, Quanhan Wu, Ning Xu, Hongshuo Tian, Lanjun Wang

Published: 2025, Last Modified: 23 Jan 2026IEEE Trans. Circuits Syst. Video Technol. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Image captioning is a fundamental task in computer vision that aims to generate precise and comprehensive descriptions of images automatically. Intuitively, humans initially rely on the image content, e.g., “cake on a plate”, to gradually gather relevant knowledge facts e.g., “birthday party”, “candles”, which is a process referred to as divergence. Then, we perform step-by-step reasoning based on the images to refine, and rearrange these knowledge facts for explicit sentence generation, a process referred to as focus. However, existing image captioning methods mainly rely on the encode-decode framework that does not well fit the “divergence-focus” nature of the task. To this end, we propose the knowledge “divergence-focus” method for Image Captioning (K-DFIC) to gather and polish knowledge facts for image understanding, which consists of two components: 1) Knowledge Divergence Module aims to leverage the divergence capability of large-scale pre-trained model to acquire knowledge facts relevant to the image content. To achieve this, we design a scene-graph-aware prompt that serves as a “trigger” for GPT-3.5, encouraging it to “diverge” and generate more sophisticated, human-like knowledge. 2) Knowledge Focus Module aims to refine acquired knowledge facts and further rearrange them in a coherent manner. We design the interactive refining network to encode knowledge, which is refined with the visual features to remove irrelevant words. Then, to generate fluent image descriptions, we design the large-scale pre-trained model-based rearrangement method to estimate the importance of each knowledge word for an image. Finally, we fuse the refined knowledge and visual features to assist the decoder in generating captions. We demonstrate the superiority of our approach through extensive experiments on the MSCOCO dataset. Our approach surpasses state-of-the-art performance across all metrics in the Karpathy split. For example, our model obtains the best CIDEr-D score of 148.4%. Additional ablation studies and visualization further validate our effectiveness.

External IDs:dblp:journals/tcsv/LiuWXTW25