Abstract: News Captioning involves generating the descriptions for news images based on the detailed content of related news articles. Given that these articles often contain extensive information not directly related to the image, captions may end up misaligned with the visual content. To mitigate this issue, we propose the novel cross-modal coherence-enhanced feedback prompting method to clarify the crucial elements that align closely with the visual content for news captioning. Specifically, we first adapt CLIP to develop a news-specific image-text matching module, enriched with insights from language model MPNet using a matching-score comparative loss, which facilitates effective cross-modal knowledge distillation. This module enhance the coherence between images and each news sentences via rating confidence. Then, we design confidence-aware prompts to fine-tune LLaVA model with by LoRa strategy, focusing on essential details in extensive articles. Lastly, we evaluate the generated news caption with refined CLIP, constructing confidence-feedback prompts to further enhance LLaVA through feedback learning, which iteratively refine captions to improve its accuracy. Extensive experiments conduct on two public datasets, GoodNews and NYTimes800k, have validated the effectiveness of our method.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion, [Generation] Generative Multimedia
Relevance To Conference: This work not only enhances the precision of news captioning by effectively mitigating the impact of irrelevant details, but also contributes to the broader domain of multimedia processing by illustrating a successful application of cross-modal knowledge distillation and feedback learning mechanisms. The ability to distill knowledge across modalities and refine outputs through feedback is particularly relevant in a multimedia context, where ensuring coherence and relevance across different types of content is crucial. This work's success in improving news captioning fidelity showcases the potential of these techniques to advance the state-of-the-art in multimodal processing.
Submission Number: 4313
Loading