Enhancing Factuality in Detailed Image Captioning with LLM-MLLM Collaboration

Saehyung Lee; Seunghyun Yoon; Trung Bui; Jing Shi; Sungroh Yoon

Enhancing Factuality in Detailed Image Captioning with LLM-MLLM Collaboration

Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image Captioning, Caption Evaluation Metric, Multimodal Large Language Model, Large Language Model

TL;DR: We propose V-FactER, a method for improving the factuality of detailed image captioning models through the collaboration of an MLLM and LLM, along with a framework and benchmark dataset for evaluating detailed image captions.

Abstract: Multimodal large language models (MLLMs) capable of interpreting images can generate highly detailed and extensive captions, owing to their advanced language modeling capabilities. However, the captions they produce frequently contain hallucinations. Furthermore, our empirical analysis reveals that existing hallucination detection methods are less effective in detailed image captioning tasks. We attribute this to the increasing reliance of MLLMs on their own generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a novel corrector-based method that decomposes a given caption into atomic propositions, evaluates the factuality of each unit, and revises the caption accordingly. Our method is training-free and can be applied in a plug-and-play manner to any captioning model. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that existing approaches to improve the factuality of MLLM outputs may fall short in detailed image captioning tasks. In contrast, our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM's performance on VQA benchmarks may not correlate with its ability to generate detailed image captions.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4121

Loading