Vision Language Model Based Caption Evaluation Method Leveraging Visual Context Extraction

Koki Maeda; Shuhei Kurita; Taiki Miyanishi; Naoaki Okazaki

Vision Language Model Based Caption Evaluation Method Leveraging Visual Context Extraction

Koki Maeda, Shuhei Kurita, Taiki Miyanishi, Naoaki Okazaki

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image Captioning, Evaluation, Vision and Language, LLM as a judge

Abstract: Given the accelerating progress of vision and language modeling, accurate evaluation of machine-generated image captions remains critical. In order to evaluate captions more closely to human preferences, metrics need to discriminate between captions of varying quality and content. However, conventional metrics fall short of comparing beyond superficial matches of words or embedding similarities; thus, they still need improvement. This paper presents VisCE2, a vision language model-based caption evaluation method. Our method focuses on visual context, which refers to the detailed content of images, including objects, attributes, and relationships. By extracting and organizing them into a structured format, we replace the human-written references with visual contexts and help VLMs better understand the image, enhancing evaluation performance. Through meta-evaluation on multiple datasets, we validated that VisCE2 outperforms the conventional pre-trained metrics in capturing caption quality and demonstrates superior consistency with human judgment.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4186

Loading