Evaluating the Fidelity of Image Captioning via Weighted Boolean Question Answering

Kaixuan Wang, Shasha Li, Jintao Tang, Kehan Long, Yongzhu Miao, Fangda Chen, Ting Wang

Published: 2024, Last Modified: 07 Jan 2026NLPCC (3) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Image captioning evaluation is of great significance in guiding caption generation, practically valuable evaluation requires fine-grained evaluation metrics. However, most of the current metrics are only capable of measuring the overall quality of the caption, which means users are not informed of any details about the caption’s defects when given a low score. As fidelity is one of the quality criteria of image captioning and has a great impact on the application, we proposed WBQA, a metric that focuses on evaluating the fidelity of the caption by Weighted Boolean Question Answering. WBQA can display which part of captions is unfaithful because natural language is used to express questions to check whether the caption matches the image. Experiments show that our metric is excellent for measuring fidelity (from 0.31 and 0.34 to 0.52 and 0.58 with Pearson correlation coefficient) and achieves state-of-the-art on multiple image captioning evaluation datasets.