HalCap-Bench: Benchmarking Hallucination Detector in Image Captioning

Kuniaki Saito; Risa Shinoda; Shohei Tanaka; Tosho Hirasawa; Fumio Okura; Yoshitaka Ushiku

HalCap-Bench: Benchmarking Hallucination Detector in Image Captioning

Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku

05 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: vision-language model, hallucination, image captioning

TL;DR: New benchmark to evaluate hallucination detector in image captioning

Abstract: Recent progress in large vision-language models (VLMs) has been driven by advances in image-text alignment, i.e., learning the relationship between image and text. Hallucination detection in captions, \textbf{HalDec}, can assess VLM's image-text alignment ability, and aims to identify errors in VLM-generated captions that misrepresent image content. Detecting these errors is crucial not only for evaluating alignment ability but also for curating high-quality image-caption pairs used to train VLMs. While VLMs have been explored as hallucination detectors, their generalizability across different captioning models, image domains, and hallucination types remains unclear due to a lack of a benchmark. In this work, we present HalDec-Bench, the first benchmark for principled and interpretable evaluation of HalDec models. It covers diverse VLMs used as captioning models, image domains, and provides high-quality hallucination-existence annotations enriched with hallucination-type labels. HalDec-Bench thus serves as a comprehensive testbed to advance HalDec and probe the image-text alignment ability of VLMs. Our analysis shows that HalDec-Bench offers tasks of varying difficulty, making it well-suited as a HalDec benchmark. Evaluating diverse VLMs reveals key limitations: (i) CLIP-like models are nearly blind to hallucinations in recent VLMs, (ii) detectors tend to over-score early sentences, and (iii) they display strong self-preference—favoring their own captions—which undermines detection performance. We will release our evaluation code and dataset upon acceptance.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 2251

Loading