Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

Published: 27 Jul 2025, Last Modified: 27 Jul 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in the given image. Most existing LVLM hallucination benchmarks are constrained to evaluate the object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, we design a unified framework to measure object and relation hallucination in LVLMs simultaneously. The core idea of our framework is to evaluate hallucinations in (object, relation, object) triplets extracted from LVLMs’ responses, making it easily generalizable to various vision-language tasks. Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. With comprehensive evaluations on Tri-HE, we observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple training-free approach that effectively mitigates hallucinations for LVLMs.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We would like to sincerely thank the reviewers and the Area Editor for their time, thoughtful feedback, and constructive suggestions throughout the review process. We have carefully revised the manuscript to address all comments and incorporate the suggested improvements. The updated version of the paper has been uploaded accordingly. First, we have included additional evaluation results on Tri-HE using more recent LVLMs. > Comparison with Recent LVLMs We have added comparisons with more recent LVLMs, including Qwen2.5-VL-7B-Instruct and InternVL2_5-8B. Further details can be found in Appendix D. Second, we have also provided baseline comparisons with our hallucination mitigation method. > Baseline Comparison We have incorporated results using LogicCheckGPT, a prompting-based, training-free method that aligns closely with our setting. As shown in Table 7, our method outperforms LogicCheckGPT, further demonstrating its effectiveness in mitigating hallucinations. To provide a more comprehensive evaluation, we also include results from two additional directions for addressing hallucinations: a decoding-based method (VCD) and a reinforcement learning (RL)-based method (OPA-DPO). Detailed comparisons are presented in Appendix D.2. Results in Table 10 further highlight the effectiveness of our approach among training-free methods. While OPA-DPO achieves the lowest hallucination rate—expected due to additional training, it could potentially be integrated with our method for further fine-tuning using RL techniques. We leave this integration as future work. Third, we thank the reviewers for highlighting the potential of applying our proposed methods within the broader visual hallucination literature. While we appreciate this perspective, we would like to emphasize that our primary motivation is to identify a rarely discussed hallucination type—relation hallucination—alongside the more common object hallucination in a unified framework. Our proposed method is designed to be generalizable across a wide range of vision-language tasks involving model-generated responses. That said, we agree that extending the application scope of our method could enhance its relevance in broader contexts, and we have accordingly added a discussion in Appendix E.
Code: https://kaichen1998.github.io/projects/tri-he/
Supplementary Material: zip
Assigned Action Editor: ~Chunyuan_Li1
Submission Number: 4566
Loading