Holistic Analysis of Hallucination in Large Vision-Language Models: Bias and Interference Challenges
Abstract: While GPT-4V(ision) impressively models both visual and textual information simultaneously, it's hallucination behavior has not been systematically assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias and Interference Challenges in Visual Language Models (Bingo). This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference. Here, bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data. Interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented. We identify a notable regional bias, whereby GPT-4V(ision) is better at interpreting Western images or images with English writing compared to images from other countries or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to leading questions and is often confused when interpreting multiple images together. Popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges. We also identified similar biases and interference vulnerabilities with LLaVA and Bard. Our results characterize the hallucination challenges in GPT-4V(ision) and state-of-the-art visual-language models, and highlight the need for new solutions.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English; Chinese; French; Japanese; Arabic
Preprint Status: There is a non-anonymous preprint (URL specified in the next question).
A1: yes
A1 Elaboration For Yes Or No: In the Section before References section
A2: yes
A2 Elaboration For Yes Or No: Appendix Section B.1
A3: yes
A3 Elaboration For Yes Or No: Section 1
B: no
C: yes
C1: no
C1 Elaboration For Yes Or No: We run our experiments through GPT API. Therefore, we did not have access to the total computatioal budget.
C2: yes
C2 Elaboration For Yes Or No: Section 2
C3: yes
C3 Elaboration For Yes Or No: Section 3
C4: no
C4 Elaboration For Yes Or No: Our analysis did not involve the use of existing packages.
D: yes
D1: yes
D1 Elaboration For Yes Or No: Section 3
D2: no
D2 Elaboration For Yes Or No: We recruited student volunteers for annotation, and we did not provide payment for their participation. As such, we did not discuss the adequacy of payment for participants' demographics, as no payment was involved.
D3: yes
D3 Elaboration For Yes Or No: Section 2
D4: no
D4 Elaboration For Yes Or No: The data is not within the scope of ethical review.
D5: no
D5 Elaboration For Yes Or No: We obtained the data from the internet, and therefore, we did not have access to the basic demographic and geographic characteristics of the annotator population.
E: yes
E1: no
E1 Elaboration For Yes Or No: We used GPT for sentence proofreading, which was not highly relevant to the content of the paper itself
0 Replies
Loading