Holistic Analysis of Hallucination in Large Vision-Language Models: Bias and Interference Challenges

Anonymous

Holistic Analysis of Hallucination in Large Vision-Language Models: Bias and Interference Challenges

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: While GPT-4V(ision) impressively models both visual and textual information simultaneously, it's hallucination behavior has not been systematically assessed. To bridge this gap, we introduce a new benchmark, namely, the Bias and Interference Challenges in Visual Language Models (Bingo). This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference. Here, bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data. Interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented. We identify a notable regional bias, whereby GPT-4V(ision) is better at interpreting Western images or images with English writing compared to images from other countries or containing text in other languages. Moreover, GPT-4V(ision) is vulnerable to leading questions and is often confused when interpreting multiple images together. Popular mitigation approaches, such as self-correction and chain-of-thought reasoning, are not effective in resolving these challenges. We also identified similar biases and interference vulnerabilities with LLaVA and Bard. Our results characterize the hallucination challenges in GPT-4V(ision) and state-of-the-art visual-language models, and highlight the need for new solutions.

Paper Type: long

Research Area: Resources and Evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English; Chinese; French; Japanese; Arabic

Preprint Status: There is a non-anonymous preprint (URL specified in the next question).

A1: yes

A1 Elaboration For Yes Or No: In the Section before References section

A2: yes

A2 Elaboration For Yes Or No: Appendix Section B.1

A3: yes

A3 Elaboration For Yes Or No: Section 1

B: no

C: yes

C1: no

C1 Elaboration For Yes Or No: We run our experiments through GPT API. Therefore, we did not have access to the total computatioal budget.

C2: yes

C2 Elaboration For Yes Or No: Section 2

C3: yes

C3 Elaboration For Yes Or No: Section 3

C4: no

C4 Elaboration For Yes Or No: Our analysis did not involve the use of existing packages.

D: yes

D1: yes

D1 Elaboration For Yes Or No: Section 3

D2: no

D2 Elaboration For Yes Or No: We recruited student volunteers for annotation, and we did not provide payment for their participation. As such, we did not discuss the adequacy of payment for participants' demographics, as no payment was involved.

D3: yes

D3 Elaboration For Yes Or No: Section 2

D4: no

D4 Elaboration For Yes Or No: The data is not within the scope of ethical review.

D5: no

D5 Elaboration For Yes Or No: We obtained the data from the internet, and therefore, we did not have access to the basic demographic and geographic characteristics of the annotator population.

E: yes

E1: no

E1 Elaboration For Yes Or No: We used GPT for sentence proofreading, which was not highly relevant to the content of the paper itself

0 Replies

Loading