Research Area: Data, Evaluation, Human mind, brain, philosophy, laws and LMs
Keywords: Vision Language Model, Optical Illusions, Dataset, Benchmark, Hallucination
TL;DR: IllusionVQA dataset tests Vision Language Models' comprehension and localization of optical illusions, revealing their limitations compared to human performance.
Abstract: The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently *unreasonable*? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best performing VLM, achieves 62.99\% accuracy (4-shot) on the comprehension task and 49.7\% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03\% and 100\% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro on the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 601
Loading