Keywords: Model Evaluation, Vision-Language Models, Multimodal, Distraction Robustness
Abstract: Although vision-language models (VLMs) have achieved significant success in various applications such as visual question answering, their resilience to prompt distractions remains as an under-explored area. Understanding how distractions affect VLMs is crucial for improving their real-world applicability, as inputs could be filled with noisy and irrelevant information in many practical scenarios. This paper aims to assess the robustness of VLMs against both visual and textual distractions in the context of science question answering. Built on the \emph{ScienceQA} dataset, we developed a new benchmark that introduces distractions in both the visual and textual contexts. To evaluate the reasoning capacity of VLMs amidst these distractions, we analyzed the performance of ten state-of-the-art models, including GPT-4o. Our findings reveal that most VLMs are vulnerable to various types of distractions, experiencing noticeable degradation in reasoning capabilities when confronted with distractions. Notably, models such as InternVL2 demonstrates a higher degree of robustness to these distractions. We also found that models exhibit greater sensitivity to textual distractions than visual ones. Additionally, we explored various mitigation strategies, such as prompt engineering, to counteract the impact of distractions. While these strategies improved model resilience, our analysis shows that there remain significant opportunities for further improvement.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12179
Loading