FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability

ACL ARR 2025 February Submission1556 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Vision Language Models (LVLMs) have achieved significant progress in integrating visual and textual inputs for multimodal reasoning. However, a recurring challenge is ensuring these models utilize visual information as effectively as linguistic content when both modalities are necessary to formulate an accurate answer. We hypothesize that hallucinations arise due to the lack of effective visual grounding in current LVLMs. Furthermore, current vision-language benchmarks are not specifically measuring the degree to which the answer require the visual input. This limitation makes it challenging to confirm that the image is truly necessary, particularly in tasks like visual question answering. In this work, we introduce FiVL, a novel method for constructing datasets designed to train LVLMs for enhanced visual grounding and also evaluate their effectiveness in achieving it. We demonstrate the value of our datasets through three approaches. First, we introduce a novel training task based on our augmented training dataset, resulting in better performance than the baseline. Second, we present benchmarks to assess the model's ability to use image as substantive evidence, rather than relying solely on linguistic priors. Finally, we identify attention heads with the strongest vision-language alignment, enabling explainability on visual-driven hallucinations. The dataset and code will be publicly available.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, vision-language grounding, vision-language alignment, lvlm
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: english
Submission Number: 1556
Loading