EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models

08 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Hallucinations, Instructional Visual Language Models, Visual Grounding
TL;DR: EAGLE is a post-pretraining method that boosts the visual encoder of multimodal LLMs, reducing hallucinations in IT-VLMs and improving visual grounding without extra instructional tuning, achieving consistent gains across 6 models and 4 benchmarks.
Abstract: Large language models and vision transformers have shown impressive zero-shot capabilities, enabling significant transferability in downstream tasks. The fusion of these models has resulted in multi-modal architectures with enhanced instructional capabilities. Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data. These failure cases (false positives) are known as hallucinations. Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation. In this paper, we address the hallucination issue by directly enhancing the capabilities of the visual component. Our approach, named EAGLE, is fully agnostic to the LLM or fusion module and works as a post-pretraining stage that improves the grounding and language alignment of the visual encoder. We show that a straightforward reformulation of the original contrastive pre-training task results in an improved visual encoder that can be incorporated into the instructional multi-modal architecture without any additional instructional training. Extensive empirical validation shows that EAGLE significantly reduces hallucinations across six different instructional multi-modal models and four challenging benchmarks.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 3091
Loading