PostAlign: Multimodal Grounding as a Corrective Lens for MLLMs

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal grounding, MLLM hallucination, alignment
Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable performance in vision-language tasks, such as image captioning and visual question answering. However, these models often struggle with fine-grained visual understanding and are prone to hallucinations, primarily due to over-reliance on linguistic priors that distract them from leveraging actual visual information. This results in outputs that are often unanchored in the visual content, leading to errors. To address these challenges, we introduce MMGrounded-PostAlign, a post-multimodal alignment framework designed to enhance the visual understanding capabilities of MLLMs and mitigate hallucinations. In the framework, the visual grounding module identifies the referred objects in the image, while the textual grounding module generates the rationale for the final answer. This dual grounding approach ensures that outputs are firmly anchored in both visual and textual evidence. In particular, we incorporate a negative rejection mechanism within the visual grounding module to distinguish between grounded entities and non-existent objects influenced by linguistic biases. Moreover, we propose a selective reasoning mechanism within the textual grounding module to adjust the model’s reasoning strategy based on the complexity of the query. These innovations together work to resolve the issues associated with hallucinations and enhance the overall alignment between visual and textual modalities. Extensive evaluations on benchmarks such as POPE, HaloQuest, ReasonSeg, MME, and MMBench demonstrate significant improvements in fine-grained visual understanding and hallucination suppression, showcasing the effectiveness of our approach in real-world multimodal tasks.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 7334
Loading