Building More Accountable Multi-Modal LLMs Through Spatially-Informed Visual Reasoning

Jing Wu; Suiyao Chen; Alexander Gutfraind; Inseok Heo; Shengjie Liu; Chen Li; Jeremy Curuksu; Michael Sharps

Building More Accountable Multi-Modal LLMs Through Spatially-Informed Visual Reasoning

Jing Wu, Suiyao Chen, Alexander Gutfraind, Inseok Heo, Shengjie Liu, Chen Li, Jeremy Curuksu, Michael Sharps

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Self-reflection, Self-evaluation

TL;DR: We propose a multi-modal framework of LLMs for self-evaluation and reflection.

Abstract: Recent research has demonstrated that debate mechanisms among Large Language Models (LLMs) show remarkable potential for enhancing reasoning capabilities and promoting responsible text generation. However, it remains an open question whether debate strategies can effectively generalize to Multi-Modal Large Language Models (MLLMs). In this paper, we address this challenge by proposing a location-aware debate framework specifically designed for MLLMs to mitigate hallucination without requiring additional external knowledge. Our approach introduces an asymmetric debate structure across both textual and visual modalities. For textual processing, one MLLM instance generates a comprehensive image description while identifying object locations, while a second instance "zooms in" on specific regions of interest to evaluate and refine the initial descriptions. For visual processing, we introduce a novel hybrid attention module that fuses visual self-attention with cross-modal attention between textual and visual information, effectively highlighting critical content regions. The framework incorporates a judge component that evaluates the complete debate process and selects the most reliable output between the two debating instances. Our experimental results demonstrate that this approach substantially reduces hallucination across diverse MLLMs and evaluation metrics. Moreover, the framework serves as a readily integrable complement to existing hallucination mitigation methods. By employing consistent procedures and standardized prompts across all investigated tasks, our framework proves both effective and highly adaptable, enabling direct application to a broad range of black-box MLLMs without architectural modifications.

Submission Number: 36

Loading