$A^{4}$-MLRM: Fourfold Attention for Adaptive Hallucination Suppression in Multimodal Large Reasoning Model

Gongli Xi; Kun Wang; Zeming Gao; Huahui Yi; Haolang Lu; Ye Tian; Wendong Wang

$A^{4}$-MLRM: Fourfold Attention for Adaptive Hallucination Suppression in Multimodal Large Reasoning Model

Gongli Xi, Kun Wang, Zeming Gao, Huahui Yi, Haolang Lu, Ye Tian, Wendong Wang

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Hallucination, Multimodal Large Language Models, Multimodal Reasoning

TL;DR: An attention-based, training-free inference method that suppresses hallucinations in multimodal large reasoning models.

Abstract: Large multimodal reasoning models have recently shown strong ability to solve complex problems by gathering evidence and performing multi-step inference. However, the long reasoning chain makes them more prone to hallucination, that is, generating content that is not supported by the input image or the question. In examining how hallucination arises, we further identify \emph{reasoning drift}: during evidence gathering the model over focuses on entities unrelated to the question, diluting attention on task relevant cues. As a result, previous attention-based methods developed for non-reasoning models often fail to localize the true evidence in reasoning settings. Based on these insights, in this paper, we introduce \emph{AttnRecall}, a metric for assessing visual perception, and present \method{}, a training free, parameter free, and architecture agnostic plugin to hallucination suppression. \method{} uses the model output as a conduit from question to visual tokens for identifying question relevant patches and steer focus to task relevant regions. Remarkably, \textbf{without any additional training}, \method{} improves all \textbf{reasoning} architectures (including \texttt{R1-OneVision}, \texttt{Ocean-R1}, \texttt{MM-Eureka}, \textit{etc.}) by $\mathbf{1.21\times}$ on reasoning benchmarks. When transferred to \textbf{non\mbox{-}reasoning} settings, it yields a $\mathbf{1.16\times}$ gain.

Primary Area: generative models

Submission Number: 1335

Loading