Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding

Beyond Single Models: Mitigating Multimodal Hallucinations via Adaptive Token Ensemble Decoding

ACL ARR 2025 May Submission5886 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Vision-Language Models (LVLMs) have recently achieved impressive results in multimodal tasks such as image captioning and visual question answering. However, they remain prone to \textbf{object hallucination}—generating descriptions of nonexistent or misidentified objects. Prior work have partially mitigated this via auxiliary training objectives or external modules, but often lacks scalability, adaptability, or model independence. To address these limitations, we propose \textbf{Adaptive Token Ensemble Decoding (ATED)}, a training-free, token-level ensemble framework that mitigates hallucination by aggregating predictions from multiple LVLMs during inference. ATED dynamically computes uncertainty-based weights for each model, reflecting their reliability at each decoding step. It also integrates diverse decoding paths to improve contextual grounding and semantic consistency. Experiments on standard hallucination detection benchmarks demonstrate that ATED significantly outperforms state-of-the-art methods, reducing hallucination without compromising fluency or relevance. Our findings highlight the benefits of adaptive ensembling and point to a promising direction for improving LVLM robustness in high-stakes vision-language applications. Code is available at [https://anonymous.4open.science/r/ATED](https://anonymous.4open.science/r/ATED)

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision question answering,multimodality

Languages Studied: English

Submission Number: 5886

Loading