Robust Multimodal Large Language Models Against Modality Conflict

Zongmeng Zhang; Wengang Zhou; Jie Zhao; Houqiang Li

Robust Multimodal Large Language Models Against Modality Conflict

Zongmeng Zhang, Wengang Zhou, Jie Zhao, Houqiang Li

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict.

Abstract: Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.

Lay Summary: Multimodal Large Language Models (MLLMs) are a type of AI technology that can understand images and answer questions related to them. These models have shown near-human intelligence in many tasks. However, they still often make mistakes, such as generating irrelevant or nonsensical answers, a phenomenon sometimes referred to as "hallucination." Our research focuses on reducing these errors to make MLLMs more reliable and accurate. We discovered that one key issue occurs when the question asked by a person conflicts with the content of the image. For instance, if an image shows a small dog but the question mistakenly asks, "What color is the cat in this picture?" the model does not recognize this conflict and provides an incorrect answer. To address this, we created specialized datasets to simulate such conflicting scenarios and trained MLLMs to better handle these situations. This additional training enables the model to recognize conflicts and give correct answers more effectively. By identifying this key source of errors and proposing a targeted solution, our work improves the accuracy and reliability of MLLMs. This advancement may accelerate the practical applications of these models in areas such as education, healthcare, and more, where accurate and dependable AI systems are crucial.

Link To Code: https://github.com/zmzhang2000/MMMC

Primary Area: Deep Learning->Large Language Models

Keywords: Multimodal Large Language Models, Modality Conflict, Hallucinations, Reinforcement Learning, Robustness

Submission Number: 11639

Loading