CARE: Causal-Aware Robust Estimation for Hallucination Mitigation in Large Vision Language Models

CARE: Causal-Aware Robust Estimation for Hallucination Mitigation in Large Vision Language Models

ACL ARR 2025 May Submission6030 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Hallucination phenomena in vision-language models significantly hinder their application potential in downstream tasks requiring high-precision reasoning, posing a critical bottleneck to the development of trustworthy artificial intelligence. Current approaches primarily rely on specific training paradigms or heuristic decoding strategies, which, while partially alleviating hallucinations, are constrained by two major limitations: substantial computational overhead from data construction and model fine-tuning, and the lack of fine-grained attribution capabilities for understanding the mechanisms behind hallucination generation. In this study, we propose a causal intervention-based generation tracing framework, CARE (Causal-Aware Robust Estimation), which achieves token-level hallucination suppression without requiring additional training. Our key insight is that excessive reliance on linguistic priors serves as the core mechanism driving hallucinations, and its statistical characteristics can be precisely captured through dual-pathway contrast. The CARE framework innovatively constructs factual and counterfactual dual generation pathways, employing robust interquartile effect detection to quantify the visual dependency of generated tokens. Experiments on multimodal evaluation benchmarks including HallusionBench, MMHalBench, and POPE demonstrate that CARE reduces hallucination rates by an average of 9\% in LLaVA-1.5 models while improving answer accuracy by 11\%, all while maintaining the fluency of generation outputs. This research provides a novel, interpretable paradigm for understanding and mitigating hallucinations in multimodal systems.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Ethics, Bias, and Fairness,Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 6030

Loading