Abstract: Large Language Models (LLMs) often automatically revise facts in provided content to align with their internal knowledge, a behavior that, while aiming for factual accuracy, can detrimentally override source material. This paper systematically investigates and formally defines this critical issue as Harmful Factuality Hallucination, where LLMs unexpectedly correct perceived inaccuracies in the input, prioritizing global factual correctness over essential source fidelity. Moving beyond anecdotal evidence, we introduce a robust framework to induce and quantify Harmful Factuality by applying controlled soft (Gaussian Embedding Perturbation) and hard (LLM-Instructed Entity Replacement) entity perturbations. We evaluate a diverse set of open-source (e.g., Llama series) and commercial (e.g., GPT-4o) LLMs of varying scales across abstractive summarization, rephrasing, and context-grounded question-answering tasks. Our experiments reveal that Harmful Factuality is prevalent, with its incidence significantly influenced by model scale (larger models often exhibit higher rates), perturbation type, entity position within the source, and task characteristics. Furthermore, through analysis of Dual Presence outputs, we identify and categorize three core behavioral mechanisms that underlie this phenomenon. Importantly, we also demonstrate that a simple instructional defense prompt can substantially mitigate Harmful Factuality, reducing it by approximately 50\% in several leading models. This research provides a foundational methodology and crucial insights for evaluating and alleviating source-conflicting behaviors, thereby supporting the development of more reliable and source-faithful LLM systems.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: counterfactual/contrastive explanations, analysis
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 6271
Loading