Harmful Factuality: LLMs Correcting What They Shouldn't

Harmful Factuality: LLMs Correcting What They Shouldn't

ACL ARR 2025 May Submission6271 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) often automatically revise facts in provided content to align with their internal knowledge, a behavior that, while aiming for factual accuracy, can detrimentally override source material. This paper systematically investigates and formally defines this critical issue as Harmful Factuality Hallucination, where LLMs unexpectedly correct perceived inaccuracies in the input, prioritizing global factual correctness over essential source fidelity. Moving beyond anecdotal evidence, we introduce a robust framework to induce and quantify Harmful Factuality by applying controlled soft (Gaussian Embedding Perturbation) and hard (LLM-Instructed Entity Replacement) entity perturbations. We evaluate a diverse set of open-source (e.g., Llama series) and commercial (e.g., GPT-4o) LLMs of varying scales across abstractive summarization, rephrasing, and context-grounded question-answering tasks. Our experiments reveal that Harmful Factuality is prevalent, with its incidence significantly influenced by model scale (larger models often exhibit higher rates), perturbation type, entity position within the source, and task characteristics. Furthermore, through analysis of Dual Presence outputs, we identify and categorize three core behavioral mechanisms that underlie this phenomenon. Importantly, we also demonstrate that a simple instructional defense prompt can substantially mitigate Harmful Factuality, reducing it by approximately 50\% in several leading models. This research provides a foundational methodology and crucial insights for evaluating and alleviating source-conflicting behaviors, thereby supporting the development of more reliable and source-faithful LLM systems.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: counterfactual/contrastive explanations, analysis

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 6271

Loading