The Privacy-Hallucination Tradeoff in Differentially Private Language Models

17 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: privacy, factuality, language models, text generation
Abstract: While prior work has studied privacy tradeoffs with utility and fairness, the impact of privacy-preservation on factual consistency and hallucination in LLM outputs remains unexplored. Given that privacy-preservation is paramount in high-stakes domains like healthcare, the factual accuracy of these systems is critical. In this study, we uncover and investigate a privacy-hallucination tradeoff in differentially private language models. We show that while stricter DP guarantees do not distort knowledge acquired during standard pre-training, they hinder the model's ability to learn new factual associations when fine-tuned on previously unseen data, as a result of which the model tends to hallucinate incorrect or irrelevant information instead. We find that the proportion of factual texts generated drops by 17-24% when models are fine-tuned on the same data using DP (epsilon = 8), compared to the non-DP models, and on average, the factuality scores differ by at least 3-5%. This disparity is further pronounced when pre-training with DP, where we find a 43% drop in the number of factually consistent texts. Our findings underscore the need for more nuanced privacy-preserving interventions that offer rigorous privacy guarantees without compromising factual accuracy.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 9493
Loading