Dreaming Up Authors: Factual Prevalence and LLM Hallucinations in Academic Authorship Attribution

ACL ARR 2026 January Submission7950 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Hallucination, Factual Prevalence, Authorship Attribution, Citation Analysis, Model Evaluation
Abstract: Large Language Models (LLMs) are widely used for question answering but are prone to hallucinations, producing confident yet incorrect responses. While prior work has proposed various hallucination detection methods, the underlying causes, particularly the role of factual prevalence in LLMs' training data, remain underexplored. To study this cause, we use academic authorship attribution, retrieving the list of authors given a research paper, as a representative task and hypothesize that hallucinations are more likely for less prevalent papers. We use citation count as a proxy for factual prevalence and validate this choice by demonstrating strong correlations with Google search hits and corpus-level occurrences in large LLM training datasets. We curate a dataset of 1.92M papers spanning eight academic disciplines and its stratified samples by discipline and citation level. Our experiments on three state-of-the-art LLMs reveal a consistent inverse relationship between hallucination rates and citation counts across all models and disciplines, with hallucination rates decreasing from over 98% for rarely cited papers (0–2 citations) to under 36.4% for highly cited papers (4,097+ citations). We also analyze the relationship between hallucination rates and author position, author count, title length, and demography, further highlighting factual prevalence as a key factor in LLM reliability and offer important insights toward building more trustworthy AI systems.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability and Analysis of Models for NLP, NLP Applications
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 7950
Loading