Dreaming Up Authors: Factual Prevalence and LLM Hallucinations in Academic Authorship Attribution

Dreaming Up Authors: Factual Prevalence and LLM Hallucinations in Academic Authorship Attribution

ACL ARR 2026 January Submission7950 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Hallucination, Factual Prevalence, Authorship Attribution, Citation Analysis, Model Evaluation

Abstract: Large Language Models (LLMs) are widely used for question answering but are prone to hallucinations, producing confident yet incorrect responses. While prior work has proposed various hallucination detection methods, the underlying causes, particularly the role of factual prevalence in LLMs' training data, remain underexplored. To study this cause, we use academic authorship attribution, retrieving the list of authors given a research paper, as a representative task and hypothesize that hallucinations are more likely for less prevalent papers. We use citation count as a proxy for factual prevalence and validate this choice by demonstrating strong correlations with Google search hits and corpus-level occurrences in large LLM training datasets. We curate a dataset of 1.92M papers spanning eight academic disciplines and its stratified samples by discipline and citation level. Our experiments on three state-of-the-art LLMs reveal a consistent inverse relationship between hallucination rates and citation counts across all models and disciplines, with hallucination rates decreasing from over 98% for rarely cited papers (0–2 citations) to under 36.4% for highly cited papers (4,097+ citations). We also analyze the relationship between hallucination rates and author position, author count, title length, and demography, further highlighting factual prevalence as a key factor in LLM reliability and offer important insights toward building more trustworthy AI systems.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Interpretability and Analysis of Models for NLP, NLP Applications

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 7950

Loading