When AI Cannot Reproduce Itself: Citation Drift as a Reproducibility Failure in Scientific LLMs

Published: 24 Dec 2025, Last Modified: 24 Dec 2025MURE Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Main track
Published Or Accepted: false
Keywords: Rashomon Effect, Model Uncertainty, Reproducibility, Large Language Models, Citation Drift, Scientific AI, Predictive Multiplicity, Uncertainty Quantification
TL;DR: Large Language Models cannot reproduce their own citations consistently—revealing a Rashomon-style reproducibility failure in AI-assisted science.
Abstract: Reproducibility is a cornerstone of scientific reliability, yet today’s AI assistants themselves often fail this test. Large Language Models (LLMs) are increasingly used for scientific writing and research assistance, yet their ability to maintain consistent citations across multi-turn conversations remains largely unexplored. This study introduces the concept of citation drift—the phenomenon where references mutate, disappear, or get fabricated during extended LLM interactions. Through a comprehensive analysis of 240 conversations across four LLaMA models using 36 authentic scientific papers from six domains, this work demonstrates significant citation instability. We introduce novel metrics including citation drift entropy and willingness-to-cite, providing a framework for evaluating LLM citation reliability in scientific contexts. We further interpret citation drift as a manifestation of the Rashomon Effect, where multiple equally capable models produce divergent yet comparably valid factual outputs under deterministic conditions. This study establishes citation drift as a meta-reproducibility benchmark, revealing that LLMs cannot reproduce their own scientific outputs consistently.
Submission Number: 13
Loading