Mapping the Increasing Use of LLMs in Scientific Papers

Weixin Liang; Yaohui Zhang; Zhengxuan Wu; Haley Lepp; Wenlong Ji; Xuandong Zhao; Hancheng Cao; Sheng Liu; Siyu He; Zhi Huang; Diyi Yang; Christopher Potts; Christopher D Manning; James Y. Zou

Mapping the Increasing Use of LLMs in Scientific Papers

Weixin Liang, Yaohui Zhang, Zhengxuan Wu, Haley Lepp, Wenlong Ji, Xuandong Zhao, Hancheng Cao, Sheng Liu, Siyu He, Zhi Huang, Diyi Yang, Christopher Potts, Christopher D Manning, James Y. Zou

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Research Area: Societal implications

Keywords: Academic Writing, Computational Social Science, Science and Technology Studies, Societal Impact and LLM Adoption

TL;DR: We conducted the first systematic analysis to quantify the impact of Large Language Models (LLMs) on academic writing over time.

Abstract: Scientific publishing lays the foundation of science by disseminating research findings, fostering collaboration, encouraging reproducibility, and ensuring that scientific knowledge is accessible, verifiable, and built upon over time. Recently, there has been immense speculation about how many people are using large language models (LLMs) like ChatGPT in their academic writing, and to what extent this tool might have an effect on global scientific practices. However, we lack a precise measure of the proportion of academic writing substantially modified or produced by LLMs. To address this gap, we conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the $\textit{arXiv}$, $\textit{bioRxiv}$, and $\textit{Nature}$ portfolio journals, using a population-level statistical framework to measure the prevalence of LLM-modified content over time. The statistical framework operates on the population level without the need to perform inference on any individual instance. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers (up to 17.5\%). In comparison, Mathematics papers and the Nature portfolio showed the least LLM modification (up to 6.3\%). Moreover, at an aggregate level, our analysis reveals that higher levels of LLM-modification are associated with papers whose first authors post preprints more frequently, papers in more crowded areas, and papers with shorter lengths. Our findings suggests that LLMs are being broadly used in scientific papers.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1404

Loading