Datasets for Scientific Literature Understanding: A Survey

ACL ARR 2026 January Submission8279 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Scientific Literature, Survey, Dataset, Large Language Models
Abstract: Empowering machines to understand scientific literature is crucial for accelerating scientific discovery and advancing the AI for Science (AI4S) paradigm. In this paper, we present a comprehensive survey of datasets serving this domain. We propose a systematic taxonomy that organizes resources spanning structural parsing, text understanding, multimodal understanding and pre-training/instruction fine-tuning. Beyond a structured overview, we discuss the evolution of the field, elucidating how the emergence of Large Language Models (LLMs) has reshaped research priorities of dataset construction. By synthesizing existing datasets and identifying critical future directions, this work provides a roadmap for advancing intelligent scientific research systems.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: NLP datasets
Contribution Types: Surveys
Languages Studied: English,Chinese
Submission Number: 8279
Loading