Datasets for Scientific Literature Understanding: A Survey

Datasets for Scientific Literature Understanding: A Survey

ACL ARR 2026 January Submission8279 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scientific Literature, Survey, Dataset, Large Language Models

Abstract: Empowering machines to understand scientific literature is crucial for accelerating scientific discovery and advancing the AI for Science (AI4S) paradigm. In this paper, we present a comprehensive survey of datasets serving this domain. We propose a systematic taxonomy that organizes resources spanning structural parsing, text understanding, multimodal understanding and pre-training/instruction fine-tuning. Beyond a structured overview, we discuss the evolution of the field, elucidating how the emergence of Large Language Models (LLMs) has reshaped research priorities of dataset construction. By synthesizing existing datasets and identifying critical future directions, this work provides a roadmap for advancing intelligent scientific research systems.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: NLP datasets

Contribution Types: Surveys

Languages Studied: English,Chinese

Submission Number: 8279

Loading