LongSciArxiv: Dual Manual Synthetic Datasets and LLM Benchmarking for Long-to-Long Scientific Survey Generation

ACL ARR 2025 May Submission938 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Survey generation involves synthesizing comprehensive scientific papers from large collections of research literature. Despite recent advances, this task remains challenging for natural language processing (NLP), especially when both input and output are long. While current large language models (LLMs) support extended context lengths, their ability to produce full-length surveys remains underexplored due to the lack of suitable datasets and benchmarks. We introduce $\textbf{GenSurvey}$, a dataset of $700$ human-written surveys paired with reference abstracts. We further create $\textbf{GenSection}$, a synthetic dataset for section-level generation created using chain-of-thought prompting with GPT-$4$ and refined through human verification. These datasets form $\textbf{LongSciArxiv}$, a dual benchmark designed for real-world tasks in education and research. These tasks require models to integrate hundreds of abstracts into coherent surveys exceeding $10{,}000$ words. In our experiments, we evaluate $10$ open-source LLMs ranging from $1\text{B}$ to $70\text{B}$ parameters. Results show that mid-sized models such as Mistral $7\text{B}$ and LLaMa3 $8\text{B}$ offer the best trade-off between performance and cost. Our findings highlight the complexity of long-to-long generation and the need for scale-aware model design and benchmarking.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation; benchmarking; automatic creation and evaluation of language resources; NLP datasets; automatic evaluation of datasets
Contribution Types: Model analysis & interpretability, Approaches low compute settings-efficiency, Data resources, Data analysis
Languages Studied: English
Submission Number: 938
Loading