Abstract: Literature review tables are essential for summarizing and comparing collections of scientific papers.
We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.
Building on recent work (Newman et al., 2024), we extend prior approaches to address real-world complexities through a combination of LLM-based methods and human annotations.
Our contributions focus on three key challenges encountered in real-world use:
(i) User prompts are often under-specified;
(ii) Retrieved candidate papers frequently contain irrelevant content; and
(iii) Task evaluation should move beyond shallow text similarity techniques and instead assess the utility of inferred tables for information-seeking tasks (e.g., comparing papers).
To support reproducible evaluation, we introduce arXiv2Table, a more realistic and challenging benchmark for this task, along with a novel approach to improve literature review table generation in real-world scenarios.
Our extensive experiments on this benchmark show that both open-weight and proprietary LLMs struggle with the task, highlighting its difficulty and the need for further advancements.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking, language resources, automatic creation and evaluation of language resources, NLP datasets
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 6402
Loading