arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation

arXiv2Table: Toward Realistic Benchmarking and Evaluation for LLM-Based Literature-Review Table Generation

ACL ARR 2026 January Submission921 Authors

26 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Tabular generation, Benchmark, Evaluation, AI4Science

Abstract: Literature review tables are essential for summarizing and comparing collections of scientific papers. In this paper, we study automatic generation of such tables from a pool of papers to satisfy a user’s information need. Building on recent work (Newman et al., 2024), we move beyond oracle settings by (i) simulating well-specified yet schema-agnostic user demands that avoid leaking gold column names or values, (ii) explicitly modeling retrieval noise via semantically related but out-of-scope distractor papers verified by human annotators, and (iii) introducing a lightweight, annotation-free, utilization-oriented evaluation that decomposes utility (schema coverage, unary cell fidelity, pairwise relational consistency) and measures paper selection via a two-way QA procedure (gold→system and system→gold) with recall, precision, and F1. To support reproducible evaluation, we introduce **arXiv2Table**, *a benchmark of 1,957 tables referencing 7,158 papers*, with human-verified distractors and rewritten, schema-agnostic user demands. We also **develop an iterative, batch-based generation method** that co-refines paper filtering and schema over multiple rounds. We validate the evaluation protocol with human audits and cross-evaluator checks. Extensive experiments show that our method consistently improves over strong baselines, while absolute scores remain modest, underscoring the task’s difficulty. Code will be released upon acceptance.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, language resources, automatic creation and evaluation of language resources, NLP datasets

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 921

Loading