AcademicEval: Live Long-Context LLM Benchmark

Haozhen Zhang; Tao Feng; Pengrui Han; Jiaxuan You

AcademicEval: Live Long-Context LLM Benchmark

Haozhen Zhang, Tao Feng, Pengrui Han, Jiaxuan You

27 Sept 2024 (modified: 15 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Ultra-long Context Understanding, Live Benchmark, Long-context LLM Benchmarks

TL;DR: We propose a live long context LLM benchmark named AcademicEval with flexible and scalable context length with no data leakage and labor-intensive annotation.

Abstract: Large Language Models (LLMs) have achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length and labor-intensive annotation, and the label leakage issue in LLM training also poses a pressing challenge. Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct holistic experiments on \textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, illustrating the challenge of our benchmark. We also provide insightful analysis for enhancing LLMs' long-context modeling capabilities.

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10802

Loading