100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

ACL ARR 2024 December Submission1446 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM shall enable its users to effortlessly process many originally exhausting tasks — e.g., digesting a long-form document to find answers v.s., directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings: Firstly, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, so when conducting a cross-model comparison, such conflation makes the user unable to understand how exactly one model or method excels at the long-context task in relation to its baseline ability. Secondly, such benchmarks are often formed in a way where each data sample has a fixed sequence length, which not only makes them solely suitable to models with a certain range of context windows, but also lacks a proxy to know at what length the model/method-of-interests would fail. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Long-context capability Evaluation, Large Language Models, Benchmarking
Languages Studied: English
Submission Number: 1446
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview