LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Yuhao Wu; Ming Shan Hee; Zhiqiang Hu; Roy Ka-Wei Lee

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Yuhao Wu, Ming Shan Hee, Zhiqiang Hu, Roy Ka-Wei Lee

Published: 22 Jan 2025, Last Modified: 11 Mar 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long context LLMs; Long-form generation; Benchmark

Abstract: Current benchmarks like ``$\textit{Needle-in-a-Haystack}$'' ($\textit{NIAH}$), $\textit{Ruler}$, and $\textit{Needlebench}$ focus on models' ability to understand long-context input sequences but fail to capture a critical dimension: the generation of high-quality long-form text. Applications such as design proposals, technical documentation, and creative writing rely on coherent, instruction-following outputs over extended sequences—a challenge that existing benchmarks do not adequately address. To fill this gap, we introduce $\textit{LongGenBench}$, a novel benchmark designed to rigorously evaluate large language models' (LLMs) ability to generate long text while adhering to complex instructions. Through tasks requiring specific events or constraints within generated text, $\textit{LongGenBench}$ evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens). Our evaluation of ten state-of-the-art LLMs reveals that, despite strong results on $\textit{Ruler}$, all models struggled with long text generation on $\textit{LongGenBench}$, particularly as text length increased. This suggests that current LLMs are not yet equipped to meet the demands of real-world, long-form text generation. We open-source $\textit{LongGenBench}$ to promote comprehensive evaluation and improvement in this critical area, with code and data available at ${anonymousurl}$.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3588

Loading