Abstract: Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, yet their ability to generate long-form content remains poorly understood and evaluated.
Our analysis reveals that current LLMs struggle with length requirements and information density in long-text generation, with performance deteriorating as text length increases.
To quantitively locate such a performance degradation and provide further insights on model development, we present \textbf{LongEval}, a benchmark that evaluates long-text generation through both \textit{direct} and \textit{plan-based} generation paradigms, inspired by cognitive and linguistic writing models.
The comprehensive experiments in this work reveals interesting findings such as that while model size correlates with generation ability, the small-scale model (e.g., LongWriter), well trained on long texts, has comparable performance.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Long text generation
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 2587
Loading