Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation Paradigm

Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation Paradigm

ACL ARR 2026 January Submission8578 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, Long-form Text Generation, Data Synthesis

Abstract: Existing long-text generation methods primarily focus on creating lengthy outputs from short inputs, but neglect the critical challenge of long-input-to-long-output generation. In scenarios with long input sequences, important information in the middle of the sequence may be overlooked, a problem commonly known as the "lost-in-the-middle" phenomenon. This phenomenon becomes more pronounced as the input length increases, leading to inconsistencies and incoherence in the generated output. To address this, we propose **R**etrieval-**A**ugmented **L**ong-Text **Writer** (RAL-Writer), which consists of a *Planner* used to generate writing steps and a *Writer* to generate content based steps. The *Writer* has a mechanism that computes an importance score to dynamically retrieve and strategically restate critical input segments by jointly modeling semantic relevance and positional bias. We also construct the first dataset for long-input generation and introduce three evaluation metrics covering length, consistency, and quality. We use this dataset to evaluate our RAL-Writer against comparable baselines. The results demonstrate the effectiveness of our approach.

Paper Type: Long

Research Area: Natural Language Generation

Research Area Keywords: automatic evaluation,text-to-text generation

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: english

Submission Number: 8578

Loading