Long-Short Alignment for Effective Long-Context Modeling in LLMs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) have exhibited impressive performance and surprising emergent properties. However, their effectiveness remains limited by the fixed context window of the transformer architecture, posing challenges for long-context modeling. Among these challenges, length generalization — the ability to generalize to sequences longer than those seen during training — is a classical and fundamental problem. In this work, we propose a fresh perspective on length generalization, shifting the focus from the conventional emphasis on input features such as positional encodings or data structures to the output distribution of the model. Specifically, through case studies on synthetic tasks, we highlight the critical role of **long-short alignment** — the consistency of output distributions across sequences of varying lengths. Extending this insight to natural language tasks, we propose a metric called Long-Short Misalignment to quantify this phenomenon, uncovering a strong correlation between the metric and length generalization performance. Building on these findings, we develop a regularization term that promotes long-short alignment during training. Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available at https://github.com/PKU-ML/LongShortAlignment.
Lay Summary: Today's large language models (like ChatGPT) are trained on relatively short texts but are increasingly asked to process much longer ones — including books, scientific papers, or lengthy conversations. However, these models often struggle when tested on longer inputs than they’ve seen during training. Why does this happen? In this work, we uncover one key reason: the model’s predictions start to behave differently when the input gets longer — even if the actual task stays the same. To address this, we propose a new way to measure and improve how consistent a model’s behavior is across short and long inputs. Our method gives the model feedback during training to reduce this inconsistency. We show that this approach helps models better generalize to longer texts — a critical step for making language models more reliable and capable in real-world applications like document analysis, multi-turn dialogue, or long-form writing. This opens up a new direction for improving language models by focusing on their output behavior.
Primary Area: Deep Learning->Self-Supervised Learning
Keywords: Large language model, length generalization, long-short alignment
Submission Number: 11567
Loading