Keywords: Long-context Extrapolation, Positional encoding, Extra-PE, Extra-MPE, LLM
Abstract: Long-context extrapolation aims to extend the contextual window of large language models to process more contextual information, which is widely adopted in industrial applications.
Current mainstream solutions involve increasing the rotation base of RoPE to varying degrees or introducing optimization strategies such as ``low-frequency extrapolation and high-frequency interpolation'', in order to enhance the model's extrapolation capabilities for long context. Actually, these methods alter the representation distribution of positional information by adjusting the rotation frequency of positional encoding, resulting in inevitably disrupt the attention distribution within the original training length range.
In this paper, we analyze this phenomenon from a theoretical perspective and propose a long-context extrapolation strategy that preserves the known distribution via periodic extension of high-dimensional positional encoding. Based on this strategy, we design two methods, namely Extra-PE and Extra-MPE, to significantly enhance the models' long-context extrapolation capabilities without disrupting the positional encoding distribution within the original training length.
Through extensive experimental results, it is found that the long-context extrapolation method based on periodic extension can enhance the model's capability in extrapolating long-contexts. Specifically, a model fine-tuned on 32k tokens can extrapolate beyond 80k tokens, surpassing the performance of the NTK-32k model and approaching that of the YaRN-64k model. Furthermore, this method demonstrates significantly superior performance in extrapolating extremely long-contexts compared to other methods. Notably, a model fine-tuned on 8k tokens still does not exhibit perplexity explosion when extrapolating to 80k tokens. Additionally, during the fine-tuning process, our approach achieves optimal performance using only one-fourth of the fine-tuning steps (100 steps) compared to the YaRNmethod. Secondly, in our comparative experiments, we found that the period in which the model learns a sufficient number of positional encoding has a significant impact on long-context extrapolation capability. Finally, through attention analysis, we discovered that our method can still maintain a stable level of attention at ultra-long distances, with the mean attention value exceeding 0 at these distances.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11110
Loading