【Proposal】A Comparative Study of RoPE-based Positional Encodings from A Scaling Perspective

20 Oct 2024 (modified: 05 Nov 2024)THU 2024 Fall AML SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Position Encodings, RoPE, Long Context Modeling
Abstract: Transformers have become the dominant architecture for natural language processing, especially as the backbone of Large Language Models (LLMs). However, their quadratic computational complexity poses challenges for efficiently training on long sequences. A common strategy involves pre-training on shorter sequences, followed by length extrapolation using longer sequences. Positional encoding plays a key role in this process, with Rotary Positional Encoding (RoPE) being widely adopted for its strong performance. Despite its utility, RoPE faces out-of-distribution (OOD) issues when sequence lengths extend beyond the pre-trained context window, prompting the development of various RoPE variants such as PI, ABF, NTK, and YaRN. These approaches aim to enhance model performance on longer sequences through scaling mechanisms, yet it remains unclear which variant is superior or why RoPE performs effectively in Transformers. This work investigates the comparative performance and principles of these RoPE-based position encodings in long-context scenarios and seeks to uncover the mechanism behind RoPE’s success. Ultimately, we aim to propose a novel positional encoding method that surpasses existing approaches in handling extended contexts.
Submission Number: 9
Loading