everyone
since 12 Dec 2024">EveryoneRevisionsBibTeXCC BY 4.0
Long context continuous pretraining enables Transformer-based large language models (LLMs) to comprehend input sequences within a larger context window than pretraining stage. Common modifications to positional encoding involve interpolation methods, such as PI, NTK-aware, ABF, YaRN, and LongRoPE. While these positional encodings have proven effective, they nonetheless exhibit certain oversights. In this study, we demonstrate that these positional encodings can be expressed within a unified functional framework. Building on this insight, we propose a guiding principle for optimal positional encoding interpolation, leading to the introduction of a novel positional encoding scheme, S$^3$PE, designed to approximate this theoretical optimal solution. We conducted length extrapolation experiments across models of varying scales, comprehensively comparing existing mainstream positional encoding approaches. The results indicate that S$^3$PE consistently outperforms current mainstream positional encodings across all configurations. Our research illustrates that S$^3$PE provides a more robust solution for long-context modeling, demonstrating superior performance in length extrapolation scenarios.