An Efficient Framework for Length Extension via Dynamically Growing Positional Embedding and Correlation-Aware Routing Attention

wang ning; Zekun Li; Tongxin Bai; Man Yao; Zhen Qin; Guoqi Li

An Efficient Framework for Length Extension via Dynamically Growing Positional Embedding and Correlation-Aware Routing Attention

wang ning, Zekun Li, Tongxin Bai, Man Yao, Zhen Qin, Guoqi Li

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Length Extension, Positional Embedding, Efficient Attention

Abstract: Modeling long sequences is critical for numerous large-scale models. However, extending existing architectures to handle significantly longer sequences poses substantial technical and computational challenges. One inevitable issue is the overfitting of large models to positional encodings during pretraining, which limits their ability to generalize to unseen positional encoding scales. Additionally, extending sequence lengths requires extensive computational resources and time. Existing positional encoding methods often rely on carefully designed scaling factors but typically yield suboptimal results. To tackle these challenges, we propose \textbf{Cyclic, Randomly Truncated, and Dynamically Growing NTK Positional Embedding (CRG NTK)}, a data-augmentation-based technique that fully explores the RoPE encoding space, enabling models to adapt to various positional scales and achieve state-of-the-art extrapolation for the extension of lengths dominated by position encoding. Furthermore, we introduce \textbf{an efficient attention mechanism with a correlation-based routing strategy to enhance the fitting of the augmented positional encoding}, yielding superior performance and more efficient fine-tuning. With our approach, LLaMA-7B and Mistral-7B fine-tuned at 16K context length achieve extrapolation factors of at least 128$\times$ on simple tasks and maintain stable perplexity over 32$\times$ sequence length extensions and saves at least 16 times the GPU training resources compared to the existing optimal method. Experiments also show that correlation routing can achieve good performance by further filtering out large amounts of noise in long sequences.

Primary Area: generative models

Submission Number: 12907

Loading