Positional Encoding via Token-Aware Phase Attention

Sid Wang; Sheng Shen; Rémi Munos; Hongyuan Zhan; Yuandong Tian

Positional Encoding via Token-Aware Phase Attention

Sid Wang, Sheng Shen, Rémi Munos, Hongyuan Zhan, Yuandong Tian

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Positional Encoding, Transformers, Long Context Modeling

TL;DR: We TAPA—a new positional encoding based on learnable phase—that provably and empirically achieves better long context ability than RoPE families

Abstract: We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 9881

Loading