Decoupling the "What" and "Where" With Polar Coordinate Positional Embedding

Published: 23 Sept 2025, Last Modified: 07 Dec 2025FoRLM 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: relative positional encoding, RoPE, Transformers, sequence modelling, length generalization, complex-valued activations
TL;DR: We propose an improvement to RoPE that decouples the matches based on content and positional information leading to improved sequence modeling performance across several domains and strong zero-shot length generalization.
Abstract: The attention mechanism in a Transformer matches query and key based on both content---the what---and position in a sequence---the where. We present an analysis indicating that what and where are entangled in the popular rotary position embedding (RoPE), which can impair performance particularly when decision making requires independent matches on these two factors. We propose an improvement to RoPE, we call Polar Coordinate Position Embedding or PoPE, that eliminates the what-where confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to training loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities, whereas RoPE's performance degrades significantly on longer sequences at test time without fine tuning or the use of position-interpolation methods.
Submission Number: 70
Loading