Keywords: RoPE, Linear Transformer, Attention, State Space Models, Forget Gate
TL;DR: We introduce Selective RoPE, an input-dependent rotary embedding that enhances gated linear transformers.
Abstract: Positional information is essential for language modeling. Softmax Transformers with Rotary Position Embeddings (RoPE) encode it with fixed-angle rotations, while linear Transformers rely on input-dependent gates that only decay past key-value norms. We provide a theoretical argument for the necessity of a rotation and decay component in well-performing sequence models, and observe that the missing ingredient in linear models is precisely the rotation that softmax attention performs implicitly. We introduce Selective Rotary Position Embedding (*Selective RoPE*), an input-dependent, learnable rotary embedding that generalizes RoPE to arbitrary angles and composes seamlessly with decay gates. Equipping gated linear attention with *Selective RoPE* yields a complex-valued recurrent layer that can be implemented efficiently with the “RoPE trick”. On synthetic benchmarks (MQAR, copying, state tracking) and 370M-parameter language-model pre-training, the method improves recall, downstream accuracy, and expressivity while adding minimal architectural overhead. We open-source our implementation [here](https://github.com/timurcarstensen/selective-rope).
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 21436
Loading