TL;DR: Lie group relative position encodings for 2D and 3D data.
Abstract: Transformer architectures rely on position encodings to model the spatial structure of input data. Rotary Position Encoding (RoPE) is a widely used method in language models that encodes relative positions through fixed, block-diagonal, rotation matrices applied to key-query interactions. We hypothesize that this inductive bias limits their RoPE's effectiveness for modalities with high dimensional structure.
Lie Relative Encodings (LieRE) introduce a principled generalization of RoPE, aimed at increasing the representational capacity of positional encodings in transformers. Instead of fixed 2D rotations, LieRE learns dense skew-symmetric matrices (Lie algebra elements), which are then differentiable mapped to form high-dimensional rotation matrices (Lie group elements). This results in richer, learnable, and continuous, encodings of both relative and absolute positional information.
We demonstrate the effectiveness of LieRE on 2D and 3D vision tasks, showing that it generalizes well to higher input resolutions while maintaining computational efficiency. The code and checkpoints are publicly available at https://github.com/StanfordMIMI/LieRE.
Lay Summary: Transformers are widely used in AI, but they need position information to understand the structure of data like images or 3D scenes. For example, when processing an image, it is divided into small blocks called patches. Each patch becomes a token, but these tokens lack information about their original location unless position encodings are added.
LieRE improves how transformers handle position encodings, especially in high-dimensional settings. It encodes token positions using rotation matrices, leveraging the relationship between skew-symmetric matrices and dense rotation matrices to preserve geometric structure.
LieRE builds on RoPE (Rotary Position Embedding), which uses block-2D rotation matrices to encode relative positions directly into model computations. While RoPE was originally developed for one-dimensional sequence data like text, LieRE generalizes these ideas to higher dimensions, allowing the model to reason about both absolute and relative positions in complex spatial data.
Link To Code: https://github.com/StanfordMIMI/LieRE
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Position encoding, Vision Transformer
Submission Number: 8240
Loading