Keywords: Positional Embedding, Vision Transformer, RoPE
TL;DR: This paper introduces GeoPE, which extends 1D RoPE to higher dimensions using quaternions and Lie algebra, achieving consistently strong performance on various 2D and 3D vision tasks while maintaining excellent generalization to unseen resolutions.
Abstract: Rotary Positional Embedding (RoPE) excels at encoding relative positions in 1D sequences, but its generalization to higher-dimensional structured data like images and videos remains a challenge. Existing approaches often treat spatial axes independently or combine them heuristically, failing to capture their geometric coupling in a symmetric and consistent manner. To address this, we introduce Geometric Positional Embedding (GeoPE), a framework that extends rotations to 3D Euclidean space using quaternions. To overcome the non-commutativity of quaternion multiplication and ensure symmetry, GeoPE constructs a unified rotational operator by computing the geometric mean of rotations within the corresponding Lie algebra. We also propose a linear variant that preserves the strict relative positional encoding of 1D RoPE, offering superior extrapolation. Extensive experiments on image classification, object detection, and 3D semantic segmentation demonstrate that GeoPE consistently outperforms standard baselines and existing 2D RoPE variants, while retaining the strong extrapolation properties of its 1D predecessor.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 19045
Loading