Keywords: Weierstrass Elliptic Function, Positional Encoding, Vision Transformers, Double periodicity
TL;DR: We propose WePE for ViTs. By exploiting the doubly periodicity of the Weierstrass elliptic function, it avoids flattening 2D images into 1D sequences and preserves the natural geometric inductive bias disrupted by conventional encodings.
Abstract: Vision Transformers (ViTs) have demonstrated remarkable success in computer vision tasks. However, their reliance on learnable one-dimensional positional encoding disrupts the inherent two-dimensional spatial structure of images due to patch flattening. Existing positional encoding approaches lack geometric constraints and fail to preserve a monotonic correspondence between Euclidean spatial distances and sequential index distances, thereby limiting the model's capacity to leverage spatial proximity priors effectively. Recognizing that periodicity is particularly beneficial for positional encoding, we propose Weierstrass elliptic Positional Encoding (WePE), a mathematically principled approach that encodes two-dimensional coordinates in the complex domain. This method maps the normalized two-dimensional patch coordinates onto the complex plane and constructs a compact four-dimensional positional feature based on the Weierstrass elliptic function $\wp(z)$ and its derivative. The doubly periodic property of $\wp(z)$ enables a principled encoding of 2D positional information, while their intrinsic lattice structure aligns naturally with the geometric regularities of patch grids in images. Their nonlinear geometric characteristics enable faithful modeling of spatial distance relationships, while the associated algebraic addition formula allows relative positional information between arbitrary patch pairs to be derived directly from their absolute encodings. WePE is a plug-and-play, resolution-agnostic positional module that integrates seamlessly with existing ViTs. Extensive experiments demonstrate that WePE delivers consistent performance gains in most scenarios, while its implementation with precomputed lookup tables ensures that these improvements incur no noticeable computational or memory overhead. In addition, several analyses and ablation studies bring further confirmation to the effectiveness of our method.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 19480
Loading