Analysing Extrapolation Capabilities of Modern Positional Embeddings in Vision Transformers

TMLR Paper9150 Authors

22 May 2026 (modified: 31 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Vision Transformers scale quadratically with input resolution, making high-resolution training prohibitively expensive, yet detection and segmentation demand it. A natural alternative is resolution extrapolation: train at low resolution and deploy at higher resolution without fine-tuning. Whether this works depends entirely on the positional encoding. Modern encodings such as RoPE, ALiBi, YaRN, and FIRE enable striking length extrapolation in language models, but their behaviour on 2D image grids is largely unknown. We present a systematic study, training a single ViT-Tiny on ImageNet-100 at 224 × 224 and evaluating zero-shot up to 1024 × 1024 (4.57×) on classification COCO detection, and ADE20K segmentation. Across all three tasks, additive attention-bias methods consistently outperform rotation-based methods at large scales: at 4.57×, FIRE retains 63.1% Top-1 accuracy while RoPE collapses to 18.9%. An attention-entropy analysis shows that bias-based encodings preserve focused, semantically coherent attention, whereas rotation-based phases drift out of distribution and induce attention collapse. These results recast extrapolation robustness as the primary axis for choosing positional encodings, and yield a practical recipe for resolution-flexible Vision Transformers.
Submission Type: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=d3T3odbDa5&noteId=d3T3odbDa5
Changes Since Last Submission: Fixed the format as old paper was not in the format of TMLR
Assigned Action Editor: ~Farzan_Farnia1
Submission Number: 9150
Loading