FourierRoFormer: Learned Fourier Attention for Vision Transformers

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Transformers, Fourier Analysis, Attention Mechanisms, Positional Embeddings, Multi-scale Learning, Frequency-aware Networks, Rotary Embeddings, Computer Vision, Representation Learning, Image Classification, Deep Learning, Interpretable Machine Learning
TL;DR: We introduce learnable Fourier attention for Vision Transformers that automatically discovers optimal spatial frequencies for multi-scale image understanding, achieving significant performance gains through interpretable frequency specialization.
Abstract: Vision Transformers (ViTs) excel at long-range reasoning but lack principled mechanisms for modeling spatial frequencies and controlling how attention decays with distance. We propose FourierRoFormer, a frequency-aware Transformer that augments rotary positional embeddings with learnable Fourier components. This enables explicit modeling of multi-scale visual patterns and adaptive distance-dependent modulation of attention. Our analysis shows that FourierRoFormer produces attention hierarchies aligned with object boundaries (correlation $r=0.85$) and distinct specialization across attention heads. On ImageNet-1K, FourierRoFormer achieves 84.1\% top-1 accuracy (+1.8pp over RoFormer-B) and outperforms non-hierarchical spectral methods, including SpectFormer-B (+1.98pp) and GFNet-B (+3.4pp), while maintaining comparable parameter efficiency. Our hierarchcial variant, FourierRoFormer-H-B, achieves 85.3\% top-1 accuracy, demonstrating compatibility with hierarchical architectures. The method improves transfer to dense prediction tasks, yielding +2.6 mAP on COCO detection and +2.2 mAP on instance segmentation. Ablation studies highlight the complementary roles of frequency modulation (+4.43pp) and adaptive damping (+2.09pp). The approach introduces only 0.04\% additional parameters and $\sim3\%$ computational overhead.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 14942
Loading