FourierRoFormer: Learned Fourier Attention for Vision Transformers

ICLR 2026 Conference Submission14942 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Transformers, Fourier Analysis, Attention Mechanisms, Positional Embeddings, Multi-scale Learning, Frequency-aware Networks, Rotary Embeddings, Computer Vision, Representation Learning, Image Classification, Deep Learning, Interpretable Machine Learning
TL;DR: We introduce learnable Fourier attention for Vision Transformers that automatically discovers optimal spatial frequencies for multi-scale image understanding, achieving significant performance gains through interpretable frequency specialization.
Abstract: Vision Transformers (ViTs) excel at long-range reasoning but lack principled mechanisms for modeling spatial frequencies and controlling how attention decays with distance. We propose \textbf{FourierRoFormer}, a frequency-aware Transformer that augments rotary positional embeddings with learnable Fourier components. This enables explicit modeling of multi-scale visual patterns and adaptive distance-dependent modulation of attention. Our analysis shows that FourierRoFormer produces attention hierarchies aligned with object boundaries (correlation $r=0.85$) and distinct specialization across attention heads. On ImageNet-1K, FourierRoFormer achieves \textbf{84.1\% top-1 accuracy} (+1.8pp over RoFormer) while using 25\% fewer parameters than competitive spectral methods. It also improves transfer to dense prediction tasks, yielding +2.6 mAP on COCO detection and +2.2 mAP on instance segmentation. Ablation studies highlight the complementary roles of frequency modulation (+4.43pp) and adaptive damping (+2.09pp). Despite its expressiveness, the method introduces only \textbf{0.04\% additional parameters} and $\sim3\%$ computational overhead, confirmed by complexity and FLOPs analysis.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 14942
Loading