Keywords: representation learning
Abstract: Effective information extraction has long been a central challenge in Computer Vision (CV). Transformer- and Mamba-based backbones have significantly advanced this field by providing powerful long-range modeling capability, even though they are initially developed for Natural Language Processing (NLP). Recent progress has highlighted the potential of Fourier Neural Operator (FNO), which, with its favorable quasi-linear complexity and strong global modeling capacity, offers a promising alternative for visual representation learning. However, FNO exhibits a fundamental limitation in capturing local high-frequency patterns due to the over-smoothing effect and bandwidth bottleneck. To address this limitation, we propose Vision Filter (ViF), as a generic backbone for CV, consisting of two complementary components: adaptive modulation for enhancing sensitivity to high-frequency component in the frequency domain, and selective activation for balancing local time-domain and global frequency-domain information flow. Extensive experiments reveal that ViF consistently outperforms prominent variants of Transformer- and Mamba-based backbones across diverse visual tasks, including image classification, object detection, and semantic segmentation. ViF demonstrates lower computational complexity than Transformer-based models and better structural modeling than Mamba-based models, which suffer from spatial disruption due to their directional scanning mechanism. The joint time- and frequency-domain mechanism introduced in ViF may establish a promising paradigm for designing effective visual representation learning, bridging local high-frequency information with global low-frequency information.
Supplementary Material: pdf
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 6047
Loading