Keywords: Sparse Spectral ; Hierarchical FFT; Low‑Rank Cross‑Frequency Mixer
TL;DR: SHFIN replaces convolutions and self-attention with a sparse Fourier operator, combining patchwise FFTs, learnable K-sparsity, and low-rank mixing, to cut parameter and compute costs by up to 60 % while retaining comparable accuracy.
Abstract: In this work, we introduce \emph{Sparse Hierarchical Fourier Interaction Networks} (SHFIN), a novel architectural primitive designed to replace both convolutional kernels and the quadratic self‑attention mechanism with a unified, spectrum‑sparse Fourier operator. SHFIN is built upon three core components: (1) a hierarchical patch‑wise fast Fourier transform (FFT) stage that partitions inputs into localized patches and computes an $O(s\log s)$ transform on each, preserving spatial locality while enabling global information mixing; (2) a learnable $K$‑sparse frequency masking mechanism, realized via a Gumbel‑Softmax relaxation, which dynamically selects only the $K$ most informative spectral components per patch, thereby pruning redundant high‑frequency bands; and (3) a gated cross‑frequency mixer, implemented as a low‑rank bilinear interaction in the retained spectral subspace, which captures dependencies across channels at $O(K^2)$ cost rather than $O(N^2)$. An inverse FFT and residual fusion complete the SHFIN block, seamlessly integrating with existing layer‑norm and feed‑forward modules.
Empirically, we integrate SHFIN blocks into both convolutional and transformer‑style backbones and conduct extensive experiments on ImageNet‑1k. On the ResNet‑50 and ViT‑Small scales, our SHFIN variants achieve comparable Top‑1 accuracy (within 0.5 pp) while reducing total parameter count by up to 60\% and improving end‑to‑end inference latency by roughly 3× on NVIDIA A100 GPUs. Moreover, in the WMT14 English–German translation benchmark, a Transformer‑Small augmented with SHFIN cross‑attention layers matches a 28.1 BLEU baseline with 55\% lower peak GPU memory usage during training. These results demonstrate that SHFIN can serve as a drop‑in replacement for both local convolution and global attention, offering a new pathway toward efficient, spectrum‑aware deep architectures.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 5723
Loading