Keywords: Subquadratic Transformer; Spectral Mixing; Multi-Modal Fusion; Fourier-Wavelet Attention
TL;DR: This paper introduces MDFWA, a subquadratic Transformer that combines Fourier and wavelet token mixing to efficiently model global and local dependencies in long and multi-modal sequences.
Abstract: We revisit the use of spectral techniques to replaces the attention mechanism in Transformers through Fourier Transform–based token mixing, and present a comprehensive and novel reformulation of this technique in next generation transformer models. We provide expanded literature context, detailed mathematical formulations of Fourier mixing and causal masking, and introduce a novel \emph{Multi-Domain Fourier-Wavelet Attention} (MDFWA) that integrates frequency- and time-localized transforms to capture both global and local dependencies efficiently. We derive the complexity bounds, gradient formulas, and show that MDFWA achieves sub-quadratic time and memory cost while improving expressive power. We validate our design on an abstractive summarization task using PubMed dataset, by enhancing the proposed approach with learned frequency bases, adaptive scale selection, and multi-modal extensions.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 6181
Loading