Is the Attention Matrix Really the Key to Self‑Attention in Multivariate Long‑Term Time Series Forecasting?
Keywords: Multivariate Long‑Term Time Series Forecasting, Self‑Attention, Attention Matrix
TL;DR: This paper indicates that the source of performance in self-attention has been misattributed, suggesting that the true benefit lies in the architectural principle of multi-branch mapping and fusion, not in the attention matrix.
Abstract: In multivariate long-term time series forecasting, the success of self-attention is commonly attributed to the attention matrix that encodes token interactions. In this paper, we provide evidence that challenges this view. Through extensive experiments on three classic and three latest Transformer models, we find that dot-product attention can be replaced by element-wise operations without token interaction, such as the addition and Hadamard product, while maintaining or even improving accuracy. This motivates our central hypothesis: the effectiveness of self-attention in this task arises not from the dynamic attention matrix, but from the multi-branch feature extraction enabled by the parallel Query, Key, and Value projections and their fusion. To validate this hypothesis, we construct a minimalist multi-branch MLP that isolates the ‘multi-branch mapping with element-wise operation’ structure from the Transformer and show that it achieves competitive performance. Our findings indicate that the source of performance in self-attention is often misinterpreted, as its actual advantage stems from the architectural principle of multi-branch mapping and fusion, rather than the attention matrix. Anonymous code is available at: https://anonymous.4open.science/r/Attention-01F4/
Primary Area: learning on time series and dynamical systems
Submission Number: 10429
Loading