Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention often outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.
Lay Summary: Forecasting future data, such as weather or stock prices, is often done using powerful but complex machine learning models like Transformers. However, deeper Transformers usually lose interpretability because they stray from clear, understandable methods like Vector Autoregression (VAR). Our research reveals that a simpler Transformer variant ("linear attention") aligns well with VAR. Building on this insight, we propose SAMoVAR, a Transformer designed specifically to maintain VAR’s clear structure. SAMoVAR enhances forecasting accuracy, interpretability, and speed, clearly showing how past data affects future outcomes. This helps users better understand predictions made from time series data.
Link To Code: https://github.com/LJC-FVNR/Structural-Aligned-Mixture-of-VAR
Primary Area: Deep Learning->Sequential Models, Time series
Keywords: Linear Attention, Transformer, Time Series Forecasting, Vector Autoregression
Submission Number: 15182
Loading