A Closer Look at Transformers for Time Series Forecasting: Understanding Why They Work and Where They Struggle

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This study investigates the effectiveness of basic transformer architecture for time-series forecasting to reveal what are the keys to the success of simpler architectures over more complex designs.
Abstract:

Time-series forecasting is crucial across various domains, including finance, healthcare, and energy. Transformer models, originally developed for natural language processing, have demonstrated significant potential in addressing challenges associated with time-series data. These models utilize different tokenization strategies, point-wise, patch-wise, and variate-wise, to represent time-series data, each resulting in different scope of attention maps. Despite the emergence of sophisticated architectures, simpler transformers consistently outperform their more complex counterparts in widely used benchmarks. This study examines why point-wise transformers are generally less effective, why intra- and inter-variate attention mechanisms yield similar outcomes, and which architectural components drive the success of simpler models. By analyzing mutual information and evaluating models on synthetic datasets, we demonstrate that intra-variate dependencies are the primary contributors to prediction performance on benchmarks, while inter-variate dependencies have a minor impact. Additionally, techniques such as Z-score normalization and skip connections are also crucial. However, these results are largely influenced by the self-dependent and stationary nature of benchmark datasets. By validating our findings on real-world healthcare data, we provide insights for designing more effective transformers for practical applications.

Lay Summary:

Accurately predicting future trends from time-based data, such as patient health, stock prices, or energy use, is essential in many fields. Transformer models, originally developed for language processing, have shown promise for this task. These models present time-based data in different ways, but interestingly, simpler transformer designs often outperform more complex ones. This study explores why simpler transformer architects work better in forecasting, and how different ways of representing time-series data affect performance. We find that most useful information comes from tracking patterns within individual variables over time, rather than between variables. Techniques like data normalization and skip connections also contribute significantly to forecasting performance. However, we show that many commonly used datasets are easier to predict than real-world data, which can mislead model design. By testing on real healthcare data, we highlight what matters most when building effective transformer models for practical applications.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Deep Learning->Sequential Models, Time series
Keywords: Time series forecasting, transformers, point-wise, patch-wise, variate-wise, tokenization, model evaluation
Submission Number: 11621
Loading