everyone
since 18 Jun 2025">EveryoneRevisionsBibTeXCC BY 4.0
Time-series forecasting is crucial across various domains, including finance, healthcare, and energy. Transformer models, originally developed for natural language processing, have demonstrated significant potential in addressing challenges associated with time-series data. These models utilize different tokenization strategies, point-wise, patch-wise, and variate-wise, to represent time-series data, each resulting in different scope of attention maps. Despite the emergence of sophisticated architectures, simpler transformers consistently outperform their more complex counterparts in widely used benchmarks. This study examines why point-wise transformers are generally less effective, why intra- and inter-variate attention mechanisms yield similar outcomes, and which architectural components drive the success of simpler models. By analyzing mutual information and evaluating models on synthetic datasets, we demonstrate that intra-variate dependencies are the primary contributors to prediction performance on benchmarks, while inter-variate dependencies have a minor impact. Additionally, techniques such as Z-score normalization and skip connections are also crucial. However, these results are largely influenced by the self-dependent and stationary nature of benchmark datasets. By validating our findings on real-world healthcare data, we provide insights for designing more effective transformers for practical applications.
Accurately predicting future trends from time-based data, such as patient health, stock prices, or energy use, is essential in many fields. Transformer models, originally developed for language processing, have shown promise for this task. These models present time-based data in different ways, but interestingly, simpler transformer designs often outperform more complex ones. This study explores why simpler transformer architects work better in forecasting, and how different ways of representing time-series data affect performance. We find that most useful information comes from tracking patterns within individual variables over time, rather than between variables. Techniques like data normalization and skip connections also contribute significantly to forecasting performance. However, we show that many commonly used datasets are easier to predict than real-world data, which can mislead model design. By testing on real healthcare data, we highlight what matters most when building effective transformer models for practical applications.