Abstract: Vector representations of contextual embeddings learned by transformer-based models are effective in various downstream tasks in \emph{numerical domains} such as time series forecasting. The significant success of these models in capturing long-range dependencies and contextual semantics has led to their widespread use across a variety of architectures. However, their black-box nature raises concerns about reliability, especially in critical applications such as energy, nature, finance, healthcare, retail, and transportation. To provide prediction reliability in numerical domains, it is necessary to open the black box behind transformer-based models and develop explanatory tools that can serve as good proxies for performance. Despite recent empirical successes, there remains little theoretical understanding of when transformer-based models, \textcolor{blue}{across both autoregressive and non-autoregressive architectures}, generalize effectively to time series forecasting tasks. This paper seeks to bridge this gap through a novel analysis based on the concept of isotropy in the contextual embedding space. Specifically, a log-linear model is considered as a simplified abstraction to study the hidden representations of transformer-based models, in which time series embeddings are related to predictive outputs via a softmax-based formulation, thereby providing a tractable lens for analyzing generalization and reliability. For this model, it is demonstrated that, in order to achieve state-of-the-art forecasting performance, the hidden representations of the transformer-based model embeddings must possess a structure that accounts for the shift-invariance of the softmax function. By formulating a gradient structure of self-attention in transformer-based models, it is shown how the isotropic property of embeddings in contextual embedding space preserves the underlying structure of representations, thereby resolving the shift-invariance problem and providing insights into model reliability and generalization. Experiments across $22$ different numerical datasets and $5$ different transformer-based models show that data characteristics and architectural choices significantly affect isotropy, which in turn directly influences forecasting performance. This establishes isotropy as a theoretically grounded and empirically validated indicator of generalization and reliability in time series forecasting.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Jacek_Cyranka1
Submission Number: 5168
Loading