Abstract: Vector representations of contextual embeddings learned by transformer-based models have been shown to be effective even for downstream tasks in \emph{numerical domains} such as time series forecasting. Their success in capturing long-range dependencies and contextual semantics has led to broad adoption across architectures. But at the same time, there is little theoretical understanding of when transformers, both autoregressive and non-autoregressive, generalize well to forecasting tasks. This paper addresses this gap through an analysis of isotropy in contextual embedding space. Specifically, we study a log-linear model as a simplified abstraction for studying hidden representations in transformer-based models. In this formulation, time series embeddings are mapped to predictive outputs through a softmax layer, providing a tractable lens for analyzing generalization. We show that state-of-the-art performance requires embeddings to possess a structure that accounts for the shift-invariance of the softmax function. By examining the gradient structure of self-attention, we demonstrate how isotropy preserves representation structure, resolves the shift-invariance problem, and provides insights into model reliability and generalization. Experiments across $22$ different numerical datasets and $5$ different transformer-based models show that data characteristics and architectural choices significantly affect isotropy, which in turn directly influences forecasting performance. This establishes isotropy as a theoretically grounded and empirically validated indicator of generalization and reliability in time series forecasting. The code for the isotropy analysis and all data are publicly available.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Jacek_Cyranka1
Submission Number: 5168
Loading