Keywords: time series, foundation models, rank structure, attention, embedding
Abstract: Transformers are widely used across data modalities, and yet the principles distilled from text models often transfer imperfectly. In this paper, we analyze Transformers through the lens of rank structure. Our focus is on the time series setting, where the structural properties of the data remarkably differ from those of text or vision. Time-series embeddings, unlike text or vision, exhibit sharply decaying singular spectra: small patch sizes and smooth continuous mappings concentrate the data into low-rank subspaces. From this, we prove that the associated $Q/K/V$ projections admit accurate low-rank approximations, and that attention layers become compressible in proportion to the decay of the embedding spectrum. We introduce the concept of *flow-of-ranks*, a mechanism by which nonlinear mixing across depth inflates the rank, explaining why early layers are most amenable to compression and why rank schedules should grow with depth. Guided by these results, we compress Chronos, a large time series foundation model, achieving a reduction of $65\\%$ in inference time and $81\\%$ in memory without loss of accuracy. These findings provide principled guidance for allocating width, depth, and heads in time series foundation models, and for exploiting their inherent compressibility.
Supplementary Material: zip
Primary Area: learning on time series and dynamical systems
Submission Number: 1173
Loading