Scaling to Billion Parameters for Time Series Foundation Models with Mixture of Experts

Xiaoming Shi; Shiyu Wang; Yuqi Nie; Dianqi Li; Zhou Ye; Qingsong Wen; Ming Jin

Scaling to Billion Parameters for Time Series Foundation Models with Mixture of Experts

Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, Ming Jin

Published: 10 Oct 2024, Last Modified: 26 Nov 2024NeurIPS 2024 TSALM WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: time series, foundation model, forecasting

Abstract:

Deep learning has made significant strides in time series forecasting, yet the field lacks large-scale pre-trained models comparable to those in language and vision domains. In this paper, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction while maintaining high model capacity. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support arbitrary forecasting horizons with varying input context lengths. We pre-trained these models on large-scale data, spanning over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world forecasting challenges with superior capability, efficiency, and flexibility.

Submission Number: 41

Loading