PatchMoE: A Time Series Foundation Model with Hierarchical Patch-wise Mixture-of-Experts

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: time series foundation model, forecasting, mixture of experts
Abstract: Recently, time series foundation models (TSFMs) pre-trained on massive datasets have achieved remarkable zero-shot performance. However, effectively modeling the diverse *inter*-series and *intra*-series patterns in large-scale datasets remains a significant challenge. Most existing methods, constrained by a single, fixed tokenizer, lack the flexibility to capture the pattern diversity. To tackle this issue, we introduce PatchMoE, a novel hierarchical Mixture of Experts (MoE) architecture, comprising Patch-wise Experts and Sample-wise Hierarchical Router as key components. Specifically, Patch-wise Experts are employed to capture diverse *inter*-series patterns with specialized patch tokenizers. While Sample-wise Hierarchical Router tackles *intra*-series patterns by dispatching the entire sample to experts. This process allows each sample to undergo hierarchical routing through multiple MoE layers, where each layer gradually outputs a partial forecast. Furthermore, to address the efficiency bottleneck of MoE architecture, we develop a highly efficient training framework for the time series modality based on Megatron-LM, which implements expert parallelism and achieves a 3 to 5 times training speedup under identical experimental settings. Benefiting from this, for the first time, we scale a time series foundation model to 8.5 billion parameters, achieving state-of-the-art results on zero-shot forecasting tasks. Compared with dense and sparse models of equivalent scale of parameters, PatchMoE demonstrates significant improvements in both effectiveness and efficiency.
Supplementary Material: zip
Primary Area: learning on time series and dynamical systems
Submission Number: 7234
Loading