Towards a Multimodal Foundation Model for Time Series Analysis

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Time Series Analysis; Foundation Models; Multi-Modality;
Abstract: Time series analysis supports a wide range of real-world applications. While existing time series foundation models primarily rely on large-scale unimodal pretraining, they lack complementary modalities to enhance time series understanding. Building multimodal foundation models is a natural next step, but it introduces key challenges: 1) the scarcity of large-scale and high-quality multimodal time series data; 2) how to effectively integrate heterogeneous modalities and enhance model generalization across both modalities and domains. To address these challenges, we take an early step toward multimodal foundation models for time series analysis. We first construct MM-TS, a large-scale multimodal dataset spanning time series, text, and image across six domains, with more than one billion time points. Then we propose HORAI, a frequency-enhanced multimodal foundation model. HORAI integrates two core components: a Frequency-guided Cross-Modality Encoder, which leverages the correspondence between modality-specific information and different frequency components of time series to effectively fuse multiple modalities, and a Time-Frequency Decoder, which incorporates frequency information into a MoE router to improve pattern discrimination and generalization. After pretraining on MM-TS, HORAI achieves state-of-the-art zero-shot performance on time series forecasting and anomaly detection tasks, demonstrating strong task versatility and generalization.
Primary Area: learning on time series and dynamical systems
Submission Number: 5892
Loading