TL;DR: A novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for augmented time series forecasting.
Abstract: Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose Time-VLM, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting. Code is available at https://github.com/CityMind-Lab/ICML25-TimeVLM.
Lay Summary: Accurate forecasting of time series — such as electricity usage, weather patterns, or traffic flow — is essential for planning and decision-making in many areas. However, traditional models often struggle with complex patterns, and deep learning methods typically require large amounts of data to work well.
In this research, we introduce Time-VLM , a new approach that combines time series with visual and textual information using pre-trained vision-language models (VLMs). Our method automatically transforms time data into meaningful images and text descriptions, allowing the model to understand both the temporal dynamics and the real-world context behind the numbers.
This makes it possible to make accurate predictions even when only limited data is available — for example, forecasting energy consumption trends from just a few past observations. We show that our model outperforms existing methods in low-data settings, especially in cross-domain scenarios where no historical data is available for training.
Time-VLM opens up new possibilities for applying powerful AI models like VLMs beyond image and language tasks, helping us better understand and predict time-dependent phenomena across a wide range of applications.
Link To Code: https://github.com/CityMind-Lab/ICML25-TimeVLM
Primary Area: Deep Learning->Sequential Models, Time series
Keywords: Time Series Forecasting, Vision-Language Models
Submission Number: 586
Loading