VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Mouxiang Chen; Lefei Shen; Zhuo Li; Xiaoyun Joy Wang; Jianling Sun; Chenghao Liu

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Mouxiang Chen, Lefei Shen, Zhuo Li, Xiaoyun Joy Wang, Jianling Sun, Chenghao Liu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose VisionTS, a time series forecasting foundation model building from rich, high-quality natural images.

Abstract: Foundation models have emerged as a promising approach in time series forecasting (TSF). Existing approaches either repurpose large language models (LLMs) or build large-scale time series datasets to develop TSF foundation models for universal forecasting. However, these methods face challenges due to the severe cross-domain gap or in-domain heterogeneity. This paper explores a new road to building a TSF foundation model from rich, high-quality natural images. Our key insight is that a visual masked autoencoder, pre-trained on the ImageNet dataset, can naturally be a numeric series forecaster. By reformulating TSF as an image reconstruction task, we bridge the gap between image pre-training and TSF downstream tasks. Surprisingly, without further adaptation in the time series domain, the proposed VisionTS could achieve better zero-shot forecast performance than existing TSF foundation models. With fine-tuning for one epoch, VisionTS could further improve the forecasting and achieve state-of-the-art performance in most cases. Extensive experiments reveal intrinsic similarities between images and real-world time series, suggesting that visual models may offer a "free lunch" for TSF and highlight the potential for future cross-modality research. Our code is available in the https://github.com/Keytoyze/VisionTS.

Lay Summary: These days, powerful AI models called "foundation models" have revolutionized fields like language processing and computer vision. People now want to use a *universal* foundation model for various fields of time series forecasting, like predicting energy use, weather, or traffic. But current approaches struggle because time series data from different areas are very different, making it hard to build one universal model. We had a creative idea: **Why not use a model already trained on *images* for forecasting?** We took an AI model pre-trained on ImageNet and treated time series data like images. By turning numbers into "images" and having the model reconstruct them, we bridged the gap between vision and forecasting. Amazingly, this worked without any extra training on actual time series data. Our model, VisionTS, outperformed existing specialized time series models **even without any time series training**. This suggests images and time series share surprising hidden similarities. Using visual models for forecasting could be a powerful "free lunch," opening exciting new paths for AI research across different data types.

Link To Code: https://github.com/Keytoyze/VisionTS

Primary Area: General Machine Learning->Sequential, Network, and Time Series Modeling

Keywords: time series forecasting, foundation models, transfer learning

Submission Number: 10743

Loading