VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

Published: 21 May 2026, Last Modified: 21 May 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.
Submission Type: Regular submission (no more than 12 pages of main content)
Supplementary Material: zip
Assigned Action Editor: ~Vincent_Fortuin1
Submission Number: 7486
Loading