Abstract: As the global shift towards green energy accelerates, photovoltaic systems are being deployed at an unprecedented scale. However, the high penetration of solar energy introduces volatility and uncertainty into the grid, posing significant challenges to grid stability and power dispatch. To address these challenges and maximize solar energy utilisation, advanced forecasting techniques are urgently needed. While numerous studies have focused on solar irradiance and photovoltaic power prediction, most existing models cannot generalize well to new solar power plants with limited training data. This limitation is a critical issue given the extensive prediction demands in new installation scenarios. Therefore, this paper proposes a novel zero-shot multimodal learning framework for predicting global horizontal irradiance. The framework integrates satellite images and numerical time series data. First, a pre-trained Vision Transformer is employed to extract feature embeddings from satellite images. Meanwhile, the temporal information from historical irradiance and relevant variables is captured using an iTransformer. Second, a filtering cross-attention mechanism is introduced to eliminate inter-modal redundant information and enhance the modal fusion capability. Finally, a multivariate shared multi-layer perceptron (MLP) is used to achieve multi-step prediction. Experimental results demonstrate that the proposed method achieves a forecasting skill (FS) of 25.39 %, outperforming existing multimodal models with fewer parameters and less explicit memory usage. The code is available at https://doi.org/10.5281/zenodo.16779630.
External IDs:doi:10.1016/j.segan.2025.102044
Loading