Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: weather forecasting, knowledge distillation, model efficiency
Abstract: State-of-the-art machine learning weather forecasting systems, such as FuXi, achieve skillful global predictions but at the cost of large model sizes and high training demands. In this work, we investigate how far such architectures can be reduced without significant loss of accuracy. Specifically, we compress FuXi-short by replacing its 48 SwinTransformerV2 blocks with only 6 (additionally evaluating 4- and 2-block variants), and probe two training strategies: (i) training the reduced model from scratch, including an efficient one-step regression initialization to quickly adapt the architecture to weather data, and (ii) block-wise distillation, where each reduced block is trained to approximate every 8th block of the original model using MSE loss. Despite the eightfold reduction in depth, accuracy on the most critical variables remains effectively unchanged. For example, mean sea level pressure RMSE increases by 0.026\% relative to the mean and 1.95\% relative to the standard deviation, while temperature RMSE changes only by 0.068\% and 0.87\%, respectively. Importantly, with one-eighth the depth, the model is substantially faster to train, enabling more agile adaptation to changing climate data. These results highlight the importance and limits of architectural compression of large models where forecasting skill can be retained even under drastic reduction. In this ongoing work we will quantify benefits of such approach, explore compression strategies, and assess robustness across seasons and longer horizons.
Submission Number: 106
Loading