S$^2$Transformer: Scalable Structured Transformers for Global Station Weather Forecasting

TMLR Paper5895 Authors

15 Sept 2025 (modified: 02 Mar 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Global Station Weather Forecasting (GSWF) is a key meteorological research area, critical to energy, aviation, and agriculture. Existing time series forecasting methods often ignore or unidirectionally model spatial correlation when conducting large-scale global station forecasting. This contradicts the intrinsic nature underlying observations of the global weather system, limiting forecast performance. To address this, we propose a novel Spatial Structured Attention Block in this paper. It partitions the spatial graph into a set of subgraphs and instantiates Intra-subgraph Attention to learn local spatial correlation within each subgraph, and aggregates nodes into subgraph representations for message passing among the subgraphs via Inter-subgraph Attention---considering both spatial proximity and global correlation. Building on this block, we develop a multiscale spatiotemporal forecasting model S$^2$Transformer by progressively expanding subgraph scales. The resulting model is both scalable and able to produce structured spatial correlation, and meanwhile, it is easy to implement. The experimental results show that it can achieve performance improvements up to 16.8% over time series forecasting baselines at low running costs.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have revised the manuscript to address the comments from the Action Editor and Reviewers. The main changes are as follows: 1. We added a new ablation variant $w/o. \text{ MA}$ in Section 5.4 to evaluate the impact of the hierarchical multi-scale architecture. 2. We provided a rigorous mathematical definition of "structured spatial correlation" in Appendix C. 3. We included a performance comparison between $S^2\text{Transformer}$ and physics-informed models in Appendix E. 4. We added Appendix F to discuss the model's sensitivity to the number of subgraphs and the robustness of the graph partitioning strategy. 5. We rewrote the Broader Impact section to provide a more comprehensive discussion on the societal implications of our work.
Code: https://github.com/hongyichenhitsz/S2Transformer
Assigned Action Editor: ~Chuxu_Zhang2
Submission Number: 5895
Loading