Abstract: Operational Numerical Weather Prediction (NWP) systems rely on computationally expensive physics-based models. Recently, transformer models have shown remarkable potential in weather forecasting achieving state-of-the-art results. However, traditional transformers discretize spatio-temporal dimensions, limiting their ability to model continuous dynamical weather processes. Moreover, their reliance on increased depth to capture complex dependencies results in higher computational cost and parameter redundancy. We address these issues with \textbf{STC-ViT}, a Spatio-Temporal Continuous Vision Transformer for weather forecasting. STC-ViT integrates a Fourier Neural Operator (FNO) for global spatial operators with a transformer-parameterized Neural ODE for continuous-time dynamics, yielding a space–time continuous model of weather forecasting. Our proposed method achieves competitive forecasting performance even with a shallow, single-layer transformer encoder, and scales further with depth as shown in our analysis (Section \ref{sec:scale}). STC-ViT generates complete forecast trajectories with an inference speed of only 0.125 seconds and achieves strong medium-range forecasting skill on $1.5^\circ$ WeatherBench 2 as compared to state-of-the-art data-driven and NWP models trained on higher-resolution data, with lower data and compute costs. We also provide detailed empirical analysis on model's performance with respect to denser time grids, higher-accuracy ODE solvers, and deeper transformer stacks. Our code is available at \url{https://anonymous.4open.science/r/STCViT-CC8B}.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=foib4M4UXm&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: This is a major revision that re-architects STC-ViT to address the previous round's concerns: (i) what the Neural ODE genuinely adds over a discrete transformer with finer time sampling, (ii) the validity of the physics-informed losses, (iii) discrepancies between paper and code, and (iv) fair baselines.
From fixed-time to continuous-time: In the previous submission, the continuous-time claim rested on (a) finite-difference temporal derivatives $\frac{V(x,y,t)-V(x,y,t-1)}{\Delta t}$ injected into the attention queries, keys, and values, and (b) a Neural ODE layer applied after a discrete transformer block. Both operate on a fixed time grid, and we acknowledged the limitation in the previous paper: 'While this formulation aligns with continuous-depth modeling principles, the practical implementation focuses on fixed-interval evaluations rather than arbitrary time points' (previous Section 4.1).
In the new version, the model itself is an ODE. The pre-norm ViT defines a learnable vector field $f_\theta(z, e(t))$ on the latent space $\mathbb{R}^{B\times N_p\times D}$, and the trajectory $z(t)$ is obtained by integrating this vector field with \texttt{torchdiffeq.odeint} from $t=0$ to any requested lead time $t_\ell$. Concretely:
1. The latent state evolves through intermediate time points the user never specifies. RK4 evaluates $f_\theta$ at four sub-steps between every pair of output times, with no extra supervision required.
2. Solver order is a free hyperparameter at inference: swapping Euler $\to$ RK4 yields $\approx$5\% RMSE reduction (Section 5.5) without changing the parameters. A behaviour only an ODE-based model exhibits and direct evidence that the integration matters beyond fixed sampling.
3. Time enters $f_\theta$ via a continuous sinusoidal embedding $e(t) \in \mathbb{R}^D$, not via a discrete index, so the dynamics are differentiable in $t$. The fixed scalar $\tau = 10^{-2}$ rescales physical lead times (hours) into the dimensionless solver variable; it improves conditioning but does not affect continuity.
Architecture redesign: Beyond making the time evolution genuinely continuous, we also removed the Temporal Continuous Attention (TCA) module, the Spatial Attention split, and the derivation pre-processing step from the previous submission. The entire $L$-layer pre-norm ViT now plays the role of $f_\theta$ in the ODE above (Section 4.4, Eqs.9-13). To anchor global spatial structure and spherical geometry, we add two spectral branches: a Fourier Neural Operator (FNO) layer for global low-frequency spatial mixing, and a spherical-harmonic (SH) positional encoder for topology-aware coordinates on the sphere. The initial latent is now $z(0) = T_{\mathrm{disc}} + T_{\mathrm{pos}} + T_{\mathrm{FNO}} + T_{\mathrm{SH}}$, where $T_{\mathrm{disc}}$ is the ClimaX-style variable-tokenised patch embedding, $T_{\mathrm{pos}}$ is a learnable 2-D sinusoidal positional embedding (now explicitly named in Eq.9 to match the implementation), $T_{\mathrm{FNO}}$ is the patch-pooled FNO feature, and $T_{\mathrm{SH}}$ is the spherical-harmonic positional feature.
Physics-informed losses replaced by architectural priors: The previous kinetic, potential, and thermodynamic-balance loss terms whose physical validity reviewers questioned have been removed. The continuity and geometry priors are now carried by the architecture itself (FNO for global low-frequency spatial structure, spherical harmonics for topology on the sphere, Neural ODE for continuous-time dynamics) rather than by soft penalties on the output. The training objective is plain latitude-weighted MSE.
Scaling analysis (new Section 5.5): Addresses the AE's question of what the Neural ODE adds over a discrete transformer with finer sampling. We sweep solver order (Euler vs RK4: $\approx$5\% RMSE reduction), transformer depth ($L=1$ vs $L=8$: $\approx$20\% RMSE reduction), and trajectory time-grid density. Forecast skill improves along all three axes, providing concrete evidence that the continuous formulation contributes beyond fixed-interval discrete sampling.
Code-paper alignment: All tensor shapes and operations have been re-derived from the implementation. The paper now reflects the code, and the code reflects the paper. The training and evaluation pipelines have been restructured to match the paper's description, and all hyperparameters are explicitly stated in the text (Section 5.1) and Table 1.
Fair baselines: Following the WB2 evaluation protocol (Rasp et al., 2024), all baselines - GraphCast and Pangu-Weather at $0.25^\circ$, IFS-HRES at $0.1^\circ$ are first regridded to a common $1.5^\circ$ grid before scoring (Section 5.4). Parameter count, and device comparisons are in Table 2.
Assigned Action Editor: ~Dit-Yan_Yeung2
Submission Number: 9208
Loading