MATEY: multiscale adaptive transformer models for spatiotemporal physical systems

TMLR Paper5023 Authors

03 Jun 2025 (modified: 17 Jul 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Accurate representation of the multiscale features in spatiotemporal physical systems using vision transformer (ViT) architectures requires extremely long, computationally prohibitive token sequences. To address this issue, we propose two adaptive tokenization schemes that dynamically adjust patch sizes based on local features: one ensures convergent behavior to uniform patch refinement, while the other offers better computational efficiency. Moreover, we present a set of spatiotemporal attention schemes, where the temporal or axial spatial dimensions are decoupled, to evaluate their baseline computational and data efficiencies and to determine whether adaptive tokenization can improve this performance. We assess the performance of the proposed multiscale adaptive model, MATEY, in a sequence of experiments. Compared to a full spatiotemporal attention scheme or a scheme that decouples only the temporal dimension, we find that fully decoupled axial attention is less efficient and expressive, requiring more training time and model weights to achieve the same accuracy. The experiments on the adaptive tokenization schemes show that, compared to a uniformly refined model, the proposed schemes achieve comparable or improved accuracy at a much lower cost. Finally, we demonstrate in two fine-tuning tasks featuring different physics that models pretrained on PDEBench data outperform the ones trained from scratch, especially in the low data regime with frozen attention.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jean_Kossaifi1
Submission Number: 5023
Loading