Higher Order Transformers With Kronecker-Structured Attention

TMLR Paper5420 Authors

18 Jul 2025 (modified: 22 Oct 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Modern datasets are increasingly high-dimensional and multiway, often represented as tensor-valued data with multi-indexed variables. While Transformers excel in sequence modeling and high-dimensional tasks, their direct application to multiway data is computationally prohibitive due to the quadratic cost of dot-product attention and the need to flatten inputs, which disrupts tensor structure and cross-dimensional dependencies. We propose the Higher-Order Transformer (HOT), a novel factorized attention framework that represents multiway attention as sums of Kronecker products or sums of mode-wise attention matrices. HOT efficiently captures dense and sparse relationships across dimensions while preserving tensor structure. Theoretically, HOT retains the expressiveness of full high-order attention and allows complexity control via factorization rank. Experiments on 2D and 3D datasets show that HOT achieves competitive performance in multivariate time series forecasting and image classification, with significantly reduced computational and memory costs. Visualizations of mode-wise attention matrices further reveal interpretable high-order dependencies learned by HOT, demonstrating its versatility for complex multiway data across diverse domains.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Since the previous submission, we have incorporated several clarifications, additional experiments, and improvements based on reviewer feedback: 1. **Scalability and New Dataset**: We added results on the SSL4EOL Benchmark dataset for pixel-wise segmentation of multispectral satellite images ($264 \times 264 \times 7$ input size), demonstrating that HOT outperforms ViT and performs on par with ResNet variants while requiring significantly fewer parameters and less computation. 2. **Memory and Runtime Analysis**: We updated Figure~7 to include detailed training and inference time (ms/sample) and GPU memory footprint (GB) comparisons across HOT, ViT, and ResNet models. Additionally, we added a new plot in Figure~6 showing how input size affects FLOPS, highlighting the computational efficiency of HOT’s Kronecker attention compared to full and divided attention mechanisms. 3. **Additional Baseline Comparisons**: We included MViT and TimeSFormer results on MedMNIST3D, showing that HOT consistently outperforms these models on all 3D benchmarks. 4. **Hyperparameter Selection**: We clarified that the number of attention heads (factorization rank) is treated as a hyperparameter, optimized via grid search for both HOT and vanilla transformer (non-factorized). Figure~5 has been updated to include vanilla non-factorized attention results. 5. **Complexity Expression**: The missing hidden dimension $D$ was added to the attention complexity formula in Section 4.1. 6. **Contributions and Presentation**: The introduction now contains a more detailed and finer-grained list of contributions. System details, previously in Section B.3 of the appendix, have been moved to the main body for improved clarity. 7. **Typos and Text Corrections**: All text errors and typos pointed out by reviewers have been corrected. 8. **Broader Impact Statement**: We included a broader impact statement discussing potential benefits, risks, and future directions for safe and ethical use of HOT in multiway data applications. These changes aim to improve clarity, reproducibility, and completeness, and further demonstrate HOT’s effectiveness, efficiency, and applicability across multiple domains.
Assigned Action Editor: ~Hankook_Lee1
Submission Number: 5420
Loading