Abstract: Efficient modeling of long sequences of high-order data requires a more efficient architecture than Transformers. In this paper, we investigate two key aspects of extending linear recurrent models, especially those originally designed for long-context language modeling, to high-order data: scanning strategies and attention-hybrid architectures. Empirical results suggest that scanning provides limited benefits, while attention-hybrid models yield promising results. Focusing on the latter, we further evaluate types of attention that are worth integrating and find that tiled high-order sliding window attention (SWA) is efficient in both theory and practice. We term the resulting hybrid architecture of linear recurrence and high-order SWA Efficient N-dimensional Attention (ENA), and conduct several experiments to evaluate its effectiveness. With comparable performance to Transformers and higher efficiency, ENA offers a promising and practical solution for ultra-long high-order data modeling.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Meisam_Razaviyayn1
Submission Number: 6131
Loading