High-Order Self-Attention Mechanism: A Deep Attention Model in Extended Parameter Space

High-Order Self-Attention Mechanism: A Deep Attention Model in Extended Parameter Space

ACL ARR 2025 May Submission1761 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Since the introduction of self-attention in 2016, Transformer-based pre-training models have achieved remarkable success, driving breakthroughs across various NLP tasks. Inspired by graph attention node aggregations from neighbor nodes, we revisit the self-attention mechanism to explore its potential for capturing higher-order relationships in sequence modeling. Specifically, we propose a novel High-Order Self-Attention mechanism, which enhances the expressive power of traditional self-attention through multiple self-attention aggregations and positional embeddings. By integrating this mechanism into self-attention-based models during the pre-training process with limited data and model capacity, we achieve up to a 35% improvement in accuracy for RoBERTa on masked token prediction tasks and up to a 75% increase in ROUGE-2 scores for GPT-2 on the pre-training task, under identical experimental conditions—demonstrating the robustness and efficiency of the proposed method even in low-resource settings. This mechanism further enables a novel parameter stacking approach, allowing models to achieve more efficient and scalable training. These findings demonstrate the potential of High-Order Self-Attention for advancing sequence modeling and pre-training workflows.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: self-attention, robustness, scaling

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 1761

Loading