Keywords: sparse attention, Transformer optimization, long- sequence processing, hierarchical chunking, dynamic gating
TL;DR: this paper proposes a novel optimization framework that integrates a dynamic sparse attention mechanism and hierarchical chunking technique.
Abstract: The Transformer model, while effective in capturing
long-range dependencies, faces significant challenges in
processing ultra-long sequence data (e.g., 10k+ time steps) due to
its quadratic computational complexity 𝑶(𝒏^𝟐) and excessive
memory demands. To address these limitations, this paper
proposes a novel optimization framework that integrates a
dynamic sparse attention mechanism and hierarchical chunking
techniques. The dynamic sparse attention employs a learnable
gating module to adaptively prune redundant attention heads,
reducing redundant computations. The hierarchical chunking
strategy divides sequences into localized blocks and introduces
lightweight cross-block interactions, balancing efficiency and
global dependency modeling. Experiments on translation (WMT
2014 En-De), time-series forecasting (ETTh1), and text
classification (IMDb) demonstrate that the proposed method
achieves a 2.19× training speedup and 25% reduction in peak
GPU memory usage compared to the vanilla Transformer, while
maintaining competitive accuracy (e.g., BLEU-4 score drops by
only 0.2 in translation). Ablation studies validate the synergistic
benefits of combining dynamic sparsity and chunking.
Additionally, adaptive block size adjustment further optimizes
memory efficiency without compromising performance. This
work provides a scalable solution for deploying Transformer-
based models in resource-constrained scenarios, such as edge
computing for healthcare and financial analytics.
Submission Number: 24
Loading