Do All Tokens Matter? Exploring Sparse Memory in Long-Sequence Generalization

Do All Tokens Matter? Exploring Sparse Memory in Long-Sequence Generalization

ACL ARR 2024 December Submission454 Authors

13 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have demonstrated remarkable progress in generating high-quality natural language through extensive pre-training over Transformer architectures. However, the quadratic complexity of transformers in sequence computation greatly limits their capability to efficiently train long sequences. To this end, we divide the input sequences of the Transformer network as two distinct components: the target part for next-token prediction, and the memory part that serves as the conditional context for the prediction of the target part. On the basis of this, we analyze the statistical law of attention patterns in modeling long context that demonstrates a highly positive correlation between the sparsity of the memory and target part with increasing sequence length. We encapsulate it as the Pareto Principle of Transformer. Therefore, in this paper, we introduce Sparse Training, a simple training technique to optimize the complexity of Transformer models in long-sequence generalization by sparsifying the memory part. Specifically, we apply a sparse sampling policy over the memory part that decays with the distance from the target part, to obtain sparse memory and preserve their positions. Without any architectural modifications, our method can extend existing Transformer-based LLMs to capture long-range dependencies within a fixed window size during the training. Experimental results on multiple datasets also demonstrate the effectiveness and efficiency of Sparse Training to mitigate the complexity of the Transformer network in building long-sequence dependency.

Paper Type: Long

Research Area: Language Modeling

Research Area Keywords: Large Language Models, Long Sequence, Length Extrapolation, Efficiency

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 454

Loading