Abstract: Large language models (LLMs) have demonstrated remarkable progress in generating high-quality natural language through extensive pre-training over Transformer architectures. However, the quadratic complexity of transformers in sequence computation greatly limits their capability to efficiently train long sequences. To this end, we divide the input sequences of the Transformer network as two distinct components: the target part for next-token prediction, and the memory part that serves as the conditional context for the prediction of the target part. On the basis of this, we analyze the statistical law of attention patterns in modeling long context that demonstrates a highly positive correlation between the sparsity of the memory and target part with increasing sequence length. We encapsulate it as the Pareto Principle of Transformer. Therefore, in this paper, we introduce Sparse Training, a simple training technique to optimize the complexity of Transformer models in long-sequence generalization by sparsifying the memory part. Specifically, we apply a sparse sampling policy over the memory part that decays with the distance from the target part, to obtain sparse memory and preserve their positions. Without any architectural modifications, our method can extend existing Transformer-based LLMs to capture long-range dependencies within a fixed window size during the training. Experimental results on multiple datasets also demonstrate the effectiveness and efficiency of Sparse Training to mitigate the complexity of the Transformer network in building long-sequence dependency.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Large Language Models, Long Sequence, Length Extrapolation, Efficiency
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 454
Loading