Abstract: Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about $75$% of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of $4\times$ without compromising performance.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - The wall-clock time of the algorithms in our experiment are included in *Appendix B.1*.
- The cubic update formulas in *Section 3.1* are revised and explanations about the design of them are included in *Section 3.1*.
- The link to our code is included in Section 4.
- More explanations about why MST is orthogonal to hardware-level or system-level accelerations are included in *Section 5*.
- Authors' information and Acknowledgement are added.
Code: https://github.com/hupihe/Mixed-Sparsity-Training
Assigned Action Editor: ~Vincent_Tan1
Submission Number: 3545
Loading