Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Pihe Hu; Shaolong Li; Xun Wang; Longbo Huang

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

Pihe Hu, Shaolong Li, Xun Wang, Longbo Huang

Published: 10 Mar 2025, Last Modified: 10 Mar 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about $75$% of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of $4\times$ without compromising performance.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: - The wall-clock time of the algorithms in our experiment are included in *Appendix B.1*. - The cubic update formulas in *Section 3.1* are revised and explanations about the design of them are included in *Section 3.1*. - The link to our code is included in Section 4. - More explanations about why MST is orthogonal to hardware-level or system-level accelerations are included in *Section 5*. - Authors' information and Acknowledgement are added.

Code: https://github.com/hupihe/Mixed-Sparsity-Training

Assigned Action Editor: ~Vincent_Tan1

Submission Number: 3545

Loading