everyone
since 13 Oct 2023">EveryoneRevisionsBibTeX
Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DNNs). Structured sparsity has garnered significant interest as a result of relatively modest hardware overhead and improved efficiency on contemporary DNN accelerators. In particular, N:M sparsity is attractive because of hardware accelerator architectures capable of harnessing specific variations of N:M structured sparsity, enhancing computational efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the DNN memory footprint owing to their modest representation overhead. Although there have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions („50%). As a consequence, the performance of models trained using these approaches tends to decline when confronted with high-sparsity regions. In this work, we extensively study the effectiveness of existing training recipes for N:M structured sparsity at high-sparsity regions and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient magnitudes. In order to mitigate this undesirable effect, we present two new sparse training recipes, namely Mask Decay Gradient Flow, MdGf' and Structure Decay Gradient Flow, SdGf. which employ decay mechanisms to progressively restrict the flow of gradients. Our results demonstrate that enabling the propagation of gradients plays a crucial role in preserving superior model performance while simultaneously attaining a high level of sparsity. Our evaluations of diverse sparsity configurations demonstrate that the proposed methods consistently achieve SOTA accuracy against conventional sparse recipes in a range of attention-based models used for various tasks encompassing both vision (up to ∆(Acc) ~2%) and language (up to ∆(Acc) ~5%). We provide the anonymized code at https://anonymous.4open.science/r/n_m_decay_1605-E77F