Sparse Iso-FLOP Transformations for Maximizing Training Efficiency

Shreyas Saxena; Vithursan Thangarasa; Abhay Gupta; Sean Lie

Sparse Iso-FLOP Transformations for Maximizing Training Efficiency

Shreyas Saxena, Vithursan Thangarasa, Abhay Gupta, Sean Lie

21 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: sparsity, sparse training, efficient training

TL;DR: We introduce a family of Sparse Iso-FLOP Transformations which can be used as drop-in replacements for dense layers to improve their modeling capacity and FLOP efficiency. We obtain significant wins across both CV and NLP domains.

Abstract: Recent studies have explored the application of weight sparsity to enhance the training efficiency of DNNs in terms of test accuracy~w.r.t training FLOPs. These studies have focused on reducing training FLOPs, but training with sparse weights often results in accuracy degradation or necessitates prolonged training schedules to attain performance similar to the original dense models; making the actual training efficiency gains less evident. In contrast, our work emphasizes leveraging sparsity to increase accuracy while maintaining the same FLOPs as the dense model, thereby demonstrating improved training efficiency through higher accuracy. We introduce Sparse-IFT, a family of Sparse Iso-FLOP Transformations that serve as drop-in replacements for dense layers, enhancing their representational capacity and FLOP efficiency. Each transformation is parameterized by a single hyperparameter (i.e., sparsity level), offering a broader search space for identifying optimal sparse masks. Substituting dense layers with Sparse-IFT, without altering any training hyperparameters, yields substantial improvements across a range of computer vision and natural language processing tasks; ResNet-18 on ImageNet (+3.5\%) and GPT-3 Small on WikiText-103 (-0.4 PPL), both matching larger dense models that use 2x or more FLOPs. To our knowledge, this is the first work to demonstrate the use of sparsity for improving the accuracy of dense models, all while maintaining consistent training FLOPs budgets via a simple set of sparse transformations.

Supplementary Material: zip

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3870

Loading