Sparser, Better, Deeper, Stronger: Improving Sparse Training with Exact Orthogonal Initialization

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Sparse Training, Pruning, Orthogonal Initialization
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We present a new sparse orthogonal initialization based random Givens rotations
Abstract: Sparse training aims to train sparse models from scratch, achieving remarkable results in recent years. A key design choice in sparse training is the sparse initialization, which determines the trainable sub-network through a binary mask. Existing methods mainly revolve around selecting the mask based on predefined dense weight initialization. However, such an approach may not efficiently leverage the mask's potential impact on training parameters and optimization. An alternative direction, inspired by research into dynamical isometry, is to introduce orthogonality in the sparse subnetwork. This helps prevent the gradient signal from vanishing or exploding, ultimately enhancing the reliability of the backpropagation process. In this work, we propose Exact Orthogonal Initialization (EOI), a novel sparse orthogonal initialization scheme based on composing random Givens rotations. Contrary to other existing approaches, our method provides exact (not approximated) orthogonality and enables the creation of layers with arbitrary densities. Through experiments on contemporary network architectures, we present the effectiveness of EOI and demonstrate that it consistently outperforms other commonly used sparse initialization techniques. Furthermore, to showcase the full potential of our method, we show that it enables the training of highly sparse 1000-layer MLP and CNN networks without any residual connections or normalization techniques. Our research highlights the importance of weight initialization in sparse training, underscoring the vital part it plays alongside the sparse mask selection.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4859
Loading