Keywords: optimization, soft-thresholding, sparse training, structured sparsity
TL;DR: An adaptive regularization that finds a minimizer that has specific sparsity structure, allowing pruning of foundation models.
Abstract: The recent trend of scaling neural networks to unprecedented sizes demands efficient structured sparsity for practical deployment, yet precise control of sparsity levels and patterns for hardware acceleration remains challenging. This paper introduces the Adaptive Soft-Thresholding Algorithm (ASTRA), which achieves a target sparsity by adapting group-wise regularization strength based on computationally inexpensive sparsity characterizations. We establish ASTRA’s theoretical foundations, proving the existence of stable regularizations that realize the desired sparsity. We demonstrate sublinear and linear convergence rates for both the model parameters and the regularization weight in deterministic settings and, crucially, an almost sure $O(1/t)$ convergence rate in the practical stochastic-gradient setting. ASTRA provides a theoretically grounded method for direct, precise control over structured sparsity, enabling the pruning and fine-tuning of foundation models into Bonsai Networks: accelerator-friendly miniatures trained to match the teacher’s outputs while preserving downstream performance.
Supplementary Material: zip
Primary Area: optimization
Submission Number: 22992
Loading