3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs

Mehdi Makni; Xiang Meng; Rahul Mazumder

3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs

Mehdi Makni, Xiang Meng, Rahul Mazumder

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse plus Low-Rank, Model Compression, Large Language Models, LoRA, PEFT, ADMM, Optimization

TL;DR: We introduce 3BASiL-TM, a highly efficient one-shot post-training method for Sparse plus Low-Rank decomposition of LLMs that reduces the WikiText2 perplexity gap to dense model by over $30\%$ compared to prior methods.

Abstract: Sparse plus Low-Rank $(\mathbf{S} + \mathbf{L}\mathbf{R})$ decomposition of Large Language Models (LLMs) has emerged as a promising direction in $\textit{model compression}$, aiming to decompose pre-trained model weights into a sum of sparse and low-rank matrices $\mathbf{W} \approx \mathbf{S} + \mathbf{LR}$. Despite recent progress, existing methods often suffer from substantial performance degradation compared to dense models. In this work, we introduce $\texttt{3BASiL-TM}$, an efficient one-shot post-training method for $(\mathbf{S} + \mathbf{L}\mathbf{R})$ decomposition of LLMs that addresses this gap. Our approach first introduces a novel 3-Block Alternating Direction Method of Multipliers (ADMM) method, termed $\texttt{3BASiL}$, to minimize the layer-wise reconstruction error with convergence guarantees. We then design a transformer-matching ($\texttt{TM}$) refinement step that jointly optimizes the sparse and low-rank components across transformer layers. This step minimizes a novel memory-efficient loss that aligns outputs at the transformer level. Notably, the $\texttt{TM}$ procedure is universal as it can enhance any $(\mathbf{S} + \mathbf{L}\mathbf{R})$ decomposition, including pure sparsity. Our numerical experiments show that $\texttt{3BASiL-TM}$ reduces the WikiText2 perplexity gap to dense LLaMA-8B model by over 30% under a (2:4 Sparse + 64 LR) configuration, compared to prior methods. Moreover, our method achieves over 2.5x faster compression runtime on an A100 GPU compared to SOTA $(\mathbf{S} + \mathbf{L}\mathbf{R})$ method. Our code is available at https://github.com/mazumder-lab/3BASiL.

Primary Area: Optimization (e.g., convex and non-convex, stochastic, robust)

Submission Number: 24116

Loading