Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

ACL ARR 2025 February Submission844 Authors

11 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent Large-Language Models (LLMs) pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on \textbf{heuristically hand-crafted metrics}, potentially leading to suboptimal performance. We instead propose a novel \textbf{optimization-based structural pruning} that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method \textbf{eliminates the back-propagation} through the LLM \emph{per se} during the optimization, requiring only \textbf{the forward pass of the LLM}. We achieve this by learning an underlying \texttt{Bernoulli} distribution to sample binary pruning masks, where we decouple the \texttt{Bernoulli} parameters from the LLM loss, thus facilitating an efficient optimization via \emph{policy gradient estimator} without back-propagation. As a result, our method is able to 1) \emph{support global and heterogeneous pruning} (\ie, our method automatically determines different redundancy for different layers), and 2) \emph{optionally initialize with a metric-based method} (for our \texttt{Bernoulli} distributions). Extensive experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models using the C4 and WikiText2 datasets demonstrate the promising performance of our method in efficiency and effectiveness.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: pruning, NLP in resource-constrained settings, optimization methods
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 844
Loading