BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation

Lian Liu; Xiandong Zhao; Guanchen Li; Dong Li; Mengdi Wang; Yinhe Han; Xiaowei Li; ying wang

BaWA: Automatic Optimizing Pruning Metric for Large Language Models with Balanced Weight and Activation

Lian Liu, Xiandong Zhao, Guanchen Li, Dong Li, Mengdi Wang, Yinhe Han, Xiaowei Li, ying wang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We identify the existing bias in existing pruning metric and propose a novel automatic optimization framework to get the appropriate pruning metric for better pruning results.

Abstract: One-shot post-training pruning enhances the deployment of billion-scale large language models (LLMs), with the pruning metric playing a pivotal role in determining which weights to remove. However, existing metrics underperform due to their reliance on a simple symbolic combination of weights and activations, overlooking imbalanced weight magnitudes and the disproportionate influence of activation outliers. To overcome these limitations, we introduce \textbf{BaWA}, a novel pruning metric that systematically \underline{Ba}lances \underline{W}eight and \underline{A}ctivation distributions for more effective pruning. BaWA introduces two key innovations: \textbf{magnitude normalization}, which mitigates weight imbalance across channels for fairer pruning decisions, and \textbf{outlier regularization}, which reduces the impact of activation outliers, ensuring more appropriate channel prioritization. To further enhance its effectiveness, BaWA incorporates an efficient and automatic framework for optimizing normalization and regularization hyperparameters. Extensive experiments validate BaWA as a state-of-the-art (SOTA) pruning metric. For instance, applying BaWA to induce 2:4 sparsity in Mistral-7B reduces perplexity in language comprehension by 2.49 and improves average downstream task accuracy by 3.08\%, outperforming the previous SOTA method Wanda.

Lay Summary: This paper introduces a new method called BaWA to make large AI language models smaller and more efficient without losing performance. Large models like ChatGPT have billions of parameters, making them slow and resource-heavy. Traditional methods for simplifying these models often remove parts unevenly, either cutting too much from some areas or missing less obvious but important components. BaWA solves this by balancing two key factors: the size of the model’s internal parameters (weights) and the impact of unusual data points (outliers) during calculations. It adjusts how much each part of the model matters, ensuring fairer decisions about what to remove. Additionally, BaWA automatically fine-tunes its settings to find the best balance, taking just minutes to optimize even for huge models. Tests show BaWA outperforms existing methods. For example, when applied to the Mistral-7B model, it reduced errors in language understanding by 2.49 points and improved accuracy on tasks by 3.08% compared to the previous best method. It works well across different model sizes and can be combined with other techniques for even better results. This advancement helps deploy powerful AI models faster and cheaper, especially on devices with limited resources like phones or laptops.

Primary Area: General Machine Learning->Hardware and Software

Keywords: LLM Pruning; Automaic Framework

Submission Number: 178

Loading