ACE: Exploring Activation Variance for Accurate and Calibration-Efficient LLM Pruning

18 Sept 2025 (modified: 26 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Pruning, Efficient AI
Abstract: With the rapid expansion of large language models (LLMs), the demand for memory and computational resources has grown significantly. Recent advances in LLM pruning aim to reduce the size and computational cost of these models. However, existing methods often suffer from either suboptimal pruning performance or low time efficiency during the pruning process. In this work, we propose an efficient and effective pruning method that simultaneously achieves high pruning performance and fast pruning speed with calibration efficiency. Our approach: (1) introduces an activation variance-guided pruning metric: a new metric that allows for better semantic information distinction preservation in the output activations after pruning; (2) enables model pruning with only a small sequence length of calibration dataset, while can maintain similar pruning performance as the original baselines that relies on larger sequence of calibration dataset (e.g. 2048 sequence lengths for Wanda and RIA). We conduct extensive experiments on prevalent LLMs, such as OPT, LLaMA, LLaMA-2, LLaMA-3, Qwen2.5, and MoE-based models such as Mixtral 8x7B. The experimental results show that we can achieve up to 18% decrease of perplexity and up to 63% less pruning time on WikiText-2, demonstrating the effectiveness of the proposed method.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12224
Loading