Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining

Published: 11 Feb 2025, Last Modified: 06 Mar 2025CPAL 2025 (Proceedings Track) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Efficient, Structured Pruning, LLMs
Abstract: To remove redundant components of large language models (LLMs) without incurring significant pruning costs, this work focuses on single-shot structured pruning without a retraining phase. We simplify the pruning process for Transformer-based LLMs by identifying a depth-2 pruning structure that functions independently. Additionally, we propose two inference-aware pruning criteria derived from the optimization perspective of output approximation, which outperforms traditional training-aware metrics such as gradient and Hessian. We also introduce a two-step reconstruction technique to mitigate pruning errors without model retraining. Experimental results demonstrate that our strategy significantly reduces pruning costs and hardware requirements while maintaining superior performance across various datasets and models.
Submission Number: 49
Loading