Adaptive Dual-Granularity Pruning Method for Large Language Models

Haiyao Li; Hongzi Zhu; Ping Li; Yi ZHENG; Zhefeng Wang

Adaptive Dual-Granularity Pruning Method for Large Language Models

Haiyao Li, Hongzi Zhu, Ping Li, Yi ZHENG, Zhefeng Wang

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: model compression，structured and unstructured pruning，adaptive pruning，sparsity and robustness，large language models

Abstract: With the rapid development of large language models (LLMs), their parameter scales continue to expand, posing significant challenges for efficient deployment. Pruning, as a mainstream compression technique, can effectively reduce model size; however, it often suffers from robustness degradation and uncontrollable model size under high pruning ratios. In this work, we propose ADAP (Adaptive Dual-Ganularity Pruning) to address these two issues. ADAP ingeniously combines the global constraints of structured pruning with the flexibility of unstructured pruning, dynamically adjusting their respective proportions and introducing an intra-layer adaptive pruning ratio allocation mechanism, thereby overcoming the performance bottlenecks of conventional single-mode pruning. Moreover, we introduce compression ratio as a unified metric, replacing the commonly used pruning ratio to achieve precise control over model size. Experimental results demonstrate that ADAP significantly outperforms existing structured and unstructured pruning methods in high-compression scenarios, delivering better task performance while maintaining controllable model scale.

Primary Area: generative models

Submission Number: 18599

Loading