HA-PAT: Hierarchically-Adaptive Pruning-Aware Tuning for Large Language Models

ICLR 2026 Conference Submission17384 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Model Compression, Large Language Models, Pruning-Aware Tuning, Structural Pruning
Abstract: The enormous size of large language models (LLMs) limits their deployment and application. Some research utilizes structural pruning to alleviate this by removing redundant weights in a hardware-agnostic manner. However, existing methods tend to apply a uniform pruning strategy across all layers, ignoring the layer-wise functional diversity and risking the removal of essential model components. To tackle this challenge, we propose a Hierarchically-Adaptive Pruning-Aware Tuning (HA-PAT) method. Based on the pruning-aware tuning framework, HA-PAT employs Hierarchical Pruning Ratio Scheduling (HPRS) to derive optimal layer-wise sparsity guided by each layer's unique functionality. It preserves the general linguistic functions of shallow layers, while aggressively pruning the deeper layers that primarily encode task-specific features. To better preserve model performance, HA-PAT introduces a magnitude vector into the compensation mechanism, enabling the reconstruction of pruned weights based on a broader information space. Experimental results show that our method consistently outperforms the baseline both in average accuracy and inference efficiency. On LLaMA2-13B with 25\% pruning ratio, our approach surpasses the PAT baseline by 4.01\% in average accuracy across 14 benchmarks, along with a 30\% inference speedup. Further experiments on downstream tasks indicate that HA-PAT better preserves the pre-trained language understanding capabilities.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 17384
Loading