Abstract: Large language models (LLMs) have achieved significant success in complex tasks across various domains, but these achievements come with high computational costs and long inference delays. Pruning, as an effective optimization technique, simplifies model structures by removing redundant components, thereby improving model generalization and operational efficiency. Although existing pruning retraining-free algorithms perform excellently in pruning time, these algorithms often focus on local optimal solutions in encoder-based language models, lacking comprehensive exploration of global optimal solutions, which may affect the overall model performance. To address this issue, we propose a novel retraining-free structured pruning algorithm, named RL-Pruner. The algorithm consists of two main stages: the Mask Rearrangement Based on Asynchronous Advantage Actor-Critic (MA3C) stage and the BiConjugate Gradient Solver for Mask Tuning (BGMT) stage. It aims to explore the intra-layer interactions of mask variables and efficiently find the global optimal solution without requiring retraining. We evaluate this method using BERTBASE and DistilBERT models on the GLUE and SQuAD benchmark tests. Experimental results show that RL-Pruner significantly improves accuracy on the SQuAD1.1 benchmark. Under a 60% FLOPs constraint, compared with existing pruning retraining-free algorithms, the F1 score increases by 4.25%.
External IDs:dblp:conf/ijcnn/YaoWZDDHQLXDPZZWZ25
Loading