DP-Prune: Global Optimal Strategy for Retraining-Free Pruning of Transformer Models

Published: 2024, Last Modified: 08 May 2026IPCCC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Transformer models have achieved significant success in various complex tasks, but their high computational costs and longer inference latency serve as limiting factors. To effectively reduce these costs, pruning has been widely adopted as an efficient method for Transformer models. Despite the excellent pruning speed demonstrated by existing retraining-free pruning algorithms, these methods often only find local optima when assessing the importance of attention heads and feed-forward networks. This limitation may lead to unstable solutions, thus affecting the overall performance of the model. To address these challenges, we propose DP-Prune (Dynamic Programming-Prune), a retraining-free structured pruning algorithm that employs a global optimization strategy. The algorithm consists of two parts: DPMO (Dynamic Programming Mask Optimization) and GSMT (GCROTMK Solver Mask Tuning), designed to quickly and effectively find global optima. We evaluate this method using BERTBASE and DistilBERT models on the GLUE and SQuAD benchmark tests. Experimental results demonstrate significant accuracy improvements on the SQuAD2.0 task test without any further training. Under a 60% FLOPs constraint, DP-Prune achieves an 8.42% increase in F1 score compared with some existing retraining-free pruning algorithms.
Loading