Abstract: The recent development of pre-trained language models (PLMs) like BERT suffers from increasing computational and memory overhead. In this paper, we focus on automatic pruning for efficient BERT architectures on natural language understanding tasks. Specifically, we propose differentiable architecture pruning (DAP) to prune redundant attention heads and hidden dimensions in BERT, which benefits both from network pruning and neural architecture search. Meanwhile, DAP can adjust itself to deploy the pruned BERT on various edge devices with different resource constraints. Empirical results show that the \(\text {BERT}_\text {BASE}\) architecture pruned by DAP achieves \(5\times \) speed-up with only a minor performance drop. The code is available at https://github.com/OscarYau525/DAP-BERT.
Loading