Efficient Pruning for Large-Scale Seq2Seq Speech Models without Back-Propagation

Published: 01 Jan 2025, Last Modified: 26 Jul 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Large-scale Seq2Seq speech models like Whisper excel in speech recognition but are limited by their high computational demands, making them difficult to be deployed on resource-constrained devices. This paper introduces a novel and efficient pruning method for compressing these models without retraining and back-propagation, focusing on models with encoder-decoder architectures. We adapt layer-wise pruning to large speech models and introduce a mixed sparsity allocation strategy that uses only forward propagation. This approach effectively reduces model size while maintaining high performance. Evaluated on the Whisper-large-v3 across various datasets, our method can almost maintain Whisper’s performance and robustness with about 60% reduction in parameters. It could also be combined with other model compression methods such as distillation to further reducing model size.
Loading