Abstract: Large-scale Seq2Seq speech models like Whisper excel in speech recognition but are limited by their high computational demands, making them difficult to be deployed on resource-constrained devices. This paper introduces a novel and efficient pruning method for compressing these models without retraining and back-propagation, focusing on models with encoder-decoder architectures. We adapt layer-wise pruning to large speech models and introduce a mixed sparsity allocation strategy that uses only forward propagation. This approach effectively reduces model size while maintaining high performance. Evaluated on the Whisper-large-v3 across various datasets, our method can almost maintain Whisper’s performance and robustness with about 60% reduction in parameters. It could also be combined with other model compression methods such as distillation to further reducing model size.
Loading