Ensemble Knowledge Distillation from Speech SSL Models Considering Inter-Teacher Differences

Pei-Jun Liao, Hung-Yi Lee, Hsin-Min Wang

Published: 2024, Last Modified: 10 Feb 2026ISCSLP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: In the realm of speech processing, Self-Supervised Learning (SSL) models such as HuBERT are widely used in various speech tasks such as Automatic Speech Recognition (ASR) and Spoken Language Understanding (SLU), and have achieved impressive results. However, these speech SSL models are often large and require significant computational resources. Many previous studies have used Knowledge Distillation (KD) to learn compact models from complex models. Ensemble Knowledge Distillation (EKD) transfers multiple SSL models into a single student model via multiple prediction heads. We focus on the differences among teacher models and use them as residuals in student model learning to achieve additional learning goals. Residual Prediction Heads and Residual Regularization are proposed. Combining RobustHuBERT with WavLM+ or Data2vec-base as teachers, we evaluate the resulting student models on 6 tasks in the SUPERB benchmark. The results show that taking residuals into account improves performance.