Abstract: Compared to large speech foundation models, small student models exhibit degraded noise robustness. The student’s robustness can be improved by introducing noise at the inputs during pre-training. Despite this, using the standard distillation loss still yields a student with degraded performance. Thus, this paper proposes improving student robustness via distillation with correlation metrics. Teacher behavior is learned by maximizing the teacher and student cross-correlation matrix between their representations towards identity. Noise robustness is encouraged via the student’s self-correlation minimization. The proposed method consistently outperforms the previous approach on Intent Classification, Keyword Spotting, and Automatic Speech Recognition tasks on SUPERB Challenge.
Loading