Towards a Universal Local Speech Feature Extractor through Distillation

Jiamin Yang; Marcelo Beramendi Caballero; Karen Livescu

Towards a Universal Local Speech Feature Extractor through Distillation

Jiamin Yang, Marcelo Beramendi Caballero, Karen Livescu

Published: 26 Aug 2025, Last Modified: 26 Aug 2025SpeechAI TTIC 2025 OralorPosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: speech model, feature extraction, distillation

Presentation Preference: Open to it if recommended by organizers

Abstract: In speech models, CNNs are widely used as local feature extractors. Recent work has shown that representations across different models seem to be converging, even when trained on different data. We hypothesize that CNN distributions across speech models have high similarity, suggesting that they could be replaced by one single model with universal applicability. Additionally, with previous work showing that convolutional layers take 33% of multiply-accumulate operations in the entire forward computation, there is room for improvement in the efficiency of the universal model. We offer indicative support on the hypothesis through similarity analysis, and develop a simple three-layer model through distillation from the transformer encoder input of HuBERT-base, Data2vec-base, and WavLM-base as the universal feature extractor. Tested on SUPERB, the model is able to largely retain the performance of three vanilla teacher models while achieving a 20x reduction in memory usage and a 10x decrease in runtime.

Submission Number: 36

Loading