Toward a Universal Local Speech Feature Extractor through Distillation

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: local feature extractor, distillation
Abstract: Self-supervised speech models commonly consist of a set of convolutional neural network (CNN) layers used as local feature extractors, followed by transformer layers. The CNN layers are computationally demanding, accounting for about a third of the multiply-accumulate operations (MACs) in inference. In addition, the CNN representations are similar across models, and also resemble the much simpler mel spectral features. We hypothesize that the CNN layers can be replaced by a simple, universal model. To test this hypothesis, we learn a two-layer feature extractor through distillation from the transformer input of several models (HuBERT, data2vec, and WavLM). Experiments on SUPERB(-SG) benchmark tasks show that our model largely retains the performance of the three teacher models and can also be used in wav2vec 2.0 Base. Our universal feature extractor requires only 3\% of the MACs of the CNN layers, resulting in a $>$20\% reduction across models in end-to-end inference runtime.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 125
Loading