Towards a Universal Local Speech Feature Extractor through Distillation
Keywords: speech model, feature extraction, distillation
Presentation Preference: Open to it if recommended by organizers
Abstract: In speech models, CNNs are widely used as local feature extractors. Recent work has shown that representations across different models seem to be converging, even when trained on different data. We hypothesize that CNN distributions across speech models have high similarity, suggesting that they could be replaced by one single model with universal applicability. Additionally, with previous work showing that convolutional layers take 33% of multiply-accumulate operations in the entire forward computation, there is room for improvement in the efficiency of the universal model. We offer indicative support on the hypothesis through similarity analysis, and develop a simple three-layer model through distillation from the transformer encoder input of HuBERT-base, Data2vec-base, and WavLM-base as the universal feature extractor. Tested on SUPERB, the model is able to largely retain the performance of three vanilla teacher models while achieving a 20x reduction in memory usage and a 10x decrease in runtime.
Submission Number: 36
Loading