Enhancing Event Tagger with Automatic Speech Recognizer for Audio Multi-task Scenarios by Distillation with Pre-Trained Large Models
Abstract: With the continuous expansion of robotics and digital humans in practical applications, the demand for the auditory system is becoming deeper, usually requiring more efficient speech recognition framework capabilities to handle multiple tasks and use fewer resources. In previous audio processing frameworks, each type of audio processing task typically requires constructing a standalone deep network model for training, which results in more training data and higher training time when constructing models for multi-task audio scenarios simultaneously. The recent improvement of audio models based on transformers have brought about methods that can handle multiple audio tasks concurrently. However, recent related methods still require retraining multi-task targets with an amount of data, and achieve the general effect after training for multi-task scenarios than the simple combination of standalone methods processed separately. In order to better build a model that can handle multiple audio tasks, we propose a novel framework of distillation through pre-trained large models for enhancing event tagger with automatic speech recognizer. Through multiple rounds of experiments on several audio datasets, it has been verified that the proposed framework can achieve better results than the baseline for multitasking, and comparative results with less parameters compared to the baselines for single-task scenarios.
Loading