Abstract: Based on 680k hours of weakly supervised multilingual and multi-task speech transcription/translation data, Whisper [1] has developed a robust system for both Automated Speech Recognition (ASR) and Speech Translation (ST). Whisper provides a simple model architecture based on Mel spectrum + two-layer convolution + Seq2seq Transformer architecture, which is easy to fine-tune on conditional generation tasks. This paper analyzes how to fine-tune Chinese ASR [2] and NER tasks based on Whisper, including (1) how to design different prompts for different generative tasks; (2) how to train ASR and NER tasks at the same time; (3) whether the performance can be further improved by using weak supervision for data enhancement. Experiments based on AISHELL [3] and AISHELL-NER [4] data, and multi-task fine-tuning based on Whisper can effectively improve the performance of ASR and NER.
Loading