Chinese ASR and NER Improvement Based on Whisper Fine-Tuning

Hao Yang; Min Zhang; Shimin Tao; Miaomiao Ma; Ying Qin

Chinese ASR and NER Improvement Based on Whisper Fine-Tuning

Hao Yang, Min Zhang, Shimin Tao, Miaomiao Ma, Ying Qin

Published: 01 Jan 2023, Last Modified: 13 Nov 2024ICACT 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Based on 680k hours of weakly supervised multilingual and multi-task speech transcription/translation data, Whisper [1] has developed a robust system for both Automated Speech Recognition (ASR) and Speech Translation (ST). Whisper provides a simple model architecture based on Mel spectrum + two-layer convolution + Seq2seq Transformer architecture, which is easy to fine-tune on conditional generation tasks. This paper analyzes how to fine-tune Chinese ASR [2] and NER tasks based on Whisper, including (1) how to design different prompts for different generative tasks; (2) how to train ASR and NER tasks at the same time; (3) whether the performance can be further improved by using weak supervision for data enhancement. Experiments based on AISHELL [3] and AISHELL-NER [4] data, and multi-task fine-tuning based on Whisper can effectively improve the performance of ASR and NER.

Loading