Abstract: Large language models (LLMs) can enhance automatic speech recognition (ASR) systems through generative error correction (GEC). In this paper, we propose Pinyin-enhanced GEC (PY-GEC), which leverages Pinyin—the phonetic representation of Mandarin Chinese—as supplementary information to improve Chinese ASR error correction. Our approach only utilizes synthetic errors for training and employs the one-best hypothesis during inference. Additionally, we introduce a multitask training approach involving conversion tasks between Pinyin and text to align their feature spaces. Experiments on the Aishell-1 and the Common Voice datasets demonstrate that our approach consistently outperforms GEC with text-only input. More importantly, we provide intuitive explanations for the effectiveness of PY-GEC and multitask training from two aspects: 1) increased attention weight on Pinyin features; and 2) aligned feature space between Pinyin and text hidden states.
External IDs:dblp:conf/icassp/LiQZZT0025
Loading