Abstract: ASR Error Correction (AEC) aims to post-process the output of ASR systems and further reduce the word error rate. In this paper, we propose a cross-modal training framework with contrastive learning on the AEC task. This framework enables a shared encoder-decoder model to learn text, pinyin (phoneme 1 ) and audio information simultaneously, which is trained by three subtasks: text correction, pinyin to text and ASR. On this basis, we introduce contrastive learning loss to shrink the distance between the three modalities and construct a unified representation. Experiments 2 on four AEC datasets show that our method effectively corrects a large number of ASR errors to state-of-the-art levels.
Loading