Cross Modal Training for ASR Error Correction with Contrastive Learning

Jin Jiang, Xiaojun Wan, Wei Peng, Rongjun Li, Jingyuan Yang, Yanquan Zhou

Published: 01 Jan 2024, Last Modified: 09 Oct 2025ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: ASR Error Correction (AEC) aims to post-process the output of ASR systems and further reduce the word error rate. In this paper, we propose a cross-modal training framework with contrastive learning on the AEC task. This framework enables a shared encoder-decoder model to learn text, pinyin (phoneme 1 ) and audio information simultaneously, which is trained by three subtasks: text correction, pinyin to text and ASR. On this basis, we introduce contrastive learning loss to shrink the distance between the three modalities and construct a unified representation. Experiments 2 on four AEC datasets show that our method effectively corrects a large number of ASR errors to state-of-the-art levels.