Cross-Modal Learning for CTC-Based ASR: Leveraging CTC-Bertscore and Sequence-Level Training

Mun-Hak Lee, Sang-Eon Lee, Ji-Eun Choi, Joon-Hyuk Chang

Published: 2023, Last Modified: 24 Apr 2026ASRU 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Due to the nature of neural networks that easily overfit the training set, neural network-based speech recognition models are vulnerable to prior shifts in data distribution or unseen words. Therefore, studies have been conducted to over-come this problem by using language models trained with a relatively easy-to-obtain unpaired corpus. In this paper, we present a new training method that uses BERT to improve the performance of a connectionist temporal classification (CTC)-based ASR model. The proposed method follows a cross-modal learning scenario and induces the CTC model to better embed contextual information by utilizing an auxiliary objective function operating at the sequence level. We applied the proposed method to fine-tune the pre-trained wav2vec 2.0 model with CTC loss and confirmed that the proposed method improves the generalization performance of the ASR model.