Unsupervised Domain Adaptation on End-to-End Multi-Talker Overlapped Speech Recognition

Lin Zheng, Han Zhu, Sanli Tian, Qingwei Zhao, Ta Li

Published: 01 Jan 2024, Last Modified: 15 May 2025IEEE Signal Process. Lett. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Serialized Output Training (SOT) has emerged as the mainstream approach for addressing the multi-talker overlapped speech recognition challenge due to its simplicity. However, SOT encounters cross-domain performance degradation which hinders its application. Meanwhile, traditional domain adaption methods may harm the accuracy of speaker change point prediction evaluated by UD-CER, which is an important metric in SOT. To solve these issues, we propose Pseudo-Labeling based SOT (PL-SOT) for domain adaptation by treating speaker change token ($< $sc$>$) specially during training to increase the accuracy of speaker change point prediction. Firstly, we improve CTC loss by proposing Weakening and Enhancing CTC (WE-CTC) loss to weaken the learning of error-prone labels surrounding $<$sc$>$ while enhance the emission probability of $< $sc$>$ through modifying posteriors of the pseudo-labels. Secondly, we introduce Weighted Confidence Filter (WCF) that assigns higher scores of $<$sc$>$ to exclude low-quality pseudo-labels without hurting the $< $sc$>$ prediction. Experimental results show that PL-SOT achieves 17.7%/12.8% average relative reduction of CER/UD-CER, with AliMeeting as source domain and AISHELL-4 along with MagicData-RAMC as target domain.