Implicit Enhancement of Target Speaker in Speaker-Adaptive ASR through Efficient Joint Optimization

Published: 2024, Last Modified: 01 Apr 2026ICASSP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In multi-speaker scenarios, automatic speech recognition (ASR) models rely on pre-processed audio after speaker separation. However, when the target speaker is not accurately separated, ASR models face limitations in reaching their peak performance. To address this issue, we propose a speaker-adaptive ASR framework that possesses more implicit target speaker enhancement capability by efficiently joint-optimized speaker recognition (SR) and ASR models. Our framework introduces sharing self-supervised learning representation, optimization transfer and hierarchy speaker-gated attention. In this manner, it can maximize effectiveness of embedding bias and emphasize target speaker corresponding to semantic units. In the CHiME-7 DASR sub-track, the proposed method achieves a 28.19% relative reduction in word error rate (WER) on the development sets when compared to the official baseline. Notably, this framework has also been employed in the champion system for the CHiME-7 DASR.
Loading