Parameter-efficient adaptation with multi-channel adversarial training for far-field speech recognition

Published: 01 Jan 2025, Last Modified: 20 May 2025EURASIP J. Audio Speech Music. Process. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Despite notable advancements in automatic speech recognition (ASR) technologies, issues such as background noise, reverberation, and speaker distance still degrade the performance of far-field speech recognition (FSR). Although large-scale pre-trained models have shown promise, their adaptation to FSR is hampered by high training costs and risks of overfitting. To overcome these problems, we introduce a parameter-efficient speech prefix tuning (SPT) for FSR based on Whisper for the first time. This method uses fixed-length vector sequences appended to input features of each layer, enhancing the model’s capability for handling complex FSR tasks. To overcome the channel interference problem of FSR, we propose a multi-channel adversarial training (MCAT) approach, which incorporates a channel recognizer in an adversarial manner to guide the model in learning channel-invariant speech representations. Experimental results on multiple datasets demonstrate speech prefix tuning surpasses LoRA by a degradation relative WER of 5.76% with fewer parameters. Moreover, multi-channel adversarial training further achieves an additional WER degradation of 3.92% and 3.71%. The t-SNE visualization indicates that MCAT clusters more samples from different channels into similar representations, effectively reducing channel interference.
Loading