CycleGAN-Based Speech Mode Transformation Model for Robust Multilingual ASR

Published: 2022, Last Modified: 02 Aug 2025Circuits Syst. Signal Process. 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In this work, we propose a multilingual speech mode transformation (MSMT) model as the front end to improve the robustness of the speech recognition system by transforming the characteristics of conversation and extempore modes of speech into read mode of speech. The proposed front end includes multilingual speech mode classification (MSMC) system and mode-specific MSMT model. The mode-specific MSMT models are developed using a cycle-consistent generative adversarial network (CycleGAN) variant named as weighted CycleGAN (WeCycleGAN). In these models, generator loss is multiplied with relevant weight to learn a strong mapping from conversation and extempore speech to read speech while preserving the linguistic content. The proposed model is developed with non-parallel speech samples of three modes using adversarial networks, which helps in learning among two distributions (extempore vs read or conversation vs read) instead of direct mapping among parallel speech samples. Experiments are conducted on non-parallel speech dataset of conversation, extempore, and read modes from four Indian languages, namely Bengali, Odia, Telugu, and Kannada. The objective evaluation shows that the transformed feature vectors are highly correlated with the target feature vectors. The subjective evaluation shows that the quality of the transformed speech mode is close to the target speech mode. The significance of the proposed MSMT model is demonstrated on speech recognition system. The results report that the performance of speech recognition is significantly improved in the presence of MSMT model.
Loading