Abstract: The phonetic structures of dysarthric speech are more difficult to discriminate than those of normal speech. Therefore, in this paper, we propose a novel voice conversion framework for dysarthric speech by learning disentangled audio-transcription representations. The novelty of this method is that it simultaneously takes both audio and its corresponding transcription as training inputs. We constrain the extracted linguistic representation from the audio input to be close to the linguistic representation from the transcription input, forcing them to share the same distribution. Furthermore, the proposed model can generate appropriate linguistic representations without any transcripts during the testing stage. The results of objective and subjective evaluations showed that the proposed method exhibits higher intelligibility and better speaker similarity of the converted speech than those of the baseline approaches.
0 Replies
Loading