Mapping frames with DNN-HMM recognizer for non-parallel voice conversion

Minghui Dong, Chenyu Yang, Yanfeng Lu, Jochen Walter Ehnes, Dong-Yan Huang, Huaiping Ming, Rong Tong, Siu Wa Lee, Haizhou Li

2015 (modified: 07 Apr 2022)APSIPA 2015Readers: Everyone

Abstract: To convert one speaker's voice to another's, the mapping of the corresponding speech segments from source speaker to target speaker must be obtained first. In parallel voice conversion, normally dynamic time warping (DTW) method is used to align signals of source and target voices. However, for conversion between non-parallel speech data, the DTW based mapping method does not work. In this paper, we propose to use a DNN-HMM recognizer to recognize each frame for both source and target speech signals. The vector of pseudo likelihood is then used to represent the frame. Similarity between two frames is measured with the distance between the vectors. A clustering method is used to group both source and target frames. Frame mapping from source to target is then established based on the clustering result. The experiments show that the proposed method can generate similar conversion results compared to parallel voice conversion.

0 Replies