Keywords: Mandarin dialect, speech recognition, speaker verification, subdialect identification, voice conversion
Abstract: This paper introduces an open source speech dataset, KeSpeech, which involves 1,542 hours of speech signals recorded by 27,237 speakers in 34 cities in China, and the pronunciation includes standard Mandarin and its 8 subdialects. The new dataset possesses several properties. Firstly, the dataset provides multiple labels including content transcription, speaker identity and subdialect, hence supporting a variety of speech processing tasks, such as speech recognition, speaker recognition, and subdialect identification, as well as other advanced techniques like multi-task learning and conditional learning. Secondly, some of the text samples were parallel recorded with both the standard Mandarin and a particular subdialect, allowing for new applications such as subdialect style conversion. Thirdly, the number of speakers is much larger than other open-source datasets, making it suitable for tasks that require training data from vast speakers. Finally, the speech signals were recorded in two phases, which opens the opportunity for the study of the time variance property of human speech. We present the design principle of the KeSpeech dataset and four baseline systems based on the new data resource: speech recognition, speaker verification, subdialect identification and voice conversion. The dataset is free for all academic usage.
Supplementary Material: zip