Abstract: Low resource of parallel data is the key challenge of accent conversion(AC) problem in which both the pronunciation units and prosody pattern need to be converted.
We propose a two-stage generative framework "convert-and-speak" in which the conversion is only operated on the semantic token level and the speech is synthesized conditioned on the converted semantic token with a speech generative model in target accent domain. The decoupling design enables the "speaking" module to use massive amount of target accent speech and relieves the parallel data required for the "conversion" module. Conversion with the bridge of semantic token also relieves the requirement for the data with text transcriptions and unlocks the usage of language pre-training technology to further efficiently reduce the need of parallel accent speech data.
To reduce the complexity and latency of "speaking", a single-stage AR generative model is designed to achieve good quality as well as lower computation cost. Experiments on Indian-English to general American-English conversion show that the proposed framework achieves state-of-the-art performance in accent similarity, speech quality, and speaker maintenance with only 15 minutes of weakly parallel data which is not constrained to the same speaker.
Extensive experimentation with diverse accent types suggests that this framework possesses a high degree of adaptability, making it readily scalable to accommodate other accents with low-resource data. Audio samples are available at https://convert-and-speak.github.io/demo/
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Generation] Multimedia Foundation Models, [Experience] Multimedia Applications, [Generation] Generative Multimedia
Relevance To Conference: The submitted work, titled "Convert and Speak," significantly advances the multimedia domain in several key areas. Firstly, it contributes to the burgeoning field of accent-related studies, such as accent conversion and correction. Compared with other speech processing tasks, this research area is notably underdeveloped, lacking public evaluation systems and contemporary methodologies, which makes it far from real usage. In practical applications, accent conversion proves invaluable in both online and offline cross-country meetings by enhancing content comprehension and participant interaction. Secondly, the innovative two-stage generative framework introduced by this work provides a new paradigm for various speech processing and editing tasks, including packet loss concealment, speech inpainting, speech destuttering, noise suppression and so on. With extensive training on large datasets, this framework has the potential to serve as a foundational model for a broad spectrum of speech-related tasks.
Supplementary Material: zip
Submission Number: 4548
Loading