Correct and speak: accent reduction with minimum supervision

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Voice Conversion, Spoken Language Models, speech tokenizer, In-context Learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We design a separate sequence-to-sequence task based on autoregressive decoder-only transformer to accomplish the accent correction
Abstract: Accent conversion(AC) aims to convert non-native accented speech to native speech by changing the pronunciation pattern and prosody of source speakers while preserving linguistic content and speaker identity. This problem is quite challenging since 1) the parallel data with same speaker speaking the same content in different accent is rarely existed; 2) the accent features not only affect the prosody but also corrupt the pronunciation units in some heavy accents like Indian accent. In this work, we propose a new framework with a correction module and speaking module based on speech generative models in which the accent removal is achieved by correcting the source accented semantic tokens to the target native ones. Specifically, a separate sequence-to-sequence task based on autoregressive decoder-only transformer has been designed to accomplish the correction. Conditioned on this corrected semantic token, a speech generative model based on TF-Codec, trained with large amounts of native speech has been proposed to generate speech with native prosody. Different from multi-stage generation used in other generative models, we use a single-stage autoregressive generation to reduce the complexity and latency of the generation process. To relieve the dependence of the parallel data, we pretrain the correction module with a pretext task in a self-supervised manner using large amounts of native speech to learn the probability space of the target native semantic tokens first so that small amounts of parallel data are needed to learn the mapping of specific corrupted pronunciation units with their native targets. Experimental results show the proposed framework achieved the state-of-the-art performance in terms of accentedness, speech quality and speaker maintanence. With the pretraining, only 15 minutes of parallel data which is not constrained to the same speaker are required to achieve a good correction quality. The proposed generative model also achieves higher speech quality and speaker similarity with lower complexity and latency(50 AR steps/1 sec of audio) compared with multi-stage speech generation methods(75 AR steps+7 NAR steps/1 sec of audio). With less supervision from parallel data, this framework can be easily extended to other accents with low-resource data.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8861
Loading