A$^2$-Flow: Alignment-Aware Pre-training for Speech Synthesis with Flow Matching

ICLR 2025 Conference Submission14052 Authors

28 Sept 2024 (modified: 27 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generative Pre-training; Flow Matching; Alignment Learning; Discrete Speech Units
TL;DR: Alignment-aware pre-training method for speech synthesis with flow matching; Leveraging Discrete Speech Units for Pre-training
Abstract: Recent advances in speech synthesis have enabled highly natural and speaker-adaptive speech generation by leveraging large-scale transcribed datasets. However, requiring tens of thousands of hours of annotated speech is impractical in low-resource settings. Existing pre-trained speech models often utilize masked speech inpainting for pre-training and show strong performance on various speech generation tasks using limited task-specific data. Nonetheless, these models still require external alignment mechanisms or extensive additional training to learn alignment for alignment-aware tasks, such as text-to-speech (TTS). In this paper, we propose A$^2$-Flow, an alignment-aware pre-training method for flow matching models in speech synthesis. A$^2$-Flow integrates alignment learning directly into the pre-training process using discrete speech units, enabling the model to efficiently adapt to alignment-aware tasks without the need for separate alignment mechanisms. By embedding alignment learning into pre-training, A$^2$-Flow facilitates alignment-free voice conversion (VC) and allows for faster convergence during TTS fine-tuning, even with limited transcribed data, making it highly suitable for low-resource scenarios. Experimental results show that A$^2$-Flow superior zero-shot VC performance compared to existing models and matches state-of-the-art TTS performance using only a small amount of transcribed data. Moreover, we demonstrate that A$^2$-Flow can be more efficiently applied to alignment-aware speech synthesis tasks than existing pre-training methods, providing a practical and scalable solution for high-quality speech synthesis across diverse settings.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 14052
Loading