Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: audio generation, speech synthesis, dialog synthesis, parallel decoding
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Modeling the tokens of a neural audio codec unlocked rapid progress in audio generation, producing high-quality, coherent audio. However, this approach requires modeling long sequences, thus affecting the training and inference costs. In this work, we propose SoundStorm, a model for efficient, parallel audio generation, which scales gracefully to long sequences without compromising the quality of the generated audio. SoundStorm receives as input coarse, discrete audio representations, and relies on bidirectional attention and confidence-based parallel decoding to sample the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We also demonstrate the ability of our model to synthesize high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers’ voices.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5911
Loading