Abstract. Modeling the tokens of a neural audio codec unlocked rapid progress in audio generation, producing high-quality, coherent audio. However, this approach requires modeling long sequences, thus affecting the training and inference costs. In this work, we propose SoundStorm, a model for efficient, parallel audio generation, which scales gracefully to long sequences without compromising the quality of the generated audio. SoundStorm receives as input coarse, discrete audio representations, and relies on bidirectional attention and confidence-based parallel decoding to sample the tokens of a neural audio codec. Compared to the autoregressive generation approach of AudioLM, our model produces audio of the same quality and with higher consistency in voice and acoustic conditions, while being two orders of magnitude faster. SoundStorm generates 30 seconds of audio in 0.5 seconds on a TPU-v4. We also demonstrate the ability of our model to synthesize high-quality, natural dialogue segments, given a transcript annotated with speaker turns and a short prompt with the speakers’ voices.
SoundStorm, coupled with the text-to-semantic modeling stage of SPEAR-TTS (Kharitonov et al., 2023), can synthesize high quality, natural dialogues, allowing one to control the spoken content (via transcripts), speaker voices (via short voice prompts) and speaker turns (via transcript annotations). When synthesizing dialogue segments of 30 seconds, we measured a runtime of 2 seconds on a single TPU-v4. The following text and speakers have not been seen during training.
Text | Voice Prompt | Synthesized Dialogue |
---|---|---|
Where did you go last summer? | I went to Greece, it was amazing. | Oh, that's great. I've always wanted to go to Greece. What was your favorite part? | Uh it's hard to choose just one favorite part, but yeah I really loved the food. The seafood was especially delicious. | yeah | And the beaches were incredible. | uhhuh | We spent a lot of time swimming, uh sunbathing, and and exploring the islands. | Oh that sounds like a perfect vacation! I'm so jealous. | It was definitely a trip I'll never forget | I really hope I'll get to visit someday! | ||
Something really funny happened to me this morning. | Oh wow, what? | Well, uh I woke up as usual. | Uhhuh | Went downstairs to have uh breakfast. | Yeah | Started eating. Then uh 10 minutes later I realized it was the middle of the night. | Oh no way, that's so funny! | ||
I'm going to Istanbul for the Champions League final. | That's awesome. Who are you supporting? | Liverpool. I've always been a big fan. | Ah Liverpool is a great team but I I think it will be it will be a close match. | Yeah, I can't wait, you know, I'm super excited to be going there! | Yeah I can imagine. | Are you coming as well? | Ah, no, unfortunately, I I can't. | ||
I've always wanted to learn how to play the guitar. | What kind of guitar do you have in mind? | Um I'm not sure, I guess I'd uh like to learn to play both acoustic and electric. | Yeah, that's a great idea. Both types of guitars have their own uh their own unique sounds and uh and playing styles. | I know, but uh it's hard to decide which one to start with. | Well, um if you're not sure, I would recommend starting with an acoustic guitar. | Interesting | They're a little easier to learn on | ah | and they can be played anywhere. | That's good to know. | ||
I didn't sleep well last night. | Oh, no. What happened? | I don't know. I I just couldn't seem to uh to fall asleep somehow, I kept tossing and turning all night. | That's too bad. Maybe you should uh try going to bed earlier tonight or uh maybe you could try reading a book. | Yeah, thanks for the suggestions, I hope you're right. | No problem. I I hope you get a good night's sleep. |
We demonstrate the capabilities of SoundStorm to generate audio conditioned on the semantic tokens of AudioLM (Borsos et al., 2022) with and without 3-second voice prompts. SoundStorm samples different speakers in the unprompted case, and maintains the speaker's voice with high consistency in the prompted case, while generating audio two orders of magnitude faster than AudioLM's acoustic generator. The original samples are from LibriSpeech test-clean.
Original | Unprompted | Prompted |
---|---|---|
When generating audio in the prompted case, SoundStorm generations have higher acoustic consistency and preserve the speaker's voice from the prompt better than AudioLM. Compared to RVQ level-wise greedy decoding with the same model, SoundStorm produces audio with higher quality.
Original | AudioLM | Greedy | SoundStorm |
---|---|---|---|
SoundStorm is a model for high-quality, efficient generation of neural audio codec-derived representations of audio. In this work, we use it as a replacement for the acoustic generation pipeline of AudioLM and SPEAR-TTS. We acknowledge that the audio samples produced by the model may be influenced by the biases present in the training data, for instance in terms of represented accents and voice characteristics. In our generated samples, we demonstrate that we can reliably control speaker characteristics via prompting. However, a more thorough analysis of any training data and its limitations is an area of future work. In turn, the ability to mimic a voice can have numerous malicious applications, including bypassing biometric identification and for the purpose of impersonation. Thus, it is crucial to put in place safeguards against the potential misuse: to this end, we have verified that, after such a replacement, the generated audio remains detectable by a dedicated classifier (98.5% using the same classifier as Borsos et al. (2022)). Hence, as a component of a larger system, we believe that SoundStorm would be unlikely to introduce additional risks to those discussed previously by Borsos et al. (2022) and Kharitonov et al. (2023). At the same time, we hope that relaxing the memory and computational requirements of AudioLM would make research in the domain of audio generation more accessible to a wider community.