TL;DR: A variational autoencoder combined with a token-based speech language model to improve the naturalness of speech synthesis.
Abstract: The success of large language models in text processing has inspired their adaptation to speech modeling.
However, since speech is continuous and complex, it is often discretized for autoregressive modeling.
Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information.
As a result, models trained on these tokens can generate speech with reduced naturalness.
Existing approaches try to fix this by adding pitch features to the semantic tokens.
However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering.
To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens.
Our approach eliminates the need for manual extraction and selection of paralinguistic features.
Moreover, it produces preferred speech continuations according to human raters.
Code, samples and models are available at https://github.com/b04901014/vae-gslm.
Lay Summary: Imagine teaching a computer to talk just like a normal human. That's the task of Generative Spoken Language Model (GSLM). Think of it as a program that learns to predict what should come next in speech, much like your phone tries to guess the next word you're typing, but with voices instead of text. We noticed that earlier methods of these models, while good at generating meaningful language content, sometimes lost important details that make human voices sound natural. These are things like pitch (the up-and-down of your voice) and loudness variations when you speak.
Can we make the computer's speech sound more natural and still make perfect sense? We developed a way for our model to pay special attention to these natural vocal qualities. But instead of us humans telling it exactly what to look for, we let it learn by listening to a huge amount of human speech data, figuring out on its own how to capture all those subtle elements that make speech sound real.
When people listened to our model, they found its speech much more natural and meaningful compared to previous approaches. This means our method could improve how natural and clear the voices of existing conversational agents, making interactions with computers much more engaging and convenient.
Link To Code: https://github.com/b04901014/vae-gslm
Primary Area: Applications->Language, Speech and Dialog
Keywords: Generative Spoken Language Modeling;Speech Language Model
Submission Number: 12402
Loading