Keywords: spontaneous speech synthesis, text-to-speech, self-supervised learning, mean-opinion-score prediction
TL;DR: We train and evaluate a large number of systems to gain insight into how different self-supervised speech representations can be used in TTS and MOS prediction on spontaneous speech.
Abstract: Self-supervised learning (SSL) speech representations learned
from large amounts of diverse, mixed-quality speech data
without transcriptions are gaining ground in many speech-
technology applications. Prior work has shown that SSL is
an effective intermediate representation in two-stage text-to-
speech (TTS) for both read and spontaneous speech. How-
ever, it is still not clear which SSL and which layer from each
SSL model is most suited for spontaneous TTS. We address this
shortcoming by extending the scope of comparison for SSL in
spontaneous TTS to 6 different SSLs and 3 layers within each
SSL. Furthermore, SSL has also shown potential in predicting
the mean opinion scores (MOS) of synthesized speech, but this
has only been done in read-speech MOS prediction. We extend
an SSL-based MOS prediction framework previously developed
for scoring read speech synthesis and evaluate its performance
on synthesized spontaneous speech. All experiments are con-
ducted twice on two different spontaneous corpora in order to
find generalizable trends. Overall, we present comprehensive
experimental results on the use of SSL in spontaneous TTS and
MOS prediction to further quantify and understand how SSL
can be used in spontaneous TTS. Audios samples: https:
//www.speech.kth.se/tts-demos/sp_ssl_tts
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 4 code implementations](https://www.catalyzex.com/paper/on-the-use-of-self-supervised-speech/code)
3 Replies
Loading