Decoding Strategy with Perceptual Rating Prediction for Language Model-Based Text-to-Speech Synthesis

Published: 10 Oct 2024, Last Modified: 22 Oct 2024Audio Imagination: NeurIPS 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: decoding strategy, text-to-speech synthesis, discrete speech token, perceptual rating prediction
TL;DR: We propose a novel decoding strategy that leverages perceptual rating prediction for language model-based speech synthesis.
Abstract: Recently, text-to-speech (TTS) synthesis models that use language models (LMs) to autoregressively generate discrete speech tokens, such as neural audio codec, have gained attention. They successfully improve the diversity and expressiveness of synthetic speech while addressing repetitive generation issues by incorporating sampling-based decoding strategies. However, sampling randomness can lead to undesirable output, such as artifacts, and destabilize the quality of synthetic speech. To address this issue, we propose BOK-PRP, a novel sampling-based decoding strategy for LM-based TTS. Our strategy incorporates best-of-K (BOK) selection process based on perceptual rating prediction (PRP), filtering out undesirable outputs while maintaining output diversity. Importantly, the perceptual rating predictor is trained with human ratings independently of TTS models, allowing BOK-PRP to be applied to various pre-trained LM-based TTS models without requiring additional TTS training. Results from subjective evaluations demonstrate that BOK-PRP significantly improves the naturalness of synthetic speech.
Submission Number: 8
Loading