Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models

Nobukatsu Hojo; Kota Yoshizato; Hirokazu Kameoka; Daisuke Saito; Shigeki Sagayama

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models

Nobukatsu Hojo, Kota Yoshizato, Hirokazu Kameoka, Daisuke Saito, Shigeki Sagayama

Published: 01 Jan 2013, Last Modified: 21 May 2025SSW 2013EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper proposes a text-to-speech synthesis (TTS) system based on a combined model of the Composite Wavelet Model (CWM) and Hidden Markov Model (HMM). Conventional HMM-based TTS systems using cepstral features tend to produce over-smoothed spectra, which often result in muffled and buzzy synthesized speech. This is simply caused by the averaging of spectra associated with each phoneme during the learning process. To avoid the over-smoothing of generated spectra, we consider it important to focus on a different representation of the generative process of speech spectra. In particular, we choose to characterize speech spectra by the CWM, whose parameters correspond to the frequency, gain and peakiness of each underlying formant. This idea is motivated by our expectation that averaging of these parameters would not directly cause the oversmoothing of spectra, as opposed to the cepstral representations. To describe the entire generative process of a sequence of speech spectra, we combine the generative process of a formant trajectory using an HMM and the generative process of a speech spectrum using the CWM. A parameter learning algorithm for this combined model is derived based on an auxiliary function approach. We confirmed through experiments that our speech synthesis system was able to generate speech spectra with clear peaks and dips, which resulted in natural-sounding synthetic speech.

Loading