Modeling of various speaking styles and emotions for HMM-based speech synthesis

Junichi Yamagishi, Koji Onishi, Takashi Masuko, Takao Kobayashi

Published: 2003, Last Modified: 19 Oct 2024INTERSPEECH 2003EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper presents an approach to realizing various emotional expressions and speaking styles in synthetic speech using HMM-based speech synthesis. We show two methods for modeling speaking styles and emotions. In the first method, called "style dependent modeling," each speaking style and emotion is individually modeled. On the other hand, in the second method, called "style mixed modeling," speaking style or emotion is treated as a contextual factor as well as phonetic, prosodic, and linguistic factors, and all speaking styles and emotions are modeled by a single acoustic model simultaneously. We chose four styles, that is, "reading," "rough," "joyful," and "sad," and compared those two modeling methods using these styles. From the results of subjective tests, it is shown that both modeling methods have almost the same performance, and that it is possible to synthesize speech with similar speaking styles and emotions to those of the recorded speech. In addition, it is also shown that the style mixed modeling can reduce the number of output distributions in comparison with the style dependent modeling.