Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

Shivam Mehta; Siyang Wang; Simon Alexanderson; Jonas Beskow; Eva Szekely; Gustav Eje Henter

Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

Shivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, Eva Szekely, Gustav Eje Henter

Published: 15 Jun 2023, Last Modified: 04 Aug 2025SSW12Readers: Everyone

Keywords: text-to-speech, speech-to-gesture, joint multimodal synthesis, deep generative model, diffusion model, evaluation

TL;DR: We propose and evaluate the first diffusion probabilistic model for synthesising speech and body gestures together.

Abstract: With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 6 code implementations](https://www.catalyzex.com/paper/diff-ttsg-denoising-probabilistic-integrated/code)

4 Replies

Loading