Diffusion-based co-speech gesture generation using joint text and audio representation

Anna Deichler; Shivam Mehta; Simon Alexanderson; Jonas Beskow

Diffusion-based co-speech gesture generation using joint text and audio representation

Anna Deichler, Shivam Mehta, Simon Alexanderson, Jonas Beskow

Published: 04 Sept 2023, Last Modified: 15 Jun 2025GENEA Challenge 2023 MainproceedingReaders: Everyone

Keywords: gesture generation, semantic gestures, motion synthesis, diffusion models, contrastive pre-training

Abstract: This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns joint embeddings for speech and gestures with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/diffusion-based-co-speech-gesture-generation/code)

3 Replies

Loading