Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation

Fengqi Liu; Hexiang Wang; Jingyu Gong; Ran Yi; Qianyu Zhou; Xuequan Lu; Jiangbo Lu; Lizhuang Ma

Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation

Fengqi Liu, Hexiang Wang, Jingyu Gong, Ran Yi, Qianyu Zhou, Xuequan Lu, Jiangbo Lu, Lizhuang Ma

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal. Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence, ignoring the semantic association of different modalities and failing to deal with salient gestures. In this paper, we propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture. Specifically, we first learn a joint manifold space for the individual representation of audio and body pose to exploit the inherent semantic association between the two modalities, and propose to enforce semantic consistency via a consistency loss. Furthermore, we emphasize the semantic consistency of salient postures by introducing a weakly-supervised detector to identify salient postures, and reweighting the consistency loss to focus more on learning the correspondence between salient postures and the high-level semantics of speech content. In addition, we propose to extract audio features dedicated to facial expression and body gesture separately, and design separate branches for face and body gesture synthesis. Extensive experiments and visualization results demonstrate the superiority of our method over the state-of-the-art approaches.

Primary Subject Area: [Generation] Generative Multimedia

Secondary Subject Area: [Experience] Multimedia Applications

Relevance To Conference: In this paper, we propose a novel co-speech gesture generation method to enhance the learning of cross-modal association of speech and gesture. Our model learns a joint manifold space for different representations of audio and body pose to exploit the inherent association between two modalities and enforce semantic consistency using a consistency loss. Our method achieves more promising outcomes than the existing works on several subjects. We believe that our paper could attract great interest of readers in the areas of multi-modality learning and gesture video generation, which meets the aims and scope of MM.

Supplementary Material: zip

Submission Number: 800

Loading