SemGest: A Multimodal Feature Space Alignment and Fusion Framework for Semantic-aware Co-speech Gesture Generation

Yo-Hsin Fang; Vijay John; Yasutomo Kawanishi

SemGest: A Multimodal Feature Space Alignment and Fusion Framework for Semantic-aware Co-speech Gesture Generation

Yo-Hsin Fang, Vijay John, Yasutomo Kawanishi

Published: 27 Aug 2025, Last Modified: 27 Aug 2025GENEA Workshop 2025EveryoneRevisionsBibTeXCC BY 4.0

Abstract: This paper addresses the challenge of 3D co-speech gesture generation, aiming to generate body gestures that align with spoken content. Existing methods leverage multimodal features, such as speech and transcripts, to improve the expressiveness of generated gestures; however, generating gestures that express certain semantic meanings of the speech remains challenging. To address this limitation, we propose SemGest, a framework featuring a semantic-to-gesture alignment mechanism and a feature fusion module that effectively integrates speech features and semantic features extracted from the transcribed text. Subsequently, a diffusion-based model is then conditioned on the fused features to generate realistic and semantic-aware co-speech gestures. By aligning semantic and gesture spaces and adaptively fusing speech and semantic features, the resulting feature space is more robust, aiding in the conditional generation process. We perform a detailed experimental analysis, demonstrating the advantages of our proposed framework over the baseline algorithms in generating vivid co-speech gestures. Our experimental results demonstrate the superiority of the proposed framework. Furthermore, ablation studies also validate the effectiveness of the proposed semantic-to-gesture alignment and feature fusion mechanisms in the proposed framework.

Submission Number: 3

Loading