SGEAG: Semantic-guided emotional-aware gesture generation from audio

Published: 22 Sept 2025, Last Modified: 22 Sept 2025WiML @ NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Gesture generation, Emotion-aware, Speech-to-gestures, Semantic-guided
Abstract: Gestures and facial expressions are fundamental to natural human communication, yet synthesizing them from speech remains a complex challenge due to the many-to-many mapping between audio and motion, the scarcity of semantic gestures, and the difficulty of capturing emotional nuance. While facial expressions tend to align with phonetic cues, whole-body gestures are more strongly driven by rhythm, semantics, and emotional context. Prior works have made progress in rhythm-aligned gesture generation, but they often neglect the semantic intent of speech [1] and the emotional tone of the speaker [2], resulting in gestures that appear generic or lack expressiveness. To address this gap, we propose SGEAG, a framework that generates whole-body co-speech gestures, including face, hands, upper and lower body, directly from audio. SGEAG introduces a two-module architecture that separates gesture generation into (1) an audio2face module for mapping speech content to facial expressions, and (2) an audio2body module that leverages rhythm and semantic cues to drive natural body movements. A key point is the Semantic-Guided Mechanism (SGM), which dynamically regulates the importance of rhythm versus semantic features, enabling the model to capture rare but meaningful gestures. To ensure expressivity, we further incorporate style and emotion adaptation applied separately to the face and body, allowing the generated gestures to reflect speaker individuality. Experiments on the BEAT2 dataset show that SGEAG consistently outperforms state-of-the-art methods such as EMAGE [3] and CAMN [4]. Quantitatively, our model reduces the Fréchet Gesture Distance (FGD) to 0.609, improves Beat Alignment (BA) to 0.811, and achieves superior emotion classification accuracy at 64.75%, a gain of more than 13% compared to EMAGE. Qualitative evaluations further confirm that SGEAG produces motion sequences that are semantically relevant and emotionally expressive. Ablation studies highlight the crucial role of both the semantic-guided mechanism and the style-emotion adaptation modules, showing significant performance degradation when these components are removed. Beyond numerical performance, a user study with 20 participants demonstrated that gestures generated by SGEAG were perceived as more natural, diverse, emotionally aligned, and semantically appropriate than those of competing methods. Participants rated our model highest across all categories, with improvements in perceived naturalness (0.82) and emotional alignment (0.79). These findings suggest that the benefits of SGEAG extend beyond computational metrics to the level of human perception, an essential benchmark for real-world deployment. The innovation of SGEAG lies in its joint modeling of semantics, rhythm, and emotion for co-speech gesture generation, enabled through modular design, cross-attention between face and body, and semantic-guided regulation of feature importance. This contribution has significant implications for virtual humans, digital assistants, educational platforms, and immersive entertainment, where natural and expressive gestures enhance interaction quality. Future research may expand SGEAG by incorporating multilingual and culturally diverse datasets, broadening its generalizability across communication contexts.
Submission Number: 32
Loading