Dual-Path Transformer-Based GAN for Co-speech Gesture Synthesis

Published: 2025, Last Modified: 22 Jan 2026Int. J. Soc. Robotics 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Co-speech gestures have significant impacts on conveying information. For social agents, producing realistic and smooth gestures are crucial to enable natural interactions with humans, which is a challenging task depending on many impact factors (e.g., speech audio, content, and the interacting person). In this paper, we tackle the cross-modal fusion problem through a novel fusion mechanism for end-to-end learning-based co-speech gesture generation. In particular, we facilitate parallel directional cross-modal transformers, and an interactive and cascaded 2D attention module, to achieve selective fusion of the gesture-related cues. Besides, we propose new metrics to evaluate gesture diversity and speech-gesture correspondence, without 3D pose annotation requirements. Experiments on a public dataset indicate that the proposed method can successfully produce diverse human-like poses, which outperform the other competitive state-of-the-art methods, with the evaluations conducted both objectively and subjectively.
Loading