SpeechCAT: Cross-Attentive Transformer for Audio to Motion Generation

Sebastian Deaconu, Xiangwei Shi, Thomas Markhorst, Jouh Yeong Chew, Xucong Zhang

Published: 01 Jan 2025, Last Modified: 14 May 2025HRI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Audio-to-motion generation is an important task with applications in virtual avatar creation for XR systems and intelligent robot control in daily life scenarios. However, most existing motion generation methods rely on a single encoder-decoder architecture to model all body parts simultaneously, which limits their ability to capture the diverse and complex motions exhibited by humans. In this paper, we propose a novel method, SpeechCAT, that employs three separate encoder-decoder modules to individually model the motions of the face, body, and hands. To capture the relationships and synchronization among these body parts, we introduce a cross-attention mechanism to effectively learn their correlations. SpeechCAT ensures sufficient capacity to model the unique characteristics of each body part while preserving the coherence between them. Our experimental results demonstrate the superiority of SpeechCAT over baseline methods, highlighting its effectiveness in generating diverse, realistic, and synchronized motions with face, body, and hand parts.