MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation

Xiaofeng Mao; Zhengkai Jiang; Qilin Wang; Chencan Fu; Jiangning Zhang; Jiafu Wu; Yabiao Wang; Chengjie Wang; Wei Li; Mingmin Chi

MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation

Xiaofeng Mao, Zhengkai Jiang, Qilin Wang, Chencan Fu, Jiangning Zhang, Jiafu Wu, Yabiao Wang, Chengjie Wang, Wei Li, Mingmin Chi

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent advancements in the field of Diffusion Transformers have substantially improved the generation of high-quality 2D images, 3D videos, and 3D shapes. However, the effectiveness of the Transformer architecture in the domain of co-speech gesture generation remains relatively unexplored, as prior methodologies have predominantly employed the Convolutional Neural Network (CNNs) or simple a few transformer layers. In an attempt to bridge this research gap, we introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G, which directly implements the denoising process on gesture sequences. To enhance the contextual reasoning capability of temporally aligned speech-driven gestures, we incorporate a novel Masked Diffusion Transformer. This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures, thereby expediting the learning process and leading to coherent and realistic motions. Apart from audio, Our MDT-A2G model also integrates multi-modal information, encompassing text, emotion, and identity. Furthermore, we propose an efficient inference strategy that diminishes the denoising computation by leveraging previously calculated results, thereby achieving a speedup with negligible performance degradation. Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6$\times$ faster than traditional diffusion transformers and an inference speed that is 5.7$\times$ than the standard diffusion model.

Primary Subject Area: [Generation] Generative Multimedia

Relevance To Conference: This work substantially advances multimedia and multimodal processing, connecting the realms of audio and gesture data. The developed algorithm, which draws from various modalities such as audio to create natural, varied, and smooth gestures, facilitates interactions that are both more lifelike and captivating within virtual spaces, gaming, and animation. The study improves multimedia systems' capacity to handle and align disparate modalities, fostering the creation of more unified and lively gestures.

Supplementary Material: zip

Submission Number: 1492

Loading