MMIDM: Generating 3D Gesture from Multimodal Inputs with Diffusion Models

Ji Ye, Changhong Liu, Haocong Wan, Aiwen Jiang, Zhenchun Lei

Published: 01 Jan 2024, Last Modified: 14 Nov 2024PRCV (6) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multimodal-driven gesture generation has received increasing attention recently. However, a new challenge is how to mine the relationship between multimodal conditional inputs and gestures to generate diverse and realistic gestures better. To address this challenge, we propose a novel framework-3D gesture generation from MultiModal Inputs with the Diffusion Models (MMIDM)-that can effectively fuse multiple modal information (such as text, music, facial expressions, character information, emotions, etc.) as the condition to guide gesture generation. Specifically, we design a multimodal self-evaluation fusion network to capture the features highly related to gestures and automatically evaluate the importance of different conditional inputs using the mixture-of-experts mechanism. Moreover, we found that the diffusion model guided by multimodal conditions will cause serious jitter problems in the generated gesture motions. To alleviate the jitter problem, we employ a novel timestep embedding strategy where the timestep embedding is injected into each transformer block of the diffusion model. We evaluated the proposed method on the BEAT multimodal dataset. Experimental results demonstrate the effectiveness of our approach.