THMM-DiT: Talking Head Motion Modeling with Diffusion Transformer

Sejong Yang, Seoung Wug Oh, Yang Zhou, Seon Joo Kim

03 Oct 2023 (modified: 25 Sept 2024)OpenReview Archive Direct UploadReaders: Everyone

Abstract: Generating natural talking head motion is a challenging task due to the one-to-many nature of speech-to-motion mapping, the high dimensionality of RGB video, and the difficulty of modeling dynamic head poses. In this technical report, we propose a new approach to generating natural talking head motion that addresses these challenges. Our approach uses a diffusion model to generate a distribution of possible head poses, which is then conditioned on the given audio to produce a natural-looking talking head. We also use a face template to reduce the computational resources required to generate high-quality RGB videos. Finally, we employ long clue frames with spatio-temporal attention of transformer to generate natural long-term sequences of head poses. Our approach is able to generate dynamic head poses in the long term while accurately synchronizing mouth shapes with the given audio.

0 Replies