Abstract: Generating natural talking head motion is a challenging task due to the one-to-many nature of speech-to-motion mapping, the high dimensionality of RGB video, and the difficulty of modeling dynamic head poses. In this technical report, we propose a new approach to generating natural talking head motion that addresses these challenges. Our approach uses a diffusion model to generate a distribution of possible head poses, which is then conditioned on the given audio to produce a natural-looking talking head. We also use a face template to reduce the computational resources required to generate high-quality RGB videos. Finally, we employ long clue frames with spatio-temporal attention of transformer to generate natural long-term sequences of head poses. Our approach is able to generate dynamic head poses in the long term while accurately synchronizing mouth shapes with the given audio.
0 Replies
Loading