Abstract: We present a framework for generating full-bodied pho-torealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech au-dio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key be-hind our method is in combining the benefits of sample di-versity from vector quantization with the high-frequency de-tails obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nu-ances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropri-ate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available on project page.
External IDs:dblp:conf/cvpr/Ng0BBDKR24
Loading