Abstract: We present a framework for generating full-bodied pho-torealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech au-dio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key be-hind our method is in combining the benefits of sample di-versity from vector quantization with the high-frequency de-tails obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nu-ances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropri-ate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available on project page.
Loading