Synthesizing real-time speech-driven facial animation

Published: 01 Jan 2014, Last Modified: 13 Nov 2024ICASSP 2014EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We present a real-time speech-driven facial animation system. In this system, Gaussian Mixture Models (GMM) are employed to perform the audio-to-visual conversion. The conventional GMM-based method performs the conversion frame by frame using minimum mean square error (MMSE) estimation. The method is reasonably effective. However, discontinuities often appear in the sequences of estimated visual features. To solve this problem, we incorporate previous visual features into the conversion so that the conversion procedure is performed in the manner of a Markov chain. After audio-to-visual conversion, the estimated visual features are transformed to blendshape weights to synthesize facial animation. Experiments show that our system can accurately convert audio features into visual features. The conversion accuracy is comparable to a current state-of-the-art trajectory-based approach. Moreover, our system runs in real time and outputs high quality lip-sync animations.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview