Mining audio/visual database for speech driven face animation

Yiqiang Chen, Wen Gao, Zhaoqi Wang, Jun Miao, Dalong Jiang

2001 (modified: 31 Oct 2023)SMC 2001Readers: Everyone

Abstract: The authors present a data mining framework in audio-visual interaction, and apply it to speech driven lip motion facial animation system. First, an unsupervised cluster algorithm is proposed to build a set of clusters in which each has similar configurations. Then, a statistical visual model is constructed by specifying all the possible cluster trajectories. The audio is analyzed with regard to learned clusters of facial gesture. For every cluster, two neural networks are trained to build mapping from audio features to cluster label and velocity respectively. Given a new vocal track, the statistical visual model and neural networks are combined together to analyze control audio, resulting in a most likely facial state sequence. The proposed method not only automatically incorporates vocal and facial dynamics such as co-articulation, but also is characterized by easy training, and being more robust, extensible and interpretable. Two approaches for an evaluation test are also proposed. The performance of our system shows that the proposed learning algorithm is suitable, which greatly improves the realism of face animation during speech.

0 Replies