Abstract: We present an approach for generating facial animation that combines video and audio input data in real time for low-end devices
through deep learning. Our method produces control signals from
audiovisual inputs separately, and mixes them to animate a character rig. The architecture relies on two specialized networks that are
trained on a combination of synthetic and real world data and are
highly engineered to be efficient in order to support quality avatar
faces even on low-end devices. In addition, the system supports
several levels of detail that degrade gracefully for additional scaling
and efficiency. We showcase how user testing has been employed
to improve performance and a comparison with state of the art.
Loading