Hierarchical Feature Warping and Blending for Talking Head Animation

Jiale Zhang, Chengxin Liu, Ke Xian, Zhiguo Cao

Published: 01 Jan 2024, Last Modified: 28 Sept 2024IEEE Trans. Circuits Syst. Video Technol. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Talking head animation transforms a source anime image to a target pose, where the transformation includes the change of facial expression and head movement. In contrast to existing approaches that operate on the low-resolution image ( $256\times 256$ ), we study this task at a higher resolution, e.g., $512\times 512$ . High-resolution talking head animation, however, raises two major challenges: i) how to achieve smooth global transformation while maintaining rich details of anime characters under large-displacement pose variations; ii) how to address the shortage of data, because no related dataset is publicly available. In this paper, we present a Hierarchical Feature Warping and Blending (HFWB) model, which tackles talking head animation hierarchically. Specifically, we use low-level features to control global transformation and high-level features to determine the details of anime characters, under the guidance of feature flow fields. These features are then blended by selective fusion units, outputting transformed anime images. In addition, we construct an anime pose dataset—AniTalk-2K, aiming to alleviate the shortage of data. It contains around 2000 anime characters with thousands of different face/head poses at a resolution of $512\times 512$ . Extensive experiments on AniTalk-2K demonstrate the superiority of our approach in generating high-quality anime talking heads over state-of-the-art methods.