Enhancing Facial Consistency in Conditional Video Generation via Facial Landmark Transformation

Published: 01 Jan 2024, Last Modified: 02 Aug 2025CoRR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Creating realistic pose-guided image-to-video character animations while preserving facial identity remains challenging, especially in complex and dynamic scenarios such as dancing, where precise identity consistency is crucial. Existing methods frequently encounter difficulties maintaining facial coherence due to misalignments between facial landmarks extracted from driving videos that provide head pose and expression cues and the facial geometry of the reference images. To address this limitation, we introduce the Facial Landmarks Transformation (FLT) method, which leverages a 3D Morphable Model to address this limitation. FLT converts 2D landmarks into a 3D face model, adjusts the 3D face model to align with the reference identity, and then transforms them back into 2D landmarks to guide the image-to-video generation process. This approach ensures accurate alignment with the reference facial geometry, enhancing the consistency between generated videos and reference images. Experimental results demonstrate that FLT effectively preserves facial identity, significantly improving pose-guided character animation models.
Loading