Keywords: Diffusion model, Portrait Animation, Image and Video Generation
Abstract: Portrait animation aims to synthesize images or videos that transfer expressions or poses from a reference while preserving identity. Existing methods often rely on high-level expression encoders, which capture only coarse semantics and miss fine-grained structural details in critical regions such as the eyes, eyebrows, and mouth, leading to noticeable discrepancies and suboptimal expression fidelity.
To address this, we propose StrucBooth, a framework that binds pixel-level expression structures into the model through case-specific optimization while preserving the generator’s inherent capabilities. StrucBooth combines (i) PGT-based self-tuning, which uses a preliminary prediction as Pseudo Ground Truth (PGT) for lightweight refinement, and (ii) pixel-level structural supervision, which extracts gradient variations (Facial Structural Gradients) from expression-related patches and aligns them to inject fine-grained structural information.
Extensive evaluations under both cross-driven and self-driven settings demonstrate that StrucBooth consistently improves expression accuracy over strong baselines, highlighting that integrating pixel-space structural signals is an effective direction for faithful and visually consistent portrait animation.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2923
Loading