StrucBooth: Structural Gradient Supervised Tuning for Enhanced Portrait Animation

ICLR 2026 Conference Submission2923 Authors

08 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion model, Portrait Animation, Image and Video Generation
Abstract: Portrait animation aims to synthesize images or videos that transfer expressions or poses from a reference while preserving identity. Existing methods often rely on high-level expression encoders, which capture only coarse semantics and miss fine-grained structural details in critical regions such as the eyes, eyebrows, and mouth, leading to noticeable discrepancies and suboptimal expression fidelity. To address this, we propose StrucBooth, a framework that binds pixel-level expression structures into the model through case-specific optimization while preserving the generator’s inherent capabilities. StrucBooth combines (i) PGT-based self-tuning, which uses a preliminary prediction as Pseudo Ground Truth (PGT) for lightweight refinement, and (ii) pixel-level structural supervision, which extracts gradient variations (Facial Structural Gradients) from expression-related patches and aligns them to inject fine-grained structural information. Extensive evaluations under both cross-driven and self-driven settings demonstrate that StrucBooth consistently improves expression accuracy over strong baselines, highlighting that integrating pixel-space structural signals is an effective direction for faithful and visually consistent portrait animation.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2923
Loading