Keywords: Video Diffusion Model, Portrait Animation
Abstract: Portrait animation can generate realistic animated videos from static portrait images and play a crucial role in a wide range of real-world applications. Despite substantial advances in realism, existing portrait animation methods suffer from critical limitations: slow inference speeds unsuitable for interactive scenarios and the absence of behavioral interaction capabilities, significantly restricting immersive user experiences. To address these limitations, we propose InterAvatar, the first framework that adapts a real-time video diffusion transformer for portrait animation conditioned on behavioral interaction prompts. Specifically, InterAvatar is built upon the diffusion transformer Wan2.1-1.3B and LTX-Video-2B with diffusion distillation. It is conditioned on a reference image, audio signals, and behavioral interaction prompts to animate avatars. To enhance appearance consistency and reduce drifting in real-time animation frameworks, we introduce a representation decoupling strategy that separates identity and attribute information from the reference appearance. We also present the first work to introduce behavioral interaction prompts into portrait animation, proposing pioneering strategies for encoding and injecting these prompts into diffusion transformers. Besides, we introduce a hybrid data curation pipeline for systematically collecting, annotating, and filtering real and synthetic video data annotated with behavioral interaction prompts. Extensive evaluations on HDTF, CelebV-HQ, and RAVDESS demonstrate InterAvatar achieves comparable video quality with state-of-the-art models and effectively simulates realistic behavioral interactions, enhancing the interactive user experience. InterAvatar can generate 80 video frames at 512×512 resolution in just 5 seconds on an Nvidia H800 GPU, offering an optimal balance between accuracy and efficiency.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 4423
Loading