InterAvatar: Real-time Interactive Portrait Animation via Behavioral Interaction Prompts

Zhuofan Zong; Hao Shao; Yang Zhou; Yufei Liu; Dongzhi Jiang; Jiale Yuan; Zimu Lu; Mingjie Zhan; Hongsheng Li

InterAvatar: Real-time Interactive Portrait Animation via Behavioral Interaction Prompts

Zhuofan Zong, Hao Shao, Yang Zhou, Yufei Liu, Dongzhi Jiang, Jiale Yuan, Zimu Lu, Mingjie Zhan, Hongsheng Li

12 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Diffusion Model, Portrait Animation

Abstract: Portrait animation can generate realistic animated videos from static portrait images and play a crucial role in a wide range of real-world applications. Despite substantial advances in realism, existing portrait animation methods suffer from critical limitations: slow inference speeds unsuitable for interactive scenarios and the absence of behavioral interaction capabilities, significantly restricting immersive user experiences. To address these limitations, we propose InterAvatar, the first framework that adapts a real-time video diffusion transformer for portrait animation conditioned on behavioral interaction prompts. Specifically, InterAvatar is built upon the diffusion transformer Wan2.1-1.3B and LTX-Video-2B with diffusion distillation. It is conditioned on a reference image, audio signals, and behavioral interaction prompts to animate avatars. To enhance appearance consistency and reduce drifting in real-time animation frameworks, we introduce a representation decoupling strategy that separates identity and attribute information from the reference appearance. We also present the first work to introduce behavioral interaction prompts into portrait animation, proposing pioneering strategies for encoding and injecting these prompts into diffusion transformers. Besides, we introduce a hybrid data curation pipeline for systematically collecting, annotating, and filtering real and synthetic video data annotated with behavioral interaction prompts. Extensive evaluations on HDTF, CelebV-HQ, and RAVDESS demonstrate InterAvatar achieves comparable video quality with state-of-the-art models and effectively simulates realistic behavioral interactions, enhancing the interactive user experience. InterAvatar can generate 80 video frames at 512×512 resolution in just 5 seconds on an Nvidia H800 GPU, offering an optimal balance between accuracy and efficiency.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 4423

Loading