Physically Grounded Avatar Generation

12 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video generation, audio-driven avatar, physically grounded human behaviors
TL;DR: This paper presents a physically grounded DiT model for audio-driven avatar generation using discrete diffusion-based physical state supervision, MLLM-based physical planning guidance, and interleaved interpolation-based long-form inference.
Abstract: Recent advances in diffusion transformer (DiT) models have greatly improved audio-driven video avatar generation, enabling the synthesis of realistic avatars from a single reference image and an audio clip. However, generating avatars with $\textit{physically grounded human behaviors}$ remains challenging, primarily due to ($\textbf{i}$) overreliance on shallow audio-visual correlations and ($\textbf{ii}$) misalignment between semantic intent and behavioral expression. Consequently, existing methods often produce facial expressions and gestures that appear constrained, lack emotional depth, and fail to capture realistic human dynamics. In this paper, we present a $\textbf{Phys}$ically grounded DiT model for $\textbf{Avatar}$ generation, termed $\textbf{PhysAvatar}$, which can produce realistic, contextually coherent, long-form avatars with human-like behavioral fidelity. PhysAvatar introduces three key innovations: ($\textbf{i}$) physical state supervision, embedding human behavioral dynamics into the video DiT model via discrete diffusion; ($\textbf{ii}$) physical planning guidance, which leverages a multimodal language model to jointly analyze audio and visual inputs and direct the avatar behaviors according to semantic intent; and ($\textbf{iii}$) efficient long-form inference with interleaved video interpolation, improving temporal coherence and identity preservation. Extensive experiments on our in-house dataset, as well as PATS and Vlogger, demonstrate that PhysAvatar outperforms state-of-the-art baselines in both generative quality and behavioral realism, consistently producing avatars that are more physically grounded, expressive, and lifelike.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4263
Loading