CyberHost: A One-stage Diffusion Framework for Audio-driven Talking Body Generation

Gaojie Lin; Jianwen Jiang; Chao Liang; Tianyun Zhong; Jiaqi Yang; Zerong Zheng; Yanbo Zheng

CyberHost: A One-stage Diffusion Framework for Audio-driven Talking Body Generation

Gaojie Lin, Jianwen Jiang, Chao Liang, Tianyun Zhong, Jiaqi Yang, Zerong Zheng, Yanbo Zheng

Published: 22 Jan 2025, Last Modified: 03 Mar 2025ICLR 2025 OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Audio-driven Human Animation.+Diffusion Model.+Generative Model.+Human Video Generation

TL;DR: We propose a one-stage audio-driven talking body generation framework, CyberHost, designed to produce human videos that match the input audio with high expressiveness and realism.

Abstract: Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. While breakthroughs have been made in driving human animation through various modalities for portraits, most of current solutions for human body animation still focus on video-driven methods, leaving audio-driven taking body generation relatively underexplored. In this paper, we introduce CyberHost, a one-stage audio-driven talking body generation framework that addresses common synthesis degradations in half-body animation, including hand integrity, identity consistency, and natural motion. CyberHost's key designs are twofold. Firstly, the Region Attention Module (RAM) maintains a set of learnable, implicit, identity-agnostic latent features and combines them with identity-specific local visual features to enhance the synthesis of critical local regions. Secondly, the Human-Prior-Guided Conditions introduce more human structural priors into the model, reducing uncertainty in generated motion patterns and thereby improving the stability of the generated videos. To our knowledge, CyberHost is the first one-stage audio-driven human diffusion model capable of zero-shot video generation for the human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects. CyberHost can also be extended to video-driven and audio-video hybrid-driven scenarios, achieving similarly satisfactory results.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4230

Loading