From Silence to Sound: Towards Audio-Visual Subject Customization

Wen Wang; Liyang Li; Qiuyu Wang; Hao Ouyang; Xiangwei Chen; Linhao Zhong; Ka Leong Cheng; Qingyan Bai; Hao Chen; Yujun Shen; Chunhua Shen

From Silence to Sound: Towards Audio-Visual Subject Customization

Wen Wang, Liyang Li, Qiuyu Wang, Hao Ouyang, Xiangwei Chen, Linhao Zhong, Ka Leong Cheng, Qingyan Bai, Hao Chen, Yujun Shen, Chunhua Shen

02 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Generation, Audio-Visual Subject Customization, Decoupled Learning, Classifier-Free Guidance

TL;DR: We present a new task termed audio-visual subject customization, and propose VauCustom, a method designed to generate videos of user-defined characters with consistent visuals and natural audio.

Abstract: We introduce a novel audio-visual subject customization task that generates videos featuring user-defined characters, emphasizing both visual and audio dimensions. A key challenge is mitigating the gap between visual synthesis and audio learning. To tackle this, we propose VauCustom (Video-Audio Custom), a two-stage method that leverages zero-shot text-to-speech to create personalized audio, and then conditions video synthesis on this audio to unify audio and visuals. During training, we design a decoupled audio-visual learning strategy that models character appearance independently before joint training, thereby preserving the visual fidelity of pre-trained text-to-video models. In addition, we propose a local classifier-free guidance mechanism tailored for audio, which selectively emphasizes character regions based on cross-attention similarity, enhancing audio-visual synchronization while reducing the impact on irrelevant background regions. Experiments demonstrate that VauCustom delivers consistent character appearance, natural audio quality, and precise audio-video synchronization across diverse scenarios, including real humans, animated human characters, and animal characters. We will release all data, code, and models to support future research.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 972

Loading