From Silence to Sound: Towards Audio-Visual Subject Customization

02 Sept 2025 (modified: 13 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Generation, Audio-Visual Subject Customization, Decoupled Learning, Classifier-Free Guidance
TL;DR: We present a new task termed audio-visual subject customization, and propose VauCustom, a method designed to generate videos of user-defined characters with consistent visuals and natural audio.
Abstract: We introduce a novel audio-visual subject customization task that generates videos featuring user-defined characters, emphasizing both visual and audio dimensions. A key challenge is mitigating the gap between visual synthesis and audio learning. To tackle this, we propose VauCustom (Video-Audio Custom), a two-stage method that leverages zero-shot text-to-speech to create personalized audio, and then conditions video synthesis on this audio to unify audio and visuals. During training, we design a decoupled audio-visual learning strategy that models character appearance independently before joint training, thereby preserving the visual fidelity of pre-trained text-to-video models. In addition, we propose a local classifier-free guidance mechanism tailored for audio, which selectively emphasizes character regions based on cross-attention similarity, enhancing audio-visual synchronization while reducing the impact on irrelevant background regions. Experiments demonstrate that VauCustom delivers consistent character appearance, natural audio quality, and precise audio-video synchronization across diverse scenarios, including real humans, animated human characters, and animal characters. We will release all data, code, and models to support future research.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 972
Loading