Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners

Yazhou Xing, Yingqing He, Zeyue Tian, Xintao Wang, Qifeng Chen

Published: 2024, Last Modified: 11 Apr 2025CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Video and audio content creation serves as the core technique for the movie industry and professional users. Re-cently, existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation abil-ity of off-the-shelf video or audio generation models. Thus, instead of training the giant models from scratch, we pro-pose to bridge the existing strong models with a shared la-tent representation space. Specifically, we propose a mul-timodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the clas-sifier guidance that guides the diffusion denoising process during inference time. Through carefully designed opti-mization strategy and loss functions, we show the superior performance of our method on joint video-audio generation, visual-steered audio generation, and audio-steered vi-sual generation tasks. The project website can be found at https://yzxing87.github.io/Seeing-and-Hearing/.