MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

Mingzhen Sun; Weining Wang; Yanyuan Qiao; Longteng Guo; Jiahui Sun; Xinxin Zhu; Jing Liu

MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

Mingzhen Sun, Weining Wang, Yanyuan Qiao, Longteng Guo, Jiahui Sun, Xinxin Zhu, Jing Liu

15 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Sounding video generation, multi-modal generation, diffusion model

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Sounding video generation (SVG) is a challenging audio-video joint generation task that requires both single-modal realism and cross-modal consistency. Previous diffusion-based methods tackled SVG within the original signal space, resulting in a huge computation burden. In this paper, we introduce a novel multi-modal latent diffusion model (MM-LDM), which establishes a perceptual latent space that is perceptually equivalent to the original audio-video signal space but drastically reduces computational complexity. We unify the representation of audio and video signals and construct a shared high-level semantic feature space to bridge the information gap between audio and video modalities. Furthermore, we present a novel cross-modal sampling guidance that extends our generative models to audio-to-video and video-to-audio conditional generation tasks. We obtain the new state-of-the-art results with significant quality and efficiency gains. In particular, our method achieves an overall improvement in all evaluation metrics and a faster training and sampling speed.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: pdf

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 281

Loading