MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

15 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Sounding video generation, multi-modal generation, diffusion model
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Sounding video generation (SVG) is a challenging audio-video joint generation task that requires both single-modal realism and cross-modal consistency. Previous diffusion-based methods tackled SVG within the original signal space, resulting in a huge computation burden. In this paper, we introduce a novel multi-modal latent diffusion model (MM-LDM), which establishes a perceptual latent space that is perceptually equivalent to the original audio-video signal space but drastically reduces computational complexity. We unify the representation of audio and video signals and construct a shared high-level semantic feature space to bridge the information gap between audio and video modalities. Furthermore, we present a novel cross-modal sampling guidance that extends our generative models to audio-to-video and video-to-audio conditional generation tasks. We obtain the new state-of-the-art results with significant quality and efficiency gains. In particular, our method achieves an overall improvement in all evaluation metrics and a faster training and sampling speed.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 281
Loading