MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Sounding Video Generation (SVG) is an audio-video joint generation task challenged by high-dimensional signal spaces, distinct data formats, and different patterns of content information. To address these issues, we introduce a novel multi-modal latent diffusion model (MM-LDM) for the SVG task. We first unify the representation of audio and video data by converting them into a single or a couple of images. Then, we introduce a hierarchical multi-modal autoencoder that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space. The former space is perceptually equivalent to the raw signal space of each modality but drastically reduces signal dimensions. The latter space serves to bridge the information gap between modalities and provides more insightful cross-modal guidance. Our proposed method achieves new state-of-the-art results with significant quality and efficiency gains. Specifically, our method achieves a comprehensive improvement on all evaluation metrics and a faster training and sampling speed on Landscape and AIST++ datasets. Moreover, we explore its performance on open-domain sounding video generation, long sounding video generation, audio continuation, video continuation, and conditional single-modal generation tasks for a comprehensive evaluation, where our MM-LDM demonstrates exciting adaptability and generalization ability.
Primary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: This paper makes a significant contribution to multimodal processing by introducing the first multi-modal latent diffusion model for the SVG task by incorporating multiple specialized designs and extending to various multi-modal generation tasks: 1) Unified Representation: The paper addresses the challenge of synthesizing realistic high-dimensional video and audio signals by unifying their representation. By converting both audio and video signals into a single or a couple of images, it simplifies the processing and bridges the gap between the data representation and conveyed messages. 2) Hierarchical Multi-Modal Autoencoder: The paper proposes a hierarchical multi-modal autoencoder that constructs low-level perceptual latent spaces for each modality and a shared high-level semantic feature space. This architecture enables the model to capture both low-level perceptual details and high-level semantic features, facilitating better understanding and synthesis of multimodal content. 3) Cross-Modal Guidance: The shared high-level semantic feature space derived from perceptual spaces provides insightful cross-modal guidance, aiding in bridging the information gap between different modalities. This cross-modal guidance enhances the consistency and coherence of synthesized multimodal content. 4) Extension to Cross-Modal Synthesis: The proposed method can be extended to synthesize one modality based on another using cross-modal sampling guidance. This capability allows the model to generate one modality (e.g., audio) based on the information from another modality (e.g., video), enhancing the versatility and applicability of the approach. 5) State-of-the-Art Results: The paper demonstrates that the proposed MM-LDM approach achieves new state-of-the-art results in terms of quality and efficiency gains. It shows comprehensive improvements across all evaluation metrics and achieves faster training and sampling speeds on Landscape and AIST++ datasets, indicating its effectiveness in practical applications. Overall, the paper's contribution lies in its innovative approach to multimodal processing, addressing challenges in synthesizing realistic multimodal content efficiently while achieving superior performance compared to existing methods.
Supplementary Material: zip
Submission Number: 792
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview