CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

TMLR Paper2353 Authors

08 Mar 2024 (modified: 15 Mar 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. Recognizing the importance of accurate alignment between video and audio events in multi-modal generation tasks, we propose a joint contrastive training loss to enhance the synchronization between visual and auditory occurrences. Our research methodology involves conducting comprehensive experiments on multiple datasets to thoroughly evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline, substantiating its effectiveness and efficiency. Notably, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task. These results indicate the potential of our proposed model as a robust solution for improving the quality and alignment of multi-modal generation, thereby contributing to the advancement of video and audio conditional generation systems.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Sebastian_Tschiatschek1
Submission Number: 2353
Loading