Keywords: AIGC, Diffusion Model, Sounding Video Generation
Abstract: Recent AIGC advances have rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task, enabling synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced proprietary systems such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. This paper presents a concise yet powerful framework for efficient and effective JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables effective cross-modal communication while enhancing single-modality generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules.
We hope this work can set a milestone for the field of native JAVG and bring new inspiration to the community.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 19605
Loading