PAV-DiT: A Cross-modal Alignment Projected Latent Diffusion Transformer for Synchronized Audio-Video Generation

15 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: video generation, diffusion model, diffusion transformer, sounding video generation
Abstract: Sounding video generation (SVG) has emerged as a challenging task due to the inherent cross-modal temporal and semantic misalignment and the high computational costs associated with multimodal data. To address these issues, we propose the Projected Latent Audio-Video Diffusion Transformer (PAV-DiT), a novel diffusion transformer explicitly designed for synchronized audio-video synthesis. Our approach introduces a Multi-scale Dual-stream Spatio-temporal Autoencoder (MDSA) that bridges audio and video modalities through a unified cross-modal latent space. This framework compresses audio and video inputs into 2D latents, each capturing distinct aspects of the signals. To further enhance audiovisual consistency and facilitate cross-modal interaction, MDSA incorporates a multi-scale attention mechanism that enables temporal alignment across resolutions and supports fine-grained fusion between modalities. To effectively capture the fine-grained spatiotemporal dependencies inherent in SVG tasks, we introduce the Spatio-Temporal Diffusion Transformer (STDiT) as the generator of our framework. Extensive experiments demonstrate that our method achieves state-of-the-art results on standard benchmarks (Landscape and AIST++), surpassing existing approaches across all evaluation metrics while substantially accelerating training and sampling speeds. We also further explore its capabilities in open-domain SVG on AudioSet, demonstrating the generalization ability of PAV-DiT.
Primary Area: generative models
Submission Number: 5639
Loading