Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers

Jibin Song; Mingi Kwon; Jaeseok Jeong; Youngjung Uh

Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers

Jibin Song, Mingi Kwon, Jaeseok Jeong, Youngjung Uh

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Audio-to-Video Generation, Multimodal Synthesis, Temporal Synchronization, Diffusion Transformer, Video Generation, Audio-Conditioned Generation

TL;DR: We propose improved audio-aligned video generation by leveraging a pretrained video generation model, while preserving its original performance.

Abstract: Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion. In contrast, audio provides temporal cues aligned with video motion, making it a promising condition for temporally controlled video generation. However, existing audio-to-video (A2V) models struggle with fine-grained synchronization due to indirect conditioning mechanisms or limited temporal modeling capacity. We present Syncphony, which generates 380×640 resolution, 24fps videos synchronized with diverse audio inputs. Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization: (1) Motion-aware Loss, which emphasizes learning at high-motion regions; (2) Audio Sync Guidance, which guides the full model using a visually aligned off-sync model without audio layers to better exploit audio cues at inference while maintaining visual quality. To evaluate synchronization, we propose CycleSync, a video-to-audio-based metric that measures the amount of motion cues in the generated video to reconstruct the original audio. Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 15982

Loading