Generated videos with shifted audio and the same image input demonstrate variations in motion depending on the alignment of audio cues.
Qualitative comparison of videos generated by Syncphony (Ours), AVSyncD, and Pyramid Flow (fine-tuned), which is a variant of our model without audio cross-attention layers. Our method generates motions that are temporally aligned with audio events and produces clearer motion dynamics and stable appearances, whereas AVSyncD often suffers from saturation artifacts and weakened motion.
Incorporating Motion-aware Loss improves both the magnitude and temporal precision of motion, particularly at the onset and offset of dynamic actions.
Applying Audio Sync Guidance captures (ASG) subtle yet important sounds and generates motion precisely aligned with the audio cues (Full Model (w=2) hits the exact target).
Applying Audio RoPE to the audio features shows tighter temporal alignment between motion and sound events.