Abstract: Highlights•Audio condition can represent infinite variations and temporally dynamic features.•Audio-to-Video generation struggles with fully representing semantically complex audio in video.•We focus on generating videos with multiple objects by utilizing audio source separation, enhancing the representation of single-source audio.•Our approach outperforms existing audio-to-video generation models on various metrics, such as video quality, audio-visual alignment, and user study.
Loading