Abstract: Highlights•First publicly proposed text–image–audio-to-video generation task.•Different designs for interactions among visual-text-audio modalities.•Better performance of the combination of the diffusion and GAN models.•Creation of three triple-modality datasets as further reliable benchmarks.
Loading