Abstract: Content creation is a growing field in Artificial Intelligence (AI) that achieves promising results using generative models. With recent advances in generative models such as Generative Adversarial Networks (GAN), videos can be generated according to specific conditions or even without any conditional settings. In this paper, we propose an end-to-end model that generates videos according to audio signals using both transcript and music. We call our model phonicsGAN since it draws a graphical alphabetic video and animate it given a phonics song. PhonicsGAN is among the first attempts to create preliminary graphical videos which can inspire and support graphical designers and educators to save time and effort. Since available graphical datasets lack acoustic signals, a suitable candidate domain for our proposed application is the phonic videos for children. PhonicsGAN deals with diverse videos in terms of content, motion and soundtrack by employing Gated Recurrent Units (GRU) layers to encode the soundtrack. A Convolutional Neural Network (CNN) is then used to generate a phonics video based on the encoded audio signal and the provided label. The preliminary results are promising and show improvements over LSTM and MoCoGAN which are state-of-the-art frameworks in the video generation domain.
0 Replies
Loading