Abstract: Writing a video from text script (i.e., video editing) is an important but challenging multimedia-related task. Although a number of recent works have started to develop deep learning models for video editing, they mainly focus on writing a video from generic text script, not suitable for some specific domains (e.g., song lyrics). In this paper, we thus introduce a novel video editing task called song-to-video translation (S2VT), which aims to write a video from song lyrics based on multimodal pre-training. Similar to generic video editing, this S2VT task also has three main steps: lyric-to-shot retrieval, shot selection, and shot stitching. However, it has a large difference from generic video editing in that: the song lyrics are often more abstract to understand than the common text script, and thus a large-scale multimodal pre-training model is needed for lyric-to-shot retrieval. To facilitate the research on S2VT, we construct a benchmark dataset with human annotations according to three evaluation metrics (i.e., semantic-consistence, content-coherence, and rhythm-matching). Further, a baseline method for S2VT is proposed by training three classifiers (each for a metric) and developing a beam shot-selection algorithm based on the trained classifiers. Extensive experiments are conducted to show the effectiveness of the proposed baseline method in the S2VT task.
Loading