Shot Retrieval and Assembly with Text Script for Video Montage Generation

Guoxing Yang; Haoyu Lu; Zelong Sun; Shiqi Zhao; Haoran Wu; Zhiwu Lu

Shot Retrieval and Assembly with Text Script for Video Montage Generation

Guoxing Yang, Haoyu Lu, Zelong Sun, Shiqi Zhao, Haoran Wu, Zhiwu Lu

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Video montage generation, text-to-shot retrieval, transformer, dataset construction

TL;DR: We propose a novel transformer-based model for video montage generation by retrieving and assembling shots with arbitrary text scripts.

Abstract: With the development of video sharing websites, numerous users desire to create their own attractive video montages. However, it is difficult for inexperienced users to create a well-edited video montage due to the lack of professional expertise. In the meantime, it is time-consuming even for experts to create video montages of high quality, which requires effectively selecting shots from abundant candidates and assembling them together. Instead of manual creation, a number of automatic methods have been proposed for video montage generation. However, these methods typically take a single sentence as input for text-to-shot retrieval, and ignore the semantic cross-sentence coherence given complicated text script of multiple sentences. To overcome this drawback, we propose a novel model for video montage generation by retrieving and assembling shots with arbitrary text scripts. To this end, a sequence consistency transformer is devised for cross-sentence coherence modeling. More importantly, with this transformer, two novel sequence-level tasks are defined for sentence-shot alignment in sequence-level: Cross-Modal Sequence Matching (CMSM) task, and Chaotic Sequence Recovering (CSR) task. To facilitate the research on video montage generation, we construct a new, highly-varied dataset which collects thousands of video-script pairs in documentary. Extensive experiments on the constructed dataset demonstrate the superior performance of the proposed model. The dataset and generated video demos are available at https://github.com/RATVDemo/RATV

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)

Supplementary Material: zip

23 Replies

Loading