Abstract: In the mobile internet era, short videos are inundating people's lives. However, research on visual language models specifically designed for short videos has yet to be fully explored. Short videos are not just videos of limited duration. The prominent visual details and high information density of short videos differentiate them to long videos. In this paper, we propose the SpatioTemporal Fine-grained Description (STFVD) emphasizing on the uniqueness of short videos, which entails capturing the intricate details of the main subject and fine-grained movements. To this end, we create a comprehensive Short Video Advertisement Description (SVAD) dataset, comprising 34,930 clips from 5,046 videos. The dataset covers a range of topics, including 191 sub-industries, 649 popular products, and 470 trending games. Various efforts have been made in the data annotation process to ensure the inclusion of fine-grained spatiotemporal information, resulting in 34,930 high-quality annotations. Compared to existing datasets, samples in SVAD exhibit a superior text information density, suggesting that SVAD is more appropriate for the analysis of short videos. Based on the SVAD dataset, we develop SVAD-VLM to generate spatiotemporal fine-grained description for short videos. We use a prompt-guided keyword generation task approach to efficiently learn key visual information. Moreover, we also utilize dual visual alignment to exploit the advantage of mixed-datasets training. Experiments on SVAD dataset demonstrate the challenge of STFVD and the competitive performance of proposed method compared to previous ones.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: 1. We describe the uniqueness of short videos in video understanding and present a new problem of spatiotemporal fine-grained video description.
2. We create the Short Video Advertisements Description (SVAD) dataset with videos from a broad spectrum of categories and spatiotemporal fine-grained descriptions of considerable linguistic complexity. SVAD is, to the best of our knowledge, the first dataset aimed at the fine-grained description of short video advertisements.
3. We develop SVAD-VLM to facilitate spatiotemporal fine-grained video description for short video advertisements. We develop prompt guided keyword generation to overcome the challenge posed by rich semantic fine-grained descriptions.
Furthermore, we introduce dual visual alignment to leverage the benefits of mixed training with auxiliary datasets and enhance the model’s generalization capability.
Supplementary Material: zip
Submission Number: 3463
Loading