Spatiotemporal Fine-grained Video Description for Short Videos

Published: 01 Jan 2024, Last Modified: 11 Apr 2025ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In the mobile internet era, short videos are inundating people's lives. However, research on visual language models specifically designed for short videos has not yet received sufficient attention. Short videos are not just videos of limited duration. The prominent visual details and high information density of short videos differentiate them to long videos. In this paper, we propose the SpatioTemporal Fine-grained Description (STFVD) emphasizing on the uniqueness of short videos, which entails capturing the intricate details of the main subject and fine-grained movements. To this end, we create a comprehensive Short Video Advertisements Description (SVAD) dataset, comprising 34,930 clips from 5,046 videos. The dataset covers a range of topics, including 191 sub-industries, 649 popular products, and 470 trending games. Various efforts have been made in the data annotation process to ensure the inclusion of fine-grained spatiotemporal information, resulting in 34,930 high-quality annotations. Compared to existing datasets, samples in SVAD exhibit a superior text information density, suggesting that SVAD is more appropriate for the analysis of short videos. Based on the SVAD dataset, we develop a visual language model (SVAD-VLM) to generate spatiotemporal fine-grained description for short videos. We use a prompt-guided keyword generation task to efficiently learn key visual information. Moreover, we also utilize dual visual alignment to exploit the advantage of mixed-datasets training. Experiments on SVAD dataset demonstrate the challenge of STFVD and the competitive performance of proposed method compared to previous ones.
Loading