Keywords: Video Chapter Generation, High-level Video Understanding, MultiModal Machine Learning
Abstract: We aim to address the problem of video chapter generation.
Compared to traditional video activity analysis, this task is significantly different.
The videos in chapter generation are much longer and contain many complex temporal structures.
Moreover, the association between video frames and narrations plays a crucial role in expressing underlying information.
To facilitate the research along this direction, we introduce a large-scale dataset called ChapterGen, which consists of approximately $10k$ user-generated videos with annotated chapter descriptions.
Our data collection procedure is fast, scalable, and does not require any additional manual annotation.
On top of this dataset, we propose a two-stage framework to perform chapter localization and chapter title generation.
This framework captures two aspects of a video, including visual dynamics and narration text.
To parse the whole video efficiently, we build the framework based on a flexible clip sliding window.
Our experiments demonstrate that the proposed framework achieves superior results over existing methods on both accuracy and efficiency.
One-sentence Summary: We propose a new dataset and a two-stage framework to address video chapter generation problem
5 Replies
Loading