Abstract: Video scene segmentation is a crucial task in temporally parsing long-form videos into basic story units. Most advanced self-supervised methods of video scene segmentation focus heavily on learning video shot features in the pre-training stage. However, these methods ignore to encode shot relations, which are essential to video scene segmentation, resulting in over-segmentation of video scenes. A straightforward solution to the above problem is to use sufficient scene boundaries to model the shot relations. In this paper, we introduce a high-quality pseudo-scene boundary generation method, Semantic Transition Detection (STD), by discovering semantic inconsistencies in temporal video chunks. Taking scene boundary prediction as the pretext task, we propose a self-supervised method for video scene segmentation, by which we can fine-tune a STD pre-trained model with limited scene boundary ground truths. In addition, considering the impact of shot duration on the segmentation results, we integrate the shot duration information into the fine-tuning stage. Experiments on widely used benchmark datasets demonstrate that our approach effectively mitigates over-segmentation and achieves remarkable results in comparison with the state of the arts.
0 Replies
Loading