Semantic Fusion Based Graph Network for Video Scene Detection

Published: 01 Jan 2024, Last Modified: 05 Nov 2025IJCNN 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video scene detection, an initial step of video analysis, temporally divides heterogeneous video into semantic segments, which is widely used in video summarization, search, browsing and retrieval. Video scene detection always cuts video into shots first and then groups these shots into segments. In this process, how to solve complex dependency relationship among shots is a barrier. The existing methods consider using Recurrent Neural Networks and Hidden Markov Model to simulate the dependency relationship between shots. However, linear approaches work a little on hierarchical video structure. In this paper, a GNN-based network is proposed to model complex structures of videos instead. Besides, the semantic gap between low-level features and high-level semantics is also a big obstacle of video scene detection. Here, three visual semantic elements in shot, i.e., environment, object, action and audio feature are extracted as shot representation. Later, we utilize a multi-modal fusion strategy, which combines early fusion and late fusion, to bridge the semantic gap between low-level features and high-level semantics. The proposed method was evaluated on BBC Planet Earth dataset and Open Video Scene Detection (OSVD) dataset, the experimental results demonstrate that the proposed method outperforms the state-of-the-art in video scene detection task.
Loading