Abstract: A dynamic video summarization system detects key parts of the input video to generate its compact representation. The summaries can be used for efficient management of video data. This paper proposes an approach, Video summarization based on multi-CNN model (VSMCNN), that exploits major aspects of human cognition to generate meaningful summaries from videos. As the method focuses on dynamic summarization, the input video is divided into a set of shots. A multi-CNN model, which is a combination of different pre-trained models of CNN, is used for feature extraction from shots. The salient features are extracted from high dimensional feature vector using an unsupervised feature reduction technique applied in multiple subspaces to rank features in the vector. The distance measure between feature vectors is then thresholded to detect prime parts of the tested video. Experiments are performed on SumMe dataset and the results prove that our approach is successful in detecting portions of the tested video that has an essential message. The analysis shows that the method outperforms the state-of-the-art methods in the literature. Further evaluation on comparison with human-generated summaries in the ground truth proves the effectiveness of the proposed method. The paper also presents a detailed analysis to show which combination of pre-trained models of CNN is best suitable for generating dynamic summaries.
Loading