Abstract: Video highlight detection aims to select the most interesting and attractive clips from lengthy videos, which is crucial for enhancing the video editing and viewing experience on social media platforms. Existing video highlight detection methods predominantly rely on visual modality information, and underutilize the abundant multimodality of videos. Furthermore, in supervised video analysis tasks, subjective judgments during label annotation can generate uncertain noise labels that negatively impact the learning process. To address these issues, we propose a noise-robust multimodal video highlight detection approach. Our approach first enhances feature representation by incorporating multimodal representations of a video’s visual and auditory information. This allows for the extraction of complementary information from different modalities. We then implement a noise-cleaning mechanism that utilizes multiple modalities to clean noise samples. This helps to suppress the negative impact of noise samples on the learning process, ensuring that the network learns more robust features from clean samples. We evaluate our approach on two public datasets, YouTube Highlights and TVSum, and demonstrate its efficacy in mitigating the impact of noise labels, while also improving the accuracy and robustness of video highlight detection.
Loading