Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Video Summarization, Self-Supervised, Contrastive Learning
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: A work on training video summarization by using self supervised methods.
Abstract: The goal of video summarization is to extract the most important parts from the original video.
Most existing methods are based on supervised learning and they have demonstrated superior performance.
However, the scarcity of annotated data is a major obstacle in the video summarization task.
To reduce the impact of the scarcity, some weakly-supervised and unsupervised methods were proposed.
Although they manifested positive results, existing methods ignore the intrinsic association between video clips.
To address it, we introduce a new self-supervised learning method called TCL-VS. Our main insight is that
a excellent summary requires not only maintaining the original video content but also eliminating redundant information.
Inspired by the observation, this work consists of two separate modules that respectively conduct
temporal consistency and diversity assessment of video clips. Each module predicts a sequence score by clip,
and then we combine them using a weighted method. Extensive experiments demonstrate that
our method achieves state-of-the-art performance on two video summarization benchmarks: SumMe and TVSum.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Supplementary Material: zip
Submission Number: 2429
Loading