Clustering-Assisted Foreground and Background Separation for Weakly-supervised Temporal Action LocalizationDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: weakly-supervised temporal action localization, online clustering
Abstract: Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels. Existing methods mainly embrace a localization-by-classification pipeline that optimizes snippet-level prediction with a video classification loss. However, this formulation suffers from the discrepancy between classification and detection, resulting in the noisy foreground and background (F&B) snippets separation. To alleviate this problem, we propose to explore the underlying structure among the snippets by unsupervised snippet clustering, rather than heavily relying on the video classification loss. Specifically, we propose a novel clustering-assisted F&B separation network dubbed CASE, which achieves F&B separation by two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that attempts to further classify the cluster as foreground or background. In the absence of ground-truth labels to train these two components, we propose to adopt an online self-training algorithm that allows online interaction of pseudo-label rectification and model training. More importantly, we propose a distribution-constrained labeling strategy that utilizes different priors to regularize the distribution of the pseudo-labels, so as to reinforce the quality of the pseudo-labels. With the aid of the online self-training algorithm and distribution-constrained labeling strategy, our method is able to exploit the latent clusters that are simultaneously typical to snippets and discriminative to F&B. Thereby, the cluster assignments of the snippets can be associated with their F&B labels to enable the F&B separation. The effectiveness of the proposed CASE is demonstrated by the experimental results on three publicly available benchmarks: THUMOS14, ActivityNet v1.2 and v1.3.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
5 Replies

Loading