Keywords: Few-shot action localization, common action localization, commonality
Abstract: The goal of this paper is to localize action instances in a long untrimmed query video using just meager trimmed support videos representing a common action whose class information is not given. In this task, it is crucial not only to correctly align a temporal segment (proposal) of the query video and the support videos, but also to increase the compatibility among the support videos. The latter has been understudied, even though the context (e.g., background, camera angle) varies across the support videos. To address both points, we design a dual cross-attention coupled with a stabilizer (DCAPS). First, we develop an attention mechanism by cross-correlation, and apply it independently to each support video (with the query videos) in order to manage the heterogeneity among the support videos. Next, we devise a stabilizer to increase the compatibility among the support videos. Then, the cross-attention is used again here to make the stabilized support videos attend and enhance the query proposals. Finally, we also develop a relational classifier head based on the query and support video representations. Hence, our contributions better utilize a few support videos for representing query proposals and thus attain precise common action localization. We show the effectiveness of our work with the state-of-the-art performance in benchmark datasets (ActivityNet1.3 and THUMOS14), and analyze each component extensively.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
TL;DR: For few-shot common action localization where no class cue of support videos is given, we mainly suggests a 3-stage cross-attention to align a long untrimmed query and trimmed support videos without losing compatibility among the support videos.
6 Replies
Loading