Selective Feature Aggregation for Single Frame Supervised Temporal Action Localization

Danqi Liao

15 May 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: Action detection in untrimmed video has been a long standing goal in computer vision. Recently, single-frame annotation has emerged as a promising direction that bridges the gap between the video-level weak-supervision and costly full supervision. We tackle the problem of single-frame supervised temporal action localization, where only one frame is annotated for each action instance in the video. Contextual information is crucial for recognizing and localizing action instances. However, existing methods for single-frame action detection still rely on limited isolated features. In this thesis, we propose the Selective Feature Aggregation module, which (1) dynamically aggregates the contextual information to strengthen the expressive power of the perframe features, and (2) utilizes a set of selective functions, which encode a general prior for selecting neighbors, to guide the feature aggregation. We find that this module reduces the context confusion and attention collapse when training a feature aggregator with a very sparse set of labels. We demonstrate that our proposed module can effectively improve the performance over previous methods on three benchmarks: THUMOS’14, GTEA and BEOID. Concretely, we improve 3.1%, 7.9%, and 2.8% respectively in IoU-averaged mAP over the baseline SFNet. The benefits are particularly striking on the challenging setting with an IoU of 0.7, where we improve 10.8% over competitive methods on BEOID.

0 Replies