Abstract: Action detection in untrimmed video has been a long standing goal in computer vision.
Recently, single-frame annotation has emerged as a promising direction that bridges
the gap between the video-level weak-supervision and costly full supervision. We
tackle the problem of single-frame supervised temporal action localization, where only
one frame is annotated for each action instance in the video. Contextual information
is crucial for recognizing and localizing action instances. However, existing methods
for single-frame action detection still rely on limited isolated features.
In this thesis, we propose the Selective Feature Aggregation module, which (1)
dynamically aggregates the contextual information to strengthen the expressive power
of the perframe features, and (2) utilizes a set of selective functions, which encode a
general prior for selecting neighbors, to guide the feature aggregation. We find that
this module reduces the context confusion and attention collapse when training a
feature aggregator with a very sparse set of labels.
We demonstrate that our proposed module can effectively improve the performance over previous methods on three benchmarks: THUMOS’14, GTEA and BEOID. Concretely, we improve 3.1%, 7.9%, and 2.8% respectively in IoU-averaged mAP over
the baseline SFNet. The benefits are particularly striking on the challenging setting
with an IoU of 0.7, where we improve 10.8% over competitive methods on BEOID.
0 Replies
Loading