Abstract: Localizing actions in video is a core task in computer vision.
The weakly supervised temporal localization problem
investigates whether this task can be adequately solved with
only video-level labels, significantly reducing the amount of
expensive and error-prone annotation that is required. A
common approach is to train a frame-level classifier where
frames with the highest class probability are selected to
make a video-level prediction. Frame-level activations are
then used for localization. However, the absence of frame-level
annotations cause the classifier to impart class bias on
every frame. To address this, we propose the Action Selection
Learning (ASL) approach to capture the general concept of action,
a property we refer to as “actionness”. Under ASL, the model
is trained with a novel class-agnostic task to predict which frames
will be selected by the classifier. Empirically, we show that ASL
outperforms leading baselines on two popular benchmarks
THUMOS-14 and ActivityNet-1.2, with 12.3% and 5.7% relative
improvement respectively. We further analyze the properties of ASL
and demonstrate the importance of actionness. The full code for
this work will be released at the time of publication.
0 Replies
Loading